NeurIPS 2025 Past Math & reasoningAgents
NeurIPS 2025 Workshop on Bridging Language, Agent, and World Models for Reasoning and Planning
LAW
- Submission deadline
- Sep 21, 2025, 11:59 UTC imported from OpenReview — check the website for extensions
- Submission portal
- OpenReview
- Notes
- Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).
Accepted papers (111)
Fetched from OpenReview (v2) on 2026-06-10.
-
A Framework for Studying AI Agent Behavior: Evidence from Consumer Choice Experiments
-
ABBEL: LLM Agents Acting through Belief Bottlenecks Expressed in Language
-
Acting Less is Reasoning More! Teaching Language Model to Act Efficiently
-
Adapting Vision-Language Models for Evaluating World Models
-
AgentChangeBench: A Multi-Dimensional Evaluation Framework for Goal-Shift Robustness in Conversational AI
-
Agentic Design Patterns: A System-Theoretic Framework
-
AgentMaster: A Modular Multi-Agent Framework with A2A and MCP Protocols via a Unified Conversational Interface
-
AI Agents for Web Testing: A Case Study in the Wild
-
AirTrafficGen: Configurable Air Traffic Scenario Generation with Large Language Models
-
Anemoi: A Semi-Centralized Multi-agent System Based on Agent-to-Agent Communication MCP server from Coral Protocol
-
Are LLMs Generalist Hanabi Agents?
-
Assessing Adaptive World Models in Machines with Novel Games
-
ATLAS: Actor-Critic Task-completion with Look-ahead Action Simulation
-
AUGUSTUS: An LLM-Driven Multimodal Agent System with Contextualized User Memory
-
Automated Reward Design for Gran Turismo
-
Avi: A 3D Vision-Language Action Model Architecture generating Action from Volumetric Inference
-
Behavioral Systems Require Behavioral Tests
-
Benchmarking Large Language Models for Zero-shot and Few-shot Phishing URL Detection
-
Beyond Generative AI: World Models for Clinical Prediction, Counterfactuals, and Planning
-
Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens
-
BioVerge: A Comprehensive Benchmark and Study of Self-Evaluating Agents for Biomedical Hypothesis Generation
-
Blocks, Bots, and Bottlenecks: Studying Real-time and Adaptive Multi-Agent LLM Collaboration
-
Bootstrapping World Models from Dynamics Models in Multimodal Foundation Models
-
Bridging Symbols from Language and Hierarchical Reinforcement Learning with Active Imitation
-
Bridging Tool Dependencies and Domain Knowledge: A Graph-Based Framework for In-Context Planning
-
Can LLMs Reliably Evaluate Themselves? A Probabilistic VC Framework
-
CaughtCheating: Is Your MLLM a Good Cheating Detective? Exploring the Boundary of Visual Perception and Reasoning
-
Causal Masking on Spatial Data: An Information-Theoretic Case for Learning Spatial Datasets with Unimodal Language Models
-
CausalARC: Abstract Reasoning with Causal World Models
-
Computer-Use Agents as Judges for Automatic GUI Design
-
CORE: Full-Path Evaluation of LLM Agents Beyond Final State
-
CORTEX: Collaborative LLM Agents for High-Stakes Alert Triage
-
Credit-Budgeted ICPC-Style Coding: When LLM Agents Must Pay for Every Decision
-
DDCG: Decoupled Dual-Critic Guidance for Embodied Agents
-
DeepPersona: Generative Engine for Scaling Deep Synthetic Personas
-
Democratizing Agentic RAG: Distillation-Guided Policy Optimization for Compact Language Models
-
Democratizing Microgrid Optimization: An LLM Agent for Dispatching Mobile Chargers to Construction Electric Vehicles
-
Demystify the Potential of Large Language Models as World Models of Code
-
DiffusionPack: Bin Packing with Custom Human Preferences
-
Do Cognitively Interpretable Reasoning Traces Improve LLM Performance?
-
DR. WELL: Dynamic Reasoning and Learning with Symbolic World Model for Embodied LLM-Based Multi-Agent Collaboration
-
EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments
-
ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction
-
Evaluating LLM Planning in Partially Observable Environments via Observation Representations and Action Sequences
-
Evaluating Long-Context Reasoning in LLM-Based WebAgents
-
Every Answer Counts: Efficient Entity-Centric QA by Bayesian-Guided Subquery Sampling
-
EvoMem: Improving Multi-Agent Planning with Dual-Evolving Memory
-
Gaze-Guided Multimodal LLMs for Social Scene Understanding
-
GAZE: Governance-Aware pre-annotation for Zero-shot World Model Environments
-
GenPlanX. Integrating LLMs and Classical AI for Generation of Plans and Execution
-
GRIT: Teaching MLLMs to Think with Images
-
Grounded-Retrieval Adversarial Imitation Loop: Integrating Language, Agent, and World Models
-
GroundedPRM: Tree-Guided and Fidelity-Aware Process Reward Modeling for Step-Level Reasoning
-
Higher Embedding Dimension Creates a Stronger World Model for a Simple Sorting Task
-
HugAgent: Evaluating LLMs in Simulating Individual-Level Human Reasoning on Open-Ended Tasks
-
Knot So Simple: A Minimalistic Environment for Spatial Reasoning
-
Language-conditioned world model improves policy generalization by reading environmental descriptions
-
Law in Silico: Simulating Legal Society with LLM-Based Agents
-
Let’s Try Again: Eliciting Multi-Turn Reasoning in Language Models via Simplistic Feedback
-
LLM Economist: Large Population Models and Mechanism Design in Multi-Agent Generative Simulacra
-
LLM-Driven Composite Neural Architecture Search for Multi-Source RL State Encoding
-
LLMs as Scalable, General-Purpose Simulators For Evolving Digital Agent Training
-
Loong: Synthesize Long Chain-of-Thoughts at Scale through Verifiers
-
Measuring Physical-World Privacy Awareness of Large Language Models: An Evaluation Benchmark
-
Measuring Rhetorical Style in Scientific Writing with LLM Personas
-
MetaSynth: Multi-Agent Metadata Generation from Implicit Feedback in Black-Box Systems
-
Mind-Map Agent: Enhancing Cooperative Task Planning through Communication Alignment with Large Language Models
-
MIRAI: Evaluating LLM Agents for International Event Forecasting
-
Model Context Protocol for Vision Agents: Schema, Memory, and World Model Implications
-
Modeling Open World Cognition as On-Demand Synthesis of Probabilistic Models
-
Modeling Others' Minds as Code
-
NiceWebRL: a Python library for human subject experiments with reinforcement learning environments
-
Observer, Not Player: Simulating Theory of Mind in Large Language Models through Game Observation
-
Physics Supernova: AI Agent Matches Elite Gold Medalists at IPhO 2025
-
Planning with Generative Cognitive Maps
-
Position: Hierarchical World Models with Causal Curation for Generalizing Agents
-
Position: Human-Robot Interaction Demands a Shift From Static Privacy Controls to Dynamic Learning
-
Position: The Physics-Physical Reasoning Interplay is Key for Future Embodied World Models
-
QueryBandits for Hallucination Mitigation: Exploiting Semantic Features for No-Regret Rewriting
-
R2P: Reformulate–Retrieve–Program for Robust Mathematical Reasoning in LLMs
-
RealWebAssist: A Benchmark for Long-Horizon Web Assistance with Real-World Users
-
Reasoning Under Pressure: LLMs in Competitive Pokémon Battles
-
RECOLLAB: Retrieval-Augmented LLMs for Cooperative Ad-hoc Teammate Modeling
-
RL in Name Only? Analyzing the Structural Assumptions in RL post-training for LLMs
-
ROSE: Reconstructing Objects, Scenes, and Trajectories from Casual Videos for Robotic Manipulation
-
Sample-Efficient Online Learning in LM Agents via Hindsight Trajectory Rewriting
-
SAND: Boosting LLM Agents with Self-Taught Action Deliberation
-
SAPO: Safety-Aware Embodied Task Planning with fully Partially-Observable environment and physical constraints
-
SCALAR: Self-Supervised Composition and Learning of Skills with LLM Planning and RL
-
Scaling LLM Planning: NL2FLow for Parametric Workflow Problem Generation and Rigorous Evaluation
-
Similar: A Step-Wise, Multi-Dimensional Reward Model for Virtual Agent Learning and Reasoning
-
Social Behaviour and Strategic Adaptation of LLMs in Multiplayer Sequential Games
-
Social World Models
-
Spatial Mental Modeling from Limited Views
-
Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!
-
STRIDE: A Systematic Framework for Selecting AI Modalities—Agentic AI, AI Assistants, or LLM Calls
-
Test-Time Scaling for Multistep Reasoning in Small Language Models via A* Search
-
The Personality Illusion: Revealing Dissociation Between Self-Reports & Behavior in LLMs
-
The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason
-
Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction
-
ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark
-
Trust, Risk, and Security in Agentic AI: A Short Survey
-
UISim: An Interactive Image-Based UI Simulator for Dynamic Mobile Environments
-
ValuePilot: A Two-Phase Framework for Value-Driven Decision-Making
-
VideoAgent: Self-Improving Video Generation for Embodied Planning
-
VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded Video Reasoning
-
What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities
-
Who Gets the Reward & Who Gets the Blame? Evaluation-Aligned Post-Training for Multi-LLM Agents
-
World Model Driven Episodic Memory for LLMs
-
World Models must live in Parallel Worlds
-
WorldAgen: Unified State-Action Prediction with Test-Time World Model Training