NeurIPS 2025PastMath & reasoningAgents

NeurIPS 2025 Workshop on Bridging Language, Agent, and World Models for Reasoning and Planning

LAW

Official website ↗OpenReview venue ↗See all NeurIPS workshops →✎ Edit this entry

Submission deadline: Sep 21, 2025, 11:59 UTC
imported from OpenReview — check the website for extensions
Submission portal: OpenReview
Notes: Topics were auto-suggested and may be imprecise — edits welcome.

Accepted papers (111)

Fetched from OpenReview (v2) on 2026-06-10.

A Framework for Studying AI Agent Behavior: Evidence from Consumer Choice Experiments
Manuel Cherep, Chengtian Ma, Abigail Xu, Maya Shaked, Patricia Maes, Nikhil Singh · PDF
ABBEL: LLM Agents Acting through Belief Bottlenecks Expressed in Language
Aly Lidayan, Jakob Brandt Bjorner, Satvik Golechha, Alane Suhr · PDF
Acting Less is Reasoning More! Teaching Language Model to Act Efficiently
Hongru WANG, Cheng Qian, Wanjun Zhong, Xiusi Chen, Jiahao Qiu, Shijue Huang, Bowen Jin, Mengdi Wang, Kam-Fai Wong, Heng Ji · PDF
Adapting Vision-Language Models for Evaluating World Models
Mariya Hendriksen, Tabish Rashid, David Bignell, Raluca Stevenson, Abdelhak Lemkhenter, Katja Hofmann, Sam Devlin, Sarah Parisot · PDF
AgentChangeBench: A Multi-Dimensional Evaluation Framework for Goal-Shift Robustness in Conversational AI
Manik Rana, Calissa Man, Jeffrey Paine, Anotida Expected Msiiwa, Ahan M R, Kevin Zhu, Vasu Sharma, Sunishchal Dev · PDF
Agentic Design Patterns: A System-Theoretic Framework
Dung Dao, Quy Minh Le, Hoang Thanh Lam, Duc-Trong Le, Quoc-Viet Pham, Barry O'Sullivan, Hoang D. Nguyen · PDF
AgentMaster: A Modular Multi-Agent Framework with A2A and MCP Protocols via a Unified Conversational Interface
Callie C. Liao, Duoduo Liao, Sai Surya Gadiraju · PDF
AI Agents for Web Testing: A Case Study in the Wild
Naimeng Ye, Xiao Yu, Ruize Xu, Tianyi Peng, Zhou Yu · PDF
AirTrafficGen: Configurable Air Traffic Scenario Generation with Large Language Models
Dewi Sid William Gould, George De Ath, Ben Carvell, Nick Pepper · PDF
Anemoi: A Semi-Centralized Multi-agent System Based on Agent-to-Agent Communication MCP server from Coral Protocol
Xinxing Ren, Caelum Forder, Qianbo Zang, Ahsen Tahir, Roman J. Georgio, Suman Deb, Peter Carroll, Önder GÜRCAN, Zekun Guo · PDF
Are LLMs Generalist Hanabi Agents?
Mahesh Ramesh, Aswinkumar Ramkumar, Pavan Thodima, Kaousheik Jayakumar, Aniket Rege · PDF
Assessing Adaptive World Models in Machines with Novel Games
Lance Ying, Katherine M. Collins, Prafull Sharma, Cédric Colas, Kaiya Ivy Zhao, Adrian Weller, Zenna Tavares, Phillip Isola, Samuel J. Gershman, Jacob Andreas, Thomas L. Griffiths, Francois Chollet, Kelsey R Allen, Joshua B. Tenenbaum · PDF
ATLAS: Actor-Critic Task-completion with Look-ahead Action Simulation
Jiali Cheng, Anjishnu Kumar, Rishi Rajasekaran, G Roshan Lal, Hani Ramezani, Oleg Rokhlenko, Omar Zia Khan, Sunny Chiu-Webster, Gang Hua, Hadi Amiri · PDF
AUGUSTUS: An LLM-Driven Multimodal Agent System with Contextualized User Memory
Jitesh Jain, Shubham Maheshwari, Ning Yu, Wen-mei Hwu, Humphrey Shi · PDF
Automated Reward Design for Gran Turismo
Michel Ma, Takuma Seno, Kaushik Subramanian, Peter R. Wurman, Peter Stone, Craig Sherstan · PDF
Avi: A 3D Vision-Language Action Model Architecture generating Action from Volumetric Inference
Harris Song, Long Le · PDF
Behavioral Systems Require Behavioral Tests
Manuel Cherep, Nikhil Singh, Patricia Maes · PDF
Benchmarking Large Language Models for Zero-shot and Few-shot Phishing URL Detection
Najmul Hasan, Prashanth BusiReddyGari · PDF
Beyond Generative AI: World Models for Clinical Prediction, Counterfactuals, and Planning
Mohammad Areeb Qazi, Maryam Nadeem, Mohammad Yaqub · PDF
Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens
Kaya Stechly, Karthik Valmeekam, Vardhan Palod, Atharva Gundawar, Subbarao Kambhampati · PDF
BioVerge: A Comprehensive Benchmark and Study of Self-Evaluating Agents for Biomedical Hypothesis Generation
Fuyi Yang, Chenchen Ye, Mingyu Derek Ma, Yijia Xiao, Matthew Yang, Wei Wang · PDF
Blocks, Bots, and Bottlenecks: Studying Real-time and Adaptive Multi-Agent LLM Collaboration
Isadora White, Kolby Nottingham, Max Robinson, Ayush Parasbhai Maniar, Mehul Maheshwari, Hansen Lillemark, Lianhui Qin, Prithviraj Ammanabrolu · PDF
Bootstrapping World Models from Dynamics Models in Multimodal Foundation Models
Yifu QIU, Yftah Ziser, Anna Korhonen, Shay B Cohen, Edoardo Ponti · PDF
Bridging Symbols from Language and Hierarchical Reinforcement Learning with Active Imitation
Ziqi Ma, Sao Mai Nguyen, Philippe Xu · PDF
Bridging Tool Dependencies and Domain Knowledge: A Graph-Based Framework for In-Context Planning
Shengjie Liu, Li Dong, Zhenyu Zhang · PDF
Can LLMs Reliably Evaluate Themselves? A Probabilistic VC Framework
Jae Oh Woo, Mengdie Flora Wang, Rahul Ghosh, Baishali Chaudhury, Mun Young Kim · PDF
CaughtCheating: Is Your MLLM a Good Cheating Detective? Exploring the Boundary of Visual Perception and Reasoning
Ming Li, Chenguang Wang, Tianyi Zhou · PDF
Causal Masking on Spatial Data: An Information-Theoretic Case for Learning Spatial Datasets with Unimodal Language Models
Jared Junkin, Samuel Nathanson · PDF
CausalARC: Abstract Reasoning with Causal World Models
Jacqueline R. M. A. Maasch, John Kalantari, Kia Khezeli · PDF
Computer-Use Agents as Judges for Automatic GUI Design
Kevin Qinghong Lin, Siyuan Hu, Linjie Li, Zhengyuan Yang, Lijuan Wang, Philip Torr, Mike Zheng Shou · PDF
CORE: Full-Path Evaluation of LLM Agents Beyond Final State
Panagiotis Michelakis, Yiannis Hadjiyianni, Dimitrios Stamoulis · PDF
CORTEX: Collaborative LLM Agents for High-Stakes Alert Triage
Bowen Wei, Yuan Shen Tay, Howard Liu, Jinhao Pan, Kun Luo, Ziwei Zhu, Chris Jordan · PDF
Credit-Budgeted ICPC-Style Coding: When LLM Agents Must Pay for Every Decision
Lingfeng Zhou, Junhao Shi, Jin Gao, Dequan Wang · PDF
DDCG: Decoupled Dual-Critic Guidance for Embodied Agents
Shaojin Ma, Min Zhang, Hongyao Tang, Jianye HAO, YAN ZHENG · PDF
DeepPersona: Generative Engine for Scaling Deep Synthetic Personas
Zhen Wang, Yufan Zhou, Zhongyan Luo, Lyumanshan Ye, Adam Wood, Man Yao, Luoshang Pan · PDF
Democratizing Agentic RAG: Distillation-Guided Policy Optimization for Compact Language Models
Rikuto Kotoge, Mai Nishimura, Jiaxin Ma · PDF
Democratizing Microgrid Optimization: An LLM Agent for Dispatching Mobile Chargers to Construction Electric Vehicles
Daniela Rojas Lozano, Yuanyuan Shi · PDF
Demystify the Potential of Large Language Models as World Models of Code
Bohan Lyu, Siqiao Huang, Zichen Liang, Wenjia Yang, Qian Sun, Jiaming Zhang · PDF
DiffusionPack: Bin Packing with Custom Human Preferences
Anurag Maurya, Shivam Vats, Gautham Balachandran, Ravi Prakash · PDF
Do Cognitively Interpretable Reasoning Traces Improve LLM Performance?
Siddhant Bhambri, Upasana Biswas, Subbarao Kambhampati · PDF
DR. WELL: Dynamic Reasoning and Learning with Symbolic World Model for Embodied LLM-Based Multi-Agent Collaboration
Narjes Nourzad, Hanqing Yang, Shiyu Chen, Carlee Joe-Wong · PDF
EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments
Zefang Liu, Yinzhu Quan · PDF
ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction
Qineng Wang, Wenlong Huang, Yu Zhou, Hang Yin, Tianwei Bao, Jianwen Lyu, Weiyu Liu, Ruohan Zhang, Jiajun Wu, Li Fei-Fei, Manling Li · PDF
Evaluating LLM Planning in Partially Observable Environments via Observation Representations and Action Sequences
Hayeong Lee, Jun Ho Seo, Sunguk Shin, Jinho Lee, Myunsoo Kim, Minsuk Chang, Byung-Jun Lee · PDF
Evaluating Long-Context Reasoning in LLM-Based WebAgents
Andy Chung, Yichi Zhang, Kaixiang Lin, Aditya Rawal, Qiaozi Gao, Joyce Chai · PDF
Every Answer Counts: Efficient Entity-Centric QA by Bayesian-Guided Subquery Sampling
Binyamin Perets, Zohar Shnaider, Dvir Aran, Shie Mannor · PDF
EvoMem: Improving Multi-Agent Planning with Dual-Evolving Memory
Wenzhe Fan, Ning Yan, Masood S. Mortazavi · PDF
Gaze-Guided Multimodal LLMs for Social Scene Understanding
Shayan Nasiriboukani, Muhammad Awais, Sara Atito · PDF
GAZE: Governance-Aware pre-annotation for Zero-shot World Model Environments
Leela Krishna, Mengyang Zhao, Saicharithreddy Pasula, Harshit Rajgarhia, Abhishek Mukherji, Vasudevan Sundarababu · PDF
GenPlanX. Integrating LLMs and Classical AI for Generation of Plans and Execution
Daniel Borrajo, Giuseppe Canonaco, Tomás de la Rosa, Alfredo Garrachón Ruiz, Sriram Gopalakrishnan, Simerjot Kaur, Marianela Morales, Sunandita Patra, Alberto Pozanco, Keshav Ramani, Charese Smiley, Pietro Totis, Manuela Veloso · PDF
GRIT: Teaching MLLMs to Think with Images
Yue Fan, Xuehai He, Diji Yang, Kaizhi Zheng, Ching-Chen Kuo, Yuting Zheng, Xinze Guan, Xin Eric Wang · PDF
Grounded-Retrieval Adversarial Imitation Loop: Integrating Language, Agent, and World Models
Liv G. d'Aliberti, Manoel Horta Ribeiro · PDF
GroundedPRM: Tree-Guided and Fidelity-Aware Process Reward Modeling for Step-Level Reasoning
Yao Zhang, Yu Wu, Haowei Zhang, Weiguo Li, Haokun Chen, Guohao Li, Zhen Han, Volker Tresp · PDF
Higher Embedding Dimension Creates a Stronger World Model for a Simple Sorting Task
Brady Bhalla, Honglu Fan, Nancy Chen, Tony Yue YU · PDF
HugAgent: Evaluating LLMs in Simulating Individual-Level Human Reasoning on Open-Ended Tasks
Chance Jiajie Li, Zhenze Mo, Yuhan Tang, Ao Qu, Jiayi Wu, Kaiya Ivy Zhao, Yulu Gan, Jie Fan, Jiangbo Yu, Jinhua Zhao, Paul Pu Liang, Luis Alberto Alonso Pastor, Kent Larson · PDF
Knot So Simple: A Minimalistic Environment for Spatial Reasoning
Zizhao Chen, Yoav Artzi · PDF
Language-conditioned world model improves policy generalization by reading environmental descriptions
Joe Nguyen, Stefan Lee · PDF
Law in Silico: Simulating Legal Society with LLM-Based Agents
Yiding Wang, Yuxuan Chen, Fanxu Meng, Xifan Chen, Xiaolei Yang, Muhan Zhang · PDF
Let’s Try Again: Eliciting Multi-Turn Reasoning in Language Models via Simplistic Feedback
Licheng Liu, Zihan Wang, Linjie Li, Chenwei Xu, Yiping Lu, Han Liu, Avirup Sil, Manling Li · PDF
LLM Economist: Large Population Models and Mechanism Design in Multi-Agent Generative Simulacra
Seth Karten, Wenzhe Li, Zihan Ding, Samuel Kleiner, Yu Bai, Chi Jin · PDF
LLM-Driven Composite Neural Architecture Search for Multi-Source RL State Encoding
Yu Yu, Qian Xie, Li Jin · PDF
LLMs as Scalable, General-Purpose Simulators For Evolving Digital Agent Training
Yiming Wang, Da Yin, Yuedong Cui, Zhiqian Li, Ruichen Zheng, Zongyu Lin, Di Wu, Xueqing Wu, Chenchen Ye, Yu Zhou, Kai-Wei Chang · PDF
Loong: Synthesize Long Chain-of-Thoughts at Scale through Verifiers
Xingyue Huang, Rishabh, Gregor Franke, Ziyi Yang, Jiamu Bai, Weijie Bai, Jinhe Bi, Zifeng Ding, Yiqun Duan, Chengyu Fan, Wendong Fan, Xin Gao, Ruohao Guo, Yuan He, Yicheng He, Xianglong Hu, Neil Johnson, Bowen Li, Fangru Lin, Siyu Lin, Tong Liu, Yunpu Ma, HAO SHEN, Hao Sun, Beibei Wang, Fangyijie Wang, Hao Wang, Haoran Wang, Yang Wang, Yifeng Wang, Zhaowei Wang, Ziyang Wang, Yifan Wu, Zikai Xiao, Chengxing Xie, Fan Yang, Junxiao Yang, Qianshuo Ye, Ziyu Ye, Guangtao Zeng, Yuwen Ebony Zhang, Zeyu Zhang, Zihao Zhu, Bernard Ghanem, Philip Torr, Guohao Li · PDF
Measuring Physical-World Privacy Awareness of Large Language Models: An Evaluation Benchmark
Xinjie Shen, Mufei Li, Pan Li · PDF
Measuring Rhetorical Style in Scientific Writing with LLM Personas
Jingyi Qiu, Hong Chen, Zongyi Li · PDF
MetaSynth: Multi-Agent Metadata Generation from Implicit Feedback in Black-Box Systems
Shreeranjani srirangamsridharan, Ali Abavisani, Reza Yousefi Maragheh, Ramin Giahi, Kai Zhao, Jason Cho, Sushant Kumar · PDF
Mind-Map Agent: Enhancing Cooperative Task Planning through Communication Alignment with Large Language Models
HoBeomJeon, Hyungmin Kim, DohyungKim, Minsu Jang, Jaehong Kim · PDF
MIRAI: Evaluating LLM Agents for International Event Forecasting
Chenchen Ye, Ziniu Hu, Yihe Deng, Zijie Huang, Mingyu Derek Ma, Yanqiao Zhu, Wei Wang · PDF
Model Context Protocol for Vision Agents: Schema, Memory, and World Model Implications
Aditi Tiwari, Akshit Bhalla, Darshan Ganesh Prasad · PDF
Modeling Open World Cognition as On-Demand Synthesis of Probabilistic Models
Lionel Wong, Katherine M. Collins, Lance Ying, Cedegao E. Zhang, Adrian Weller, Tobias Gerstenberg, Timothy J. O'Donnell, Alexander K. Lew, Jacob Andreas, Joshua B. Tenenbaum, Tyler BrookeWilson · PDF
Modeling Others' Minds as Code
Kunal Jha, Aydan Yuenan Huang, Eric Ye, Natasha Jaques, Max Kleiman-Weiner · PDF
NiceWebRL: a Python library for human subject experiments with reinforcement learning environments
Wilka Carvalho, Vikram Srinivas Goddla, Ishaan Sinha, Hoon Shin, Kunal Jha · PDF
Observer, Not Player: Simulating Theory of Mind in Large Language Models through Game Observation
Jerry Wang, Ting Yu Liu · PDF
Physics Supernova: AI Agent Matches Elite Gold Medalists at IPhO 2025
Jiahao Qiu, Jingzhe Shi, Xinzhe Juan, Zelin Zhao, Jiayi Geng, Shilong Liu, Hongru WANG, Sanfeng Wu, Mengdi Wang · PDF
Planning with Generative Cognitive Maps
Jeffrey Qin, Albert Yang, Cole Wyeth, Ziheng Xu, Kevin Ellis, Marta Kryven · PDF
Position: Hierarchical World Models with Causal Curation for Generalizing Agents
Fei Dai, Hanqi Zhou, Alison Gopnik · PDF
Position: Human-Robot Interaction Demands a Shift From Static Privacy Controls to Dynamic Learning
Shuning Zhang, Hong Jia, Simin Li, Ting Dang, Yongquan Hu, Xin Yi, Hewu Li · PDF
Position: The Physics-Physical Reasoning Interplay is Key for Future Embodied World Models
Terry Jingchen Zhang, Kun Xiang, Yinya Huang, Jixi He, Zirong Liu, Yueling Tang, Ruizhe Zhou, Chengyu Yu, Xiaodan Liang · PDF
QueryBandits for Hallucination Mitigation: Exploiting Semantic Features for No-Regret Rewriting
Nicole Cho, William Watson, Alec Koppel, Sumitra Ganesh, Manuela Veloso · PDF
R2P: Reformulate–Retrieve–Program for Robust Mathematical Reasoning in LLMs
Yu Zhang, Shujun Peng, Xinhan Lin, Yang Hu, Shouyi Yin · PDF
RealWebAssist: A Benchmark for Long-Horizon Web Assistance with Real-World Users
Suyu Ye, Haojun Shi, Darren Shih, Hyokun Yun, Tanya G. Roosta, Tianmin Shu · PDF
Reasoning Under Pressure: LLMs in Competitive Pokémon Battles
Tadisetty Sai Yashwanth, Dhatri C · PDF
RECOLLAB: Retrieval-Augmented LLMs for Cooperative Ad-hoc Teammate Modeling
Conor Wallace, Umer Siddique, Yongcan Cao · PDF
RL in Name Only? Analyzing the Structural Assumptions in RL post-training for LLMs
Soumya Rani Samineni, Durgesh Kalwar, Karthik Valmeekam, Kaya Stechly, Subbarao Kambhampati · PDF
ROSE: Reconstructing Objects, Scenes, and Trajectories from Casual Videos for Robotic Manipulation
Peihao Li, Haoran Geng, Jameson Crate, Yanbing Han, Junyi Zhang, Feishi Wang, Charlie Tianyue Cheng, Runpei Dong, Yen-Jen Wang, Haozhe Lou, Trevor Darrell, Pieter Abbeel, Jitendra Malik · PDF
Sample-Efficient Online Learning in LM Agents via Hindsight Trajectory Rewriting
Michael Y. Hu, Benjamin Van Durme, Jacob Andreas, Harsh Jhamtani · PDF
SAND: Boosting LLM Agents with Self-Taught Action Deliberation
Yu Xia, Yiran Jenny Shen, Junda Wu, Tong Yu, Sungchul Kim, Ryan A. Rossi, Lina Yao, Julian McAuley · PDF
SAPO: Safety-Aware Embodied Task Planning with fully Partially-Observable environment and physical constraints
Hyungmin Kim, HoBeomJeon, DohyungKim, Minsu Jang, Jaehong Kim · PDF
SCALAR: Self-Supervised Composition and Learning of Skills with LLM Planning and RL
Renos Zabounidis, Yue Wu, Simon Stepputtis, Tom Mitchell, Yuanzhi Li, Katia P. Sycara · PDF
Scaling LLM Planning: NL2FLow for Parametric Workflow Problem Generation and Rigorous Evaluation
Jungkoo Kang · PDF
Similar: A Step-Wise, Multi-Dimensional Reward Model for Virtual Agent Learning and Reasoning
Bingchen Miao, Yang Wu, Minghe Gao, Qifan Yu, Wendong Bu, Wenqiao Zhang, Yunfei Li, Siliang Tang, Tat-Seng Chua, Juncheng Li · PDF
Social Behaviour and Strategic Adaptation of LLMs in Multiplayer Sequential Games
Xijie Zeng, Frank Rudzicz, Marta Kryven · PDF
Social World Models
Xuhui Zhou, Jiarui Liu, Akhila Yerukola, Hyunwoo Kim, Maarten Sap · PDF
Spatial Mental Modeling from Limited Views
Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, Saining Xie, Manling Li, Jiajun Wu, Li Fei-Fei · PDF
Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!
Subbarao Kambhampati, Kaya Stechly, Karthik Valmeekam, Lucas Paul Saldyt, Siddhant Bhambri, Vardhan Palod, Atharva Gundawar, Soumya Rani Samineni, Durgesh Kalwar, Upasana Biswas · PDF
STRIDE: A Systematic Framework for Selecting AI Modalities—Agentic AI, AI Assistants, or LLM Calls
Shubhi Asthana, Ruchi Mahindru, Bing Zhang, Hima Patel, Chad DeLuca · PDF
Test-Time Scaling for Multistep Reasoning in Small Language Models via A* Search
Alexander Braverman, Weitong Zhang, Quanquan Gu · PDF
The Personality Illusion: Revealing Dissociation Between Self-Reports & Behavior in LLMs
Pengrui Han, Rafal Dariusz Kocielnik, Peiyang Song, Ramit Debnath, Dean Mobbs, Anima Anandkumar, R. Michael Alvarez · PDF
The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason
Shanchao Liang, Spandan Garg, Roshanak Zilouchian Moghaddam · PDF
Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction
Junhong Shen, Hao Bai, Lunjun Zhang, Yifei Zhou, Amrith Setlur, Shengbang Tong, Diego Caples, Nan Jiang, Tong Zhang, Ameet Talwalkar, Aviral Kumar · PDF
ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark
Vaskar Nath, Pranav Vishnu Raja, Jane Yu, Claire Yoon, Sean M. Hendryx · PDF
Trust, Risk, and Security in Agentic AI: A Short Survey
Shaina Raza, Ranjan Sapkota, Manoj Karkee, Christos Emmanouilidis · PDF
UISim: An Interactive Image-Based UI Simulator for Dynamic Mobile Environments
Jiannan Xiang, Yun Zhu, Lei Shu, Maria Wang, Lijun Yu, Gabriel Barcik, James David Lyon, Srinivas Sunkara, Jindong Chen · PDF
ValuePilot: A Two-Phase Framework for Value-Driven Decision-Making
Yitong Luo, Ziang Chen, Hou Hei Lam, Jiayu Zhan, Junqi Wang, Zhenliang Zhang, Xue Feng · PDF
VideoAgent: Self-Improving Video Generation for Embodied Planning
Achint Soni, Sreyas Venkataraman, Abhranil Chandra, Sebastian Fischmeister, Percy Liang, Bo Dai, Sherry Yang · PDF
VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded Video Reasoning
Ye Liu, Kevin Qinghong Lin, Chang Wen Chen, Mike Zheng Shou · PDF
What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities
Wendong Bu, Yang Wu, Qifan Yu, Minghe Gao, Bingchen Miao, Zhenkui Zhang, Kaihang Pan, Yunfei Li, Mengze Li, Wei Ji, Juncheng Li, Siliang Tang, Yueting Zhuang · PDF
Who Gets the Reward & Who Gets the Blame? Evaluation-Aligned Post-Training for Multi-LLM Agents
Chih-Hsuan Yang, Tanwi Mallick, Ian Foster, Amal Gueroudji, Rajeev Thakur · PDF
World Model Driven Episodic Memory for LLMs
Shreyas Rajesh, Pavan S Holur, Chenda Duan, David Chong, vwani Roychowdhury · PDF
World Models must live in Parallel Worlds
Sahithya Ravi, Aditya Chinchure, Pushkar Shukla, Vered Shwartz, Leonid Sigal · PDF
WorldAgen: Unified State-Action Prediction with Test-Time World Model Training
Chi Wan, Kangrui Wang, Yuan Si, Pingyue Zhang, Huang Huang, Manling Li · PDF

Accepted papers (111)

☆A Framework for Studying AI Agent Behavior: Evidence from Consumer Choice Experiments

☆ABBEL: LLM Agents Acting through Belief Bottlenecks Expressed in Language

☆Acting Less is Reasoning More! Teaching Language Model to Act Efficiently

☆Adapting Vision-Language Models for Evaluating World Models

☆AgentChangeBench: A Multi-Dimensional Evaluation Framework for Goal-Shift Robustness in Conversational AI

☆Agentic Design Patterns: A System-Theoretic Framework

☆AgentMaster: A Modular Multi-Agent Framework with A2A and MCP Protocols via a Unified Conversational Interface

☆AI Agents for Web Testing: A Case Study in the Wild

☆AirTrafficGen: Configurable Air Traffic Scenario Generation with Large Language Models

☆Anemoi: A Semi-Centralized Multi-agent System Based on Agent-to-Agent Communication MCP server from Coral Protocol

☆Are LLMs Generalist Hanabi Agents?

☆Assessing Adaptive World Models in Machines with Novel Games

☆ATLAS: Actor-Critic Task-completion with Look-ahead Action Simulation

☆AUGUSTUS: An LLM-Driven Multimodal Agent System with Contextualized User Memory

☆Automated Reward Design for Gran Turismo

☆Avi: A 3D Vision-Language Action Model Architecture generating Action from Volumetric Inference

☆Behavioral Systems Require Behavioral Tests

☆Benchmarking Large Language Models for Zero-shot and Few-shot Phishing URL Detection

☆Beyond Generative AI: World Models for Clinical Prediction, Counterfactuals, and Planning

☆Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens

☆BioVerge: A Comprehensive Benchmark and Study of Self-Evaluating Agents for Biomedical Hypothesis Generation

☆Blocks, Bots, and Bottlenecks: Studying Real-time and Adaptive Multi-Agent LLM Collaboration

☆Bootstrapping World Models from Dynamics Models in Multimodal Foundation Models

☆Bridging Symbols from Language and Hierarchical Reinforcement Learning with Active Imitation

☆Bridging Tool Dependencies and Domain Knowledge: A Graph-Based Framework for In-Context Planning

☆Can LLMs Reliably Evaluate Themselves? A Probabilistic VC Framework

☆CaughtCheating: Is Your MLLM a Good Cheating Detective? Exploring the Boundary of Visual Perception and Reasoning

☆Causal Masking on Spatial Data: An Information-Theoretic Case for Learning Spatial Datasets with Unimodal Language Models

☆CausalARC: Abstract Reasoning with Causal World Models

☆Computer-Use Agents as Judges for Automatic GUI Design

☆CORE: Full-Path Evaluation of LLM Agents Beyond Final State

☆CORTEX: Collaborative LLM Agents for High-Stakes Alert Triage

☆Credit-Budgeted ICPC-Style Coding: When LLM Agents Must Pay for Every Decision

☆DDCG: Decoupled Dual-Critic Guidance for Embodied Agents

☆DeepPersona: Generative Engine for Scaling Deep Synthetic Personas

☆Democratizing Agentic RAG: Distillation-Guided Policy Optimization for Compact Language Models

☆Democratizing Microgrid Optimization: An LLM Agent for Dispatching Mobile Chargers to Construction Electric Vehicles

☆Demystify the Potential of Large Language Models as World Models of Code

☆DiffusionPack: Bin Packing with Custom Human Preferences

☆Do Cognitively Interpretable Reasoning Traces Improve LLM Performance?

☆DR. WELL: Dynamic Reasoning and Learning with Symbolic World Model for Embodied LLM-Based Multi-Agent Collaboration

☆EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments

☆ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction

☆Evaluating LLM Planning in Partially Observable Environments via Observation Representations and Action Sequences

☆Evaluating Long-Context Reasoning in LLM-Based WebAgents

☆Every Answer Counts: Efficient Entity-Centric QA by Bayesian-Guided Subquery Sampling

☆EvoMem: Improving Multi-Agent Planning with Dual-Evolving Memory

☆Gaze-Guided Multimodal LLMs for Social Scene Understanding

☆GAZE: Governance-Aware pre-annotation for Zero-shot World Model Environments

☆GenPlanX. Integrating LLMs and Classical AI for Generation of Plans and Execution

☆GRIT: Teaching MLLMs to Think with Images

☆Grounded-Retrieval Adversarial Imitation Loop: Integrating Language, Agent, and World Models

☆GroundedPRM: Tree-Guided and Fidelity-Aware Process Reward Modeling for Step-Level Reasoning

☆Higher Embedding Dimension Creates a Stronger World Model for a Simple Sorting Task

☆HugAgent: Evaluating LLMs in Simulating Individual-Level Human Reasoning on Open-Ended Tasks

☆Knot So Simple: A Minimalistic Environment for Spatial Reasoning

☆Language-conditioned world model improves policy generalization by reading environmental descriptions

☆Law in Silico: Simulating Legal Society with LLM-Based Agents

☆Let’s Try Again: Eliciting Multi-Turn Reasoning in Language Models via Simplistic Feedback

☆LLM Economist: Large Population Models and Mechanism Design in Multi-Agent Generative Simulacra

☆LLM-Driven Composite Neural Architecture Search for Multi-Source RL State Encoding

☆LLMs as Scalable, General-Purpose Simulators For Evolving Digital Agent Training

☆Loong: Synthesize Long Chain-of-Thoughts at Scale through Verifiers

☆Measuring Physical-World Privacy Awareness of Large Language Models: An Evaluation Benchmark

☆Measuring Rhetorical Style in Scientific Writing with LLM Personas

☆MetaSynth: Multi-Agent Metadata Generation from Implicit Feedback in Black-Box Systems

☆Mind-Map Agent: Enhancing Cooperative Task Planning through Communication Alignment with Large Language Models

☆MIRAI: Evaluating LLM Agents for International Event Forecasting

☆Model Context Protocol for Vision Agents: Schema, Memory, and World Model Implications

☆Modeling Open World Cognition as On-Demand Synthesis of Probabilistic Models

☆Modeling Others' Minds as Code

☆NiceWebRL: a Python library for human subject experiments with reinforcement learning environments

☆Observer, Not Player: Simulating Theory of Mind in Large Language Models through Game Observation

☆Physics Supernova: AI Agent Matches Elite Gold Medalists at IPhO 2025

☆Planning with Generative Cognitive Maps

☆Position: Hierarchical World Models with Causal Curation for Generalizing Agents

☆Position: Human-Robot Interaction Demands a Shift From Static Privacy Controls to Dynamic Learning

☆Position: The Physics-Physical Reasoning Interplay is Key for Future Embodied World Models

☆QueryBandits for Hallucination Mitigation: Exploiting Semantic Features for No-Regret Rewriting

A Framework for Studying AI Agent Behavior: Evidence from Consumer Choice Experiments

ABBEL: LLM Agents Acting through Belief Bottlenecks Expressed in Language

Acting Less is Reasoning More! Teaching Language Model to Act Efficiently

Adapting Vision-Language Models for Evaluating World Models

AgentChangeBench: A Multi-Dimensional Evaluation Framework for Goal-Shift Robustness in Conversational AI

Agentic Design Patterns: A System-Theoretic Framework

AgentMaster: A Modular Multi-Agent Framework with A2A and MCP Protocols via a Unified Conversational Interface

AI Agents for Web Testing: A Case Study in the Wild

AirTrafficGen: Configurable Air Traffic Scenario Generation with Large Language Models

Anemoi: A Semi-Centralized Multi-agent System Based on Agent-to-Agent Communication MCP server from Coral Protocol

Are LLMs Generalist Hanabi Agents?

Assessing Adaptive World Models in Machines with Novel Games

ATLAS: Actor-Critic Task-completion with Look-ahead Action Simulation

AUGUSTUS: An LLM-Driven Multimodal Agent System with Contextualized User Memory

Automated Reward Design for Gran Turismo

Avi: A 3D Vision-Language Action Model Architecture generating Action from Volumetric Inference

Behavioral Systems Require Behavioral Tests

Benchmarking Large Language Models for Zero-shot and Few-shot Phishing URL Detection

Beyond Generative AI: World Models for Clinical Prediction, Counterfactuals, and Planning

Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens

BioVerge: A Comprehensive Benchmark and Study of Self-Evaluating Agents for Biomedical Hypothesis Generation

Blocks, Bots, and Bottlenecks: Studying Real-time and Adaptive Multi-Agent LLM Collaboration

Bootstrapping World Models from Dynamics Models in Multimodal Foundation Models

Bridging Symbols from Language and Hierarchical Reinforcement Learning with Active Imitation

Bridging Tool Dependencies and Domain Knowledge: A Graph-Based Framework for In-Context Planning

Can LLMs Reliably Evaluate Themselves? A Probabilistic VC Framework

CaughtCheating: Is Your MLLM a Good Cheating Detective? Exploring the Boundary of Visual Perception and Reasoning

Causal Masking on Spatial Data: An Information-Theoretic Case for Learning Spatial Datasets with Unimodal Language Models

CausalARC: Abstract Reasoning with Causal World Models

Computer-Use Agents as Judges for Automatic GUI Design

CORE: Full-Path Evaluation of LLM Agents Beyond Final State

CORTEX: Collaborative LLM Agents for High-Stakes Alert Triage

Credit-Budgeted ICPC-Style Coding: When LLM Agents Must Pay for Every Decision

DDCG: Decoupled Dual-Critic Guidance for Embodied Agents

DeepPersona: Generative Engine for Scaling Deep Synthetic Personas

Democratizing Agentic RAG: Distillation-Guided Policy Optimization for Compact Language Models

Democratizing Microgrid Optimization: An LLM Agent for Dispatching Mobile Chargers to Construction Electric Vehicles

Demystify the Potential of Large Language Models as World Models of Code

DiffusionPack: Bin Packing with Custom Human Preferences

Do Cognitively Interpretable Reasoning Traces Improve LLM Performance?

DR. WELL: Dynamic Reasoning and Learning with Symbolic World Model for Embodied LLM-Based Multi-Agent Collaboration

EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments

ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction

Evaluating LLM Planning in Partially Observable Environments via Observation Representations and Action Sequences

Evaluating Long-Context Reasoning in LLM-Based WebAgents

Every Answer Counts: Efficient Entity-Centric QA by Bayesian-Guided Subquery Sampling

EvoMem: Improving Multi-Agent Planning with Dual-Evolving Memory

Gaze-Guided Multimodal LLMs for Social Scene Understanding

GAZE: Governance-Aware pre-annotation for Zero-shot World Model Environments

GenPlanX. Integrating LLMs and Classical AI for Generation of Plans and Execution

GRIT: Teaching MLLMs to Think with Images

Grounded-Retrieval Adversarial Imitation Loop: Integrating Language, Agent, and World Models

GroundedPRM: Tree-Guided and Fidelity-Aware Process Reward Modeling for Step-Level Reasoning

Higher Embedding Dimension Creates a Stronger World Model for a Simple Sorting Task

HugAgent: Evaluating LLMs in Simulating Individual-Level Human Reasoning on Open-Ended Tasks

Knot So Simple: A Minimalistic Environment for Spatial Reasoning

Language-conditioned world model improves policy generalization by reading environmental descriptions

Law in Silico: Simulating Legal Society with LLM-Based Agents

Let’s Try Again: Eliciting Multi-Turn Reasoning in Language Models via Simplistic Feedback

LLM Economist: Large Population Models and Mechanism Design in Multi-Agent Generative Simulacra

LLM-Driven Composite Neural Architecture Search for Multi-Source RL State Encoding

LLMs as Scalable, General-Purpose Simulators For Evolving Digital Agent Training

Loong: Synthesize Long Chain-of-Thoughts at Scale through Verifiers

Measuring Physical-World Privacy Awareness of Large Language Models: An Evaluation Benchmark

Measuring Rhetorical Style in Scientific Writing with LLM Personas

MetaSynth: Multi-Agent Metadata Generation from Implicit Feedback in Black-Box Systems

Mind-Map Agent: Enhancing Cooperative Task Planning through Communication Alignment with Large Language Models

MIRAI: Evaluating LLM Agents for International Event Forecasting

Model Context Protocol for Vision Agents: Schema, Memory, and World Model Implications

Modeling Open World Cognition as On-Demand Synthesis of Probabilistic Models

Modeling Others' Minds as Code

NiceWebRL: a Python library for human subject experiments with reinforcement learning environments

Observer, Not Player: Simulating Theory of Mind in Large Language Models through Game Observation

Physics Supernova: AI Agent Matches Elite Gold Medalists at IPhO 2025

Planning with Generative Cognitive Maps

Position: Hierarchical World Models with Causal Curation for Generalizing Agents

Position: Human-Robot Interaction Demands a Shift From Static Privacy Controls to Dynamic Learning

Position: The Physics-Physical Reasoning Interplay is Key for Future Embodied World Models

QueryBandits for Hallucination Mitigation: Exploiting Semantic Features for No-Regret Rewriting

R2P: Reformulate–Retrieve–Program for Robust Mathematical Reasoning in LLMs