NeurIPS 2025 Past Agents

Workshop on Scaling Environments for Agents

SEA @ NeurIPS 2025

Submission deadline
Sep 3, 2025, 11:59 UTC
imported from OpenReview — check the website for extensions
Submission portal
OpenReview
Notes
Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).

Accepted papers (93)

Fetched from OpenReview (v2) on 2026-06-10.

  1. A Multi-agent Reasoning Framework for Video Question Answering

    Abhi Kamboj, Gaurav Kumar, Krista Holden, Madhumitha Saravanan, Pradyumna Narayana · PDF
  2. Agent Context Protocols Enhance Collective Inference

    Arjun Beniwal, Devansh Bhardwaj, Shreyas Chaudhari, Ashwin Kalyan, Tanmay Rajpurohit, Karthik R Narasimhan, Ameet Deshpande, Vishvak Murahari · PDF
  3. AgentCrypt: Advancing Privacy and (Secure) Computation in AI Agent Collaboration

    Harish Karthikeyan, Yue Guo, Udari Madhushani Sehwag, Leo de Castro, Antigoni Polychroniadou, Leo Ardon, Sumitra Ganesh · PDF
  4. Agentic Persona Control and Task State Tracking for Realistic User Simulation in Interactive Scenarios

    Hareeshwar Karthikeyan · PDF
  5. AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents

    Jingxu Xie, Dylan Xu, Xuandong Zhao, Dawn Song · PDF
  6. All Life is Problem Creation: Learning to Generate Environments that Maximize Performance Gain

    Titas Anciukevičius, Yuhui Wang, Piotr Piękos, Li Nanbo, Wenyi Wang, Jürgen Schmidhuber · PDF
  7. Are LLMs Generalist Hanabi Agents?

    Mahesh Ramesh, Aswinkumar Ramkumar, Pavan Thodima, Kaousheik Jayakumar, Aniket Rege · PDF
  8. Automated Specialization of Stateful Agent Systems

    Myan Vu, Harrish Ayyanar, PANG JIANG, Anwiketh Reddy, Mayank Goel, Kevin Zhu · PDF
  9. Beyond Fixed Tasks: Unsupervised Environment Design for Task-Level Pairs

    Daniel Furelos-Blanco, Charles Pert, Frederik Kelbel, Alexander F Spies, Alessandra Russo, Michael D Dennis · PDF
  10. BoxingGym: Benchmarking Progress in Automated Experimental Design and Model Discovery

    Kanishk Gandhi, Michael Y. Li, Lyle Goodyear, Agam Bhatia, Ying Li, Aditi Bhaskar, Mohammed Zaman, Noah Goodman · PDF
  11. BrowseMaster: Towards Scalable Web Browsing via Tool-Augmented Programmatic Agent Pair

    Xianghe Pang, Shuo Tang, Rui Ye, Yuwen Du, Yaxin Du, Siheng Chen · PDF
  12. Characterizing Deep Research: A Benchmark and Formal Definition

    Abhinav Java, Ashmit Khandelwal, Sukruta Prakash Midigeshi, Aaron Halfaker, Amit Deshpande, Navin Goyal, Ankur Gupta, Nagarajan Natarajan, Amit Sharma · PDF
  13. ChinaTravel: An Open-Ended Benchmark for Language Agents in Chinese Travel Planning

    Jie-Jing Shao, Bo-Wen Zhang, Xiao-Wen Yang, Baizhi Chen, Siyu Han, Wen-Da Wei, Guohao Cai, Zhenhua Dong, Lan-Zhe Guo, Yu-Feng Li · PDF
  14. Co-Evolving Complexity: An Adversarial Framework for Automatic MARL Curricula

    Brennen Hill · PDF
  15. Code2MCP: Transforming Code Repositories into MCP Services

    Chaoqian Ouyang, Ling Yue, Shimin Di, Libin Zheng, Shaowu Pan, Min-Ling Zhang · PDF
  16. CoLLAB: A Framework for Designing Scalable Benchmarks for Agentic LLMs

    Saaduddin Mahmud, Eugene Bagdasarian, Shlomo Zilberstein · PDF
  17. Communicating Plans, Not Percepts: Scalable Multi-Agent Coordination with Embodied World Models

    Brennen Hill, Mant Koh En Wei, Jishnuanandh Thangavel · PDF
  18. CUBE: Collaborative Multi-Agent Block-Pushing Environment for Collective Planning with LLM Agents

    Hanqing Yang, Narjes Nourzad, Shiyu Chen, Carlee Joe-Wong · PDF
  19. DEBATE: A Large-Scale Benchmark for Role-Playing LLM Agents in Multi-Agent, Long-Form Debates

    Yun-Shiuan Chuang, Ruixuan Tu, Chengtao Dai, Smit Vasani, Binwei Yao, Michael Henry Tessler, Sijia Yang, Dhavan V. Shah, Robert D. Hawkins, Junjie Hu, Timothy T. Rogers · PDF
  20. DefenderBench: A Toolkit for Evaluating Language Agents in Cybersecurity Environments

    Chiyu Zhang, Marc-Alexandre Côté, Michael Albada, Anush Sankaran, Jack W Stokes, Tong Wang, Amir H. Abdi, William Blum, Muhammad Abdul-Mageed · PDF
  21. Enabling multi-agent collaboration in knowledge graph environments

    Iñaki Arango, Ayush Noori, Lucas Vittor, Joaquin Polonuer, Marinka Zitnik · PDF
  22. Enabling User-Created Multi-Agent Simulations: Interactive and Customizable 2D Environments to Study Team Dynamics with LLM Agents

    Mohammed Almutairi, Charles Chiang, Haoze Guo, Nandini Banerjee, Maria Milkowski, Daniel Nguyen, Michael G Yankoski, Tim Weninger, Svitlana Volkova, Trenton W. Ford, Diego Gomez-Zara · PDF
  23. EVOLVE-MEM: A Self-Adaptive Hierarchical Memory Architecture for Next-Generation Agentic AI Systems

    Rishi Ashish Shah, Ujjwal Kakar, Shashvat Singhal, Dinesh K Vishwakarma · PDF
  24. Examining the Vulnerability of Multi-Agent Medical Systems to Human Interventions for Clinical Reasoning

    Benjamin Liu, Dillon Mehta, Rishi Malhotra, Adam Zobian, Yong Ying Tan, Samir Chopra, Daniella Rand, Natalie Pang, Abhiram Gudimella, Raghav Thallapragada, Derek Jiu, Prisha Shah, Kevin Zhu · PDF
  25. Exploring Personality Trait Change of LLM-Based AI Systems

    Yuhan Ma, Junjie Wang · PDF
  26. Faithful Simulation of User–Agent–Environment Interactions for Scalable LLM Agent Evaluation

    Aleksei Kudrinskii, Saibo Geng, Luca Beurer-Kellner, Marc Fischer · PDF
  27. Fathom-Search-4B: Scaling DeepSearch Reasoning Capabilities via RL

    Shreyas Singh, Kunal Singh, Pradeep Moturi · PDF
  28. GEM: A Gym for Agentic LLMs

    Zichen Liu, Anya Sims, Keyu Duan, Changyu Chen, Haotian Xu, Simon Yu, Chenmien Tan, Shaopan Xiong, Weixun Wang, Bo Liu, Hao Zhu, Weiyan Shi, Diyi Yang, Wee Sun Lee, Min Lin · PDF
  29. GLEE: A Unified Framework and Benchmark for Language-based Economic Environments

    Eilam Shapira, Omer Madmon, Itamar Reinman, Samuel Joseph Amouyal, Roi Reichart, Moshe Tennenholtz · PDF
  30. Go-Browse: Training Web Agents with Structured Exploration

    Apurva Gandhi, Graham Neubig · PDF
  31. GR-Agent: Adaptive Graph Reasoning Agent under Incomplete Knowledge

    Dongzhuoran Zhou, Yuqicheng Zhu, Xiaxia Wang, Hongkuan Zhou, Jiaoyan Chen, Steffen Staab, Yuan He, Evgeny Kharlamov · PDF
  32. GroundedPRM: Tree-Guided and Fidelity-Aware Process Reward Modeling for Step-Level Reasoning

    Yao Zhang, Yu Wu, Haowei Zhang, Weiguo Li, Haokun Chen, Guohao Li, Zhen Han, Volker Tresp · PDF
  33. IndusGCC: A Data Benchmark and Evaluation Framework for GUI-Based General Computer Control in Industrial Automation

    Xiaoran Yang, Yuyang Du, Kexin Chen, Soung Chang Liew, Jiamin Lu, Ziyu Guo, Xiaoyan Liu, Qun Yang, Shiqi XU, Xingyu Fan, Yuchen Pan, Taoyong Cui, Hongyu Deng, Boris Düdder, Jianzhang Pan, Qun Fang, Pheng-Ann Heng · PDF
  34. Learning to Make Friends: Coaching LLM Agents toward Emergent Social Ties

    Philipp J. Schneider, LIN TIAN, Marian-Andrei Rizoiu · PDF
  35. Licence to Scale: A Microservice Simulation Environment for Benchmarking Agentic AI

    Christopher Lohse, Adrian Selk, Amadou Ba, Jonas Wahl, Marco Ruffini · PDF
  36. LLM Economist: Large Population Models and Mechanism Design in Multi-Agent Generative Simulacra

    Seth Karten, Wenzhe Li, Zihan Ding, Samuel Kleiner, Yu Bai, Chi Jin · PDF
  37. LLMs as Scalable, General-Purpose Simulators For Evolving Digital Agent Training

    Yiming Wang, Da Yin, Yuedong Cui, Zhiqian Li, Ruichen Zheng, Zongyu Lin, Di Wu, Xueqing Wu, Chenchen Ye, Yu Zhou, Kai-Wei Chang · PDF
  38. Ludax: A GPU-Accelerated Domain Specific Language for Board Games

    Graham Todd, Alexander George Padula, Dennis J. N. J. Soemers, Julian Togelius · PDF
  39. MAPGD: Multi-Agent Prompt Gradient Descent for Collaborative Prompt Optimization

    Yichen Han, Bojun Liu, Zhengpeng zhou, Guanyu Liu, Zeng Zhang, Yang Yang, Wenli Wang, Isaac N Shi, Yunyan, Lewei He, TIANYU SHI · PDF
  40. MAS-ZERO: Designing Multi-Agent Systems with Zero Supervision

    Zixuan Ke, Austin Xu, Yifei Ming, Xuan-Phi Nguyen, Caiming Xiong, Shafiq Joty · PDF
  41. MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers

    Zhenting Wang, Qi Chang, Hemani Patel, Shashank Biju, Cheng-En Wu, Quan Liu, Aolin Ding, Alireza Rezazadeh, Ankit Shah, Yujia Bao, Eugene Siow · PDF
  42. MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers

    Ziyang Luo, Zhiqi Shen, Wenzhuo Yang, Zirui Zhao, Prathyusha Jwalapuram, Amrita Saha, Doyen Sahoo, Silvio Savarese, Caiming Xiong, Junnan Li · PDF
  43. MEMTRACK: Evaluating Long-Term Memory and State Tracking in Multi-Platform Dynamic Agent Environments

    Darshan Girish Deshpande, Varun Prashant Gangal, Hersh Mehta, Jędrzej Rosłaniec, Anand Kannappan, Rebecca Qian, Peng Wang · PDF
  44. MIRAI: Evaluating LLM Agents for International Event Forecasting

    Chenchen Ye, Ziniu Hu, Yihe Deng, Zijie Huang, Mingyu Derek Ma, Yanqiao Zhu, Wei Wang · PDF
  45. Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks

    Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, Heng Ji · PDF
  46. Mobile-Agent-V: A Video-Guided Approach for Effortless and Efficient Operational Knowledge Injection in Mobile Automation

    Junyang Wang, Haiyang Xu, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, Jitao Sang · PDF
  47. Model Context Protocol for Vision Agents: Schema, Memory, and World Model Implications

    Aditi Tiwari, Akshit Bhalla · PDF
  48. Natural Language Grounded Reinforcement Learning for Clinical Decision-Making in Virtual Patient Simulations

    Niyel Hassan, Benjamin Liu, Jason Tsai, Jeffrey K Jopling, Dana Lin, Edward Melcer, Cara Liebert · PDF
  49. On the Importance of Task Complexity in Evaluating LLM-Based Multi-Agent Systems

    Bohan Tang, Huidong Liang, Keyue Jiang, Xiaowen Dong · PDF
  50. OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation

    Ziyi Wang, Yuxuan Lu, Wenbo Li, Amirali Amini, Bo Sun, Yakov Bart, Weimin Lyu, Jiri Gesi, Tian Wang, Jing Huang, Yu Su, Upol Ehsan, Malihe Alikhani, Toby Jia-Jun Li, Lydia Chilton, Dakuo Wang · PDF
  51. Paper2Video: Automatic Video Generation from Scientific Papers

    Zeyu Zhu, Kevin Qinghong Lin, Mike Zheng Shou · PDF
  52. Player-Coach Teamwork: Multi-agent Collaboration for Improving LLM Reasoning

    Heewon Park, Minhae Kwon · PDF
  53. PrivacyMAS: A Privacy-Preserving Multi-Agent System Framework

    Maryam Fatima · PDF
  54. Protein Design with Agent Rosetta: A Case Study for Specialized Scientific Agents

    Jacopo Teneggi, Tanya Marwah, Alberto Bietti, P. Douglas Renfrew, Vikram Khipple Mulligan, Siavash Golkar · PDF
  55. PuzzleJAX: A Benchmark for Reasoning and Learning

    Sam Earle, Graham Todd, Yuchen Li, Ahmed Khalifa, Zehua Jiang, Muhammad Umair Nasir, Andrzej Banburski-Fahey, Julian Togelius · PDF
  56. RAISE: Reliable Agent Improvement via Simulated Experience

    Sahar Omidi Shayegan, Joshua Meyer, Victor Shih, Sebastian Sosa, Tianyi Peng, Kostis Kaffes, Eugene Wu, Andi Partovi, Mehdi Jamei · PDF
  57. RealWebAssist: A Benchmark for Long-Horizon Web Assistance with Real-World Users

    Suyu Ye, Haojun Shi, Darren Shih, Hyokun Yun, Tanya G. Roosta, Tianmin Shu · PDF
  58. ReMAC: Large Language Model-Driven Reward Design for Multi-Agent Manipulation Collaboration

    Pengyi Li, Hongyao Tang, Yifu Yuan, Jianye HAO · PDF
  59. Revisiting Boids for Emergent Intelligence via Multi-Agent Collaborative Tool-Building

    Xisen Wang, Qi Zhang · PDF
  60. Revisiting Uncertainty Estimation and Calibration of Large Language Models

    Linwei Tao, Yi-Fan Yeh, Minjing Dong, Tao Huang, Jialin Yu, Philip Torr, Chang Xu · PDF
  61. RPGBENCH: Evaluating Large Language Models as Role-Playing Game Engines

    Pengfei Yu, Dongming Shen, Silin Meng, Jaewon Lee, Weisu Yin, Andrea Yaoyun Cui, Zhenlin Xu, Yi Zhu, Xingjian Shi, Mu Li, Alex Smola · PDF
  62. Scaling Environments for LLM Agents in the Era of Learning from Interaction: A Survey

    Yuchen Huang, Sijia Li, Minghao LIU, Wei Liu, Zhiyuan Fan, Yi R. Fung · PDF
  63. Scaling Open-Ended Reasoning to Predict the Future

    Nikhil Chandak, Shashwat Goel, Ameya Prabhu, Moritz Hardt, Jonas Geiping · PDF
  64. SEA: Stateful Execution Environment for Conversational Big Data Analytics

    Rohit Kumar, Ajay Anil Kumar · PDF
  65. SEDM: Scalable Self-Evolving Distributed Memory for Agents

    Haoran Xu, Jiacong Hu, ZHANG Ke, Lei Yu, Yuxin Tang, Xinyuan Song, Yiqun Duan, Lynn Ai, TIANYU SHI · PDF
  66. See, Think, Act: Online Shopper Behavior Simulation with VLM Agents

    Yimeng Zhang, Ziyi Wang, Yuxuan Lu, Simon Sinong Zhan, Jing Huang, Dakuo Wang · PDF
  67. Shaping Smart Personal Assistants through Generative Interactive Environments for Scalable Design and Evaluation

    Ziyi Xuan, Yiwen Wu, Vinod Namboodiri, Yu Yang · PDF
  68. Shop-R1: Rewarding LLMs to Simulate Human Behavior in Online Shopping via Reinforcement Learning

    Yimeng Zhang, Ziyi Wang, Yuxuan Lu, Simon Sinong Zhan, Dakuo Wang · PDF
  69. Similar: A Step-Wise, Multi-Dimensional Reward Model for Virtual Agent Learning and Reasoning

    Bingchen Miao, Yang Wu, Minghe Gao, Qifan Yu, Wendong Bu, Wenqiao Zhang, Yunfei Li, Siliang Tang, Tat-Seng Chua, Juncheng Li · PDF
  70. SimuGen: Multi-modal Agentic Framework for Constructing Block Diagram-Based Simulation Models

    Xinxing Ren, Qianbo Zang, Zekun Guo · PDF
  71. SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards

    Hunar Batra, Haoqin Tu, Hardy Chen, Yuanze Lin, Cihang Xie, Ronald Clark · PDF
  72. Steering Diffusion Policies with Value-Guided Denoising

    Hanming Ye · PDF
  73. The Influence of Scaffolds on Coordination Scaling Laws in LLM Agents

    Mariana Meireles, Rupali Bhati, Niklas Lauffer, Cameron Allen · PDF
  74. The Physical Basis of Prediction: World Model Formation in Neural Organoids via an LLM-Generated Curriculum

    Brennen Hill · PDF
  75. Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction

    Junhong Shen, Hao Bai, Lunjun Zhang, Yifei Zhou, Amrith Setlur, Shengbang Tong, Diego Caples, Nan Jiang, Tong Zhang, Ameet Talwalkar, Aviral Kumar · PDF
  76. Towards Agents That Know When They Don't Know: Uncertainty as a Control Signal for Structured Reasoning

    Josefa Lia Stoisser, Marc Boubnovski Martell, Lawrence Phillips, Gianluca Mazzoni, Lea Mørch Harder, Philip Torr, Jesper Ferkinghoff-Borg, Kaspar Märtens, Julien Fauqueur · PDF
  77. Traxgen: Ground-Truth Trajectory Generation for AI Agent Evaluation

    Maria Emilia Mazzolenis, Ruirui Zhang · PDF
  78. TutorTest: Evaluating Language Model-based Tutoring Policies Using Surrogate Tasks

    Aishwarya Mandyam · PDF
  79. Two Heads are Better Than One: Test-time Scaling of Multi-agent Collaborative Reasoning

    Can Jin, Hongwu Peng, Qixin Zhang, Yujin Tang, Tong Che, Dimitris N. Metaxas · PDF
  80. UpSkill: Mutual Information Skill Learning for Structured Response Diversity in LLMs

    Devan Shah, Owen Yang, Daniel Yang, Chongyi Zheng, Benjamin Eysenbach · PDF
  81. UserBench: An Interactive Gym Environment for User-Centric Agents

    Cheng Qian, Zuxin Liu, Akshara Prabhakar, Zhiwei Liu, Jianguo Zhang, Haolin Chen, Heng Ji, Weiran Yao, Shelby Heinecke, Silvio Savarese, Huan Wang · PDF
  82. VendiRL: A Framework for Self-Supervised Reinforcement Learning of Diversely Diverse Skills

    Erik M. Lintunen · PDF
  83. Verifiable Chemical Reasoning through Tool-Calling Agentic Workflow

    Gabrielle Gaudeau, Shinnosuke Tanaka, Defne Circi, Ian W Kennedy, Movina Moses, Mohab Elkaref · PDF
  84. VideoAgent2: Enhancing the LLM-Based Agent System for Long-Form Video Understanding by Uncertainty-Aware CoT

    Zhuo Zhi, Qiangqiang Wu, Minghe Shen, Wenbo Li, Yinchuan Li, Kun Shao, Kaiwen Zhou · PDF
  85. Vision-Language Models Unlock Task-Centric Latent Actions

    Alexander Nikulin, Ilya Zisman, Albina Klepach, Denis Tarasov, Alexander Derevyagin, Andrei Polubarov, Lyubaykin Nikita, Vladislav Kurenkov · PDF
  86. WebArena Verified: Reliable Evaluation for Web Agents

    Amine El hattami, Megh Thakkar, Nicolas Chapados, Christopher Pal · PDF
  87. What Limits Agentic Systems Efficiency?

    Song Bian, Minghao Yan, Anand Jayarajan, Gennady Pekhimenko, Shivaram Venkataraman · PDF
  88. What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities

    Wendong Bu, Yang Wu, Qifan Yu, Minghe Gao, Bingchen Miao, Zhenkui Zhang, Kaihang Pan, Yunfei Li, Mengze Li, Wei Ji, Juncheng Li, Siliang Tang, Yueting Zhuang · PDF
  89. When Agents go Astray: Course-Correcting SWE Agents with PRMs

    Shubham Gandhi, Jason Tsay, Jatin Ganhotra, Kiran Kate, Yara Rizk · PDF
  90. When Developer Aid Becomes Security Debt: A Systematic Analysis of Insecure Behaviors in LLM Coding Agents

    Matous Kozak, Roshanak Zilouchian Moghaddam, Kalpathy Sivaraman · PDF
  91. You Don't Know Until You Click: Automated GUI Testing for Production-Ready Software Evaluation

    Yutong Bian, Xianhao Lin, Yupeng Xie, Tianyang Liu, Mingchen Zhuge, Siyuan Lu, Haoming Tang, Jinlin Wang, Jiayi Zhang, Jiaqi Chen, Xiangru Tang, Yongxin Ni, Sirui Hong, Chenglin Wu · PDF
  92. YuLan-OneSim: Towards the Next Generation of Social Simulator with Large Language Models

    Lei Wang, Heyang Gao, Xiaohe Bo, Xu Chen, Ji-Rong Wen · PDF
  93. Zephyrus: An Agentic Framework for Weather Science

    Sumanth Varambally, Marshall Fisher, Jas Thakker, Yiwei Chen, Zhirui Xia, Ruijia Niu, Yasaman Jafari, Veeramakali Vignesh Manivannan, Zachary Novack, Luyu Han, Srikar Eranky, Salva Rühling Cachay, Taylor Berg-Kirkpatrick, Duncan Watson-Parris, Yian Ma, Rose Yu · PDF