NeurIPS 2025 Past Agents
Workshop on Scaling Environments for Agents
SEA @ NeurIPS 2025
- Submission deadline
- Sep 3, 2025, 11:59 UTC imported from OpenReview — check the website for extensions
- Submission portal
- OpenReview
- Notes
- Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).
Accepted papers (93)
Fetched from OpenReview (v2) on 2026-06-10.
-
A Multi-agent Reasoning Framework for Video Question Answering
-
Agent Context Protocols Enhance Collective Inference
-
AgentCrypt: Advancing Privacy and (Secure) Computation in AI Agent Collaboration
-
Agentic Persona Control and Task State Tracking for Realistic User Simulation in Interactive Scenarios
-
AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents
-
All Life is Problem Creation: Learning to Generate Environments that Maximize Performance Gain
-
Are LLMs Generalist Hanabi Agents?
-
Automated Specialization of Stateful Agent Systems
-
Beyond Fixed Tasks: Unsupervised Environment Design for Task-Level Pairs
-
BoxingGym: Benchmarking Progress in Automated Experimental Design and Model Discovery
-
BrowseMaster: Towards Scalable Web Browsing via Tool-Augmented Programmatic Agent Pair
-
Characterizing Deep Research: A Benchmark and Formal Definition
-
ChinaTravel: An Open-Ended Benchmark for Language Agents in Chinese Travel Planning
-
Co-Evolving Complexity: An Adversarial Framework for Automatic MARL Curricula
-
Code2MCP: Transforming Code Repositories into MCP Services
-
CoLLAB: A Framework for Designing Scalable Benchmarks for Agentic LLMs
-
Communicating Plans, Not Percepts: Scalable Multi-Agent Coordination with Embodied World Models
-
CUBE: Collaborative Multi-Agent Block-Pushing Environment for Collective Planning with LLM Agents
-
DEBATE: A Large-Scale Benchmark for Role-Playing LLM Agents in Multi-Agent, Long-Form Debates
-
DefenderBench: A Toolkit for Evaluating Language Agents in Cybersecurity Environments
-
Enabling multi-agent collaboration in knowledge graph environments
-
Enabling User-Created Multi-Agent Simulations: Interactive and Customizable 2D Environments to Study Team Dynamics with LLM Agents
-
EVOLVE-MEM: A Self-Adaptive Hierarchical Memory Architecture for Next-Generation Agentic AI Systems
-
Examining the Vulnerability of Multi-Agent Medical Systems to Human Interventions for Clinical Reasoning
-
Exploring Personality Trait Change of LLM-Based AI Systems
-
Faithful Simulation of User–Agent–Environment Interactions for Scalable LLM Agent Evaluation
-
Fathom-Search-4B: Scaling DeepSearch Reasoning Capabilities via RL
-
GEM: A Gym for Agentic LLMs
-
GLEE: A Unified Framework and Benchmark for Language-based Economic Environments
-
Go-Browse: Training Web Agents with Structured Exploration
-
GR-Agent: Adaptive Graph Reasoning Agent under Incomplete Knowledge
-
GroundedPRM: Tree-Guided and Fidelity-Aware Process Reward Modeling for Step-Level Reasoning
-
IndusGCC: A Data Benchmark and Evaluation Framework for GUI-Based General Computer Control in Industrial Automation
-
Learning to Make Friends: Coaching LLM Agents toward Emergent Social Ties
-
Licence to Scale: A Microservice Simulation Environment for Benchmarking Agentic AI
-
LLM Economist: Large Population Models and Mechanism Design in Multi-Agent Generative Simulacra
-
LLMs as Scalable, General-Purpose Simulators For Evolving Digital Agent Training
-
Ludax: A GPU-Accelerated Domain Specific Language for Board Games
-
MAPGD: Multi-Agent Prompt Gradient Descent for Collaborative Prompt Optimization
-
MAS-ZERO: Designing Multi-Agent Systems with Zero Supervision
-
MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers
-
MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers
-
MEMTRACK: Evaluating Long-Term Memory and State Tracking in Multi-Platform Dynamic Agent Environments
-
MIRAI: Evaluating LLM Agents for International Event Forecasting
-
Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks
-
Mobile-Agent-V: A Video-Guided Approach for Effortless and Efficient Operational Knowledge Injection in Mobile Automation
-
Model Context Protocol for Vision Agents: Schema, Memory, and World Model Implications
-
Natural Language Grounded Reinforcement Learning for Clinical Decision-Making in Virtual Patient Simulations
-
On the Importance of Task Complexity in Evaluating LLM-Based Multi-Agent Systems
-
OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation
-
Paper2Video: Automatic Video Generation from Scientific Papers
-
Player-Coach Teamwork: Multi-agent Collaboration for Improving LLM Reasoning
-
PrivacyMAS: A Privacy-Preserving Multi-Agent System Framework
-
Protein Design with Agent Rosetta: A Case Study for Specialized Scientific Agents
-
PuzzleJAX: A Benchmark for Reasoning and Learning
-
RAISE: Reliable Agent Improvement via Simulated Experience
-
RealWebAssist: A Benchmark for Long-Horizon Web Assistance with Real-World Users
-
ReMAC: Large Language Model-Driven Reward Design for Multi-Agent Manipulation Collaboration
-
Revisiting Boids for Emergent Intelligence via Multi-Agent Collaborative Tool-Building
-
Revisiting Uncertainty Estimation and Calibration of Large Language Models
-
RPGBENCH: Evaluating Large Language Models as Role-Playing Game Engines
-
Scaling Environments for LLM Agents in the Era of Learning from Interaction: A Survey
-
Scaling Open-Ended Reasoning to Predict the Future
-
SEA: Stateful Execution Environment for Conversational Big Data Analytics
-
SEDM: Scalable Self-Evolving Distributed Memory for Agents
-
See, Think, Act: Online Shopper Behavior Simulation with VLM Agents
-
Shaping Smart Personal Assistants through Generative Interactive Environments for Scalable Design and Evaluation
-
Shop-R1: Rewarding LLMs to Simulate Human Behavior in Online Shopping via Reinforcement Learning
-
Similar: A Step-Wise, Multi-Dimensional Reward Model for Virtual Agent Learning and Reasoning
-
SimuGen: Multi-modal Agentic Framework for Constructing Block Diagram-Based Simulation Models
-
SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards
-
Steering Diffusion Policies with Value-Guided Denoising
-
The Influence of Scaffolds on Coordination Scaling Laws in LLM Agents
-
The Physical Basis of Prediction: World Model Formation in Neural Organoids via an LLM-Generated Curriculum
-
Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction
-
Towards Agents That Know When They Don't Know: Uncertainty as a Control Signal for Structured Reasoning
-
Traxgen: Ground-Truth Trajectory Generation for AI Agent Evaluation
-
TutorTest: Evaluating Language Model-based Tutoring Policies Using Surrogate Tasks
-
Two Heads are Better Than One: Test-time Scaling of Multi-agent Collaborative Reasoning
-
UpSkill: Mutual Information Skill Learning for Structured Response Diversity in LLMs
-
UserBench: An Interactive Gym Environment for User-Centric Agents
-
VendiRL: A Framework for Self-Supervised Reinforcement Learning of Diversely Diverse Skills
-
Verifiable Chemical Reasoning through Tool-Calling Agentic Workflow
-
VideoAgent2: Enhancing the LLM-Based Agent System for Long-Form Video Understanding by Uncertainty-Aware CoT
-
Vision-Language Models Unlock Task-Centric Latent Actions
-
WebArena Verified: Reliable Evaluation for Web Agents
-
What Limits Agentic Systems Efficiency?
-
What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities
-
When Agents go Astray: Course-Correcting SWE Agents with PRMs
-
When Developer Aid Becomes Security Debt: A Systematic Analysis of Insecure Behaviors in LLM Coding Agents
-
You Don't Know Until You Click: Automated GUI Testing for Production-Ready Software Evaluation
-
YuLan-OneSim: Towards the Next Generation of Social Simulator with Large Language Models
-
Zephyrus: An Agentic Framework for Weather Science