NeurIPS 2025 Past Large language models
First Workshop on Multi-Turn Interactions in Large Language Models
MTI-LLM @ NeurIPS 2025
- Submission deadline
- Sep 3, 2025, 11:59 UTC imported from OpenReview — check the website for extensions
- Submission portal
- OpenReview
- Notes
- Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).
Accepted papers (122)
Fetched from OpenReview (v2) on 2026-06-10.
-
$\mathbf{T^3}$: Reducing Belief Deviation in Reinforcement Learning for Active Reasoning
-
$\textit{The Traitors}$: Deception and Trust in Multi-Agent Language Model Simulations
-
A Personalized Conversational Benchmark: Towards Simulating Personalized Conversations
-
A Practitioner's Guide to Multi-turn Agentic Reinforcement Learning
-
A-LAMP: Agentic LLM-Based Framework for Automated MDP Modeling and Policy Generation
-
AgentChangeBench: A Multi-Dimensional Evaluation Framework for Goal-Shift Robustness in Conversational AI
-
AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents
-
AI Debaters are More Persuasive when Arguing in Alignment with Their Own Beliefs
-
Alignment via Competition: Emergent Alignment from Differently Misaligned Agents
-
Another Turn, Better Output? A Turn-Wise Analysis of Iterative LLM Prompting
-
Are LLMs Generalist Hanabi Agents?
-
AsymPuzl: An Asymmetric Puzzle for multi-agent cooperation
-
AURA: A Diagnostic Framework for Tracking User Satisfaction of Interactive Planning Agents
-
Automating Deception: Scalable Multi-Turn LLM Jailbreaks
-
Benchmarking Correctness and Security in Multi-Turn Code Generation
-
Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL
-
BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent
-
BrowserArena: Evaluating LLM Agents on Real-World Web Navigation Tasks
-
CaRT: Teaching LLM Agents to Know When They Know Enough
-
CEDA: Cross-modal Evaluation through Debate Agents for Robust Hallucination Detection
-
Characterization and Detection of Incompleteness and Ambiguity in Multi-Turn Interactions with LLMs
-
ChatCLIDS: Simulating Persuasive AI Dialogues to Promote Closed-Loop Insulin Adoption in Type 1 Diabetes Care
-
Collaborative Prediction: Tractable Information Aggregation via Agreement
-
ConDABench: Interactive Evaluation of Language Models for Data Analysis
-
Conformity, Inertia, and Value Alignment in Multi-Turn LLM Deliberation
-
CORE: Measuring Multi-Agent LLM Interaction Quality under Game-Theoretic Pressures
-
CRMWeaver: Building Powerful Business Agent via Agentic RL and Shared Memories
-
Customer-R1: personalized simulation of Human Behaviors via RL-based LLM Agent in Online Shopping
-
Delay-of-Gratification as a Multi-Agent Survival Micro-benchmark for Long-Horizon LLMs: Social Exposure, Personas, and Tool Use Budgets
-
DeLLMphi: A Multi-Turn Method for Multi-Agent Forecasting
-
Democratizing Diplomacy: A Harness for Evaluating Any Large Language Model on Full-Press Diplomacy
-
Disclosure Audits for LLM Agents
-
Do Large Language Models Defend Their Beliefs Consistently?
-
Efficient Reinforcement Learning for Optimizing Multi-turn Student Outcomes with LLM Tutors
-
ERGO: Entropy-guided Resetting for Generation Optimization in Multi-turn Language Models
-
Estimating the Empowerment of Language Model Agents
-
ExploraTutor: A Dataset for Children’s Exploratory Dialogue by Integrating Multiple Educational theories
-
Exploring exploration with foundation agents in interactive environments
-
Fathom-Search-4B: Scaling DeepSearch Reasoning Capabilities via RL
-
Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions
-
FlowKV: Enhancing Multi-Turn Conversational Coherence in LLMs via Isolated Key-Value Cache Management
-
Goal Alignment in LLM-Based User Simulators for Conversational AI
-
Graph of Attacks: Improved Black-Box and Interpretable Jailbreaks for LLMs
-
How Can Input Reformulation Improve Tool Usage Accuracy in a Complex Dynamic Environment? A Study on τ -bench
-
How to Train Your LLM Web Agent: A Statistical Diagnosis
-
Improved Multi-Agent Collaboration with Multi-Turn Reinforcement Learning
-
Improving Language Agents through BREW: Bootstrapping expeRientially-learned Environmental knoWledge
-
Interleaved Reasoning for Large Language Models via Reinforcement Learning
-
It's LIT! Reliability-Optimized LLMs with Inspectable Tools
-
Language Models Rate Their Own Actions As Safer
-
Large Language Models Develop Novel Social Biases Through Adaptive Exploration
-
Learning to be Proactive from Missed User-Signals in Multi-turn Dialogues
-
Let's Think in Two Steps: Mitigating Agreement Bias in MLLMs with Self-Grounded Verification
-
Let’s Try Again: Eliciting Multi-Turn Reasoning in Language Models via Simplistic Feedback
-
Leveraging In-Context Learning for Language Model Agents
-
LLM Rationalis? Measuring bargaining capabilities of AI negotiators
-
MAC: A Multi-Agent Framework for Interactive User Clarification in Multi-turn Conversations
-
MAREval: A Multi-Agent Framework for Evaluating Natural Language Recommendation Explanations
-
MELISSA: Multi-level Evaluation with LLM-based Integrated Self-Scrutiny and Auditing
-
MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents
-
Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge
-
Modeling and Predicting Multi-Turn Answer Instability in Large Language Models
-
Multi-Agent-as-Judge: Aligning LLM-Agent-Based Automated Evaluation with Multi-Dimensional Human Evaluation
-
Multi-Turn Human–LLM Interaction Through the Lens of a Two-Way Intelligibility Protocol
-
Multi-Turn LLM Systems for Diagnostic Decision-Making: Considerations, Biases, and Challenges
-
MultiScale Contextual Bandits for Long Term Objectives
-
ObjexMT: Objective Extraction and Metacognitive Calibration for LLM‑as‑a‑Judge under Multi‑Turn Jailbreaks
-
Offline Policy Evaluation of Multi-Turn LLM Health Coaching with Real Users
-
One-Pass to Reason: Token Duplication and Block-Sparse Mask for Efficient Fine-Tuning on Multi-Turn Reasoning
-
Open-Universe Assistance Games
-
OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation
-
Optimizing for Persuasion Improves LLM Generalization: Evidence from Quality-Diversity Evolution of Debate Strategies
-
OrchDAG: Complex Tool Orchestration in Multi-Turn Interactions with Plan DAGs
-
Orchestrator: Active Inference for Multi-Agent Systems in Long-Horizon Tasks
-
ParetoMIL: Early Risk Detection in Dialogue under Weak Supervision
-
PersonaAgent: When Large Language Model Agents Meet Personalization at Test Time
-
Persuade Me if You Can: A Framework for Evaluating Persuasion Effectiveness and Susceptibility Among Large Language Models
-
Pluralistic Behavior Suite: Stress-Testing Multi-Turn Adherence to Custom Behavioral Policies
-
PrefDisco: Evaluating Proactive Personalization through Interactive Preference Discovery
-
Probe by Gaming: A Game-based Benchmark for Assessing Conceptual Knowledge in LLMs
-
PyVision: Agentic Vision with Dynamic Tooling
-
Quantifying Information Gain and Redundancy in Multi-Turn LLM Conversations
-
RAFFLES: Reasoning-based Attribution of Faults for LLM Systems
-
RealWebAssist: A Benchmark for Long-Horizon Web Assistance with Real-World Users
-
RefineBench: Evaluating Refinement Capability in Language Models
-
REFRAG: Rethinking RAG based Decoding
-
Reinforced Reasoning for Interactive Multi-step Embodied Planning
-
Reinforcement Learning for Long-Horizon Multi-Turn Search Agents
-
Reinforcing Multi-Turn Reasoning in LLM Agents via Turn-Level Reward Design and Credit Assignment
-
Saying the Unsaid: Revealing the Hidden Language of Multimodal Systems Through Telephone Games
-
Scalability of LLM-Based Multi-Agent Systems for Scientific Code Generation: A Preliminary Study
-
Semantic Context for Tool Orchestration
-
SENTINEL: Sentiment Evolution and Narrative Tracking in Extended LLM Interactions
-
Show or Tell? Interactive Task Learning with Large Language Models
-
SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning
-
SkyRL-SQL: Multi-turn SQL Data Agents via RL
-
SMAGDi: Socratic Multi Agent Interaction Graph Distillation for Efficient High Accuracy Reasoning
-
Sotopia-RL: Reward Design for Social Intelligence
-
Stability of Preference Alignment for Multi-Turn Control with LLM Policies
-
StarDojo: Benchmarking Open-Ended Behaviors of Agentic Multimodal LLMs in Production–Living Simulations with Stardew Valley
-
State-Induced Risk Amplification of AI Agents
-
Stop-RAG: Value-Based Retrieval Control for Iterative RAG
-
Studying Coordination and Collusion in Multi-Agent LLM Code Reviews
-
Task Completion Agents are Not Ideal Collaborators
-
Test-Time Policy Adaptation for Enhanced Multi-Turn Interactions with LLMs
-
The Automated but Risky Game: Modeling Agent-to-Agent Negotiations and Transactions in Consumer Markets
-
The Chameleon Nature of LLMs: Quantifying Multi-Turn Stance Instability in Search-Enabled Language Models
-
The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs
-
The Influence of Scaffolds on Coordination Scaling Laws in LLM Agents
-
Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction
-
TOD-ProcBench: Benchmarking Complex Instruction-Following in Task-Oriented Dialogues
-
ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark
-
Toward Community-Driven Agents for Machine Learning Engineering
-
Tracing Coordination Dynamics in Multi-Turn LLM Discussions
-
Traxgen: Ground-Truth Trajectory Generation for AI Agent Evaluation
-
User-Assistant Bias in LLMs
-
Verlog: Context-lite Multi-turn Reinforcement Learning framework for Long-Horizon LLM Agents
-
VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded Video Reasoning
-
WebGraphEval: Multi-Turn Trajectory Evaluation for Web Agents using Graph Representation
-
WEBSERV: A Browser-Server Environment for Efficient Training of Reinforcement Learning-based Web Agents at Scale
-
What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities
-
WOLF: Werewolf-based Observations for LLM Deception and Falsehoods