NeurIPS 2025 Past Large language models

First Workshop on Multi-Turn Interactions in Large Language Models

MTI-LLM @ NeurIPS 2025

Submission deadline
Sep 3, 2025, 11:59 UTC
imported from OpenReview — check the website for extensions
Submission portal
OpenReview
Notes
Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).

Accepted papers (122)

Fetched from OpenReview (v2) on 2026-06-10.

  1. $\mathbf{T^3}$: Reducing Belief Deviation in Reinforcement Learning for Active Reasoning

    Deyu Zou, Yongqiang Chen, Jianxiang Wang, Garry YANG, Mufei Li, Qing Da, Pan Li, Yu Gong, James Cheng · PDF
  2. $\textit{The Traitors}$: Deception and Trust in Multi-Agent Language Model Simulations

    Pedro M. P. Curvo · PDF
  3. A Personalized Conversational Benchmark: Towards Simulating Personalized Conversations

    Li Li, Peilin Cai, Ryan A. Rossi, Franck Dernoncourt, Branislav Kveton, Junda Wu, Tong Yu, Linxin Song, Tiankai Yang, Yuehan Qin, Nesreen K. Ahmed, Samyadeep Basu, Subhojyoti Mukherjee, Ruiyi Zhang, Zhengmian Hu, Bo Ni, Yuxiao Zhou, Zichao Wang, Yue Huang, Yu Wang, Xiangliang Zhang, Philip S. Yu, Xiyang Hu, Yue Zhao · PDF
  4. A Practitioner's Guide to Multi-turn Agentic Reinforcement Learning

    Ruiyi Wang, Prithviraj Ammanabrolu · PDF
  5. A-LAMP: Agentic LLM-Based Framework for Automated MDP Modeling and Policy Generation

    Hong Je-Gal, Chanbin YI, Hyun-Suk Lee · PDF
  6. AgentChangeBench: A Multi-Dimensional Evaluation Framework for Goal-Shift Robustness in Conversational AI

    Manik Rana, Calissa Man, Anotida Expected Msiiwa, Jeffrey Paine, Ahan M R · PDF
  7. AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents

    Jingxu Xie, Dylan Xu, Xuandong Zhao, Dawn Song · PDF
  8. AI Debaters are More Persuasive when Arguing in Alignment with Their Own Beliefs

    María Victoria Carro, Denise Alejandra Mester, Facundo Nieto, Oscar Agustín Stanchi, Guido Ernesto Bergman, Mario Leiva, Luca Nicolás Forziati Gangi, Eitan Sprejer, Francisca Gauna Selasco, Juan Gustavo Corvalan, Maria Vanina Martinez, Gerardo Simari · PDF
  9. Alignment via Competition: Emergent Alignment from Differently Misaligned Agents

    Natalie Collina, Surbhi Goel, Aaron Roth, Emily Ryu, Mirah Shi · PDF
  10. Another Turn, Better Output? A Turn-Wise Analysis of Iterative LLM Prompting

    Shashidhar Reddy Javaji, Bhavul Gauri, Zining Zhu · PDF
  11. Are LLMs Generalist Hanabi Agents?

    Mahesh Ramesh, Aswinkumar Ramkumar, Pavan Thodima, Kaousheik Jayakumar, Aniket Rege · PDF
  12. AsymPuzl: An Asymmetric Puzzle for multi-agent cooperation

    Xavier Cadet, Edward Koh, Peter Chin · PDF
  13. AURA: A Diagnostic Framework for Tracking User Satisfaction of Interactive Planning Agents

    Takyoung Kim, Janvijay Singh, Shuhaib Mehri, Emre Can Acikgoz, Sagnik Mukherjee, Nimet Beyza Bozdag, Sumuk Shashidhar, Gokhan Tur, Dilek Hakkani-Tür · PDF
  14. Automating Deception: Scalable Multi-Turn LLM Jailbreaks

    Adarsh Kumarappan, Ananya Mujoo · PDF
  15. Benchmarking Correctness and Security in Multi-Turn Code Generation

    Ruchit Rawal, Jeffrey Yang Fan Chiang, Jeffery Siyuan Tian, Aastha Mahajan, Tom Goldstein, Yizheng Chen · PDF
  16. Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL

    Jiaxuan Gao, Wei Fu, Minyang Xie, Shusheng Xu, Chuyi He, Zhiyu Mei, Banghua Zhu, Yi Wu · PDF
  17. BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent

    Zijian Chen, Xueguang Ma, Shengyao Zhuang, Ping Nie, Kai Zou, Sahel Sharifymoghaddam, Andrew Liu, Joshua Green, Kshama Patel, Ruoxi Meng, Mingyi Su, Yanxi Li, Haoran Hong, Xinyu Shi, Xuye Liu, Nandan Thakur, Crystina Zhang, Luyu Gao, Wenhu Chen, Jimmy Lin · PDF
  18. BrowserArena: Evaluating LLM Agents on Real-World Web Navigation Tasks

    Sagnik Anupam, Davis Brown, Shuo Li, Eric Wong, Hamed Hassani, Osbert Bastani · PDF
  19. CaRT: Teaching LLM Agents to Know When They Know Enough

    Grace Liu, Yuxiao Qu, Jeff Schneider, Aarti Singh, Aviral Kumar · PDF
  20. CEDA: Cross-modal Evaluation through Debate Agents for Robust Hallucination Detection

    Susmit Neogi, Wang Yun · PDF
  21. Characterization and Detection of Incompleteness and Ambiguity in Multi-Turn Interactions with LLMs

    Riya Naik, Ashwin Srinivasan, Swati Agarwal, Estrid He · PDF
  22. ChatCLIDS: Simulating Persuasive AI Dialogues to Promote Closed-Loop Insulin Adoption in Type 1 Diabetes Care

    Zonghai Yao, Talha Chafekar, Junda Wang, Shuo Han, Feiyun Ouyang, Junhui Qian, Lingxi Li, hong yu · PDF
  23. Collaborative Prediction: Tractable Information Aggregation via Agreement

    Natalie Collina, Ira Globus-Harris, Surbhi Goel, Varun Gupta, Aaron Roth, Mirah Shi · PDF
  24. ConDABench: Interactive Evaluation of Language Models for Data Analysis

    Avik Dutta, Priyanshu Gupta, Hosein Hasanbeig, Rahul Pratap Singh, Harshit Nigam, Sumit Gulwani, Arjun Radhakrishna, Gustavo Soares, Ashish Tiwari · PDF
  25. Conformity, Inertia, and Value Alignment in Multi-Turn LLM Deliberation

    Pratik S. Sachdeva, Tom van Nuenen · PDF
  26. CORE: Measuring Multi-Agent LLM Interaction Quality under Game-Theoretic Pressures

    Punya Syon Pandey, Yongjin Yang, Jiarui Liu, Zhijing Jin · PDF
  27. CRMWeaver: Building Powerful Business Agent via Agentic RL and Shared Memories

    Yilong Lai, Yipin Yang, Jialong Wu, Zhenglin Wang, Ting Liang, Linjianguo, Keping Yang · PDF
  28. Customer-R1: personalized simulation of Human Behaviors via RL-based LLM Agent in Online Shopping

    Ziyi Wang, Yuxuan Lu, Yimeng Zhang, Jing Huang, Dakuo Wang · PDF
  29. Delay-of-Gratification as a Multi-Agent Survival Micro-benchmark for Long-Horizon LLMs: Social Exposure, Personas, and Tool Use Budgets

    Olga Manakina, Igor Bogdanov, Chung-Horng Lung · PDF
  30. DeLLMphi: A Multi-Turn Method for Multi-Agent Forecasting

    Andrew Robert Williams, Martin Weiss, Victoria Feere, Nasim Rahaman, Hugo Larochelle · PDF
  31. Democratizing Diplomacy: A Harness for Evaluating Any Large Language Model on Full-Press Diplomacy

    Alexander Duffy, Samuel J Paech, Ishana Shastri, Elizabeth Karpinski, Baptiste Alloui-Cros, Matthew Lyle Olson, Tyler Marques · PDF
  32. Disclosure Audits for LLM Agents

    Saswat Das, Jameson Sandler, Ferdinando Fioretto · PDF
  33. Do Large Language Models Defend Their Beliefs Consistently?

    Arka Pal, Arthur Liang, Teo Kitanovski, Akilesh Potti, Micah Goldblum · PDF
  34. Efficient Reinforcement Learning for Optimizing Multi-turn Student Outcomes with LLM Tutors

    HyunJi Nam, Omer Gottesman, Amy Zhang, Dean Foster, Emma Brunskill, Lyle Ungar · PDF
  35. ERGO: Entropy-guided Resetting for Generation Optimization in Multi-turn Language Models

    Haziq Mohammad Khalid, Athikash Jeyaganthan, Timothy Do, Yicheng Fu, Vasu Sharma, Sean O'Brien, Kevin Zhu · PDF
  36. Estimating the Empowerment of Language Model Agents

    Jinyeop Song, Jeff Gore, Max Kleiman-Weiner · PDF
  37. ExploraTutor: A Dataset for Children’s Exploratory Dialogue by Integrating Multiple Educational theories

    Siqi Xie, Yaxin Xu · PDF
  38. Exploring exploration with foundation agents in interactive environments

    Daniel P. Sawyer, Nan Rosemary Ke, Hubert Soyer, Martin Engelcke, John Reid, David P Reichert, Drew A. Hudson, Alexander Lerchner, Danilo Jimenez Rezende, Timothy P Lillicrap, Michael Curtis Mozer, Jane X Wang · PDF
  39. Fathom-Search-4B: Scaling DeepSearch Reasoning Capabilities via RL

    Shreyas Singh, Kunal Singh, Pradeep Moturi · PDF
  40. Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions

    Yubo Li, Yidi Miao, Xueying Ding, Ramayya Krishnan, Rema Padman · PDF
  41. FlowKV: Enhancing Multi-Turn Conversational Coherence in LLMs via Isolated Key-Value Cache Management

    Xiang Liu, Hong Chen, Xuming Hu, Xiaowen Chu · PDF
  42. Goal Alignment in LLM-Based User Simulators for Conversational AI

    Shuhaib Mehri · PDF
  43. Graph of Attacks: Improved Black-Box and Interpretable Jailbreaks for LLMs

    Mohammad Akbar-Tajari, Mohammad Taher Pilehvar, Mohammad Mahmoody · PDF
  44. How Can Input Reformulation Improve Tool Usage Accuracy in a Complex Dynamic Environment? A Study on τ -bench

    Venkatesh Mishra, Amir Saeidi, Satyam Raj, Mutsumi Nakamura, Jayanth Srinivasa, Gaowen Liu, Ali Payani, Chitta Baral · PDF
  45. How to Train Your LLM Web Agent: A Statistical Diagnosis

    Dheeraj Vattikonda, Santhoshi Ravichandran, Emiliano Penaloza, Hadi Nekoei, Megh Thakkar, Thibault Le Sellier de Chezelles, Nicolas Gontier, Miguel Muñoz-Mármol, Sahar Omidi Shayegan, Stefania Raimondo, Xue Liu, Alexandre Drouin, Laurent Charlin, Alexandre Piché, Alexandre Lacoste, Massimo Caccia · PDF
  46. Improved Multi-Agent Collaboration with Multi-Turn Reinforcement Learning

    Shuo Liu, Tianle Chen, Christopher Amato · PDF
  47. Improving Language Agents through BREW: Bootstrapping expeRientially-learned Environmental knoWledge

    Shashank Kirtania, Param Biyani, Priyanshu Gupta, Yasharth Bajpai, Roshni Iyer, Sumit Gulwani, Gustavo Soares · PDF
  48. Interleaved Reasoning for Large Language Models via Reinforcement Learning

    Roy Xie, David Qiu, Deepak Gopinath, Dong Lin, Yanchao Sun, Chong Wang, Saloni Potdar, Bhuwan Dhingra · PDF
  49. It's LIT! Reliability-Optimized LLMs with Inspectable Tools

    Ruixin Zhang, Jon Donnelly, Zhicheng Guo, Ghazal Khalighinejad, Haiyang Huang, Alina Jade Barnett, Cynthia Rudin · PDF
  50. Language Models Rate Their Own Actions As Safer

    Dipika Khullar, Jack Hopkins, Rowan Wang, Fabien Roger · PDF
  51. Large Language Models Develop Novel Social Biases Through Adaptive Exploration

    Addison J. Wu, Ryan Liu, Xuechunzi Bai, Thomas L. Griffiths · PDF
  52. Learning to be Proactive from Missed User-Signals in Multi-turn Dialogues

    Saba Rahimi, Sivapriya Vellaichamy, Kelly Patel, Thomas Cook, Zhen Zeng, Sumitra Ganesh · PDF
  53. Let's Think in Two Steps: Mitigating Agreement Bias in MLLMs with Self-Grounded Verification

    Moises Andrade, Joonhyuk Cha, Brandon Ho, Vriksha Srihari, Karmesh Yadav, Zsolt Kira · PDF
  54. Let’s Try Again: Eliciting Multi-Turn Reasoning in Language Models via Simplistic Feedback

    Licheng Liu, Zihan Wang, Linjie Li, Chenwei Xu, Yiping Lu, Han Liu, Avirup Sil, Manling Li · PDF
  55. Leveraging In-Context Learning for Language Model Agents

    Shivanshu Gupta, Sameer Singh, Ashish Sabharwal, Tushar Khot, Ben Bogin · PDF
  56. LLM Rationalis? Measuring bargaining capabilities of AI negotiators

    Cheril Shah, Akshit Agarwal, Kanak Garg, Mourad Heddaya · PDF
  57. MAC: A Multi-Agent Framework for Interactive User Clarification in Multi-turn Conversations

    Emre Can Acikgoz, Jinoh Oh, Joo Hyuk Jeon, Jie Hao, Heng Ji, Dilek Hakkani-Tür, Gokhan Tur, Xiang Li, Chengyuan Ma, Xing Fan · PDF
  58. MAREval: A Multi-Agent Framework for Evaluating Natural Language Recommendation Explanations

    Reza Yousefi Maragheh, Jayesh Uddhav Kudase, Aysenur Inan, Ramin Giahi, Kai Zhao, Jianpeng Xu, Jason Cho, Evren Korpeoglu, Sushant Kumar · PDF
  59. MELISSA: Multi-level Evaluation with LLM-based Integrated Self-Scrutiny and Auditing

    Amirhossein Afsharrad, Sri Jaladi, Nima Yazdani, Ali Ansari, Seyed Shahabeddin Mousavi, Sanjay Lall · PDF
  60. MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents

    Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, Paul Pu Liang · PDF
  61. Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge

    Boyu Gou, Zanming Huang, Yuting Ning, Yu Gu, Michael Lin, Weijian Qi, Andrei Kopanev, Botao Yu, Bernal Jiménez Gutiérrez, Yiheng Shu, Chan Hee Song, Jiaman Wu, Shijie Chen, Hanane Nour Moussa, TIANSHU ZHANG, Jian Xie, Yifei Li, Tianci Xue, Zeyi Liao, Kai Zhang, Boyuan Zheng, Zhaowei Cai, Viktor Rozgic, Morteza Ziyadi, Huan Sun, Yu Su · PDF
  62. Modeling and Predicting Multi-Turn Answer Instability in Large Language Models

    Jiahang He, Rishi Ramachandran, Neel Ramachandran, Aryan Katakam, Kevin Zhu, Sunishchal Dev, Ashwinee Panda, Aryan Shrivastava · PDF
  63. Multi-Agent-as-Judge: Aligning LLM-Agent-Based Automated Evaluation with Multi-Dimensional Human Evaluation

    Jiaju Chen, Yuxuan Lu, Xiaojie Wang, Huimin Zeng, Jing Huang, Jiri Gesi, Ying Xu, Dakuo Wang · PDF
  64. Multi-Turn Human–LLM Interaction Through the Lens of a Two-Way Intelligibility Protocol

    Harshvardhan Mestha, Karan Bania, Shreyas V, Sidong Liu, Ashwin Srinivasan · PDF
  65. Multi-Turn LLM Systems for Diagnostic Decision-Making: Considerations, Biases, and Challenges

    Benjamin Liu, Sejong Kim, Drona Thoka, Varun Puttagunta, Kaylin Sheng, Mark Li, Kiran Nijjer, Adnan Ahmed, Thi Uyen Hanh Le, Sai Chidvilas Gudiboina, Ali Ugur, Kevin Zhu · PDF
  66. MultiScale Contextual Bandits for Long Term Objectives

    Richa Rastogi, Yuta Saito, Thorsten Joachims · PDF
  67. ObjexMT: Objective Extraction and Metacognitive Calibration for LLM‑as‑a‑Judge under Multi‑Turn Jailbreaks

    Hyunjun Kim, Junwoo Ha, Haon Park, Sangyoon Yu · PDF
  68. Offline Policy Evaluation of Multi-Turn LLM Health Coaching with Real Users

    Melik Ozolcer, Sang Won Bae · PDF
  69. One-Pass to Reason: Token Duplication and Block-Sparse Mask for Efficient Fine-Tuning on Multi-Turn Reasoning

    Ritesh Goru, Shanay Mehta, Prateek Jain · PDF
  70. Open-Universe Assistance Games

    Rachel Ma, Jingyi Qu, Andreea Bobu, Dylan Hadfield-Menell · PDF
  71. OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation

    Ziyi Wang, Yuxuan Lu, Wenbo Li, Amirali Amini, Bo Sun, Yakov Bart, Weimin Lyu, Jiri Gesi, Tian Wang, Jing Huang, Yu Su, Upol Ehsan, Malihe Alikhani, Toby Jia-Jun Li, Lydia Chilton, Dakuo Wang · PDF
  72. Optimizing for Persuasion Improves LLM Generalization: Evidence from Quality-Diversity Evolution of Debate Strategies

    Aksel Joonas Reedi, Corentin Léger, Julien Pourcel, Loris Gaven, Perrine Charriau, Guillaume Pourcel · PDF
  73. OrchDAG: Complex Tool Orchestration in Multi-Turn Interactions with Plan DAGs

    Yifu Lu, Shengjie Liu, Li Dong · PDF
  74. Orchestrator: Active Inference for Multi-Agent Systems in Long-Horizon Tasks

    Lukas Beckenbauer, Johannes-Lucas Löwe, Ge Zheng, Alexandra Brintrup · PDF
  75. ParetoMIL: Early Risk Detection in Dialogue under Weak Supervision

    Avinash Baidya, Xinran Liang, Ruocheng Guo, Kamalika Das, Xiang Gao · PDF
  76. PersonaAgent: When Large Language Model Agents Meet Personalization at Test Time

    Weizhi Zhang, Xinyang Zhang, Chenwei Zhang, Liangwei Yang, Jingbo Shang, Zhepei Wei, Henry Peng Zou, Zijie Huang, Zhengyang Wang, Yifan Gao, Xiaoman Pan, Lian Xiong, Jingguo Liu, Philip S. Yu, Xian Li · PDF
  77. Persuade Me if You Can: A Framework for Evaluating Persuasion Effectiveness and Susceptibility Among Large Language Models

    Nimet Beyza Bozdag, Shuhaib Mehri, Gokhan Tur, Dilek Hakkani-Tür · PDF
  78. Pluralistic Behavior Suite: Stress-Testing Multi-Turn Adherence to Custom Behavioral Policies

    Prasoon Varshney, Makesh Narsimhan Sreedhar, Liwei Jiang, Traian Rebedea, Christopher Parisien · PDF
  79. PrefDisco: Evaluating Proactive Personalization through Interactive Preference Discovery

    Shuyue Stella Li, Avinandan Bose, Faeze Brahman, Simon Shaolei Du, Pang Wei Koh, Maryam Fazel, Yulia Tsvetkov · PDF
  80. Probe by Gaming: A Game-based Benchmark for Assessing Conceptual Knowledge in LLMs

    Shuhang Xu, Weijian Deng, Yixuan Zhou, Fangwei Zhong · PDF
  81. PyVision: Agentic Vision with Dynamic Tooling

    Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Ming Li, Qilong Wu, Kaipeng Zhang, Chen Wei · PDF
  82. Quantifying Information Gain and Redundancy in Multi-Turn LLM Conversations

    Abhiram Rao Gorle, Amit Kumar Singh Yadav, Tsachy Weissman · PDF
  83. RAFFLES: Reasoning-based Attribution of Faults for LLM Systems

    Chenyang Zhu, Spencer Hong, Jingyu Wu, Kushal Chawla, Yuhui Tang, Youbing Yin, Nathan Wolfe, Erin Babinsky, Daben Liu · PDF
  84. RealWebAssist: A Benchmark for Long-Horizon Web Assistance with Real-World Users

    Suyu Ye, Haojun Shi, Darren Shih, Hyokun Yun, Tanya G. Roosta, Tianmin Shu · PDF
  85. RefineBench: Evaluating Refinement Capability in Language Models

    Young-Jun Lee, Seungone Kim, Byung-Kwan Lee, Minkyeong Moon, Yechan Hwang, Jong Myoung Kim, Graham Neubig, Sean Welleck, Ho-Jin Choi · PDF
  86. REFRAG: Rethinking RAG based Decoding

    Xiaoqiang Lin, Aritra Ghosh, Bryan Kian Hsiang Low, Anshumali Shrivastava, Vijai Mohan · PDF
  87. Reinforced Reasoning for Interactive Multi-step Embodied Planning

    Di Wu, Jiaxin Fan, Junzhe Zang, Guanbo Wang, Wei Yin, Wenhao Li, Bo Jin · PDF
  88. Reinforcement Learning for Long-Horizon Multi-Turn Search Agents

    Vivek Kalyan, Martin Andrews · PDF
  89. Reinforcing Multi-Turn Reasoning in LLM Agents via Turn-Level Reward Design and Credit Assignment

    Quan Wei, Siliang Zeng, Chenliang Li, William Brown, Oana Frunza, Wei Deng, Yuriy Nevmyvaka, Yang Katie Zhao, Alfredo Garcia, Mingyi Hong · PDF
  90. Saying the Unsaid: Revealing the Hidden Language of Multimodal Systems Through Telephone Games

    Juntu Zhao, Jialing Zhang, Chongxuan Li, Dequan Wang · PDF
  91. Scalability of LLM-Based Multi-Agent Systems for Scientific Code Generation: A Preliminary Study

    Yuru wang, Kaiyan Zhang, Kai Tian, Sihang Zeng, Xingtai Lv, Ning Ding, Biqing Qi, Bowen Zhou · PDF
  92. Semantic Context for Tool Orchestration

    Robert Müller · PDF
  93. SENTINEL: Sentiment Evolution and Narrative Tracking in Extended LLM Interactions

    Pranav Anuraag, Ethan Xu, Alexander Arutchev, Asher Nerenberg · PDF
  94. Show or Tell? Interactive Task Learning with Large Language Models

    Jacob Sansom, Muhammad Khalifa, Honglak Lee, Joyce Chai · PDF
  95. SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning

    Zhenghai Xue, Longtao Zheng, Qian Liu, Yingru Li, Xiaosen Zheng, Zejun MA, Bo An · PDF
  96. SkyRL-SQL: Multi-turn SQL Data Agents via RL

    Shu Liu, Alan Zhu, Sumanth Hegde, Shiyi Cao, Shuo Yuan, Samion Suwito, Tyler Griggs, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica · PDF
  97. SMAGDi: Socratic Multi Agent Interaction Graph Distillation for Efficient High Accuracy Reasoning

    Aayush Aluru, Myra N. Malik, Samarth Patankar, Spencer Kim, Kevin Zhu, Vasu Sharma, Sean O'Brien · PDF
  98. Sotopia-RL: Reward Design for Social Intelligence

    Haofei Yu, Zhengyang Qi, Yining Zhao, Kolby Nottingham, Keyang Xuan, Bodhisattwa Prasad Majumder, Hao Zhu, Paul Pu Liang, Jiaxuan You · PDF
  99. Stability of Preference Alignment for Multi-Turn Control with LLM Policies

    Andrew Silva, Pradyumna Tambwekar, Deepak Edakkattil Gopinath, Jonathan DeCastro, Guy Rosman, Avinash Balachandran · PDF
  100. StarDojo: Benchmarking Open-Ended Behaviors of Agentic Multimodal LLMs in Production–Living Simulations with Stardew Valley

    Weihao Tan, Changjiu Jiang, Yu Duan, Mingcong Lei, Li JiaGeng, Yitian Hong, Xinrun Wang, Bo An · PDF
  101. State-Induced Risk Amplification of AI Agents

    Rebecka Nordenlöw, Takayuki Osogami, Lauren Quigley, Sara E. Berger, Rachel K. E. Bellamy · PDF
  102. Stop-RAG: Value-Based Retrieval Control for Iterative RAG

    Jaewan Park, Solbee Cho, Jay-Yoon Lee · PDF
  103. Studying Coordination and Collusion in Multi-Agent LLM Code Reviews

    Jennifer Za, Aristeidis Panos, Roger Dearnaley, Samuel Albanie · PDF
  104. Task Completion Agents are Not Ideal Collaborators

    Shannon Zejiang Shen, Valerie Chen, Ken Gu, Alexis Ross, Zixian Ma, Jillian Ross, Alex Gu, Chenglei Si, Wayne Chi, Andi Peng, Jocelyn J Shen, Ameet Talwalkar, Tongshuang Wu, David Sontag · PDF
  105. Test-Time Policy Adaptation for Enhanced Multi-Turn Interactions with LLMs

    Chenxing Wei, Hong Wang, Ying Tiffany He, Fei Yu, Yao Shu · PDF
  106. The Automated but Risky Game: Modeling Agent-to-Agent Negotiations and Transactions in Consumer Markets

    Shenzhe Zhu, Jiao Sun, Yi Nian, Tobin South, Alex Pentland, Jiaxin Pei · PDF
  107. The Chameleon Nature of LLMs: Quantifying Multi-Turn Stance Instability in Search-Enabled Language Models

    Shivam Ratnakar, Sanjay Raghavendra · PDF
  108. The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs

    Akshit Sinha, Arvindh Arun, Shashwat Goel, Steffen Staab, Jonas Geiping · PDF
  109. The Influence of Scaffolds on Coordination Scaling Laws in LLM Agents

    Mariana Meireles, Rupali Bhati, Niklas Lauffer, Cameron Allen · PDF
  110. Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction

    Junhong Shen, Hao Bai, Lunjun Zhang, Yifei Zhou, Amrith Setlur, Shengbang Tong, Diego Caples, Nan Jiang, Tong Zhang, Ameet Talwalkar, Aviral Kumar · PDF
  111. TOD-ProcBench: Benchmarking Complex Instruction-Following in Task-Oriented Dialogues

    Sarik Ghazarian, Abhinav Gullapalli, Swair Shah, Anurag Beniwal, Nanyun Peng, Narayanan Sadagopan, Zhou Yu · PDF
  112. ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark

    Vaskar Nath, Pranav Vishnu Raja, Jane Yu, Claire Yoon, Sean M. Hendryx · PDF
  113. Toward Community-Driven Agents for Machine Learning Engineering

    Sijie Li, Weiwei Sun, Shanda Li, Ameet Talwalkar, Yiming Yang · PDF
  114. Tracing Coordination Dynamics in Multi-Turn LLM Discussions

    Angelina Parfenova, Jürgen Pfeffer, Alexander Denzler · PDF
  115. Traxgen: Ground-Truth Trajectory Generation for AI Agent Evaluation

    Maria Emilia Mazzolenis, Ruirui Zhang · PDF
  116. User-Assistant Bias in LLMs

    Xu Pan, Jingxuan Fan, Zidi Xiong, Ely Hahami, Jorin Overwiening, Ziqian Xie · PDF
  117. Verlog: Context-lite Multi-turn Reinforcement Learning framework for Long-Horizon LLM Agents

    Wentse Chen, Jiayu Chen, Hao Zhu, Jeff Schneider · PDF
  118. VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded Video Reasoning

    Ye Liu, Kevin Qinghong Lin, Chang Wen Chen, Mike Zheng Shou · PDF
  119. WebGraphEval: Multi-Turn Trajectory Evaluation for Web Agents using Graph Representation

    Yaoyao Qian, Yuanli Wang, Jinda Zhang, Yun Zong, Meixu Chen, Hanhan Zhou, Jindan Huang, Yifan Zeng, Xinyu Hu, Chan Hee Song, Danqing Zhang · PDF
  120. WEBSERV: A Browser-Server Environment for Efficient Training of Reinforcement Learning-based Web Agents at Scale

    Yuxuan Lu, Jing Huang, Hui Liu, Jiri Gesi, Yan Han, Shihan Fu, Tianqi Zheng, Dakuo Wang · PDF
  121. What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities

    Wendong Bu, Yang Wu, Qifan Yu, Minghe Gao, Bingchen Miao, Zhenkui Zhang, Kaihang Pan, Yunfei Li, Mengze Li, Wei Ji, Juncheng Li, Siliang Tang, Yueting Zhuang · PDF
  122. WOLF: Werewolf-based Observations for LLM Deception and Falsehoods

    Mrinal Agarwal, Saad Rana, Theo Sundoro, Hermela Berhe, Spencer Kim, Vasu Sharma, Sean O'Brien, Kevin Zhu · PDF