ICLR 2026 Past Agents

Agentic AI in the Wild: From Hallucinations to Reliable Autonomy

Reliable_Autonomy

Submission deadline
Feb 6, 2026, 23:59 UTC
imported from OpenReview — check the website for extensions
Submission portal
OpenReview
Notes
Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).

Accepted papers (65)

Fetched from OpenReview (v2) on 2026-06-10.

  1. ”TINY” SILENT HALLUCINATIONS IN AGENTIC AI: HIDDEN FAILURE MODES IN AUTONOMOUS SYSTEMS

    Mahule Roy, Subhas Roy
  2. A Unified Definition of Hallucination: It’s The World Model, Stupid!

    Emmy Liu, Varun Prashant Gangal, Chelsea Zou, Michael Yu, Xiaoqi Huang, Alex Chang, Zhuofu Tao, Karanpartap Singh, Sachin Kumar, Steven Y. Feng
  3. Adversarial Iterative Unit Test Generation with Large Language Models

    Dongjun Lee, Juyong Lee, Changho Hwang, Kimin Lee
  4. AgentHallu: Benchmarking Automated Hallucination Attribution of LLM-based Agents

    Xuannan Liu, Xiao Yang, Zekun Li, Pei Pei Li, Ran He
  5. Agentic Pressure: The Endogenous Entropy of Reliable Autonomy

    Hengle Jiang, Ziying Luo, Ke Tang
  6. AI-BAAM: AI-Driven Bank Statement Analytics as Alternative Data for Malaysian MSME Credit Scoring

    Chun Chet Ng, Zhen Hao Chu, Jia Yu Lim, Boon Yin Yin, Low Wei Zeng, Jin Khye Tan
  7. Atomix: Timely, Transactional Tool Use for Reliable Agentic Workflows

    Bardia Mohammadi, Nearchos Potamitis, Lars Henning Klein, Akhil Arora, Laurent Bindschaedler
  8. AutoBaxBuilder: Bootstrapping Code Security Benchmarking

    Tobias von Arx, Niels Mündler, Mark Vero, Maximilian Baader, Martin Vechev
  9. Behavioral Continuity in Agentic LLMs: An Engineering Mental Structure Approach

    Ning Coeva
  10. Building Reliable Long-Form Generation via Hallucination Rejection Sampling

    Lin Li, Georgia Channing, Suhaas M Bhat, Gabriel Davis Jones, Yarin Gal
  11. CA-BED: Conversation-Aware Bayesian Experimental Design

    Daniel Arnould, Rashad Aziz, Zixuan Kang, Tanav Changal, Kevin Zhu, Sunishchal Dev, Gabriel Grand, Shreyas Sunil Kulkarni
  12. Challenges in Inference-Time Scaling with Uncertainty-Aware Tree Search

    Jacopo Minniti, Neil Band, Tim G. J. Rudner
  13. CodeTaste: Can LLMs Generate Human-Level Code Refactorings?

    Alex Thillen, Niels Mündler, Veselin Raychev, Martin Vechev
  14. CoE: Collaborative Entropy for Uncertainty Quantification in Agentic Multi-LLM Systems

    Kangkang Sun, Jun Wu, Jianhua Li, Minyi Guo, Xiuzhen Chen, Jianwei Huang
  15. Distilling Reasoning Without Knowledge: A Framework for Reliable LLMs

    Auksarapak Kietkajornrit, Nima Asgharbeygi, Jad Tarifi
  16. Do LLMs Act Like Rational Agents? Measuring Belief Coherence in Probabilistic Decision Making

    Khurram Yamin, Jingjing Tang, Santiago Cortes-Gomez, Amit Sharma, Eric Horvitz, Bryan Wilder
  17. Don't Do That!: Guiding Embodied Systems through Large Language Model-based Constraint Generation

    Amin Seffo, Aladin Djuhera, Masataro Asai, Holger Boche
  18. DSGym: A Standardized and Holistic Framework for Advancing Data Science Agents

    Fan Nie, Junlin Wang, Harper Hua, Federico Bianchi, Yongchan Kwon, Zhenting Qi, Owen Queen, Shang Zhu, James Zou
  19. E-valuator: Reliable Agent Verifiers with Sequential Hypothesis Testing

    Shuvom Sadhuka, Drew Prinster, Clara Fannjiang, Gabriele Scalia, Bonnie Berger, Aviv Regev, Hanchen
  20. Efficient Hallucination Detection for LLMs Using Uncertainty-Aware Attention Heads

    Artem Vazhentsev, Lyudmila Rvanova, Gleb Kuzmin, Ekaterina Fadeeva, Ivan Lazichny, Alexander Panchenko, Maxim Panov, Mrinmaya Sachan, Preslav Nakov, Timothy Baldwin, Artem Shelmanov
  21. Efficient Hallucination Detection in Automatic Code Generation

    Georgii Andriushchenko, Roman Garaev, Lyudmila Rvanova, Artem Shelmanov, Vladimir V. Ivanov
  22. Entropy Jurisprudence: Auditing Procedural Fidelity in LLM Normative Reasoning

    CHEN XIWEI
  23. Epistemic Context Learning: Building Trust the Right Way in LLM Multi-Agent Systems

    Ruiwen Zhou, Maojia Song, Xiaobao Wu, Sitao Cheng, Xunjian Yin, Yuxi Xie, Zoey Hao, Wenyue Hua, Liangming Pan, Soujanya Poria, Min-Yen Kan
  24. Escaping the Mode: Multi Answer Reinforcement Learning in LMs

    Isha Puri, Mehul Damani, Idan Shenfeld, Marzyeh Ghassemi, Jacob Andreas, Yoon Kim
  25. Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?

    Thibaud Gloaguen, Niels Mündler, Mark Niklas Mueller, Veselin Raychev, Martin Vechev
  26. From Bandit Regret to FDR Control: Online Selective Generation with Feedback Unlocking

    Minjae Lee, Yoonjae Jung, Sangdon Park
  27. From the Wild Web to the Zoo: Benchmarking Web Agents with a Realistic Simulator

    Brian Grinstead, Mariana Meireles, Christoph Kerschbaumer, Cameron Allen
  28. GLEAN: Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification

    Yichi Zhang, Nabeel Seedat, Yinpeng Dong, Peng Cui, Jun Zhu, Mihaela van der Schaar
  29. HallucinationHunter: Fine-Grained Factual Grounding of Generated Text

    Peter Belcak, Yu Zhang, Shizhe Diao, David Eric Austin, Yonggan Fu, Ryan Angilly, Yingyan Celine Lin, Eileen Margaret Peters Long, Pavlo Molchanov
  30. Hierarchical Procedural Meta-Reasoning for Generalizable Multimodal Agents

    Yao Fu, Shengyi Qian, Pierluca D'Oro, Fanyi Xiao, Honglak Lee, Joseph Tighe, Manchen Wang
  31. Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use

    Aradhye Agarwal, Gurdit Siyan, Yash Pandya, Joykirat Singh, Akshay Nambi
  32. LUMINA: Long-horizon Understanding for Multi-turn Interactive Agents

    Amin Rakhsha, Thomas Hehn, Pietro Mazzaglia, Fabio Valerio Massoli, Arash Behboodi, Tribhuvanesh Orekondy
  33. Measuring Agents in Production

    Melissa Pan, Negar Arabzadeh, Riccardo Cogo, Yuxuan Zhu, Alexander Xiong, Lakshya A Agrawal, Huanzhi Mao, Emma Shen, Sid Pallerla, Liana Patel, Shu Liu, Tianneng Shi, Xiaoyuan Liu, Jared Quincy Davis, Emmanuele Lacavalla, Alessandro Basile, Shuyi Yang, Paul Castro, Daniel Kang, Koushik Sen, Dawn Song, Joseph E. Gonzalez, Ion Stoica, Matei Zaharia, Marquita Ellis
  34. MedMMV: A Controllable Multimodal Multi-Agent Framework for Reliable and Verifiable Clinical Reasoning

    Hongjun Liu, Yinghao Zhu, Yuhui Wang, Yitao Long, Dennis Shasha, Lequan Yu, Chen Zhao
  35. Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning

    Jiayun Wu, Jiashuo Liu, Zhiyuan Zeng, Tianyang Zhan, Tianle Cai, Wenhao Huang
  36. No One Monitor Fits All: Oversight Strategies for Frontier Agents

    Neil Kale, Shashwat Saxena, Ziqian Zhong, Chen Henry Wu, Aditi Raghunathan
  37. Online Conformal Prediction with Adversarial Semi-bandit Feedback via Regret Minimization

    Junyoung Yang, Kyungmin Kim, Sangdon Park
  38. OPENAPPS: SIMULATING ENVIRONMENT VARIATIONS TO MEASURE UI-AGENT RELIABILITY

    Karen Ullrich, Jingtong Su, Claudia Shi, Arjun Subramonian, Amir Bar, Ivan Evtimov, Nikolaos Tsilivis, Randall Balestriero, Julia Kempe, Mark Ibrahim
  39. Owl: Separating Generation from Evaluation to Detect Plausible Failures in Lifecycle Inventory Mapping

    Andrew Dumit, Krishna Rao, Shaena Ulissi, Steven Watson, P. James Joyce, Shuhan Bao, Jacob Feintzeig, Sangwon Suh
  40. Pay Attention to the Triggers: Constructing Backdoors That Survive Distillation

    Giovanni De Muri, Mark Vero, Robin Staab, Martin Vechev
  41. PersonaPlugin: A Multi-Source Persona Framework for LLM Personalization in Telecommunications

    Jinmo Kang, Minseop Lee, Songha Kim, Junho Shin, Changho Lee, Yeonghwan Jeon, Hyuncheol Jo
  42. Post-Training LLMs as Better Decision-Making Agents: A Regret-Minimization Approach

    Chanwoo Park, Ziyang Chen, Asuman E. Ozdaglar, Kaiqing Zhang
  43. PROBE: PROcess-Based BEnchmark for Hallucination Detection

    Yu Zhang, Peter Belcak, Shizhe Diao, Yonggan Fu, Shaona Ghosh, Morteza Mardani, Eileen Margaret Peters Long, Bei Yu, Pavlo Molchanov
  44. Quantifying Genuine Awareness in Hallucination Prediction: Disentangling Question-Side Shortcuts

    Yeongbin Seo, Dongha Lee, Jinyoung Yeo
  45. Reasoning as Gradient: Scaling MLE Agents Beyond Tree Search

    Yifei Zhang, Xu Yang, Xiao Yang, Bowen Xian, Qizheng Li, Shikai Fang, Jingyuan Li, Jian Wang, Minrui Xu, Yuge Zhang, Weiqing Liu, Jiang Bian
  46. Reasoning Is Not Free: Robust Adaptive Cost-Efficient Router for LLM-as-a-Judge

    Wenbo Zhang, Lijinghua Zhang, Liner Xiang, Hengrui Cai
  47. Robust LLM Performance Certification via Constrained Maximum Likelihood Estimation

    Minghe Shen, Ananth Balashankar, Adam Fisch, David Madras, Miguel R. D. Rodrigues
  48. RPRA: Predicting an LLM-Judge for Efficient but Performant Inference

    Dylan R. Ashley, Gael Le Lan, Changsheng Zhao, Naina Dhingra, Zhipeng Cai, Ernie Chang, Mingchen Zhuge, Yangyang Shi, Vikas Chandra, Jürgen Schmidhuber
  49. SafeGround: Know When to Trust GUI Grounding Models via Uncertainty Calibrations

    Qingni Wang, Yue Fan, Xin Eric Wang
  50. Scaling Agentic Capabilities, Not Context: Efficient Reinforcement Finetuning for Large Toolspaces

    Karan Gupta, Pranav Vajreshwari, Yash Pandya, Akshay Nambi
  51. Scaling Agents for Computer Use

    Gonzalo Gonzalez-Pumariega, Vincent Tu, Chih-Lun Lee, Jiachen Yang, Ang Li, Xin Eric Wang
  52. SCoOP: Semantic Consistent Opinion Pooling for Uncertainty Quantification in Multiple Vision Language Model Systems

    Chung-En Johnny Yu, Brian Jalaian, Nathaniel D. Bastian
  53. Semantic Grounding as a Hallucination Mitigation Layer for Reliable AI Agents

    Shivansh Tuteja, Tanvi Bisht, Jatin Bedi
  54. Semantic Self-Distillation for Language Model Uncertainty

    Edward Phillips, Sean Wu, Boyan Gao, David A. Clifton
  55. Steering Large Language Models Toward Clarification through Sparse Autoencoders

    Alisa Petrova, Alexey Kovalev
  56. TAPE: Tool-Guided Adaptive Planning and Constrained Execution in Language Model Agents

    Jongwon Jeong, Jungtaek Kim, Kangwook Lee
  57. Test-Time Scaling in Reasoning Models Is Not Effective for Knowledge-Intensive Tasks Yet

    James Xu Zhao, Bryan Hooi, See-Kiong Ng
  58. The Confidence Manifold: Geometric Structure of Correctness Representations in Language Models

    Seonglae Cho, Zekun Wu, Kleyton Da Costa, Adriano Koshiyama
  59. TINY: RepoMirage: Do Code Agents Really Understand Repository Structures?

    Hanyu Li, Yichi Zhang, Speed Zhu, Yinpeng Dong
  60. TSLM: Tree-Structured Language Modeling for Divergent Thinking

    Doyoung Kim, JaeHyeok Doo, Minjoon Seo
  61. Uncertainty Quantification in LLM Agents: Foundations, Emerging Challenges, and Opportunities

    Changdae Oh, Seongheon Park, To Eun Kim, Jiatong Li, Wendi Li, Samuel Yeh, Xuefeng Du, Hamed Hassani, Paul Bogdan, Dawn Song, Sharon Li
  62. Understanding Reasoning Collapse in Multi-Turn Agent Reinforcement Learning

    Zihan Wang, Chi Gui, Xing Jin, Qineng Wang, Licheng Liu, Kangrui Wang, Shiqi Chen, Linjie Li, Zhengyuan Yang, Pingyue Zhang, Yiping Lu, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, Manling Li
  63. WebPII: Benchmarking Visual PII Detection for Computer-Use Agents

    Nathan J. Zhao
  64. Weight Space Detection of Backdoors in LoRA Adapters

    David Puertolas Merenciano, Ekaterina Vasyagina, Raghav Dixit, Kevin Zhu, Ruizhe Li, Maheep Chaudhary
  65. Zero-Shot LLM-Guided Autonomous Agent for Energy-Aware Resource Allocation in Embedded Systems

    Mohammad Pivezhandi, Mahdi Banisharif, Abusayeed Saifullah