ICLR 2026 Past Agents
Agentic AI in the Wild: From Hallucinations to Reliable Autonomy
Reliable_Autonomy
- Submission deadline
- Feb 6, 2026, 23:59 UTC imported from OpenReview — check the website for extensions
- Submission portal
- OpenReview
- Notes
- Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).
Accepted papers (65)
Fetched from OpenReview (v2) on 2026-06-10.
-
”TINY” SILENT HALLUCINATIONS IN AGENTIC AI: HIDDEN FAILURE MODES IN AUTONOMOUS SYSTEMS
-
A Unified Definition of Hallucination: It’s The World Model, Stupid!
-
Adversarial Iterative Unit Test Generation with Large Language Models
-
AgentHallu: Benchmarking Automated Hallucination Attribution of LLM-based Agents
-
Agentic Pressure: The Endogenous Entropy of Reliable Autonomy
-
AI-BAAM: AI-Driven Bank Statement Analytics as Alternative Data for Malaysian MSME Credit Scoring
-
Atomix: Timely, Transactional Tool Use for Reliable Agentic Workflows
-
AutoBaxBuilder: Bootstrapping Code Security Benchmarking
-
Behavioral Continuity in Agentic LLMs: An Engineering Mental Structure Approach
-
Building Reliable Long-Form Generation via Hallucination Rejection Sampling
-
CA-BED: Conversation-Aware Bayesian Experimental Design
-
Challenges in Inference-Time Scaling with Uncertainty-Aware Tree Search
-
CodeTaste: Can LLMs Generate Human-Level Code Refactorings?
-
CoE: Collaborative Entropy for Uncertainty Quantification in Agentic Multi-LLM Systems
-
Distilling Reasoning Without Knowledge: A Framework for Reliable LLMs
-
Do LLMs Act Like Rational Agents? Measuring Belief Coherence in Probabilistic Decision Making
-
Don't Do That!: Guiding Embodied Systems through Large Language Model-based Constraint Generation
-
DSGym: A Standardized and Holistic Framework for Advancing Data Science Agents
-
E-valuator: Reliable Agent Verifiers with Sequential Hypothesis Testing
-
Efficient Hallucination Detection for LLMs Using Uncertainty-Aware Attention Heads
-
Efficient Hallucination Detection in Automatic Code Generation
-
Entropy Jurisprudence: Auditing Procedural Fidelity in LLM Normative Reasoning
-
Epistemic Context Learning: Building Trust the Right Way in LLM Multi-Agent Systems
-
Escaping the Mode: Multi Answer Reinforcement Learning in LMs
-
Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?
-
From Bandit Regret to FDR Control: Online Selective Generation with Feedback Unlocking
-
From the Wild Web to the Zoo: Benchmarking Web Agents with a Realistic Simulator
-
GLEAN: Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification
-
HallucinationHunter: Fine-Grained Factual Grounding of Generated Text
-
Hierarchical Procedural Meta-Reasoning for Generalizable Multimodal Agents
-
Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use
-
LUMINA: Long-horizon Understanding for Multi-turn Interactive Agents
-
Measuring Agents in Production
-
MedMMV: A Controllable Multimodal Multi-Agent Framework for Reliable and Verifiable Clinical Reasoning
-
Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning
-
No One Monitor Fits All: Oversight Strategies for Frontier Agents
-
Online Conformal Prediction with Adversarial Semi-bandit Feedback via Regret Minimization
-
OPENAPPS: SIMULATING ENVIRONMENT VARIATIONS TO MEASURE UI-AGENT RELIABILITY
-
Owl: Separating Generation from Evaluation to Detect Plausible Failures in Lifecycle Inventory Mapping
-
Pay Attention to the Triggers: Constructing Backdoors That Survive Distillation
-
PersonaPlugin: A Multi-Source Persona Framework for LLM Personalization in Telecommunications
-
Post-Training LLMs as Better Decision-Making Agents: A Regret-Minimization Approach
-
PROBE: PROcess-Based BEnchmark for Hallucination Detection
-
Quantifying Genuine Awareness in Hallucination Prediction: Disentangling Question-Side Shortcuts
-
Reasoning as Gradient: Scaling MLE Agents Beyond Tree Search
-
Reasoning Is Not Free: Robust Adaptive Cost-Efficient Router for LLM-as-a-Judge
-
Robust LLM Performance Certification via Constrained Maximum Likelihood Estimation
-
RPRA: Predicting an LLM-Judge for Efficient but Performant Inference
-
SafeGround: Know When to Trust GUI Grounding Models via Uncertainty Calibrations
-
Scaling Agentic Capabilities, Not Context: Efficient Reinforcement Finetuning for Large Toolspaces
-
Scaling Agents for Computer Use
-
SCoOP: Semantic Consistent Opinion Pooling for Uncertainty Quantification in Multiple Vision Language Model Systems
-
Semantic Grounding as a Hallucination Mitigation Layer for Reliable AI Agents
-
Semantic Self-Distillation for Language Model Uncertainty
-
Steering Large Language Models Toward Clarification through Sparse Autoencoders
-
TAPE: Tool-Guided Adaptive Planning and Constrained Execution in Language Model Agents
-
Test-Time Scaling in Reasoning Models Is Not Effective for Knowledge-Intensive Tasks Yet
-
The Confidence Manifold: Geometric Structure of Correctness Representations in Language Models
-
TINY: RepoMirage: Do Code Agents Really Understand Repository Structures?
-
TSLM: Tree-Structured Language Modeling for Divergent Thinking
-
Uncertainty Quantification in LLM Agents: Foundations, Emerging Challenges, and Opportunities
-
Understanding Reasoning Collapse in Multi-Turn Agent Reinforcement Learning
-
WebPII: Benchmarking Visual PII Detection for Computer-Use Agents
-
Weight Space Detection of Backdoors in LoRA Adapters
-
Zero-Shot LLM-Guided Autonomous Agent for Energy-Aware Resource Allocation in Embedded Systems