ICLR 2026 Past AgentsSafety & alignmentPrivacy & security
Agents in the Wild: Safety, Security, and Beyond
ICLR 2026 AIWILD
- Submission deadline
- Feb 13, 2026, 12:00 UTC imported from OpenReview — check the website for extensions
- Submission portal
- OpenReview
- Notes
- Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).
Accepted papers (150)
Fetched from OpenReview (v2) on 2026-06-10.
-
A Behavioural and Representational Evaluation of Goal-Directedness in Language Model Agents
-
A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring
-
A Framework for Formalizing LLM Agent Security
-
A Survey on Agentic Security: Applications, Threats and Defenses
-
Ads in AI Chatbots? An Analysis of How Large Language Models Navigate Conflicts of Interest
-
Agent Properties for Multi-Agent Safety
-
Agent Psychometrics: Task-Level Performance Prediction in Agentic Coding Benchmarks
-
Agent That Matters: An Attribution Framework for Multi-Agent LLMs
-
Agentic Browsers and the Same-Origin Policy
-
Agentic Rubrics as Contextual Verifiers for SWE Agents
-
Agentic Uncertainty Reveals Agentic Overconfidence
-
Agentified Benchmarking for Logical Reasoning Agents
-
AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM‑Based Agents
-
Agents in the Wild: Safety, Society, and the Illusion of Sociality on Moltbook
-
AgentTrace: Causal Graph Tracing for Root Cause Analysis in Deployed Multi-Agent Systems
-
AI Organizations Are More Effective but Less Aligned than Individual Agents
-
Analyzing and Improving Chain-of-Thought Monitorability Through Information Theory
-
Are LLM Agents Exploitable Negotiators ?
-
Asymmetric Goal Drift in Coding Agents Under Value Conflict
-
Atomix: Timely, Transactional Tool Use for Reliable Agentic Workflows
-
Attack Selection Reduces Safety in Concentrated AI Control Settings against Trusted Monitoring
-
Behavioral and Strategic Deception in Large Language Models: A Taxonomy and Benchmark Analysis
-
Better Attacks for Better Monitors: Semi-Automated Red-Teaming for Agent Monitoring
-
Beyond Clicking: A Step Towards Generalist GUI Grounding via Text Dragging
-
BlueCodeAgent: A Blue Teaming Agent Powered by Automated Red Teaming for CodeGen AI
-
Bridging the Gap between Theory of Mind and Action in LLMs
-
Certifying Robustness of Agent Tool-Selection Under Adversarial Attacks
-
Characterizing Web Search in The Age of Generative AI
-
ClawdPwned: Malicious Instructions in the OpenClaw AI Agent Skills repository
-
CommandSans: Securing AI Agents with Surgical Precision Prompt Sanitization
-
Context Inference Attacks Without Jailbreaks
-
Coordinating Coexisting Learning Agents in Shared Spectrum via Parameter Space Complementarity
-
CR-Bench: Evaluating the Real-World Utility of AI Code Review Agents
-
CrispEdit: Low-Curvature Projections for Scalable Non-Destructive LLM Editing
-
Critical Mass: Phase Transitions, Covert Coordination Detection, and Contagion Dynamics in Multi-Agent Systems
-
CyberGym-E2E: Scalable Real-World Benchmark for AI Agents' End-to-End Cybersecurity Capabilities
-
Data-Centric Interpretability for LLM-based Multi-Agent Reinforcement Learning
-
Defensive Refusal Bias: How Safety Alignment Fails Cyber Defenders
-
Differential Harm Propensity in Personalized LLM Agents: The Curious Case of Mental Health Disclosure
-
Directional Embedding Smoothing for Robust Vision Language Models
-
DSGym: A Standardized and Holistic Framework for Advancing Data Science Agents
-
Echoing: Identity Failures when LLM Agents Talk to Each Other
-
Efficient Test-Time Scaling of Multi-Step Reasoning by Probing Internal States of Large Language Models
-
Efficient Tree-Structured Deep Research with Adaptive Resource Allocation
-
Emergence WebVoyager: Toward Consistent and Transparent Evaluation of (Web) Agents in the Wild
-
Entropic Context Shaping: Information-Theoretic Filtering for Context-Aware LLM Agents
-
ESDAE: Evaluating Synthetic Data for Agent Evaluation
-
Evaluating LLM Judges in Cybersecurity Script Analysis
-
Evo-Guard: Self-Evolving GNN Guardrails for Adaptive Safety in GUI Agents
-
Exposing Security Vulnerabilities in LLM Based Educational Grading Agents
-
Federated Agent Reinforcement Learning
-
FICO-BENCH: Evaluating Vision-Language Models under Visual Fidelity and Compression at Scale
-
Forgetting-MarI: LLM Unlearning via Marginal Information Regularization
-
Formalizing the Safety, Security, and Functional Properties of Agentic AI Systems
-
From the Wild Web to the Zoo: Benchmarking Web Agents with a Realistic Simulator
-
General Agent Evaluation
-
GLEAN: Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification
-
GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory
-
Guarded Tool-Using LLM Agents for Incident Response: A Safety-Gated Architecture and Operational Evaluation Protocol
-
Guardian Angels in the Wild: Verification-First LLM Planning for Safety-Critical Daily Life Tasks
-
Hair-Trigger Alignment: Black-Box Evaluation Cannot Guarantee Post-Update Alignment
-
HealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration Tasks
-
How does information access affect LLM monitors' ability to detect sabotage?
-
How LLMs Distort & Transform Our Language
-
Human-Guided Harm Recovery for Computer Use Agents
-
Inherited Goal Drift: Contextual Pressure Can Undermine Agentic Goals
-
Judge Reliability Harness: Stress Testing the Reliability of LLM Judges
-
Large-scale online deanonymization with LLMs
-
Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use
-
Leveraging RAG for Training-Free Alignment of LLMs
-
LLM Agentic System Safety Requires Hybrid Alignment
-
LLM Hypnosis: Characterizing the Fragility of RLHF Against Unprivileged Knowledge Injection
-
LLM Novice Uplift on Dual-Use, In Silico Biology Tasks: A Multi-Benchmark Assessment
-
LOOK BEFORE YOU LEAP: THERMODYNAMIC ARBI- TRATION OF PARAMETRIC AND NON-PARAMETRIC KNOWLEDGE IN LLM AGENTS VIA SELF- REGULATING MEMORY ARCHITECTURES
-
Lost in Simulation: LLM-Simulated Users are Unreliable Proxies for Human Users in Agentic Evaluations
-
Lost in the Noise: How Test-Time Reasoning Fails with Contextual Distractors
-
Lying to Win: Assessing LLM Deception through Human-AI Games and Parallel-World Probing
-
Measuring Agents in Production
-
META-GOVERNANCE ARCHITECTURES FOR MULTI-AGENT SYSTEM SAFETY, ALIGNMENT, GOVERNANCE, AND SECURITY
-
Mind the Gap: Evaluating Model- and Agentic-Level Vulnerabilities in LLMs with Action Graphs
-
Model Agreement via Anchoring
-
More Capable, Less Cooperative? When LLMs Fail At Zero-Cost Collaboration
-
NAAMSE: Framework for Evolutionary Security Evaluation of Agents
-
NESSiE: The Necessary Safety Benchmark - Identifying Errors that should not Exist
-
NesyProAct: Proactive Neural-Symbolic Control for Web Agents
-
No One Monitor Fits All: Oversight Strategies for Frontier Agents
-
Noticing the Watcher: LLM Agents Can Infer CoT Monitoring from Blocking Feedback
-
Objective Misalignment in LLM-based Multi Agent Social Deception Game
-
Offline Reinforcement Learning of High-Quality Behaviors Under Robust Style Alignment
-
On Randomness in Agentic Evals
-
OPENAPPS: SIMULATING ENVIRONMENT VARIATIONS TO MEASURE UI-AGENT RELIABILITY
-
OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows
-
Pay Attention to the Triggers: Constructing Backdoors That Survive Distillation
-
Persuasion Attacks Can Decrease Effectiveness of CoT Monitoring
-
Physics-Guided Multimodal Multi-Agent Learning for Intelligent Transportation Systems
-
Position: Agentic Systems Should be General
-
Position: AI Development Should Prioritize Cognitive Security
-
Position: Science is Collaborative—LLM for Science Should Be Too
-
Position: We Must Proactively Address AI Safety Debt
-
PrefPO: Pairwise Preference Prompt Optimization
-
PriGuardAgent: Context-Aware Privacy Guardrails for Agentic Systems
-
ProActor: Timing-Aware Reinforcement Learning for Proactive Task Scheduling Agents
-
Profit Is the Red Team: Stress-Testing Agents in Strategic Economic Interactions
-
Prover-Verifier Games for AI Control
-
Quality-Diversity Evolution for Discovering Diverse Vulnerabilities in LLM Safety
-
Rapid Poison: Practical Poisoning Attacks Against the Rapid Response Framework
-
Reading Between the Pixels: Linking Text-Image Embedding Alignment to Typographic Attack Success on Vision-Language Models
-
Recalling Too Well: Sycophancy and Bias Amplification in Memory-Augmented Models
-
Reference-Guided Machine Unlearning
-
RepoMirage: Do Code Agents Really Understand Repository Structures?
-
ResearchGym: Evaluating Language Model Agents on Real-World AI Research
-
RubricRobustness: Evaluating the Sensitivity of Rubrics-Based Benchmarks to Simple Perturbations
-
SafePro: Evaluating the Safety of Professional-Level AI Agents
-
Scaling Agentic Capabilities, Not Context: Efficient Reinforcement Finetuning for Large Toolspaces
-
Scaling Agents for Computer Use
-
Script Kiddie Uplift: Measuring Procedural Misuse Amplification in AI Agents
-
Secure Code Generation via Online Reinforcement Learning with Vulnerability Reward Model
-
SenseAct: Structuring GUI Actions for Reliable Planning and Verification
-
Sound Agentic Science Requires Adversarial Experiments
-
SPARK: Spectral Perturbation based Adversarial Attacks for KGRAG Agents
-
SPECA: Specification-to-Checklist Agentic Auditing for Multi-Implementation Systems — A Case Study on Ethereum Clients
-
Subliminal Signals in Preference Labels
-
Subliminal Transfer of Unsafe Behaviors in AI Agent Distillation
-
Sweeping Promptable Spoofs under the DirtyRAG: A Practical, Query-Blind RAG Attack Done Right
-
T-MAP: Red-Teaming LLM Agents with Trajectory-aware Evolutionary Search
-
TamperBench: A Systematic Framework to Stress-Test LLM Safety Under Fine-Tuning and Tampering
-
TamperTest: A Framework for Testing Tamper Resistance in Open-Weight LLMs
-
The Algorithmic Self-Portrait: Deconstructing Memory in ChatGPT
-
The Alignment Waltz: Jointly Training Agents to Collaborate for Safety
-
The Controllability Trap: A Governance Framework for Military AI Agents
-
The Reliability Gap in Agentic Evidence Verification for Materials Science
-
The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search
-
Toward Reliable, Safe, and Secure LLMs for Scientific Applications
-
Towards Predictive Models of Strategic Behaviour in Large Language Model Agents
-
Towards Shutdownable Agents: Generalizing Stochastic Choice in RL Agents and LLMs
-
TRADERBENCH: HOW ROBUST ARE AI AGENTS IN ADVERSARIAL CAPITAL MARKETS?
-
TSR: Trajectory‑Search Rollouts for Multi‑Turn RL of LLM Agents
-
Uncertainty Drives Social Bias Changes in Quantized Large Language Models
-
Uncertainty-Aware Self-Correction for Coding Agents
-
Understanding Metacognition in Multi-Agent LLMs: Routing, Not Reasoning
-
Understanding Reasoning Collapse in Multi-Turn Agent Reinforcement Learning
-
VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use
-
Visual Exclusivity Attacks: Automatic Multimodal Red Teaming via Agentic Planning
-
W&D: Scaling Parallel Tool Calling for Efficient Deep Research Agents
-
When Agents Persuade: Rhetoric Generation and Mitigation in LLMs
-
When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift
-
When Benign Inputs Lead to Severe Harms: Eliciting Unsafe Unintended Behaviors of Computer-Use Agents
-
When Fuzzing Becomes Agentic: Semantic State Exploration in the Wild
-
Why Do Language Model Agents Whistleblow?
-
ZeroDayBench: Evaluating LLM Agents on Unseen Zero-Day Vulnerabilities for Cyberdefense