ICML 2026 Past AgentsSafety & alignmentPrivacy & security
Second Workshop on Agents in the Wild: Safety, Security, and Beyond
ICML 2026 AIWILD
- Submission deadline
- May 9, 2026, 12:00 UTC imported from OpenReview — check the website for extensions
- Submission portal
- OpenReview
- Notes
- Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).
Accepted papers (216)
Fetched from OpenReview (v2) on 2026-06-10.
-
A Multi-Model Self-Evolving Framework for Zero-Data Document Understanding via Axiomatic Synthetic Refinement
-
A Prompt-Masked Pilot for History-Dependent Safety Degradation in Multi-Turn Conversational Agents
-
A Systematic Investigation of RL-Jailbreaking in LLMs
-
ABRA: Agent Benchmark for Radiology Applications
-
Activation Steering for Tool-Poisoning Defense in Language-Model Agents
-
Adaptive Adversaries: A Multi-Turn, Multi-LLM Benchmark for LLM Agent Security
-
AF-ARENA: A Multi-Dimensional Evaluation Suite for Alignment Faking
-
Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks
-
Agentic Misalignment Deterrence: Incorporating Probability and Stakeholders to Decision-Making
-
Agentic Reinforcement Learning for Search Misaligns Instruction-Tuning
-
AgentSociety: Incentivizing Agentic Social Intelligence
-
AI Agent Safety is a Reinforcement Learning Problem
-
AI Safeguards as Affordance Modulation: Embedded Population Assumptions in Agentic Systems
-
Aligning Language Models with Selective Prediction
-
Approve the Effect, Not the Tool Call: Preventing Stale Consent in Tool-Using Agents
-
Architecture Matters for Multi-Agent Security
-
ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning
-
ASPI: Seeking Ambiguity Clarification Amplifies Prompt Injection Vulnerability in LLM Agents
-
ATLAS: Adaptive Topology-Level Attack Synthesis for Probing Multi-Agent Systems
-
Attack Selection in Agentic AI Control Evaluations Meaningfully Decreases Safety
-
Attractor States Emerge in Multi-Turn LLM Conversations
-
Autoformalization of Agent Instructions into Policy-as-Code
-
AutoHoney: Automating, Deploying, and Evaluating Scheming Honeypots Across Production Codebases
-
Automata from Agent Traces: Failure and Next-Step Prediction
-
Automated interpretability and feature discovery in language models with agents
-
BARRED: Synthetic Training of Custom Policy Guardrails via Asymmetric Debate
-
BarrierSteer: LLM Safety via Learning Barrier Steering
-
Behavioral Code: Legible, Auditable Loops for Autonomous Agents
-
Behavioral Determinants of Deployed AI Agents in Social Networks: A Multi-Factor Study of Personality, Model, and Guardrail Specification
-
Beyond Single-Model Injection: A Threat Model and Defense Architecture for Prompt Injection in Multi-Agent Systems
-
Bias and Discrimination in the Agentic Web and How Project NANDA Can Support Mitigation
-
BiasTrojan: LLM Judgers Are Easily Distorted by Few Hundreds of Contrastive Biased Training Data
-
BioAgent Bench: An AI Agent Evaluation Suite for Bioinformatics
-
Boundary Point Jailbreaking of Black-Box LLMs
-
Bridging Safety and Performance in Autonomous Systems using Offline Reinforcement Learning
-
CAD-bench: Benchmarking Language Models on Functional CAD Generation
-
Calibrated Deferral Routing for Cost-Efficient Guardrails
-
CaMeLs Can Use Computers Too: System-level Security for Computer Use Agents
-
CANARY: Zero-Label Detection of Fine-Tuning Contamination in Language Models
-
Catching Infrastructure Sabotage When Coding Agents Are Insider Threats
-
Caught in the Act(ivation): Stopping Credential Exfiltration Before It Starts
-
Chain of Risk: Safety Failures in Large Reasoning Models and Mitigation via Adaptive Multi-Principle Steering
-
Chain-of-Sanitized-Thoughts: Reducing PII Leakage in Chain-of-Thought Reasoning
-
ChainCaps: Composition-Safe Tool-Using Agents via Monotonic Capability Attenuation
-
Chasing the Public Score: User Pressure and Evaluation Exploitation in Coding Agent Workflows
-
Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs
-
ClinSeekAgent: Automating Multi-modal Evidence Seeking for Agentic Clinical Reasoning
-
Coding Agents Don't Know When to Act
-
Communication Boundary Control for Safer Multi-Agent Language Agents
-
Component and Dimension Sparsity in Transformer Refusal Mechanisms
-
Compound AI System Reliability: A Failure Taxonomy and Resilience Pattern Catalog from 150 Production Incidents
-
Consensus–Bayesian Anomaly Detection in Agentic Access Graphs
-
Containment Verification: AI Safety Guarantees Independent of Alignment
-
CONTRA: Red-Teaming Configurations of Personalizable Agents
-
Contrastive Discovery: Open-Ended Scientific Discovery over Competing Explanations
-
Controlling Tool Use with Heading-Specific Activation Steering
-
Copy-on-Write Scoring: Application-Specific Agent Evaluations
-
Correcting Noise-Mispecified Operator Selection in Wild Compound LLM Agents
-
Coverage-Aware Test Generation for Conversational AI Agents
-
CPPO: Contrastive Perception Policy Optimization for VLM Agents
-
Cross-Agent Campaign Attribution: Linking Asynchronous Attacks Across LLM Agents
-
CrossAnchor: Image-Anchored Text Optimization Exposes Blind Spots in Multi-line Defenses of Agentic Systems
-
CTFusion : A CTF-based Benchmark for LLM Agent Evaluation
-
Decomposing Smooth Agentic Inference Scaling
-
Digital Twin Builder: A Multi-Agent LLM System for Automated Industrial Digital Twin Development
-
Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models
-
Do AI Agents Write Less Maintainable Code Than Human Developers?
-
Does the Optimal Hallucination Detector for Agentic Tool Calls Depend on Model Scale?
-
Double-Helix Co-Training for Computer-Use Generator and Verifier Models
-
Dynamic Capability Scoping for Enterprise AI Agents: A Synthetic Dataset and Three-Source Permission Architecture
-
Emergent Social Intelligence Risks in Generative Multi-Agent Systems
-
Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute
-
Evaluating Agentic Configuration Repair for Computer Networks
-
Evaluation Theater: How Structural Compliance Decouples from Cognitive Judgment in Deployed LLM Agents
-
EvoCUA: Evolving Computer Use Agents via Learning from Scalable Synthetic Experience
-
Extracting Recurring Vulnerabilities from Black-Box LLM-Generated Software
-
Failure-Aware Query Refinement for Reliable Open-Vocabulary Home-Robot Perception
-
FaultLoc: Evaluating Coding Agents For Fault Localization
-
From Business Metrics to Behavioral Personas Controllable User Simulation for Pre-Deployment Agent Testing
-
From Reward-Hack Activations to Agentic Risk States: Context-Calibrated Mechanistic Monitoring in LLM Agents
-
From Segments to Scenes: Temporal Understanding for Agentic Autonomous Driving via Vision-Language Models
-
From Self-Preservation to Peer-Preservation: A Staged Framing of Preservation-Oriented Misalignment in Frontier Models
-
From Untrusted Input to Trusted Memory: A Systematic Study of Memory Poisoning Attacks in LLM Agents
-
FronTalk: Benchmarking Front-End Development as Conversational Code Generation with Multi-Modal Feedback
-
Full-Season Agent Evaluation in Soybean Farm Operations under Real-World Agricultural Process Dynamics
-
Game-Theoretic Multi-LLM Routing for Safer Agents in the Wild
-
GameDevBench: Evaluating Agentic Capabilities Through Game Development
-
General Agent Evaluation
-
Goal-Drift Probes: Anticipating Multi-Turn LLM Agent Failure From Mid-Network Activations
-
Hidden in Memory: Sleeper Memory Poisoning in LLM Agents
-
Hidden in Plain Sight: Benchmarking Agent Safety Against Decomposition Attacks with DecompBench
-
HiMA: Efficient Hybrid Model Serving for Agentic Systems
-
Honeyval: A Comprehensive Evaluation Framework for LLM-powered HTTP Honeypots
-
House Rules: Institutional Design in Multi-Agent LLM Code Markets
-
How can we assess human-agent interactions? Case studies in software agent design
-
How Far Are VLMs from Privacy Awareness in the Physical World? An Empirical Study
-
How Should Your Agent Talk to Mine? Measuring the Utility–Security Frontier of Cross-Boundary Agentic Delegation
-
How Well Do Models Follow Their Constitutions?
-
HowLLMDecision Agents Fail in the Wild: AReproducible Failure-Cell Framework
-
InferenceBench: A Benchmark for Open-Ended LLM Inference Optimization by AI Agents
-
Internal vs. External: Comparing Deliberation and Evolution for Multi-Agent Constitutional Design
-
Internal-State Probes Read the Situation, Not the Action: Three Negative Results for Pre-Action Misalignment Monitoring
-
iOSWorld: A Benchmark for Personally Intelligent Phone Agents
-
Is Your LLM-as-a-Recommender Agent Trustable? LLMs' Recommendation is Easily Hacked by Biases (Preferences)
-
It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents
-
Knowing When to \texttt{STOP}, \texttt{RECOVER}, and \texttt{SEARCH} \\ A Modular Framework for GUI Automation
-
Latent Undertow: How Ordinary Typos Break Probes
-
Lateral Data Exfiltration in MCP: How One Compromised Server Captures Cross-Domain Agent Data
-
Laundering AI Authority with Adversarial Examples
-
Learning Stateful Predictive Knowledge From Experience
-
Learning to Inject: Automated Prompt Injection via Reinforcement Learning
-
Learning to Learn-at-Test-Time: Language Agents with Learnable Adaptation Policies
-
LegalHalluLens: Typed Hallucination Auditing and Calibrated Multi-Agent Debate for Trustworthy Legal AI
-
Liability Frameworks for Agentic AI Systems
-
Life After Benchmark Saturation: A Case Study of CORE-Bench
-
Linguistic Firewall: Geometry as Defense in Multi-Agent Systems Routing
-
LinuxArena: A Control Setting for AI Agents in Live Production Software Environments
-
LLM-Guided Planning for Multi-hop Reasoning over Multimodal Nuclear Regulatory Documents
-
LLMs Struggle to Rank Products Robustly
-
Lost in the Maze: Overcoming Context Limitations in Long- Horizon Agentic Search
-
MacArena: Benchmarking Computer Use Agents on an Online macOS Environment
-
Making Open-Source Text LLM Watermarks Durable Against Merging
-
Measuring Five-Nines Reliability: Sample-Efficient LLM Evaluation in Saturated Benchmarks
-
Mecha-nudges for Machines
-
Mechanism Design Is Not Enough: Prosocial Agents for Cooperative AI
-
Memory-Induced Tool-Drift in LLM Agents
-
Meta-Harness: Post-Training Reliable Agent Systems via Harness Search
-
Minibinder Lab: The Reliability Gap Of Agents For Designing High Quality Protein Binders
-
MIRAGE: A Polarity-Flipping Encoding Subspace in LLM Agents
-
Mitigating Over-Personalization in Language Models via Structured Memory
-
Mitigating Visual Hallucinations for Reliable Multimodal Agents
-
MyPCBench: A Benchmark for Personally Intelligent Computer-Use Agents
-
NEST: Nascent Encoded Steganographic Thoughts
-
Network-Level Prompt and Trait Leakage in Local Research Agents
-
NitroBox: Lightning-Fast Sandbox for Large-Scale RL Training
-
Noticing the Watcher: LLM Agents Can Infer CoT Monitoring from Blocking Feedback
-
Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks
-
Omission Constraints Decay While Commission Constraints Persist in Long-Context LLM Agents
-
One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue
-
Online Boundary-Aware Memory for Case-Based Reasoning Agents
-
Open-World Evaluations for Measuring Frontier AI Capabilities
-
OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data
-
Out-of-Distribution Generalization of Risk Aversion in Language Models
-
Oversight is Not Compliance: Tacit Collusion in LLM Pricing Agents Under Antitrust Regulation
-
Parameters as Agentic Memory: Internalizing Long-Horizon Memories for Efficient LLM Agents
-
Peer-Preservation in Frontier Models
-
Plausible Deniability Guarantees for Whistleblowers
-
Position: LLM Social-Simulation Agents in the Wild Cannot Serve as Social Scientific Evidence Without an Identification Strategy
-
Position: Stop Hardcoding Multi-Agent Workflows That General Agents Will Outgrow
-
ReCode: Unify Plan and Action for Universal Granularity Control
-
Red-Teaming Agent Execution Contexts: Open-World Security Evaluation on OpenClaw
-
Reliable to Expressive: A Curriculum for Rubric-Following Safety Judges
-
Remote Control: AI Control with User Actions
-
RevengeBench: Reverse Engineering Code-Space Policies from Behavioral Experiments
-
Reward Bias Substitution: Single-Axis Bias Mitigations Redirect Optimization Pressure
-
Reward Hacking in Rubric-Based Reinforcement Learning
-
Robust Multi-Agent LLMs under Byzantine Faults
-
SAFARI: Scaling Long Horizon Agentic Fault Attribution via Active Investigation
-
Safe Flow Q-Learning: Offline Safe Reinforcement Learning with Reachability-Based Flow Policies
-
Safe Under Budget? Verification Budgets and Abstention Failures in Web Agents
-
SafeClawBench: An Operating-System Perspective on Evaluating the Security of Claw-like Agent Systems
-
Same Action, Different Justification: Path-Based Authorization for Irreversible Agent Actions
-
Same Biology, Different Scores: Quantifying the Tool-Use Confound in Agentic Biology Evaluation
-
Scaling Laws for Strategic Interactions
-
Sell Me This Stock: Unsafe Recommendation Drift in LLM Agents
-
SkillOptimizer: Agent Skill Optimization Through Subskills Without Task Supervision
-
SlotGuard: Stop Oversharing Private Local Context in LLM Agent Transcripts
-
Smarter Saboteurs, Better Fixers: Scaling & Security in Linear Multi-Agent Workflows
-
Sockpuppetting: Jailbreaking LLMs by Combining Prefilling with Optimization
-
SRTJ: Self-Evolving Rule-Driven Training-Free LLM Jailbreaking
-
Steering Externalities: Benign Activation Steering Unintentionally Increases Jailbreak Risk for Large Language Models
-
Stop Comparing LLM Agents Without Disclosing the Harness
-
Stop Reporting System-Level AI Reasoning as Individual Model Capability
-
Structured Hallucination in Tool-Using Agents: Measuring and Mitigating LLM Synthesis Corruption in Production
-
SudoBench: A Contextual Authorization Benchmark for LLM Agents
-
Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment
-
Terminal Agents Suffice for Enterprise Automation
-
The Best-Laid SCHEMEs: Coordinated Sabotage and Monitoring in Multi-Agent Systems
-
The Monitorability Gap Between Reasoning in Thinking and Reasoning in the Output
-
The Safety Illusion of Greedy Decoding: Diagnosing Booster's Compliant Leakage and a Phase-2 Mitigation
-
The Token Tax : Measuring the Diminishing Returns of Test-Time Compute in Agentic Pipelines
-
Thought Virus: Spreading Subliminal Biases in Multi-Agent Systems
-
Tool Selection Bias Amplifies in Multi-turn User–Agent Interactions
-
Tool-Framing Bypasses LLM Safety: Procedural Abstraction Reduces Refusal Rates by Up to 40 Percentage Points Across Models
-
ToolFailBench: Diagnosing Tool-Use Failures in LLM Agents
-
Toward Scalable Terminal Task Synthesis via Skill Graphs
-
Towards Budget-Aware Agents: Do LLM Agents Know What They Will Spend?
-
Towards Predictive Models of Strategic Behaviour in Large Language Model Agents
-
TRACE: Capability-Targeted Agentic Training
-
TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories
-
Tracking the Behavioral Trajectories of Adapting Agents
-
Tracking the Truth: Object-Centric Spatio-Temporal Monitoring for Video Large Language Models
-
Training Language Agents to Learn from Experience
-
Training ML Models with Predictable Failures
-
Untrusted Content Masking for Web Agents with Security Guarantees
-
VATS: Exploiting Implicit Authority in Error-Path Injection via Systematic Mutation
-
VIGIL: A Reflective Runtime for Self-Healing LLM Agents
-
WARD: Adversarially Robust Defense of Web Agents Against Prompt Injections
-
WARP: A Wrapper-Based, Adaptive, Realistic Pipeline for Reliable Web-Agent Robustness Testing
-
We Let Agents Compete and They Tried to Cheat. KernelGuard:Defending GPU Competitions from Adversarial Agentic Systems
-
Web Agents Leak Sensitive Data on Simple Scalable Websites
-
WebAgentGuard: A Reasoning-Driven Guard Model for Detecting Prompt Injection Attacks in Web Agents
-
WebArena-Pro: A Heterogeneous, Multimodal, Reproducible Benchmark for Web Agents
-
WebPII: Benchmarking Visual PII Detection for Computer-Use Agents
-
What Can One Bad Tool Call Destroy? Measuring and Minimizing Blast Radius in Agentic Tool Use
-
What Game-Theoretic Benchmarks Miss: Strategic Silence in Multi-Agent LLMs
-
When Cloud Agents Meet Device Agents: Lessons from Hybrid Multi-Agent Systems
-
When Do Covert Channels Emerge? Probing Steganographic Capacity in Multimodal Agents via Diffusion VAE Latents
-
When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel
-
Who Flips? Self- and Cross-Model Counterarguments Reveal Answer Instability in LLMs
-
Widening the Gap: Exploiting LLM Quantization via Outlier Injection
-
WinDOM: Self-Family Distillation for Small-Model GUI Grounding
-
Within-Episode Failure Recovery in LLM Agents via Progress-Gated Dual-Process Routing
-
WorldFork: Trace-Auditable Forecasting Agents in Open-Ended Domains
-
Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw
-
Your Cursor is Not Secure: Command Line Interface Agent Can Expose Realistic Risks Through Tactics, Techniques, and Procedures