ICLR 2026PastAgentsSafety & alignmentPrivacy & security

Agents in the Wild: Safety, Security, and Beyond

ICLR 2026 AIWILD

Official website ↗OpenReview venue ↗See all ICLR workshops →✎ Edit this entry

Submission deadline: Feb 13, 2026, 12:00 UTC
OpenReview-synced 2026-02-13 12:00 UTC (as of 2026-06-23) — extensions on OpenReview are applied automatically; verify on the website.
Submission portal: OpenReview
Notes: Topics were auto-suggested and may be imprecise — edits welcome.

Accepted papers (150)

Fetched from OpenReview (v2) on 2026-06-10.

A Behavioural and Representational Evaluation of Goal-Directedness in Language Model Agents
Raghu Arghal, Fade Chen, Niall Dalton, Evgenii Kortukov, Calum McNamara, Angelos Nalmpantis, Moksh Nirvaan, Gabriele Sarti, Mario Giulianelli · PDF
A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring
Usman Anwar, Julianna Piskorz, David D. Baek, David Demitri Africa, Jim Weatherall, Max Tegmark, Christian Schroeder de Witt, Mihaela van der Schaar, David Krueger · PDF
A Framework for Formalizing LLM Agent Security
Vincent Siu, Jingxuan He, Kyle Montgomery, Zhun Wang, Neil Zhenqiang Gong, Chenguang Wang, Dawn Song · PDF
A Survey on Agentic Security: Applications, Threats and Defenses
Asif Shahriar, Md Nafiu Rahman, Sadif Ahmed, Farig Sadeque, Md Rizwan Parvez · PDF
Ads in AI Chatbots? An Analysis of How Large Language Models Navigate Conflicts of Interest
Addison J. Wu, Ryan Liu, Shuyue Stella Li, Yulia Tsvetkov, Thomas L. Griffiths · PDF
Agent Properties for Multi-Agent Safety
Cecilia Elena Tilli · PDF
Agent Psychometrics: Task-Level Performance Prediction in Agentic Coding Benchmarks
Chris Ge, Daria Kryvosheieva, Daniel Fried, Uzay Girit, Kaivalya Hariharan · PDF
Agent That Matters: An Attribution Framework for Multi-Agent LLMs
MingYu Lu, Yushan Huang, Su-In Lee · PDF
Agentic Browsers and the Same-Origin Policy
Franziska Roesner, David Kohlbrenner · PDF
Agentic Rubrics as Contextual Verifiers for SWE Agents
Mohit Raghavendra, Anisha Gunjal, Bing Liu, Yunzhong He · PDF
Agentic Uncertainty Reveals Agentic Overconfidence
Jean Kaddour, Srijan Patel, Gbetondji Jean-Sebastien Dovonon, Leo Richter, Pasquale Minervini, Matt J. Kusner · PDF
Agentified Benchmarking for Logical Reasoning Agents
Zhiyu Ni, Yifeng Xiao, Zheng Liang · PDF
AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM‑Based Agents
Emma Gouné, Akshat Naik, Patrick Quinn, Guillermo Bosch, Francisco Javier Campos Zabala, Jason Ross Brown, Edward James Young · PDF
Agents in the Wild: Safety, Society, and the Illusion of Sociality on Moltbook
Yunbei Zhang, Kai Mei, Ming Liu, Janet Wang, Dimitris N. Metaxas, Xiao Wang, Jihun Hamm, Yingqiang Ge · PDF
AgentTrace: Causal Graph Tracing for Root Cause Analysis in Deployed Multi-Agent Systems
Zhaohui Geoffrey Wang · PDF
AI Organizations Are More Effective but Less Aligned than Individual Agents
Judy Hanwen Shen, Daniel Zhu, Siddarth Srinivasan, Henry Sleight, Lawrence T. Wagner III, Morgan Jane Matthews, Jascha Sohl-Dickstein, Erik Jones · PDF
Analyzing and Improving Chain-of-Thought Monitorability Through Information Theory
Usman Anwar, Tim Bakker, Dana Kianfar, Cristina Pinneri, Christos Louizos · PDF
Are LLM Agents Exploitable Negotiators ?
Ramzi Dakhmouche · PDF
Asymmetric Goal Drift in Coding Agents Under Value Conflict
Magnus Saebo, Spencer Gibson, Tyler Crosse, Achyutha Menon, Eyon Jang, Diogo Cruz · PDF
Atomix: Timely, Transactional Tool Use for Reliable Agentic Workflows
Bardia Mohammadi, Nearchos Potamitis, Lars Henning Klein, Akhil Arora, Laurent Bindschaedler · PDF
Attack Selection Reduces Safety in Concentrated AI Control Settings against Trusted Monitoring
Joachim Schaeffer, Arjun Khandelwal, Tyler Tracy · PDF
Behavioral and Strategic Deception in Large Language Models: A Taxonomy and Benchmark Analysis
Jerick Shi · PDF
Better Attacks for Better Monitors: Semi-Automated Red-Teaming for Agent Monitoring
Monika Jotautaitė, Maria Angelica Martinez, Tyler Tracy, Ollie Matthews · PDF
Beyond Clicking: A Step Towards Generalist GUI Grounding via Text Dragging
Zeyi Liao, Yadong Lu, Boyu Gou, Huan Sun, Ahmed Hassan Awadallah · PDF
BlueCodeAgent: A Blue Teaming Agent Powered by Automated Red Teaming for CodeGen AI
Chengquan Guo, Yuzhou Nie, Chulin Xie, Zinan Lin, Wenbo Guo, Bo Li · PDF
Bridging the Gap between Theory of Mind and Action in LLMs
Sehyeok Kang, Jihwan Oh, Se-Young Yun · PDF
Certifying Robustness of Agent Tool-Selection Under Adversarial Attacks
Jehyeok Yeon, Isha Chaudhary, Gagandeep Singh · PDF
Characterizing Web Search in The Age of Generative AI
Elisabeth Kirsten, Jost Große Perdekamp, Qinyuan Wu, Mihir Upadhyay, Krishna P. Gummadi, Muhammad Bilal Zafar · PDF
ClawdPwned: Malicious Instructions in the OpenClaw AI Agent Skills repository
Arjun Krishna · PDF
CommandSans: Securing AI Agents with Surgical Precision Prompt Sanitization
Debeshee Das, Luca Beurer-Kellner, Marc Fischer, Maximilian Baader · PDF
Context Inference Attacks Without Jailbreaks
Prince Jha, Samuele Poppi, Nils Lukas · PDF
Coordinating Coexisting Learning Agents in Shared Spectrum via Parameter Space Complementarity
MD ASHIKUL HAQUE, Haibo Zhang, Abusayeed Saifullah · PDF
CR-Bench: Evaluating the Real-World Utility of AI Code Review Agents
Kristen Pereira, Neelabh Sinha, Rajat Ghosh, Debojyoti Dutta · PDF
CrispEdit: Low-Curvature Projections for Scalable Non-Destructive LLM Editing
Zarif Ikram, Arad Firouzkouhi, Stephen Tu, Mahdi Soltanolkotabi, Paria Rashidinejad · PDF
Critical Mass: Phase Transitions, Covert Coordination Detection, and Contagion Dynamics in Multi-Agent Systems
Ben Jenkins · PDF
CyberGym-E2E: Scalable Real-World Benchmark for AI Agents' End-to-End Cybersecurity Capabilities
Tianneng Shi, Robin Rheem, Dongwei Jiang, Mona Wang, Francisco De La Riega, Zhun Wang, Jingzhi Jiang, Alexander Cheung, Sean Tai, Jonah Cha, Jianhong Tu, Gabriel Han, Chenguang Wang, Wenbo Guo, Jingxuan He, Dawn Song · PDF
Data-Centric Interpretability for LLM-based Multi-Agent Reinforcement Learning
John Yan, Michael Yu, Yuqi Sun, Alexander Duffy, Tyler Marques, Matthew Lyle Olson · PDF
Defensive Refusal Bias: How Safety Alignment Fails Cyber Defenders
David Campbell, Neil Kale, Udari Madhushani Sehwag, Bert Herring, Nick Price, Dan Borges, Alex Levinson, Christina Q Knight · PDF
Differential Harm Propensity in Personalized LLM Agents: The Curious Case of Mental Health Disclosure
Caglar Yildirim · PDF
Directional Embedding Smoothing for Robust Vision Language Models
Ye Wang, Jing Liu, Toshiaki Koike-Akino · PDF
DSGym: A Standardized and Holistic Framework for Advancing Data Science Agents
Fan Nie, Junlin Wang, Harper Hua, Federico Bianchi, Yongchan Kwon, Zhenting Qi, Owen Queen, Shang Zhu, James Zou · PDF
Echoing: Identity Failures when LLM Agents Talk to Each Other
Sarath Shekkizhar, Romain Cosentino, Adam Earle, Silvio Savarese · PDF
Efficient Test-Time Scaling of Multi-Step Reasoning by Probing Internal States of Large Language Models
Jingwei Ni, Ekaterina Fadeeva, Tianyi Wu, Mubashara Akhtar, Jiaheng Zhang, Elliott Ash, Markus Leippold, Timothy Baldwin, See-Kiong Ng, Artem Shelmanov, Mrinmaya Sachan · PDF
Efficient Tree-Structured Deep Research with Adaptive Resource Allocation
Lunyiu Nie, Nedim Lipka, Ryan A. Rossi, Swarat Chaudhuri · PDF
Emergence WebVoyager: Toward Consistent and Transparent Evaluation of (Web) Agents in the Wild
Deepak Akkil, Mowafak Allaham, Amal Raj, Tamer Abuelsaad, Ravi Kokku · PDF
Entropic Context Shaping: Information-Theoretic Filtering for Context-Aware LLM Agents
Hyunjun Kim · PDF
ESDAE: Evaluating Synthetic Data for Agent Evaluation
Shuaiqi Wang, Aadyaa Maddi, Zinan Lin, Giulia Fanti · PDF
Evaluating LLM Judges in Cybersecurity Script Analysis
Alexandra Daniela Damir, Apostu Alexandru-Mihai, Diana Bolocan, Andrei Preda, Ioana Croitoru, Mihaela Gaman, Laura Vasilie, Bilal Issa, Monica-Nicoleta Pascu · PDF
Evo-Guard: Self-Evolving GNN Guardrails for Adaptive Safety in GUI Agents
Yifei Song, Yilei Jiang, Yingshui Tan, Xiangyu Yue, Lian-Kuan Chen · PDF
Exposing Security Vulnerabilities in LLM Based Educational Grading Agents
Xueyi Li, Zhuoneng Zhou, Zitao Liu, Yongdong WU · PDF
Federated Agent Reinforcement Learning
Canyu Chen, Kangyu Zhu, Zhaorun Chen, Zhanhui Zhou, Shizhe Diao, Yiping Lu, Tian Li, Manling Li, Dawn Song · PDF
FICO-BENCH: Evaluating Vision-Language Models under Visual Fidelity and Compression at Scale
Jianhong Tu, Nicholas Crispino, Kyle Montgomery, Chenguang Wang, Dawn Song · PDF
Forgetting-MarI: LLM Unlearning via Marginal Information Regularization
Shizhou Xu, Yuan Ni, Stefan Broecker, Thomas Strohmer · PDF
Formalizing the Safety, Security, and Functional Properties of Agentic AI Systems
Edoardo Allegrini, Ananth Shreekumar, Z. Berkay Celik · PDF
From the Wild Web to the Zoo: Benchmarking Web Agents with a Realistic Simulator
Brian Grinstead, Mariana Meireles, Christoph Kerschbaumer, Cameron Allen · PDF
General Agent Evaluation
Elron Bandel, Asaf Yehudai, Lilach Eden, Yehoshua Sagron, Yotam Perlitz, Elad Venezian, Natalia Razinkov, Natan Ergas, Shlomit Shachor Ifergan, Segev Shlomov, Michal Jacovi, Leshem Choshen, Liat Ein-Dor, Yoav Katz, Michal Shmueli-Scheuer · PDF
GLEAN: Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification
Yichi Zhang, Nabeel Seedat, Yinpeng Dong, Peng Cui, Jun Zhu, Mihaela van der Schaar · PDF
GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory
Pepijn Cobben, X. Angelo Huang, Thao Amelia Pham, Isabel Dahlgren, Terry Jingchen Zhang, Zhijing Jin · PDF
Guarded Tool-Using LLM Agents for Incident Response: A Safety-Gated Architecture and Operational Evaluation Protocol
Dhruv Patel · PDF
Guardian Angels in the Wild: Verification-First LLM Planning for Safety-Critical Daily Life Tasks
Saurabh Dingwani, Ayan Banerjee, Sandeep Gupta · PDF
Hair-Trigger Alignment: Black-Box Evaluation Cannot Guarantee Post-Update Alignment
Yavuz Faruk Bakman, Duygu Nur Yaldiz, Salman Avestimehr, Sai Praneeth Karimireddy · PDF
HealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration Tasks
Suhana Bedi, Ryan Welch, Ethan Steinberg, Michael Wornow, Taeil Matthew Kim, Haroun Ahmed, Sanmi Koyejo, Nigam Shah · PDF
How does information access affect LLM monitors' ability to detect sabotage?
Rauno Arike, Raja Mehta Moreno, Rohan Subramani, Shubhorup Biswas, Francis Rhys Ward · PDF
How LLMs Distort & Transform Our Language
Marwa Abdulhai, Isadora White, Yanming Wan, Joel Z Leibo, Max Kleiman-Weiner, Natasha Jaques · PDF
Human-Guided Harm Recovery for Computer Use Agents
Christy Li, Sky CH-Wang, Andi Peng, Andreea Bobu · PDF
Inherited Goal Drift: Contextual Pressure Can Undermine Agentic Goals
Achyutha Menon, Magnus Saebo, Tyler Crosse, Spencer Gibson, Eyon Jang, Diogo Cruz · PDF
Judge Reliability Harness: Stress Testing the Reliability of LLM Judges
Sunishchal Dev, Andrew Sloan, Joshua Kavner, Nicholas Kong, Morgan Sandler · PDF
Large-scale online deanonymization with LLMs
Simon Lermen, Daniel Paleka, Joshua Swanson, Michael Aerni, Nicholas Carlini, Florian Tramèr · PDF
Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use
Aradhye Agarwal, Gurdit Siyan, Yash Pandya, Joykirat Singh, Akshay Nambi, Ahmed Hassan Awadallah · PDF
Leveraging RAG for Training-Free Alignment of LLMs
John Timothy Halloran · PDF
LLM Agentic System Safety Requires Hybrid Alignment
Vincent Siu, Kyle Montgomery, Yujin Potter, Zhun Wang, Dawn Song, Chenguang Wang · PDF
LLM Hypnosis: Characterizing the Fragility of RLHF Against Unprivileged Knowledge Injection
Almog Hilel, Riddhi Bhagwat, Leshem Choshen, Idan Shenfeld, Jacob Andreas · PDF
LLM Novice Uplift on Dual-Use, In Silico Biology Tasks: A Multi-Benchmark Assessment
Chen Bo Calvin Zhang, Christina Q Knight, Nicholas Kruus, Jason Hausenloy, Nathaniel Li, Aiden Kim, Yury Orlovskiy, Coleman Breen, Bryce Cai, Jasper Götting, Andrew Bo Liu, Samira Nedungadi, Paula Rodriguez, Yannis Yiming He, Zifan Wang, Seth Donoughe, Julian Michael · PDF
LOOK BEFORE YOU LEAP: THERMODYNAMIC ARBI- TRATION OF PARAMETRIC AND NON-PARAMETRIC KNOWLEDGE IN LLM AGENTS VIA SELF- REGULATING MEMORY ARCHITECTURES
Akash Das · PDF
Lost in Simulation: LLM-Simulated Users are Unreliable Proxies for Human Users in Agentic Evaluations
Preethi Seshadri, Samuel Cahyawijaya, Ayomide Odumakinde, Sameer Singh, Seraphina Goldfarb-Tarrant · PDF
Lost in the Noise: How Test-Time Reasoning Fails with Contextual Distractors
Seongyun Lee, Yongrae Jo, Minju Seo, Moontae Lee, Minjoon Seo · PDF
Lying to Win: Assessing LLM Deception through Human-AI Games and Parallel-World Probing
Arash Marioriyad, Ali Nouri, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah · PDF
Measuring Agents in Production
Melissa Pan, Negar Arabzadeh, Riccardo Cogo, Yuxuan Zhu, Alexander Xiong, Lakshya A Agrawal, Huanzhi Mao, Emma Shen, Sid Pallerla, Liana Patel, Shu Liu, Tianneng Shi, Xiaoyuan Liu, Jared Quincy Davis, Emmanuele Lacavalla, Alessandro Basile, Shuyi Yang, Paul Castro, Daniel Kang, Koushik Sen, Dawn Song, Joseph E. Gonzalez, Ion Stoica, Matei Zaharia, Marquita Ellis · PDF
META-GOVERNANCE ARCHITECTURES FOR MULTI-AGENT SYSTEM SAFETY, ALIGNMENT, GOVERNANCE, AND SECURITY
Himanshu Joshi, Shivani Shukla, Sunita Kumari, Manas Joshi · PDF
Mind the Gap: Evaluating Model- and Agentic-Level Vulnerabilities in LLMs with Action Graphs
Ilham Wicaksono, Zekun Wu, Rahul Patel, Theo King, Adriano Koshiyama, Philip Colin Treleaven · PDF
Model Agreement via Anchoring
Eric Eaton, Surbhi Goel, Marcel Hussing, Michael Kearns, Aaron Roth, Sikata Bela Sengupta, Jessica Sorrell · PDF
More Capable, Less Cooperative? When LLMs Fail At Zero-Cost Collaboration
Advait Yadav, Sidney Black, Oliver Sourbut · PDF
NAAMSE: Framework for Evolutionary Security Evaluation of Agents
Kunal Pai, Parth Shah, Harshil Patel · PDF
NESSiE: The Necessary Safety Benchmark - Identifying Errors that should not Exist
Johannes Bertram, Jonas Geiping · PDF
NesyProAct: Proactive Neural-Symbolic Control for Web Agents
Keyi Xiang, Tianyi Tang, Jie-Jing Shao, Yueming Lyu, Ivor Tsang, Yew-Soon Ong, Haiyan Yin · PDF
No One Monitor Fits All: Oversight Strategies for Frontier Agents
Neil Kale, Shashwat Saxena, Ziqian Zhong, Chen Henry Wu, Aditi Raghunathan · PDF
Noticing the Watcher: LLM Agents Can Infer CoT Monitoring from Blocking Feedback
Thomas Jiralerspong, Flemming Kondrup, Yoshua Bengio · PDF
Objective Misalignment in LLM-based Multi Agent Social Deception Game
Marylou Fauchard, Florian Carichon, Margarida Carvalho, Golnoosh Farnadi · PDF
Offline Reinforcement Learning of High-Quality Behaviors Under Robust Style Alignment
Mathieu Petitbois, Rémy Portelas, Sylvain Lamprier · PDF
On Randomness in Agentic Evals
Bjarni Haukur Bjarnason, André Silva, Martin Monperrus · PDF
OPENAPPS: SIMULATING ENVIRONMENT VARIATIONS TO MEASURE UI-AGENT RELIABILITY
Karen Ullrich, Jingtong Su, Claudia Shi, Arjun Subramonian, Amir Bar, Ivan Evtimov, Nikolaos Tsilivis, Randall Balestriero, Julia Kempe, Mark Ibrahim · PDF
OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows
Qiushi Sun, Mukai Li, Zhoumianze Liu, Zhihui Xie, Fangzhi Xu, Zhangyue Yin, Kanzhi Cheng, Zehao Li, Zichen Ding, Qi Liu, Zhiyong Wu, Zhuosheng Zhang, Ben Kao, Lingpeng Kong · PDF
Pay Attention to the Triggers: Constructing Backdoors That Survive Distillation
Giovanni De Muri, Mark Vero, Robin Staab, Martin Vechev · PDF
Persuasion Attacks Can Decrease Effectiveness of CoT Monitoring
Jennifer Za, Julija Bainiaksina, Nikita Ostrovsky, Tanush Chopra, Victoria Krakovna · PDF
Physics-Guided Multimodal Multi-Agent Learning for Intelligent Transportation Systems
Zhen Tian, Yaqiong Zhang, Zhihao Lin, Fujiang Yuan, Yijun Lu, Wangjie lang, Xinyu Wang, Ning Lyu, Zhiguo Tao, Kaijie Chen, Aaron Wang · PDF
Position: Agentic Systems Should be General
Elron Bandel, Asaf Yehudai, Alexandre Lacoste, Avijit Ghosh, Graham Neubig, Margaret Mitchell, Michal Shmueli-Scheuer, Leshem Choshen · PDF
Position: AI Development Should Prioritize Cognitive Security
Batu El, Shiye Su, Aneesh Pappu, Peggy Yin, Julie Heng, Eric Heng, Ryan Z Wang, Andreas Haupt, James Zou · PDF
Position: Science is Collaborative—LLM for Science Should Be Too
Terry Jingchen Zhang, Wenyuan Jiang, Yongjin Yang, Sirui Lu, Bernhard Schölkopf, Zhijing Jin · PDF
Position: We Must Proactively Address AI Safety Debt
Peter Wallich, Raymond Douglas · PDF
PrefPO: Pairwise Preference Prompt Optimization
Rahul Singhal, Pradyumna Tambwekar, Karime Maamari · PDF
PriGuardAgent: Context-Aware Privacy Guardrails for Agentic Systems
Chulin Xie, Amit Dhurandhar, Bo Li · PDF
ProActor: Timing-Aware Reinforcement Learning for Proactive Task Scheduling Agents
Lei Ding, Bin He, Chenguang Wang, Yang Liu · PDF
Profit Is the Red Team: Stress-Testing Agents in Strategic Economic Interactions
Shouqiao Wang, Marcello Politi, Samuele Marro, Davide Crapis · PDF
Prover-Verifier Games for AI Control
Joan Velja, Charlie Griffin, Alessandro Abate · PDF
Quality-Diversity Evolution for Discovering Diverse Vulnerabilities in LLM Safety
Subhadip Mitra · PDF
Rapid Poison: Practical Poisoning Attacks Against the Rapid Response Framework
David Huang, Jaewon Chang, Avidan Shah, Prateek Mittal, Chawin Sitawarin · PDF
Reading Between the Pixels: Linking Text-Image Embedding Alignment to Typographic Attack Success on Vision-Language Models
Ravikumar Balakrishnan, Sanket Mendapara, Ankit Garg · PDF
Recalling Too Well: Sycophancy and Bias Amplification in Memory-Augmented Models
Shelly Bensal, Axel Magnuson, Aparna Balagopalan, Daniel M. Bikel · PDF
Reference-Guided Machine Unlearning
Jonas Mirlach, Sonia Laguna, Julia E Vogt · PDF
RepoMirage: Do Code Agents Really Understand Repository Structures?
Hanyu Li, Yichi Zhang, Speed Zhu, Yinpeng Dong · PDF
ResearchGym: Evaluating Language Model Agents on Real-World AI Research
Aniketh Garikaparthi, Manasi Patwardhan, Arman Cohan · PDF
RubricRobustness: Evaluating the Sensitivity of Rubrics-Based Benchmarks to Simple Perturbations
Manasi Sharma, Brad Kenstler, Bing Liu · PDF
SafePro: Evaluating the Safety of Professional-Level AI Agents
Kaiwen Zhou, Shreedhar Jangam, Ashwin Nagarajan, Tejas Polu, Suhas Oruganti, Chengzhi Liu, Ching-Chen Kuo, Yuting Zheng, Sravana Jyothi Narayanaraju, Xin Eric Wang · PDF
Scaling Agentic Capabilities, Not Context: Efficient Reinforcement Finetuning for Large Toolspaces
Karan Gupta, Pranav Vajreshwari, Yash Pandya, Akshay Nambi, Ahmed Hassan Awadallah · PDF
Scaling Agents for Computer Use
Gonzalo Gonzalez-Pumariega, Vincent Tu, Chih-Lun Lee, Jiachen Yang, Ang Li, Xin Eric Wang · PDF
Script Kiddie Uplift: Measuring Procedural Misuse Amplification in AI Agents
Zora Che, Julio Poveda, Aldana Belen Rodriguez, Yannis Yiming He, Chen Bo Calvin Zhang, Zifan Wang, Udari Madhushani Sehwag · PDF
Secure Code Generation via Online Reinforcement Learning with Vulnerability Reward Model
Tianyi Wu, Mingzhe Du, Yue Liu, Chengran Yang, Terry Yue Zhuo, Jiaheng Zhang, See-Kiong Ng · PDF
SenseAct: Structuring GUI Actions for Reliable Planning and Verification
Cai Hongtian, Tianyi Ma, Jie-Jing Shao, Tianyi Tang, Ivor Tsang, Yueming Lyu, Haiyan Yin · PDF
Sound Agentic Science Requires Adversarial Experiments
Dionizije Fa, Marko Čuljak · PDF
SPARK: Spectral Perturbation based Adversarial Attacks for KGRAG Agents
Aditya Saibewar, Aditya Ramesh, Shivam Bhardwaj, Jatin Chauhan, Manohar Kaul · PDF
SPECA: Specification-to-Checklist Agentic Auditing for Multi-Implementation Systems — A Case Study on Ethereum Clients
Masato Kamba, Akiyoshi Sannai · PDF
Subliminal Signals in Preference Labels
Isotta Magistrali, Frédéric Berdoz, Sam Dauncey, Roger Wattenhofer · PDF
Subliminal Transfer of Unsafe Behaviors in AI Agent Distillation
Jacob Dang, Brian Yang Xie, Omar G. Younis · PDF
Sweeping Promptable Spoofs under the DirtyRAG: A Practical, Query-Blind RAG Attack Done Right
Shaochen Zhong, Jiamu Zhang, Hoang Anh Duy Le, Wenya Xie, Yifan Lu, Xintong Sun, Mohsen Hariri, Hongyi Liu, Guanchu Wang, Zhaozhuo Xu, Zirui Liu, Shuai Xu, Ning Xie, Li Li, Rui Chen, Ruixiang Tang, Xia Hu, Vipin Chaudhary · PDF
T-MAP: Red-Teaming LLM Agents with Trajectory-aware Evolutionary Search
Hyomin Lee, Sangwoo Park, Yumin Choi, Sohyun An, Hayeon Lee, Seanie Lee, Sung Ju Hwang · PDF
TamperBench: A Systematic Framework to Stress-Test LLM Safety Under Fine-Tuning and Tampering
Saad Hossain, Tom Tseng, Punya Syon Pandey, Samanvay Vajpayee, Nayeema Nonta, Matthew Kowal, Samuel Simko, Stephen Casper, Zhijing Jin, Kellin Pelrine, Sirisha Rambhatla · PDF
TamperTest: A Framework for Testing Tamper Resistance in Open-Weight LLMs
Isabel Dahlgren, Aashiq Muhamed · PDF
The Algorithmic Self-Portrait: Deconstructing Memory in ChatGPT
Abhisek Dash, Soumi Das, Elisabeth Kirsten, Qinyuan Wu, Sai Keerthana Karnam, Krishna P. Gummadi, Thorsten Holz, Muhammad Bilal Zafar, Savvas Zannettou · PDF
The Alignment Waltz: Jointly Training Agents to Collaborate for Safety
Jingyu Zhang, Haozhu Wang, Eric Michael Smith, Sid Wang, Amr Sharaf, Mahesh Pasupuleti, Benjamin Van Durme, Daniel Khashabi, Jason E Weston, Hongyuan Zhan · PDF
The Controllability Trap: A Governance Framework for Military AI Agents
Subramanyam Sahoo · PDF
The Reliability Gap in Agentic Evidence Verification for Materials Science
Albert Gong, James J. Kim, Anmol Kabra, Aaditya Panigrahi, Jiashuo Wang, Arjun B. Mulchandani, Michael Freeman, Fatmagul Katmer, Joshua Peters Wakefield, Linxi Zhao, Chao Wan, Akanksha Sarkar, Yoav Artzi, Leslie M Schoop, John Thickstun, Kilian Q Weinberger, Eun-Ah Kim, Peter I. Frazier, Jennifer J. Sun · PDF
The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search
Rongzhe Wei, Peizhi Niu, Xinjie Shen, Tony Tu, Yifan Li, Ruihan Wu, Eli Chien, Pin-Yu Chen, Olgica Milenkovic, Pan Li · PDF
Toward Reliable, Safe, and Secure LLMs for Scientific Applications
Saket Sanjeev Chaturvedi, Joshua Bergersona, Tanwi Mallick · PDF
Towards Predictive Models of Strategic Behaviour in Large Language Model Agents
Jennifer Za, Aristeidis Panos, Jan Cuhel, Samuel Albanie · PDF
Towards Shutdownable Agents: Generalizing Stochastic Choice in RL Agents and LLMs
Carissa Cullen, Harry Garland, Alexander Roman, Louis Thomson, Christos Ziakas, Elliott Thornley · PDF
TRADERBENCH: HOW ROBUST ARE AI AGENTS IN ADVERSARIAL CAPITAL MARKETS?
Xiaochuang Yuan, Hui Xu, Silvia Xu, Cui Zou, Jing Xiong · PDF
TSR: Trajectory‑Search Rollouts for Multi‑Turn RL of LLM Agents
Aladin Djuhera, Swanand Ravindra Kadhe, Farhan Ahmed, Heiko Ludwig, Holger Boche · PDF
Uncertainty Drives Social Bias Changes in Quantized Large Language Models
Stanley Bryan Zamora Hua, Sanae Lotfi, Irene Y. Chen · PDF
Uncertainty-Aware Self-Correction for Coding Agents
Jason Almeida, Lokesh Sai Dasari, Anubhav Pal, Tinuade Adeleke, Sean Wu, Ruizhe Li · PDF
Understanding Metacognition in Multi-Agent LLMs: Routing, Not Reasoning
Mafizur Rahman, Lijun Qian · PDF
Understanding Reasoning Collapse in Multi-Turn Agent Reinforcement Learning
Zihan Wang, Chi Gui, Xing Jin, Qineng Wang, Licheng Liu, Kangrui Wang, Shiqi Chen, Linjie Li, Zhengyuan Yang, Pingyue Zhang, Yiping Lu, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, Manling Li · PDF
VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use
Dongfu Jiang, Yi Lu, Zhuofeng Li, Zhiheng Lyu, Ping Nie, Haozhe Wang, Alex Su, Hui Chen, Kai Zou, Chao Du, Tianyu Pang, Wenhu Chen · PDF
Visual Exclusivity Attacks: Automatic Multimodal Red Teaming via Agentic Planning
Yunbei Zhang, Yingqiang Ge, Weijie Xu, Yuhui Xu, Jihun Hamm, Chandan K. Reddy · PDF
W&D: Scaling Parallel Tool Calling for Efficient Deep Research Agents
Xiaoqiang Lin, Jun Hao Liew, Silvio Savarese, Junnan Li · PDF
When Agents Persuade: Rhetoric Generation and Mitigation in LLMs
Julia Jose, Ritik Roongta, Rachel Greenstadt · PDF
When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift
Max Fomin · PDF
When Benign Inputs Lead to Severe Harms: Eliciting Unsafe Unintended Behaviors of Computer-Use Agents
Jaylen Jones, Zhehao Zhang, Yuting Ning, Eric Fosler-Lussier, Pierre-Luc St-Charles, Yoshua Bengio, Dawn Song, Yu Su, Huan Sun · PDF
When Fuzzing Becomes Agentic: Semantic State Exploration in the Wild
Andrew Yin, Zhaoling Chen, Qian Zhang, Heng Yin · PDF
Why Do Language Model Agents Whistleblow?
Kushal Agrawal, Frank Xiao, Guido Ernesto Bergman, Asa Cooper Stickland · PDF
ZeroDayBench: Evaluating LLM Agents on Unseen Zero-Day Vulnerabilities for Cyberdefense
Nancy Lau, Louis Sloot, Jyoutir Raj, Evan Harris, Giuseppe Marco Boscardin, Dan Zhao, Dylan Bowman, Mario Brajkovski, Jaideep Singh Chawla · PDF

Accepted papers (150)

☆A Behavioural and Representational Evaluation of Goal-Directedness in Language Model Agents

☆A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring

☆A Framework for Formalizing LLM Agent Security

☆A Survey on Agentic Security: Applications, Threats and Defenses

☆Ads in AI Chatbots? An Analysis of How Large Language Models Navigate Conflicts of Interest

☆Agent Properties for Multi-Agent Safety

☆Agent Psychometrics: Task-Level Performance Prediction in Agentic Coding Benchmarks

☆Agent That Matters: An Attribution Framework for Multi-Agent LLMs

☆Agentic Browsers and the Same-Origin Policy

☆Agentic Rubrics as Contextual Verifiers for SWE Agents

☆Agentic Uncertainty Reveals Agentic Overconfidence

☆Agentified Benchmarking for Logical Reasoning Agents

☆AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM‑Based Agents

☆Agents in the Wild: Safety, Society, and the Illusion of Sociality on Moltbook

☆AgentTrace: Causal Graph Tracing for Root Cause Analysis in Deployed Multi-Agent Systems

☆AI Organizations Are More Effective but Less Aligned than Individual Agents

☆Analyzing and Improving Chain-of-Thought Monitorability Through Information Theory

☆Are LLM Agents Exploitable Negotiators ?

☆Asymmetric Goal Drift in Coding Agents Under Value Conflict

☆Atomix: Timely, Transactional Tool Use for Reliable Agentic Workflows

☆Attack Selection Reduces Safety in Concentrated AI Control Settings against Trusted Monitoring

☆Behavioral and Strategic Deception in Large Language Models: A Taxonomy and Benchmark Analysis

☆Better Attacks for Better Monitors: Semi-Automated Red-Teaming for Agent Monitoring

☆Beyond Clicking: A Step Towards Generalist GUI Grounding via Text Dragging

☆BlueCodeAgent: A Blue Teaming Agent Powered by Automated Red Teaming for CodeGen AI

☆Bridging the Gap between Theory of Mind and Action in LLMs

☆Certifying Robustness of Agent Tool-Selection Under Adversarial Attacks

☆Characterizing Web Search in The Age of Generative AI

☆ClawdPwned: Malicious Instructions in the OpenClaw AI Agent Skills repository

☆CommandSans: Securing AI Agents with Surgical Precision Prompt Sanitization

☆Context Inference Attacks Without Jailbreaks

☆Coordinating Coexisting Learning Agents in Shared Spectrum via Parameter Space Complementarity

☆CR-Bench: Evaluating the Real-World Utility of AI Code Review Agents

☆CrispEdit: Low-Curvature Projections for Scalable Non-Destructive LLM Editing

☆Critical Mass: Phase Transitions, Covert Coordination Detection, and Contagion Dynamics in Multi-Agent Systems

☆CyberGym-E2E: Scalable Real-World Benchmark for AI Agents' End-to-End Cybersecurity Capabilities

☆Data-Centric Interpretability for LLM-based Multi-Agent Reinforcement Learning

☆Defensive Refusal Bias: How Safety Alignment Fails Cyber Defenders

☆Differential Harm Propensity in Personalized LLM Agents: The Curious Case of Mental Health Disclosure

☆Directional Embedding Smoothing for Robust Vision Language Models

☆DSGym: A Standardized and Holistic Framework for Advancing Data Science Agents

☆Echoing: Identity Failures when LLM Agents Talk to Each Other

☆Efficient Test-Time Scaling of Multi-Step Reasoning by Probing Internal States of Large Language Models

☆Efficient Tree-Structured Deep Research with Adaptive Resource Allocation

☆Emergence WebVoyager: Toward Consistent and Transparent Evaluation of (Web) Agents in the Wild

☆Entropic Context Shaping: Information-Theoretic Filtering for Context-Aware LLM Agents

☆ESDAE: Evaluating Synthetic Data for Agent Evaluation

☆Evaluating LLM Judges in Cybersecurity Script Analysis

☆Evo-Guard: Self-Evolving GNN Guardrails for Adaptive Safety in GUI Agents

☆Exposing Security Vulnerabilities in LLM Based Educational Grading Agents

☆Federated Agent Reinforcement Learning

☆FICO-BENCH: Evaluating Vision-Language Models under Visual Fidelity and Compression at Scale

☆Forgetting-MarI: LLM Unlearning via Marginal Information Regularization

☆Formalizing the Safety, Security, and Functional Properties of Agentic AI Systems

☆From the Wild Web to the Zoo: Benchmarking Web Agents with a Realistic Simulator

☆General Agent Evaluation

☆GLEAN: Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification

☆GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory

☆Guarded Tool-Using LLM Agents for Incident Response: A Safety-Gated Architecture and Operational Evaluation Protocol

☆Guardian Angels in the Wild: Verification-First LLM Planning for Safety-Critical Daily Life Tasks

☆Hair-Trigger Alignment: Black-Box Evaluation Cannot Guarantee Post-Update Alignment

☆HealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration Tasks

☆How does information access affect LLM monitors' ability to detect sabotage?

☆How LLMs Distort & Transform Our Language

☆Human-Guided Harm Recovery for Computer Use Agents

☆Inherited Goal Drift: Contextual Pressure Can Undermine Agentic Goals

☆Judge Reliability Harness: Stress Testing the Reliability of LLM Judges

☆Large-scale online deanonymization with LLMs

☆Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use

☆Leveraging RAG for Training-Free Alignment of LLMs

☆LLM Agentic System Safety Requires Hybrid Alignment

☆LLM Hypnosis: Characterizing the Fragility of RLHF Against Unprivileged Knowledge Injection

☆LLM Novice Uplift on Dual-Use, In Silico Biology Tasks: A Multi-Benchmark Assessment

☆LOOK BEFORE YOU LEAP: THERMODYNAMIC ARBI- TRATION OF PARAMETRIC AND NON-PARAMETRIC KNOWLEDGE IN LLM AGENTS VIA SELF- REGULATING MEMORY ARCHITECTURES

☆Lost in Simulation: LLM-Simulated Users are Unreliable Proxies for Human Users in Agentic Evaluations

☆Lost in the Noise: How Test-Time Reasoning Fails with Contextual Distractors

☆Lying to Win: Assessing LLM Deception through Human-AI Games and Parallel-World Probing

☆Measuring Agents in Production

☆META-GOVERNANCE ARCHITECTURES FOR MULTI-AGENT SYSTEM SAFETY, ALIGNMENT, GOVERNANCE, AND SECURITY

A Behavioural and Representational Evaluation of Goal-Directedness in Language Model Agents

A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring

A Framework for Formalizing LLM Agent Security

A Survey on Agentic Security: Applications, Threats and Defenses

Ads in AI Chatbots? An Analysis of How Large Language Models Navigate Conflicts of Interest

Agent Properties for Multi-Agent Safety

Agent Psychometrics: Task-Level Performance Prediction in Agentic Coding Benchmarks

Agent That Matters: An Attribution Framework for Multi-Agent LLMs

Agentic Browsers and the Same-Origin Policy

Agentic Rubrics as Contextual Verifiers for SWE Agents

Agentic Uncertainty Reveals Agentic Overconfidence

Agentified Benchmarking for Logical Reasoning Agents

AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM‑Based Agents

Agents in the Wild: Safety, Society, and the Illusion of Sociality on Moltbook

AgentTrace: Causal Graph Tracing for Root Cause Analysis in Deployed Multi-Agent Systems

AI Organizations Are More Effective but Less Aligned than Individual Agents

Analyzing and Improving Chain-of-Thought Monitorability Through Information Theory

Are LLM Agents Exploitable Negotiators ?

Asymmetric Goal Drift in Coding Agents Under Value Conflict

Atomix: Timely, Transactional Tool Use for Reliable Agentic Workflows

Attack Selection Reduces Safety in Concentrated AI Control Settings against Trusted Monitoring

Behavioral and Strategic Deception in Large Language Models: A Taxonomy and Benchmark Analysis

Better Attacks for Better Monitors: Semi-Automated Red-Teaming for Agent Monitoring

Beyond Clicking: A Step Towards Generalist GUI Grounding via Text Dragging

BlueCodeAgent: A Blue Teaming Agent Powered by Automated Red Teaming for CodeGen AI

Bridging the Gap between Theory of Mind and Action in LLMs

Certifying Robustness of Agent Tool-Selection Under Adversarial Attacks

Characterizing Web Search in The Age of Generative AI

ClawdPwned: Malicious Instructions in the OpenClaw AI Agent Skills repository

CommandSans: Securing AI Agents with Surgical Precision Prompt Sanitization

Context Inference Attacks Without Jailbreaks

Coordinating Coexisting Learning Agents in Shared Spectrum via Parameter Space Complementarity

CR-Bench: Evaluating the Real-World Utility of AI Code Review Agents

CrispEdit: Low-Curvature Projections for Scalable Non-Destructive LLM Editing

Critical Mass: Phase Transitions, Covert Coordination Detection, and Contagion Dynamics in Multi-Agent Systems

CyberGym-E2E: Scalable Real-World Benchmark for AI Agents' End-to-End Cybersecurity Capabilities

Data-Centric Interpretability for LLM-based Multi-Agent Reinforcement Learning

Defensive Refusal Bias: How Safety Alignment Fails Cyber Defenders

Differential Harm Propensity in Personalized LLM Agents: The Curious Case of Mental Health Disclosure

Directional Embedding Smoothing for Robust Vision Language Models

DSGym: A Standardized and Holistic Framework for Advancing Data Science Agents

Echoing: Identity Failures when LLM Agents Talk to Each Other

Efficient Test-Time Scaling of Multi-Step Reasoning by Probing Internal States of Large Language Models

Efficient Tree-Structured Deep Research with Adaptive Resource Allocation

Emergence WebVoyager: Toward Consistent and Transparent Evaluation of (Web) Agents in the Wild

Entropic Context Shaping: Information-Theoretic Filtering for Context-Aware LLM Agents

ESDAE: Evaluating Synthetic Data for Agent Evaluation

Evaluating LLM Judges in Cybersecurity Script Analysis

Evo-Guard: Self-Evolving GNN Guardrails for Adaptive Safety in GUI Agents

Exposing Security Vulnerabilities in LLM Based Educational Grading Agents

Federated Agent Reinforcement Learning

FICO-BENCH: Evaluating Vision-Language Models under Visual Fidelity and Compression at Scale

Forgetting-MarI: LLM Unlearning via Marginal Information Regularization

Formalizing the Safety, Security, and Functional Properties of Agentic AI Systems

From the Wild Web to the Zoo: Benchmarking Web Agents with a Realistic Simulator

General Agent Evaluation

GLEAN: Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification

GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory

Guarded Tool-Using LLM Agents for Incident Response: A Safety-Gated Architecture and Operational Evaluation Protocol

Guardian Angels in the Wild: Verification-First LLM Planning for Safety-Critical Daily Life Tasks

Hair-Trigger Alignment: Black-Box Evaluation Cannot Guarantee Post-Update Alignment

HealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration Tasks

How does information access affect LLM monitors' ability to detect sabotage?

How LLMs Distort & Transform Our Language

Human-Guided Harm Recovery for Computer Use Agents

Inherited Goal Drift: Contextual Pressure Can Undermine Agentic Goals

Judge Reliability Harness: Stress Testing the Reliability of LLM Judges

Large-scale online deanonymization with LLMs

Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use

Leveraging RAG for Training-Free Alignment of LLMs

LLM Agentic System Safety Requires Hybrid Alignment

LLM Hypnosis: Characterizing the Fragility of RLHF Against Unprivileged Knowledge Injection

LLM Novice Uplift on Dual-Use, In Silico Biology Tasks: A Multi-Benchmark Assessment

LOOK BEFORE YOU LEAP: THERMODYNAMIC ARBI- TRATION OF PARAMETRIC AND NON-PARAMETRIC KNOWLEDGE IN LLM AGENTS VIA SELF- REGULATING MEMORY ARCHITECTURES

Lost in Simulation: LLM-Simulated Users are Unreliable Proxies for Human Users in Agentic Evaluations

Lost in the Noise: How Test-Time Reasoning Fails with Contextual Distractors

Lying to Win: Assessing LLM Deception through Human-AI Games and Parallel-World Probing

Measuring Agents in Production

META-GOVERNANCE ARCHITECTURES FOR MULTI-AGENT SYSTEM SAFETY, ALIGNMENT, GOVERNANCE, AND SECURITY

Mind the Gap: Evaluating Model- and Agentic-Level Vulnerabilities in LLMs with Action Graphs