ICLR 2026 Past AgentsSafety & alignmentPrivacy & security

Agents in the Wild: Safety, Security, and Beyond

ICLR 2026 AIWILD

Submission deadline
Feb 13, 2026, 12:00 UTC
imported from OpenReview — check the website for extensions
Submission portal
OpenReview
Notes
Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).

Accepted papers (150)

Fetched from OpenReview (v2) on 2026-06-10.

  1. A Behavioural and Representational Evaluation of Goal-Directedness in Language Model Agents

    Raghu Arghal, Fade Chen, Niall Dalton, Evgenii Kortukov, Calum McNamara, Angelos Nalmpantis, Moksh Nirvaan, Gabriele Sarti, Mario Giulianelli · PDF
  2. A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring

    Usman Anwar, Julianna Piskorz, David D. Baek, David Demitri Africa, Jim Weatherall, Max Tegmark, Christian Schroeder de Witt, Mihaela van der Schaar, David Krueger · PDF
  3. A Framework for Formalizing LLM Agent Security

    Vincent Siu, Jingxuan He, Kyle Montgomery, Zhun Wang, Neil Zhenqiang Gong, Chenguang Wang, Dawn Song · PDF
  4. A Survey on Agentic Security: Applications, Threats and Defenses

    Asif Shahriar, Md Nafiu Rahman, Sadif Ahmed, Farig Sadeque, Md Rizwan Parvez · PDF
  5. Ads in AI Chatbots? An Analysis of How Large Language Models Navigate Conflicts of Interest

    Addison J. Wu, Ryan Liu, Shuyue Stella Li, Yulia Tsvetkov, Thomas L. Griffiths · PDF
  6. Agent Properties for Multi-Agent Safety

    Cecilia Elena Tilli · PDF
  7. Agent Psychometrics: Task-Level Performance Prediction in Agentic Coding Benchmarks

    Chris Ge, Daria Kryvosheieva, Daniel Fried, Uzay Girit, Kaivalya Hariharan · PDF
  8. Agent That Matters: An Attribution Framework for Multi-Agent LLMs

    MingYu Lu, Yushan Huang, Su-In Lee · PDF
  9. Agentic Browsers and the Same-Origin Policy

    Franziska Roesner, David Kohlbrenner · PDF
  10. Agentic Rubrics as Contextual Verifiers for SWE Agents

    Mohit Raghavendra, Anisha Gunjal, Bing Liu, Yunzhong He · PDF
  11. Agentic Uncertainty Reveals Agentic Overconfidence

    Jean Kaddour, Srijan Patel, Gbetondji Jean-Sebastien Dovonon, Leo Richter, Pasquale Minervini, Matt J. Kusner · PDF
  12. Agentified Benchmarking for Logical Reasoning Agents

    Zhiyu Ni, Yifeng Xiao, Zheng Liang · PDF
  13. AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM‑Based Agents

    Emma Gouné, Akshat Naik, Patrick Quinn, Guillermo Bosch, Francisco Javier Campos Zabala, Jason Ross Brown, Edward James Young · PDF
  14. Agents in the Wild: Safety, Society, and the Illusion of Sociality on Moltbook

    Yunbei Zhang, Kai Mei, Ming Liu, Janet Wang, Dimitris N. Metaxas, Xiao Wang, Jihun Hamm, Yingqiang Ge · PDF
  15. AgentTrace: Causal Graph Tracing for Root Cause Analysis in Deployed Multi-Agent Systems

    Zhaohui Geoffrey Wang · PDF
  16. AI Organizations Are More Effective but Less Aligned than Individual Agents

    Judy Hanwen Shen, Daniel Zhu, Siddarth Srinivasan, Henry Sleight, Lawrence T. Wagner III, Morgan Jane Matthews, Jascha Sohl-Dickstein, Erik Jones · PDF
  17. Analyzing and Improving Chain-of-Thought Monitorability Through Information Theory

    Usman Anwar, Tim Bakker, Dana Kianfar, Cristina Pinneri, Christos Louizos · PDF
  18. Are LLM Agents Exploitable Negotiators ?

    Ramzi Dakhmouche · PDF
  19. Asymmetric Goal Drift in Coding Agents Under Value Conflict

    Magnus Saebo, Spencer Gibson, Tyler Crosse, Achyutha Menon, Eyon Jang, Diogo Cruz · PDF
  20. Atomix: Timely, Transactional Tool Use for Reliable Agentic Workflows

    Bardia Mohammadi, Nearchos Potamitis, Lars Henning Klein, Akhil Arora, Laurent Bindschaedler · PDF
  21. Attack Selection Reduces Safety in Concentrated AI Control Settings against Trusted Monitoring

    Joachim Schaeffer, Arjun Khandelwal, Tyler Tracy · PDF
  22. Behavioral and Strategic Deception in Large Language Models: A Taxonomy and Benchmark Analysis

    Jerick Shi · PDF
  23. Better Attacks for Better Monitors: Semi-Automated Red-Teaming for Agent Monitoring

    Monika Jotautaitė, Maria Angelica Martinez, Tyler Tracy, Ollie Matthews · PDF
  24. Beyond Clicking: A Step Towards Generalist GUI Grounding via Text Dragging

    Zeyi Liao, Yadong Lu, Boyu Gou, Huan Sun, Ahmed Hassan Awadallah · PDF
  25. BlueCodeAgent: A Blue Teaming Agent Powered by Automated Red Teaming for CodeGen AI

    Chengquan Guo, Yuzhou Nie, Chulin Xie, Zinan Lin, Wenbo Guo, Bo Li · PDF
  26. Bridging the Gap between Theory of Mind and Action in LLMs

    Sehyeok Kang, Jihwan Oh, Se-Young Yun · PDF
  27. Certifying Robustness of Agent Tool-Selection Under Adversarial Attacks

    Jehyeok Yeon, Isha Chaudhary, Gagandeep Singh · PDF
  28. Characterizing Web Search in The Age of Generative AI

    Elisabeth Kirsten, Jost Große Perdekamp, Qinyuan Wu, Mihir Upadhyay, Krishna P. Gummadi, Muhammad Bilal Zafar · PDF
  29. ClawdPwned: Malicious Instructions in the OpenClaw AI Agent Skills repository

    Arjun Krishna · PDF
  30. CommandSans: Securing AI Agents with Surgical Precision Prompt Sanitization

    Debeshee Das, Luca Beurer-Kellner, Marc Fischer, Maximilian Baader · PDF
  31. Context Inference Attacks Without Jailbreaks

    Prince Jha, Samuele Poppi, Nils Lukas · PDF
  32. Coordinating Coexisting Learning Agents in Shared Spectrum via Parameter Space Complementarity

    MD ASHIKUL HAQUE, Haibo Zhang, Abusayeed Saifullah · PDF
  33. CR-Bench: Evaluating the Real-World Utility of AI Code Review Agents

    Kristen Pereira, Neelabh Sinha, Rajat Ghosh, Debojyoti Dutta · PDF
  34. CrispEdit: Low-Curvature Projections for Scalable Non-Destructive LLM Editing

    Zarif Ikram, Arad Firouzkouhi, Stephen Tu, Mahdi Soltanolkotabi, Paria Rashidinejad · PDF
  35. Critical Mass: Phase Transitions, Covert Coordination Detection, and Contagion Dynamics in Multi-Agent Systems

    Ben Jenkins · PDF
  36. CyberGym-E2E: Scalable Real-World Benchmark for AI Agents' End-to-End Cybersecurity Capabilities

    Tianneng Shi, Robin Rheem, Dongwei Jiang, Mona Wang, Francisco De La Riega, Zhun Wang, Jingzhi Jiang, Alexander Cheung, Sean Tai, Jonah Cha, Jianhong Tu, Gabriel Han, Chenguang Wang, Wenbo Guo, Jingxuan He, Dawn Song · PDF
  37. Data-Centric Interpretability for LLM-based Multi-Agent Reinforcement Learning

    John Yan, Michael Yu, Yuqi Sun, Alexander Duffy, Tyler Marques, Matthew Lyle Olson · PDF
  38. Defensive Refusal Bias: How Safety Alignment Fails Cyber Defenders

    David Campbell, Neil Kale, Udari Madhushani Sehwag, Bert Herring, Nick Price, Dan Borges, Alex Levinson, Christina Q Knight · PDF
  39. Differential Harm Propensity in Personalized LLM Agents: The Curious Case of Mental Health Disclosure

    Caglar Yildirim · PDF
  40. Directional Embedding Smoothing for Robust Vision Language Models

    Ye Wang, Jing Liu, Toshiaki Koike-Akino · PDF
  41. DSGym: A Standardized and Holistic Framework for Advancing Data Science Agents

    Fan Nie, Junlin Wang, Harper Hua, Federico Bianchi, Yongchan Kwon, Zhenting Qi, Owen Queen, Shang Zhu, James Zou · PDF
  42. Echoing: Identity Failures when LLM Agents Talk to Each Other

    Sarath Shekkizhar, Romain Cosentino, Adam Earle, Silvio Savarese · PDF
  43. Efficient Test-Time Scaling of Multi-Step Reasoning by Probing Internal States of Large Language Models

    Jingwei Ni, Ekaterina Fadeeva, Tianyi Wu, Mubashara Akhtar, Jiaheng Zhang, Elliott Ash, Markus Leippold, Timothy Baldwin, See-Kiong Ng, Artem Shelmanov, Mrinmaya Sachan · PDF
  44. Efficient Tree-Structured Deep Research with Adaptive Resource Allocation

    Lunyiu Nie, Nedim Lipka, Ryan A. Rossi, Swarat Chaudhuri · PDF
  45. Emergence WebVoyager: Toward Consistent and Transparent Evaluation of (Web) Agents in the Wild

    Deepak Akkil, Mowafak Allaham, Amal Raj, Tamer Abuelsaad, Ravi Kokku · PDF
  46. Entropic Context Shaping: Information-Theoretic Filtering for Context-Aware LLM Agents

    Hyunjun Kim · PDF
  47. ESDAE: Evaluating Synthetic Data for Agent Evaluation

    Shuaiqi Wang, Aadyaa Maddi, Zinan Lin, Giulia Fanti · PDF
  48. Evaluating LLM Judges in Cybersecurity Script Analysis

    Alexandra Daniela Damir, Apostu Alexandru-Mihai, Diana Bolocan, Andrei Preda, Ioana Croitoru, Mihaela Gaman, Laura Vasilie, Bilal Issa, Monica-Nicoleta Pascu · PDF
  49. Evo-Guard: Self-Evolving GNN Guardrails for Adaptive Safety in GUI Agents

    Yifei Song, Yilei Jiang, Yingshui Tan, Xiangyu Yue, Lian-Kuan Chen · PDF
  50. Exposing Security Vulnerabilities in LLM Based Educational Grading Agents

    Xueyi Li, Zhuoneng Zhou, Zitao Liu, Yongdong WU · PDF
  51. Federated Agent Reinforcement Learning

    Canyu Chen, Kangyu Zhu, Zhaorun Chen, Zhanhui Zhou, Shizhe Diao, Yiping Lu, Tian Li, Manling Li, Dawn Song · PDF
  52. FICO-BENCH: Evaluating Vision-Language Models under Visual Fidelity and Compression at Scale

    Jianhong Tu, Nicholas Crispino, Kyle Montgomery, Chenguang Wang, Dawn Song · PDF
  53. Forgetting-MarI: LLM Unlearning via Marginal Information Regularization

    Shizhou Xu, Yuan Ni, Stefan Broecker, Thomas Strohmer · PDF
  54. Formalizing the Safety, Security, and Functional Properties of Agentic AI Systems

    Edoardo Allegrini, Ananth Shreekumar, Z. Berkay Celik · PDF
  55. From the Wild Web to the Zoo: Benchmarking Web Agents with a Realistic Simulator

    Brian Grinstead, Mariana Meireles, Christoph Kerschbaumer, Cameron Allen · PDF
  56. General Agent Evaluation

    Elron Bandel, Asaf Yehudai, Lilach Eden, Yehoshua Sagron, Yotam Perlitz, Elad Venezian, Natalia Razinkov, Natan Ergas, Shlomit Shachor Ifergan, Segev Shlomov, Michal Jacovi, Leshem Choshen, Liat Ein-Dor, Yoav Katz, Michal Shmueli-Scheuer · PDF
  57. GLEAN: Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification

    Yichi Zhang, Nabeel Seedat, Yinpeng Dong, Peng Cui, Jun Zhu, Mihaela van der Schaar · PDF
  58. GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory

    Pepijn Cobben, X. Angelo Huang, Thao Amelia Pham, Isabel Dahlgren, Terry Jingchen Zhang, Zhijing Jin · PDF
  59. Guarded Tool-Using LLM Agents for Incident Response: A Safety-Gated Architecture and Operational Evaluation Protocol

    Dhruv Patel · PDF
  60. Guardian Angels in the Wild: Verification-First LLM Planning for Safety-Critical Daily Life Tasks

    Saurabh Dingwani, Ayan Banerjee, Sandeep Gupta · PDF
  61. Hair-Trigger Alignment: Black-Box Evaluation Cannot Guarantee Post-Update Alignment

    Yavuz Faruk Bakman, Duygu Nur Yaldiz, Salman Avestimehr, Sai Praneeth Karimireddy · PDF
  62. HealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration Tasks

    Suhana Bedi, Ryan Welch, Ethan Steinberg, Michael Wornow, Taeil Matthew Kim, Haroun Ahmed, Sanmi Koyejo, Nigam Shah · PDF
  63. How does information access affect LLM monitors' ability to detect sabotage?

    Rauno Arike, Raja Mehta Moreno, Rohan Subramani, Shubhorup Biswas, Francis Rhys Ward · PDF
  64. How LLMs Distort & Transform Our Language

    Marwa Abdulhai, Isadora White, Yanming Wan, Joel Z Leibo, Max Kleiman-Weiner, Natasha Jaques · PDF
  65. Human-Guided Harm Recovery for Computer Use Agents

    Christy Li, Sky CH-Wang, Andi Peng, Andreea Bobu · PDF
  66. Inherited Goal Drift: Contextual Pressure Can Undermine Agentic Goals

    Achyutha Menon, Magnus Saebo, Tyler Crosse, Spencer Gibson, Eyon Jang, Diogo Cruz · PDF
  67. Judge Reliability Harness: Stress Testing the Reliability of LLM Judges

    Sunishchal Dev, Andrew Sloan, Joshua Kavner, Nicholas Kong, Morgan Sandler · PDF
  68. Large-scale online deanonymization with LLMs

    Simon Lermen, Daniel Paleka, Joshua Swanson, Michael Aerni, Nicholas Carlini, Florian Tramèr · PDF
  69. Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use

    Aradhye Agarwal, Gurdit Siyan, Yash Pandya, Joykirat Singh, Akshay Nambi, Ahmed Hassan Awadallah · PDF
  70. Leveraging RAG for Training-Free Alignment of LLMs

    John Timothy Halloran · PDF
  71. LLM Agentic System Safety Requires Hybrid Alignment

    Vincent Siu, Kyle Montgomery, Yujin Potter, Zhun Wang, Dawn Song, Chenguang Wang · PDF
  72. LLM Hypnosis: Characterizing the Fragility of RLHF Against Unprivileged Knowledge Injection

    Almog Hilel, Riddhi Bhagwat, Leshem Choshen, Idan Shenfeld, Jacob Andreas · PDF
  73. LLM Novice Uplift on Dual-Use, In Silico Biology Tasks: A Multi-Benchmark Assessment

    Chen Bo Calvin Zhang, Christina Q Knight, Nicholas Kruus, Jason Hausenloy, Nathaniel Li, Aiden Kim, Yury Orlovskiy, Coleman Breen, Bryce Cai, Jasper Götting, Andrew Bo Liu, Samira Nedungadi, Paula Rodriguez, Yannis Yiming He, Zifan Wang, Seth Donoughe, Julian Michael · PDF
  74. LOOK BEFORE YOU LEAP: THERMODYNAMIC ARBI- TRATION OF PARAMETRIC AND NON-PARAMETRIC KNOWLEDGE IN LLM AGENTS VIA SELF- REGULATING MEMORY ARCHITECTURES

    Akash Das · PDF
  75. Lost in Simulation: LLM-Simulated Users are Unreliable Proxies for Human Users in Agentic Evaluations

    Preethi Seshadri, Samuel Cahyawijaya, Ayomide Odumakinde, Sameer Singh, Seraphina Goldfarb-Tarrant · PDF
  76. Lost in the Noise: How Test-Time Reasoning Fails with Contextual Distractors

    Seongyun Lee, Yongrae Jo, Minju Seo, Moontae Lee, Minjoon Seo · PDF
  77. Lying to Win: Assessing LLM Deception through Human-AI Games and Parallel-World Probing

    Arash Marioriyad, Ali Nouri, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah · PDF
  78. Measuring Agents in Production

    Melissa Pan, Negar Arabzadeh, Riccardo Cogo, Yuxuan Zhu, Alexander Xiong, Lakshya A Agrawal, Huanzhi Mao, Emma Shen, Sid Pallerla, Liana Patel, Shu Liu, Tianneng Shi, Xiaoyuan Liu, Jared Quincy Davis, Emmanuele Lacavalla, Alessandro Basile, Shuyi Yang, Paul Castro, Daniel Kang, Koushik Sen, Dawn Song, Joseph E. Gonzalez, Ion Stoica, Matei Zaharia, Marquita Ellis · PDF
  79. META-GOVERNANCE ARCHITECTURES FOR MULTI-AGENT SYSTEM SAFETY, ALIGNMENT, GOVERNANCE, AND SECURITY

    Himanshu Joshi, Shivani Shukla, Sunita Kumari, Manas Joshi · PDF
  80. Mind the Gap: Evaluating Model- and Agentic-Level Vulnerabilities in LLMs with Action Graphs

    Ilham Wicaksono, Zekun Wu, Rahul Patel, Theo King, Adriano Koshiyama, Philip Colin Treleaven · PDF
  81. Model Agreement via Anchoring

    Eric Eaton, Surbhi Goel, Marcel Hussing, Michael Kearns, Aaron Roth, Sikata Bela Sengupta, Jessica Sorrell · PDF
  82. More Capable, Less Cooperative? When LLMs Fail At Zero-Cost Collaboration

    Advait Yadav, Sidney Black, Oliver Sourbut · PDF
  83. NAAMSE: Framework for Evolutionary Security Evaluation of Agents

    Kunal Pai, Parth Shah, Harshil Patel · PDF
  84. NESSiE: The Necessary Safety Benchmark - Identifying Errors that should not Exist

    Johannes Bertram, Jonas Geiping · PDF
  85. NesyProAct: Proactive Neural-Symbolic Control for Web Agents

    Keyi Xiang, Tianyi Tang, Jie-Jing Shao, Yueming Lyu, Ivor Tsang, Yew-Soon Ong, Haiyan Yin · PDF
  86. No One Monitor Fits All: Oversight Strategies for Frontier Agents

    Neil Kale, Shashwat Saxena, Ziqian Zhong, Chen Henry Wu, Aditi Raghunathan · PDF
  87. Noticing the Watcher: LLM Agents Can Infer CoT Monitoring from Blocking Feedback

    Thomas Jiralerspong, Flemming Kondrup, Yoshua Bengio · PDF
  88. Objective Misalignment in LLM-based Multi Agent Social Deception Game

    Marylou Fauchard, Florian Carichon, Margarida Carvalho, Golnoosh Farnadi · PDF
  89. Offline Reinforcement Learning of High-Quality Behaviors Under Robust Style Alignment

    Mathieu Petitbois, Rémy Portelas, Sylvain Lamprier · PDF
  90. On Randomness in Agentic Evals

    Bjarni Haukur Bjarnason, André Silva, Martin Monperrus · PDF
  91. OPENAPPS: SIMULATING ENVIRONMENT VARIATIONS TO MEASURE UI-AGENT RELIABILITY

    Karen Ullrich, Jingtong Su, Claudia Shi, Arjun Subramonian, Amir Bar, Ivan Evtimov, Nikolaos Tsilivis, Randall Balestriero, Julia Kempe, Mark Ibrahim · PDF
  92. OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows

    Qiushi Sun, Mukai Li, Zhoumianze Liu, Zhihui Xie, Fangzhi Xu, Zhangyue Yin, Kanzhi Cheng, Zehao Li, Zichen Ding, Qi Liu, Zhiyong Wu, Zhuosheng Zhang, Ben Kao, Lingpeng Kong · PDF
  93. Pay Attention to the Triggers: Constructing Backdoors That Survive Distillation

    Giovanni De Muri, Mark Vero, Robin Staab, Martin Vechev · PDF
  94. Persuasion Attacks Can Decrease Effectiveness of CoT Monitoring

    Jennifer Za, Julija Bainiaksina, Nikita Ostrovsky, Tanush Chopra, Victoria Krakovna · PDF
  95. Physics-Guided Multimodal Multi-Agent Learning for Intelligent Transportation Systems

    Zhen Tian, Yaqiong Zhang, Zhihao Lin, Fujiang Yuan, Yijun Lu, Wangjie lang, Xinyu Wang, Ning Lyu, Zhiguo Tao, Kaijie Chen, Aaron Wang · PDF
  96. Position: Agentic Systems Should be General

    Elron Bandel, Asaf Yehudai, Alexandre Lacoste, Avijit Ghosh, Graham Neubig, Margaret Mitchell, Michal Shmueli-Scheuer, Leshem Choshen · PDF
  97. Position: AI Development Should Prioritize Cognitive Security

    Batu El, Shiye Su, Aneesh Pappu, Peggy Yin, Julie Heng, Eric Heng, Ryan Z Wang, Andreas Haupt, James Zou · PDF
  98. Position: Science is Collaborative—LLM for Science Should Be Too

    Terry Jingchen Zhang, Wenyuan Jiang, Yongjin Yang, Sirui Lu, Bernhard Schölkopf, Zhijing Jin · PDF
  99. Position: We Must Proactively Address AI Safety Debt

    Peter Wallich, Raymond Douglas · PDF
  100. PrefPO: Pairwise Preference Prompt Optimization

    Rahul Singhal, Pradyumna Tambwekar, Karime Maamari · PDF
  101. PriGuardAgent: Context-Aware Privacy Guardrails for Agentic Systems

    Chulin Xie, Amit Dhurandhar, Bo Li · PDF
  102. ProActor: Timing-Aware Reinforcement Learning for Proactive Task Scheduling Agents

    Lei Ding, Bin He, Chenguang Wang, Yang Liu · PDF
  103. Profit Is the Red Team: Stress-Testing Agents in Strategic Economic Interactions

    Shouqiao Wang, Marcello Politi, Samuele Marro, Davide Crapis · PDF
  104. Prover-Verifier Games for AI Control

    Joan Velja, Charlie Griffin, Alessandro Abate · PDF
  105. Quality-Diversity Evolution for Discovering Diverse Vulnerabilities in LLM Safety

    Subhadip Mitra · PDF
  106. Rapid Poison: Practical Poisoning Attacks Against the Rapid Response Framework

    David Huang, Jaewon Chang, Avidan Shah, Prateek Mittal, Chawin Sitawarin · PDF
  107. Reading Between the Pixels: Linking Text-Image Embedding Alignment to Typographic Attack Success on Vision-Language Models

    Ravikumar Balakrishnan, Sanket Mendapara, Ankit Garg · PDF
  108. Recalling Too Well: Sycophancy and Bias Amplification in Memory-Augmented Models

    Shelly Bensal, Axel Magnuson, Aparna Balagopalan, Daniel M. Bikel · PDF
  109. Reference-Guided Machine Unlearning

    Jonas Mirlach, Sonia Laguna, Julia E Vogt · PDF
  110. RepoMirage: Do Code Agents Really Understand Repository Structures?

    Hanyu Li, Yichi Zhang, Speed Zhu, Yinpeng Dong · PDF
  111. ResearchGym: Evaluating Language Model Agents on Real-World AI Research

    Aniketh Garikaparthi, Manasi Patwardhan, Arman Cohan · PDF
  112. RubricRobustness: Evaluating the Sensitivity of Rubrics-Based Benchmarks to Simple Perturbations

    Manasi Sharma, Brad Kenstler, Bing Liu · PDF
  113. SafePro: Evaluating the Safety of Professional-Level AI Agents

    Kaiwen Zhou, Shreedhar Jangam, Ashwin Nagarajan, Tejas Polu, Suhas Oruganti, Chengzhi Liu, Ching-Chen Kuo, Yuting Zheng, Sravana Jyothi Narayanaraju, Xin Eric Wang · PDF
  114. Scaling Agentic Capabilities, Not Context: Efficient Reinforcement Finetuning for Large Toolspaces

    Karan Gupta, Pranav Vajreshwari, Yash Pandya, Akshay Nambi, Ahmed Hassan Awadallah · PDF
  115. Scaling Agents for Computer Use

    Gonzalo Gonzalez-Pumariega, Vincent Tu, Chih-Lun Lee, Jiachen Yang, Ang Li, Xin Eric Wang · PDF
  116. Script Kiddie Uplift: Measuring Procedural Misuse Amplification in AI Agents

    Zora Che, Julio Poveda, Aldana Belen Rodriguez, Yannis Yiming He, Chen Bo Calvin Zhang, Zifan Wang, Udari Madhushani Sehwag · PDF
  117. Secure Code Generation via Online Reinforcement Learning with Vulnerability Reward Model

    Tianyi Wu, Mingzhe Du, Yue Liu, Chengran Yang, Terry Yue Zhuo, Jiaheng Zhang, See-Kiong Ng · PDF
  118. SenseAct: Structuring GUI Actions for Reliable Planning and Verification

    Cai Hongtian, Tianyi Ma, Jie-Jing Shao, Tianyi Tang, Ivor Tsang, Yueming Lyu, Haiyan Yin · PDF
  119. Sound Agentic Science Requires Adversarial Experiments

    Dionizije Fa, Marko Čuljak · PDF
  120. SPARK: Spectral Perturbation based Adversarial Attacks for KGRAG Agents

    Aditya Saibewar, Aditya Ramesh, Shivam Bhardwaj, Jatin Chauhan, Manohar Kaul · PDF
  121. SPECA: Specification-to-Checklist Agentic Auditing for Multi-Implementation Systems — A Case Study on Ethereum Clients

    Masato Kamba, Akiyoshi Sannai · PDF
  122. Subliminal Signals in Preference Labels

    Isotta Magistrali, Frédéric Berdoz, Sam Dauncey, Roger Wattenhofer · PDF
  123. Subliminal Transfer of Unsafe Behaviors in AI Agent Distillation

    Jacob Dang, Brian Yang Xie, Omar G. Younis · PDF
  124. Sweeping Promptable Spoofs under the DirtyRAG: A Practical, Query-Blind RAG Attack Done Right

    Shaochen Zhong, Jiamu Zhang, Hoang Anh Duy Le, Wenya Xie, Yifan Lu, Xintong Sun, Mohsen Hariri, Hongyi Liu, Guanchu Wang, Zhaozhuo Xu, Zirui Liu, Shuai Xu, Ning Xie, Li Li, Rui Chen, Ruixiang Tang, Xia Hu, Vipin Chaudhary · PDF
  125. T-MAP: Red-Teaming LLM Agents with Trajectory-aware Evolutionary Search

    Hyomin Lee, Sangwoo Park, Yumin Choi, Sohyun An, Hayeon Lee, Seanie Lee, Sung Ju Hwang · PDF
  126. TamperBench: A Systematic Framework to Stress-Test LLM Safety Under Fine-Tuning and Tampering

    Saad Hossain, Tom Tseng, Punya Syon Pandey, Samanvay Vajpayee, Nayeema Nonta, Matthew Kowal, Samuel Simko, Stephen Casper, Zhijing Jin, Kellin Pelrine, Sirisha Rambhatla · PDF
  127. TamperTest: A Framework for Testing Tamper Resistance in Open-Weight LLMs

    Isabel Dahlgren, Aashiq Muhamed · PDF
  128. The Algorithmic Self-Portrait: Deconstructing Memory in ChatGPT

    Abhisek Dash, Soumi Das, Elisabeth Kirsten, Qinyuan Wu, Sai Keerthana Karnam, Krishna P. Gummadi, Thorsten Holz, Muhammad Bilal Zafar, Savvas Zannettou · PDF
  129. The Alignment Waltz: Jointly Training Agents to Collaborate for Safety

    Jingyu Zhang, Haozhu Wang, Eric Michael Smith, Sid Wang, Amr Sharaf, Mahesh Pasupuleti, Benjamin Van Durme, Daniel Khashabi, Jason E Weston, Hongyuan Zhan · PDF
  130. The Controllability Trap: A Governance Framework for Military AI Agents

    Subramanyam Sahoo · PDF
  131. The Reliability Gap in Agentic Evidence Verification for Materials Science

    Albert Gong, James J. Kim, Anmol Kabra, Aaditya Panigrahi, Jiashuo Wang, Arjun B. Mulchandani, Michael Freeman, Fatmagul Katmer, Joshua Peters Wakefield, Linxi Zhao, Chao Wan, Akanksha Sarkar, Yoav Artzi, Leslie M Schoop, John Thickstun, Kilian Q Weinberger, Eun-Ah Kim, Peter I. Frazier, Jennifer J. Sun · PDF
  132. The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search

    Rongzhe Wei, Peizhi Niu, Xinjie Shen, Tony Tu, Yifan Li, Ruihan Wu, Eli Chien, Pin-Yu Chen, Olgica Milenkovic, Pan Li · PDF
  133. Toward Reliable, Safe, and Secure LLMs for Scientific Applications

    Saket Sanjeev Chaturvedi, Joshua Bergersona, Tanwi Mallick · PDF
  134. Towards Predictive Models of Strategic Behaviour in Large Language Model Agents

    Jennifer Za, Aristeidis Panos, Jan Cuhel, Samuel Albanie · PDF
  135. Towards Shutdownable Agents: Generalizing Stochastic Choice in RL Agents and LLMs

    Carissa Cullen, Harry Garland, Alexander Roman, Louis Thomson, Christos Ziakas, Elliott Thornley · PDF
  136. TRADERBENCH: HOW ROBUST ARE AI AGENTS IN ADVERSARIAL CAPITAL MARKETS?

    Xiaochuang Yuan, Hui Xu, Silvia Xu, Cui Zou, Jing Xiong · PDF
  137. TSR: Trajectory‑Search Rollouts for Multi‑Turn RL of LLM Agents

    Aladin Djuhera, Swanand Ravindra Kadhe, Farhan Ahmed, Heiko Ludwig, Holger Boche · PDF
  138. Uncertainty Drives Social Bias Changes in Quantized Large Language Models

    Stanley Bryan Zamora Hua, Sanae Lotfi, Irene Y. Chen · PDF
  139. Uncertainty-Aware Self-Correction for Coding Agents

    Jason Almeida, Lokesh Sai Dasari, Anubhav Pal, Tinuade Adeleke, Sean Wu, Ruizhe Li · PDF
  140. Understanding Metacognition in Multi-Agent LLMs: Routing, Not Reasoning

    Mafizur Rahman, Lijun Qian · PDF
  141. Understanding Reasoning Collapse in Multi-Turn Agent Reinforcement Learning

    Zihan Wang, Chi Gui, Xing Jin, Qineng Wang, Licheng Liu, Kangrui Wang, Shiqi Chen, Linjie Li, Zhengyuan Yang, Pingyue Zhang, Yiping Lu, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, Manling Li · PDF
  142. VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use

    Dongfu Jiang, Yi Lu, Zhuofeng Li, Zhiheng Lyu, Ping Nie, Haozhe Wang, Alex Su, Hui Chen, Kai Zou, Chao Du, Tianyu Pang, Wenhu Chen · PDF
  143. Visual Exclusivity Attacks: Automatic Multimodal Red Teaming via Agentic Planning

    Yunbei Zhang, Yingqiang Ge, Weijie Xu, Yuhui Xu, Jihun Hamm, Chandan K. Reddy · PDF
  144. W&D: Scaling Parallel Tool Calling for Efficient Deep Research Agents

    Xiaoqiang Lin, Jun Hao Liew, Silvio Savarese, Junnan Li · PDF
  145. When Agents Persuade: Rhetoric Generation and Mitigation in LLMs

    Julia Jose, Ritik Roongta, Rachel Greenstadt · PDF
  146. When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift

    Max Fomin · PDF
  147. When Benign Inputs Lead to Severe Harms: Eliciting Unsafe Unintended Behaviors of Computer-Use Agents

    Jaylen Jones, Zhehao Zhang, Yuting Ning, Eric Fosler-Lussier, Pierre-Luc St-Charles, Yoshua Bengio, Dawn Song, Yu Su, Huan Sun · PDF
  148. When Fuzzing Becomes Agentic: Semantic State Exploration in the Wild

    Andrew Yin, Zhaoling Chen, Qian Zhang, Heng Yin · PDF
  149. Why Do Language Model Agents Whistleblow?

    Kushal Agrawal, Frank Xiao, Guido Ernesto Bergman, Asa Cooper Stickland · PDF
  150. ZeroDayBench: Evaluating LLM Agents on Unseen Zero-Day Vulnerabilities for Cyberdefense

    Nancy Lau, Louis Sloot, Jyoutir Raj, Evan Harris, Giuseppe Marco Boscardin, Dan Zhao, Dylan Bowman, Mario Brajkovski, Jaideep Singh Chawla · PDF