ICML 2026 Past Safety & alignment

Trustworthy AI for Good (AI4GOOD) Workshop @ ICML 2026

AI4GOOD Workshop 2026

Submission deadline
May 10, 2026, 12:00 UTC
imported from OpenReview — check the website for extensions
Submission portal
OpenReview
Notes
Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).

Accepted papers (187)

Fetched from OpenReview (v2) on 2026-06-10.

  1. $\mathcal{D}^2$-Monitor: $\mathcal{D}$ynamic Safety Monitoring for $\mathcal{D}$iffusion LLMs via Hesitation-Aware Routing

    Aoxi Liu, Yupeng Chen, James Oldfield, Guan Zhe Hong, Junchi Yu, Baoyuan Wu, Philip Torr, Adel Bibi · PDF
  2. A Generative Model of Contextual Integrity: Appropriate vs. Inappropriate Information Sharing

    Omer Kamal Ali Ebead, Juan Claude Formanek, Joel Z Leibo · PDF
  3. A Low-Rank Subspace Analysis of LLM Interventions

    Angira Sharma, Christian Schroeder de Witt, Philip Torr, Anisoara Calinescu, Jialin Yu · PDF
  4. A Training-Dynamics View of Catastrophic Overfitting: Understanding and Prevention

    Jimin Yeom, Sungyoon Lee · PDF
  5. Adaptive Trimodal Fusion for Mental-Health Symptom Classification in Memes

    Arush Gumber · PDF
  6. Adversarial Review: Cooperative Code Review through Structured Disagreement

    Eric S. Qiu, Joyce Gill · PDF
  7. AI Governance in Social Work: A Triple Mandate-Informed Accountability Model

    Eunhye Ahn, Moon Choi, Claire R. McNellan, Lauri Goldkind · PDF
  8. AI-Mediated Communication Can Steer Collective Opinion

    Stratis Tsirtsis, Kai Rawal, Chris Russell, Brent Mittelstadt, Sandra Wachter · PDF
  9. ALIGNBEAM: Inference-Time Alignment Transfer via Cross-Vocabulary Logit Mixing

    Chirag Chawla, Pratinav Seth, Vinay kumar Sankarapu · PDF
  10. Architecture Matters for Multi-Agent Security

    Ben Hagag, William L. Anderson, Christian Schroeder de Witt, Sarah Scheffler · PDF
  11. Attack Selection in Agentic AI Control Evaluations Meaningfully Decreases Safety

    Catherine Ge-Wang, Tyler Crosse, Benjamin Hadad, Joachim Schaeffer, Ram Potham, Tyler Tracy · PDF
  12. Attacking Medical Vision-Language Models with Query-Based Zero-Order Optimization

    Maxime Griot · PDF
  13. Attractor Inversion: A Geometric Account of Adversarial Manipulation in Human Decision-Making

    Leo Lorence George, Anushri Iyer, Abhishek Bakshi, Pavan Kulkarni · PDF
  14. Attractor States Emerge in Multi-Turn LLM Conversations

    Ting-Wen Ko, Jonas Geiping · PDF
  15. Auditable Bits or Covert Influence? Safe Revelation Complexity in Partially Observable Assistance Games

    Manoj Saravanan, Rohit Kumar Salla, Shrikar Reddy Kota · PDF
  16. Auditing Chain-of-Thought Faithfulness for Trustworthy AI: A Reproducible Corruption-Probe Protocol Across Eleven Frontier LLMs

    Ali Saffarini, Aram Bagdasarian · PDF
  17. Auditing Clinical Concept Fragmentation in Sparse Medical Vision–Language Representations

    Junah Jung, YeonGyu Han, Chang Min Park, Dongheon Lee · PDF
  18. Auditing Emotion-Vector-Steered Political Bias in Open-Weight LLMs

    Gabor Hollbeck, Baran Peters, Alexander von Recum, Kevin Riehl, Alec McGail, Julian Windeck, Kevin O'Sullivan, Robert Jakob · PDF
  19. Auditing LLMs for Hidden Behaviors using Model Diffing

    Atharv Naphade, Emil Ryd, Keshav Shenoy · PDF
  20. Auditing the Judge: Human-Grounded Bias Discovery, Quantification, and Mitigation in LLM Judges

    Hamin Koo, ChanJoo Jung, Fangzhao Wu, Jaehyung Kim · PDF
  21. Authority, Truth, and Citation Bias: A Large-Scale Multi-Domain Benchmark for Studying Epistemic Susceptibility in Large Language Models

    Aryan Khurana, Aravind Ramana RN, Dhruv Kumar · PDF
  22. Balance Human Agency & AI Assistance in the Tussle for the ``Right'' to Choose, Own, Work, and Learn

    Zi-Yu Khoo, Yuriel Ryan, Nicole Heng Yim Oo, Hui En Pang, Eric J. W. Orlowski, Hakim Norhashim, Ruth Wan Theng Chew, Davin Choo, Rachael Hwee Ling Sim, Simon Chesterman, Jungpil Hahn, Bryan Kian Hsiang Low · PDF
  23. BarrierSteer: LLM Safety via Learning Barrier Steering

    Thanh Q. Tran, Arun Verma, Kiwan Wong, Bryan Kian Hsiang Low, Daniela Rus, Wei Xiao · PDF
  24. Belief Engine: Configurable and Inspectable Stance Dynamics in Multi-Agent LLM Deliberation

    Joshua C. Yang, Maurice Flechtner, Damian Dailisan, Michiel A. Bakker · PDF
  25. BELLS-O: Evaluating the Operational Trade-offs of LLM Supervision Systems

    Leonhard Waibl, Felix Michalak, Hadrien Mariaccia · PDF
  26. Beyond Accuracy: Measuring Bias Acknowledgment in Chain-of-Thought Reasoning for Responsible AI Evaluation

    Xian Sun, Yingshuo Wang, Wei Gao, Lingdong Kong, Zexin Zhuang, Zhichao Fan, Wenlong Dong, Hrishikesh Paranjape, Zhiyuan Zheng · PDF
  27. Beyond Agreeable Chatbots: Context-Aware Safety Oversight for Trustworthy Patient-Facing LLMs

    Elham Nasarian, Abhilash Neog, Kwok-leung Tsui, Niyousha Hosseinichimeh · PDF
  28. Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection

    Jinhan Li, Kexian Tang, Yihan Xu, Zhuorui Ye, Kaifeng Lyu · PDF
  29. Beyond the Prompt: Leveraging Pre-Decoding States for Jailbreak Detection in dLLMs

    Adam Hazimeh, Amel Abdelraheem, Ke Wang, Mariam Salman, Ljiljana Dolamic, Gérôme Bovet, Pascal Frossard · PDF
  30. Bosses, Kings, and the Commons: Cooperation Under Power Asymmetry in LLM Societies

    Abhilekh Borah · PDF
  31. Bridging the Gap Between Tort Law and Unforeseeable AI Errors

    Mitchell Chervu Johnston, Brian Matejek · PDF
  32. Can LLMs Contribute to Cooperative Fact-Checking? A Field Evaluation on X Community Notes

    Haiwen Li, Michiel A. Bakker · PDF
  33. Can LLMs deliberate? Benchmarking Collective Reasoning for Democratic AI Applications

    Maurice Flechtner · PDF
  34. CANARY: Zero-Label Detection of Fine-Tuning Contamination in Language Models

    Swapnil Parekh · PDF
  35. Capability Is Not Propensity: Measuring Pressure-Robust Cooperative Behavior in Civic LLM Agents

    Neel Tushar Shah, Manglam Kartik · PDF
  36. Certifying Robustness Large Language Models via Discrete-Continuous Randomized Smoothing

    Subin Jang, Jimin Yeom, Sungyoon Lee · PDF
  37. ChainMark: Model-Free LLM Watermarking with Closed-Form Calibration

    Chengheng Li Chen, Kyuhee Kim · PDF
  38. Closing the Welfare Outreach Gap: A Conversational Architecture and Cell-Level Eligibility Benchmark for Korean Welfare Recommendation

    Byeongmin Kang, Minwoo Han, Junhak Lee, Minsu Kim, Jihie Kim · PDF
  39. Concept Unlearning via Cross-Attention Activation Projection for Diffusion Models

    Saemi Moon, Suhyeon Jun, Seoyeon Lee, Dongwoo Kim · PDF
  40. Consensus‑Aware Bridge Maintenance Planning with Auditable Evidence and Multi‑Stakeholder AI Evaluation

    Takayuki Shinohara, Hidetaka Saomoto, J-katagiri · PDF
  41. Consistency Training Along the Transformer Stack

    Sukrati Gautam, Neil Shah, Arav Dhoot, Bryan Maruyama, Caroline Wei, Rohan Kapoor, Robert Sidey, Prakhar Gupta, Zi Cheng Huang, David Demitri Africa · PDF
  42. Context Over Content: Exposing Evaluation Faking in Automated Judges

    Manan Gupta, Inderjeet Jayakumar Nair, Lu Wang, Dhruv Kumar · PDF
  43. Contract Cards for Auditable Private Conformal Prediction

    Patrick Indri, Tamara Drucks, Georgios Spathoulas · PDF
  44. Cooperate to Compete: Strategic Coordination in Multi-Agent Conquest

    Abigail O'Neill, Alan Zhu, Mihran Miroyan, Narges Norouzi, Joseph E. Gonzalez · PDF
  45. CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas

    Emanuel Tewolde, Xiao Zhang, David Guzman Piedrahita, Vincent Conitzer, Zhijing Jin · PDF
  46. CPInj: Uncovering Prompt Injection Risks in Textual Collaborative Prompt Optimization

    Xinting Liao, Behnoosh Zamanlooy, Masoumeh Shafieinejad, D. B. Emerson, Ruinan Jin, Deval Pandya, Xiaoxiao Li · PDF
  47. Data Contradictions Are Uncertainty, Not Noise

    Adhiraj Chhoda · PDF
  48. Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories

    Kevin Qinghong Lin, Batu El, Yuhong Shi, Pan Lu, Philip Torr, James Zou · PDF
  49. DeflectBench: A Benchmark for Evaluating Rhetorical Fallacy Generation in LLMs

    Art Kanke · PDF
  50. Democratizing Agent Deployment Safety: A structural monitoring approach

    Preeti Ravindra, Rahul Tiwari · PDF
  51. DGN: Disagreement Graph Networks for Learning from Multiple Annotators

    Keyu Zhu · PDF
  52. Differential Auditing for Undesired Behavior

    Ishwar B Balappanawar, Venkata Hasith Vattikuti, Greta Kintzley, Ronan Azimi-Mancel, Satvik Golechha · PDF
  53. Direction-Flipped Influence Audits Reveal Hidden Structure in Moral Choices of LLMs

    Phil Blandfort, Tushar Karayil, Alex McKenzie, Urja Pawar, Robert Graham, Dmitrii Krasheninnikov · PDF
  54. Distill to Detect: Exposing Stealth Biases in LLMs through Cartridge Distillation

    Shayan Talaei, Abhinav Chinta, Fnu Devvrit, Amin Karbasi, Azalia Mirhoseini, Amin Saberi · PDF
  55. DiTCarbon: Predictive Carbon Footprint Estimation for Diffusion Transformer Inference

    Jon Maravilla, Justin Kur, Kaiqi Zhao · PDF
  56. Do LLMs Follow Their Self-Reported Causal Graphs? A Graph-Contract Audit of Falsifiable Rationales for Trustworthy Decisions

    Amine M'Charrak, Thong Pham, Thomas Lukasiewicz, Yuxiao Dong, Shohei Shimizu · PDF
  57. Do LLMs Take Care of Their Own? Similarity Signals Can Induce Cooperation

    Akash Kundu, Emanuel Tewolde, Ratip Emin Berker, Samuel F. Brown, Vincent Conitzer · PDF
  58. Do Thinking Tokens Help with Safety?

    Narutatsu Ri, Abhishek Panigrahi, Sanjeev Arora · PDF
  59. Does Moral Reasoning Training Help or Hurt? Red-Teaming RL-Trained Ethical Agents with Persona Attacks

    Arth Singh · PDF
  60. Efficient Safety Benchmarking via Item Response Theory

    Fabio Spagliardi, Mírian Silva, Ayan Datta, Aiden Zhou, Vamshi Krishna Bonagiri, Diogo Cruz · PDF
  61. Emergent Social Intelligence Risks in Generative Multi-Agent Systems

    Yue Huang, Yu Jiang, Wenjie Wang, Haomin Zhuang, Xiaonan Luo, Yuchen Ma, Zhangchen Xu, Zichen Chen, Nuno Moniz, Zinan Lin, Pin-Yu Chen, Nitesh V. Chawla, Nouha Dziri, Huan Sun, Xiangliang Zhang · PDF
  62. EmoPair: A New Paradigm for Measuring Emotional Affect

    Michael Leon Chrzan, Meghavarshini Krishnaswamy, Anika Alam, Bin Hu, Wenjing Gao, Jing Liu · PDF
  63. EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents

    Gurusha Juneja, Dylan Lu, Saaket Agashe, Parth Diwane, Edward Gunn, Jayanth Srinivasa, Gaowen Liu, William Yang Wang, Yali Du, Xin Eric Wang · PDF
  64. Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute

    Yejun Yun, Eugene Koran, Samantha Tetef, Benjamin Arnav, Pablo Bernabeu-Perez · PDF
  65. ESSA: Evolved Safety Specification Alignment

    Dongcho Park, Yeowon Jung, Sung Jun Cheon · PDF
  66. Eval Cooperativeness Mitigates Evaluation Gaming in LLMs

    Jasmine Xinze Li, Alexander Matt Turner · PDF
  67. Evaluating Cooperation in LLM Social Groups through Elected Leadership

    Ryan Faulkner, Anushka Deshpande, David Guzman Piedrahita, Joel Z Leibo, Zhijing Jin · PDF
  68. Evaluating using Mock Tool Calls to Quarantine Untrusted Prompt Inputs

    David Gros, Adam Gleave · PDF
  69. Evaluation without Generation: Non-Generative Assessment of Harmful Model Specialization with Applications to CSAM

    Vinith Menon Suriyakumar, Ayush Sekhari, Lena Stempfle, Robertson Wang, Michael Simpson, Rebecca S. Portnoff, Marzyeh Ghassemi, Ashia C. Wilson · PDF
  70. Every Bit, Everywhere, All at Once: A Binomial Multibit LLM Watermark

    Thibaud Gloaguen, Robin Staab, Mark Vero, Martin Vechev · PDF
  71. Exposing the Systematic Vulnerability of Open-Weight Models to Prefill Attacks

    Lukas Struppek, Adam Gleave, Kellin Pelrine · PDF
  72. Fairness-Aware Low-Rank Representation Fine-Tuning

    Parameswaran Kamalaruban, Mark Anderson, Stuart Burrell, Maeve Madigan, Piotr Skalski, David Sutton · PDF
  73. Few-Shot Truly Benign DPO Attack for Jailbreaking LLMs

    Sangyeon Yoon, Wonje Jeung, Dongjae Jeon, Yoonjun Cho, Albert No · PDF
  74. Flag Game: Interpreting Decision Mechanisms of Bounded Social Agents

    Elizabeth Pavlova, Hidenori Tanaka · PDF
  75. Graph-Regularized Sparse Autoencoders for LLM Safety Steering

    Jehyeok Yeon, Federico Cinus, Yifan Wu, Luca Luceri · PDF
  76. HABERMOLT: Delegating Deliberation to AI Representatives

    Joseph Low, Oscar Duys, Juan Claude Formanek, Michiel A. Bakker, Lewis Hammond · PDF
  77. Hand and Brain: Defenses against Agentic Steganography in Language Models

    Robert Krzyzanowski, Iván Arcuschin, Matthew Lee, Bryce Meyer, Georg Lange · PDF
  78. Hidden Commitment: When Language Models Silently Pick a Side and How Steering Can Surface It

    Samuel Dawit Assefa, Jae Won Cho · PDF
  79. Hidden in Plain Sight: Benchmarking Agent Safety Against Decomposition Attacks with DecompBench

    Vikhyath Kothamasu, Virginia Smith, Chhavi Yadav · PDF
  80. Honeyval: A Comprehensive Evaluation Framework for LLM-powered HTTP Honeypots

    Mark Vero, Fabian Kaczmarczyck, Ivan Petrov, Ilia Shumailov, Niels Heinen, Jamie Hayes, Tianqi Fan, Luca Invernizzi, Martin Vechev · PDF
  81. Human-AI Collaborative Uncertainty Quantification

    Sima Noorani, Shayan Kiyani, George J. Pappas, Hamed Hassani · PDF
  82. I-Robot: Identifying Robotic and Human Motion in Humanoids

    Taehoon Kim, Jongwook Choi, Haeun Noh, Hwang Junyeup, Jongwon Choi · PDF
  83. Image Triaging for Budget-Aware Universal Attacks on Vision-Language Models

    Wei Yong Tan, Dongyue Lu, Wei Tsang Ooi · PDF
  84. In-Context Neurofeedback: Can Large Language Models Control Their Internal Representations through Privileged Access?

    Koshiro Aoki, Ryota Takatsuki, Gouki Minegishi, Yusuke Haruki, Daisuke Kawahara · PDF
  85. Innocuous-Seeming Data, Latent Ideology: Ideological Generalisation in Finetuned LLMs

    Robert Graham, Edward Stevinson, Yariv Barsheshat · PDF
  86. Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Misalignment in LLMs

    Krishak Aneja, Manas Mittal, Anmol Goel, Ponnurangam Kumaraguru, Vamshi Krishna Bonagiri · PDF
  87. Invisible Conflicts: Media Coverage Asymmetry and Categorical Failure in LLM Conflict Forecasting

    Poli Nemkova · PDF
  88. Is Multilingual LLM Watermarking Truly Multilingual? Scaling Robustness to 100+ Languages via Back-Translation

    Asim Mohamed, Martin Gubri · PDF
  89. IsoAct: Structure-Preserving Post-hoc Debiasing via Isometric Actions

    Sumin Park, Taero Kim, Subeen Park, Minhyeong Cho, Jong Hwan Lee, Kyungwoo Song · PDF
  90. Jailbreaking for the Average Jane: Choosing Optimal Jailbreaks via Bandit Algorithms for Automatically Enhanced Prompts

    Prarabdh Shukla, Ritik, Suhas Devraj Rao, Arpit Agarwal, Arjun Bhagoji · PDF
  91. Language Models Can Coarsely Modulate Entropy Under Instruction

    Luca Baroni, Kola Ayonrinde, Shi Feng, Puria Radmard · PDF
  92. Language Models can Learn High-Capacity Secure Steganography

    Georg Lange, Iván Arcuschin, Robert Krzyzanowski, Matthew Lee, Bryce Meyer · PDF
  93. LaRA: Layer-wise Representation Analysis for Detecting Data Contamination in RL Post-Training

    Minju Gwak, Minseo Kwak, Guijin Son, Alan Ritter, Jaehyung Kim · PDF
  94. Learning from Self Critique and Refinement for Faithful LLM Summarization

    Ting-Yao Hu, Hema Swetha Koppula, Hadi Pouransari, Cem Koc, Oncel Tuzel, Raviteja Vemulapalli · PDF
  95. LLM Persuasiveness Evaluation: A Structured Review of Automated Methods

    Kamile Dementaviciute, Guillaume Bied, Tijl De Bie · PDF
  96. Localizing Text Anonymization for Trustworthy AI: Extending RAT-Bench to Malaysian Microdata and PII

    David Hong Liang Chew, Zexi Yao, Nataša Krčo, Matthieu Meeus, Waqas Khalid Obeidy, Yves-Alexandre de Montjoye · PDF
  97. Making Open-Source Text LLM Watermarks Durable Against Merging

    Luisa Scharff, Thibaud Gloaguen, Robin Staab, Martin Vechev · PDF
  98. Making Visible, Making Invisible: How an AI Scribe Reshapes Documentation Authority in Social Work

    Eunki Joung, Hong Chul Nam, Geonhwi Hwang, Moohyun Lee, Hyun Kwon · PDF
  99. Making Your Action Policies Interpretable: Mixtures of Action Queries

    Suhyung Choi, Youngseok Joo, Hyundo Lee, Kyuhwan Shim, Kisung Shin, Chungwoo Lee, Minjeong Gu, Jun Ki Lee, Byoung-Tak Zhang · PDF
  100. Manipulation Is Task-Dependent: A Multi-Axis, Multi-Environment Evaluation of Frontier LLMs

    Adeeb Zaman, Erik Nordby, Fred Heiding · PDF
  101. Marking the Wrong Symptoms: Evaluating LLM Watermarks in Medical Texts

    Melanie Rieff, Robin Staab, Thibaud Gloaguen, Stefan Hegselmann, Martin Vechev · PDF
  102. Matching Ranks Over Probability Yields Truly Deep Safety Alignment

    Jason Vega, Gagandeep Singh · PDF
  103. Measuring Weak-to-Strong Legibility of Reasoning Models

    Dani Roytburg, Shreya Sridhar, Daphne Ippolito · PDF
  104. Mechanism Design Is Not Enough: Prosocial Agents for Cooperative AI

    Xuanqiang Angelo Huang, Charlie Tharas, Samuele Marro, Van Q. Truong, Bernhard Schölkopf, Emanuele La Malfa, Zhijing Jin · PDF
  105. Mechanisms for Aggregated Individual Reporting Should be Established for Post-Deployment Evaluation

    Jessica Dai, Inioluwa Deborah Raji, Benjamin Recht, Irene Y. Chen · PDF
  106. Medical Model Synthesis Architectures: A Case Study

    Katherine M. Collins, Marlene Berke, Ilia Sucholutsky, Ayman Ali, Adrian Weller, Timothy J. O'Donnell, Tyler BrookeWilson, Lionel Wong, Joshua B. Tenenbaum · PDF
  107. Minionese: Comprehensive Benchmark and Mechanistic Study of Multilingual LLM Safety

    Ayushi Mehrotra, Chigozirim Ifebi, Brent Kong · PDF
  108. Mitigating Watermark Forgery in Generative Models via Randomized Key Selection

    Toluwani Aremu, Noor Hazim Hussein, Munachiso Samuel Nwadike, Samuele Poppi, Jie Zhang, Karthik Nandakumar, Neil Zhenqiang Gong, Nils Lukas · PDF
  109. MMDiff: Multimodal Model Diffing for Feature Discovery and Control

    Lachin Naghashyar, Hunar Batra, Ashkan Khakzar, Philip Torr, Ronald Clark, Christian Schroeder de Witt, Constantin Venhoff · PDF
  110. Moloch's Bargain: Emergent Misalignment When LLMs Compete for Audiences

    Batu El, James Zou · PDF
  111. Multi-Agent AI Systems Need Institutional Design, Not Just Model-Level Alignment

    Van Q. Truong, Xuanqiang Angelo Huang, Erivan Inan, Ryan Faulkner, Joel Christoph, Terry Jingchen Zhang, David Guzman Piedrahita, Zhijing Jin · PDF
  112. Narrow Secret Loyalty Dodges Black-Box Audits

    Alfie Lamerton, Fabien Roger · PDF
  113. NEMO: Benchmarking Natural-Language Explanations of Vision Model Errors

    Nam Hyeon-Woo, Yoonsu Kim, Kihoon Son, Juho Kim, Tae-Hyun Oh · PDF
  114. NEST: Nascent Encoded Steganographic Thoughts

    Artem Karpov · PDF
  115. Norm Enforcement for AI Agents: Robustly Shaping Behavior in Multi-Agent Systems

    Yaowen Ye, Jacob Steinhardt · PDF
  116. Noticing the Watcher: LLM Agents Can Infer CoT Monitoring from Blocking Feedback

    Thomas Jiralerspong, Flemming Kondrup, Yoshua Bengio · PDF
  117. Online Agent-as-a-Judge: Situation-Generating Evaluation for Interactive Agents

    Hyogon Ryu, Jeonghwan Kim, Yewon Lim, Chaeun Lee, Jeongwook Kim, Donghoon Ham · PDF
  118. Operational Alignment: An Auditing Framework for Trustworthy AI in Consequential Decisions

    Anthony Cruz · PDF
  119. Optimizing Message-Driven Recruitment on Networks

    Tzeh Yuan Neoh, Davin Choo, Milind Tambe · PDF
  120. Persona‑Conditioned Adversarial Prompting (PCAP): Multi‑Identity Red‑Teaming for Enhanced Adversarial Prompt Discovery

    Cristian Morasso, Anisa Halimi, Muhammad Zaid Hameed, Douglas Leith · PDF
  121. PlainProbe: A Stable Cross-Entropy Baseline for Data-Scarce Deepfake Detection

    Youngjoon Cho, Soyoun Bang, Heechul Jung · PDF
  122. Plausible Deniability Guarantees for Whistleblowers

    Leo Richter, Matt J. Kusner · PDF
  123. PoliticsBench: Benchmarking Political Values in Large Language Models with Multi-Turn Roleplay

    Rohan Khetan, Ashna Khetan · PDF
  124. Position: Collaboration Between the City and the Machine Learning Community is Crucial to Efficient Autonomous Vehicles Routing

    Anastasia Psarou, Ahmet Onur Akman, Łukasz Gorczyca, Michał Hoffmann, Grzegorz Jamróz, Rafal Kucharski · PDF
  125. Proof-of-Guardrail in AI Agents and What (Not) to Trust from It

    Xisen Jin, Michael Duan, Aaron Chan, Zhenglun Chen, Junyi Du, Xiang Ren · PDF
  126. Provably Optimal Learning Algorithms for Assistance Games

    Nivasini Ananthakrishnan, Mark Bedaywi, Michael I. Jordan, Stuart Russell, Nika Haghtalab · PDF
  127. Proximal State Nudging: Reducing Skill Atrophy from AI Assistance

    Megha Srivastava, Jonathan Ouyang, Eric Ziyang Zhou, Andrew Silva, Emily Sumner, Dorsa Sadigh, Yuchen Cui, Deepak Edakkattil Gopinath, Guy Rosman · PDF
  128. Quantamination: Dynamic Quantization Can Leak Your Data Across the Batch

    Hanna Foerster, Ilia Shumailov, Cheng Zhang, Yiren Zhao, Jamie Hayes, Robert D. Mullins · PDF
  129. Quantifying Faithful Confidence Expression in Large Reasoning Models

    Areeb Gani, Asal Meskin, Gabrielle Kaili-May Liu, Arman Cohan · PDF
  130. Quantifying Risk of Epistemic Harm from the Use of AI Surrogates in Social Science Research

    Falaah Arif Khan, Nivedha Sivakumar, Yinong Oliver Wang, Katherine Metcalf, Cezanne Camacho, Barry-John Theobald, Luca Zappella, Nicholas Apostoloff · PDF
  131. RAGEN-2: Reasoning Collapse in Agentic RL

    Zihan Wang, Chi Gui, Xing Jin, Qineng Wang, Licheng Liu, Kangrui Wang, Shiqi Chen, Linjie Li, Zhengyuan Yang, Pingyue Zhang, Yiping Lu, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, Manling Li · PDF
  132. RAVR-S: State-Sensitive Verification and Repair for Trustworthy Rule-Governed LLM Dialogue

    Yaroslav Pelekhov · PDF
  133. Re-Mask and Redirect: Exploiting Denoising Irreversibility in Diffusion Language Models

    Arth Singh · PDF
  134. Reasoning Up the Instruction Ladder for Controllable Language Models

    Zishuo Zheng, Vidhisha Balachandran, Chan Young Park, Faeze Brahman, Sachin Kumar · PDF
  135. REC-CBM: Rubric-Aware Error-Correction Concept Bottleneck Models for Trustworthy Open-Ended Grading

    Chengshuai Zhao, Fan Zhang, Kumar Satvik Chaudhary, Yiwen Li, Lo Pang-Yun Ting, Ying-Chih Chen, huan liu · PDF
  136. ReCord: Replay Coordination for Safe and Robust Population-Based Training in Autonomous Driving

    Hyeon-Chang Jeon, KyungJoong Kim · PDF
  137. Reimagining Meaningful Model Multiplicity

    Ira Globus-Harris, Nikhil Garg · PDF
  138. Retrieval Shift as a Source of Demographic Bias in Medical RAG

    Harim Lee, DASOM LEE · PDF
  139. Right Predictions, Misleading Explanations: On the Vulnerability of Vision-Language Model Explanations

    Narges Babadi, Hadis Karimipour · PDF
  140. RLSpoofer: A Sample-Efficient Black-Box Spoofing Attack for Stress-Testing LLM Watermarks

    Hanbo Huang, Xuan Gong, Yiran Zhang, Hao Zheng, Wenbin Dai, Jieren Kuang, Shiyu Liang · PDF
  141. SafeGround: Know When to Trust GUI Grounding Models via Uncertainty Calibration

    Qingni Wang, Yue Fan, Xin Eric Wang · PDF
  142. Safer by Diffusion, Broken by Context: Diffusion LLM’s Safety Blessing and Its Failure Mode

    Zeyuan He, Yupeng Chen, Lang Lin, Yihan WANG, Shenxu Chang, Eric Sommerlade, Philip Torr, Junchi Yu, Adel Bibi, Jialin Yu · PDF
  143. Safety Cost of Steering Vectors Is Separable and Reducible

    Yuxiao Li, Gjergji Kasneci · PDF
  144. Safety-Anchored Fine-Tuning: Diagnosing and Preventing Safety Collapse in Large Language Models via Adversarial Alignment Anchoring

    Mohammad Kadiri, Kush Patil · PDF
  145. Same Facts, Different Updates: Inference Setup Shapes LLM Behavior in Medical Allocation

    Spencer Gibson, Tyler Crosse, Magnus Saebo, Achyutha Menon, Eyon Jang, Diogo Cruz · PDF
  146. Scaling Trends for Lie Detector Oversight in Preference Learning

    Oskar John Hollinsworth, Ann-Kathrin Dombrowski, Sam Adam-Day, Adam Gleave, Chris Cundy · PDF
  147. Selective Safety Steering via Value-Filtered Decoding

    Bat-Sheva Einbinder, Hen Davidov, Yee Whye Teh, Yarin Gal, Yaniv Romano · PDF
  148. SEVA: Self-Evolving Verification Agent with Process Reward for Fact Attribution

    Aojie Yuan, Yi Nian, Haiyue Zhang, Yue Zhao · PDF
  149. Shortcuts in the Tail: Debiasing via Post-Hoc Spectral Compression of Fine-Tuning Updates

    Edward Sun, Dmitrii Troitskii · PDF
  150. Social Choice Foundations for Simulation-Augmented Generation

    Sonja Kraiczy, Smitha Milli, Ratip Emin Berker, Avinandan Bose, Brandon Amos, Jamelle Watson-Daniels, Maximilian Nickel, Edith Elkind, Ariel D. Procaccia · PDF
  151. StealthRank: LLM Ranking Manipulation via Stealthy Prompt Optimization

    Yiming Tang, Yi Fan, Chenxiao Yu, Tiankai Yang, Yue Zhao, Xiyang Hu · PDF
  152. Steering LLMs to Assist Humans via Scalable Interactive Oversight

    Enyu Zhou, Zhiheng Xi, MaLong, Zhihao Zhang, Shihan Dou, Zhikai Lei, Guoteng Wang, Rui Zheng, Hang Yan, Tao Gui, Qi Zhang, Xuanjing Huang · PDF
  153. StegoBench: Evaluating steganography potential in language models through supervised learning

    Bryce Meyer, Robert Krzyzanowski, Iván Arcuschin, Matthew Lee, Georg Lange · PDF
  154. Stop Reporting System-Level AI Reasoning as Individual Model Capability

    Adhiraj Chhoda · PDF
  155. Strategic Exploitation in LLM Agent Markets: A Simulation Framework for E-Commerce Trust

    Shijun Lei, Quang Nguyen, Swapneel S Mehta, Zeping Li, Huichuan Fu, Xiaolong Zheng, Siki Chen, Yunji Liang, Philip Torr, Zhenfei Yin · PDF
  156. Structural Safety Generalisation in Agentic AI Setups

    Jake Gardner · PDF
  157. StylisticBias: A Few Human Visual Cues Drive Most Social Bias in MLLMs

    Shaghayegh Kolli, Timo Cavelius, Nafiseh Nikeghbal, Samantha Dalal, Jana Diesner · PDF
  158. Subliminal Transfer of Positional Biases in Language Models

    Jiahang He, Anya Singh, Cabrel Happi, Varun Nair, Vidyut Baradwaj, Jai Relan · PDF
  159. SURE: Judge-Aware Safety Update Review for Public-Interest LLM Deployment

    YeonGyu Han, Junah Jung, Dongheon Lee · PDF
  160. The Best-Laid SCHEMEs: Coordinated Sabotage and Monitoring in Multi-Agent Systems

    Nikolay Radev, Lennart J Haas, Benjamin Arnav, Pablo Bernabeu-Perez · PDF
  161. The Bottleneck in AI Governance: Evidence from 1,419 State Bills

    Mansur Ali Khan, Mehmet Efe Akengin, Osman Salahuddin, Ahmad A Rushdi · PDF
  162. The Broken Telephone Changes Tone: Examining Nuanced Linguistic Cues in LLM Chains-of-Translation

    Quang Minh Nguyen, Maida Aizaz, Braahmi Padmakumar · PDF
  163. The Character of Confabulation: Operationalizing a Clinical Typology for Reasoning-Mode Language Models

    Parichaye Grover, Vivek Kumar Sehgal · PDF
  164. The Containment Gap: How Deployed Agentic AI Frameworks Fail Public-Facing Safety Requirements

    Md Jafrin Hossain, Mohammad Arif Hossain, Weiqi Liu, Nirwan Ansari · PDF
  165. Three Years of r/ChatGPT: Societal Impact Evaluations from Social Media Data

    Jessica Dai, Sean Garcia, Emma Pierson, Benjamin Recht, Nika Haghtalab · PDF
  166. Tool-Framing Bypasses LLM Safety: Procedural Abstraction Reduces Refusal Rates by Up to 40 Percentage Points Across Models

    Kevin Power · PDF
  167. Toward Dealing with Unverbalized Eval Awareness

    Hieu M. Vu, William Saunders · PDF
  168. Toward Trustworthy LLM Router Ecosystems: Incentive-Compatible Cryptographic Mitigations

    Kwok Wai Lui, Shenbao Lu, Han Wu, Xin Yang, Wenyuan Jiang · PDF
  169. Towards Budget-Aware Agents: Do LLM Agents Know What They Will Spend?

    Yuxiang Lin, Zihan Wang, Mengyang Liu, Yuxuan Shan, Longju Bai, Junyao Zhang, Xing Jin, Boshan Chen, Jinyan Su, Xingyao Wang, Jiaxin Pei, Manling Li · PDF
  170. Towards Predictive Models of Strategic Behaviour in Large Language Model Agents

    Jennifer Za, Aristeidis Panos, Jan Cuhel · PDF
  171. Training ML Models with Predictable Failures

    Will Schwarzer, Scott Niekum · PDF
  172. Treat Bias as Noise: Training Bias-Robust LLM Reasoning via Reinforcement Learning

    Qian Wang, Xuandong Zhao, Zirui Zhang, Zhanzhi Lou, Nuo Chen, Dawn Song, Bingsheng He · PDF
  173. Two Wrongs, No Right: Opposing Measurement Failures in LLM Annotators for Civic Discourse

    Varun Kotte · PDF
  174. Understanding Consistency Through Internal Representations in Large Vision-Language Models

    Thanh Quoc Hung Le, Quang H Nguyen, Hyeonjeong Ha, Zhenhailong Wang, Hoang Phan, Mohit Bansal, Heng Ji, Khoa D Doan · PDF
  175. Unmasking the Hidden Fairness, Bias, and Safety Costs of Compression with Mixture-of-Expert Models

    Elizabeth Szentmiklossy, Mike Lasby, Qiangqiang Mao, Shaina Raza, Yani Ioannou · PDF
  176. Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents

    Xu Li, Simon Yu, Minzhou Pan, Yiyou Sun, Bo Li, Dawn Song, Xue Lin, Weiyan Shi · PDF
  177. Voting Protocols as Coordination Mechanisms for Role-Constrained Multi-Agent Tutoring Systems

    Eric S. Qiu, Joyce Gill · PDF
  178. WARP: Measuring and Mitigating Evaluation Awareness in Browser-Agent Safety Benchmarks

    Jasmine Xinze Li, Ashton Chew, Maxwell Lin, Eliot Krzysztof Jones, Xiaohan Fu, Andy Zou · PDF
  179. Watermarking for Proprietary Dataset Protection

    John Kirchenbauer, Brian R. Bartoldson, Bhavya Kailkhura, Tom Goldstein · PDF
  180. Watershed: A Unified Benchmark for End-to-End Data Provenance Evaluation

    John Russell Himawan, Gregory Kang Ruey Lau, Bryan Kian Hsiang Low · PDF
  181. Weight-Level Defenses Improve LLM Agent Adversarial Robustness

    Mehmet Ozdincer, Samuel Simko, Bernhard Schölkopf, Zhijing Jin · PDF
  182. What do Uncertainty Lens tell about Emergent Misalignment?

    Aleksei Beglov, Daniil Korbut, Elena Tutubalina, Mikhail Seleznyov · PDF
  183. When Do Covert Channels Emerge? Probing Steganographic Capacity in Multimodal Agents via Diffusion VAEs Latents

    Joy Zheyun Yang, Tushar Nagar, Catherine Ge-Wang · PDF
  184. When Language Representations Interact: Separability and Cross-Lingual Effects in LLMs

    Boris Marinov, Angira Sharma, Christian Schroeder de Witt, Philip Torr, Anisoara Calinescu, Jialin Yu · PDF
  185. Where Do Agents Differ? Interpretable Rule Discovery for Performance Differences Across Models and Data

    Sascha Xu, Antoine Gautier, Jilles Vreeken · PDF
  186. Who Flips? Self- and Cross-Model Counterarguments Reveal Answer Instability in LLMs

    Nafiseh Nikeghbal, Amir Hossein Kargaran, Shaghayegh Kolli, Jana Diesner · PDF
  187. Widening the Gap: Exploiting LLM Quantization via Outlier Injection

    Xiaohua Zhan, Kazuki Egashira, Robin Staab, Mark Vero, Martin Vechev · PDF