ICLR 2026 Past Safety & alignmentInterpretability

ICLR 2026 Workshop on Principled Design for Trustworthy AI - Interpretability, Robustness, and Safety across Modalities

ICLR 2026 Trustworthy AI

Submission deadline
Feb 3, 2026, 11:59 UTC
imported from OpenReview — check the website for extensions
Submission portal
OpenReview
Notes
Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).

Accepted papers (144)

Fetched from OpenReview (v2) on 2026-06-10.

  1. A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring

    Usman Anwar, Julianna Piskorz, David D. Baek, David Demitri Africa, Jim Weatherall, Max Tegmark, Christian Schroeder de Witt, Mihaela van der Schaar, David Krueger
  2. A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior

    Harry Mayne, Justin Singh Kang, Dewi Sid William Gould, Kannan Ramchandran, Adam Mahdi, Noah Y. Siegel
  3. AdaptNC: Adaptive Nonconformity Scores for Uncertainty-Aware Autonomous Systems in Dynamic Environments

    Renukanandan Tumu, Aditya Singh, Rahul Mangharam
  4. Agentic Uncertainty Reveals Agentic Overconfidence

    Jean Kaddour, Srijan Patel, Gbetondji Jean-Sebastien Dovonon, Leo Richter, Pasquale Minervini, Matt J. Kusner
  5. AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM‑Based Agents

    Emma Gouné, Akshat Naik, Patrick Quinn, Guillermo Bosch, Francisco Javier Campos Zabala, Jason Ross Brown, Edward James Young
  6. Always Keep Your Promises: A Model-Agnostic Attribution Algorithm for Neural Networks

    Kevin Lee, Duncan Halverson, Pablo Andres Millan Arias
  7. Attention Sinks in Diffusion Language Models

    Maximo Eduardo Rulli, Simone Petruzzi, Edoardo Michielon, Fabrizio Silvestri, Simone Scardapane, Alessio Devoto
  8. Auditing Cascading Risks in Multi-Agent Systems via Semantic–Geometric Co-evolution

    Zixun Luo, YUHANG FAN, hengyu lin, Youzhi Zhang
  9. AutoBaxBuilder: Bootstrapping Code Security Benchmarking

    Tobias von Arx, Niels Mündler, Mark Vero, Maximilian Baader, Martin Vechev
  10. Backdoor Attacks on Decentralised Post-Training

    Oguzhan Ersoy, Nikolay Blagoev, Jona te Lintelo, Stefanos Koffas, Marina Krček, Stjepan Picek
  11. BackFed: A Standardized and Efficient Benchmark Framework for Backdoor Attacks in Federated Learning

    Thinh Dao, Thuy Dung Nguyen, Khoa D Doan, Kok-Seng Wong
  12. BarrierSteer: LLM Safety via Learning Barrier Steering

    Thanh Q. Tran, Arun Verma, Kiwan Wong, Bryan Kian Hsiang Low, Daniela Rus, Wei Xiao
  13. Benchmarking AI Control Protocols for Safety in Medical Question-Answering Tasks

    Guido Freire, Agustín E. Martínez-Suñé, Viviana Cotik
  14. Beyond Idealized Patients: Evaluating LLMs under Challenging Patient Behaviors in Medical Consultations

    Yahan Li, Xinyi Jie, Wanjia Ruan, Xubei Zhang, Huaijie ZHU, Yicheng Gao, Ruishan Liu
  15. Beyond Static Truthfulness Benchmarks: Two Truths and One Lie for Multi-Agent Deception and Detection

    Jason Kong, Lanxiang Hu, Flavio Ponzina, Tajana Rosing
  16. Black-box Optimization of LLM Outputs by Asking for Directions

    Jie Zhang, Meng Ding, Yang Liu, Jue Hong, Florian Tramèr
  17. Bootstrapping-based Regularisation for Reducing Individual Prediction Instability in Clinical Risk Prediction Models

    Sara Matijevic, Christopher Yau
  18. Bridging the Missing-Modality Gap: Improving Text-Only Calibration of Vision Language Models

    Mingyeong Kim, Jungwon Choi, Chaeyun Jang, Juho Lee
  19. BUDDY: Blending Training and Deployment Data with Weighted Expert Ensembles for Post-hoc LLM Calibration

    Aishwarya Mandyam, Wenhui Sophia Lu, Wing Hung Wong, John Duchi, Barbara E Engelhardt
  20. Byzantine Machine Learning: MultiKrum and an Optimal Notion of Robustness

    Gilles Bareilles, Wassim Bouaziz, Julien Fageot, El-Mahdi El-Mhamdi
  21. Calibrated Predictive Lower Bounds on Time-to-Unsafe-Sampling in LLMs

    Hen Davidov, Shai Feldman, Gilad Freidkin, Yaniv Romano
  22. Causal Analysis of Representation Drift for Robust Deployment

    Thomas Y Chen, Daniel Xu
  23. Closing the Distribution Gap in Adversarial Training for LLMs

    Chengzhi Martin Hu, Jonas Dornbusch, David Lüdke, Stephan Günnemann, Leo Schwinn
  24. Collaborative Threshold Watermarking

    Tameem Bakr, Anish Ambreth, Nils Lukas
  25. Constructive Circuit Amplification: Improving Math Reasoning in LLMs via Targeted Sub-Network Updates

    Nikhil Prakash, Donghao Ren, Dominik Moritz, Yannick Assogba
  26. Control Reinforcement Learning: Interpretable Token-Level Steering of LLMs via Sparse Autoencoder Features

    Seonglae Cho, Zekun Wu, Adriano Koshiyama
  27. Deception in Dialogue: Evaluating and Mitigating Deceptive Behavior in Large Language Models

    Marwa Abdulhai, Ryan Cheng, Aryansh Shrivastava, Natasha Jaques, Yarin Gal, Sergey Levine
  28. DELTA-CROSSCODER: ROBUST CROSSCODER IN NARROW FINE-TUNING REGIMES

    Aly M. Kassem, Thomas Jiralerspong, Negar Rostamzadeh, Golnoosh Farnadi
  29. Diff Mining: Logit Differences Reveal Finetuning Objectives

    Greg Kocher, Robert West, Clément Dumas, Julian Minder
  30. Digging Deeper: Learning Multi-Level Concept Hierarchies

    Oscar Hill, Mateo Espinosa Zarlenga, Mateja Jamnik
  31. Disentangled Safety Adapters Enable Efficient Guardrails and Flexible Inference-Time Alignment

    Kundan Krishna, Joseph Yitan Cheng, Charles Maalouf, Leon Alexander Gatys
  32. Disentangling goal and framing for detecting LLM jailbreaks

    Amirhossein Farzam, Majid Behbahani, Mani Malek, Yuriy Nevmyvaka, Guillermo Sapiro
  33. DLM-Scope: Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders

    Xu Wang, Bingqing Jiang, Yu Wan, Baosong Yang, Lingpeng Kong, Difan Zou
  34. Do LLMs Act Like Rational Agents? Measuring Belief Coherence in Probabilistic Decision Making

    Khurram Yamin, Jingjing Tang, Santiago Cortes-Gomez, Amit Sharma, Eric Horvitz, Bryan Wilder
  35. Dual-Objective Reinforcement Learning with novel Hamilton-Jacobi-Bellman formulations

    William Sharpless, Dylan Hirsch, Sander Tonkens, Nikhil Uday Shinde, Sylvia Herbert
  36. Efficient Refusal Ablation in LLM through Optimal Transport

    geraldin nanfack, Elvis Dohmatob
  37. Efficient Test-Time Scaling of Multi-Step Reasoning by Probing Internal States of Large Language Models

    Jingwei Ni, Ekaterina Fadeeva, Tianyi Wu, Mubashara Akhtar, Jiaheng Zhang, Elliott Ash, Markus Leippold, Timothy Baldwin, See-Kiong Ng, Artem Shelmanov, Mrinmaya Sachan
  38. Enabling Preference-driven Unlearning in Few-step Distilled Text-to-Image Diffusion Models

    Gaurav Patel, Jun Fang, Greg Ver Steeg, Qiang Qiu, Sravan Sripada
  39. Endogenous Resistance to Activation Steering in Language Models

    Alex McKenzie, Keenan Pepper, Stijn Servaes, Martin Leitgab, Murat Cubuktepe, Michael Vaiana, Diogo S de Lucena, Judd Rosenblatt, Michael S. A. Graziano
  40. Enhancing Deep Neural Network Reliability with Refinement and Calibration

    Ramya Hebbalaguppe, K.N Ajay Shastry, Soumya Suvra Ghosal, Chetan Arora
  41. Enhancing Trust in Large Language Models via Uncertainty-Calibrated Fine-tuning

    Ranganath Krishnan, Piyush Khanna, Omesh Tickoo
  42. Evolving Safety Landscape of Multi-modal Large Language Models: A Survey of Emerging Threats and Safeguards

    Xi Li, Shu Zhao, Xiaohan Zou, Fei Zhao, Fuxiao Liu, Yusen Zhang, Cheng Han, Yushun Dong, Jiaqi Wang
  43. Exact Certification of Neural Networks and Partition Aggregation Ensembles against Label Poisoning

    Ajinkya Mohgaonkar, Lukas Gosch, Mahalakshmi Sabanayagam, Debarghya Ghoshdastidar, Stephan Günnemann
  44. Expert Selections In MoE Models Reveal (Almost) As Much As Text

    Amir Nuriyev, Gabriel Kulp
  45. Expert-guided Clinical Text Augmentation via Query-Based Model Collaboration

    Dongkyu Cho, Miao Zhang, Gregory D Lyng, Rumi Chunara
  46. Explainability Is Not a Feature: A Position on Trustworthy AI

    Gabriel Banaggia, Eduardo Soares, Renato Cerqueira, Emilio Vital Brazil, Simone Barbosa
  47. Explaining Grokking in Transformers through the Lens of Inductive Bias

    Jaisidh Singh, Diganta Misra, Antonio Orvieto
  48. Fairness Failure Modes of Multimodal LLMs

    Canyu Chen, Anglin Cai, Joan Nwatu, Jianshu Zhang, Yale Li, Han Liu, Jessica Hullman, Rada Mihalcea, Kathleen McKeown, Manling Li
  49. Fault-Tolerant Preference Alignment via Multi-Agent Verification

    Elias Hossain, Maryam Rahimimovassagh, SUBASH neupane, Mohammad Jahid Ibna Basher, Ivan Garibay, Niloofar Yousefi
  50. Federated Agent Reinforcement Learning

    Canyu Chen, Kangyu Zhu, Zhaorun Chen, Zhanhui Zhou, Shizhe Diao, Yiping Lu, Tian Li, Manling Li, Dawn Song
  51. FedGraph: Defending Federated Large Language Model Fine-Tuning Against Backdoor Attacks via Graph-Based Aggregation

    Xi Chen, Chunyi Zhou, Rui Zeng, Xiaogang Xu, Zhe Liu, Shouling Ji
  52. Few-Shot Adversarial Low-Rank Fine-Tuning of Vision-Language Models

    Sajjad Ghiasvand, Haniyeh Ehsani Oskouie, Mahnoosh Alizadeh, Ramtin Pedarsani
  53. Forgetting is Competition: Rethinking Unlearning as Representation Interference in Diffusion Models

    Ashutosh Ranjan, Vivek Srivastava, Shirish Karande, Murari Mandal
  54. From Data to Behavior: Predicting Unintended Model Behaviors Before Training

    Mengru Wang, Zhenqian Xu, Junfeng Fang, Yunzhi Yao, Shumin Deng, Huajun Chen, Ningyu Zhang
  55. Frontier Models Can Take Actions at Low Probabilities

    Alex Serrano, Wen Xing, David Lindner, Erik Jenner
  56. Geometry-Aware Crossover for Effective and Efficient Evolutionary Attacks

    Hyo Seo Kim, Gang Luo, Can Chen, Binghui Wang, Yue Duan, Ren Wang
  57. GLEAN: Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification

    Yichi Zhang, Nabeel Seedat, Yinpeng Dong, Peng Cui, Jun Zhu, Mihaela van der Schaar
  58. Google's LLM Watermarking System is Vulnerable to Layer Inflation Attack

    Romina Omidi, Yun Dong, Binghui Wang
  59. GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory

    Pepijn Cobben, X. Angelo Huang, Thao Amelia Pham, Isabel Dahlgren, Terry Jingchen Zhang, Zhijing Jin
  60. GuardReasoner-Omni: A Reasoning-based Multi-modal Guardrail for Text, Image, and Video

    Zhenhao Zhu, Yue Liu, Yanpei Guo, Wenjie Qu, Cancan Chen, Yufei He, Yibo Li, Yulin Chen, Tianyi Wu, Huiying Xu, Xinzhong Zhu, Jiaheng Zhang
  61. Hide and Find: A Distributed Adversarial Attack on Federated Graph Learning

    JinShan Liu, Ken Li, Jiazhe Wei, Bin Shi, Bo Dong
  62. Hierarchical Retrieval at Scale: Bridging Transparency and Efficiency

    Shubham Gupta, Zichao Li, Tianyi Chen, Cem Subakan, Siva Reddy, Perouz Taslakian, Valentina Zantedeschi
  63. How does information access affect LLM monitors' ability to detect sabotage?

    Rauno Arike, Raja Mehta Moreno, Rohan Subramani, Shubhorup Biswas, Francis Rhys Ward
  64. Human-Guided Harm Recovery for Computer Use Agents

    Christy Li, Sky CH-Wang, Andi Peng, Andreea Bobu
  65. Illusions of Confidence? Diagnosing LLM Truthfulness via Neighborhood Consistency

    Haoming Xu, Ningyuan Zhao, Yunzhi Yao, Weihong Xu, Hongru WANG, Xinle Deng, Shumin Deng, Jeff Z. Pan, Huajun Chen, Ningyu Zhang
  66. Improving Semantic Uncertainty Quantification in Question Answering via Token-Level Temperature Scaling

    Tom A. Lamb, Desi R. Ivanova, Philip Torr, Tim G. J. Rudner
  67. Inference-Time Backdoors via Hidden Instructions in LLM Chat Templates

    Ariel Fogel, Omer Hofman, Eilon Cohen, Roman Vainshtein
  68. INFERENCE-TIME SAFETY FOR CODE LLMS VIA RETRIEVAL-AUGMENTED REVISION

    Manisha Mukherjee, Vincent Josua Hellendoorn
  69. Instruction Following by Principled Attention Boosting of Large Language Models

    Vitoria Guardieiro, Avishree Khare, Adam Stein, Eric Wong
  70. Investigating Data Interventions for Subgroup Fairness: An ICU Case Study

    Erin Tan, Judy Hanwen Shen, Irene Y. Chen
  71. Knowing When to Quit: A Principled Framework for Dynamic Abstention in LLM Reasoning

    Hen Davidov, Nachshon Cohen, Oren Kalinsky, Yaron Fairstein, Guy Kushilevitz, Ram Yazdi, Patrick Rebeschini
  72. Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms

    Linh Le, David Williams-King, Mohamed Amine Merzouk, Aton Kamanda, Adam Oberman
  73. Learn to be Unlearned: Optimizing Language Models for Unlearning via Clustered Gradient Routing

    Vincent Hanke, Jing Xu, Martin Pawelczyk, Michael Backes, Adam Dziedzic, Franziska Boenisch
  74. Learning Minimal Contexts: How Chain-of-Thought Induces Out-of-Distribution Generalization

    Yu Wang, Fu-Chieh Chang, Pei-Yuan Wu
  75. Leveraging RAG for Training-Free Alignment of LLMs

    John Timothy Halloran
  76. Lightweight and Interpretable Transformer via Mixed Graph Algorithm Unrolling for Traffic Forecast

    Ji Qi, Mingxiao Liu, VIET HO TAM THUC DO, Yuzhe Li, Zhuoshi Pan, Gene Cheung, H. Vicky Zhao
  77. LoRA Users Beware: A Few Spurious Tokens Can Manipulate Your Finetuned Model

    Praney Goyal, Marcel Mateos Salles, Pradyut Sekhsaria, Hai Huang, Randall Balestriero
  78. MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs

    Chun Yan Ryan Kan, Tommy Tran, Vedant Yadav, Ava Cai, Ruizhe Li, Maheep Chaudhary
  79. Memorization Dynamics in Knowledge Distillation for Language Models

    Jaydeep Borkar, Karan Chadha, Niloofar Mireshghallah, Yuchen Zhang, Irina-Elena Veliche, David A. Smith, Zheng Xu, Diego Garcia-Olano
  80. Mitigating Legibility Tax with Decoupled Prover-Verifier Games

    Yegon Kim, Juho Lee
  81. Mitigating Reward Hacking with RL Training Interventions

    Aria Wong, Joshua Engels, Neel Nanda
  82. MM-PoisonRAG: Disrupting Multimodal RAG with Local and Global Knowledge Poisoning Attacks

    Hyeonjeong Ha, Qiusi Zhan, Jeonghwan Kim, Dimitrios Bralios, Saikrishna sanniboina, Nanyun Peng, Kai-Wei Chang, Daniel Kang, Heng Ji
  83. Model Organisms for Generalization Resistance Under Distribution Shift

    Jou Barzdukas, Jack Peck, Julian Schulz, Paulius Rauba, Lennie Wells
  84. MONITORING EMERGENT REWARD HACKING DURING GENERATION VIA INTERNAL ACTIVATIONS

    Patrick Wilhelm, Thorsten Wittkopp, Odej Kao
  85. Moral Preferences of LLMs Under Directed Contextual Influence

    Phil Blandfort, Tushar Karayil, Urja Pawar, Robert Graham, Alex McKenzie, Dmitrii Krasheninnikov
  86. Neural Chameleons: Language Models Can Learn to Hide Their Thoughts from Unseen Activation Monitors

    Max McGuinness, Alex Serrano, Luke Bailey, Scott Emmons
  87. No Answer Needed: Predicting LLM Answer Accuracy from Question-Only Linear Probes

    Iván Vicente Moreno Cencerrado, Arnau Padrés Masdemont, Anton Gonzalvez Hawthorne, David Demitri Africa, Lorenzo Pacchiardi
  88. No One Monitor Fits All: Oversight Strategies for Frontier Agents

    Neil Kale, Shashwat Saxena, Ziqian Zhong, Chen Henry Wu, Aditi Raghunathan
  89. Nonparametric Variational Differential Privacy via Embedding Parameter Clipping

    Dina El Zein, Shashi Kumar, James Henderson
  90. Offline Reinforcement Learning of High-Quality Behaviors Under Robust Style Alignment

    Mathieu Petitbois, Rémy Portelas, Sylvain Lamprier
  91. OmniPatch: A Universal Adversarial Patch for ViT-CNN Cross-Architecture Transfer in Semantic Segmentation

    Aarush Aggarwal, Akshat Tomar, Amritanshu Tiwari, Sargam Goyal
  92. On the Effects of Adversarial Perturbations on Distribution Robustness

    Yipei Wang, Zhaoying Pan, Xiaoqian Wang
  93. Paranoid Monitors: How Long Context Breaks LLM Agent Supervision

    Alicia Yang, Aashiq Muhamed, Mona T. Diab, Virginia Smith
  94. Patching LLMs Like Software: A Lightweight Method for Improving Safety Policies in Large Language Models

    Huzaifa Arif, Pin-Yu Chen, Keerthiram Murugesan, Alex Gittens, Payel Das, Ching-Yun Ko
  95. Pay Attention to the Triggers: Constructing Backdoors That Survive Distillation

    Giovanni De Muri, Mark Vero, Robin Staab, Martin Vechev
  96. Position: Beyond Reasoning Zombies — AI Reasoning Requires Process Validity

    Rachel Lawrence, Jacqueline R. M. A. Maasch
  97. Post-hoc Stochastic Concept Bottleneck Models

    Wiktor Hoffmann, Sonia Laguna, Moritz Vandenhirtz, Emanuele Palumbo, Julia E Vogt
  98. Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models

    Anmol Goel, Cornelius Emde, Sangdoo Yun, Seong Joon Oh, Martin Gubri
  99. Prototype-Based Selective Prediction for Multimodal Instruction Models

    Eduardo Soares, Emilio Vital Brazil, Plamen P Angelov, Victor Y. Shirasuna, Renato Cerqueira
  100. Query Circuits: Explaining How Language Models Answer User Prompts

    Tung-Yu Wu, Fazl Barez
  101. RAPO: Risk-Aware Preference Optimization for Generalizable Safe Reasoning

    Zeming Wei, Qiaosheng Zhang, Xia Hu, Xingcheng Xu
  102. RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models

    Quy-Anh Dang, Chris Ngo, Truong-Son Hy
  103. Representational de-collapse: Interactions between supervised finetuning and in-context learning in language models

    Abrar Elidrisi, Andrew M Saxe, Jin Hwa Lee, Basile Confavreux
  104. Robust AI Evaluation through Maximal Lotteries

    Hadi Khalaf, Serena Lutong Wang, Daniel Halpern, Itai Shapira, Flavio Calmon, Ariel D. Procaccia
  105. Robust Feature Attribution via Integrated Sensitivity Gradients

    Rukmangadh Sai Myana, Sumit Kumar Jha, Yanzhao Wu
  106. Robust Object Detection via Kronecker Tensor Decomposition: Theory, Algorithms, and Applications

    Salman Ahmadi-Asl, Roman Garaev, Hamidreza Behjoo, Asad Masood Khattak, Manuel Mazzara
  107. RouterInterp: Superposed Specialisation in MoE Routing

    Ilya Lasy, Nora Yinuo Cai, Kola Ayonrinde
  108. SafeGuide: Adaptive Inference-Time Safety Control for Diffusion Models

    Tong Zhou, Juyang Bai, Xiaolin Xu, Shaolei Ren
  109. SafetyPairs: Isolating Safety Critical Image Features With Counterfactual Image Generation

    Alec Helbling, Shruti Palaskar, Kundan Krishna, Duen Horng Chau, Leon Alexander Gatys, Joseph Yitan Cheng
  110. SALVE: Sparse Autoencoder-Latent Vector Editing for Mechanistic Control of Neural Networks

    Vegard Flovik
  111. Same Question, Different Lies: Cross-Context Consistency (C³) for Black-Box Sandbagging Detection

    Lin Yulong, Pablo Bernabeu-Perez, Benjamin Arnav, Lennie Wells, Mary Phuong
  112. Scalable Bayesian Monte Carlo: fast uncertainty estimation beyond deep ensembles

    Xinzhu Liang, Joseph Lukens, Sanjaya Lohani, Thomas A. Searles, Brian T. Kirby, Xin Qiu, Kody J. H. Law
  113. Secure Code Generation via Online Reinforcement Learning with Vulnerability Reward Model

    Tianyi Wu, Mingzhe Du, Yue Liu, Chengran Yang, Terry Yue Zhuo, Jiaheng Zhang, See-Kiong Ng
  114. Selective Disclosure: Controlling Information Leakage in DocVQA Explanations

    Kangsoo Jung, Mohamed Ali Souibgui, Changkyu Choi, Catuscia Palamidessi
  115. Simple LLM Baselines are Competitive for Model Diffing

    Elias Kempf, Simon Schrodi, Bartosz Cywiński, Thomas Brox, Neel Nanda, Arthur Conmy
  116. Sparse Circuits of Vision Language Alignment

    Huizhen Shu, xuying li
  117. Stability-Aware Prompt Optimization for Clinical Data Abstraction

    Arinbjörn Kolbeinsson, Daniel R. Timbie, Sajjan Narsinghani, Sanjay Hariharan
  118. Stress-Testing Alignment Audits with Prompt-Level Strategic Deception

    Oliver Daniels, Benjamin M. Marlin, Perusha Moodley, David Lindner
  119. SureFED: Robust Federated Learning via Uncertainty-Aware Inward and Outward Inspection

    Nasimeh Heydaribeni, Ruisi Zhang, Tara Javidi, Cristina Nita-Rotaru, Farinaz Koushanfar
  120. Sycophantic Anchors: Localizing and Quantifying User Agreement in Reasoning Models

    Jacek Duszenko
  121. Test-Time Training Undermines Existing Safety Guardrails

    Simone Antonelli, Mohammad Sadegh Akhondzadeh, Aleksandar Bojchevski
  122. ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts

    Trapoom Ukarapol, Nut Chukamphaeng, Kunat Pipatanakul, Pakhapoom Sarapat
  123. The Alignment Waltz: Jointly Training Agents to Collaborate for Safety

    Jingyu Zhang, Haozhu Wang, Eric Michael Smith, Sid Wang, Amr Sharaf, Mahesh Pasupuleti, Benjamin Van Durme, Daniel Khashabi, Jason E Weston, Hongyuan Zhan
  124. The Realignment Problem: When Right becomes Wrong in LLMs

    Aakash Sen Sharma, Debdeep Sanyal, Manodeep Ray, Vivek Srivastava, Shirish Karande, Murari Mandal
  125. The Rogue Scalpel: Activation Steering Compromises LLM Safety

    Anton Korznikov, Andrey V. Galichin, Alexey Dontsov, Oleg Rogov, Ivan Oseledets, Elena Tutubalina
  126. The Semantic Imprinting Hypothesis: How Semantic Watermarks Survive Prompt-based Editing

    Sung Ju Lee, Nam Ik Cho
  127. Theory of Minimal Weight Perturbations in Deep Networks and its Applications for Low-Rank Activated Backdoor Attacks

    Bethan Evans, Jared Tanner
  128. TIGHTENING OPTIMALITY GAP WITH CONFIDENCE THROUGH CONFORMAL PREDICTION

    Miao Li, Michael Klamkin, Russell Bent, Pascal Van Hentenryck
  129. Towards Statistical Verification for Trustworthy AI

    Blossom Metevier, Max Springer, Bohdan Turbal, Aleksandra Korolova
  130. Training with Honeypots: Reshaping How LLMs Fail

    Samuel Simko, Punya Syon Pandey, Zhijing Jin, Bernhard Schölkopf
  131. TrustLDM: Benchmarking Trustworthiness in Language Diffusion Model

    Yichuan Mo, Yukun Jiang, Yanbo Shi, Mingjie Li, Michael Backes, Yang Zhang, Yisen Wang
  132. Tuning Just Enough: Lightweight Backdoor Attacks on Multi-Encoder Diffusion Models

    Ziyuan Chen, Yujin Jeong, Tobias Braun, Anna Rohrbach
  133. Uncertainty Drives Social Bias Changes in Quantized Large Language Models

    Stanley Bryan Zamora Hua, Sanae Lotfi, Irene Y. Chen
  134. Understanding Adversarial Transfer Across Modalities: Why Representation-Space Attacks Fail Where Data-Space Attacks Succeed

    Isha Gupta, Rylan Schaeffer, Joshua Kazdan, Ken Liu, Sanmi Koyejo
  135. Understanding Empirical Unlearning with Combinatorial Interpretability

    Shingo Kodama, Niv Cohen, Micah Adler, Nir N Shavit
  136. Unifying Perspectives on Learning Biases: A Data-Centric Intervention for Holistic Fairness, Robustness, and Generalization

    Patrick Vincent, Innocent Nyalala
  137. Verbosity Tradeoffs and the Impact of Scale on the Faithfulness of LLM Self-Explanations

    Noah Y. Siegel, Nicolas Heess, Maria Perez-Ortiz, Oana-Maria Camburu
  138. Visual Disentangled Diffusion Autoencoders: Scalable Counterfactual Generation for Foundation Models

    Sidney Bender, Marco Morik
  139. VisualScratchpad: Inference-time Visual Concepts Analysis in Vision Language Models

    Hyesu Lim, Jinho Choi, Taekyung Kim, Byeongho Heo, Jaegul Choo, Dongyoon Han
  140. Watermarking Discrete Diffusion Language Models

    Avi Bagchi, Akhil Bhimaraju, Moulik Choraria, Daniel Alabi, Lav R. Varshney
  141. When Bias Pretends to Be Truth: How Spurious Correlations Undermine Hallucination Detection in LLMs

    Shaowen Wang, Yiqi Dong, Ruinian Chang, Tansheng Zhu, Yuebo Sun, Kaifeng Lyu, Jian Li
  142. When One Modality Rules Them All: Backdoor Modality Collapse in Multimodal Diffusion Models

    Qitong Wang, Haoran Dai, Haotian Zhang, Christopher Rasmussen, Binghui Wang
  143. When RAG Hurts: Diagnosing and Mitigating Attention Distraction in Retrieval-Augmented LVLMs

    Beidi Zhao, Wenlong Deng, Xinting Liao, Yushu Li, Nazim Shaikh, Yao Nie, Xiaoxiao Li
  144. Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics

    Ziwen Xu, Chenyan WU, Hengyu Sun, Haiwen Hong, Mengru Wang, Yunzhi Yao, Longtao Huang, Hui Xue, Shumin Deng, Zhixuan Chu, Huajun Chen, Ningyu Zhang