ICLR 2026 Past Safety & alignmentInterpretability
ICLR 2026 Workshop on Principled Design for Trustworthy AI - Interpretability, Robustness, and Safety across Modalities
ICLR 2026 Trustworthy AI
- Submission deadline
- Feb 3, 2026, 11:59 UTC imported from OpenReview — check the website for extensions
- Submission portal
- OpenReview
- Notes
- Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).
Accepted papers (144)
Fetched from OpenReview (v2) on 2026-06-10.
-
A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring
-
A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior
-
AdaptNC: Adaptive Nonconformity Scores for Uncertainty-Aware Autonomous Systems in Dynamic Environments
-
Agentic Uncertainty Reveals Agentic Overconfidence
-
AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM‑Based Agents
-
Always Keep Your Promises: A Model-Agnostic Attribution Algorithm for Neural Networks
-
Attention Sinks in Diffusion Language Models
-
Auditing Cascading Risks in Multi-Agent Systems via Semantic–Geometric Co-evolution
-
AutoBaxBuilder: Bootstrapping Code Security Benchmarking
-
Backdoor Attacks on Decentralised Post-Training
-
BackFed: A Standardized and Efficient Benchmark Framework for Backdoor Attacks in Federated Learning
-
BarrierSteer: LLM Safety via Learning Barrier Steering
-
Benchmarking AI Control Protocols for Safety in Medical Question-Answering Tasks
-
Beyond Idealized Patients: Evaluating LLMs under Challenging Patient Behaviors in Medical Consultations
-
Beyond Static Truthfulness Benchmarks: Two Truths and One Lie for Multi-Agent Deception and Detection
-
Black-box Optimization of LLM Outputs by Asking for Directions
-
Bootstrapping-based Regularisation for Reducing Individual Prediction Instability in Clinical Risk Prediction Models
-
Bridging the Missing-Modality Gap: Improving Text-Only Calibration of Vision Language Models
-
BUDDY: Blending Training and Deployment Data with Weighted Expert Ensembles for Post-hoc LLM Calibration
-
Byzantine Machine Learning: MultiKrum and an Optimal Notion of Robustness
-
Calibrated Predictive Lower Bounds on Time-to-Unsafe-Sampling in LLMs
-
Causal Analysis of Representation Drift for Robust Deployment
-
Closing the Distribution Gap in Adversarial Training for LLMs
-
Collaborative Threshold Watermarking
-
Constructive Circuit Amplification: Improving Math Reasoning in LLMs via Targeted Sub-Network Updates
-
Control Reinforcement Learning: Interpretable Token-Level Steering of LLMs via Sparse Autoencoder Features
-
Deception in Dialogue: Evaluating and Mitigating Deceptive Behavior in Large Language Models
-
DELTA-CROSSCODER: ROBUST CROSSCODER IN NARROW FINE-TUNING REGIMES
-
Diff Mining: Logit Differences Reveal Finetuning Objectives
-
Digging Deeper: Learning Multi-Level Concept Hierarchies
-
Disentangled Safety Adapters Enable Efficient Guardrails and Flexible Inference-Time Alignment
-
Disentangling goal and framing for detecting LLM jailbreaks
-
DLM-Scope: Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders
-
Do LLMs Act Like Rational Agents? Measuring Belief Coherence in Probabilistic Decision Making
-
Dual-Objective Reinforcement Learning with novel Hamilton-Jacobi-Bellman formulations
-
Efficient Refusal Ablation in LLM through Optimal Transport
-
Efficient Test-Time Scaling of Multi-Step Reasoning by Probing Internal States of Large Language Models
-
Enabling Preference-driven Unlearning in Few-step Distilled Text-to-Image Diffusion Models
-
Endogenous Resistance to Activation Steering in Language Models
-
Enhancing Deep Neural Network Reliability with Refinement and Calibration
-
Enhancing Trust in Large Language Models via Uncertainty-Calibrated Fine-tuning
-
Evolving Safety Landscape of Multi-modal Large Language Models: A Survey of Emerging Threats and Safeguards
-
Exact Certification of Neural Networks and Partition Aggregation Ensembles against Label Poisoning
-
Expert Selections In MoE Models Reveal (Almost) As Much As Text
-
Expert-guided Clinical Text Augmentation via Query-Based Model Collaboration
-
Explainability Is Not a Feature: A Position on Trustworthy AI
-
Explaining Grokking in Transformers through the Lens of Inductive Bias
-
Fairness Failure Modes of Multimodal LLMs
-
Fault-Tolerant Preference Alignment via Multi-Agent Verification
-
Federated Agent Reinforcement Learning
-
FedGraph: Defending Federated Large Language Model Fine-Tuning Against Backdoor Attacks via Graph-Based Aggregation
-
Few-Shot Adversarial Low-Rank Fine-Tuning of Vision-Language Models
-
Forgetting is Competition: Rethinking Unlearning as Representation Interference in Diffusion Models
-
From Data to Behavior: Predicting Unintended Model Behaviors Before Training
-
Frontier Models Can Take Actions at Low Probabilities
-
Geometry-Aware Crossover for Effective and Efficient Evolutionary Attacks
-
GLEAN: Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification
-
Google's LLM Watermarking System is Vulnerable to Layer Inflation Attack
-
GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory
-
GuardReasoner-Omni: A Reasoning-based Multi-modal Guardrail for Text, Image, and Video
-
Hide and Find: A Distributed Adversarial Attack on Federated Graph Learning
-
Hierarchical Retrieval at Scale: Bridging Transparency and Efficiency
-
How does information access affect LLM monitors' ability to detect sabotage?
-
Human-Guided Harm Recovery for Computer Use Agents
-
Illusions of Confidence? Diagnosing LLM Truthfulness via Neighborhood Consistency
-
Improving Semantic Uncertainty Quantification in Question Answering via Token-Level Temperature Scaling
-
Inference-Time Backdoors via Hidden Instructions in LLM Chat Templates
-
INFERENCE-TIME SAFETY FOR CODE LLMS VIA RETRIEVAL-AUGMENTED REVISION
-
Instruction Following by Principled Attention Boosting of Large Language Models
-
Investigating Data Interventions for Subgroup Fairness: An ICU Case Study
-
Knowing When to Quit: A Principled Framework for Dynamic Abstention in LLM Reasoning
-
Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms
-
Learn to be Unlearned: Optimizing Language Models for Unlearning via Clustered Gradient Routing
-
Learning Minimal Contexts: How Chain-of-Thought Induces Out-of-Distribution Generalization
-
Leveraging RAG for Training-Free Alignment of LLMs
-
Lightweight and Interpretable Transformer via Mixed Graph Algorithm Unrolling for Traffic Forecast
-
LoRA Users Beware: A Few Spurious Tokens Can Manipulate Your Finetuned Model
-
MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs
-
Memorization Dynamics in Knowledge Distillation for Language Models
-
Mitigating Legibility Tax with Decoupled Prover-Verifier Games
-
Mitigating Reward Hacking with RL Training Interventions
-
MM-PoisonRAG: Disrupting Multimodal RAG with Local and Global Knowledge Poisoning Attacks
-
Model Organisms for Generalization Resistance Under Distribution Shift
-
MONITORING EMERGENT REWARD HACKING DURING GENERATION VIA INTERNAL ACTIVATIONS
-
Moral Preferences of LLMs Under Directed Contextual Influence
-
Neural Chameleons: Language Models Can Learn to Hide Their Thoughts from Unseen Activation Monitors
-
No Answer Needed: Predicting LLM Answer Accuracy from Question-Only Linear Probes
-
No One Monitor Fits All: Oversight Strategies for Frontier Agents
-
Nonparametric Variational Differential Privacy via Embedding Parameter Clipping
-
Offline Reinforcement Learning of High-Quality Behaviors Under Robust Style Alignment
-
OmniPatch: A Universal Adversarial Patch for ViT-CNN Cross-Architecture Transfer in Semantic Segmentation
-
On the Effects of Adversarial Perturbations on Distribution Robustness
-
Paranoid Monitors: How Long Context Breaks LLM Agent Supervision
-
Patching LLMs Like Software: A Lightweight Method for Improving Safety Policies in Large Language Models
-
Pay Attention to the Triggers: Constructing Backdoors That Survive Distillation
-
Position: Beyond Reasoning Zombies — AI Reasoning Requires Process Validity
-
Post-hoc Stochastic Concept Bottleneck Models
-
Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models
-
Prototype-Based Selective Prediction for Multimodal Instruction Models
-
Query Circuits: Explaining How Language Models Answer User Prompts
-
RAPO: Risk-Aware Preference Optimization for Generalizable Safe Reasoning
-
RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models
-
Representational de-collapse: Interactions between supervised finetuning and in-context learning in language models
-
Robust AI Evaluation through Maximal Lotteries
-
Robust Feature Attribution via Integrated Sensitivity Gradients
-
Robust Object Detection via Kronecker Tensor Decomposition: Theory, Algorithms, and Applications
-
RouterInterp: Superposed Specialisation in MoE Routing
-
SafeGuide: Adaptive Inference-Time Safety Control for Diffusion Models
-
SafetyPairs: Isolating Safety Critical Image Features With Counterfactual Image Generation
-
SALVE: Sparse Autoencoder-Latent Vector Editing for Mechanistic Control of Neural Networks
-
Same Question, Different Lies: Cross-Context Consistency (C³) for Black-Box Sandbagging Detection
-
Scalable Bayesian Monte Carlo: fast uncertainty estimation beyond deep ensembles
-
Secure Code Generation via Online Reinforcement Learning with Vulnerability Reward Model
-
Selective Disclosure: Controlling Information Leakage in DocVQA Explanations
-
Simple LLM Baselines are Competitive for Model Diffing
-
Sparse Circuits of Vision Language Alignment
-
Stability-Aware Prompt Optimization for Clinical Data Abstraction
-
Stress-Testing Alignment Audits with Prompt-Level Strategic Deception
-
SureFED: Robust Federated Learning via Uncertainty-Aware Inward and Outward Inspection
-
Sycophantic Anchors: Localizing and Quantifying User Agreement in Reasoning Models
-
Test-Time Training Undermines Existing Safety Guardrails
-
ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts
-
The Alignment Waltz: Jointly Training Agents to Collaborate for Safety
-
The Realignment Problem: When Right becomes Wrong in LLMs
-
The Rogue Scalpel: Activation Steering Compromises LLM Safety
-
The Semantic Imprinting Hypothesis: How Semantic Watermarks Survive Prompt-based Editing
-
Theory of Minimal Weight Perturbations in Deep Networks and its Applications for Low-Rank Activated Backdoor Attacks
-
TIGHTENING OPTIMALITY GAP WITH CONFIDENCE THROUGH CONFORMAL PREDICTION
-
Towards Statistical Verification for Trustworthy AI
-
Training with Honeypots: Reshaping How LLMs Fail
-
TrustLDM: Benchmarking Trustworthiness in Language Diffusion Model
-
Tuning Just Enough: Lightweight Backdoor Attacks on Multi-Encoder Diffusion Models
-
Uncertainty Drives Social Bias Changes in Quantized Large Language Models
-
Understanding Adversarial Transfer Across Modalities: Why Representation-Space Attacks Fail Where Data-Space Attacks Succeed
-
Understanding Empirical Unlearning with Combinatorial Interpretability
-
Unifying Perspectives on Learning Biases: A Data-Centric Intervention for Holistic Fairness, Robustness, and Generalization
-
Verbosity Tradeoffs and the Impact of Scale on the Faithfulness of LLM Self-Explanations
-
Visual Disentangled Diffusion Autoencoders: Scalable Counterfactual Generation for Foundation Models
-
VisualScratchpad: Inference-time Visual Concepts Analysis in Vision Language Models
-
Watermarking Discrete Diffusion Language Models
-
When Bias Pretends to Be Truth: How Spurious Correlations Undermine Hallucination Detection in LLMs
-
When One Modality Rules Them All: Backdoor Modality Collapse in Multimodal Diffusion Models
-
When RAG Hurts: Diagnosing and Mitigating Attention Distraction in Retrieval-Augmented LVLMs
-
Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics