ICML 2026 Past Safety & alignment
Trustworthy AI for Good (AI4GOOD) Workshop @ ICML 2026
AI4GOOD Workshop 2026
- Submission deadline
- May 10, 2026, 12:00 UTC imported from OpenReview — check the website for extensions
- Submission portal
- OpenReview
- Notes
- Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).
Accepted papers (187)
Fetched from OpenReview (v2) on 2026-06-10.
-
$\mathcal{D}^2$-Monitor: $\mathcal{D}$ynamic Safety Monitoring for $\mathcal{D}$iffusion LLMs via Hesitation-Aware Routing
-
A Generative Model of Contextual Integrity: Appropriate vs. Inappropriate Information Sharing
-
A Low-Rank Subspace Analysis of LLM Interventions
-
A Training-Dynamics View of Catastrophic Overfitting: Understanding and Prevention
-
Adaptive Trimodal Fusion for Mental-Health Symptom Classification in Memes
-
Adversarial Review: Cooperative Code Review through Structured Disagreement
-
AI Governance in Social Work: A Triple Mandate-Informed Accountability Model
-
AI-Mediated Communication Can Steer Collective Opinion
-
ALIGNBEAM: Inference-Time Alignment Transfer via Cross-Vocabulary Logit Mixing
-
Architecture Matters for Multi-Agent Security
-
Attack Selection in Agentic AI Control Evaluations Meaningfully Decreases Safety
-
Attacking Medical Vision-Language Models with Query-Based Zero-Order Optimization
-
Attractor Inversion: A Geometric Account of Adversarial Manipulation in Human Decision-Making
-
Attractor States Emerge in Multi-Turn LLM Conversations
-
Auditable Bits or Covert Influence? Safe Revelation Complexity in Partially Observable Assistance Games
-
Auditing Chain-of-Thought Faithfulness for Trustworthy AI: A Reproducible Corruption-Probe Protocol Across Eleven Frontier LLMs
-
Auditing Clinical Concept Fragmentation in Sparse Medical Vision–Language Representations
-
Auditing Emotion-Vector-Steered Political Bias in Open-Weight LLMs
-
Auditing LLMs for Hidden Behaviors using Model Diffing
-
Auditing the Judge: Human-Grounded Bias Discovery, Quantification, and Mitigation in LLM Judges
-
Authority, Truth, and Citation Bias: A Large-Scale Multi-Domain Benchmark for Studying Epistemic Susceptibility in Large Language Models
-
Balance Human Agency & AI Assistance in the Tussle for the ``Right'' to Choose, Own, Work, and Learn
-
BarrierSteer: LLM Safety via Learning Barrier Steering
-
Belief Engine: Configurable and Inspectable Stance Dynamics in Multi-Agent LLM Deliberation
-
BELLS-O: Evaluating the Operational Trade-offs of LLM Supervision Systems
-
Beyond Accuracy: Measuring Bias Acknowledgment in Chain-of-Thought Reasoning for Responsible AI Evaluation
-
Beyond Agreeable Chatbots: Context-Aware Safety Oversight for Trustworthy Patient-Facing LLMs
-
Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection
-
Beyond the Prompt: Leveraging Pre-Decoding States for Jailbreak Detection in dLLMs
-
Bosses, Kings, and the Commons: Cooperation Under Power Asymmetry in LLM Societies
-
Bridging the Gap Between Tort Law and Unforeseeable AI Errors
-
Can LLMs Contribute to Cooperative Fact-Checking? A Field Evaluation on X Community Notes
-
Can LLMs deliberate? Benchmarking Collective Reasoning for Democratic AI Applications
-
CANARY: Zero-Label Detection of Fine-Tuning Contamination in Language Models
-
Capability Is Not Propensity: Measuring Pressure-Robust Cooperative Behavior in Civic LLM Agents
-
Certifying Robustness Large Language Models via Discrete-Continuous Randomized Smoothing
-
ChainMark: Model-Free LLM Watermarking with Closed-Form Calibration
-
Closing the Welfare Outreach Gap: A Conversational Architecture and Cell-Level Eligibility Benchmark for Korean Welfare Recommendation
-
Concept Unlearning via Cross-Attention Activation Projection for Diffusion Models
-
Consensus‑Aware Bridge Maintenance Planning with Auditable Evidence and Multi‑Stakeholder AI Evaluation
-
Consistency Training Along the Transformer Stack
-
Context Over Content: Exposing Evaluation Faking in Automated Judges
-
Contract Cards for Auditable Private Conformal Prediction
-
Cooperate to Compete: Strategic Coordination in Multi-Agent Conquest
-
CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas
-
CPInj: Uncovering Prompt Injection Risks in Textual Collaborative Prompt Optimization
-
Data Contradictions Are Uncertainty, Not Noise
-
Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories
-
DeflectBench: A Benchmark for Evaluating Rhetorical Fallacy Generation in LLMs
-
Democratizing Agent Deployment Safety: A structural monitoring approach
-
DGN: Disagreement Graph Networks for Learning from Multiple Annotators
-
Differential Auditing for Undesired Behavior
-
Direction-Flipped Influence Audits Reveal Hidden Structure in Moral Choices of LLMs
-
Distill to Detect: Exposing Stealth Biases in LLMs through Cartridge Distillation
-
DiTCarbon: Predictive Carbon Footprint Estimation for Diffusion Transformer Inference
-
Do LLMs Follow Their Self-Reported Causal Graphs? A Graph-Contract Audit of Falsifiable Rationales for Trustworthy Decisions
-
Do LLMs Take Care of Their Own? Similarity Signals Can Induce Cooperation
-
Do Thinking Tokens Help with Safety?
-
Does Moral Reasoning Training Help or Hurt? Red-Teaming RL-Trained Ethical Agents with Persona Attacks
-
Efficient Safety Benchmarking via Item Response Theory
-
Emergent Social Intelligence Risks in Generative Multi-Agent Systems
-
EmoPair: A New Paradigm for Measuring Emotional Affect
-
EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents
-
Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute
-
ESSA: Evolved Safety Specification Alignment
-
Eval Cooperativeness Mitigates Evaluation Gaming in LLMs
-
Evaluating Cooperation in LLM Social Groups through Elected Leadership
-
Evaluating using Mock Tool Calls to Quarantine Untrusted Prompt Inputs
-
Evaluation without Generation: Non-Generative Assessment of Harmful Model Specialization with Applications to CSAM
-
Every Bit, Everywhere, All at Once: A Binomial Multibit LLM Watermark
-
Exposing the Systematic Vulnerability of Open-Weight Models to Prefill Attacks
-
Fairness-Aware Low-Rank Representation Fine-Tuning
-
Few-Shot Truly Benign DPO Attack for Jailbreaking LLMs
-
Flag Game: Interpreting Decision Mechanisms of Bounded Social Agents
-
Graph-Regularized Sparse Autoencoders for LLM Safety Steering
-
HABERMOLT: Delegating Deliberation to AI Representatives
-
Hand and Brain: Defenses against Agentic Steganography in Language Models
-
Hidden Commitment: When Language Models Silently Pick a Side and How Steering Can Surface It
-
Hidden in Plain Sight: Benchmarking Agent Safety Against Decomposition Attacks with DecompBench
-
Honeyval: A Comprehensive Evaluation Framework for LLM-powered HTTP Honeypots
-
Human-AI Collaborative Uncertainty Quantification
-
I-Robot: Identifying Robotic and Human Motion in Humanoids
-
Image Triaging for Budget-Aware Universal Attacks on Vision-Language Models
-
In-Context Neurofeedback: Can Large Language Models Control Their Internal Representations through Privileged Access?
-
Innocuous-Seeming Data, Latent Ideology: Ideological Generalisation in Finetuned LLMs
-
Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Misalignment in LLMs
-
Invisible Conflicts: Media Coverage Asymmetry and Categorical Failure in LLM Conflict Forecasting
-
Is Multilingual LLM Watermarking Truly Multilingual? Scaling Robustness to 100+ Languages via Back-Translation
-
IsoAct: Structure-Preserving Post-hoc Debiasing via Isometric Actions
-
Jailbreaking for the Average Jane: Choosing Optimal Jailbreaks via Bandit Algorithms for Automatically Enhanced Prompts
-
Language Models Can Coarsely Modulate Entropy Under Instruction
-
Language Models can Learn High-Capacity Secure Steganography
-
LaRA: Layer-wise Representation Analysis for Detecting Data Contamination in RL Post-Training
-
Learning from Self Critique and Refinement for Faithful LLM Summarization
-
LLM Persuasiveness Evaluation: A Structured Review of Automated Methods
-
Localizing Text Anonymization for Trustworthy AI: Extending RAT-Bench to Malaysian Microdata and PII
-
Making Open-Source Text LLM Watermarks Durable Against Merging
-
Making Visible, Making Invisible: How an AI Scribe Reshapes Documentation Authority in Social Work
-
Making Your Action Policies Interpretable: Mixtures of Action Queries
-
Manipulation Is Task-Dependent: A Multi-Axis, Multi-Environment Evaluation of Frontier LLMs
-
Marking the Wrong Symptoms: Evaluating LLM Watermarks in Medical Texts
-
Matching Ranks Over Probability Yields Truly Deep Safety Alignment
-
Measuring Weak-to-Strong Legibility of Reasoning Models
-
Mechanism Design Is Not Enough: Prosocial Agents for Cooperative AI
-
Mechanisms for Aggregated Individual Reporting Should be Established for Post-Deployment Evaluation
-
Medical Model Synthesis Architectures: A Case Study
-
Minionese: Comprehensive Benchmark and Mechanistic Study of Multilingual LLM Safety
-
Mitigating Watermark Forgery in Generative Models via Randomized Key Selection
-
MMDiff: Multimodal Model Diffing for Feature Discovery and Control
-
Moloch's Bargain: Emergent Misalignment When LLMs Compete for Audiences
-
Multi-Agent AI Systems Need Institutional Design, Not Just Model-Level Alignment
-
Narrow Secret Loyalty Dodges Black-Box Audits
-
NEMO: Benchmarking Natural-Language Explanations of Vision Model Errors
-
NEST: Nascent Encoded Steganographic Thoughts
-
Norm Enforcement for AI Agents: Robustly Shaping Behavior in Multi-Agent Systems
-
Noticing the Watcher: LLM Agents Can Infer CoT Monitoring from Blocking Feedback
-
Online Agent-as-a-Judge: Situation-Generating Evaluation for Interactive Agents
-
Operational Alignment: An Auditing Framework for Trustworthy AI in Consequential Decisions
-
Optimizing Message-Driven Recruitment on Networks
-
Persona‑Conditioned Adversarial Prompting (PCAP): Multi‑Identity Red‑Teaming for Enhanced Adversarial Prompt Discovery
-
PlainProbe: A Stable Cross-Entropy Baseline for Data-Scarce Deepfake Detection
-
Plausible Deniability Guarantees for Whistleblowers
-
PoliticsBench: Benchmarking Political Values in Large Language Models with Multi-Turn Roleplay
-
Position: Collaboration Between the City and the Machine Learning Community is Crucial to Efficient Autonomous Vehicles Routing
-
Proof-of-Guardrail in AI Agents and What (Not) to Trust from It
-
Provably Optimal Learning Algorithms for Assistance Games
-
Proximal State Nudging: Reducing Skill Atrophy from AI Assistance
-
Quantamination: Dynamic Quantization Can Leak Your Data Across the Batch
-
Quantifying Faithful Confidence Expression in Large Reasoning Models
-
Quantifying Risk of Epistemic Harm from the Use of AI Surrogates in Social Science Research
-
RAGEN-2: Reasoning Collapse in Agentic RL
-
RAVR-S: State-Sensitive Verification and Repair for Trustworthy Rule-Governed LLM Dialogue
-
Re-Mask and Redirect: Exploiting Denoising Irreversibility in Diffusion Language Models
-
Reasoning Up the Instruction Ladder for Controllable Language Models
-
REC-CBM: Rubric-Aware Error-Correction Concept Bottleneck Models for Trustworthy Open-Ended Grading
-
ReCord: Replay Coordination for Safe and Robust Population-Based Training in Autonomous Driving
-
Reimagining Meaningful Model Multiplicity
-
Retrieval Shift as a Source of Demographic Bias in Medical RAG
-
Right Predictions, Misleading Explanations: On the Vulnerability of Vision-Language Model Explanations
-
RLSpoofer: A Sample-Efficient Black-Box Spoofing Attack for Stress-Testing LLM Watermarks
-
SafeGround: Know When to Trust GUI Grounding Models via Uncertainty Calibration
-
Safer by Diffusion, Broken by Context: Diffusion LLM’s Safety Blessing and Its Failure Mode
-
Safety Cost of Steering Vectors Is Separable and Reducible
-
Safety-Anchored Fine-Tuning: Diagnosing and Preventing Safety Collapse in Large Language Models via Adversarial Alignment Anchoring
-
Same Facts, Different Updates: Inference Setup Shapes LLM Behavior in Medical Allocation
-
Scaling Trends for Lie Detector Oversight in Preference Learning
-
Selective Safety Steering via Value-Filtered Decoding
-
SEVA: Self-Evolving Verification Agent with Process Reward for Fact Attribution
-
Shortcuts in the Tail: Debiasing via Post-Hoc Spectral Compression of Fine-Tuning Updates
-
Social Choice Foundations for Simulation-Augmented Generation
-
StealthRank: LLM Ranking Manipulation via Stealthy Prompt Optimization
-
Steering LLMs to Assist Humans via Scalable Interactive Oversight
-
StegoBench: Evaluating steganography potential in language models through supervised learning
-
Stop Reporting System-Level AI Reasoning as Individual Model Capability
-
Strategic Exploitation in LLM Agent Markets: A Simulation Framework for E-Commerce Trust
-
Structural Safety Generalisation in Agentic AI Setups
-
StylisticBias: A Few Human Visual Cues Drive Most Social Bias in MLLMs
-
Subliminal Transfer of Positional Biases in Language Models
-
SURE: Judge-Aware Safety Update Review for Public-Interest LLM Deployment
-
The Best-Laid SCHEMEs: Coordinated Sabotage and Monitoring in Multi-Agent Systems
-
The Bottleneck in AI Governance: Evidence from 1,419 State Bills
-
The Broken Telephone Changes Tone: Examining Nuanced Linguistic Cues in LLM Chains-of-Translation
-
The Character of Confabulation: Operationalizing a Clinical Typology for Reasoning-Mode Language Models
-
The Containment Gap: How Deployed Agentic AI Frameworks Fail Public-Facing Safety Requirements
-
Three Years of r/ChatGPT: Societal Impact Evaluations from Social Media Data
-
Tool-Framing Bypasses LLM Safety: Procedural Abstraction Reduces Refusal Rates by Up to 40 Percentage Points Across Models
-
Toward Dealing with Unverbalized Eval Awareness
-
Toward Trustworthy LLM Router Ecosystems: Incentive-Compatible Cryptographic Mitigations
-
Towards Budget-Aware Agents: Do LLM Agents Know What They Will Spend?
-
Towards Predictive Models of Strategic Behaviour in Large Language Model Agents
-
Training ML Models with Predictable Failures
-
Treat Bias as Noise: Training Bias-Robust LLM Reasoning via Reinforcement Learning
-
Two Wrongs, No Right: Opposing Measurement Failures in LLM Annotators for Civic Discourse
-
Understanding Consistency Through Internal Representations in Large Vision-Language Models
-
Unmasking the Hidden Fairness, Bias, and Safety Costs of Compression with Mixture-of-Expert Models
-
Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents
-
Voting Protocols as Coordination Mechanisms for Role-Constrained Multi-Agent Tutoring Systems
-
WARP: Measuring and Mitigating Evaluation Awareness in Browser-Agent Safety Benchmarks
-
Watermarking for Proprietary Dataset Protection
-
Watershed: A Unified Benchmark for End-to-End Data Provenance Evaluation
-
Weight-Level Defenses Improve LLM Agent Adversarial Robustness
-
What do Uncertainty Lens tell about Emergent Misalignment?
-
When Do Covert Channels Emerge? Probing Steganographic Capacity in Multimodal Agents via Diffusion VAEs Latents
-
When Language Representations Interact: Separability and Cross-Lingual Effects in LLMs
-
Where Do Agents Differ? Interpretable Rule Discovery for Performance Differences Across Models and Data
-
Who Flips? Self- and Cross-Model Counterarguments Reveal Answer Instability in LLMs
-
Widening the Gap: Exploiting LLM Quantization via Outlier Injection