ICML 2025 Past Large language modelsFairness & ethics
ICML 2025 Workshop on Reliable and Responsible Foundation Models
ICML 2025 R2-FM Workshop
- Submission deadline
- May 31, 2025, 12:01 UTC imported from OpenReview — check the website for extensions
- Submission portal
- OpenReview
- Notes
- Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).
Accepted papers (120)
Fetched from OpenReview (v2) on 2026-06-10.
-
(Im)possibility of Automated Hallucination Detection in Large Language Models
-
A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1
-
A Statistical Physics of Language Model Reasoning
-
A Thousand Words or An Image: Studying the Influence of Persona Modality in Multimodal LLMs
-
AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions
-
Accountability Attribution: Tracing Model Behavior to Training Processes
-
Adjacent Words, Divergent Intents: Jailbreaking Large Language Models via Task Concurrency
-
Advancing LLM Safe Alignment with Safety Representation Ranking
-
Adversarial Manipulation of Reasoning Models using Internal Representations
-
ALIGN: Prompt-based Attribute Alignment for Reliable, Responsible, and Personalized LLM-based Decision-Making
-
Aligned Textual Scoring Rule
-
Alignment of Large Language Models with Constrained Learning
-
Angular Steering: Behavior Control via Rotation in Activation Space
-
ASNO: An Interpretable Attention-Based Spatio-Temporal Neural Operator for Robust Scientific Machine Learning
-
Auditing, Monitoring, and Intervention for Compliance of Advanced AI Systems
-
Beautiful Images, Toxic Words: Understanding and Addressing Offensive Text in Generated Images
-
Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models
-
Beyond Multiple Choice: Evaluating Steering Vectors for Adaptive Free-Form Summarization
-
BiasGUARRD: Enhancing Fairness and Reliability in LLM Conflict Resolution Through Agentic Debiasing
-
Can We Infer Confidential Properties of Training Data from LLMs?
-
Capability-Based Scaling Laws for LLM Red-Teaming
-
Circuit Discovery Helps To Detect LLM Jailbreaking
-
Conformal Prediciton Beyond the Seen: A Missing Mass Perspective for Uncertainty Quantification in Generative Models
-
Conformal Risk Minimization with Variance Reduction
-
Consistency in Language Models: Current Landscape, Challenges, and Future Directions
-
Copilot Arena: A Platform for Code LLM Evaluation in the Wild
-
Data Shifts Hurt CoT: A Theoretical Study
-
Dataset Protection via Watermarked Canaries in Retrieval-Augmented LLMs
-
Defending Against Prompt Injection with a Few DefensiveTokens
-
DINGO: Constrained Inference for Diffusion LLMs
-
Distilling Safe LLM Systems via Soft Prompts
-
Do Sparse Autoencoders Generalize? A Case Study of Answerability
-
Do Students Debias Like Teachers? On the Distillability of Bias Mitigation Methods
-
Does Representation Intervention Really Identify Desired Concepts and Elicit Alignment?
-
Don’t Think Twice! Over-Reasoning Impairs Confidence Calibration
-
Doubly Robust Alignment for Large Language Models
-
Dynamic Risk Assessments for Offensive Cybersecurity Agents
-
Efficient and Privacy-Preserving Soft Prompt Transfer for LLMs
-
Empirical Comparison of Membership Inference Attacks in Deep Transfer Learning
-
Enhancing Clinical Multiple-Choice Questions Benchmarks with Knowledge Graph Guided Distractor Generation
-
Escaping the SpuriVerse: Can Large Vision-Language Models Generalize Beyond Seen Spurious Correlations?
-
Evaluating Adversarial Protections for Diffusion Personalization: A Comprehensive Study
-
Evaluating Large Language Models' Capability to Launch Fully Automated Spear Phishing Campaigns
-
Evolving Prompts In-Context: An Open-ended, Self-replicating Perspective
-
Extracting memorized pieces of (copyrighted) books from open-weight language models
-
Finetuning-Activated Backdoors in LLMs
-
Focus on This, Not That! Steering LLMs with Adaptive Feature Specification
-
Foundational Models Must Be Designed To Yield Safer Loss Landscapes That Resist Harmful Fine-Tuning
-
From Tasks to Teams: A Risk-First Evaluation Framework for Multi-Agent LLM Systems in Finance
-
GenAI Copyright Evidence with Operational Meaning
-
GLSim: Detecting Object Hallucinations in LVLMs via Global-Local Similarity
-
GPT, But Backwards: Exactly Inverting Language Model Outputs
-
GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning
-
Improving Commonsense Reasoning and Reliability in LLMs Through Cognitive-Inspired Prompting Frameworks
-
In Agents We Trust, but Who Do Agents Trust? Latent Source Preferences Steer LLM Generations
-
In-Context Watermarks for Large Language Models
-
Investigating Tool-Memory Conflicts in Tool-Augmented LLMs
-
Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models
-
Learning on LLM Output Signatures for Gray-Box Behavior Analysis
-
Learning Robust 3D Representation from CLIP via Dual Denoising
-
Lifelong Safety Alignment for Language Models
-
Lookahead Bias in Pretrained Language Models
-
LoRA Merging with SVD: Understanding Interference and Preserving Performance
-
MARVEL: Modular Abstention for Reliable and Versatile Expert LLMs
-
MixAT: Combining Continuous and Discrete Adversarial Training for LLMs
-
Model Organisms for Emergent Misalignment
-
Multi-Modal Medical Image Augmentation for Controlled Heterogeneity and Fair Outcomes
-
On Characterizations for Language Generation: Interplay of Hallucinations, Breadth, and Stability
-
On Learning Verifiers for Chain-of-Thought Reasoning
-
On the Scoring Functions for RAG-based Conformal Factuality
-
One Stone, Two Birds: Enhancing Adversarial Defense Through the Lens of Distributional Discrepancy
-
Persuade Me If You Can: Evaluating AI Agent Influence on Safety Monitors
-
Poison as Cure: Visual Noise for Mitigating Object Hallucinations in LVMs
-
Position: Agent-Specific Trustworthiness Risk as a Research Priority
-
Position: Membership Inference Attack Should Move On to Distributional Statistics for Distilled Generative Models
-
Position: Reasoning LLMs are Wandering Solution Explorers
-
Predicting the Performance of Black-box Language Models with Follow-up Queries
-
Prune 'n Predict: Optimizing LLM Decision-making with Conformal Prediction
-
RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability
-
Recitation over Reasoning: How Cutting-Edge Language Models Can Fail on Elementary School-Level Reasoning Problems?
-
Reward Shaping to Mitigate Reward Hacking in RLHF
-
Robust and Interpretable Relational Reasoning with Large Language Models and Symbolic Solvers
-
Robust LLM Fingerprinting via Domain-Specific Watermarks
-
RoMa: A Robust Model Watermarking Scheme for Protecting IP in Diffusion Models
-
SAEs Can Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs
-
SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning
-
Sample-Specific Noise Injection For Diffusion-Based Adversarial Purification
-
Scalable Defense against In-the-wild Jailbreaking Attacks with Safety Context Retrieval
-
Self-reflective Uncertainties: Do LLMs Know Their Internal Answer Distribution?
-
Semi-Nonnegative GPT: Towards Monosemantic Representations
-
Silent Leaks: Implicit Knowledge Extraction Attack on RAG Systems through Benign Queries
-
SimBA: Simplifying Benchmark Analysis Using Performance Matrices Alone
-
Simple Mechanistic Explanations for Out-Of-Context Reasoning
-
SOSBENCH: Benchmarking Safety Alignment on Scientific Knowledge
-
State Space Models: A Naturally Robust Alternative to Transformers in Computer Vision
-
Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks
-
Steering Language Model Refusal with Sparse Autoencoders
-
Steering LLM Reasoning Through Bias-Only Adaptation
-
Sysformer: Safeguarding Frozen Large Language Models with Adaptive System Prompts
-
Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning
-
The Automated but Risky Game: Modeling Agent-to-Agent Negotiations and Transactions in Consumer Markets
-
The Geometries of Truth Are Orthogonal Across Tasks
-
The Geometry of Forgetting: Analyzing Machine Unlearning through Local Learning Coefficients
-
The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1
-
The Necessity for Intervention Fidelity: Unintended Side Effects When Steering LLMs
-
Thought calibration: Efficient and confident test-time scaling
-
Towards Secure Model Sharing with Approximate Fingerprints
-
Transferable Visual Adversarial Attacks for Proprietary Multimodal Large Language Models
-
Transformers Don't In-Context Learn Least Squares Regression
-
TRoVe: Discovering Error-Inducing Static Feature Biases in Temporal Vision-Language Models
-
Uncertainty Quantification for Multimodal Large Language Models
-
UniGuard: Towards Universal Safety Guardrails for Jailbreak Attacks on Multimodal Large Language Models
-
Valid Inference with Synthetic Data from Language Models
-
Verbalized Confidence Triggers Self-Verification : Emergent Behavior Without Explicit Reasoning Supervision
-
Visual Instruction Bottleneck Tuning
-
Visual Language Models as Zero-Shot Deepfake Detectors
-
Watermarking Autoregressive Image Generation
-
Weak-for-Strong: Training Weak Meta-Agent to Harness Strong Executors
-
What do Geometric Hallucination Detection Metrics Actually Measure?
-
When Meaning Doesn’t Matter: Exposing Guard Model Fragility via Paraphrasing