ICLR 2024PastLarge language modelsFairness & ethics

ICLR 2024 Workshop on Reliable and Responsible Foundation Models

ICLR 2024 R2-FM Workshop

Official website ↗OpenReview venue ↗See all ICLR workshops →✎ Edit this entry

Submission deadline: Feb 11, 2024, 12:59 UTC
imported from OpenReview — check the website for extensions
Submission portal: OpenReview
Notes: Topics were auto-suggested and may be imprecise — edits welcome.

Accepted papers (66)

Fetched from OpenReview (v2) on 2026-06-10.

©Plug-in Authorization for Human Content Copyright Protection in Text-to-Image Model
Chao Zhou, Huishuai Zhang, Jiang Bian, Weiming Zhang, Nenghai Yu · PDF
A StrongREJECT for Empty Jailbreaks
Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, Sam Toyer · PDF
Actions Speak Louder than Words: Superficial Fairness Alignment in LLMs
Qiyao Wei, Alex James Chan, Lea Goetz, David Watson, Mihaela van der Schaar · PDF
Adversarial Robustness for Visual Grounding of Multimodal Large Language Models
Kuofeng Gao, Yang Bai, Jiawang Bai, Yong Yang, Shu-Tao Xia · PDF
AI-Generated Images Introduce Invisible Relevance Bias to Text-Image Retrieval
Shicheng Xu, Danyang Hou, Liang Pang, Jingcheng Deng, Jun Xu, Huawei Shen, Xueqi Cheng · PDF
Aligning Modalities in Vision Large Language Models via Preference Fine-tuning
Yiyang Zhou, Chenhang Cui, Rafael Rafailov, Chelsea Finn, Huaxiu Yao · PDF
Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
Boyi Wei, Kaixuan Huang, Yangsibo Huang, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang, Peter Henderson · PDF
Augmentation Alone Leads to Generalization
Runtian Zhai, Bingbin Liu, Andrej Risteski, J Zico Kolter, Pradeep Kumar Ravikumar · PDF
AutoPRM: Automating Procedural Supervision for Multi-Step Reasoning via Controllable Question Decomposition
Zhaorun Chen, Zhuokai Zhao, Zhihong Zhu, Ruiqi Zhang, Xiang Li, Bhiksha Raj, Huaxiu Yao · PDF
Boosting Jailbreak Attack With Momentum
Yihao Zhang, Zeming Wei · PDF
Can Generative Multimodal Models Count to Ten?
Sunayana Rane, Alexander Ku, Jason Michael Baldridge, Ian Tenney, Thomas L. Griffiths, Been Kim · PDF
Can Large Language Models Achieve Calibration with In-Context Learning?
Chengzu Li, Han Zhou, Goran Glavaš, Anna Korhonen, Ivan Vulić · PDF
Can Large Language Models Reason Robustly with Noisy Rationales?
Zhanke Zhou, Rong Tao, Jianing Zhu, Yiwen Luo, Zengmao Wang, Bo Han · PDF
Chain-of-Verification Reduces Hallucination in Large Language Models
Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, Jason E Weston · PDF
Composing Knowledge and Compression Interventions for Language Models
Arinbjörn Kolbeinsson, Tianjin Huang, Shanghua Gao, Shiwei Liu, Jonathan Richard Schwarz, Anurag Jayant Vaidya, Faisal Mahmood, Marinka Zitnik, Tianlong Chen, Thomas Hartvigsen · PDF
Confidence Calibration and Rationalization for LLMs via Multi-Agent Deliberation
Ruixin Yang, Dheeraj Rajagopal, Shirley Anugrah Hayati, Bin Hu, Dongyeop Kang · PDF
Dataset MKSL for measuring adequate response performance by knowledge level
NohMyongSung, Cho Ung Hui · PDF
Does Data Contamination Make a Difference? Insights from Intentionally Contaminating Pre-training Data For Language Models
Minhao Jiang, Ken Liu, Ming Zhong, Rylan Schaeffer, Siru Ouyang, Jiawei Han, Sanmi Koyejo · PDF
Evaluating Model Bias Requires Characterizing its Mistakes
Isabela Albuquerque, Jessica Schrouff, David Warde-Farley, Ali Taylan Cemgil, Sven Gowal, Olivia Wiles · PDF
Experts Don't Cheat: Learning What You Don't Know By Predicting Pairs
Daniel D. Johnson, Daniel Tarlow, David Duvenaud, Chris J. Maddison · PDF
Explaining latent representations of generative models with large multimodal models
Mengdan Zhu, Zhenke Liu, Bo Pan, Abhinav Angirekula, Liang Zhao · PDF
Explicit Knowledge Factorization Meets In-Context Learning: What Do We Gain?
Sarthak Mittal, Eric Elmoznino, Leo Gagnon, Sangnie Bhardwaj, Dhanya Sridhar, Guillaume Lajoie · PDF
Exploring the Robustness of In-Context Learning with Noisy Labels
Chen Cheng, Xinzhi Yu, Haodong Wen, Jingsong Sun, Guanzhang Yue, Yihao Zhang, Zeming Wei · PDF
HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding
Zhaorun Chen, Zhuokai Zhao, Hongyin Luo, Huaxiu Yao, Bo Li, Jiawei Zhou · PDF
Hijacking Context in Large Multi-modal Models
Joonhyun Jeong · PDF
How many Opinions does your LLM have? Improving Uncertainty Estimation in NLG
Lukas Aichberger, Kajetan Schweighofer, Mykyta Ielanskyi, Sepp Hochreiter · PDF
How to train your VIT for OOD detection
Maximilian Müller, Matthias Hein · PDF
Improving Open Information Extraction with Large Language Models: A Study on Demonstration Uncertainty
Chen Ling, Xujiang Zhao, Xuchao Zhang, Yanchi Liu, Wei Cheng, Haoyu Wang, Zhengzhang Chen, Mika Oishi, Takao Osaki, Katsushi Matsuda, Liang Zhao, Haifeng Chen · PDF
In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation
Shiqi Chen, Miao Xiong, Junteng Liu, Zhengxuan Wu, Teng Xiao, Siyang Gao, Junxian He · PDF
Instruction Tuning for Secure Code Generation
Jingxuan He, Mark Vero, Gabriela Krasnopolska, Martin Vechev · PDF
Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates
Kaifeng Lyu, Haoyu Zhao, Xinran Gu, Dingli Yu, Anirudh Goyal, Sanjeev Arora · PDF
LARGE LANGUAGE MODEL CASCADES WITH MIXTURE OF THOUGHT REPRESENTATIONS FOR COST-EFFICIENT REASONING
Murong Yue, Jie Zhao, Min Zhang, Liang Du, Ziyu Yao · PDF
Large Language Models are Anonymizers
Robin Staab, Mark Vero, Mislav Balunovic, Martin Vechev · PDF
Mapping Social Choice Theory to RLHF
Jessica Dai, Eve Fleisig · PDF
Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks
Samyak Jain, Robert Kirk, Ekdeep Singh Lubana, Robert P. Dick, Hidenori Tanaka, Tim Rocktäschel, Edward Grefenstette, David Krueger · PDF
Memorization and Privacy Risks in Domain-Specific Large Language Models
Xinyu Yang, Zichen Wen, Wenjie Qu, Zhaorun Chen, Zhiying Xiang, Beidi Chen, Huaxiu Yao · PDF
On Fairness Implications and Evaluations of Low-Rank Adaptation of Large Models
Ken Liu, Zhoujie Ding, Berivan Isik, Sanmi Koyejo · PDF
Personalized Language Modeling from Personalized Human Feedback
Xinyu Li, Zachary Chase Lipton, Liu Leqi · PDF
Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control
Gunshi Gupta, Karmesh Yadav, Yarin Gal, Dhruv Batra, Zsolt Kira, Cong Lu, Tim G. J. Rudner · PDF
Preventing Memorized Completions through White-Box Filtering
Oam Patel, Rowan Wang · PDF
Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines
Yuchen Li, Alexandre Kirchmeyer, Aashay Mehta, Yilong Qin, Boris Dadachev, Kishore A Papineni, Sanjiv Kumar, Andrej Risteski · PDF
Prompting for Robustness: Extracting Robust Classifiers from Foundation Models
Amrith Setlur, Saurabh Garg, Virginia Smith, Sergey Levine · PDF
ProTransformer: Robustify Transformers via Plug-and-Play Paradigm
Zhichao Hou, Weizhi Gao, Yuchen Shen, Xiaorui Liu · PDF
Questioning the Survey Responses of Large Language Models
Ricardo Dominguez-Olmedo, Moritz Hardt, Celestine Mendler-Dünner · PDF
RAMBLA: A FRAMEWORK FOR EVALUATING THE RELIABILITY OF LLMS AS ASSISTANTS IN THE BIOMEDICAL DOMAIN
William James Bolton, Rafael Poyiadzi, Edward Morrell, Gabriela van Bergen Gonzalez Bueno, Lea Goetz · PDF
Re-Ex: Revising after Explanation reduces the Factual Errors in LLM Responses
Juyeon Kim, Jeongeun Lee, YoonHo Chang, CHANYEOL CHOI, Jun-Seong Kim, Jy-yong Sohn · PDF
Robust CLIP: Unsupervised Adversarial Fine-tuning of Vision Embeddings for Robust Large Vision-Language Models
Christian Schlarmann, Naman Deep Singh, Francesco Croce, Matthias Hein · PDF
Scaling Compute Is Not All You Need for Adversarial Robustness
Edoardo Debenedetti, Zishen Wan, Maksym Andriushchenko, Vikash Sehwag, Kshitij Bhardwaj, Bhavya Kailkhura · PDF
Seeing is Believing: Mitigating Hallucination in Large Vision-Language Models via CLIP-Guided Decoding
Ailin Deng, Zhirui Chen, Bryan Hooi · PDF
Self-Alignment of Large Language Models via Social Scene Simulation
Xianghe Pang, Shuo Tang, Rui Ye, Yuxin Xiong, Bolun Zhang, Yanfeng Wang, Siheng Chen · PDF
Self-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image Generation
Hang Li, Chengzhi Shen, Philip Torr, Volker Tresp, Jindong Gu · PDF
Setting the Record Straight on Transformer Oversmoothing
Gbetondji Jean-Sebastien Dovonon, Michael M. Bronstein, Matt Kusner · PDF
Skip $\textbackslash n$: A simple method to reduce hallucination in Large Vision-Language Models
Zongbo Han, Zechen Bai, Haiyang Mei, Qianli Xu, Changqing Zhang, Mike Zheng Shou · PDF
Steering LLMs Towards Unbiased Responses: A Causality-Guided Debiasing Framework
Jingling Li, Zeyu Tang, Xiaoyu Liu, Peter Spirtes, Kun Zhang, Liu Leqi, Yang Liu · PDF
texplain: Post-hoc Textual Explanation of Image Classifiers with Pre-trained Language Models
Saeid Asgari, Aliasghar Khani, Amir Hosein Khasahmadi, Aditya Sanghi, Karl D.D. Willis, Ali Mahdavi Amiri · PDF
THE BIAS OF HARMFUL LABEL ASSOCIATIONS IN VISION-LANGUAGE MODELS
Caner Hazirbas, Alicia Yi Sun, Yonathan Efroni, Mark Ibrahim · PDF
Towards Logically Consistent Language Models via Probabilistic Reasoning
Diego Calanzone, Antonio Vergari, Stefano Teso · PDF
Towards Personalized AI: Early-stopping Low-Rank Adaptation of Foundation Models
Zihao Luo, Di Wang, Yun Sing Koh, Jingfeng Zhang · PDF
Unified Hallucination Detection for Multimodal Large Language Models
Xiang Chen, Chenxi Wang, Ningyu Zhang, Yida Xue, xiaoyan yang, Yue Shen, Jinjie GU, Huajun Chen · PDF
Unlearnable Examples for Diffusion Models: Protect Data from Unauthorized Exploitation
Zhengyue Zhao, Jinhao Duan, Xing Hu, Kaidi Xu, Chenan Wang, Rui Zhang, Zidong Du, Qi Guo, Yunji Chen · PDF
Unsolvable Problem Detection for Vision Language Models
Atsuyuki Miyai, Jingkang Yang, Jingyang Zhang, Yifei Ming, Qing Yu, Go Irie, Yixuan Li, Hai Li, Ziwei Liu, Kiyoharu Aizawa · PDF
Value Augmented Sampling: Predict Your Rewards To Align Language Models
Seungwook Han, Idan Shenfeld, Akash Srivastava, Yoon Kim, Pulkit Agrawal · PDF
Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations
Katie Matton, Robert Ness, Emre Kiciman · PDF
Watermark Stealing in Large Language Models
Nikola Jovanović, Robin Staab, Martin Vechev · PDF
WAVES: Benchmarking the Robustness of Image Watermarks
Mucong Ding, Tahseen Rabbani, Bang An, Aakriti Agrawal, Yuancheng Xu, Chenghao Deng, Sicheng Zhu, Abdirisak Mohamed, Yuxin Wen, Tom Goldstein, Furong Huang · PDF
WorldBench: Quantifying Geographic Disparities in LLM Factual Recall
Mazda Moayeri, Soheil Feizi · PDF

Accepted papers (66)

☆©Plug-in Authorization for Human Content Copyright Protection in Text-to-Image Model

☆A StrongREJECT for Empty Jailbreaks

☆Actions Speak Louder than Words: Superficial Fairness Alignment in LLMs

☆Adversarial Robustness for Visual Grounding of Multimodal Large Language Models

☆AI-Generated Images Introduce Invisible Relevance Bias to Text-Image Retrieval

☆Aligning Modalities in Vision Large Language Models via Preference Fine-tuning

☆Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications

☆Augmentation Alone Leads to Generalization

☆AutoPRM: Automating Procedural Supervision for Multi-Step Reasoning via Controllable Question Decomposition

☆Boosting Jailbreak Attack With Momentum

☆Can Generative Multimodal Models Count to Ten?

☆Can Large Language Models Achieve Calibration with In-Context Learning?

☆Can Large Language Models Reason Robustly with Noisy Rationales?

☆Chain-of-Verification Reduces Hallucination in Large Language Models

☆Composing Knowledge and Compression Interventions for Language Models

☆Confidence Calibration and Rationalization for LLMs via Multi-Agent Deliberation

☆Dataset MKSL for measuring adequate response performance by knowledge level

☆Does Data Contamination Make a Difference? Insights from Intentionally Contaminating Pre-training Data For Language Models

☆Evaluating Model Bias Requires Characterizing its Mistakes

☆Experts Don't Cheat: Learning What You Don't Know By Predicting Pairs

☆Explaining latent representations of generative models with large multimodal models

☆Explicit Knowledge Factorization Meets In-Context Learning: What Do We Gain?

☆Exploring the Robustness of In-Context Learning with Noisy Labels

☆HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding

☆Hijacking Context in Large Multi-modal Models

☆How many Opinions does your LLM have? Improving Uncertainty Estimation in NLG

☆How to train your VIT for OOD detection

☆Improving Open Information Extraction with Large Language Models: A Study on Demonstration Uncertainty

☆In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation

☆Instruction Tuning for Secure Code Generation

☆Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates

☆LARGE LANGUAGE MODEL CASCADES WITH MIXTURE OF THOUGHT REPRESENTATIONS FOR COST-EFFICIENT REASONING

☆Large Language Models are Anonymizers

☆Mapping Social Choice Theory to RLHF

☆Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks

☆Memorization and Privacy Risks in Domain-Specific Large Language Models

☆On Fairness Implications and Evaluations of Low-Rank Adaptation of Large Models

☆Personalized Language Modeling from Personalized Human Feedback

☆Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control

☆Preventing Memorized Completions through White-Box Filtering

☆Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines

☆Prompting for Robustness: Extracting Robust Classifiers from Foundation Models

☆ProTransformer: Robustify Transformers via Plug-and-Play Paradigm

☆Questioning the Survey Responses of Large Language Models

☆RAMBLA: A FRAMEWORK FOR EVALUATING THE RELIABILITY OF LLMS AS ASSISTANTS IN THE BIOMEDICAL DOMAIN

☆Re-Ex: Revising after Explanation reduces the Factual Errors in LLM Responses

☆Robust CLIP: Unsupervised Adversarial Fine-tuning of Vision Embeddings for Robust Large Vision-Language Models

☆Scaling Compute Is Not All You Need for Adversarial Robustness

☆Seeing is Believing: Mitigating Hallucination in Large Vision-Language Models via CLIP-Guided Decoding

☆Self-Alignment of Large Language Models via Social Scene Simulation

☆Self-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image Generation

☆Setting the Record Straight on Transformer Oversmoothing

☆Skip $\textbackslash n$: A simple method to reduce hallucination in Large Vision-Language Models

☆Steering LLMs Towards Unbiased Responses: A Causality-Guided Debiasing Framework

☆texplain: Post-hoc Textual Explanation of Image Classifiers with Pre-trained Language Models

☆THE BIAS OF HARMFUL LABEL ASSOCIATIONS IN VISION-LANGUAGE MODELS

☆Towards Logically Consistent Language Models via Probabilistic Reasoning

☆Towards Personalized AI: Early-stopping Low-Rank Adaptation of Foundation Models

☆Unified Hallucination Detection for Multimodal Large Language Models

☆Unlearnable Examples for Diffusion Models: Protect Data from Unauthorized Exploitation

☆Unsolvable Problem Detection for Vision Language Models

☆Value Augmented Sampling: Predict Your Rewards To Align Language Models

☆Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations

☆Watermark Stealing in Large Language Models

☆WAVES: Benchmarking the Robustness of Image Watermarks

☆WorldBench: Quantifying Geographic Disparities in LLM Factual Recall

©Plug-in Authorization for Human Content Copyright Protection in Text-to-Image Model

A StrongREJECT for Empty Jailbreaks

Actions Speak Louder than Words: Superficial Fairness Alignment in LLMs

Adversarial Robustness for Visual Grounding of Multimodal Large Language Models

AI-Generated Images Introduce Invisible Relevance Bias to Text-Image Retrieval

Aligning Modalities in Vision Large Language Models via Preference Fine-tuning

Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications

Augmentation Alone Leads to Generalization

AutoPRM: Automating Procedural Supervision for Multi-Step Reasoning via Controllable Question Decomposition

Boosting Jailbreak Attack With Momentum

Can Generative Multimodal Models Count to Ten?

Can Large Language Models Achieve Calibration with In-Context Learning?

Can Large Language Models Reason Robustly with Noisy Rationales?

Chain-of-Verification Reduces Hallucination in Large Language Models

Composing Knowledge and Compression Interventions for Language Models

Confidence Calibration and Rationalization for LLMs via Multi-Agent Deliberation

Dataset MKSL for measuring adequate response performance by knowledge level

Does Data Contamination Make a Difference? Insights from Intentionally Contaminating Pre-training Data For Language Models

Evaluating Model Bias Requires Characterizing its Mistakes

Experts Don't Cheat: Learning What You Don't Know By Predicting Pairs

Explaining latent representations of generative models with large multimodal models

Explicit Knowledge Factorization Meets In-Context Learning: What Do We Gain?

Exploring the Robustness of In-Context Learning with Noisy Labels

HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding

Hijacking Context in Large Multi-modal Models

How many Opinions does your LLM have? Improving Uncertainty Estimation in NLG

How to train your VIT for OOD detection

Improving Open Information Extraction with Large Language Models: A Study on Demonstration Uncertainty

In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation

Instruction Tuning for Secure Code Generation

Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates

LARGE LANGUAGE MODEL CASCADES WITH MIXTURE OF THOUGHT REPRESENTATIONS FOR COST-EFFICIENT REASONING

Large Language Models are Anonymizers

Mapping Social Choice Theory to RLHF

Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks

Memorization and Privacy Risks in Domain-Specific Large Language Models

On Fairness Implications and Evaluations of Low-Rank Adaptation of Large Models

Personalized Language Modeling from Personalized Human Feedback

Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control

Preventing Memorized Completions through White-Box Filtering

Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines

Prompting for Robustness: Extracting Robust Classifiers from Foundation Models

ProTransformer: Robustify Transformers via Plug-and-Play Paradigm

Questioning the Survey Responses of Large Language Models

RAMBLA: A FRAMEWORK FOR EVALUATING THE RELIABILITY OF LLMS AS ASSISTANTS IN THE BIOMEDICAL DOMAIN

Re-Ex: Revising after Explanation reduces the Factual Errors in LLM Responses

Robust CLIP: Unsupervised Adversarial Fine-tuning of Vision Embeddings for Robust Large Vision-Language Models

Scaling Compute Is Not All You Need for Adversarial Robustness

Seeing is Believing: Mitigating Hallucination in Large Vision-Language Models via CLIP-Guided Decoding

Self-Alignment of Large Language Models via Social Scene Simulation

Self-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image Generation

Setting the Record Straight on Transformer Oversmoothing

Skip $\textbackslash n$: A simple method to reduce hallucination in Large Vision-Language Models

Steering LLMs Towards Unbiased Responses: A Causality-Guided Debiasing Framework

texplain: Post-hoc Textual Explanation of Image Classifiers with Pre-trained Language Models

THE BIAS OF HARMFUL LABEL ASSOCIATIONS IN VISION-LANGUAGE MODELS

Towards Logically Consistent Language Models via Probabilistic Reasoning

Towards Personalized AI: Early-stopping Low-Rank Adaptation of Foundation Models

Unified Hallucination Detection for Multimodal Large Language Models

Unlearnable Examples for Diffusion Models: Protect Data from Unauthorized Exploitation

Unsolvable Problem Detection for Vision Language Models

Value Augmented Sampling: Predict Your Rewards To Align Language Models

Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations

Watermark Stealing in Large Language Models

WAVES: Benchmarking the Robustness of Image Watermarks

WorldBench: Quantifying Geographic Disparities in LLM Factual Recall