ICML 2024PastSafety & alignment

ICML 2024 Workshop on Models of Human Feedback for AI Alignment

ICML 2024 Workshop MHFAIA

Official website ↗OpenReview venue ↗See all ICML workshops →✎ Edit this entry

Submission deadline: Jun 1, 2024, 18:00 UTC
imported from OpenReview — check the website for extensions
Submission portal: OpenReview
Notes: Topics were auto-suggested and may be imprecise — edits welcome.

Accepted papers (60)

Fetched from OpenReview (v2) on 2026-06-10.

"You just can’t go around killing people'' Explaining Agent Behavior to a Human Terminator
Uri Menkes, Ofra Amir, Assaf Hallak · PDF
A Theoretical Framework for Partially Observed Reward-States in RLHF
Chinmaya Kausik, Mirco Mutti, Aldo Pacchiano, Ambuj Tewari · PDF
Accelerating Best-of-N via Speculative Rejection
Ruiqi Zhang, Momin Haider, Ming Yin, Jiahao Qiu, Mengdi Wang, Peter Bartlett, Andrea Zanette · PDF
Adversarial Multi-dueling Bandits
Pratik Gajane · PDF
AI Alignment with Changing and Influenceable Reward Functions
Micah Carroll, Davis Foote, Anand Siththaranjan, Stuart Russell, Anca Dragan · PDF
Aligning Crowd Feedback via Distributional Preference Reward Modeling
Dexun Li, Cong Zhang, Kuicai Dong, Derrick Goh Xin Deik, Ruiming Tang, Yong Liu · PDF
Aligning Large Language Models with Representation Editing: A Control Perspective
Lingkai Kong, Haorui Wang, Wenhao Mu, Yuanqi Du, Yuchen Zhuang, Yifei Zhou, Yue Song, Rongzhi Zhang, Kai Wang, Chao Zhang · PDF
AMBER: An Entropy Maximizing Environment Design Algorithm for Inverse Reinforcement Learning
Paul Nitschke, Lars Lien Ankile, Eura Nofshin, Siddharth Swaroop, Finale Doshi-Velez, Weiwei Pan · PDF
Beyond Thumbs Up/Down: Untangling Challenges of Fine-Grained Feedback for Text-to-Image Generation
Katherine M. Collins, Najoung Kim, Yonatan Bitton, Verena Rieser, Shayegan Omidshafiei, Yushi Hu, Sherol Chen, Senjuti Dutta, Minsuk Chang, Kimin Lee, Youwei Liang, Georgina Evans, Sahil Singla, Gang Li, Adrian Weller, Junfeng He, Deepak Ramachandran, Krishnamurthy Dj Dvijotham · PDF
Bootstrapping Language Models with DPO Implicit Rewards
Changyu Chen, Zichen Liu, Chao Du, Tianyu Pang, Qian Liu, Arunesh Sinha, Pradeep Varakantham, Min Lin · PDF
Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization
Hritik Bansal, Ashima Suvarna, Gantavya Bhatt, Nanyun Peng, Kai-Wei Chang, Aditya Grover · PDF
Comparing Comparisons: Informative and Easy Human Feedback with Distinguishability Queries
Xuening Feng, Zhaohui JIANG, Timo Kaufmann, Eyke Hüllermeier, Paul Weng, Yifei Zhu · PDF
Concept-Based Interpretable Reinforcement Learning with Limited to No Human Labels
Zhuorui Ye, Stephanie Milani, Fei Fang, Geoffrey J. Gordon · PDF
Cross-Domain Knowledge Transfer for RL via Preference Consistency
Ting-Hsuan Huang, Ping-Chun Hsieh · PDF
Distributional Preference Alignment of LLMs via Optimal Transport
Igor Melnyk, Youssef Mroueh, Brian Belgodere, Mattia Rigotti, Apoorva Nitsure, Mikhail Yurochkin, Kristjan Greenewald, Jiri Navratil, Jarret Ross · PDF
DPM: Dual Preferences-based Multi-Agent Reinforcement Learning
Sehyeok Kang, Yongsik Lee, Se-Young Yun · PDF
DPO Meets PPO: Reinforced Token Optimization for RLHF
Han Zhong, Guhao Feng, Wei Xiong, Xinle Cheng, Li Zhao, Di He, Jiang Bian, Liwei Wang · PDF
Efficient Inverse Reinforcement Learning without Compounding Errors
Nicolas Espinosa Dice, Gokul Swamy, Sanjiban Choudhury, Wen Sun · PDF
Enhancing Intent Understanding for Ambiguous prompt: A Human-Machine Co-Adaption Strategy
Yangfan He, Yuxuan Bai, TIANYU SHI · PDF
Filtered Direct Preference Optimization
Tetsuro Morimura, Mitsuki Sakamoto, Yuu Jinnai, Kenshi Abe, Kaito Ariu · PDF
Free-Energy Equilibria: Toward a Theory of Interactions Between Boundedly-Rational Agents
David Hyland, Tomáš Gavenčiak, Lancelot Da Costa, Conor Heins, Vojtech Kovarik, Julian Gutierrez, Michael J. Wooldridge, Jan Kulveit · PDF
Generalizing Offline Alignment Theoretical Paradigm with Diverse Divergence Constraints
Haoyuan Sun, Yuxin Zheng, Yifei Zhao, Yongzhe Chang, Xueqian Wang · PDF
Hummer: Towards Limited Competitive Preference Dataset
Li Jiang, Yusen Wu, Junwu Xiong, Jingqing Ruan, Yichuan Ding, Qingpei Guo, zujie wen, JUN ZHOU, Xiaotie Deng · PDF
Informed Meta-Learning
Kasia Kobalczyk, Mihaela van der Schaar · PDF
Inverse Reinforcement Learning from Demonstrations for LLM Alignment
Hao Sun, Mihaela van der Schaar · PDF
Is a Good Description Worth a Thousand Pictures? Reducing Multimodal Alignment to Text-Based, Unimodal Alignment
Amin Memarian, Touraj Laleh, Irina Rish, Ardavan S. Nobandegani · PDF
Is poisoning a real threat to LLM alignment? Maybe more so than you think
Pankayaraj Pathmanathan, Souradip Chakraborty, Xiangyu Liu, Yongyuan Liang, Furong Huang · PDF
Language Alignment via Nash-learning and Adaptive feedback
Ari Azarafrooz, Farshid Faal · PDF
Learning the eye of the beholder: Statistical modeling and estimation for personalized color perception
Xuanzhou Chen, Austin Xu, Jingyan Wang, Ashwin Pananjady · PDF
Learning to Assist Humans without Inferring Rewards
Vivek Myers, Evan Ellis, Benjamin Eysenbach, Sergey Levine, Anca Dragan · PDF
MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with Diverse Human Preferences
Souradip Chakraborty, Jiahao Qiu, Hui Yuan, Alec Koppel, Furong Huang, Dinesh Manocha, Amrit Bedi, Mengdi Wang · PDF
Modeling the Plurality of Human Preferences via Ideal Points
Daiwei Chen, Yi Chen, Aniket Rege, Ramya Korlakai Vinayak · PDF
Models That Prove Their Own Correctness
Noga Amit, Shafi Goldwasser, Orr Paradise, Guy N. Rothblum · PDF
Multi-Agent Imitation Learning: Value is Easy, Regret is Hard
Jingwu Tang, Gokul Swamy, Fei Fang, Steven Wu · PDF
MultiScale Policy Learning for Alignment with Long Term Objectives
Richa Rastogi, Yuta Saito, Thorsten Joachims · PDF
New Desiderata for Direct Preference Optimization
Xiangkun Hu, Tong He, David Wipf · PDF
Off-Policy Evaluation from Logged Human Feedback
Aniruddha Bhargava, Lalit K Jain, Branislav Kveton, Ge Liu, Subhojyoti Mukherjee · PDF
Optimal Design for Human Feedback
Subhojyoti Mukherjee, Anusha Lalitha, Kousha Kalantari, Aniket Anand Deshmukh, Ge Liu, Yifei Ma, Branislav Kveton · PDF
Order-Optimal Instance-Dependent Bounds for Offline Reinforcement Learning with Preference Feedback
Zhirui Chen, Vincent Y. F. Tan · PDF
PIPER: Primitive-Informed Preference-based Hierarchical Reinforcement Learning via Hindsight Relabeling
Utsav Singh, Wesley A. Suttle, Brian M. Sadler, Vinay P. Namboodiri, Amrit Singh Bedi · PDF
Preference Elicitation for Offline Reinforcement Learning
Alizée Pace, Bernhard Schölkopf, Gunnar Ratsch, Giorgia Ramponi · PDF
Preference Learning Algorithms Do Not Learn Preference Rankings
Angelica Chen, Sadhika Malladi, Lily H Zhang, Xinyi Chen, Qiuyi Zhang, Rajesh Ranganath, Kyunghyun Cho · PDF
Prompt Optimization with Human Feedback
Xiaoqiang Lin, Zhongxiang Dai, Arun Verma, See-Kiong Ng, Patrick Jaillet, Bryan Kian Hsiang Low · PDF
Query Design for Crowdsourced Clustering: Effect of Cognitive Overload and Contextual Bias
Yi Chen, Ramya Korlakai Vinayak · PDF
REBEL: Reinforcement Learning via Regressing Relative Rewards
Zhaolin Gao, Jonathan Daniel Chang, Wenhao Zhan, Owen Oertell, Gokul Swamy, Kianté Brantley, Thorsten Joachims, J. Andrew Bagnell, Jason D. Lee, Wen Sun · PDF
Regularized Best-of-N Sampling to Mitigate Reward Hacking for Language Model Alignment
Yuu Jinnai, Tetsuro Morimura, Kaito Ariu, Kenshi Abe · PDF
Reinforcement Learning from Human Text Feedback: Learning a Reward Model from Human Text Input
Belen Martin Urcelay, Andreas Krause, Giorgia Ramponi · PDF
Relatively Rational: Learning Utilities and Rationalities Jointly from Pairwise Preferences
Taku Yamagata, Tobias Oberkofler, Timo Kaufmann, Viktor Bengs, Eyke Hüllermeier, Raul Santos-Rodriguez · PDF
Reuse Your Rewards: Reward Model Transfer for Zero-Shot Cross-Lingual Alignment
Zhaofeng Wu, Ananth Balashankar, Yoon Kim, Jacob Eisenstein, Ahmad Beirami · PDF
Revisiting Successor Features for Inverse Reinforcement Learning
Arnav Kumar Jain, Harley Wiltzer, Jesse Farebrother, Irina Rish, Glen Berseth, Sanjiban Choudhury · PDF
RLHF and IIA: Perverse Incentives
Wanqiao Xu, Shi Dong, Xiuyuan Lu, Grace Lam, Zheng Wen, Benjamin Van Roy · PDF
Scalable Oversight by Accounting for Unreliable Feedback
Shivam Singhal, Cassidy Laidlaw, Anca Dragan · PDF
Scalably Solving Assistance Games
Cassidy Laidlaw, Eli Bronstein, Timothy Guo, Dylan Feng, Lukas Berglund, Justin Svegliato, Stuart Russell, Anca Dragan · PDF
Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms
Rafael Rafailov, Yaswanth Chittepu, Ryan Park, Harshit Sikchi, Joey Hejna, W. Bradley Knox, Chelsea Finn, Scott Niekum · PDF
Step-On-Feet Tuning: Scaling Self-Alignment of LLMs via Bootstrapping
Haoyu Wang, Guozheng Ma, Ziqiao Meng, Zeyu Qin, Li Shen, Zhong Zhang, Bingzhe Wu, Liu Liu, Yatao Bian, Tingyang Xu, Xueqian Wang, Peilin Zhao · PDF
Stochastic Concept Bottleneck Models
Moritz Vandenhirtz, Sonia Laguna, Ričards Marcinkevičs, Julia E Vogt · PDF
Towards Aligning Language Models with Textual Feedback
Saüc Abadal Lloret, Shehzaad Dhuliawala, Keerthiram Murugesan, Mrinmaya Sachan · PDF
Towards Safe Large Language Models for Medicine
Tessa Han, Aounon Kumar, Chirag Agarwal, Himabindu Lakkaraju · PDF
Uncertainty-aware Preference Alignment in Reinforcement Learning from Human Feedback
Sheng Xu, Bo Yue, Hongyuan Zha, Guiliang Liu · PDF
Weak-to-Strong Extrapolation Expedites Alignment
Chujie Zheng, Ziqi Wang, Heng Ji, Minlie Huang, Nanyun Peng · PDF

Accepted papers (60)

☆"You just can’t go around killing people'' Explaining Agent Behavior to a Human Terminator

☆A Theoretical Framework for Partially Observed Reward-States in RLHF

☆Accelerating Best-of-N via Speculative Rejection

☆Adversarial Multi-dueling Bandits

☆AI Alignment with Changing and Influenceable Reward Functions

☆Aligning Crowd Feedback via Distributional Preference Reward Modeling

☆Aligning Large Language Models with Representation Editing: A Control Perspective

☆AMBER: An Entropy Maximizing Environment Design Algorithm for Inverse Reinforcement Learning

☆Beyond Thumbs Up/Down: Untangling Challenges of Fine-Grained Feedback for Text-to-Image Generation

☆Bootstrapping Language Models with DPO Implicit Rewards

☆Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization

☆Comparing Comparisons: Informative and Easy Human Feedback with Distinguishability Queries

☆Concept-Based Interpretable Reinforcement Learning with Limited to No Human Labels

☆Cross-Domain Knowledge Transfer for RL via Preference Consistency

☆Distributional Preference Alignment of LLMs via Optimal Transport

☆DPM: Dual Preferences-based Multi-Agent Reinforcement Learning

☆DPO Meets PPO: Reinforced Token Optimization for RLHF

☆Efficient Inverse Reinforcement Learning without Compounding Errors

☆Enhancing Intent Understanding for Ambiguous prompt: A Human-Machine Co-Adaption Strategy

☆Filtered Direct Preference Optimization

☆Free-Energy Equilibria: Toward a Theory of Interactions Between Boundedly-Rational Agents

☆Generalizing Offline Alignment Theoretical Paradigm with Diverse Divergence Constraints

☆Hummer: Towards Limited Competitive Preference Dataset

☆Informed Meta-Learning

☆Inverse Reinforcement Learning from Demonstrations for LLM Alignment

☆Is a Good Description Worth a Thousand Pictures? Reducing Multimodal Alignment to Text-Based, Unimodal Alignment

☆Is poisoning a real threat to LLM alignment? Maybe more so than you think

☆Language Alignment via Nash-learning and Adaptive feedback

☆Learning the eye of the beholder: Statistical modeling and estimation for personalized color perception

☆Learning to Assist Humans without Inferring Rewards

☆MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with Diverse Human Preferences

☆Modeling the Plurality of Human Preferences via Ideal Points

☆Models That Prove Their Own Correctness

☆Multi-Agent Imitation Learning: Value is Easy, Regret is Hard

☆MultiScale Policy Learning for Alignment with Long Term Objectives

☆New Desiderata for Direct Preference Optimization

☆Off-Policy Evaluation from Logged Human Feedback

☆Optimal Design for Human Feedback

☆Order-Optimal Instance-Dependent Bounds for Offline Reinforcement Learning with Preference Feedback

☆PIPER: Primitive-Informed Preference-based Hierarchical Reinforcement Learning via Hindsight Relabeling

☆Preference Elicitation for Offline Reinforcement Learning

☆Preference Learning Algorithms Do Not Learn Preference Rankings

☆Prompt Optimization with Human Feedback

☆Query Design for Crowdsourced Clustering: Effect of Cognitive Overload and Contextual Bias

☆REBEL: Reinforcement Learning via Regressing Relative Rewards

☆Regularized Best-of-N Sampling to Mitigate Reward Hacking for Language Model Alignment

☆Reinforcement Learning from Human Text Feedback: Learning a Reward Model from Human Text Input

☆Relatively Rational: Learning Utilities and Rationalities Jointly from Pairwise Preferences

☆Reuse Your Rewards: Reward Model Transfer for Zero-Shot Cross-Lingual Alignment

☆Revisiting Successor Features for Inverse Reinforcement Learning

☆RLHF and IIA: Perverse Incentives

☆Scalable Oversight by Accounting for Unreliable Feedback

☆Scalably Solving Assistance Games

☆Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms

☆Step-On-Feet Tuning: Scaling Self-Alignment of LLMs via Bootstrapping

☆Stochastic Concept Bottleneck Models

☆Towards Aligning Language Models with Textual Feedback

☆Towards Safe Large Language Models for Medicine

☆Uncertainty-aware Preference Alignment in Reinforcement Learning from Human Feedback

☆Weak-to-Strong Extrapolation Expedites Alignment

"You just can’t go around killing people'' Explaining Agent Behavior to a Human Terminator

A Theoretical Framework for Partially Observed Reward-States in RLHF

Accelerating Best-of-N via Speculative Rejection

Adversarial Multi-dueling Bandits

AI Alignment with Changing and Influenceable Reward Functions

Aligning Crowd Feedback via Distributional Preference Reward Modeling

Aligning Large Language Models with Representation Editing: A Control Perspective

AMBER: An Entropy Maximizing Environment Design Algorithm for Inverse Reinforcement Learning

Beyond Thumbs Up/Down: Untangling Challenges of Fine-Grained Feedback for Text-to-Image Generation

Bootstrapping Language Models with DPO Implicit Rewards

Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization

Comparing Comparisons: Informative and Easy Human Feedback with Distinguishability Queries

Concept-Based Interpretable Reinforcement Learning with Limited to No Human Labels

Cross-Domain Knowledge Transfer for RL via Preference Consistency

Distributional Preference Alignment of LLMs via Optimal Transport

DPM: Dual Preferences-based Multi-Agent Reinforcement Learning

DPO Meets PPO: Reinforced Token Optimization for RLHF

Efficient Inverse Reinforcement Learning without Compounding Errors

Enhancing Intent Understanding for Ambiguous prompt: A Human-Machine Co-Adaption Strategy

Filtered Direct Preference Optimization

Free-Energy Equilibria: Toward a Theory of Interactions Between Boundedly-Rational Agents

Generalizing Offline Alignment Theoretical Paradigm with Diverse Divergence Constraints

Hummer: Towards Limited Competitive Preference Dataset

Informed Meta-Learning

Inverse Reinforcement Learning from Demonstrations for LLM Alignment

Is a Good Description Worth a Thousand Pictures? Reducing Multimodal Alignment to Text-Based, Unimodal Alignment

Is poisoning a real threat to LLM alignment? Maybe more so than you think

Language Alignment via Nash-learning and Adaptive feedback

Learning the eye of the beholder: Statistical modeling and estimation for personalized color perception

Learning to Assist Humans without Inferring Rewards

MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with Diverse Human Preferences

Modeling the Plurality of Human Preferences via Ideal Points

Models That Prove Their Own Correctness

Multi-Agent Imitation Learning: Value is Easy, Regret is Hard

MultiScale Policy Learning for Alignment with Long Term Objectives

New Desiderata for Direct Preference Optimization

Off-Policy Evaluation from Logged Human Feedback

Optimal Design for Human Feedback

Order-Optimal Instance-Dependent Bounds for Offline Reinforcement Learning with Preference Feedback

PIPER: Primitive-Informed Preference-based Hierarchical Reinforcement Learning via Hindsight Relabeling

Preference Elicitation for Offline Reinforcement Learning

Preference Learning Algorithms Do Not Learn Preference Rankings

Prompt Optimization with Human Feedback

Query Design for Crowdsourced Clustering: Effect of Cognitive Overload and Contextual Bias

REBEL: Reinforcement Learning via Regressing Relative Rewards

Regularized Best-of-N Sampling to Mitigate Reward Hacking for Language Model Alignment

Reinforcement Learning from Human Text Feedback: Learning a Reward Model from Human Text Input

Relatively Rational: Learning Utilities and Rationalities Jointly from Pairwise Preferences

Reuse Your Rewards: Reward Model Transfer for Zero-Shot Cross-Lingual Alignment

Revisiting Successor Features for Inverse Reinforcement Learning

RLHF and IIA: Perverse Incentives

Scalable Oversight by Accounting for Unreliable Feedback

Scalably Solving Assistance Games

Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms

Step-On-Feet Tuning: Scaling Self-Alignment of LLMs via Bootstrapping

Stochastic Concept Bottleneck Models

Towards Aligning Language Models with Textual Feedback

Towards Safe Large Language Models for Medicine

Uncertainty-aware Preference Alignment in Reinforcement Learning from Human Feedback

Weak-to-Strong Extrapolation Expedites Alignment