ICML 2024 Past Safety & alignment

ICML 2024 Workshop on Models of Human Feedback for AI Alignment

ICML 2024 Workshop MHFAIA

Submission deadline
Jun 1, 2024, 18:00 UTC
imported from OpenReview — check the website for extensions
Submission portal
OpenReview
Notes
Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).

Accepted papers (60)

Fetched from OpenReview (v2) on 2026-06-10.

  1. "You just can’t go around killing people'' Explaining Agent Behavior to a Human Terminator

    Uri Menkes, Ofra Amir, Assaf Hallak · PDF
  2. A Theoretical Framework for Partially Observed Reward-States in RLHF

    Chinmaya Kausik, Mirco Mutti, Aldo Pacchiano, Ambuj Tewari · PDF
  3. Accelerating Best-of-N via Speculative Rejection

    Ruiqi Zhang, Momin Haider, Ming Yin, Jiahao Qiu, Mengdi Wang, Peter Bartlett, Andrea Zanette · PDF
  4. Adversarial Multi-dueling Bandits

    Pratik Gajane · PDF
  5. AI Alignment with Changing and Influenceable Reward Functions

    Micah Carroll, Davis Foote, Anand Siththaranjan, Stuart Russell, Anca Dragan · PDF
  6. Aligning Crowd Feedback via Distributional Preference Reward Modeling

    Dexun Li, Cong Zhang, Kuicai Dong, Derrick Goh Xin Deik, Ruiming Tang, Yong Liu · PDF
  7. Aligning Large Language Models with Representation Editing: A Control Perspective

    Lingkai Kong, Haorui Wang, Wenhao Mu, Yuanqi Du, Yuchen Zhuang, Yifei Zhou, Yue Song, Rongzhi Zhang, Kai Wang, Chao Zhang · PDF
  8. AMBER: An Entropy Maximizing Environment Design Algorithm for Inverse Reinforcement Learning

    Paul Nitschke, Lars Lien Ankile, Eura Nofshin, Siddharth Swaroop, Finale Doshi-Velez, Weiwei Pan · PDF
  9. Beyond Thumbs Up/Down: Untangling Challenges of Fine-Grained Feedback for Text-to-Image Generation

    Katherine M. Collins, Najoung Kim, Yonatan Bitton, Verena Rieser, Shayegan Omidshafiei, Yushi Hu, Sherol Chen, Senjuti Dutta, Minsuk Chang, Kimin Lee, Youwei Liang, Georgina Evans, Sahil Singla, Gang Li, Adrian Weller, Junfeng He, Deepak Ramachandran, Krishnamurthy Dj Dvijotham · PDF
  10. Bootstrapping Language Models with DPO Implicit Rewards

    Changyu Chen, Zichen Liu, Chao Du, Tianyu Pang, Qian Liu, Arunesh Sinha, Pradeep Varakantham, Min Lin · PDF
  11. Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization

    Hritik Bansal, Ashima Suvarna, Gantavya Bhatt, Nanyun Peng, Kai-Wei Chang, Aditya Grover · PDF
  12. Comparing Comparisons: Informative and Easy Human Feedback with Distinguishability Queries

    Xuening Feng, Zhaohui JIANG, Timo Kaufmann, Eyke Hüllermeier, Paul Weng, Yifei Zhu · PDF
  13. Concept-Based Interpretable Reinforcement Learning with Limited to No Human Labels

    Zhuorui Ye, Stephanie Milani, Fei Fang, Geoffrey J. Gordon · PDF
  14. Cross-Domain Knowledge Transfer for RL via Preference Consistency

    Ting-Hsuan Huang, Ping-Chun Hsieh · PDF
  15. Distributional Preference Alignment of LLMs via Optimal Transport

    Igor Melnyk, Youssef Mroueh, Brian Belgodere, Mattia Rigotti, Apoorva Nitsure, Mikhail Yurochkin, Kristjan Greenewald, Jiri Navratil, Jarret Ross · PDF
  16. DPM: Dual Preferences-based Multi-Agent Reinforcement Learning

    Sehyeok Kang, Yongsik Lee, Se-Young Yun · PDF
  17. DPO Meets PPO: Reinforced Token Optimization for RLHF

    Han Zhong, Guhao Feng, Wei Xiong, Xinle Cheng, Li Zhao, Di He, Jiang Bian, Liwei Wang · PDF
  18. Efficient Inverse Reinforcement Learning without Compounding Errors

    Nicolas Espinosa Dice, Gokul Swamy, Sanjiban Choudhury, Wen Sun · PDF
  19. Enhancing Intent Understanding for Ambiguous prompt: A Human-Machine Co-Adaption Strategy

    Yangfan He, Yuxuan Bai, TIANYU SHI · PDF
  20. Filtered Direct Preference Optimization

    Tetsuro Morimura, Mitsuki Sakamoto, Yuu Jinnai, Kenshi Abe, Kaito Ariu · PDF
  21. Free-Energy Equilibria: Toward a Theory of Interactions Between Boundedly-Rational Agents

    David Hyland, Tomáš Gavenčiak, Lancelot Da Costa, Conor Heins, Vojtech Kovarik, Julian Gutierrez, Michael J. Wooldridge, Jan Kulveit · PDF
  22. Generalizing Offline Alignment Theoretical Paradigm with Diverse Divergence Constraints

    Haoyuan Sun, Yuxin Zheng, Yifei Zhao, Yongzhe Chang, Xueqian Wang · PDF
  23. Hummer: Towards Limited Competitive Preference Dataset

    Li Jiang, Yusen Wu, Junwu Xiong, Jingqing Ruan, Yichuan Ding, Qingpei Guo, zujie wen, JUN ZHOU, Xiaotie Deng · PDF
  24. Informed Meta-Learning

    Kasia Kobalczyk, Mihaela van der Schaar · PDF
  25. Inverse Reinforcement Learning from Demonstrations for LLM Alignment

    Hao Sun, Mihaela van der Schaar · PDF
  26. Is a Good Description Worth a Thousand Pictures? Reducing Multimodal Alignment to Text-Based, Unimodal Alignment

    Amin Memarian, Touraj Laleh, Irina Rish, Ardavan S. Nobandegani · PDF
  27. Is poisoning a real threat to LLM alignment? Maybe more so than you think

    Pankayaraj Pathmanathan, Souradip Chakraborty, Xiangyu Liu, Yongyuan Liang, Furong Huang · PDF
  28. Language Alignment via Nash-learning and Adaptive feedback

    Ari Azarafrooz, Farshid Faal · PDF
  29. Learning the eye of the beholder: Statistical modeling and estimation for personalized color perception

    Xuanzhou Chen, Austin Xu, Jingyan Wang, Ashwin Pananjady · PDF
  30. Learning to Assist Humans without Inferring Rewards

    Vivek Myers, Evan Ellis, Benjamin Eysenbach, Sergey Levine, Anca Dragan · PDF
  31. MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with Diverse Human Preferences

    Souradip Chakraborty, Jiahao Qiu, Hui Yuan, Alec Koppel, Furong Huang, Dinesh Manocha, Amrit Bedi, Mengdi Wang · PDF
  32. Modeling the Plurality of Human Preferences via Ideal Points

    Daiwei Chen, Yi Chen, Aniket Rege, Ramya Korlakai Vinayak · PDF
  33. Models That Prove Their Own Correctness

    Noga Amit, Shafi Goldwasser, Orr Paradise, Guy N. Rothblum · PDF
  34. Multi-Agent Imitation Learning: Value is Easy, Regret is Hard

    Jingwu Tang, Gokul Swamy, Fei Fang, Steven Wu · PDF
  35. MultiScale Policy Learning for Alignment with Long Term Objectives

    Richa Rastogi, Yuta Saito, Thorsten Joachims · PDF
  36. New Desiderata for Direct Preference Optimization

    Xiangkun Hu, Tong He, David Wipf · PDF
  37. Off-Policy Evaluation from Logged Human Feedback

    Aniruddha Bhargava, Lalit K Jain, Branislav Kveton, Ge Liu, Subhojyoti Mukherjee · PDF
  38. Optimal Design for Human Feedback

    Subhojyoti Mukherjee, Anusha Lalitha, Kousha Kalantari, Aniket Anand Deshmukh, Ge Liu, Yifei Ma, Branislav Kveton · PDF
  39. Order-Optimal Instance-Dependent Bounds for Offline Reinforcement Learning with Preference Feedback

    Zhirui Chen, Vincent Y. F. Tan · PDF
  40. PIPER: Primitive-Informed Preference-based Hierarchical Reinforcement Learning via Hindsight Relabeling

    Utsav Singh, Wesley A. Suttle, Brian M. Sadler, Vinay P. Namboodiri, Amrit Singh Bedi · PDF
  41. Preference Elicitation for Offline Reinforcement Learning

    Alizée Pace, Bernhard Schölkopf, Gunnar Ratsch, Giorgia Ramponi · PDF
  42. Preference Learning Algorithms Do Not Learn Preference Rankings

    Angelica Chen, Sadhika Malladi, Lily H Zhang, Xinyi Chen, Qiuyi Zhang, Rajesh Ranganath, Kyunghyun Cho · PDF
  43. Prompt Optimization with Human Feedback

    Xiaoqiang Lin, Zhongxiang Dai, Arun Verma, See-Kiong Ng, Patrick Jaillet, Bryan Kian Hsiang Low · PDF
  44. Query Design for Crowdsourced Clustering: Effect of Cognitive Overload and Contextual Bias

    Yi Chen, Ramya Korlakai Vinayak · PDF
  45. REBEL: Reinforcement Learning via Regressing Relative Rewards

    Zhaolin Gao, Jonathan Daniel Chang, Wenhao Zhan, Owen Oertell, Gokul Swamy, Kianté Brantley, Thorsten Joachims, J. Andrew Bagnell, Jason D. Lee, Wen Sun · PDF
  46. Regularized Best-of-N Sampling to Mitigate Reward Hacking for Language Model Alignment

    Yuu Jinnai, Tetsuro Morimura, Kaito Ariu, Kenshi Abe · PDF
  47. Reinforcement Learning from Human Text Feedback: Learning a Reward Model from Human Text Input

    Belen Martin Urcelay, Andreas Krause, Giorgia Ramponi · PDF
  48. Relatively Rational: Learning Utilities and Rationalities Jointly from Pairwise Preferences

    Taku Yamagata, Tobias Oberkofler, Timo Kaufmann, Viktor Bengs, Eyke Hüllermeier, Raul Santos-Rodriguez · PDF
  49. Reuse Your Rewards: Reward Model Transfer for Zero-Shot Cross-Lingual Alignment

    Zhaofeng Wu, Ananth Balashankar, Yoon Kim, Jacob Eisenstein, Ahmad Beirami · PDF
  50. Revisiting Successor Features for Inverse Reinforcement Learning

    Arnav Kumar Jain, Harley Wiltzer, Jesse Farebrother, Irina Rish, Glen Berseth, Sanjiban Choudhury · PDF
  51. RLHF and IIA: Perverse Incentives

    Wanqiao Xu, Shi Dong, Xiuyuan Lu, Grace Lam, Zheng Wen, Benjamin Van Roy · PDF
  52. Scalable Oversight by Accounting for Unreliable Feedback

    Shivam Singhal, Cassidy Laidlaw, Anca Dragan · PDF
  53. Scalably Solving Assistance Games

    Cassidy Laidlaw, Eli Bronstein, Timothy Guo, Dylan Feng, Lukas Berglund, Justin Svegliato, Stuart Russell, Anca Dragan · PDF
  54. Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms

    Rafael Rafailov, Yaswanth Chittepu, Ryan Park, Harshit Sikchi, Joey Hejna, W. Bradley Knox, Chelsea Finn, Scott Niekum · PDF
  55. Step-On-Feet Tuning: Scaling Self-Alignment of LLMs via Bootstrapping

    Haoyu Wang, Guozheng Ma, Ziqiao Meng, Zeyu Qin, Li Shen, Zhong Zhang, Bingzhe Wu, Liu Liu, Yatao Bian, Tingyang Xu, Xueqian Wang, Peilin Zhao · PDF
  56. Stochastic Concept Bottleneck Models

    Moritz Vandenhirtz, Sonia Laguna, Ričards Marcinkevičs, Julia E Vogt · PDF
  57. Towards Aligning Language Models with Textual Feedback

    Saüc Abadal Lloret, Shehzaad Dhuliawala, Keerthiram Murugesan, Mrinmaya Sachan · PDF
  58. Towards Safe Large Language Models for Medicine

    Tessa Han, Aounon Kumar, Chirag Agarwal, Himabindu Lakkaraju · PDF
  59. Uncertainty-aware Preference Alignment in Reinforcement Learning from Human Feedback

    Sheng Xu, Bo Yue, Hongyuan Zha, Guiliang Liu · PDF
  60. Weak-to-Strong Extrapolation Expedites Alignment

    Chujie Zheng, Ziqi Wang, Heng Ji, Minlie Huang, Nanyun Peng · PDF