ICML 2025 Past Safety & alignment

2nd Workshop on Models of Human Feedback for AI Alignment

MoFA

Submission deadline
May 28, 2025, 13:00 UTC
imported from OpenReview — check the website for extensions
Submission portal
OpenReview
Notes
Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).

Accepted papers (68)

Fetched from OpenReview (v2) on 2026-06-10.

  1. A Unified Perspective on Reward Distillation Through Ratio Matching

    Kenan Hasanaliyev, Schwinn Saereesitthipitak, Rohan Sanda · PDF
  2. ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment

    Xiaoqiang Lin, Arun Verma, Zhongxiang Dai, Daniela Rus, See-Kiong Ng, Bryan Kian Hsiang Low · PDF
  3. Advancing LLM Safe Alignment with Safety Representation Ranking

    Tianqi Du, Zeming Wei, Quan Chen, Chenheng Zhang, Yisen Wang · PDF
  4. Aggregated Individual Reporting for Post-Deployment Evaluation

    Jessica Dai, Inioluwa Deborah Raji, Benjamin Recht, Irene Y. Chen · PDF
  5. Aligned Textual Scoring Rule

    Yuxuan Lu, Yifan Wu, Jason Hartline, Michael Curry · PDF
  6. Aligning Neural Style Representations for Style-based Clustering

    Abhishek Dangeti, Pavan Gajula, Vikram Jamwal, Vivek Srivastava · PDF
  7. Alignment as Distribution Learning: Your Preference Model is Explicitly a Language Model

    Jihun Yun, Juno Kim, Jongho Park, Junhyuck Kim, Jongha Jon Ryu, Jaewoong Cho, Kwang-Sung Jun · PDF
  8. Alignment of Large Language Models with Constrained Learning

    Botong Zhang, Shuo Li, Ignacio Hounie, Osbert Bastani, Dongsheng Ding, Alejandro Ribeiro · PDF
  9. Angular Steering: Behavior Control via Rotation in Activation Space

    Hieu M. Vu, Tan Minh Nguyen · PDF
  10. Auto-Guideline Alignment: Probing and Simulating Human Ideological Preferences in LLMs via Prompt Engineering

    Chien-Hua Chen, Chang Chih Meng, Li-Ni Fu, Hen-Hsen Huang, I-Chen Wu · PDF
  11. BiasLab: Toward Explainable Political Bias Detection with Dual-Axis Human Annotations and Rationale Indicators

    KMA SOLAIMAN · PDF
  12. Composition and Alignment of Diffusion Models using Constrained Learning

    Shervin Khalafi, Ignacio Hounie, Dongsheng Ding, Alejandro Ribeiro · PDF
  13. Configurable Preference Tuning with Rubric-Guided Synthetic Data

    Victor Gallego · PDF
  14. Copilot Arena: A Platform for Code LLM Evaluation in the Wild

    Wayne Chi, Valerie Chen, Anastasios Nikolas Angelopoulos, Wei-Lin Chiang, Aditya Mittal, Naman Jain, Tianjun Zhang, Ion Stoica, Chris Donahue, Ameet Talwalkar · PDF
  15. CUDA: Capturing Uncertainty and Diversity in Preference Feedback Augmentation

    Sehyeok Kang, Jaewook Jeong, Se-Young Yun · PDF
  16. Cultivating Pluralism In Algorithmic Monoculture: The Community Alignment Dataset

    Lily H Zhang, Smitha Milli, Karen Long Jusko, Jonathan Smith, Brandon Amos, Wassim Bouaziz, Jack Kussman, Manon Revel, Lisa Titus, Bhaktipriya Radharapu, Jane Yu, Vidya Sarma, Kristopher Rose, Maximilian Nickel · PDF
  17. CulturalFrames: Assessing Cultural Expectation Alignment in Text-to-Image Models and Evaluation Metrics

    Shravan Nayak, Mehar Bhatia, Xiaofeng Zhang, Verena Rieser, Lisa Anne Hendricks, Sjoerd van Steenkiste, Yash Goyal, Karolina Stanczak, Aishwarya Agrawal · PDF
  18. Deep Context-Dependent Choice Model

    Shuhan Zhang, Zhi Wang, Rui Gao, Shuang Li · PDF
  19. Do Language Models Understand Discrimination? Testing Alignment with Human Legal Reasoning under the ECHR

    Tatiana Botskina · PDF
  20. Doctor Approved: Generating Medically Accurate Skin Disease Images through AI–Expert Feedback

    Janet Wang, Yunbei Zhang, Zhengming Ding, Jihun Hamm · PDF
  21. Doubly Robust Alignment for Large Language Models

    Erhan Xu, Kai Ye, Hongyi Zhou, Luhan Zhu, Francesco Quinzan, Chengchun Shi · PDF
  22. Dropping Just a Handful of Preferences Can Change Top Large Language Model Rankings

    Jenny Y. Huang, Yunyi Shen, Dennis Wei, Tamara Broderick · PDF
  23. Dynamic Guardian Models: Realtime Content Moderation With User-Defined Policies

    Monte Hoover, Vatsal Baherwani, Neel Jain, Khalid Saifullah, Joseph James Vincent, Chirag Jain, Melissa Kazemi Rad, C. Bayan Bruss, Ashwinee Panda, Tom Goldstein · PDF
  24. EconEvals: Benchmarks and Litmus Tests for LLM Agents in Unknown Environments

    Sara Fish, Julia Shephard, Minkai Li, Ran I Shorrer, Yannai A. Gonczarowski · PDF
  25. Efficient Generative Models Personalization via Optimal Experimental Design

    Guy Schacht, Mojmir Mutny, Riccardo De Santi, Ziyad Sheebaelhamd, Andreas Krause · PDF
  26. Empirical Studies on the Limitations of Direct Preference Optimization, and a Possible Quick Fix

    Jiarui Yao, Yong Lin, Tong Zhang · PDF
  27. Enhancing Personalized Multi-Turn Dialogue with Curiosity Reward

    Yanming Wan, Jiaxing Wu, Marwa Abdulhai, Lior Shani, Natasha Jaques · PDF
  28. Entropy Controllable Direct Preference Optimization

    Motoki Omura, Yasuhiro Fujita, Toshiki Kataoka · PDF
  29. Expected Reward Prediction, with Applications to Model Routing

    Kenan Hasanaliyev, Silas Alberti, Jenny Hamer, Dheeraj Rajagopal, Kevin Robinson, Jasper Snoek, Victor Veitch, Alexander Nicholas D'Amour · PDF
  30. Few-shot Steerable Alignment: Adapting Rewards and LLM Policies with Neural Processes

    Kasia Kobalczyk, Claudio Fanconi, Hao Sun, Mihaela van der Schaar · PDF
  31. Fine-Tuning Next-Scale Visual Autoregressive Models with Group Relative Policy Optimization

    Matteo Gallici, Haitz Sáez de Ocáriz Borde · PDF
  32. FSPO: Few-Shot Preference Optimization of Synthetic Preference Data Elicits LLM Personalization to Real Users

    Anikait Singh, Sheryl Hsu, Kyle Hsu, Eric Mitchell, Stefano Ermon, Tatsunori Hashimoto, Archit Sharma, Chelsea Finn · PDF
  33. Full-Stack Alignment: Co-Aligning AI and Institutions with Thicker Models of Value

    Ryan Lowe, Joe Edelman, Tan Zhi-Xuan, Oliver Klingefjord, Ellie Hain, Vincent Wang, Atrisha Sarkar, Michiel A. Bakker, Fazl Barez, Matija Franklin, Andreas Haupt, Jobst Heitzig, Wesley H. Holliday, Julian Jara-Ettinger, Atoosa Kasirzadeh, Ryan Othniel Kearns, James Ravi Kirkpatrick, Andrew Koh, Joel Lehman, Sydney Levine, Manon Revel, Ivan Vendrov · PDF
  34. Geometry-Aware Preference Learning for 3D Texture Generation

    AmirHossein Zamani, Tianhao Xie, Amir Aghdam, Tiberiu Popa, Eugene Belilovsky · PDF
  35. Human Feedback Guided Reinforcement Learning for Unknown Temporal Tasks via Weighted Finite Automata

    Nathaniel Smith, Nicholas Hirsch, Yu Wang · PDF
  36. Implicit User Feedback in Human-LLM Dialogues: Informative to Understand Users yet Noisy as a Learning Signal

    Yuhan Liu, Michael JQ Zhang, Eunsol Choi · PDF
  37. Improvement-Guided Iterative DPO for Diffusion Models

    Ying Fan, Fei Deng, Yang Zhao, Sahil Singla, Rahul Jain, Tingbo Hou, Kangwook Lee, Feng Yang, Deepak Ramachandran, Qifei Wang · PDF
  38. In-Context Alignment at Scale: When More is Less

    Neelabh Madan, Lakshmi Subramanian · PDF
  39. In-Context Personalized Alignment with Feedback History under Counterfactual Evaluation

    Xisen Jin, Zheng Li, Zhenwei Dai, Hui Liu, Xianfeng Tang, Chen Luo, Rahul Goutam, Xiang Ren, Qi He · PDF
  40. Inference-Time Reward Hacking in Large Language Models

    Hadi Khalaf, Claudio Mayrink Verdun, Alex Oesterling, Himabindu Lakkaraju, Flavio Calmon · PDF
  41. KL-Regularised Q-Learning: A Token-level Action-Value perspective on Online RLHF

    Lennie Wells, Edward James Young, Jason Ross Brown, Sergio Bacallado · PDF
  42. Language Model Personalization via Reward Factorization

    Idan Shenfeld, Felix Faltings, Pulkit Agrawal, Aldo Pacchiano · PDF
  43. Learning interpretable descriptions of human preferences

    Rajiv Movva, Emma Pierson · PDF
  44. LoRe: Personalizing LLMs via Low-Rank Reward Modeling

    Avinandan Bose, Zhihan Xiong, Yuejie Chi, Simon Shaolei Du, Lin Xiao, Maryam Fazel · PDF
  45. Mechanism Design for Alignment via Human Feedback

    Julian Manyika, Michael J. Wooldridge, Jiarui Gan · PDF
  46. Mimicking Human Intuition: Cognitive Belief-Driven Reinforcement Learning

    Xingrui Gu, Guanren Qiao, Chuyi Jiang · PDF
  47. Mitigating Fine-tuning Risks in LLMs via Safety-Aware Probing Optimization

    Chengcan Wu, Zhixin Zhang, Zeming Wei, Yihao Zhang, Meng Sun · PDF
  48. Multi-Task Reward Learning from Human Ratings

    Mingkang Wu, Devin White, Evelyn Rose, Vernon Lawhern, Nicholas R Waytowich, Yongcan Cao · PDF
  49. On the strength of goodhart's law

    Adrien Majka, Wassim Bouaziz, El-Mahdi El-Mhamdi · PDF
  50. Online Learning and Equilibrium Computation with Ranking Feedback

    Mingyang Liu, Yongshan Chen, Zhiyuan Fan, Gabriele Farina, Asuman E. Ozdaglar, Kaiqing Zhang · PDF
  51. Playing the Data: Video Games as a Tool to Annotate and Train Models on Large Datasets

    Parham Ghasemloo Gheidari, Kai-Hsiang Chang, Roman Sarrazin-Gendron, Renata Mutalova, Alexander Butyaev, Attila Szantner, Jérôme Waldispühl · PDF
  52. Reasoning Isn't Enough: Examining Truth-Bias and Sycophancy in LLMs

    Emilio Barkett, Olivia Long, Madhavendra Thakur · PDF
  53. ReDit: Reward Dithering for Improved LLM Policy Optimization

    Chenxing Wei, Jiarui Yu, Ying Tiffany He, Hande Dong, Yao Shu, Fei Yu · PDF
  54. Rewrite-to-Rank: Optimizing Ad Visibility via Retrieval-Aware Text Rewriting

    Chloe Ho, Ishneet Sukhvinder Singh, Diya Sharma, Tanvi Reddy Anumandla, Michael Lu, Vasu Sharma, Kevin Zhu · PDF
  55. Robust Multi-Objective Controlled Decoding of Large Language Models

    Seongho Son, William Bankes, Sangwoong Yoon, Shyam Sundhar Ramesh, Xiaohang Tang, Ilija Bogunovic · PDF
  56. Robust Reinforcement Learning from Human Feedback for Large Language Models Fine-Tuning

    Kai Ye, Hongyi Zhou, Jin Zhu, Francesco Quinzan, Chengchun Shi · PDF
  57. Robust Reward Modeling via Causal Rubrics

    Pragya Srivastava, Harman Singh, Rahul Madhavan, Gandharv Patil, Sravanti Addepalli, Arun Suggala, Rengarajan Aravamudhan, Soumya Sharma, Anirban Laha, Aravindan Raghuveer, Karthikeyan Shanmugam, Doina Precup · PDF
  58. Scalable Defense against In-the-wild Jailbreaking Attacks with Safety Context Retrieval

    Taiye Chen, Zeming Wei, Ang Li, Yisen Wang · PDF
  59. Selective Preference Aggregation

    Shreyas Kadekodi, Hayden McTavish, Berk Ustun · PDF
  60. Self-Concordant Preference Learning from Noisy Labels

    Shiv Shankar, Madalina Fiterau · PDF
  61. The Strong, weak and benign Goodhart’s law. An independence-free and paradigm-agnostic formalisation

    Adrien Majka, El-Mahdi El-Mhamdi · PDF
  62. Theoretical Analysis of KL-regularized RLHF with Multiple Reference Models

    Gholamali Aminian, Amir R. Asadi, Idan Shenfeld, Youssef Mroueh · PDF
  63. Towards a Sharp Analysis of Offline Policy Learning for $f$-Divergence-Regularized Contextual Bandits

    Qingyue Zhao, Kaixuan Ji, Heyang Zhao, Tong Zhang, Quanquan Gu · PDF
  64. Tracing Human-like Traits in LLMs: Origins, Real-World Manifestation, and Controllability

    Pengrui Han, Rafal Dariusz Kocielnik, Peiyang Song, Ramit Debnath, Dean Mobbs, Anima Anandkumar, R. Michael Alvarez · PDF
  65. Unanchoring the Mind: DAE-Guided Counterfactual Reasoning for Rare Disease Diagnosis

    Yuting Yan, Yinghao Fu, Wendi Ren, Shuang Li · PDF
  66. Understanding Likelihood Over-optimisation in Direct Alignment Algorithms

    Zhengyan Shi, Sander Land, Acyr Locatelli, Matthieu Geist, Max Bartolo · PDF
  67. Vertical Moral Growth: A Novel Developmental Framework for Human Feedback Quality in AI Alignment

    Taichiro Endo · PDF
  68. What Matters when Modeling Human Behavior using Imitation Learning?

    Aneri Muni, Esther Derman, Vincent Taboga, Pierre-Luc Bacon, Erick Delage · PDF