ICLR 2025 Past Safety & alignment

ICLR 2025 Workshop on Bidirectional Human-AI Alignment

ICLR 2025 Bi-Align Workshop

Submission deadline
Feb 16, 2025, 11:59 UTC
imported from OpenReview — check the website for extensions
Submission portal
OpenReview
Notes
Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).

Accepted papers (71)

Fetched from OpenReview (v2) on 2026-06-10.

  1. A Benchmark for Scalable Oversight Mechanisms

    Abhimanyu Pallavi Sudhir, Jackson Kaunismaa, Arjun Panickssery · PDF
  2. A Pilot Study of Weak-to-Strong Generalization in Safety, Toxicity, and Legal Reasoning

    Ruimeng Ye, Yang Xiao, Bo Hui · PDF
  3. A Roadmap for Human-Agent Moral Alignment: Integrating Pre-defined Intrinsic Rewards and Learned Reward Models

    Elizaveta Tennant, Stephen Hailes, Mirco Musolesi · PDF
  4. A Sociotechnical Perspective on Aligning AI with Pluralistic Human Values

    Dalia Ali, Aysenur Kocak, Dora Zhao, Allison Koenecke, Orestis Papakyriakopoulos · PDF
  5. Active Human Feedback Collection via Neural Contextual Dueling Bandits

    Arun Verma, Xiaoqiang Lin, Zhongxiang Dai, Daniela Rus, Bryan Kian Hsiang Low · PDF
  6. Addressing and Visualizing Misalignments in Human Task-Solving Trajectories

    Sejin Kim, Hosung Lee, Sundong Kim · PDF
  7. AI Systematically Rewires the Flow of Ideas

    Zhonghao He, Tianyi Qiu, Tao Lin, Moshe Glickman, Atoosa Kasirzadeh, John Wihbey, Max Kleiman-Weiner · PDF
  8. AI-enhanced semantic feature norms for 786 concepts

    Siddharth Suresh, Kushin Mukherjee, Tyler Giallanza, Xizheng Yu, Mia Patil, Jonathan D. Cohen, Timothy T. Rogers · PDF
  9. Aligning LLMs with Domain Invariant Reward Models

    David Wu, Sanjiban Choudhury · PDF
  10. Augmenting Image Annotation: A Human–LMM Collaborative Framework for Efficient Object Selection and Label Generation

    HE ZHANG, Xinyi Fu, John Millar Carroll · PDF
  11. Beyond Bradley-Terry Models: A General Preference Model for Language Model Alignment

    Yifan Zhang, Ge Zhang, Yue Wu, Kangping Xu, Quanquan Gu · PDF
  12. Bidirectional Alignment for Inclusive Narrative Generation

    Ken Kawamura · PDF
  13. Broaden your SCOPE! Efficient Conversation Planning for LLMs using Semantic Space

    Zhiliang Chen, Xinyuan Niu, Chuan-Sheng Foo, Bryan Kian Hsiang Low · PDF
  14. Can RLHF be More Efficient with Imperfect Reward Models? A Policy Coverage Perspective

    Jiawei Huang, Bingcong Li, Christoph Dann, Niao He · PDF
  15. Cooperative Agency-Centered LLMs

    Iyadunni J. Adenuga · PDF
  16. CoPL: Collaborative Preference Learning for Personalizing LLMs

    Youngbin Choi, Seunghyuk Cho, Minjong Lee, MoonJeong Park, Yesong Ko, Jungseul Ok, Dongwoo Kim · PDF
  17. CTRL-Rec: Controlling Recommender Systems With Natural Language

    Micah Carroll, Adeline Foote, Marcus Williams, Anca Dragan, W. Bradley Knox, Smitha Milli · PDF
  18. Data-adaptive Safety Rules for Training Reward Models

    Xiaomin Li, Mingye Gao, Zhiwei Zhang, Jingxuan Fan, Weiyu Li · PDF
  19. Decision Preference Alignment for Large-Scale Agents Based on Reward Model Generation

    Zheng Jiaoling, Xu Weifeng, Luo Qian, Dang Wanli, Geng Long, Gao Guokang, Ren Yulin, Fan Xingyu · PDF
  20. Drift: Efficient Implicit Personalization of Large Language Models

    Minbeom Kim, Kang-il Lee, Seongho Joo, Hwaran Lee, Kyomin Jung · PDF
  21. Envision Human-AI Perceptual Alignment from a Multimodal Interaction Perspective

    Shu Zhong, Marianna Obrist · PDF
  22. Exploring Persona-dependent LLM Alignment for the Moral Machine Experiment

    Jiseon Kim, Jea Kwon, Luiz Felipe Vecchietti, Alice Oh, Meeyoung Cha · PDF
  23. From Intuition to Understanding: Using AI Peers to Overcome Physics Misconceptions

    Ruben Weijers, Denton Wu, Hannah Betts, Tamara Jacod, Yuxiang Guan, Vidya Sujaya, Kushal Dev, Toshali Goel, William Delooze, Reihaneh Rabbany, Ying Wu, Jean-François Godbout, Kellin Pelrine · PDF
  24. Human Alignment: How Much We Adapt to LLMs?

    Cazalets Tanguy, Ruben Janssens, Tony Belpaeme, Joni Dambre · PDF
  25. Inference-time Alignment in Continuous Space

    Yige Yuan, Teng Xiao, Li Yunfan, Bingbing Xu, Shuchang Tao, Yunqi Qiu, Huawei Shen, Xueqi Cheng · PDF
  26. InterFeedback: Unveiling Interactive Intelligence of Large Multimodal Models with Human Feedback

    Henry Hengyuan Zhao, Wenqi Pei, Yifei Tao, Haiyang Mei, Mike Zheng Shou · PDF
  27. Learning From Diverse Experts: Behavior Alignment Through Multi-Objective Inverse Reinforcement Learning

    Jun-Jie Yang, Qian-You Zhang, Chia-Heng Hsu, Xi Liu, Ping-Chun Hsieh · PDF
  28. Mitigating Societal Cognitive Overload in the Age of AI: Challenges and Directions

    Salem Lahlou · PDF
  29. Monitoring LLM Agents for Sequentially Contextual Harm

    Chen Yueh-Han, Nitish Joshi, Yulin Chen, He He, Rico Angell · PDF
  30. Moral Alignment for LLM Agents

    Elizaveta Tennant, Stephen Hailes, Mirco Musolesi · PDF
  31. Multi-Objective Probabilistic Preference Learning with Soft and Hard Bounds

    Edward Chen, Sang T. Truong, Natalie Dullerud, Sanmi Koyejo, Carlos Guestrin · PDF
  32. Negotiative Alignment: An interactive approach to human-AI co-adaptation for clinical applications

    Florence Xini Doo, Nikhil Shah, Pranav Kulkarni, Vishwa Sanjay Parekh, Heng Huang · PDF
  33. Observability of Latent States in Generative AI Models

    Tian Yu Liu, Stefano Soatto, Matteo Marchi, Pratik Chaudhari, Paulo Tabuada · PDF
  34. Online Learning with Ranking Feedback and An Application to Equilibrium Computation

    Mingyang Liu, Yongshan Chen, Zhiyuan Fan, Gabriele Farina, Asuman E. Ozdaglar, Kaiqing Zhang · PDF
  35. Order Independence With Finetuning

    Katrina Brown, Reid McIlroy-Young · PDF
  36. Outlier-Aware Preference Optimization for Large Language Models

    Pragya Srivastava, Sai Soumya Nalli, Amit Deshpande, Amit Sharma · PDF
  37. PARSE-Ego4D: Toward Bidirectionally Aligned Action Recommendations for Egocentric Videos

    Steven Abreu, Tiffany D Do, Karan Ahuja, Eric J Gonzalez, Lee Payne, Daniel McDuff, Mar Gonzalez-Franco · PDF
  38. Patterns and Mechanisms of Contrastive Activation Engineering

    Yixiong Hao, Ayush Panda, Stepan Shabalin, Sheikh Abdur Raheem Ali · PDF
  39. PILAF: Optimal Human Preference Sampling for Reward Modeling

    Yunzhen Feng, Ariel Kwiatkowski, Kunhao Zheng, Julia Kempe, Yaqi Duan · PDF
  40. Policy Prototyping for LLMs: Pluralistic Alignment via Interactive and Collaborative Policymaking

    Kevin Feng, Inyoung Cheong, Quan Ze Chen, Amy X Zhang · PDF
  41. Position: Interpretability is a Bidirectional Communication Problem

    Kola Ayonrinde · PDF
  42. PREFERENCE OPTIMIZATION FOR CONCEPT BOTTLENECK MODELS

    Emiliano Penaloza, Tianyue H. Zhang, Laurent Charlin, Mateo Espinosa Zarlenga · PDF
  43. Preference-Based Alignment of Discrete Diffusion Models

    Umberto Borso, Davide Paglieri, Jude Wells, Tim Rocktäschel · PDF
  44. Probing Mechanical Reasoning in Large Vision Language Models

    Haoran Sun, Yijiang Li, Qingying Gao, Haiyun Lyu, Dezhi Luo, Hokin Deng · PDF
  45. Processing, Priming, Probing: Human Interventions for Explainability Alignment

    Kenza Amara · PDF
  46. Representational Alignment Supports Effective Teaching

    Ilia Sucholutsky, Katherine M. Collins, Maya Malaviya, Nori Jacoby, Weiyang Liu, Theodore Sumers, Michalis Korakakis, Umang Bhatt, Mark K Ho, Joshua B. Tenenbaum, Bradley C. Love, Zachary Pardos, Adrian Weller, Thomas L. Griffiths · PDF
  47. Representational Difference Clustering

    Neehar Kondapaneni, Emily Gu, Oisin Mac Aodha, Pietro Perona · PDF
  48. Rethinking AI Cultural Alignment

    Michal Bravansky, Filip Trhlík, Fazl Barez · PDF
  49. Rethinking Anti-Misinformation AI

    Vidya Sujaya, Kellin Pelrine, Andreea Musulan, Reihaneh Rabbany · PDF
  50. SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities

    Fengqing Jiang, Zhangchen Xu, Yuetai Li, Luyao Niu, Zhen Xiang, Bo Li, Bill Yuchen Lin, Radha Poovendran · PDF
  51. Scalably Solving Assistance Games

    Cassidy Laidlaw, Eli Bronstein, Timothy Guo, Dylan Feng, Lukas Berglund, Justin Svegliato, Stuart Russell, Anca Dragan · PDF
  52. Shared Similarity Between Humans and Chatbots: Exploring Human Willingness to Seek Social Support From Chatbots

    Zicheng Zhu, Tianqi Song, Jefferson Lim, Chi-Lan Yang, Yi-Chieh Lee · PDF
  53. Societal Alignment Frameworks Can Improve LLM Alignment

    Karolina Stanczak, Nicholas Meade, Mehar Bhatia, Hattie Zhou, Konstantin Böttinger, Jeremy Barnes, Jason Stanley, Jessica Montgomery, Richard Zemel, Nicolas Papernot, Nicolas Chapados, Denis Therien, Timothy P Lillicrap, Ana Marasovic, Sylvie Delacroix, Gillian K Hadfield, Siva Reddy · PDF
  54. Societal Impacts Research Requires Benchmarks for Creative Composition Tasks

    Judy Hanwen Shen, Carlos Guestrin · PDF
  55. Superalignment with Dynamic Human Values

    Florian Mai, David Kaczér, Nicholas Kluge Corrêa, Lucie Flek · PDF
  56. SWEPO: Simultaneous Weighted Preference Optimization for Group Contrastive Alignment

    Taneesh Gupta, Rahul Madhavan, Xuchao Zhang, Chetan Bansal, Saravan Rajmohan · PDF
  57. Sycophancy Claims about Language Models: The Missing Human-in-the-Loop

    Jan Batzner, Volker Stocker, Stefan Schmid, Gjergji Kasneci · PDF
  58. Symmetry-Breaking Augmentations for Ad Hoc Teamwork

    Ravi Hammond, Dustin Craggs, Mingyu Guo, Jakob Nicolaus Foerster, Ian Reid · PDF
  59. The Alignment Trilemma: A Theoretical Perspective on Recursive Misalignment and Human-AI Adaptation Dynamics

    Tarun Raheja, Nilay Pochhi · PDF
  60. The Human Visual System Can Inspire New Interaction Paradigms for LLMs

    Diana Robinson, Neil D Lawrence · PDF
  61. The Lock-in Hypothesis: Stagnation by Algorithm

    Tianyi Qiu, Zhonghao He, Tejasveer Chugh, Max Kleiman-Weiner · PDF
  62. Towards LVLM-Aided Alignment of Task-Specific Vision Models

    Alexander Koebler, Christian Greisinger, Jan Paulus, Ingo Thon, Florian Buettner · PDF
  63. TraCeS: Trajectory Based Credit Assignment From Sparse Safety Feedback

    Siow Meng Low, Akshat Kumar · PDF
  64. TRIG-Bench: A Benchmark for Text-Rich Image Grounding

    Ming Li, Ruiyi Zhang, Jian Chen, Tianyi Zhou · PDF
  65. Trustworthy AI Must Account for Intersectionality

    Jesse C. Cresswell · PDF
  66. Understanding (Un)Reliability of Steering Vectors in Language Models

    Joschka Braun, Carsten Eickhoff, David Krueger, Seyed Ali Bahrainian, Dmitrii Krasheninnikov · PDF
  67. Value Alignment in the Global South: A Multidimensional Approach to Norm Elicitation in Indian Contexts

    Atmadeep Ghoshal, Martim Brandao, Ruba Abu-Salma · PDF
  68. ValueMap: Mapping Crowdsourced Human Values to Computational Scores for Bi-directional Alignment

    Priya Ronald DCosta, Rupkatha Hira · PDF
  69. Vision Language Models Know Law of Conservation without Understanding More-or-Less

    Dezhi Luo, Haiyun Lyu, Qingying Gao, Haoran Sun, Yijiang Li, Hokin Deng · PDF
  70. Vision Language Models See What You Want but not What You See

    Qingying Gao, Yijiang Li, Haiyun Lyu, Haoran Sun, Dezhi Luo, Hokin Deng · PDF
  71. We Shape AI, and Thereafter AI Shape Us: Humans Align with AI through Social Influences

    Jingshu Li, Tianqi Song, Beichen Xue, Yi-Chieh Lee · PDF