ICLR 2025 Past Safety & alignment
ICLR 2025 Workshop on Bidirectional Human-AI Alignment
ICLR 2025 Bi-Align Workshop
- Submission deadline
- Feb 16, 2025, 11:59 UTC imported from OpenReview — check the website for extensions
- Submission portal
- OpenReview
- Notes
- Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).
Accepted papers (71)
Fetched from OpenReview (v2) on 2026-06-10.
-
A Benchmark for Scalable Oversight Mechanisms
-
A Pilot Study of Weak-to-Strong Generalization in Safety, Toxicity, and Legal Reasoning
-
A Roadmap for Human-Agent Moral Alignment: Integrating Pre-defined Intrinsic Rewards and Learned Reward Models
-
A Sociotechnical Perspective on Aligning AI with Pluralistic Human Values
-
Active Human Feedback Collection via Neural Contextual Dueling Bandits
-
Addressing and Visualizing Misalignments in Human Task-Solving Trajectories
-
AI Systematically Rewires the Flow of Ideas
-
AI-enhanced semantic feature norms for 786 concepts
-
Aligning LLMs with Domain Invariant Reward Models
-
Augmenting Image Annotation: A Human–LMM Collaborative Framework for Efficient Object Selection and Label Generation
-
Beyond Bradley-Terry Models: A General Preference Model for Language Model Alignment
-
Bidirectional Alignment for Inclusive Narrative Generation
-
Broaden your SCOPE! Efficient Conversation Planning for LLMs using Semantic Space
-
Can RLHF be More Efficient with Imperfect Reward Models? A Policy Coverage Perspective
-
Cooperative Agency-Centered LLMs
-
CoPL: Collaborative Preference Learning for Personalizing LLMs
-
CTRL-Rec: Controlling Recommender Systems With Natural Language
-
Data-adaptive Safety Rules for Training Reward Models
-
Decision Preference Alignment for Large-Scale Agents Based on Reward Model Generation
-
Drift: Efficient Implicit Personalization of Large Language Models
-
Envision Human-AI Perceptual Alignment from a Multimodal Interaction Perspective
-
Exploring Persona-dependent LLM Alignment for the Moral Machine Experiment
-
From Intuition to Understanding: Using AI Peers to Overcome Physics Misconceptions
-
Human Alignment: How Much We Adapt to LLMs?
-
Inference-time Alignment in Continuous Space
-
InterFeedback: Unveiling Interactive Intelligence of Large Multimodal Models with Human Feedback
-
Learning From Diverse Experts: Behavior Alignment Through Multi-Objective Inverse Reinforcement Learning
-
Mitigating Societal Cognitive Overload in the Age of AI: Challenges and Directions
-
Monitoring LLM Agents for Sequentially Contextual Harm
-
Moral Alignment for LLM Agents
-
Multi-Objective Probabilistic Preference Learning with Soft and Hard Bounds
-
Negotiative Alignment: An interactive approach to human-AI co-adaptation for clinical applications
-
Observability of Latent States in Generative AI Models
-
Online Learning with Ranking Feedback and An Application to Equilibrium Computation
-
Order Independence With Finetuning
-
Outlier-Aware Preference Optimization for Large Language Models
-
PARSE-Ego4D: Toward Bidirectionally Aligned Action Recommendations for Egocentric Videos
-
Patterns and Mechanisms of Contrastive Activation Engineering
-
PILAF: Optimal Human Preference Sampling for Reward Modeling
-
Policy Prototyping for LLMs: Pluralistic Alignment via Interactive and Collaborative Policymaking
-
Position: Interpretability is a Bidirectional Communication Problem
-
PREFERENCE OPTIMIZATION FOR CONCEPT BOTTLENECK MODELS
-
Preference-Based Alignment of Discrete Diffusion Models
-
Probing Mechanical Reasoning in Large Vision Language Models
-
Processing, Priming, Probing: Human Interventions for Explainability Alignment
-
Representational Alignment Supports Effective Teaching
-
Representational Difference Clustering
-
Rethinking AI Cultural Alignment
-
Rethinking Anti-Misinformation AI
-
SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities
-
Scalably Solving Assistance Games
-
Shared Similarity Between Humans and Chatbots: Exploring Human Willingness to Seek Social Support From Chatbots
-
Societal Alignment Frameworks Can Improve LLM Alignment
-
Societal Impacts Research Requires Benchmarks for Creative Composition Tasks
-
Superalignment with Dynamic Human Values
-
SWEPO: Simultaneous Weighted Preference Optimization for Group Contrastive Alignment
-
Sycophancy Claims about Language Models: The Missing Human-in-the-Loop
-
Symmetry-Breaking Augmentations for Ad Hoc Teamwork
-
The Alignment Trilemma: A Theoretical Perspective on Recursive Misalignment and Human-AI Adaptation Dynamics
-
The Human Visual System Can Inspire New Interaction Paradigms for LLMs
-
The Lock-in Hypothesis: Stagnation by Algorithm
-
Towards LVLM-Aided Alignment of Task-Specific Vision Models
-
TraCeS: Trajectory Based Credit Assignment From Sparse Safety Feedback
-
TRIG-Bench: A Benchmark for Text-Rich Image Grounding
-
Trustworthy AI Must Account for Intersectionality
-
Understanding (Un)Reliability of Steering Vectors in Language Models
-
Value Alignment in the Global South: A Multidimensional Approach to Norm Elicitation in Indian Contexts
-
ValueMap: Mapping Crowdsourced Human Values to Computational Scores for Bi-directional Alignment
-
Vision Language Models Know Law of Conservation without Understanding More-or-Less
-
Vision Language Models See What You Want but not What You See
-
We Shape AI, and Thereafter AI Shape Us: Humans Align with AI through Social Influences