ICML 2024 Past Safety & alignment
ICML 2024 Workshop on Models of Human Feedback for AI Alignment
ICML 2024 Workshop MHFAIA
- Submission deadline
- Jun 1, 2024, 18:00 UTC imported from OpenReview — check the website for extensions
- Submission portal
- OpenReview
- Notes
- Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).
Accepted papers (60)
Fetched from OpenReview (v2) on 2026-06-10.
-
"You just can’t go around killing people'' Explaining Agent Behavior to a Human Terminator
-
A Theoretical Framework for Partially Observed Reward-States in RLHF
-
Accelerating Best-of-N via Speculative Rejection
-
Adversarial Multi-dueling Bandits
-
AI Alignment with Changing and Influenceable Reward Functions
-
Aligning Crowd Feedback via Distributional Preference Reward Modeling
-
Aligning Large Language Models with Representation Editing: A Control Perspective
-
AMBER: An Entropy Maximizing Environment Design Algorithm for Inverse Reinforcement Learning
-
Beyond Thumbs Up/Down: Untangling Challenges of Fine-Grained Feedback for Text-to-Image Generation
-
Bootstrapping Language Models with DPO Implicit Rewards
-
Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization
-
Comparing Comparisons: Informative and Easy Human Feedback with Distinguishability Queries
-
Concept-Based Interpretable Reinforcement Learning with Limited to No Human Labels
-
Cross-Domain Knowledge Transfer for RL via Preference Consistency
-
Distributional Preference Alignment of LLMs via Optimal Transport
-
DPM: Dual Preferences-based Multi-Agent Reinforcement Learning
-
DPO Meets PPO: Reinforced Token Optimization for RLHF
-
Efficient Inverse Reinforcement Learning without Compounding Errors
-
Enhancing Intent Understanding for Ambiguous prompt: A Human-Machine Co-Adaption Strategy
-
Filtered Direct Preference Optimization
-
Free-Energy Equilibria: Toward a Theory of Interactions Between Boundedly-Rational Agents
-
Generalizing Offline Alignment Theoretical Paradigm with Diverse Divergence Constraints
-
Hummer: Towards Limited Competitive Preference Dataset
-
Informed Meta-Learning
-
Inverse Reinforcement Learning from Demonstrations for LLM Alignment
-
Is a Good Description Worth a Thousand Pictures? Reducing Multimodal Alignment to Text-Based, Unimodal Alignment
-
Is poisoning a real threat to LLM alignment? Maybe more so than you think
-
Language Alignment via Nash-learning and Adaptive feedback
-
Learning the eye of the beholder: Statistical modeling and estimation for personalized color perception
-
Learning to Assist Humans without Inferring Rewards
-
MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with Diverse Human Preferences
-
Modeling the Plurality of Human Preferences via Ideal Points
-
Models That Prove Their Own Correctness
-
Multi-Agent Imitation Learning: Value is Easy, Regret is Hard
-
MultiScale Policy Learning for Alignment with Long Term Objectives
-
New Desiderata for Direct Preference Optimization
-
Off-Policy Evaluation from Logged Human Feedback
-
Optimal Design for Human Feedback
-
Order-Optimal Instance-Dependent Bounds for Offline Reinforcement Learning with Preference Feedback
-
PIPER: Primitive-Informed Preference-based Hierarchical Reinforcement Learning via Hindsight Relabeling
-
Preference Elicitation for Offline Reinforcement Learning
-
Preference Learning Algorithms Do Not Learn Preference Rankings
-
Prompt Optimization with Human Feedback
-
Query Design for Crowdsourced Clustering: Effect of Cognitive Overload and Contextual Bias
-
REBEL: Reinforcement Learning via Regressing Relative Rewards
-
Regularized Best-of-N Sampling to Mitigate Reward Hacking for Language Model Alignment
-
Reinforcement Learning from Human Text Feedback: Learning a Reward Model from Human Text Input
-
Relatively Rational: Learning Utilities and Rationalities Jointly from Pairwise Preferences
-
Reuse Your Rewards: Reward Model Transfer for Zero-Shot Cross-Lingual Alignment
-
Revisiting Successor Features for Inverse Reinforcement Learning
-
RLHF and IIA: Perverse Incentives
-
Scalable Oversight by Accounting for Unreliable Feedback
-
Scalably Solving Assistance Games
-
Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms
-
Step-On-Feet Tuning: Scaling Self-Alignment of LLMs via Bootstrapping
-
Stochastic Concept Bottleneck Models
-
Towards Aligning Language Models with Textual Feedback
-
Towards Safe Large Language Models for Medicine
-
Uncertainty-aware Preference Alignment in Reinforcement Learning from Human Feedback
-
Weak-to-Strong Extrapolation Expedites Alignment