ICML 2025 Past Safety & alignment
2nd Workshop on Models of Human Feedback for AI Alignment
MoFA
- Submission deadline
- May 28, 2025, 13:00 UTC imported from OpenReview — check the website for extensions
- Submission portal
- OpenReview
- Notes
- Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).
Accepted papers (68)
Fetched from OpenReview (v2) on 2026-06-10.
-
A Unified Perspective on Reward Distillation Through Ratio Matching
-
ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment
-
Advancing LLM Safe Alignment with Safety Representation Ranking
-
Aggregated Individual Reporting for Post-Deployment Evaluation
-
Aligned Textual Scoring Rule
-
Aligning Neural Style Representations for Style-based Clustering
-
Alignment as Distribution Learning: Your Preference Model is Explicitly a Language Model
-
Alignment of Large Language Models with Constrained Learning
-
Angular Steering: Behavior Control via Rotation in Activation Space
-
Auto-Guideline Alignment: Probing and Simulating Human Ideological Preferences in LLMs via Prompt Engineering
-
BiasLab: Toward Explainable Political Bias Detection with Dual-Axis Human Annotations and Rationale Indicators
-
Composition and Alignment of Diffusion Models using Constrained Learning
-
Configurable Preference Tuning with Rubric-Guided Synthetic Data
-
Copilot Arena: A Platform for Code LLM Evaluation in the Wild
-
CUDA: Capturing Uncertainty and Diversity in Preference Feedback Augmentation
-
Cultivating Pluralism In Algorithmic Monoculture: The Community Alignment Dataset
-
CulturalFrames: Assessing Cultural Expectation Alignment in Text-to-Image Models and Evaluation Metrics
-
Deep Context-Dependent Choice Model
-
Do Language Models Understand Discrimination? Testing Alignment with Human Legal Reasoning under the ECHR
-
Doctor Approved: Generating Medically Accurate Skin Disease Images through AI–Expert Feedback
-
Doubly Robust Alignment for Large Language Models
-
Dropping Just a Handful of Preferences Can Change Top Large Language Model Rankings
-
Dynamic Guardian Models: Realtime Content Moderation With User-Defined Policies
-
EconEvals: Benchmarks and Litmus Tests for LLM Agents in Unknown Environments
-
Efficient Generative Models Personalization via Optimal Experimental Design
-
Empirical Studies on the Limitations of Direct Preference Optimization, and a Possible Quick Fix
-
Enhancing Personalized Multi-Turn Dialogue with Curiosity Reward
-
Entropy Controllable Direct Preference Optimization
-
Expected Reward Prediction, with Applications to Model Routing
-
Few-shot Steerable Alignment: Adapting Rewards and LLM Policies with Neural Processes
-
Fine-Tuning Next-Scale Visual Autoregressive Models with Group Relative Policy Optimization
-
FSPO: Few-Shot Preference Optimization of Synthetic Preference Data Elicits LLM Personalization to Real Users
-
Full-Stack Alignment: Co-Aligning AI and Institutions with Thicker Models of Value
-
Geometry-Aware Preference Learning for 3D Texture Generation
-
Human Feedback Guided Reinforcement Learning for Unknown Temporal Tasks via Weighted Finite Automata
-
Implicit User Feedback in Human-LLM Dialogues: Informative to Understand Users yet Noisy as a Learning Signal
-
Improvement-Guided Iterative DPO for Diffusion Models
-
In-Context Alignment at Scale: When More is Less
-
In-Context Personalized Alignment with Feedback History under Counterfactual Evaluation
-
Inference-Time Reward Hacking in Large Language Models
-
KL-Regularised Q-Learning: A Token-level Action-Value perspective on Online RLHF
-
Language Model Personalization via Reward Factorization
-
Learning interpretable descriptions of human preferences
-
LoRe: Personalizing LLMs via Low-Rank Reward Modeling
-
Mechanism Design for Alignment via Human Feedback
-
Mimicking Human Intuition: Cognitive Belief-Driven Reinforcement Learning
-
Mitigating Fine-tuning Risks in LLMs via Safety-Aware Probing Optimization
-
Multi-Task Reward Learning from Human Ratings
-
On the strength of goodhart's law
-
Online Learning and Equilibrium Computation with Ranking Feedback
-
Playing the Data: Video Games as a Tool to Annotate and Train Models on Large Datasets
-
Reasoning Isn't Enough: Examining Truth-Bias and Sycophancy in LLMs
-
ReDit: Reward Dithering for Improved LLM Policy Optimization
-
Rewrite-to-Rank: Optimizing Ad Visibility via Retrieval-Aware Text Rewriting
-
Robust Multi-Objective Controlled Decoding of Large Language Models
-
Robust Reinforcement Learning from Human Feedback for Large Language Models Fine-Tuning
-
Robust Reward Modeling via Causal Rubrics
-
Scalable Defense against In-the-wild Jailbreaking Attacks with Safety Context Retrieval
-
Selective Preference Aggregation
-
Self-Concordant Preference Learning from Noisy Labels
-
The Strong, weak and benign Goodhart’s law. An independence-free and paradigm-agnostic formalisation
-
Theoretical Analysis of KL-regularized RLHF with Multiple Reference Models
-
Towards a Sharp Analysis of Offline Policy Learning for $f$-Divergence-Regularized Contextual Bandits
-
Tracing Human-like Traits in LLMs: Origins, Real-World Manifestation, and Controllability
-
Unanchoring the Mind: DAE-Guided Counterfactual Reasoning for Rare Disease Diagnosis
-
Understanding Likelihood Over-optimisation in Direct Alignment Algorithms
-
Vertical Moral Growth: A Novel Developmental Framework for Human Feedback Quality in AI Alignment
-
What Matters when Modeling Human Behavior using Imitation Learning?