NeurIPS 2025 Past Safety & alignmentReinforcement learningTheory
NeurIPS 2025 Workshop: Second Workshop on Aligning Reinforcement Learning Experimentalists and Theorists
ARLET
- Submission deadline
- Sep 3, 2025, 13:00 UTC imported from OpenReview — check the website for extensions
- Submission portal
- OpenReview
- Notes
- Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).
Accepted papers (101)
Fetched from OpenReview (v2) on 2026-06-10.
-
A Regularized Actor-Critic Algorithm for Bi-Level Reinforcement Learning
-
A Reinforcement Learning Approach for Health-Behavioural Recommendations to Reduce Cancer Risk
-
A Theoretical Analysis of Information Bottlenecks for Zero-Shot Transfer in Reinforcement Learning
-
Accelerating Diffusion Planners in Offline RL via Reward-Aware Consistency Trajectory Distillation
-
Active Learning for Stochastic Contextual Linear Bandits
-
Ada-Diffuser: Latent-Aware Adaptive Diffusion for Decision-Making
-
All Roads Lead to Likelihood: The Value of RL in Fine-Tuning
-
Automatic Reward Shaping from Multi-Objective Human Heuristics
-
Bandit and Delayed Feedback in Online Structured Prediction
-
Bandit Learning on Dynamic Graphs
-
Behavior-Aware Off-Policy Selection in High-Stake Human-Centric Environments
-
Beyond Marginals: Capturing Correlated Returns through Joint Distributional Reinforcement Learning
-
Beyond RLHF and NLHF: Population-Proportional Alignment under an Axiomatic Framework
-
Bigger, Regularized, Categorical: High-Capacity Value Functions are Efficient Multi-Task Learners
-
Bootstrap Ensemble Uncertainty for State-Adaptive Regularization in Offline Reinforcement Learning
-
Compute-Optimal Scaling for Value-Based Deep RL
-
Constrained Linear Thompson Sampling
-
Convergence and Sample Complexity of First-Order Methods for Agnostic Reinforcement Learning
-
Data-Dependent Regret Bounds for MABs with Constraints
-
Demystifying the Mechanisms Behind Emergent Exploration in Goal-conditioned RL
-
DHP: Discrete Hierarchical Planning for HRL Agents
-
Efficient Adversarial Attacks on High-dimensional Offline Bandits
-
Efficient Restarts in Non-Stationary Model-Free Reinforcement Learning
-
Efficiently Robust In-Context Reinforcement Learning with Adversarial Generalization and Adaptation
-
Enhancing Diversity in Large Language Models via Determinantal Point Processes
-
Exploration Implies Data Augmentation: Reachability and Generalisation in Contextual MDPs
-
Exploring Time-Step Size in Reinforcement Learning for Sepsis Treatment
-
Fictive Learning Augments Model-Based Reinforcement Learning in the Two-Step Task
-
floq: Training Critics via Flow-Matching for Scaling Compute in Value-Based RL
-
From Contextual Combinatorial Semi-Bandits to Bandit List Classification: Improved Sample Complexity with Sparse Rewards
-
Generalized Linear Bandits: Almost Optimal Regret with One-Pass Update
-
Generating Auxiliary Tasks with Reinforcement Learning
-
Horizon Reduction Makes RL Scalable
-
How to Provably Improve Return Conditioned Supervised Learning?
-
Human-Inspired Multi-Level Reinforcement Learning
-
Hybrid Training for Enhanced Multi-task Generalization in Multi-agent Reinforcement Learning
-
Idea: Bridging Theoretical Fairness Definitions with Multi-Agent Coordination in the Real World
-
Idea: Fairness Constraints as Reliability Guarantees for RLHF Reward Models
-
Idea: Sharpe Ratio-Optimized Thompson Sampling for Risk-Aware Online Learning
-
Improved Regret Bounds for Linear Bandits with Heavy-Tailed Rewards
-
Improved Training Mechanisms for Reinforcement Learning via Online Model Selection
-
Improving Value Estimation Critically Enhances Vanilla Policy Gradient
-
Intent‑Based Reward Inference for Value‑Aligned Reinforcement Learning
-
Large Language Model-Enhanced RL for Diverse and Novel Recommendations
-
Learning a Pessimistic Reward in RLHF: KL Regularization is Not Necessary
-
Linear Dynamics meets Linear MDPs: Closed-Form Optimal Policies via Reinforcement Learning
-
LLM-Driven Policy Diffusion: Enhancing Generalization in Offline Reinforcement Learning
-
Long-Horizon Model-Based Offline Reinforcement Learning Without Conservatism
-
MOBODY: Model-Based Off-Dynamics Offline Reinforcement Learning
-
On the relation of bisimulation, model irrelevance, and corresponding regret bounds
-
Open Problem: Order Optimal Regret Bounds for Non-Markovian Rewards
-
Optimal Regret Bounds for Policy Optimization in Contextual Bandits
-
Optimistic Actor-Critic with Parametric Policies: Unifying Sample Efficiency and Practicality
-
Oracle-RLAIF: An Improved Fine-Tuning Framework for Multi-modal Video Models through Reinforcement Learning from Ranking Feedback
-
Outcome-based Exploration for LLM Reasoning
-
Policy Compatible Skill Incremental Learning via Lazy Learning Interface
-
Policy Gradient Guidance Enables Test Time Control
-
Policy Optimization in CMDPs with Bandit Feedback: Learning with Stochastic and Adversarial Constraints
-
Policy Search via Bayesian Optimization with Temporal Difference Gaussian Processes
-
Policy Testing in Markov Decision Processes
-
Principled Learning-to-Communicate in Cooperative MARL: An Information-Structure Perspective
-
Provably Efficient and Agile Randomized Q-Learning
-
Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions
-
Real-World Reinforcement Learning of Active Perception Behaviors
-
Regret Bounds for Adversarial Contextual Bandits with General Function Approximation and Delayed Feedback
-
Replicable Reinforcement Learning with Linear Function Approximation
-
Revisiting Actor-Critic Methods in Discrete Action Off-Policy Reinforcement Learning
-
Revisiting Mixture Policies in Entropy-Regularized Actor-Critic
-
Reward Model Overoptimisation in Iterated RLHF
-
RL's Razor: Why On-Policy Reinforcement Learning Forgets Less
-
Robust Constrained Offline Reinforcement Learning with Linear Function Approximation
-
Robust Policy Gradient Optimization through Parameter Perturbation in Reinforcement Learning
-
Safe Exploration via Policy Priors
-
Safe Guaranteed Dynamics Exploration with Probabilistic Models
-
Safe, Trust Region Policy Optimization for Constrained Reinforcement Learning
-
Sail into the Headwind: Alignment via Robust Rewards and Dynamic Labels against Reward Hacking
-
Scaling Offline RL via Efficient and Expressive Shortcut Models
-
Sharp Gap-Dependent Variance-Aware Regret Bounds for Tabular MDPs
-
Shift Before You Learn: Enabling Low-Rank Representations in Reinforcement Learning
-
SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards
-
Speaking the Language of Teamwork: LLM-Guided Credit Assignment in Multi-Agent Reinforcement Learning
-
Spectral Collapse Drives Loss of Plasticity in Deep Continual Learning
-
Stackelberg Learning from Human Feedback: Preference Optimization as a Sequential Game
-
State Entropy Regularization for Robust Reinforcement Learning
-
Steering Diffusion Policies with Value-Guided Denoising
-
Structure Matters: Dynamic Policy Gradient
-
SUSD: Structured Unsupervised Skill Discovery through State Factorization
-
TARC: Time-Adaptive Robotic Control
-
Test Time Risk Adaption with Mixture of Agents
-
The Good, The Bad, and The Hybrid: A Reward Structure Showdown in Reasoning Models Training
-
The Minimax Complexity of Preference-Based Decision Making in Multi-Objective Reinforcement Learning
-
The Role of Preference Data and Unembeddings in the Convergence Rate of DPO
-
Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction
-
Towards Parameter-Free Temporal Difference Learning
-
Towards shutdownable agents via stochastic choice
-
Uncertainty-Aware Policy-Preserving Abstractions with Abstention for One-Shot Decisions
-
Unifying Agent Interaction and World Information for Multi-agent Coordination
-
Unsupervised Contrastive Goal Reaching
-
What Makes a Reward Model a Good Teacher? An Optimization Perspective
-
When Data is the Algorithm: A Systematic Study and Curation of Preference Optimization Datasets
-
When Maximum Entropy Misleads Policy Optimization