ICML 2026 Past Reinforcement learningOptimizationDatasets
Decision-Making from Offline Datasets to Online Adaptation: Black-Box Optimization to Reinforcement Learning
DEMO 2026
- Submission deadline
- May 9, 2026, 23:59 UTC imported from OpenReview — check the website for extensions
- Submission portal
- OpenReview
- Notes
- Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).
Accepted papers (142)
Fetched from OpenReview (v2) on 2026-06-10.
-
A Differentiable Bayesian Optimization Framework via Variational Mutual Information Estimation
-
A Diffusion Approximation for Temporal-Difference Learning with Linear Features under Markovian Noise
-
A Language-Guided Bayesian Optimization for Efficient LoRA Hyperparameter Search
-
A Mutual Information Lower Bound for Multimodal Regression Active Learning
-
A Planning-Based Reinforcement Learning Approach to Numerical Optimization
-
Abstraction for Offline Goal-Conditioned Reinforcement Learning
-
Action-Free Offline RL via Demonstrator Diversity
-
AdaDPO: Self-Adaptive Direct Preference Optimization with Balanced Gradient Updates
-
Adaptive Querying with AI Persona Priors
-
Adaptive Stratified Active Statistical Inference
-
Aligning Flow Map Policies with Optimal $Q$-Guidance
-
An Information-Theoretic Analysis of OOD Generalization in Meta-Reinforcement Learning
-
ARCA: Adapter-Residual Credit Assignment When Token Signals Degenerate
-
AsyncOPD: How Stale Can On-Policy Distillation Be?
-
Auditing Offline Demonstration Pruning for Online Robot Deployment
-
Automated Kernel Discovery Towards Understanding High-dimensional Bayesian Optimization
-
Bayesian Optimization with Early Trial Termination for Speeding Up Parallel Neural Network Training
-
Behavior Cloning is Not All You Need: The Optimality of On-Policy Distillation for Noisy Expert Feedback
-
Belief-Aware Decision Transformers for Offline-to-Online Decision-Making under Partial Observability: A Geosteering Case Study
-
Bellman--Whitney Envelopes: Sharp Partial Identification in Offline Control under Support Holes
-
Beyond One-Size-Fits-All: Diagnosis-Driven Online Reinforcement Learning with Offline Priors
-
BoLT: A Benchmark to Democratize Black-box Optimization Research for Expensive LLM Tasks
-
Boosting Direct Preference Optimization with Penalization
-
Boosting for Reinforcement Learning in Structured MDPs
-
Can K Heads Explore Better Than One in Online Reinforcement Learning?
-
Can We Really Learn One Representation to Optimize All Rewards?
-
Clarifying Uncertainty Quantification in Off-Policy Evaluation: Beyond Effective Sample Sizes, Towards Confidence Intervals
-
CODA: Coordination via On-Policy Diffusion for Multi-Agent Offline Reinforcement Learning
-
CombiLatent: Neural Combinatorial Optimization via Latent Space Search under Sinkhorn Divergence Regularization
-
Combinatorial Allocation Bandits with Nonlinear Arm Utility
-
Conformal Candidate Certification for Offline Model-Based Optimization
-
CoRDE: Concept-Prior Routed Diffusion Experts for Structural Generalization in Robot Manipulation
-
Cost-Aware Learning
-
Curvature-Aware Active Statistical Inference : Reducing Labeling via Data Coherence
-
Decision Titan: Test-Time Training for Long-Term Memory in Offline Reinforcement Learning
-
Disentangled Differentiable Model Predictive Control for Data-efficient and Interpretable Imitation Learning
-
Dual Advantage Fields
-
DUO: Diffusion Models for Universal Offline Black-Box Optimization
-
Efficient Algorithms for Contextual Apple Tasting with Log-Loss
-
Efficient Cost-Aware LLM Evaluation via Bayesian Bandit Gittins Indices
-
Efficient Off-Policy RL for Video Generation via Forward-Consistent Reward Matching
-
Efficient Post-training of LLMs for Code Generation With Offline Reinforcement Learning
-
ElementMindX: Offline Supplier-Substitution Ranking for Natural-Language Trade-Shock Decision Support
-
Fairness of Exposure in Stochastic Multiple-play Multi-armed Bandits
-
Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability
-
FASTER: Value-Guided Sampling for Fast RL
-
FICReg: Forward-Inverse Consistency Regularization for Latent World Models
-
Flow-Based Offline Reinforcement Learning for Voltage Regulation in Distribution Networks
-
Forgetting to Improve: Principled Data Removal in Active Learning
-
Freeze the Policy, Infer the Goal: Cross-Domain Imitation with World Models
-
From Offline Evidence to Online Action: A Decision Framework for Imperfect Offline Evaluation
-
From Offline Global Information to Online Decentralized Policies in Edge Network Scheduling
-
From Offline Trajectories to Online Adaptation: A Multimodal JEPA Pretraining Study on Pokemon Red
-
From Static Policies to Adaptive Priors in Offline Reinforcement Learning
-
Future Information-Directed Sampling for Bayesian Nonstationary Bandits
-
Globally Convergent Offline Reinforcement Learning with Smoothed Bellman Residual Minimization
-
Good Experience Maximization
-
Hazard Compression: Catastrophic Forgetting in Diffusion-Based Generative Replay for Safe Reinforcement Learning
-
Hidden Failure Modes in Latent World-Model Planning from Offline Data
-
How Many Initial Points Does Bayesian Optimization Need?
-
Imitating the Imperfect: Offline-to-Online Robust Imitation Learning from Heterogeneous Demonstrators
-
Improve Reasoning Ability by Reinforcing Only from Positive Rollouts
-
Improving Multi-Agent Coordination with a Drift-Aware RL Objective
-
In-context Latent Space Bayesian Optimization
-
In-Context Pure Exploration in Continuous Decision Spaces
-
Information-Directed Offline-to-Online Reinforcement Learning
-
Instability and Interpretability Discrepancies Between CNNs and Vision Transformers in Keratoconus Detection
-
Is Temporal-Difference Learning the Only Path to Stitching in RL?
-
Learning from the Right Mistakes: When Do Low-Performing Data Help Offline Policy Gradients?
-
Learning Insider-Threat Intervention Policies from Offline Logs
-
Learning Reasoning Rewards from Expert Demonstrations with Inverse Reinforcement Learning
-
Learning through Adaptive Queries: a Directional Derivative Approach
-
Learning to Orchestrate Heterogeneous Agents under Uncertainty
-
Learning When to Stop: Selective Imitation Learning Under Arbitrary Dynamics Shift
-
Less Tuning, Better Planning: Simplifying Offline Model-Based Planning
-
Leveraging Instruction Tuning and Merging for Reasoning Model Adaptation
-
Leveraging Offline Supervision for Efficient and Generalizable Reinforcement Learning in Large-Scale Vision--Language--Action Models
-
LLM-PriorCB: Textual Contextual Bandits with LLM-Induced Priors
-
Long-Horizon Model-Based Offline Reinforcement Learning Without Explicit Conservatism
-
Mamba as Decision Maker: Exploring Multi-scale Sequence Modeling in Offline Reinforcement Learning
-
Meta-GC-TTT: Training Offline Goal-Conditioned Policies for Test-Time Adaptation
-
MIRT: Multi-Dimensional IRT for SLO-Adaptive Multi-Agent Routing
-
Molten Pot: Evaluations & Datasets for Social Offline Reinforcement Learning
-
Neuron-Aware Data Selection for Annotation-Free LLM Self-Distillation
-
Neutral Reward Filtering for Fair Offline-to-Online Diffusion Alignment
-
Noisy-Space Policy Gradient for Diffusion Policies in Offline Reinforcement Learning
-
Offline Multi-Agent Reinforcement Learning for Objective-Weight Adaptation in Three-Sided Marketplace Dispatch
-
Offline Policy Learning for Clinical-Trial Strategy
-
Offline Policy Learning under Compliance Uncertainty: Adoption-Aware Decision-Making with Observational-to-RCT Calibration Drift
-
Offline Preference Learning with Clustering and Active Data-Augmentation
-
On the Optimal Sample Complexity of Offline Multi-Armed Bandits with KL Regularization
-
On the Role of Proposal Support in Diffusion-Based Offline RL for Sequential Decision-Making
-
Online Regret Minimization in Linear Bandits with Offline data.
-
Online Self-Training for Co-Adaptation in Hierarchical Diffusion Policies
-
Orchestrating LLMs as Hierarchical Multi-Agent Reinforcement Learning System for Automotive Software Development
-
PC3D: Zero-Shot Cooperation Across Variable Rosters via Personalized Context Distillation
-
Pessimism’s Paradox: Conservative Offline Training Amplifies Reward Hacking During Online Adaptation in Reasoning Models.
-
Pitfalls and Remedies for Multi-Task Bayesian Optimization
-
Policy-Only Power Sampling for Vision-Language-Action Control
-
Position: Offline-Dataset Evaluation for Online Decision-Making Needs an Identification Standard
-
Practical Bayesian Optimization for Scientific Discovery
-
Provably Stable Neural Dynamics via Koopman Operator Certificates
-
PubSwap: Public-Data Off-Policy Coordination for Federated RLVR
-
Q-Flow: Stable and Expressive Reinforcement Learning with Flow-Based Policy
-
Qantara: Bridge-Flow Training for Multi-Paradigm JEPA Control
-
Rationale-Guided Policy Optimization: Learning to Reason with Adaptive Rationale Scaffolding
-
Receding-Horizon Control via Drifting Models
-
Receding-Horizon Execution for Action Chunking in Offline-to-Online Reinforcement Learning
-
Rethinking Bayesian Optimization for Co-Optimizing LLM Training Configurations
-
REVES: REvision and VErification–Augmented Training for Test-Time Scaling
-
Reward-Wise Value Estimation for Multi-Reward Optimization in Large Language Models
-
RLRank: Distilling Offline Oracles into Online Policies for Document Reranking
-
Safe-CDT: Adaptive Target Scheduling for Safe Cross-Domain Deployment of Constrained Decision Transformers
-
SALT: Learning State- and Temporally-Abstracted World Models for Offline Long-Horizon Decision-Making
-
Sample-Mean Anchored Thompson Sampling for Offline-to-Online Learning with Distribution Shift
-
Sampling-Based Safe Reinforcement Learning
-
Second-Order Actor-Critic Methods for Discounted MDPs via Policy Hessian Decomposition
-
Soft Forward-Backward Representations for Zero-shot Reinforcement Learning with General Utilities
-
Spectral Perturbation Bounds for Experience Replay: A Bias–Variance Decomposition for Offline Decision-Making
-
Static Benchmarks Are Broken: The Case for Dynamic Evaluation of LLMs
-
Statistical Complexity of Soft Bellman Residual Minimization
-
Structured Behavioral Heterogeneity as Latent Regime Constraints
-
The Illusion of State: Sharp Memory-Decay Bounds in Linear SSMs
-
The Three Regimes of Offline-to-Online Reinforcement Learning
-
Tight Gap-Dependent Regret Bounds and Problem-Independent Bounds for Cost-aware Cascading Bandits
-
Towards Adapting Contrastive RL to the Offline Setting
-
TRACER: Trust-Calibrated Offline-to-Online Reinforcement Learning
-
Transfer-Ready Critics: Auditing Conservatism Footprints for Offline-to-Online RL
-
Trust the Batch, Online or Offline: Adaptive Policy Optimization for Post-Training
-
UA2C: Uncertainty-Aware Adaptive Action Chunking for Offline-to-Online Decision-Making in Mixed Traffic
-
Uncertainty-Guided Reward Labeling for Reinforcement Learning under Limited Feedback
-
Unified Latent Steering and Residual Refinement for Online Improvement of Diffusion Policy Models
-
UNIQ: Conformal Calibration for Adaptive Conservatism in Offline Reinforcement Learning
-
Utilizing Historical Data for Neural Bandits with Domain Shift
-
V-VLAPS: Value-Guided Planning for Vision-Language-Action Models
-
VLA Grounder: Language-Conditioning Space Optimization for Black-Box VLA Models
-
What Makes Value Learning Efficient in Residual Reinforcement Learning?
-
When Does Deep RL Beat Calibrated Baselines? A Benchmark Study on Adaptive Resource Control
-
When Loss Signals Dominate Context: Adaptive Expert Routing in the Loss-Dominance Regime
-
When Offline Selectors Cannot Beat the Best Single Model: A Diagnostic Study on edX Dropout Prediction
-
When Valid Signals Fail: Regime Boundaries Between LLM Features and RL Trading Policies
-
XQCfD: Accelerating Fast Actor-Critic Algorithms with Prior Data and Prior Policies