ICLR 2026 Past Large language models
The 1st Workshop on Scaling Post-training for LLMs
SPOT
- Submission deadline
- Feb 7, 2026, 11:59 UTC imported from OpenReview — check the website for extensions
- Submission portal
- OpenReview
- Notes
- Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).
Accepted papers (64)
Fetched from OpenReview (v2) on 2026-06-10.
-
A-3PO: Accelerating Asynchronous LLM Training with Staleness-aware Proximal Policy Approximation
-
Actor-Curator: Scalable Adaptive Curriculum Learning for LLM Post-Training
-
Beyond Log Likelihood: Probability-Based Objectives for Supervised Fine-Tuning across the Model Capability Continuum
-
Beyond Scalar Critics: A Distributional Perspective on Reinforcement Learning with Verifiable Rewards for LLMs
-
BugPilot: Complex Bug Generation for Efficient Learning of SWE Skills
-
Challenges in Inference-Time Scaling with Uncertainty-Aware Tree Search
-
CLEANER: Self-Purified Trajectories Boost Agentic Reinforcement Learning
-
Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision
-
Compute-Efficient GRPO Training
-
Corecraft: Training Generalizable Agents on High-Fidelity RL Environments
-
Countdown-Code: A Testbed for Studying The Emergence and Generalization of Reward Hacking
-
Counterfactual Credit Assignment for Policy Optimization
-
Coverage Improvement and Fast Convergence of On-policy Preference Learning
-
CrispEdit: Low-Curvature Projections for Scalable Non-Destructive LLM Editing
-
DELTA4: Sparse Matrix-Vector Multiplication for Low Sparsity
-
DGPO: Decoupled Gradient Policy Optimization for RLVR in LLMs
-
DIRICHLET-PRIOR SHAPING: GUIDING EXPERT SPECIALIZATION IN UPCYCLED MOES
-
Efficient and Stable Scaling of Reinforcement Learning for LLMs via Dynamic Allocation and Gradient Modulation
-
Efficient RL Training for LLMs with Experience Replay
-
Entropy-Aware On-Policy Distillation of Language Models
-
Escaping the Mode: Multi Answer Reinforcement Learning in LMs
-
Execution-Grounded Credit Assignment for GRPO in Code Generation
-
Expanding the Capabilities of Reinforcement Learning via Text Feedback
-
F-GRPO: Don't Let Your Policy Learn the Obvious and Forget the Rare
-
Federated Agent Reinforcement Learning
-
From Atomic to Composite: Reinforcement Learning Enables Generalization in Complementary Reasoning
-
GEOMA: Geometric and Econometric Objectives for Multi-Reward Alignment
-
Hierarchical Agenda Reasoning for Strategic Multi-Turn Dialogue Agents
-
Is the Importance Ratio Necessary for Stable Reinforcement Learning in LLMs?
-
IsoCompute Playbook: Optimally Scaling Sampling Compute for LLM RL
-
Jointly Reinforcing Diversity and Quality in Language Model Generations
-
Learning Discriminative Process Reward Models without Step Labels
-
Learning Useful Supervision for Reinforcement Learning in Reasoning Models
-
Making Complex Reasoning Student-Friendly: A Hybrid LLM-to-SLM Distillation Framework
-
Maximum Likelihood Reinforcement Learning
-
Mix Early, Forget Less: Data Mixing During Pretraining Builds Resistance to Forgetting
-
Near-Optimal Regret for KL-Regularized Multi-Armed Bandits
-
NyoomFloat12: Lossless 12-bit Weight Compression for Post-Training Inference
-
On quantizing the state of the Muon optimizer
-
Prescriptive Scaling Reveals the Evolution of Language Model Capabilities
-
Privileged Information Distillation for Language Models
-
QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources
-
Reasoning as Compression: Unifying Budget Forcing via the Conditional Information Bottleneck
-
Reasoning Cache: Learning to Extrapolate to Long Lengths via Short-Length RL
-
Recontextualization Mitigates Specification Gaming without Modifying the Specification
-
Reinforcement Learning via Self-Distillation
-
Reuse your FLOPs: Scaling RL on Hard Problems by Conditioning on Very Off-Policy Prefixes
-
RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large Language Models
-
RL Excursions during Pre-training: How early is too early for On-policy Learning?
-
RL-VLA$^3$: Reinforcement Learning VLA Accelerating via Full Asynchronism
-
Scaling Reward Modeling without Human Supervision
-
Scaling Search-Augmented LLM Reasoning via Adaptive Information Control
-
Semantic Tube Prediction: Beating LLM Data Efficiency with JEPA
-
Shape of Thought: When Distribution Matters More than Correctness in Reasoning Tasks
-
Sparse Attention for Efficient LLM Reinforcement Learning
-
TestSmith: Reinforcement Learning for Unit Test Generation with Synthetic Perturbations
-
TOUCAN: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments
-
Towards Understanding the Benefits of Online Imitation Learning
-
Training-Free Dynamic Upcycling of Expert Language Models
-
V1: Unifying Generation and Self-Verification for Parallel Reasoners
-
VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use
-
Weight Decay Improves Language Model Plasticity
-
Weight Space Detection of Backdoors in LoRA Adapters
-
When Tokens Decay and Turns Amplify: A Dual-Granularity Framework for Multi-Turn Preference Optimization