ICLR 2026 Past Large language models

The 1st Workshop on Scaling Post-training for LLMs

SPOT

Submission deadline
Feb 7, 2026, 11:59 UTC
imported from OpenReview — check the website for extensions
Submission portal
OpenReview
Notes
Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).

Accepted papers (64)

Fetched from OpenReview (v2) on 2026-06-10.

  1. A-3PO: Accelerating Asynchronous LLM Training with Staleness-aware Proximal Policy Approximation

    Xiaocan Li, Zheng Shen, Shiliang Wu · PDF
  2. Actor-Curator: Scalable Adaptive Curriculum Learning for LLM Post-Training

    Zhengyao Gu, Jonathan Light, Raul Astudillo, Ziyu Ye, Langzhou He, Wei Cheng, Santiago Paternain, Philip S. Yu, Yisong Yue · PDF
  3. Beyond Log Likelihood: Probability-Based Objectives for Supervised Fine-Tuning across the Model Capability Continuum

    Gaotang Li, Ruizhong Qiu, Xiusi Chen, Heng Ji, Hanghang Tong · PDF
  4. Beyond Scalar Critics: A Distributional Perspective on Reinforcement Learning with Verifiable Rewards for LLMs

    Jinyi Liu, Yiboyun Chen, Hongyao Tang, Yi Ma, Shuyue Hu, Yang Chen, Fei Ni, Qiaosheng Zhang, LEI BAI, YAN ZHENG, Jianye HAO · PDF
  5. BugPilot: Complex Bug Generation for Efficient Learning of SWE Skills

    Atharv Sonwane, Isadora White, Hyunji Lee, Matheus Pereira, Lucas Caccia, Minseon Kim, Zhengyan Shi, Chinmay Singh, Alessandro Sordoni, Marc-Alexandre Côté, Xingdi Yuan · PDF
  6. Challenges in Inference-Time Scaling with Uncertainty-Aware Tree Search

    Jacopo Minniti, Neil Band, Tim G. J. Rudner · PDF
  7. CLEANER: Self-Purified Trajectories Boost Agentic Reinforcement Learning

    Tianshi Xu, Yuteng Chen, Meng Li · PDF
  8. Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision

    Dulhan Jayalath, Shashwat Goel, Thomas Foster, Parag Jain, Suchin Gururangan, Cheng Zhang, Anirudh Goyal, Alan Schelten · PDF
  9. Compute-Efficient GRPO Training

    Rajat Ghosh, Vaishnavi Bhargava, Debojyoti Dutta · PDF
  10. Corecraft: Training Generalizable Agents on High-Fidelity RL Environments

    Sushant Mehta · PDF
  11. Countdown-Code: A Testbed for Studying The Emergence and Generalization of Reward Hacking

    Muhammad Khalifa, Zohaib Khan, Omer Tafveez, Hao Peng, Lu Wang · PDF
  12. Counterfactual Credit Assignment for Policy Optimization

    Mykola Khandoga, Rui Yuan, Vinay kumar Sankarapu · PDF
  13. Coverage Improvement and Fast Convergence of On-policy Preference Learning

    Juno Kim, Jihun Yun, Jason D. Lee, Kwang-Sung Jun · PDF
  14. CrispEdit: Low-Curvature Projections for Scalable Non-Destructive LLM Editing

    Zarif Ikram, Arad Firouzkouhi, Stephen Tu, Mahdi Soltanolkotabi, Paria Rashidinejad · PDF
  15. DELTA4: Sparse Matrix-Vector Multiplication for Low Sparsity

    Vladimír Macko, Vladimír Boža · PDF
  16. DGPO: Decoupled Gradient Policy Optimization for RLVR in LLMs

    Xiaoliang Fu, Jiaye Lin, Yangyi Fang, Cong Qin, Chaowen Hu, Binbin Zheng, Zekai Shao · PDF
  17. DIRICHLET-PRIOR SHAPING: GUIDING EXPERT SPECIALIZATION IN UPCYCLED MOES

    Leyla Mirvakhabova, Babak Ehteshami Bejnordi, Gaurav Kumar, Hanxue Liang, Wanru Zhao, Paul N. Whatmough · PDF
  18. Efficient and Stable Scaling of Reinforcement Learning for LLMs via Dynamic Allocation and Gradient Modulation

    Yangyi Fang, Jiaye Lin, Xiaoliang Fu, Cong Qin, Haolin Shi, Chaowen Hu · PDF
  19. Efficient RL Training for LLMs with Experience Replay

    Charles Arnal, Vivien Cabannes, Taco Cohen, Julia Kempe, Rémi Munos · PDF
  20. Entropy-Aware On-Policy Distillation of Language Models

    Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, Kimin Lee · PDF
  21. Escaping the Mode: Multi Answer Reinforcement Learning in LMs

    Isha Puri, Mehul Damani, Idan Shenfeld, Marzyeh Ghassemi, Jacob Andreas, Yoon Kim · PDF
  22. Execution-Grounded Credit Assignment for GRPO in Code Generation

    Abhijit Kumar, Shikhar Gupta, Natalya Kumar · PDF
  23. Expanding the Capabilities of Reinforcement Learning via Text Feedback

    Yuda Song, Lili Chen, Fahim Tajwar, Rémi Munos, Deepak Pathak, Drew Bagnell, Aarti Singh, Andrea Zanette · PDF
  24. F-GRPO: Don't Let Your Policy Learn the Obvious and Forget the Rare

    Daniil Plyusov, Alexey Gorbatovski, Boris Shaposhnikov, Viacheslav Sinii, Alexey Malakhov, Daniil Gavrilov · PDF
  25. Federated Agent Reinforcement Learning

    Canyu Chen, Kangyu Zhu, Zhaorun Chen, Zhanhui Zhou, Shizhe Diao, Yiping Lu, Tian Li, Manling Li, Dawn Song · PDF
  26. From Atomic to Composite: Reinforcement Learning Enables Generalization in Complementary Reasoning

    Sitao Cheng, Xunjian Yin, Ruiwen Zhou, Yuxuan Li, Xinyi Wang, Liangming Pan, William Yang Wang, Victor Zhong · PDF
  27. GEOMA: Geometric and Econometric Objectives for Multi-Reward Alignment

    Taneesh Gupta, Pragya Srivastava, Rahul Madhavan, Karthikeyan Shanmugam, Aravindan Raghuveer · PDF
  28. Hierarchical Agenda Reasoning for Strategic Multi-Turn Dialogue Agents

    Marwa Abdulhai, Ryan Cheng, Aryansh Shrivastava, Aviral Kumar, Sergey Levine · PDF
  29. Is the Importance Ratio Necessary for Stable Reinforcement Learning in LLMs?

    Shuibai Zhang, Junhyuck Kim, Gyeongman Kim, Jaewoong Cho · PDF
  30. IsoCompute Playbook: Optimally Scaling Sampling Compute for LLM RL

    Zhoujun Cheng, Yutao Xie, Yuxiao Qu, Amrith Setlur, Shibo Hao, Varad Pimpalkhute, Tongtong Liang, Feng Yao, Zhengzhong Liu, Eric P. Xing, Virginia Smith, Ruslan Salakhutdinov, Zhiting Hu, Taylor W. Killian, Aviral Kumar · PDF
  31. Jointly Reinforcing Diversity and Quality in Language Model Generations

    Tianjian Li, Yiming Zhang, Ping Yu, Swarnadeep Saha, Daniel Khashabi, Jason E Weston, Jack Lanchantin, Tianlu Wang · PDF
  32. Learning Discriminative Process Reward Models without Step Labels

    Kingsley Kim, Haolin Liu, Chen-Yu Wei · PDF
  33. Learning Useful Supervision for Reinforcement Learning in Reasoning Models

    Liang Chen, Xueting Han, Li Shen, Jing Bai, Kam-Fai Wong · PDF
  34. Making Complex Reasoning Student-Friendly: A Hybrid LLM-to-SLM Distillation Framework

    Yongjin Yang, Yinghui He, Jiarui Liu, Zhijing Jin · PDF
  35. Maximum Likelihood Reinforcement Learning

    Fahim Tajwar, Guanning Zeng, Yueer Zhou, Yuda Song, Daman Arora, Yiding Jiang, Jeff Schneider, Ruslan Salakhutdinov, Haiwen Feng, Andrea Zanette · PDF
  36. Mix Early, Forget Less: Data Mixing During Pretraining Builds Resistance to Forgetting

    Lawrence Feng, Gaurav Rohit Ghosal, Jacob Mitchell Springer, Ziqian Zhong, Aditi Raghunathan · PDF
  37. Near-Optimal Regret for KL-Regularized Multi-Armed Bandits

    Kaixuan Ji, Qingyue Zhao, Heyang Zhao, Qiwei Di, Quanquan Gu · PDF
  38. NyoomFloat12: Lossless 12-bit Weight Compression for Post-Training Inference

    Sylvie Liberman, Tianyi Zhang, Daniel Y Fu · PDF
  39. On quantizing the state of the Muon optimizer

    Aman Gupta, Rafael Celente, Abhishek Shivanna, Daniel Thomas Braithwaite, Gregory Dexter, Shao Tang, Hiroto Udagawa, Daniel Silva, Rohan Ramanath, Sathiya Keerthi · PDF
  40. Prescriptive Scaling Reveals the Evolution of Language Model Capabilities

    Hanlin Zhang, Jikai Jin, Vasilis Syrgkanis, Sham M. Kakade · PDF
  41. Privileged Information Distillation for Language Models

    Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, Massimo Caccia · PDF
  42. QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources

    Zhikai Li, Xiaoxuan Liu, Banghua Zhu, Zhen Dong, Kurt Keutzer · PDF
  43. Reasoning as Compression: Unifying Budget Forcing via the Conditional Information Bottleneck

    Fabio Valerio Massoli, Andrey Kuzmin, Arash Behboodi · PDF
  44. Reasoning Cache: Learning to Extrapolate to Long Lengths via Short-Length RL

    Ian Wu, Yuxiao Qu, Amrith Setlur, Aviral Kumar · PDF
  45. Recontextualization Mitigates Specification Gaming without Modifying the Specification

    Ariana Azarbal, Victor Gillioz, Vladimir Ivanov, Bryce Woodworth, jacob drori, Nevan Wichers, Aram Ebtekar, Alex Cloud, Alexander Matt Turner · PDF
  46. Reinforcement Learning via Self-Distillation

    Jonas Hübotter, Frederike Lübeck, Lejs Deen Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, Andreas Krause · PDF
  47. Reuse your FLOPs: Scaling RL on Hard Problems by Conditioning on Very Off-Policy Prefixes

    Amrith Setlur, Zijian Wang, Andrew Cohen, Paria Rashidinejad, Sang Michael Xie · PDF
  48. RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large Language Models

    Xiao Feng, Bo Han, Zhanke Zhou, Jiaqi Fan, Jiangchao Yao, Ka Ho Li, Dahai Yu, Michael Ng · PDF
  49. RL Excursions during Pre-training: How early is too early for On-policy Learning?

    Rachit Bansal, Clara Mohri, Tian Qin, David Alvarez-Melis, Sham M. Kakade · PDF
  50. RL-VLA$^3$: Reinforcement Learning VLA Accelerating via Full Asynchronism

    Zhong Guan, Haoran Sun, Yongjian Guo, shuai di, Xiaodong Bai, Jing Long, Tianyun Zhao, LUOMINGXI, Hongke Zhao, Likang Wu, Xiaotie Deng, Xi Xiao, Sheng Wen, Yicheng Gong, Junwu Xiong · PDF
  51. Scaling Reward Modeling without Human Supervision

    Jingxuan Fan, Yueying Li, Zhenting Qi, Dinghuai Zhang, Kianté Brantley, Sham M. Kakade, Hanlin Zhang · PDF
  52. Scaling Search-Augmented LLM Reasoning via Adaptive Information Control

    Siheng Xiong, Oguzhan Gungordu, Blair Johnson, James Clayton Kerce, Faramarz Fekri · PDF
  53. Semantic Tube Prediction: Beating LLM Data Efficiency with JEPA

    Hai Huang, Yann LeCun, Randall Balestriero · PDF
  54. Shape of Thought: When Distribution Matters More than Correctness in Reasoning Tasks

    Abhranil Chandra, Ayush Agrawal, Arian Hosseini, Sebastian Fischmeister, Rishabh Agarwal, Navin Goyal, Aaron Courville · PDF
  55. Sparse Attention for Efficient LLM Reinforcement Learning

    Yang Zhou, Ranajoy Sadhukhan, Zhaofeng Sun, Zhuoming Chen, Souvik Kundu, Saket Dingliwal, Sai Muralidhar Jayanthi, Aram Galstyan, Haizhong Zheng, Beidi Chen · PDF
  56. TestSmith: Reinforcement Learning for Unit Test Generation with Synthetic Perturbations

    Stanley Yu, Roger Jin · PDF
  57. TOUCAN: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments

    Zhangchen Xu, Adriana Meza Soria, Shawn Tan, Anurag Roy, Ashish Sunil Agrawal, Radha Poovendran, Rameswar Panda · PDF
  58. Towards Understanding the Benefits of Online Imitation Learning

    Huaqing Zhang, Jingchu Gai, Juno Kim, Bingbin Liu, Andrej Risteski · PDF
  59. Training-Free Dynamic Upcycling of Expert Language Models

    Eros Fanì, Oguzhan Ersoy · PDF
  60. V1: Unifying Generation and Self-Verification for Parallel Reasoners

    Harman Singh, Xiuyu Li, Kusha Sareen, Monishwaran Maheswaran, Sijun Tan, Xiaoxia Wu, Junxiong Wang, Alpay Ariyak, Qingyang Wu, Samir Khaki, Rishabh Tiwari, Long Lian, Yucheng Lu, Boyi Li, Alane Suhr, Ben Athiwaratkun, Kurt Keutzer · PDF
  61. VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use

    Dongfu Jiang, Yi Lu, Zhuofeng Li, Zhiheng Lyu, Ping Nie, Haozhe Wang, Alex Su, Hui Chen, Kai Zou, Chao Du, Tianyu Pang, Wenhu Chen · PDF
  62. Weight Decay Improves Language Model Plasticity

    Tessa Han, Sebastian Bordt, Hanlin Zhang, Sham M. Kakade · PDF
  63. Weight Space Detection of Backdoors in LoRA Adapters

    David Puertolas Merenciano, Ekaterina Vasyagina, Raghav Dixit, Kevin Zhu, Ruizhe Li, Maheep Chaudhary · PDF
  64. When Tokens Decay and Turns Amplify: A Dual-Granularity Framework for Multi-Turn Preference Optimization

    Yangyi Fang, Jiaye Lin, Xiaoliang Fu, Haolin Shi, Cong Qin, Chaowen Hu · PDF