ICML 2025 Past EfficiencyML systemsLarge language models

ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models

ES-FoMo

Unverified seed entry. Some fields are estimates — confirm everything on the official website before planning a submission.

Submission deadline
May 26, 2025, 23:59 AoE (UTC−12)
SEED estimate of the historical deadline — verify
Workshop day
Jul 19, 2025
Submission portal
OpenReview
Notes
SEED DATA — name/website from the OpenReview venue record; workshop date estimated — verify.

Accepted papers (146)

Fetched from OpenReview (v2) on 2026-06-10.

  1. $\mu$-MoE: Test-Time Pruning as Micro-Grained Mixture-of-Experts

    Toshiaki Koike-Akino, Jing Liu, Ye Wang · PDF
  2. A Minimalist Optimizer Design for LLM Pretraining

    Athanasios Glentis, Jiaxiang Li, Andi Han, Mingyi Hong · PDF
  3. A Survey on Prompt Tuning

    Zongqian Li, Yixuan Su, Nigel Collier · PDF
  4. ABBA: Highly Expressive Hadamard Product Adaptation for Large Language Models

    Raghav Singhal, Kaustubh Ponkshe, Rohit Vartak, Praneeth Vepakomma · PDF
  5. Accelerated Test-Time Scaling with Model-Free Speculative Sampling

    Woomin Song, Saket Dingliwal, Sai Muralidhar Jayanthi, Bhavana Ganesh, Jinwoo Shin, Aram Galstyan, Sravan Babu Bodapati · PDF
  6. Accelerating Linear Attention Design by Unifying Forward & Backward Propagation

    Zhen Qin, Xuyang Shen, Dong Li, Yiran Zhong · PDF
  7. Act Only When It Pays: Efficient Reinforcement Learning for LLM Reasoning via Selective Rollouts

    Haizhong Zheng, Yang Zhou, Brian R. Bartoldson, Bhavya Kailkhura, Fan Lai, Jiawei Zhao, Beidi Chen · PDF
  8. Adaptive Backbone Selection for Efficient and Real-Time Vision Inference

    Syed Amir Hamza, Alexander Jesser · PDF
  9. Adaptive Self-improvement LLM Agentic System for ML Library Development

    Genghan Zhang, Weixin Liang, Olivia Hsu, Kunle Olukotun · PDF
  10. An Efficient Row-Based Sparse Fine-Tuning with Low Quantization Error

    Cen-Jhih Li, Aditya Bhaskara · PDF
  11. Any-Order GPT as Masked Diffusion Model: Decoupling Formulation and Architecture

    Shuchen Xue, Tianyu Xie, Tianyang Hu, Zijin Feng, Jiacheng Sun, Kenji Kawaguchi, Zhenguo Li, Zhi-Ming Ma · PDF
  12. AREAL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning

    Wei Fu, Jiaxuan Gao, Shusheng Xu, Zhiyu Mei, Chen Zhu, Xujie Shen, Chuyi He, Guo Wei, Jun Mei, WANG JIASHU, Tongkai Yang, Binhang Yuan, Yi Wu · PDF
  13. Autoregressive Language Modeling by Compressed Sequence Mixing

    Jatin Prakash, Aahlad Manas Puli, Rajesh Ranganath · PDF
  14. AWP: Activation-aware Weight Pruning and Quantization with Projected Gradient Descent

    Jing Liu, Toshiaki Koike-Akino, Ye Wang, Hassan Mansour, Matthew Brand · PDF
  15. Balancing LoRA Performance and Efficiency with Simple Shard Sharing

    Jiale Kang, Qingyu Yin · PDF
  16. Batch-Max: Higher LLM Throughput using Larger Batch Sizes and KV Cache Compression

    Michael R. Metel, Boxing Chen, Mehdi Rezagholizadeh · PDF
  17. Best-of-N through the Smoothing Lens: KL Divergence and Regret Analysis

    Gholamali Aminian, Idan Shenfeld, Amir R. Asadi, Ahmad Beirami, Youssef Mroueh · PDF
  18. Beyond Cosine Decay: On the effectiveness of Infinite Learning Rate Schedule for Continual Pre-training

    Vaibhav Singh, Paul Janson, Paria Mehrbod, Adam Ibrahim, Irina Rish, Eugene Belilovsky, Benjamin Thérien · PDF
  19. BlockBPE: Parallel BPE Tokenization

    Amos You · PDF
  20. BREAD: Branched Rollouts from Expert Anchors Bridge SFT & RL for Reasoning

    Xuechen Zhang, Zijian Huang, Yingcong Li, Chenshun Ni, Jiasi Chen, Samet Oymak · PDF
  21. Byzantine-Resilient Zero-Order Optimization for Scalable Federated Fine-Tuning of Large Language Models

    Maximilian Egger, Mayank Bakshi, Rawad Bitar · PDF
  22. Cache Saver: A Modular Framework for Efficient, Affordable, and Reproducible LLM Inference

    Nearchos Potamitis, Lars Henning Klein, Chongyang Xu, Attreyee Mukherjee, Bardia Mohammadi, Niket Tandon, Laurent Bindschaedler, Akhil Arora · PDF
  23. CarbonGearRL: Precision-Elastic, Carbon-Aware Scheduling for Foundation-Model Training

    Thomas Y Chen · PDF
  24. Cartridges: Lightweight and general-purpose long context representations via self-study

    Sabri Eyuboglu, Ryan Saul Ehrlich, Simran Arora, Neel Guha, Dylan Zinsley, Emily Ruoyu Liu, Atri Rudra, James Y. Zou, Azalia Mirhoseini, Christopher Re · PDF
  25. Chipmunk: Training-Free Acceleration of Diffusion Transformers with Dynamic Column-Sparse Deltas

    Austin Silveria, Soham V. Govande, Daniel Y Fu · PDF
  26. CoDM: A Co-design Framework for Efficient Sparse Diffusion Models

    Xiaolong Wu, Xiang Gao, Xiyun Song, Zongfang Lin, Heather Yu, David Gu · PDF
  27. Compress, Gather, and Recompute: REFORMing Long-Context Processing in Transformers

    Woomin Song, Sai Muralidhar Jayanthi, Srikanth Ronanki, Kanthashree Mysore Sathyendra, Jinwoo Shin, Aram Galstyan, Shubham Katiyar, Sravan Babu Bodapati · PDF
  28. Compressing Large Language Models to Any Size Without Re-Computation

    Martin Genzel, Patrick Putzky, Pengfei Zhao, Sebastian Schulze, Mattes Mollenhauer, Robert Seidel, Stefan Dietzel, Thomas Wollmann · PDF
  29. ConMeZO: Adaptive Directional Sampling for Gradient-Free Finetuning of Language Models

    Lejs Deen Behric, Liang Zhang, Bingcong Li, Kiran Koshy Thekumparampil · PDF
  30. Context-lite Multi-turn Reinforcement Learning for LLM Agents

    Wentse Chen, Jiayu Chen, Hao Zhu, Jeff Schneider · PDF
  31. Continuous Autoregressive Generation with Mixture of Gaussians

    Alex Quach, Tsun-Hsuan Wang, Ramin Hasani, Mathias Lechner, Alexander Amini · PDF
  32. Cost-Efficient Serving of LLM Agents via Test-Time Plan Caching

    Qizheng Zhang, Michael Wornow, Kunle Olukotun · PDF
  33. d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning

    Siyan Zhao, Devaansh Gupta, Qinqing Zheng, Aditya Grover · PDF
  34. Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation

    Liliang Ren, Congcong Chen, Haoran Xu, Young Jin Kim, Adam Atkinson, Zheng Zhan, Jiankai Sun, Baolin Peng, Liyuan Liu, Shuohang Wang, Hao Cheng, Jianfeng Gao, Weizhu Chen, yelong shen · PDF
  35. DEL-ToM: Inference-Time Scaling for Theory-of-Mind Reasoning via Dynamic Epistemic Logic

    Yuheng Wu, Jianwen Xie, Denghui Zhang, Zhaozhuo Xu · PDF
  36. Demystifying Language Model Forgetting with Low-rank Example Associations

    Xisen Jin, Xiang Ren · PDF
  37. DLaVA: Document Language and Vision Assistant for Answer Localization with Enhanced Interpretability and Trustworthiness

    Ahmad Mohammadshirazi, Pinaki Prasad Guha Neogi, Ser-Nam Lim, Rajiv Ramnath · PDF
  38. Early Attentive Sparsification Accelerates Neural Speech Transcription

    Zifei Xu, Sayeh Sharify, Hesham Mostafa, Tristan J Webb, Wanzin Yazar, Xin Wang · PDF
  39. Efficient and Accurate KV-cache Management for Long-Sequence LLMs

    Yuzhen Mao, Qitong Wang, Martin Ester, Ke Li · PDF
  40. Efficient Pre-Training of LLMs via Topology-Aware Communication Alignment on More Than 9600 GPUs

    Guoliang He, Youhe Jiang, Wencong Xiao, Jiang Kaihua, Shuguang Wang, Jun Wang, Du Zixian, Zhuo Jiang, Xinlei Zhang, Binhang Yuan, Eiko Yoneki · PDF
  41. Efficient Temporal Tokenization for Mobility Prediction with Large Language Models

    Haoyu He, Haozheng Luo, Yan Chen, Qi R. Wang · PDF
  42. Efficient-vDiT: Efficient Video Diffusion Transformers With Attention Tile

    Hangliang Ding, Dacheng Li, Runlong Su, Peiyuan Zhang, Zhijie Deng, Ion Stoica, Hao Zhang · PDF
  43. Exchangeability in Neural Network Architectures and its Application to Dynamic Pruning

    Pu Luke Yi, Tianlang Chen, Yifan Yang, Sara Achour · PDF
  44. Exploring Diffusion Transformer Designs via Grafting

    Keshigeyan Chandrasegaran, Michael Poli, Daniel Y Fu, Dongjun Kim, Lea M. Hadzic, Manling Li, Agrim Gupta, Stefano Massaroli, Azalia Mirhoseini, Juan Carlos Niebles, Stefano Ermon, Li Fei-Fei · PDF
  45. Fed-SB: A Silver Bullet for Extreme Communication Efficiency and Performance in (Private) Federated LoRA Fine-Tuning

    Raghav Singhal, Kaustubh Ponkshe, Rohit Vartak, Lav R. Varshney, Praneeth Vepakomma · PDF
  46. Flexi-LoRA: Efficient LoRA Finetuning with Input-Adaptive Dynamic Ranks

    Zongqian Li, Yixuan Su, Han Zhou, Zihao Fu, Nigel Collier · PDF
  47. Foreign Sparse Attention: Effective Distillation into Sparse Attention

    Vijaykaarti Sundarapandiyan, Tom Goldstein, Ashwinee Panda · PDF
  48. FPTQuant: Function-Preserving Transforms for LLM Quantization

    Boris van Breugel, Yelysei Bondarenko, Paul N. Whatmough, Markus Nagel · PDF
  49. FrugalRAG: Learning to retrieve and reason for multi-hop QA

    Abhinav Java, Srivathsan Koundinyan, Nagarajan Natarajan, Amit Sharma · PDF
  50. GPTailor: Large Language Model Pruning Through Layer Cutting and Stitching

    Guinan Su, Li Shen, Lu Yin, Shiwei Liu, Yanwu Yang, Jonas Geiping · PDF
  51. GPU Kernel Scientist: An LLM-Driven Framework for Iterative Kernel Optimization

    Martin Andrews, Sam Witteveen · PDF
  52. Graph Signal Processing Meets Mamba2: Adaptive Filter Bank via Delta Modulation

    Yehjin Shin, Seojin Kim, Noseong Park · PDF
  53. Guided Speculative Inference for Efficient Test-Time Alignment of LLMs

    Jonathan Geuter, Youssef Mroueh, David Alvarez-Melis · PDF
  54. HadaNorm: Diffusion Transformer Quantization through Mean-Centered Transformations

    Marco Federici, Riccardo Del Chiaro, Boris van Breugel, Paul N. Whatmough, Markus Nagel · PDF
  55. Hardware-Efficient Attention for Fast Decoding

    Ted Zadouri, Hubert Strauss, Tri Dao · PDF
  56. How Many Tokens Do 3D Point Cloud Transformer Architectures Really Need?

    Tuan Anh Tran, Duy Minh Ho Nguyen, Hoai-Chau Tran, Michael Barz, Khoa D Doan, Roger Wattenhofer, Vien Anh Ngo, Mathias Niepert, Daniel Sonntag, Paul Swoboda · PDF
  57. How Well do LLMs Compress Their Own Chain-of-Thought? A Token Complexity Approach

    Ayeong Lee, Ethan Che, Tianyi Peng · PDF
  58. InterLoRA: An Adaptive LoRA Structure Based on The Mechanistic Interpretability of Transformer

    Jihao Gu, Zelin Wang, Yibo Zhang, Ping Gong, Zhisong Bie · PDF
  59. Is Visual Prompting the Right Setup for Knowledge Transfer in new Foundation Models?

    Niclas Hergenröther, Antonio Orvieto · PDF
  60. Iterative Amortized Inference: Unifying In-Context Learning and Learned Optimizers

    Sarthak Mittal, Divyat Mahajan, Guillaume Lajoie, Mohammad Pezeshki · PDF
  61. JSONSchemaBench: Evaluating Constrained Decoding with LLMs on Efficiency, Coverage and Quality

    Saibo Geng, Hudson Cooper, Michal Moskal, Samuel Jenkins, Julian Berman, Nathan Ranchin, Robert West, Eric Horvitz, Harsha Nori · PDF
  62. Kevin: Multi-Turn RL for Generating CUDA Kernels

    Carlo Baronio, Pietro Marsella, Ben Pan, Simon Guo, Silas Alberti · PDF
  63. KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction

    Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W. Lee, Sangdoo Yun, Hyun Oh Song · PDF
  64. Language System: A Lightweight Ranking Framework for Language Models

    Chenheng Zhang, Tianqi Du, Jizhe Zhang, Mingqing Xiao, Yifei Wang, Yisen Wang, Zhouchen Lin · PDF
  65. Large Reasoning Models Know How to Think Efficiently

    Zeyu XING, Xing Li, Huiling Zhen, Xianzhi Yu, Mingxuan Yuan, Sinno Jialin Pan · PDF
  66. LATTICE: Learning to Efficiently Compress the Memory

    Mahdi Karami, Vahab Mirrokni · PDF
  67. Learning Adaptive Parallel Reasoning with Language Models

    Jiayi Pan, Xiuyu Li, Long Lian, Charlie Victor Snell, Yifei Zhou, Adam Yala, Trevor Darrell, Kurt Keutzer, Alane Suhr · PDF
  68. Learning to Discover Abstractions for LLM Reasoning

    Yuxiao Qu, Anikait Singh, Yoonho Lee, Amrith Setlur, Ruslan Salakhutdinov, Chelsea Finn, Aviral Kumar · PDF
  69. Learning What Matters: Prioritized Concept Learning via Relative Error-driven Sample Selection

    Shivam Chandhok, Qian Yang, Oscar Mañas, Kanishk Jain, Aishwarya Agrawal, Leonid Sigal · PDF
  70. LOGAH: Initialize Large Transformers via Small Graph HyperNetworks

    Xinyu Zhou, Boris Knyazev, Alexia Jolicoeur-Martineau, Jie Fu · PDF
  71. LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification

    Penghui Yang, Cunxiao Du, Fengzhuo Zhang, Haonan Wang, Tianyu Pang, Chao Du, Bo An · PDF
  72. LoRA Fine-Tuning Without GPUs: A CPU-Efficient Meta-Generation Framework for LLMs

    Reza Arabpour, Haitz Sáez de Ocáriz Borde, Anastasis Kratsios · PDF
  73. LoRA Merging with SVD: Understanding Interference and Preserving Performance

    Dennis Tang, Prateek Yadav, Yi-Lin Sung, Jaehong Yoon, Mohit Bansal · PDF
  74. Making Small Language Models Efficient Reasoners: Intervention, Supervision, Reinforcement

    Xuechen Zhang, Zijian Huang, Chenshun Ni, Ziyang Xiong, Jiasi Chen, Samet Oymak · PDF
  75. Mamba Drafters for Speculative Decoding

    Daewon Choi, Seunghyuk Oh, Saket Dingliwal, Jihoon Tack, Kyuyoung Kim, Woomin Song, Seojin Kim, Insu Han, Jinwoo Shin, Aram Galstyan, Shubham Katiyar, Sravan Babu Bodapati · PDF
  76. MASSV: Multimodal Adaptation and Self-Data Distillation for Speculative Decoding of Vision-Language Models

    Mugilan Ganesan, Shane Segal, Ankur Aggarwal, Nish Sinnadurai, Sean Lie, Vithursan Thangarasa · PDF
  77. MatMuls are Enough for Efficient and Performant Linear-Time Attention

    Andrew Argatkiny, Ilya Makarov · PDF
  78. Mitigating Over-Smoothing in Mamba2 via Spectral Domain Analysis

    Seojin Kim, Yehjin Shin, Noseong Park · PDF
  79. Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Thinking

    Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, Se-Young Yun · PDF
  80. Model Parallelism With Subnetwork Data Parallelism

    Vaibhav Singh, Zafir Khalid, Eugene Belilovsky, Edouard Oyallon · PDF
  81. MTraining: Efficient Distributed Training for Ultra-Long Contexts via Dynamic Sparse Attention

    Wenxuan Li, Chengruidong Zhang, Huiqiang Jiang, Yucheng Li, Yuqing Yang, Lili Qiu · PDF
  82. Mu-Parametrization for Mixture of Experts

    Jan Małaśnicki, Kamil Ciebiera, Mateusz Boruń, Maciej Pióro, Jan Ludziejewski, Maciej Stefaniak, Michał Krutul, Sebastian Jaszczur, Marek Cygan, Kamil Adamczewski, Jakub Krajewski · PDF
  83. MuLoCo: Muon is a practical inner optimizer for DiLoCo

    Benjamin Thérien, Xiaolong Huang, Irina Rish, Eugene Belilovsky · PDF
  84. Multi-stream Sequence Learning

    Mohamed Elsayed, A. Rupam Mahmood · PDF
  85. Multi-student Diffusion Distillation for Better One-step Generators

    Yanke Song, Jonathan Lorraine, Weili Nie, Karsten Kreis, James Lucas · PDF
  86. Next-Token Prediction Should be Ambiguity-Sensitive : A Meta-Learing Perspective

    Leo Gagnon, Eric Elmoznino, Sarthak Mittal, Tom Marty, Tejas Kasetty, Dhanya Sridhar, Guillaume Lajoie · PDF
  87. One-Pass to Reason: Token Duplication and Block-Sparse Mask for Efficient Fine-Tuning on Multi-Turn Reasoning

    Ritesh Goru, Shanay Mehta, Prateek Jain · PDF
  88. Optimal Formats for Weight Quantisation

    Douglas Orr, Luka Ribar, Carlo Luschi · PDF
  89. Outlier-Free Genomic Foundation Models for Resource-Efficient Training and Low-Bit Inference

    Chenghao Qiu, Haozheng Luo, Maojiang Su, Zhihan Zhou, Zoe Mehta, Guo Ye, Jerry Yao-Chieh Hu, Han Liu · PDF
  90. Overcoming Long-Context Limitations of State-Space Models via Context-Dependent Sparse Attention

    Zhihao Zhan, Jianan Zhao, Zhaocheng Zhu, Jian Tang · PDF
  91. Partition Generative Modeling: Masked Modeling Without Masks

    Justin Deschenaux, Lan Tran, Caglar Gulcehre · PDF
  92. PiKE: Adaptive Data Mixing for Large-Scale Multi-Task Learning Under Low Gradient Conflicts

    Zeman Li, Yuan Deng, Peilin Zhong, Meisam Razaviyayn, Vahab Mirrokni · PDF
  93. PiKV: KV Cache Management System for MoE Architecture

    Dong Liu, Yanxuan Yu, Ben Lengerich, Ying Nian Wu, Xuhong Wang · PDF
  94. pLSTM: parallelizable Linear Source Transition Mark networks

    Korbinian Pöppel, Richard Freinschlag, Thomas Schmied, Wei Lin, Sepp Hochreiter · PDF
  95. PoLAR: Polar-Decomposed Low-Rank Adapter Representation

    Kai Lion, Liang Zhang, Bingcong Li, Niao He · PDF
  96. PoTPTQ: A Two-step Power-of-Two Post-training for LLMs

    Xinyu Wang, Vahid Partovi Nia, Peng Lu, Jerry Huang, Xiao-Wen Chang, Boxing Chen, Yufei Cui · PDF
  97. Predictive Scheduling for Efficient Inference-Time Reasoning in Large Language Models

    Aneesh Muppidi, Katrina Brown, Rana Shahout · PDF
  98. Privacy Isn’t Free: Benchmarking the Systems Cost of Privacy-Preserving ML

    Nnaemeka Casmir Obiefuna, Samuel Oladayo Oyeneye, Similoluwa Odunaiya, Iremide Blessing Oyelaja, Steven Kolawole · PDF
  99. Private Zeroth-Order Optimization with Public Data

    Xuchen Gong, Tian Li · PDF
  100. Proof-of-Concept for Private Local-to-Cloud LLM Chat via Trusted Execution Environments

    Avanika Narayan, Dan Biderman, Christopher Re · PDF
  101. PT-MoE: An Efficient Finetuning Framework for Integrating Mixture-of-Experts into Prompt Tuning

    Zongqian Li, Yixuan Su, Nigel Collier · PDF
  102. Q-Adam-mini: Memory-Efficient 8-bit Quantized Optimizer for Large Language Model Training

    Yizhou Han, Chaohao Yang, Congliang Chen, Xingjian Wang, Ruoyu Sun · PDF
  103. QuarterMap: Efficient Post-Training Token Pruning for Visual State Space Models

    Tien-Yu Chi, Hung-Yueh Chiang, Diana Marculescu, Kai-Chiang Wu · PDF
  104. Quartet: Native FP4 Training Can Be Optimal for Large Language Models

    Roberto L. Castro, Andrei Panferov, Rush Tabesh, Jiale Chen, Oliver Sieberling, Mahdi Nikdan, Saleh Ashkboos, Dan Alistarh · PDF
  105. Radio: Rate–Distortion Optimization for Large Language Model Compression

    Sean I. Young · PDF
  106. Resource-efficient Inference with Foundation Model Programs

    Lunyiu Nie, Zhimin Ding, Kevin Yu, Marco Cheung, Chris Jermaine, Swarat Chaudhuri · PDF
  107. Revisit What You See: Disclose Language Prior in Vision Tokens for Efficient Guided Decoding of LVLMs

    Beomsik Cho, Jaehyung Kim · PDF
  108. SageAttention2++: A More Efficient Implementation of SageAttention2

    Jintao Zhang, Xiaoming Xu, Jia wei, Haofeng Huang, Pengle Zhang, Chendong Xiang, Jun Zhu, Jianfei Chen · PDF
  109. SARA: Selective and Adaptive Retrieval-augmented Generation with Context Compression

    Yiqiao Jin, Kartik Sharma, Vineeth Rakesh, Yingtong Dou, Menghai Pan, Mahashweta Das, Srijan Kumar · PDF
  110. Scaling Fine-Grained MoE Beyond 50B Parameters: Empirical Evaluation and Practical Insights

    Jakub Krajewski, Marcin Chochowski, Daniel Korzekwa · PDF
  111. Scaling Up Liquid-Resistance Liquid-Capacitance Networks for Efficient Sequence Modeling

    Mónika Farsang, Ramin Hasani, Radu Grosu · PDF
  112. Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

    Jonas Geiping, Sean Michael McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, Tom Goldstein · PDF
  113. SD$^2$: Self-Distilled Sparse Drafters

    Mike Lasby, Nish Sinnadurai, Valavan Manohararajah, Sean Lie, Yani Ioannou, Vithursan Thangarasa · PDF
  114. Shrinking the Generation-Verification Gap with Weak Verifiers

    Jon Saad-Falcon, E. Kelly Buchanan, Mayee F Chen, Tzu-Heng Huang, Brendan McLaughlin, Tanvir Bhathal, Shang Zhu, Ben Athiwaratkun, Frederic Sala, Scott Linderman, Azalia Mirhoseini, Christopher Re · PDF
  115. SortedRL: Accelerating RL Training for LLMs through Online Length-aware Scheduling

    Yiqi Zhang, Huiqiang Jiang, Xufang Luo, Zhihe Yang, Chengruidong Zhang, Yifei Shen, Dongsheng Li, Yuqing Yang, Lili Qiu, Yang You · PDF
  116. SpecCoT: Accelerating Chain-of-Thought Reasoning through Speculative Exploration

    Junhan Shi, Yijia Zhu, Zhenning Shi, Dan Zhao, Qing Li, Yong Jiang · PDF
  117. SPECS: Faster Test-Time Scaling through Speculative Drafts

    Mert Cemri, Nived Rajaraman, Rishabh Tiwari, Xiaoxuan Liu, Kurt Keutzer, Ion Stoica, Kannan Ramchandran, Ahmad Beirami, Ziteng Sun · PDF
  118. Speeding up Speculative Decoding via Sequential Approximate Verification

    Meiyu Zhong, Noel Teku, Ravi Tandon · PDF
  119. Steering LLM Reasoning Through Bias-Only Adaptation

    Viacheslav Sinii, Alexey Gorbatovski, Artem Cherepanov, Boris Shaposhnikov, Nikita Balagansky, Daniil Gavrilov · PDF
  120. Tail-Optimized Caching for LLM Inference

    Wenxin Zhang, Yueying Li, Tianyi Peng, Ciamac C. Moallemi · PDF
  121. Tensor Product Attention Is All You Need

    Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Zhen Qin, Yang Yuan, Quanquan Gu, Andrew C Yao · PDF
  122. The Road Not Taken: Hindsight Exploration for LLMs in Multi-Turn RL

    Huaxiaoyue Wang, Sanjiban Choudhury · PDF
  123. Thinformer: Guaranteed Attention Approximation via Low-Rank Thinning

    Annabelle Michael Carrell, Albert Gong, Abhishek Shetty, Raaz Dwivedi, Lester Mackey · PDF
  124. Think Clearly: Improving Reasoning via Redundant Token Pruning

    Daewon Choi, Jimin Lee, Jihoon Tack, Woomin Song, Saket Dingliwal, Sai Muralidhar Jayanthi, Bhavana Ganesh, Jinwoo Shin, Aram Galstyan, Sravan Babu Bodapati · PDF
  125. ThinkingViT: Nested Thinking Vision Transformer for Elastic Inference

    Ali Hojjat, Janek Haberer, Soren Pirk, Olaf Landsiedel · PDF
  126. Tiny Reward Models

    Sarah Pan · PDF
  127. TinyServe: Query-Aware Cache Selection for Efficient LLM Inference

    Dong Liu, Yanxuan Yu · PDF
  128. TMA-Adaptive FP8 Grouped GEMM: Eliminating Padding Requirements in Low-Precision Training and Inference on Hopper

    Suzhongling, Rong Fu, Weihan Cao, Jianfei Gao, Minxi Jin, PeiZhilin, Hui Wang · PDF
  129. TORCHSIM: High Fidelity Runtime and Memory Estimation for Distributed Training

    Sanket Purandare, Emma Yang, Andrew Zhao, Qitong Wang, Wei Feng, Alban Desmaison, Andrew Gu, Tianyu Liu, Less Wright, Gokul Nadathur, Stratos Idreos · PDF
  130. Toward Dataset Distillation for Regression Problems

    Jamie Mahowald, Ravi Srinivasan, Zhangyang Wang · PDF
  131. Towards Efficient Pre-training: Exploring FP4 Precision in Large Language Models

    Jiecheng Zhou, Ding Tang, Rong Fu, Boni Hu, Haoran Xu, Yi Wang, Suzhongling, Liang Liu, PeiZhilin, Hengjie Li, Xingcheng Zhang, Weiming Zhang · PDF
  132. Towards Large Scale Training on Apple Silicon

    Tycho F. A. van der Ouderaa, Mohamed Baioumy, Matt Beton, Seth Howes, Gelu Vrabie, Alex Cheema · PDF
  133. Towards Understanding Orthogonalization in Muon

    Valentyn Boreiko, Zhiqi Bu, Sheng Zha · PDF
  134. Towards Understanding Self-Pretraining for Sequence Classification

    Omar Coser, Antonio Orvieto · PDF
  135. Training Language Models to Reason Efficiently

    Daman Arora, Andrea Zanette · PDF
  136. Training-free LLM Verification via Recycling Few-shot Examples

    Dongseok Lee, JIMYUNG HONG, Dongyoung Kim, Jaehyung Kim · PDF
  137. Training-Free Semantic Deferrals for Open-Ended LLM Cascades

    Duncan Soiffer, Steven Kolawole, Virginia Smith · PDF
  138. Ultra-Efficient and Effective Large Language Models with Multi-Boolean Architectures

    Ba-Hien Tran, Van Minh Nguyen · PDF
  139. Unbounded Memory and Consistent Imagination via Unified Diffusion–SSM World Models

    Jia-Hua Lee, Bor-Jiun Lin, Wei-Fang Sun, Chun-Yi Lee · PDF
  140. Unified Scaling Laws for Compressed Representations

    Andrei Panferov, Alexandra Volkova, Ionut-Vlad Modoranu, Vage Egiazarian, Mher Safaryan, Dan Alistarh · PDF
  141. Vision Language Model Distillation Using Partial Information Decomposition

    Stephen D. Liang · PDF
  142. VOCABTRIM: Vocabulary Pruning for Efficient Speculative Decoding in LLMs

    Raghavv Goel, Sudhanshu Agrawal, Mukul Gagrani, Junyoung Park, Yifan Zao, He Zhang, Tian Liu, Yiping Yang, Xin Yuan, Jiuyuan Lu, Christopher Lott, Mingu Lee · PDF
  143. VScan: A Two-Stage Visual Token Reduction Framework for Accelerating Large Vision-Language Models

    Ce Zhang, Kaixin Ma, Tianqing Fang, Wenhao Yu, Hongming Zhang, Zhisong Zhang, Yaqi Xie, Katia P. Sycara, Haitao Mi, Dong Yu · PDF
  144. WINA: Weight Informed Neuron Activation for Accelerating Large Language Model Inference

    Sihan Chen, Dan Zhao, Jongwoo Ko, Colby Banbury, Huiping Zhuang, Luming Liang, Tianyi Chen · PDF
  145. Zero-Shot Conversion to Monarch-Structured Attention

    Can Yaras, Alec S. Xu, Pierre Abillama, Changwoo Lee, Laura Balzano · PDF
  146. zip2zip: Inference-Time Adaptive Vocabularies for Language Models via Token Compression

    Saibo Geng, Nathan Ranchin, Yunzhen Yao, Maxime Peyrard, Chris Wendler, Michael Gastpar, Robert West · PDF