ICML 2026 Past Large language models

AdaptFM: Resource-Adaptive Foundation Model Inference

AdaptFM

Submission deadline
May 8, 2026, 23:59 AoE (UTC−12)
imported from OpenReview — check the website for extensions
Submission portal
OpenReview
Notes
Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).

Accepted papers (124)

Fetched from OpenReview (v2) on 2026-06-10.

  1. A Recipe for an Elastic Mixture: One Mixture-of-Experts for Every Resource Budget

    Chloe Chia · PDF
  2. A Tale of Two Temperatures: Simple, Efficient, and Diverse Sampling from Diffusion Language Models

    Theo X. Olausson, Metod Jazbec, Xi Wang, Armando Solar-Lezama, Christian A. Naesseth, Stephan Mandt, Eric Nalisnick · PDF
  3. A3: an Analytical Low-Rank Approximation Framework for Attention

    Jeffrey T. H. Wong, Cheng Zhang, Xinye Cao, Pedro Gimenes, Christos-Savvas Bouganis, George Anthony Constantinides, Wayne Luk, Yiren Zhao · PDF
  4. AAAC: Activation-Aware Adaptive Codebooks for 4-bit LLM Weight Quantization

    Beshr IslamBouli, David Jin · PDF
  5. Accelerating LLM Inference via Vector Index Based Output Embeddings

    Martin Loretz, Sepp Hochreiter · PDF
  6. Activation Quantization of Vision Encoders Needs Prefixing Registers

    Seunghyeon Kim, Taesun Yeom, Jinho Kim, Wonpyo Park, Kyuyeun Kim, Jaeho Lee · PDF
  7. Activation- and Influence-Aware Ranks (AIR): Function-Preserving SVD Compression for LLMs

    Nico Harder, Daniel Becking, Karsten Mueller, Wojciech Samek · PDF
  8. Adaptive Generate-Rank-Verify: Inference-Time Search with Costly Verification

    Shaddin Dughmi, Mahdi Haghifam, Yusuf Hakan Kalayci · PDF
  9. Adaptive Safety Probing for Resource-Efficient Vision-Language-Action Models

    Seongbin Park, Fan Zhang, Hossein Khalili, Nader Sehatbakhsh · PDF
  10. AgentKV: Phase-Aware KV Eviction for Agentic LLMs

    Taowen Tony Liu, Jeffrey T. H. Wong, Can Xiao, Bowen Yang, Hao Mark Chen, Yiren Zhao · PDF
  11. AgentRouter: Heterogeneous Model Routing for Cost-Optimal Multi-Step Agentic Workflows

    Rudrendu Kumar Paul, Sourav Nandy · PDF
  12. Alignment Collapse Under KV Cache Quantization: A 35-Minute Audit for Quantized LLM Deployments

    Bruce Changlong Xu, Adarsh Kumarappan, Mu Zhou · PDF
  13. BASTION: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion Drafting

    Soowon Oh, Nam Cao, Yujin Kim, Hojung Jung, Huzama Ahmad, Sangmin Bae, Se-Young Yun · PDF
  14. Beyond Imitation: A Resource Adaptive Embedder that Outperforms its 14×Larger Teacher on Financial Retrieval

    Ailar Mahdizadeh, Aria Salari, Sohail Rajabi, Shahriar Mirabbasi, Panos Nasiopoulos, Alireza Morsali · PDF
  15. Block-Based Double Decoders

    Asher Labovich, Vanessa Alexander, Chaitanya Harsha, Benjamin Bradley · PDF
  16. Block-Level Recursion: Adaptive Test-Time Routing in Large Language Models

    Kristiyan Sakalyan, Sanghwan Kim, Leo Schwinn, Quentin Bouniot, Zeynep Akata · PDF
  17. Cache You Later: Post-Compression KV Repair for Long-Context Agentic LLM Inference

    Andrew Rusli, Shreyan Paliwal, Henry Zhang, Michael Jiao · PDF
  18. CafeQ: Calibration-free Quantization via Learned Transformations and Adaptive Rounding

    Ziteng Sun, Adrian Benton, Sadik Yagiz Yetim, Samuel Kushnir, Asher Trockman, Vikas Singh, Suhas Diggavi, Ananda Theertha Suresh · PDF
  19. CARES: Context-Aware Resolution Selector for VLMs

    Moshe Kimhi, Nimrod Shabtay, Raja Giryes, Chaim Baskin, Eli Schwartz · PDF
  20. Characterizing self-speculative decoding approaches for accelerating LLMs

    Jungmin Ha, Karthik Ganesan, Anh Nguyen, Tanvir Ahmed, Andreas Moshovos · PDF
  21. CLAWS: Calibration-Aware Activation Sparsity for Instruction-Tuned LLMs

    Noah Cylich, Karen Mosoyan, Henry Ndubuaku · PDF
  22. COAT: COrrelation-Aware Orthogonal Transform for LLM Quantization

    Indranil Patra, AZHAR YOUSUF, Manu Mathew, Chandra Sekhar Seelamantula · PDF
  23. CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization

    Soo Min Kwon, Ziteng Sun, Ananda Theertha Suresh, Himanshu Jain, Sanjiv Kumar · PDF
  24. Convergence-Gated Distillation for Resource-Adaptive Reinforcement Learning Agents

    Bruce Changlong Xu, Jay J. Park, Vivek Buch · PDF
  25. CoupledNorm: Efficient Normalization via Shared RMS Statistics

    Martin Loretz, Sepp Hochreiter · PDF
  26. Cross-Tokenizer LLM Distillation through a Byte-Level Interface

    Avyav Kumar Singh, Yen-Chen Wu, Alexandru Cioba, Alberto Bernacchia, Davide Buffelli · PDF
  27. DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

    Chenyang Song, Weilin Zhao, Xu Han, Chaojun Xiao, Yingfa Chen, Zhiyuan Liu · PDF
  28. Decoupling Spatial and Semantic Token Compression for Vision-Language Model Acceleration

    Seunghun Moon, Jaehyun Pyun, Hyunwoo Yu, Suk-Ju Kang · PDF
  29. DIPA: Difficulty-Informed Probabilistic Allocation of Test-Time Compute via Training-Free Proxies

    Wenyang Hu, Yao Shu, See-Kiong Ng, Bryan Kian Hsiang Low · PDF
  30. Distill, Suppress, and Fuse: Cross-Modal Knowledge Integration for Optical Flow-Free Temporal Action Segmentation

    Seungjin Han, Gyeong-Hyeon Kim, Eunwoo Kim · PDF
  31. DREAM-MoE: Downstream Routing Error-Aware Margin-Preserving Quantization for Mixture-of-Experts Large Language Models

    Hancheol Park, Geonho Lee, Tae-Ho Kim · PDF
  32. DropKV: Decoupling Residual-Output Perturbation for Near-Optimal KV-Cache Eviction

    Aozhong Zhang, Selcuk Gurses, Yanxia Deng, Naigang Wang, Chi-Chun Liu, Davis Wertheimer, Derrick Liu, Xin Li, Zi Yang, Felix X.-F. Ye, Penghang Yin · PDF
  33. Dropping the Anchor: Statistical Context Summarization for Distributed Systems via Pulsar Attention

    Aryan Sood, Shantanu Acharya · PDF
  34. Efficient Encoder-Only Context Compression via Marginal Contribution Scoring

    Thao Do, Dinh Phu Tran, An Vo, Seon Kwon Kim, Daeyoung Kim · PDF
  35. EfficientRollout: System-Aware Self-Speculative Decoding for RL Rollouts

    Minseo Kim, Minjae Lee, Seunghyuk Oh, Kevin Galim, Donghoon Kim, Coleman Richard Charles Hooper, Harman Singh, Amir Gholami, Hyung Il Koo, Wonjun Kang · PDF
  36. Empirical Analysis of Layer Redundancy in Diffusion Language Models

    Yuto Karashima, Hiroaki Ito, Hikari Otsuka, Guanxi Lu, Tatsuya Kaneko, Masato Motomura, Daichi Fujiki · PDF
  37. EntropyCache: Decoded Token Entropy Guided KV Caching for Diffusion Language Models

    Minsoo Cheong, Donghyun Son, Woosang Lim, Sungjoo Yoo · PDF
  38. Fast Inference via Hierarchical Speculative Decoding

    Clara Mohri, Amir Globerson, Haim Kaplan, Yishay Mansour, Tal Schuster · PDF
  39. Fault Robustness of Custom Floating-Point and Integer Formats: Datatype Selection as a Reliability-Aware Compression Decision

    R S Haripriya, Jaynarayan T Tudu · PDF
  40. Fixed-Point Reasoning: Stable and Adaptive Deep Looped Models

    Sajad Movahedi, Shlomo Libo Feigin, Vera Milovanović, Alexander Theus, Thomas Hofmann, Valentina Boeva, T. Konstantin Rusch, Antonio Orvieto · PDF
  41. FlexRank: Nested Low-Rank Knowledge Decomposition for Adaptive Model Deployment

    Riccardo Zaccone, Stefanos Laskaridis, Marco Ciccone, Samuel Horváth · PDF
  42. Fully Nested Transformers

    Avi Trost, Alexander Yun, John Cooper, Gabriel Orlanski, Frederic Sala · PDF
  43. Ghosted Layers: Unconstrained Activation Alignment for Recovering Layer-Pruned LLMs

    Vincent-Daniel Yun, Junhyuk Jo, Sai Praneeth Karimireddy, Sunwoo Lee · PDF
  44. GreenMoE: Exploiting Dynamic Load Imbalance for Energy-Efficient Long-Context MoE Training

    Laiyi Li, Zhenheng Tang, Peijie Dong, Qiang Wang · PDF
  45. Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts

    Yingfa Chen, Zhen Leng Thai, Zihan Zhou, Zhu Zhang, Xingyu Shen, Shuo Wang, Chaojun Xiao, Xu Han, Zhiyuan Liu · PDF
  46. HYBRIDKV: Exploiting Head-Dominant Reconstruction for Efficient Query-Agnostic KV Cache Compression

    Changwoo Baek, Kyeongbo Kong · PDF
  47. HyPER: Bridging Exploration and Exploitation for Scalable LLM Reasoning with Hypothesis Path Expansion and Reduction

    Shengxuan Qiu, Haochen Huang, Shuzhang Zhong, Pengfei Zuo, Meng Li · PDF
  48. Implicit Off-Diagonal Curvature Modeling via Gradient Projection for Post-Training Quantization of Vision Transformers

    Jincheol Yang, Jaemin Choi, Nahyun Lim, Yun-Seong Jeong, Matti Alexander Zinke, Hyunwoo Yu, Bongjoon Hyun, Kyomin Sohn, Suk-Ju Kang · PDF
  49. Improving Cascade Routing for Structured Attribute Generation with Heterogeneous Confidence

    Fatemeh Mansoori, Andrea Scarinci, Aditya Aggarwal, Suleiman A. Khan, Ashwin Chandramouli · PDF
  50. IR3DE: A Linear Router for Large Language Models

    Eros Fanì, Oguzhan Ersoy · PDF
  51. Is Escalation Worth It? A Decision-Theoretic Characterization of LLM Cascades

    Dylan Bouchard · PDF
  52. Jacobian-guided Noise Injection for Quantization Robustness in Large Language Models

    Deepanshu Pandey, Nahush Lele, Arnav Chavan, Sankalp Dayal, Deepak Gupta · PDF
  53. KVgrad: Query-Agnostic KV Cache Eviction via Gradient-based Global Importance Scoring

    Jihwan Kwak, Sunghwan Joo, Jung Yoon Hwang, Jaeseok Byun, Taesup Moon · PDF
  54. Latent Cache Flow: Model-to-Model Communication Without Text

    Maximillian Rossi, Prajwal Raghunath, Eugene Wu · PDF
  55. Layer Verification Accelerates Speculative Tree Decoding

    Jaeyoung Cha, Hanseul Cho, Chulhee Yun · PDF
  56. Layout and Fusion Trade-offs for Mixture-of-Experts Inference under Single-Node Tensor Parallelism

    June Yong Yang, Inhyuk Cho, Taehyeon Kim, Yu Jin Kim, Moontae Lee · PDF
  57. LEAP: Learnable End-to-End Adaptive Pruning of Large Language Models

    Mohammad Mozaffari, Younes Hourri, Mohammad Rastegari, Mahyar Najibi · PDF
  58. Learning Adaptive LLM Decoding

    Huangyuan Su, Zhe Ye, Samuel Tenka, Aidan Z.H. Yang, Soonho Kong, Udaya Ghai · PDF
  59. Learning Adaptive Reasoning Budgets via Constraint-Rectified Training

    Qinhang Wu, Sen Lin, Ming Zhang, Yingbin Liang, Ness Shroff · PDF
  60. Learning When to Attend: Conditional Memory Access for Long-Context LLMs

    Sakshi Choudhary, Aditya Chattopadhyay, Luca Zancato, Elvis Nunez, Matthew Trager, Wei Xia, Stefano Soatto · PDF
  61. Leech Lattice Vector Quantization for Efficient LLM Compression

    Tycho F. A. van der Ouderaa, Mart Van Baalen, Paul N. Whatmough, Markus Nagel · PDF
  62. LExI: Layer-Adaptive Active Experts for Efficient MoE Inference

    Krishna Teja Chitty-Venkata, Murali Emani · PDF
  63. LLM Family Expansion via Distillation and Quantization

    Andrei Panferov, Davit Melikidze, Dan Alistarh · PDF
  64. LongAttnComp: Cross-Family Context Compression for Long-Context Reasoning

    Mengmeng Ji, Ravi Shanker Raju, Jonathan Lingjie Li, Chen Wu · PDF
  65. Low Dimensional Embeddings for Model Capability Understanding

    Shivam Patel, William Cocke, Gauri Joshi · PDF
  66. MAGE: All-[MASK] Block Already Knows Where to Look in Block Diffusion LLM

    Omin Kwon, Yeonjae Kim, Doyeon Kim, Minseo Kim, Yeonhong Park, Jae W. Lee · PDF
  67. MatMLA: Matryoshka Multi-Head Latent Attention

    Kevin Li, Berlin Chen, Caitlin Wang, Aakash Lahoti, Albert Gu, Tri Dao · PDF
  68. MineDraft: A Framework for Batch Parallel Speculative Decoding

    Zhenwei Tang, Arun Verma, Zijian Zhou, Zhaoxuan Wu, Alok Prakash, Daniela Rus, Bryan Kian Hsiang Low · PDF
  69. Modality-Aware Block Rotation for Vision-Language-Action Model Quantization

    U-Yeong Kim, Suk-Ju Kang · PDF
  70. MoNe: Modular Neural Memory for Efficient Long Context Inference

    Wonguk Cho, Kyubyung Chae, Tribhuvanesh Orekondy, Sunghyun Park, Hyoungwoo Park, Jeongho Kim, Arash Behboodi, Kyuwoong Hwang, Sungrack Yun · PDF
  71. MoSE: Mixture of Slimmable Experts for Efficient and Adaptive Language Models

    Nurbek Tastan, Stefanos Laskaridis, Karthik Nandakumar, Samuel Horváth · PDF
  72. Multi-Mixer Models: Flexible Sequence Modeling with Shared Representations

    Kevin Li, Asher Trockman, Ananda Theertha Suresh, Ziteng Sun · PDF
  73. Multi-Token Prediction via Self-Distillation

    John Kirchenbauer, Abhimanyu Hans, Brian R. Bartoldson, Micah Goldblum, Ashwinee Panda, Tom Goldstein · PDF
  74. Neural Weight Compression for Language Models

    Jegwang Ryu, Minkyu Kim, Seungjun Shin, Hee Min Choi, Dokwan Oh, Jaeho Lee · PDF
  75. NOSA: Native and Offloadable Sparse Attention

    Yuxiang Huang, Pengjie Wang, Jicheng Han, Weilin Zhao, Zhou su, Ao Sun, HongyaLyu, Hengyu Zhao, Yudong Wang, Chaojun Xiao, Xu Han, Zhiyuan Liu · PDF
  76. On State Reduction in Linear Attention

    Philipp Nazari, T. Konstantin Rusch · PDF
  77. On the Optimal Reasoning Length for RL-Trained Language Models

    Daisuke Nohara, Taishi Nakamura, Rio Yokota · PDF
  78. One Simple Trick for Improving the Performance of Energy-Limited Local Inference and Training

    Erik Schultheis, Maximilian Kleinegger, Dan Alistarh · PDF
  79. OriCache: Orientation-Guided Feature Caching for DiT Acceleration

    Joonsik Nam, Hyunwoo Yu, Suk-Ju Kang · PDF
  80. Prelude: Execution-Class Aware Serving for Decision-Style LLM Inference

    Minzhou Pan, Yuzhou Nie, Ruilin Zhou, Yuheng Tang, Jingyang Zhang, Dawn Song, Bo Li, Wenbo Guo · PDF
  81. PRESTO: Prefix-Aligned Tree Drafting for Diffusion Speculative Decoding

    Zheng Wang, Zhifan Ye, Yonggan Fu, Qi Cheng, Ziyan Wang, Feng Zhu, Haozhe Zhao, Humphrey Shi, Pavlo Molchanov, Minjia Zhang · PDF
  82. Pruning and Distilling Mixture-of-Experts into Dense Language Models

    Junhyuck Kim, Jihun Yun, Haechan Kim, Gyeongman Kim, Joonghyun Bae, Jaewoong Cho · PDF
  83. QJL is 1-bit Compressive Sensing: An Equivalence and Its Consequences for KV Cache Compression in LLMs

    Mohammad Babakmehr · PDF
  84. Re-evaluating Confidence Remasking in Masked Diffusion Language Models

    Stipe Frkovic, Metod Jazbec, Dan Zhang, Christian A. Naesseth, Ilija Bogunovic, Eric Nalisnick · PDF
  85. Recency/Frequency Adaptive KV Caching for Large Language Model Serving

    Yang Shen, Meghana Madhyastha, Robert Underwood, Bogdan Nicolae, Randal Burns · PDF
  86. Recovering Selectivity with LTI State Space Operators for Portable Long-Context Inference

    Minseon Gwak, N. Benjamin Erichson, PooGyeon Park · PDF
  87. Reducing Attention Distribution Error with Unified Tail Aggregation for Sparse Attention

    Hyunwoo Yu, Jongbeom Lee, Jaemin Choi, Jincheol Yang, Yubin Cho, Joonsik Nam, Seunghun Moon, Jung-Woo Chang, Bongjoon Hyun, Kyomin Sohn, Kyeongbo Kong, Suk-Ju Kang · PDF
  88. Referring Video Object Segmentation via Language-aligned Track Selection

    Seongchan Kim, Woojeong Jin, Sangbeom Lim, Heeji Yoon, Hyunwook Choi, Seungryong Kim · PDF
  89. Relaxed On-Policy Distillation: Selective Credit Allocation for Scaling Reasoning Efficiently

    Jongwoo Ko, Sara Abdali, Young Jin Kim, Tianyi Chen, Pashmina Cameron · PDF
  90. Resource-Adaptive Foundation Model Reasoning via Semantic Coverage

    Max Ruiz Luyten, Thomas Pouplin, Mihaela van der Schaar · PDF
  91. Resource-Adaptivity Beyond the Model: Sensor Control for Quantized On-Device Vision

    Hongjun Suh, Woojin Jang, Hyung-Sin Kim · PDF
  92. Rethinking Layer Redundancy in Large Language Models: Calibration Objectives and Search for Depth Pruning

    Minkyu Kim, Vincent-Daniel Yun, Youngrae Kim, YoungJin Heo, Suin Cho, Seong-hun Kim, Woosang Lim, Gaeul Kwon · PDF
  93. Selective Sinkhorn Routing for Improved Sparse Mixture of Experts

    Duc Anh Nguyen, Huu Binh Ta, Duc-Nhuan Le, Tan Minh Nguyen, Toan Tran · PDF
  94. SFPruner: Single-Forward Visual Token Subset Selection for Resource-Efficient Multimodal Foundation Model Inference

    Jouwon Song, Woohyeong Kim, Seungjae Baek, Kyeongbo Kong · PDF
  95. ShadowSpec: Towards Zero Speculation Overhead for Substitute Speculative Decoding

    Kuan-Cheng Lin, Pei-Shuo Wang, Jian-Jia Chen, Chun-Che Yang, Chi-Chih Chang, Brian J Chan, Ning-Chi Huang, Mohamed S. Abdelfattah, Kai-Chiang Wu · PDF
  96. Sigmoid Attention as a Better Substrate for Learned KV Cache Eviction

    Isaac Li · PDF
  97. SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

    Shengkun Tang, Zekun Wang, Bo Zheng, Liangyu Wang, Rui Men, Siqi Zhang, Xiulong Yuan, Zihan Qiu, Zhiqiang Shen, Dayiheng Liu · PDF
  98. SparseSAM: Structured Sparsification of Activations in Segment Anything Models

    Hoai-Chau Tran, Chi H Nguyen, Duy Minh Ho Nguyen, Mathias Niepert, Fan Lai, Khoa D Doan · PDF
  99. Speedrunning GPT3: Training an (Almost-) GPT3-175B-Quality Model in Under 10K USD

    Georgios Vlassis, Erik Schultheis, Matin Ansaripour, Andrei Panferov, Dan Alistarh · PDF
  100. SpiralFovea: Input-Adaptive Foveated Tokenization as a Third Lever of Resource-Adaptive Inference

    KyanMahajan, Mohammad Saqlain · PDF
  101. SRA-MoE: Output-Aware Selective Router Alignment for MoE Quantization

    Geonho Lee, Hancheol Park, Seunghyun Lee, Jungwook Choi, Tae-Ho Kim · PDF
  102. Stabilizing Extrapolation in Looped Transformers via Learned Stochastic Stopping

    Hsun-Yu Kuo, El Mahdi Chayti, Patrik Reizinger, Wieland Brendel, Martin Jaggi · PDF
  103. Staircase Streaming for Low-Latency Multi-Agent Inference

    Junlin Wang, Jue WANG, Zhen Xu, Ben Athiwaratkun, Bhuwan Dhingra, Ce Zhang, James Zou · PDF
  104. Step-Tagging Early-Stopping: Toward controlling the generation of Language Reasoning Models through black-box step monitoring

    Yannis Belkhiter, Seshu Tirupathi, Giulio Zizzo, John Kelleher · PDF
  105. StreamAttention: Energy-Efficient and High-Utilization Attention on Systolic Hardware

    Olav Førland, H. T. Kung · PDF
  106. StructSAM: Structure- and Spectrum-Preserving Token Merging for Segment Anything Models

    Duy Minh Ho Nguyen, Tuan Anh Tran, Thuy-Duong Khanh Nguyen, Siwei Xie, Trung Quoc Nguyen, Mai Thanh Nhat Truong, Daniel Palenicek, An Thai Le, Michael Barz, Eric Hannus, TrungTin Nguyen, Tuan Quang Dam, Tran Nguyen Le, Ngan Le, Minh Nhat VU, Khoa D Doan, Vien Anh Ngo, Pengtao Xie, James Zou, Daniel Sonntag, Jan Peters, Mathias Niepert · PDF
  107. Structural Outlier-Aware Post-Training Quantization for Monocular Depth Estimation

    Yun-Seong Jeong, Jincheol Yang, Nahyun Lim, Jaemin Choi, Matti Alexander Zinke, Sungwook Choi, Sung-Sik Cho, Suk-Ju Kang · PDF
  108. Structure-Preserving Adaptive Post-Training Quantization for Monocular Depth Estimation

    Jaemin Choi, Jincheol Yang, Nahyun Lim, Yun-Seong Jeong, Matti Alexander Zinke, Hyunwoo Yu, Suk-Ju Kang · PDF
  109. SubspacePath Pruner: Inference-time Pruning via Probe-based Representation–Parameter Coupling

    Zhiren Gong, Yikun Hou, Fan Wu, CHE WANG, Fuyao Zhang, Tiantong Wu, Yurong Hao, Jiaming Zhang, Yiyang Duan, Tiantong Wang, Fei Huang, Chau Yuen, Wei Yang Bryan Lim · PDF
  110. TEAM: Temporal–Spatial Consistency Guided Expert Activation for MoE Diffusion Language Model Acceleration

    Linye Wei, Zixiang Luo, Pingzhi Tang, Meng Li · PDF
  111. TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning

    Alliot Nagle, Jakhongir Saydaliev, Dhia Garbaya, Michael Gastpar, Ashok Vardhan Makkuva, Hyeji Kim · PDF
  112. Think Deep, Think Fast: Investigating Inference-Time Scaling And The Reasoning Floor

    Junlin Wang, Shang Zhu, Jon Saad-Falcon, Ben Athiwaratkun, Qingyang Wu, Jue WANG, Shuaiwen Leon Song, Ce Zhang, Bhuwan Dhingra, James Zou · PDF
  113. Training Continuous Chain of Thought Models: A Tale of Two Regimes

    Varun Yerram, He He, Eunsol Choi · PDF
  114. Understanding Layer Patching in Model Size Interpolation

    Sara Kangaslahti, Jonathan Geuter, Nihal V. Nayak, Marco Fumero, Francesco Locatello, David Alvarez-Melis · PDF
  115. VEDJE: Video-Efficient Discriminative Joint Encoder for Scalable Video-Text Retrieval

    Shahaf Wagner, Gabriele Serussi, Dan Ben Ami, Tomer Galanti, Chaim Baskin · PDF
  116. Vision Token Pruning via Query--Vision Interaction Decomposition

    Harshithanjani Athi, Sravan Kumar Ankireddy, Jianzhong Charlie Zhang, Hyeji Kim · PDF
  117. Weight Concentration Regularization for Improving Pruning Robustness Under High Sparsity

    Vincent-Daniel Yun, Junhyuk Jo, Sunwoo Lee · PDF
  118. What Matters for NVFP4 Training? A Scaling Study of Low-Precision Pre-Training Recipes

    Anjulie Agrusa, Andrei Panferov, Elizabeth Wei, Keith Wyss, Paul Gibbons, Erik Schultheis, Tijmen Blankevoort, Dan Alistarh · PDF
  119. When Good Enough Is Optimal: Multiplication-Only Matrix Inversion Approximation for Quantized Gated DeltaNet

    Luoming Zhang, Yuwei Ren, Kuizhang, Tian Liu, Lingjuan Ge, Denghao Li, Matthew Harper Langston, Yin Huang, Weiliang Will Zeng, liang zhang · PDF
  120. Why Limit the Residual Stream to Layers and Not Tokens? Persistent Memory for Continuous Latent Reasoning

    Mujtaba farhan, Ashwinee Panda, Maheep Chaudhary, Sean Wu · PDF
  121. WildCat: Near-Linear Attention in Theory and Practice

    Tobias Schröder, Lester Mackey · PDF
  122. XShare: Collaborative in-Batch Expert Sharing for Faster MoE Inference

    Daniil Vankov, Nikita Ivkin, Jaime Campos Salas, Kyle R. Ulrich, Xiang song, Ashish Khetan, George Karypis · PDF
  123. You Had One Job: Per-Task Quantization Using LLMs’ Hidden Representations

    Amit LeVi, Raz Lapid, Rom Himelstein, Chaim Baskin, Ravid Shwartz-Ziv, Avi Mendelson · PDF
  124. Zero-Shot Quantization for Vision-Language-Action Models via Trajectory Curvature and Attention Guidance

    Sung-hwan Han, Youngmin Yi · PDF