ICLR 2025 Past Large language modelsEfficiency

Sparsity in LLMs (SLLM): Deep Dive into Mixture of Experts, Quantization, Hardware, and Inference

SLLM

Submission deadline
Feb 8, 2025, 11:59 UTC
imported from OpenReview — check the website for extensions
Submission portal
OpenReview
Notes
Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).

Accepted papers (70)

Fetched from OpenReview (v2) on 2026-06-10.

  1. 2SSP: A Two-Stage Framework for Structured Pruning of LLMs

    Fabrizio Sandri, Elia Cunegatti, Giovanni Iacca · PDF
  2. Accelerating Transformer Inference and Training with 2:4 Activation Sparsity

    Daniel HAZIZA, Timothy Chou, Dhruv Choudhary, Jesse Cai, Luca Wehrstedt, Francisco Massa, Jiecao Yu, Geonhwa Jeong, Supriya Rao, Patrick Labatut · PDF
  3. Antipodal Pairing and Mechanistic Signals in Dense SAE Latents

    Alessandro Stolfo, Ben Peng Wu, Mrinmaya Sachan · PDF
  4. Brain-inspired sparse training enables Transformers and LLMs to perform as fully connected

    Yingtao Zhang, Jialin Zhao, Wenjing Wu, Ziheng Liao, Umberto Michieli, Carlo Vittorio Cannistraci · PDF
  5. CAMEx: Curvature-aware Merging of Experts

    Dung Viet Nguyen, Minh Hoang Nguyen, Luc Nguyen, Rachel S.Y. Teo, Tan Minh Nguyen, Linh Duy Tran · PDF
  6. ChamaleonLLM: Batch-Aware Dynamic Low-Rank Adaptation via Inference-Time Clusters

    Kamer Ali Yuksel, Hassan Sawaf · PDF
  7. ClusterGen: Token Generation in Sublinear Time and Memory with Clustering KV Cache

    Amir Zandieh, Insu Han, Amin Karbasi, Vahab Mirrokni · PDF
  8. Compressed sparse tiles for memory-efficient unstructured and semi-structured sparsity

    Mike Lasby, Max Zimmer, Sebastian Pokutta, Erik Schultheis · PDF
  9. Contextual Sparsity as a Tool for Mechanistic Understanding of Retrieval in Hybrid Foundation Models

    Davide Zani, Kurt Felix Michalak, Steven Abreu · PDF
  10. DeltaMoE: Memory-Efficient Inference for Merged Mixture of Experts with Delta Compression

    Boyko Borisov, Xiaozhe Yao, Nezihe Merve Gürel, Ana Klimovic · PDF
  11. Differentiable Attention Sparsity via Structured $D$-Gating

    Chris Kolb, Bernd Bischl, David Rügamer · PDF
  12. Divide, Reweight, and Conquer: A Logit Arithmetic Approach for In-Context Learning

    Chengsong Huang, Langlin Huang, Jiaxin Huang · PDF
  13. Efficient Transformers via MPO-Based Low-Rank Factorization and Pruning

    Sam Mikhak, Venkata Sai Gummidi, Praneeth Medepalli, Kevin Zhu · PDF
  14. Evaluating LLM Memorization Using Soft Token Sparsity

    Zhili Feng, Yixuan Even Xu, Pratyush Maini, Alexander Robey, Avi Schwarzschild, J Zico Kolter · PDF
  15. EvoPress: Accurate Dynamic Model Compression via Evolutionary Search

    Oliver Sieberling, Denis Kuznedelev, Dan Alistarh · PDF
  16. Exploring the dual lottery ticket hypothesis in finetuning through specialised sparsification

    Sampreeth R S, Arindam Biswas, Pabitra Mitra, BISWAJIT BASU · PDF
  17. Faster, Cheaper, Just as Good: Cost- and Latency-Constrained Routing for LLMs

    Javid Lakha, Minlan Yu, Rana Shahout · PDF
  18. From Dense to Dynamic: Token-Difficulty Driven MoEfication of Pre-Trained LLMs

    Kumari Nishu, Sachin Mehta, Samira Abnar, Mehrdad Farajtabar, Maxwell Horton, Mahyar Najibi, Moin Nabi, Minsik Cho, Devang Naik · PDF
  19. High Frequency Latents Are Features, Not Bugs

    Xiaoqing Sun, Joshua Engels, Max Tegmark · PDF
  20. How Can Representation Dimension Dominate Structurally Pruned LLMs?

    Mingxue Xu, Lisa Alazraki, Danilo Mandic · PDF
  21. How Sparse Attention Approximates Exact Attention?Your Attention is Naturally $n^C$-Sparse

    Zhao Song, Jing Xiong, Chiwun Yang · PDF
  22. InhibiDistilbert: Knowledge Distillation for a ReLU and Addition-based Transformer

    Tony Zhang, Rickard Brannvall · PDF
  23. Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient

    Jan Ludziejewski, Maciej Pióro, Jakub Krajewski, Michał Krutul, Jan Małaśnicki, Maciej Stefaniak, Piotr Sankowski, Marek Cygan, Kamil Adamczewski, Piotr Miłoś, Sebastian Jaszczur · PDF
  24. KURTAIL : KURTOSIS-BASED LLM QUANTIZATION

    Mohammad Sadegh Akhondzadeh, Aleksandar Bojchevski, Evangelos Eleftheriou, Martino Dazzi · PDF
  25. LEWIS (LayEr WIse Sparsity) - A Training Free Guided Model Merging Approach

    Hetarth Chopra, Vidhi Rambhia, Vikram S. Adve · PDF
  26. Lexico: Extreme KV Cache Compression via Sparse Coding over Universal Dictionaries

    Junhyuck Kim, Jongho Park, Jaewoong Cho, Dimitris Papailiopoulos · PDF
  27. LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference

    Guangtao Wang, Shubhangi Upasani, Chen Wu, Darshan Gandhi, Jonathan Lingjie Li, Changran Hu, Bo Li, Urmish Thakker · PDF
  28. LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation

    CHEN Han, Zicong Jiang, Zining Zhang, Bingsheng He, Luo Pingyi, Mian Lu, Yuqiang Chen · PDF
  29. LoRA Without Forgetting: Freezing and Sparse Masking for Low-Rank Adaptation

    Juzheng Zhang, Jiacheng You, Ashwinee Panda, Tom Goldstein · PDF
  30. LoRAM: Low-Rank Adaptation of Large Language Models on Manifold

    Xiaowen Jiang, Xun Wang, Sebastian U Stich · PDF
  31. Low-rank Adapting Models for Sparse Autoencoders

    Matthew Chen, Joshua Engels, Max Tegmark · PDF
  32. Low-Rank is Required for Pruning LLMs

    Stephen Zhang, Vardan Papyan · PDF
  33. Matryoshka Quantization

    Pranav Ajit Nair, Puranjay Datta, Jeff Dean, Prateek Jain, Aditya Kusupati · PDF
  34. Mixture-of-Mamba: Enhancing Multi-Modal State-Space Models with Modality-Aware Sparsity

    Weixin Liang, Junhong Shen, Genghan Zhang, Ning Dong, Luke Zettlemoyer, LILI YU · PDF
  35. MobiLlama: Towards Accurate & Lightweight Fully Transparent GPT

    Omkar Chakradhar Thawakar, Ashmal Vayani, Salman Khan, Hisham Cholakkal, Rao Muhammad Anwer, Michael Felsberg, Timothy Baldwin, Eric P. Xing, Fahad Shahbaz Khan · PDF
  36. MoE Lens - An Expert Is All You Need

    Marmik Chaudhari, Idhant Gulati, Nishkal Hundia, Pranav Karra, Shivam Raval · PDF
  37. NoWag: A Unified Framework for Shape Preserving Compression of Large Language Models

    Lawrence Ray Liu, Inesh Chakrabarti, Yixiao Li, Mengdi Wang, Tuo Zhao, Lin Yang · PDF
  38. On multi-token prediction for efficient LLM inference

    Somesh Mehra, Javier Alonso Garcia, Lukas Mauch · PDF
  39. On the Spatial Structure of Mixture-of-Experts in Transformers

    Daniel Bershatsky, Ivan Oseledets · PDF
  40. One Must Imagine Experts Happy: Rebalancing Neural Routers via Constrained Optimization

    Kushal Thaman · PDF
  41. Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models

    Samira Abnar, Harshay Shah, Dan Busbridge, Alaaeldin El-Nouby, Joshua M. Susskind, Vimal Thilak · PDF
  42. Pivoting Factorization: A Compact Meta Low-Rank Representation of Sparsity for Efficient Inference in Large Language Models

    Jialin Zhao, Yingtao Zhang, Carlo Vittorio Cannistraci · PDF
  43. Post-LoRA Restoration: Utilizing Transferability of Low-Rank Adapter in Quantized Foundation Models

    Yuto Kanda, Kenji Hatano · PDF
  44. Prefix and Output Length-Aware Scheduling for Efficient Online LLM Inference

    Iñaki Arango, Ayush Noori, Yepeng Huang, Rana Shahout, Minlan Yu · PDF
  45. PRUNING AS A DEFENSE: REDUCING MEMORIZATION IN LARGE LANGUAGE MODELS

    Mansi Gupta, Nikhar Waghela, Sarthak Gupta, Shourya Goel, Sanjif Shanmugavelu · PDF
  46. Q-Filters: Leveraging Query-Key Geometry for Efficient Key-Value Cache Compression

    Nathan Godey, Alessio Devoto, Yu Zhao, Simone Scardapane, Pasquale Minervini, Éric Villemonte de la Clergerie, Benoît Sagot · PDF
  47. QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead

    Amir Zandieh, Majid Daliri, Insu Han · PDF
  48. QuEST: Training Accurate LLMs over Highly-Compressed Weights and Activation

    Andrei Panferov, Jiale Chen, Soroush Tabesh, Roberto L. Castro, Mahdi Nikdan, Dan Alistarh · PDF
  49. ReALLM: a general framework for LLM compression and fine-tuning

    Lisa Bedin, Louis Leconte, Van Minh Nguyen, Eric Moulines · PDF
  50. Recovery-on-the-line: Linear trends in post-quantization performance recovery

    Shashata Sawmya, Shuvom Sadhuka, Ragulan Sivakumar, Nir N Shavit, Dan Alistarh, Bonnie Berger · PDF
  51. ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals

    Utkarsh Saxena, Sayeh Sharify, Kaushik Roy, Xin Wang · PDF
  52. RLMedusa: Reinforcement Learning for Multiple Decoding Heads to Accelerate LLM Inference

    Aadit Juneja, Parsa Idehpour · PDF
  53. Robustly identifying concepts introduced during chat fine-tuning using crosscoders

    Julian Minder, Clément Dumas, Bilal Chughtai, Neel Nanda · PDF
  54. S2-ATTENTION: HARDWARE-AWARE CONTEXT SHARDING AMONG ATTENTION HEADS

    Xihui Lin, Yunan Zhang, Suyu Ge, Liliang Ren, Barun Patra, Vishrav Chaudhary, Hao Peng, Xia Song · PDF
  55. Scalable Continual Learning: Adaptive MoEs for Expanding Task Sets

    Adrian Candocia, Omer Mustafa Inan, Raaghav Agarwal, Aamod Varma, Mark A. Davenport · PDF
  56. Scaling Laws and Efficient Inference for Ternary Language Models

    Tejas Vaidhya, Ayush Kaushal, Vineet Jain, Francis Couture-Harpin, Prashant Shishodia, Majid Behbahani, Irina Rish, Yuriy Nevmyvaka · PDF
  57. Scaling Sparse Feature Circuits For Studying In-Context Learning

    Dmitrii Kharlapenko, Stepan Shabalin, Arthur Conmy, Neel Nanda · PDF
  58. SpargeAttn: Training-Free Sparse Attention Accelerating Any Model Inference

    Jintao Zhang, Chendong Xiang, Haofeng Huang, Jia wei, Haocheng Xi, Jun Zhu, Jianfei Chen · PDF
  59. Sparse and Wide Linear RNNs Are at the Efficiency-Performance Pareto Front

    Alessandro Pierro, Steven Abreu, Jonathan Timcheck, Philipp Stratmann, Sumit Bam Shrestha · PDF
  60. Sparse Gradient Compression for Fine-Tuning Large Language Models

    David H. Yang, Mohammad Mohammadi Amiri, Tejaswini Pedapati, Subhajit Chaudhury, Pin-Yu Chen · PDF
  61. Sparse Spectral Training and Inference on Euclidean and Hyperbolic Neural Networks

    Jialin Zhao, Yingtao Zhang, Xinghang Li, Huaping Liu, Carlo Vittorio Cannistraci · PDF
  62. SPEX: Scaling Feature Interaction Explanations for LLMs

    Justin Singh Kang, Landon Butler, Abhineet Agarwal, Yigit Efe Erginbas, Ramtin Pedarsani, Bin Yu, Kannan Ramchandran · PDF
  63. Steering Fine-Tuning Generalization with Targeted Concept Ablation

    Helena Casademunt, Caden Juang, Senthooran Rajamanoharan, Neel Nanda · PDF
  64. Symmetric Pruning for Large Language Models

    Kai Yi, Peter Richtárik · PDF
  65. TASP: Preserving Training Dynamics in Transformers via NTK-Aware Structured Pruning

    Mengting Ai, Tianxin Wei, Jingrui He · PDF
  66. The Surprising Effectiveness of Randomness in LLM Pruning

    Shuyao Xu, Liu Jiayao, Zhenfeng He, Cheng Peng, Weidi Xu · PDF
  67. Understanding the Difficulty of Low-Precision Post-Training Quantization for LLMs

    Zifei Xu, Sayeh Sharify, Wanzin Yazar, Tristan J Webb, Xin Wang · PDF
  68. Unveiling Simplicities of Attention: Adaptive Long-Context Head Identification

    Konstantin Donhauser, Charles Arnal, Mohammad Pezeshki, Vivien Cabannes, David Lopez-Paz, Kartik Ahuja · PDF
  69. Wanda++: Pruning Large Language Models via Regional Gradients

    Yifan Yang, Kai Zhen, Bhavana Ganesh, Aram Galstyan, Goeric Huybrechts, Markus Müller, Jonas M. Kübler, Rupak Vignesh Swaminathan, Athanasios Mouchtaris, Sravan Babu Bodapati, Nathan Susanj, Zheng Zhang, Jack FitzGerald, Abhishek Kumar · PDF
  70. Zeroth-Order Adaptive Neuron Alignment Based Pruning without Re-Training

    Elia Cunegatti, Leonardo Lucio Custode, Giovanni Iacca · PDF