ICLR 2025PastLarge language modelsEfficiency

Sparsity in LLMs (SLLM): Deep Dive into Mixture of Experts, Quantization, Hardware, and Inference

SLLM

Official website ↗OpenReview venue ↗See all ICLR workshops →✎ Edit this entry

Submission deadline: Feb 8, 2025, 11:59 UTC
imported from OpenReview — check the website for extensions
Submission portal: OpenReview
Notes: Topics were auto-suggested and may be imprecise — edits welcome.

Accepted papers (70)

Fetched from OpenReview (v2) on 2026-06-10.

2SSP: A Two-Stage Framework for Structured Pruning of LLMs
Fabrizio Sandri, Elia Cunegatti, Giovanni Iacca · PDF
Accelerating Transformer Inference and Training with 2:4 Activation Sparsity
Daniel HAZIZA, Timothy Chou, Dhruv Choudhary, Jesse Cai, Luca Wehrstedt, Francisco Massa, Jiecao Yu, Geonhwa Jeong, Supriya Rao, Patrick Labatut · PDF
Antipodal Pairing and Mechanistic Signals in Dense SAE Latents
Alessandro Stolfo, Ben Peng Wu, Mrinmaya Sachan · PDF
Brain-inspired sparse training enables Transformers and LLMs to perform as fully connected
Yingtao Zhang, Jialin Zhao, Wenjing Wu, Ziheng Liao, Umberto Michieli, Carlo Vittorio Cannistraci · PDF
CAMEx: Curvature-aware Merging of Experts
Dung Viet Nguyen, Minh Hoang Nguyen, Luc Nguyen, Rachel S.Y. Teo, Tan Minh Nguyen, Linh Duy Tran · PDF
ChamaleonLLM: Batch-Aware Dynamic Low-Rank Adaptation via Inference-Time Clusters
Kamer Ali Yuksel, Hassan Sawaf · PDF
ClusterGen: Token Generation in Sublinear Time and Memory with Clustering KV Cache
Amir Zandieh, Insu Han, Amin Karbasi, Vahab Mirrokni · PDF
Compressed sparse tiles for memory-efficient unstructured and semi-structured sparsity
Mike Lasby, Max Zimmer, Sebastian Pokutta, Erik Schultheis · PDF
Contextual Sparsity as a Tool for Mechanistic Understanding of Retrieval in Hybrid Foundation Models
Davide Zani, Kurt Felix Michalak, Steven Abreu · PDF
DeltaMoE: Memory-Efficient Inference for Merged Mixture of Experts with Delta Compression
Boyko Borisov, Xiaozhe Yao, Nezihe Merve Gürel, Ana Klimovic · PDF
Differentiable Attention Sparsity via Structured $D$-Gating
Chris Kolb, Bernd Bischl, David Rügamer · PDF
Divide, Reweight, and Conquer: A Logit Arithmetic Approach for In-Context Learning
Chengsong Huang, Langlin Huang, Jiaxin Huang · PDF
Efficient Transformers via MPO-Based Low-Rank Factorization and Pruning
Sam Mikhak, Venkata Sai Gummidi, Praneeth Medepalli, Kevin Zhu · PDF
Evaluating LLM Memorization Using Soft Token Sparsity
Zhili Feng, Yixuan Even Xu, Pratyush Maini, Alexander Robey, Avi Schwarzschild, J Zico Kolter · PDF
EvoPress: Accurate Dynamic Model Compression via Evolutionary Search
Oliver Sieberling, Denis Kuznedelev, Dan Alistarh · PDF
Exploring the dual lottery ticket hypothesis in finetuning through specialised sparsification
Sampreeth R S, Arindam Biswas, Pabitra Mitra, BISWAJIT BASU · PDF
Faster, Cheaper, Just as Good: Cost- and Latency-Constrained Routing for LLMs
Javid Lakha, Minlan Yu, Rana Shahout · PDF
From Dense to Dynamic: Token-Difficulty Driven MoEfication of Pre-Trained LLMs
Kumari Nishu, Sachin Mehta, Samira Abnar, Mehrdad Farajtabar, Maxwell Horton, Mahyar Najibi, Moin Nabi, Minsik Cho, Devang Naik · PDF
High Frequency Latents Are Features, Not Bugs
Xiaoqing Sun, Joshua Engels, Max Tegmark · PDF
How Can Representation Dimension Dominate Structurally Pruned LLMs?
Mingxue Xu, Lisa Alazraki, Danilo Mandic · PDF
How Sparse Attention Approximates Exact Attention?Your Attention is Naturally $n^C$-Sparse
Zhao Song, Jing Xiong, Chiwun Yang · PDF
InhibiDistilbert: Knowledge Distillation for a ReLU and Addition-based Transformer
Tony Zhang, Rickard Brannvall · PDF
Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient
Jan Ludziejewski, Maciej Pióro, Jakub Krajewski, Michał Krutul, Jan Małaśnicki, Maciej Stefaniak, Piotr Sankowski, Marek Cygan, Kamil Adamczewski, Piotr Miłoś, Sebastian Jaszczur · PDF
KURTAIL : KURTOSIS-BASED LLM QUANTIZATION
Mohammad Sadegh Akhondzadeh, Aleksandar Bojchevski, Evangelos Eleftheriou, Martino Dazzi · PDF
LEWIS (LayEr WIse Sparsity) - A Training Free Guided Model Merging Approach
Hetarth Chopra, Vidhi Rambhia, Vikram S. Adve · PDF
Lexico: Extreme KV Cache Compression via Sparse Coding over Universal Dictionaries
Junhyuck Kim, Jongho Park, Jaewoong Cho, Dimitris Papailiopoulos · PDF
LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference
Guangtao Wang, Shubhangi Upasani, Chen Wu, Darshan Gandhi, Jonathan Lingjie Li, Changran Hu, Bo Li, Urmish Thakker · PDF
LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation
CHEN Han, Zicong Jiang, Zining Zhang, Bingsheng He, Luo Pingyi, Mian Lu, Yuqiang Chen · PDF
LoRA Without Forgetting: Freezing and Sparse Masking for Low-Rank Adaptation
Juzheng Zhang, Jiacheng You, Ashwinee Panda, Tom Goldstein · PDF
LoRAM: Low-Rank Adaptation of Large Language Models on Manifold
Xiaowen Jiang, Xun Wang, Sebastian U Stich · PDF
Low-rank Adapting Models for Sparse Autoencoders
Matthew Chen, Joshua Engels, Max Tegmark · PDF
Low-Rank is Required for Pruning LLMs
Stephen Zhang, Vardan Papyan · PDF
Matryoshka Quantization
Pranav Ajit Nair, Puranjay Datta, Jeff Dean, Prateek Jain, Aditya Kusupati · PDF
Mixture-of-Mamba: Enhancing Multi-Modal State-Space Models with Modality-Aware Sparsity
Weixin Liang, Junhong Shen, Genghan Zhang, Ning Dong, Luke Zettlemoyer, LILI YU · PDF
MobiLlama: Towards Accurate & Lightweight Fully Transparent GPT
Omkar Chakradhar Thawakar, Ashmal Vayani, Salman Khan, Hisham Cholakkal, Rao Muhammad Anwer, Michael Felsberg, Timothy Baldwin, Eric P. Xing, Fahad Shahbaz Khan · PDF
MoE Lens - An Expert Is All You Need
Marmik Chaudhari, Idhant Gulati, Nishkal Hundia, Pranav Karra, Shivam Raval · PDF
NoWag: A Unified Framework for Shape Preserving Compression of Large Language Models
Lawrence Ray Liu, Inesh Chakrabarti, Yixiao Li, Mengdi Wang, Tuo Zhao, Lin Yang · PDF
On multi-token prediction for efficient LLM inference
Somesh Mehra, Javier Alonso Garcia, Lukas Mauch · PDF
On the Spatial Structure of Mixture-of-Experts in Transformers
Daniel Bershatsky, Ivan Oseledets · PDF
One Must Imagine Experts Happy: Rebalancing Neural Routers via Constrained Optimization
Kushal Thaman · PDF
Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models
Samira Abnar, Harshay Shah, Dan Busbridge, Alaaeldin El-Nouby, Joshua M. Susskind, Vimal Thilak · PDF
Pivoting Factorization: A Compact Meta Low-Rank Representation of Sparsity for Efficient Inference in Large Language Models
Jialin Zhao, Yingtao Zhang, Carlo Vittorio Cannistraci · PDF
Post-LoRA Restoration: Utilizing Transferability of Low-Rank Adapter in Quantized Foundation Models
Yuto Kanda, Kenji Hatano · PDF
Prefix and Output Length-Aware Scheduling for Efficient Online LLM Inference
Iñaki Arango, Ayush Noori, Yepeng Huang, Rana Shahout, Minlan Yu · PDF
PRUNING AS A DEFENSE: REDUCING MEMORIZATION IN LARGE LANGUAGE MODELS
Mansi Gupta, Nikhar Waghela, Sarthak Gupta, Shourya Goel, Sanjif Shanmugavelu · PDF
Q-Filters: Leveraging Query-Key Geometry for Efficient Key-Value Cache Compression
Nathan Godey, Alessio Devoto, Yu Zhao, Simone Scardapane, Pasquale Minervini, Éric Villemonte de la Clergerie, Benoît Sagot · PDF
QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead
Amir Zandieh, Majid Daliri, Insu Han · PDF
QuEST: Training Accurate LLMs over Highly-Compressed Weights and Activation
Andrei Panferov, Jiale Chen, Soroush Tabesh, Roberto L. Castro, Mahdi Nikdan, Dan Alistarh · PDF
ReALLM: a general framework for LLM compression and fine-tuning
Lisa Bedin, Louis Leconte, Van Minh Nguyen, Eric Moulines · PDF
Recovery-on-the-line: Linear trends in post-quantization performance recovery
Shashata Sawmya, Shuvom Sadhuka, Ragulan Sivakumar, Nir N Shavit, Dan Alistarh, Bonnie Berger · PDF
ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals
Utkarsh Saxena, Sayeh Sharify, Kaushik Roy, Xin Wang · PDF
RLMedusa: Reinforcement Learning for Multiple Decoding Heads to Accelerate LLM Inference
Aadit Juneja, Parsa Idehpour · PDF
Robustly identifying concepts introduced during chat fine-tuning using crosscoders
Julian Minder, Clément Dumas, Bilal Chughtai, Neel Nanda · PDF
S2-ATTENTION: HARDWARE-AWARE CONTEXT SHARDING AMONG ATTENTION HEADS
Xihui Lin, Yunan Zhang, Suyu Ge, Liliang Ren, Barun Patra, Vishrav Chaudhary, Hao Peng, Xia Song · PDF
Scalable Continual Learning: Adaptive MoEs for Expanding Task Sets
Adrian Candocia, Omer Mustafa Inan, Raaghav Agarwal, Aamod Varma, Mark A. Davenport · PDF
Scaling Laws and Efficient Inference for Ternary Language Models
Tejas Vaidhya, Ayush Kaushal, Vineet Jain, Francis Couture-Harpin, Prashant Shishodia, Majid Behbahani, Irina Rish, Yuriy Nevmyvaka · PDF
Scaling Sparse Feature Circuits For Studying In-Context Learning
Dmitrii Kharlapenko, Stepan Shabalin, Arthur Conmy, Neel Nanda · PDF
SpargeAttn: Training-Free Sparse Attention Accelerating Any Model Inference
Jintao Zhang, Chendong Xiang, Haofeng Huang, Jia wei, Haocheng Xi, Jun Zhu, Jianfei Chen · PDF
Sparse and Wide Linear RNNs Are at the Efficiency-Performance Pareto Front
Alessandro Pierro, Steven Abreu, Jonathan Timcheck, Philipp Stratmann, Sumit Bam Shrestha · PDF
Sparse Gradient Compression for Fine-Tuning Large Language Models
David H. Yang, Mohammad Mohammadi Amiri, Tejaswini Pedapati, Subhajit Chaudhury, Pin-Yu Chen · PDF
Sparse Spectral Training and Inference on Euclidean and Hyperbolic Neural Networks
Jialin Zhao, Yingtao Zhang, Xinghang Li, Huaping Liu, Carlo Vittorio Cannistraci · PDF
SPEX: Scaling Feature Interaction Explanations for LLMs
Justin Singh Kang, Landon Butler, Abhineet Agarwal, Yigit Efe Erginbas, Ramtin Pedarsani, Bin Yu, Kannan Ramchandran · PDF
Steering Fine-Tuning Generalization with Targeted Concept Ablation
Helena Casademunt, Caden Juang, Senthooran Rajamanoharan, Neel Nanda · PDF
Symmetric Pruning for Large Language Models
Kai Yi, Peter Richtárik · PDF
TASP: Preserving Training Dynamics in Transformers via NTK-Aware Structured Pruning
Mengting Ai, Tianxin Wei, Jingrui He · PDF
The Surprising Effectiveness of Randomness in LLM Pruning
Shuyao Xu, Liu Jiayao, Zhenfeng He, Cheng Peng, Weidi Xu · PDF
Understanding the Difficulty of Low-Precision Post-Training Quantization for LLMs
Zifei Xu, Sayeh Sharify, Wanzin Yazar, Tristan J Webb, Xin Wang · PDF
Unveiling Simplicities of Attention: Adaptive Long-Context Head Identification
Konstantin Donhauser, Charles Arnal, Mohammad Pezeshki, Vivien Cabannes, David Lopez-Paz, Kartik Ahuja · PDF
Wanda++: Pruning Large Language Models via Regional Gradients
Yifan Yang, Kai Zhen, Bhavana Ganesh, Aram Galstyan, Goeric Huybrechts, Markus Müller, Jonas M. Kübler, Rupak Vignesh Swaminathan, Athanasios Mouchtaris, Sravan Babu Bodapati, Nathan Susanj, Zheng Zhang, Jack FitzGerald, Abhishek Kumar · PDF
Zeroth-Order Adaptive Neuron Alignment Based Pruning without Re-Training
Elia Cunegatti, Leonardo Lucio Custode, Giovanni Iacca · PDF

Accepted papers (70)

☆2SSP: A Two-Stage Framework for Structured Pruning of LLMs

☆Accelerating Transformer Inference and Training with 2:4 Activation Sparsity

☆Antipodal Pairing and Mechanistic Signals in Dense SAE Latents

☆Brain-inspired sparse training enables Transformers and LLMs to perform as fully connected

☆CAMEx: Curvature-aware Merging of Experts

☆ChamaleonLLM: Batch-Aware Dynamic Low-Rank Adaptation via Inference-Time Clusters

☆ClusterGen: Token Generation in Sublinear Time and Memory with Clustering KV Cache

☆Compressed sparse tiles for memory-efficient unstructured and semi-structured sparsity

☆Contextual Sparsity as a Tool for Mechanistic Understanding of Retrieval in Hybrid Foundation Models

☆DeltaMoE: Memory-Efficient Inference for Merged Mixture of Experts with Delta Compression

☆Differentiable Attention Sparsity via Structured $D$-Gating

☆Divide, Reweight, and Conquer: A Logit Arithmetic Approach for In-Context Learning

☆Efficient Transformers via MPO-Based Low-Rank Factorization and Pruning

☆Evaluating LLM Memorization Using Soft Token Sparsity

☆EvoPress: Accurate Dynamic Model Compression via Evolutionary Search

☆Exploring the dual lottery ticket hypothesis in finetuning through specialised sparsification

☆Faster, Cheaper, Just as Good: Cost- and Latency-Constrained Routing for LLMs

☆From Dense to Dynamic: Token-Difficulty Driven MoEfication of Pre-Trained LLMs

☆High Frequency Latents Are Features, Not Bugs

☆How Can Representation Dimension Dominate Structurally Pruned LLMs?

☆How Sparse Attention Approximates Exact Attention?Your Attention is Naturally $n^C$-Sparse

☆InhibiDistilbert: Knowledge Distillation for a ReLU and Addition-based Transformer

☆Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient

☆KURTAIL : KURTOSIS-BASED LLM QUANTIZATION

☆LEWIS (LayEr WIse Sparsity) - A Training Free Guided Model Merging Approach

☆Lexico: Extreme KV Cache Compression via Sparse Coding over Universal Dictionaries

☆LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference

☆LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation

☆LoRA Without Forgetting: Freezing and Sparse Masking for Low-Rank Adaptation

☆LoRAM: Low-Rank Adaptation of Large Language Models on Manifold

☆Low-rank Adapting Models for Sparse Autoencoders

☆Low-Rank is Required for Pruning LLMs

☆Matryoshka Quantization

☆Mixture-of-Mamba: Enhancing Multi-Modal State-Space Models with Modality-Aware Sparsity

☆MobiLlama: Towards Accurate & Lightweight Fully Transparent GPT

☆MoE Lens - An Expert Is All You Need

☆NoWag: A Unified Framework for Shape Preserving Compression of Large Language Models

☆On multi-token prediction for efficient LLM inference

☆On the Spatial Structure of Mixture-of-Experts in Transformers

☆One Must Imagine Experts Happy: Rebalancing Neural Routers via Constrained Optimization

☆Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models

☆Pivoting Factorization: A Compact Meta Low-Rank Representation of Sparsity for Efficient Inference in Large Language Models

☆Post-LoRA Restoration: Utilizing Transferability of Low-Rank Adapter in Quantized Foundation Models

☆Prefix and Output Length-Aware Scheduling for Efficient Online LLM Inference

☆PRUNING AS A DEFENSE: REDUCING MEMORIZATION IN LARGE LANGUAGE MODELS

☆Q-Filters: Leveraging Query-Key Geometry for Efficient Key-Value Cache Compression

☆QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead

☆QuEST: Training Accurate LLMs over Highly-Compressed Weights and Activation

☆ReALLM: a general framework for LLM compression and fine-tuning

☆Recovery-on-the-line: Linear trends in post-quantization performance recovery

☆ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals

☆RLMedusa: Reinforcement Learning for Multiple Decoding Heads to Accelerate LLM Inference

☆Robustly identifying concepts introduced during chat fine-tuning using crosscoders

☆S2-ATTENTION: HARDWARE-AWARE CONTEXT SHARDING AMONG ATTENTION HEADS

☆Scalable Continual Learning: Adaptive MoEs for Expanding Task Sets

☆Scaling Laws and Efficient Inference for Ternary Language Models

☆Scaling Sparse Feature Circuits For Studying In-Context Learning

☆SpargeAttn: Training-Free Sparse Attention Accelerating Any Model Inference

☆Sparse and Wide Linear RNNs Are at the Efficiency-Performance Pareto Front

☆Sparse Gradient Compression for Fine-Tuning Large Language Models

☆Sparse Spectral Training and Inference on Euclidean and Hyperbolic Neural Networks

☆SPEX: Scaling Feature Interaction Explanations for LLMs

☆Steering Fine-Tuning Generalization with Targeted Concept Ablation

☆Symmetric Pruning for Large Language Models

☆TASP: Preserving Training Dynamics in Transformers via NTK-Aware Structured Pruning

☆The Surprising Effectiveness of Randomness in LLM Pruning

☆Understanding the Difficulty of Low-Precision Post-Training Quantization for LLMs

☆Unveiling Simplicities of Attention: Adaptive Long-Context Head Identification

☆Wanda++: Pruning Large Language Models via Regional Gradients

☆Zeroth-Order Adaptive Neuron Alignment Based Pruning without Re-Training

2SSP: A Two-Stage Framework for Structured Pruning of LLMs

Accelerating Transformer Inference and Training with 2:4 Activation Sparsity

Antipodal Pairing and Mechanistic Signals in Dense SAE Latents

Brain-inspired sparse training enables Transformers and LLMs to perform as fully connected

CAMEx: Curvature-aware Merging of Experts

ChamaleonLLM: Batch-Aware Dynamic Low-Rank Adaptation via Inference-Time Clusters

ClusterGen: Token Generation in Sublinear Time and Memory with Clustering KV Cache

Compressed sparse tiles for memory-efficient unstructured and semi-structured sparsity

Contextual Sparsity as a Tool for Mechanistic Understanding of Retrieval in Hybrid Foundation Models

DeltaMoE: Memory-Efficient Inference for Merged Mixture of Experts with Delta Compression

Differentiable Attention Sparsity via Structured $D$-Gating

Divide, Reweight, and Conquer: A Logit Arithmetic Approach for In-Context Learning

Efficient Transformers via MPO-Based Low-Rank Factorization and Pruning

Evaluating LLM Memorization Using Soft Token Sparsity

EvoPress: Accurate Dynamic Model Compression via Evolutionary Search

Exploring the dual lottery ticket hypothesis in finetuning through specialised sparsification

Faster, Cheaper, Just as Good: Cost- and Latency-Constrained Routing for LLMs

From Dense to Dynamic: Token-Difficulty Driven MoEfication of Pre-Trained LLMs

High Frequency Latents Are Features, Not Bugs

How Can Representation Dimension Dominate Structurally Pruned LLMs?

How Sparse Attention Approximates Exact Attention?Your Attention is Naturally $n^C$-Sparse

InhibiDistilbert: Knowledge Distillation for a ReLU and Addition-based Transformer

Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient

KURTAIL : KURTOSIS-BASED LLM QUANTIZATION

LEWIS (LayEr WIse Sparsity) - A Training Free Guided Model Merging Approach

Lexico: Extreme KV Cache Compression via Sparse Coding over Universal Dictionaries

LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference

LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation

LoRA Without Forgetting: Freezing and Sparse Masking for Low-Rank Adaptation

LoRAM: Low-Rank Adaptation of Large Language Models on Manifold

Low-rank Adapting Models for Sparse Autoencoders

Low-Rank is Required for Pruning LLMs

Matryoshka Quantization

Mixture-of-Mamba: Enhancing Multi-Modal State-Space Models with Modality-Aware Sparsity

MobiLlama: Towards Accurate & Lightweight Fully Transparent GPT

MoE Lens - An Expert Is All You Need

NoWag: A Unified Framework for Shape Preserving Compression of Large Language Models

On multi-token prediction for efficient LLM inference

On the Spatial Structure of Mixture-of-Experts in Transformers

One Must Imagine Experts Happy: Rebalancing Neural Routers via Constrained Optimization

Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models

Pivoting Factorization: A Compact Meta Low-Rank Representation of Sparsity for Efficient Inference in Large Language Models

Post-LoRA Restoration: Utilizing Transferability of Low-Rank Adapter in Quantized Foundation Models

Prefix and Output Length-Aware Scheduling for Efficient Online LLM Inference

PRUNING AS A DEFENSE: REDUCING MEMORIZATION IN LARGE LANGUAGE MODELS

Q-Filters: Leveraging Query-Key Geometry for Efficient Key-Value Cache Compression

QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead

QuEST: Training Accurate LLMs over Highly-Compressed Weights and Activation

ReALLM: a general framework for LLM compression and fine-tuning

Recovery-on-the-line: Linear trends in post-quantization performance recovery

ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals

RLMedusa: Reinforcement Learning for Multiple Decoding Heads to Accelerate LLM Inference

Robustly identifying concepts introduced during chat fine-tuning using crosscoders

S2-ATTENTION: HARDWARE-AWARE CONTEXT SHARDING AMONG ATTENTION HEADS

Scalable Continual Learning: Adaptive MoEs for Expanding Task Sets

Scaling Laws and Efficient Inference for Ternary Language Models

Scaling Sparse Feature Circuits For Studying In-Context Learning

SpargeAttn: Training-Free Sparse Attention Accelerating Any Model Inference

Sparse and Wide Linear RNNs Are at the Efficiency-Performance Pareto Front

Sparse Gradient Compression for Fine-Tuning Large Language Models

Sparse Spectral Training and Inference on Euclidean and Hyperbolic Neural Networks

SPEX: Scaling Feature Interaction Explanations for LLMs

Steering Fine-Tuning Generalization with Targeted Concept Ablation

Symmetric Pruning for Large Language Models

TASP: Preserving Training Dynamics in Transformers via NTK-Aware Structured Pruning

The Surprising Effectiveness of Randomness in LLM Pruning

Understanding the Difficulty of Low-Precision Post-Training Quantization for LLMs

Unveiling Simplicities of Attention: Adaptive Long-Context Head Identification

Wanda++: Pruning Large Language Models via Regional Gradients

Zeroth-Order Adaptive Neuron Alignment Based Pruning without Re-Training