ICLR 2025 Past Large language modelsEfficiency
Sparsity in LLMs (SLLM): Deep Dive into Mixture of Experts, Quantization, Hardware, and Inference
SLLM
- Submission deadline
- Feb 8, 2025, 11:59 UTC imported from OpenReview — check the website for extensions
- Submission portal
- OpenReview
- Notes
- Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).
Accepted papers (70)
Fetched from OpenReview (v2) on 2026-06-10.
-
2SSP: A Two-Stage Framework for Structured Pruning of LLMs
-
Accelerating Transformer Inference and Training with 2:4 Activation Sparsity
-
Antipodal Pairing and Mechanistic Signals in Dense SAE Latents
-
Brain-inspired sparse training enables Transformers and LLMs to perform as fully connected
-
CAMEx: Curvature-aware Merging of Experts
-
ChamaleonLLM: Batch-Aware Dynamic Low-Rank Adaptation via Inference-Time Clusters
-
ClusterGen: Token Generation in Sublinear Time and Memory with Clustering KV Cache
-
Compressed sparse tiles for memory-efficient unstructured and semi-structured sparsity
-
Contextual Sparsity as a Tool for Mechanistic Understanding of Retrieval in Hybrid Foundation Models
-
DeltaMoE: Memory-Efficient Inference for Merged Mixture of Experts with Delta Compression
-
Differentiable Attention Sparsity via Structured $D$-Gating
-
Divide, Reweight, and Conquer: A Logit Arithmetic Approach for In-Context Learning
-
Efficient Transformers via MPO-Based Low-Rank Factorization and Pruning
-
Evaluating LLM Memorization Using Soft Token Sparsity
-
EvoPress: Accurate Dynamic Model Compression via Evolutionary Search
-
Exploring the dual lottery ticket hypothesis in finetuning through specialised sparsification
-
Faster, Cheaper, Just as Good: Cost- and Latency-Constrained Routing for LLMs
-
From Dense to Dynamic: Token-Difficulty Driven MoEfication of Pre-Trained LLMs
-
High Frequency Latents Are Features, Not Bugs
-
How Can Representation Dimension Dominate Structurally Pruned LLMs?
-
How Sparse Attention Approximates Exact Attention?Your Attention is Naturally $n^C$-Sparse
-
InhibiDistilbert: Knowledge Distillation for a ReLU and Addition-based Transformer
-
Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient
-
KURTAIL : KURTOSIS-BASED LLM QUANTIZATION
-
LEWIS (LayEr WIse Sparsity) - A Training Free Guided Model Merging Approach
-
Lexico: Extreme KV Cache Compression via Sparse Coding over Universal Dictionaries
-
LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference
-
LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation
-
LoRA Without Forgetting: Freezing and Sparse Masking for Low-Rank Adaptation
-
LoRAM: Low-Rank Adaptation of Large Language Models on Manifold
-
Low-rank Adapting Models for Sparse Autoencoders
-
Low-Rank is Required for Pruning LLMs
-
Matryoshka Quantization
-
Mixture-of-Mamba: Enhancing Multi-Modal State-Space Models with Modality-Aware Sparsity
-
MobiLlama: Towards Accurate & Lightweight Fully Transparent GPT
-
MoE Lens - An Expert Is All You Need
-
NoWag: A Unified Framework for Shape Preserving Compression of Large Language Models
-
On multi-token prediction for efficient LLM inference
-
On the Spatial Structure of Mixture-of-Experts in Transformers
-
One Must Imagine Experts Happy: Rebalancing Neural Routers via Constrained Optimization
-
Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models
-
Pivoting Factorization: A Compact Meta Low-Rank Representation of Sparsity for Efficient Inference in Large Language Models
-
Post-LoRA Restoration: Utilizing Transferability of Low-Rank Adapter in Quantized Foundation Models
-
Prefix and Output Length-Aware Scheduling for Efficient Online LLM Inference
-
PRUNING AS A DEFENSE: REDUCING MEMORIZATION IN LARGE LANGUAGE MODELS
-
Q-Filters: Leveraging Query-Key Geometry for Efficient Key-Value Cache Compression
-
QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead
-
QuEST: Training Accurate LLMs over Highly-Compressed Weights and Activation
-
ReALLM: a general framework for LLM compression and fine-tuning
-
Recovery-on-the-line: Linear trends in post-quantization performance recovery
-
ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals
-
RLMedusa: Reinforcement Learning for Multiple Decoding Heads to Accelerate LLM Inference
-
Robustly identifying concepts introduced during chat fine-tuning using crosscoders
-
S2-ATTENTION: HARDWARE-AWARE CONTEXT SHARDING AMONG ATTENTION HEADS
-
Scalable Continual Learning: Adaptive MoEs for Expanding Task Sets
-
Scaling Laws and Efficient Inference for Ternary Language Models
-
Scaling Sparse Feature Circuits For Studying In-Context Learning
-
SpargeAttn: Training-Free Sparse Attention Accelerating Any Model Inference
-
Sparse and Wide Linear RNNs Are at the Efficiency-Performance Pareto Front
-
Sparse Gradient Compression for Fine-Tuning Large Language Models
-
Sparse Spectral Training and Inference on Euclidean and Hyperbolic Neural Networks
-
SPEX: Scaling Feature Interaction Explanations for LLMs
-
Steering Fine-Tuning Generalization with Targeted Concept Ablation
-
Symmetric Pruning for Large Language Models
-
TASP: Preserving Training Dynamics in Transformers via NTK-Aware Structured Pruning
-
The Surprising Effectiveness of Randomness in LLM Pruning
-
Understanding the Difficulty of Low-Precision Post-Training Quantization for LLMs
-
Unveiling Simplicities of Attention: Adaptive Long-Context Head Identification
-
Wanda++: Pruning Large Language Models via Regional Gradients
-
Zeroth-Order Adaptive Neuron Alignment Based Pruning without Re-Training