ICML 2024 Past Large language modelsEfficiencyML systems
Workshop on Efficient Systems for Foundation Models II @ ICML2024
ES-FoMo-II 2024
- Submission deadline
- Jun 4, 2024, 11:59 UTC imported from OpenReview — check the website for extensions
- Submission portal
- OpenReview
- Notes
- Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).
Accepted papers (80)
Fetched from OpenReview (v2) on 2026-06-10.
-
AdaInf: Adaptive Inference for Resource-Constrained Foundation Models
-
Adam-mini: Use Fewer Learning Rates To Gain More
-
AdaNF: Quantization Group Adaptive NormalFloat for Low Bit Fine-tuning of LLMs
-
BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts
-
Block Verification Accelerates Speculative Decoding
-
Can Transformers Solve Least Squares to High Precision?
-
Characterizing Prompt Compression Methods for Long Context Inference
-
CLAM: Unifying Finetuning, Quantization, and Pruning by Chaining LLM Adapter Modules
-
CO2: Precise Attention Score Observation for improving KV Cache Replacement in Large Language Model
-
Compress then Serve: Serving Thousands of LoRA Adapters with Little Overhead
-
DocParseNet: Advanced Semantic Segmentation and OCR Embeddings for Efficient Scanned Document Annotation
-
Does your data spark joy? Performance gains from domain upsampling at the end of training
-
Efficient LLM Pruning with Global Token-Dependency Awareness and Hardware-Adapted Inference
-
Efficient multi-prompt evaluation of LLMs
-
Efficient Training of Language Models with Compact and Consistent Next Token Distributions
-
Enhancing Stability for Large Models Training in Constrained Bandwidth Networks
-
Exploiting Activation Sparsity with Dense to Dynamic-k Mixture-of-Experts Conversion
-
Exploring and Improving Drafts in Blockwise Parallel Decoding
-
Exploring Monotonicity in Early-Exiting Language Models
-
ExpoMamba: Exploiting Frequency SSM Blocks for Efficient and Effective Image Enhancement
-
Exponential Quantum Communication Advantage in Distributed Inference and Learning
-
Fast Adaptation and Robust Quantization of Multi-Modal Foundation Models from Associative Memory: A Case Study in SpeechLM
-
Fast and Memory-Efficient Multi-Sequence Generation via Structured Masking
-
Fast yet Safe: Early-Exiting with Risk Control
-
Fewer Truncations Improve Language Modeling
-
GPTVQ: The Blessing of Dimensionality for LLM Quantization
-
GRASS: Compute Efficient Low-Memory LLM Training with Structured Sparse Gradients
-
Hardware-Efficient Quantization for Green Custom Foundation Models
-
HLSTransform: Energy-Efficient Llama 2 Inference on FPGAs Via High Level Synthesis
-
Hydragen: High-Throughput LLM Inference with Shared Prefixes
-
Implicit Optimization Bias of Next-token Prediction in Linear Models
-
In Defense of Structural Sparse Adapters for Concurrent LLM Serving
-
Janus: An Efficient and Expressive Subquadratic Architecture for Modeling Biological Sequences
-
Just read twice: closing the recall gap for recurrent language models
-
LAuReL: Learned Augmented Residual Layer
-
LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference
-
Learned Best-Effort LLM Serving
-
Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMs
-
Low Rank Quantization-Aware Training for LLMs
-
Low-rank Linearization of Large Language Models
-
Mamba-PTQ: Outlier Channels in Recurrent Large Language Models
-
MInference: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention
-
Mobile and Edge Evaluation of Large Language Models
-
MoRe Fine-Tuning with 10x Fewer Parameters
-
NVDSL: Simplifying Tensor Cores with Python-Driven MLIR Metaprogramming
-
OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training
-
OpenELM: An Efficient Language Model Family with Open Training and Inference Framework
-
Optimised Grouped-Query Attention Mechanism for Transformers
-
Optimistic Verifiable Training by Controlling Hardware Nondeterminism
-
OutEffHop: A Principled Outlier-Efficient Attention Layer from Dense Associative Memory Models
-
Outliers and Calibration Sets have Diminishing Effect on Quantization of Modern LLMs
-
Performance Control in Early Exiting to Deploy Large Models at the Same Cost of Smaller Ones
-
PQV-Mobile: A Combined Pruning and Quantization Toolkit to Optimize Vision Transformers for Mobile Applications
-
Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models
-
Pretrained Hybrids with MAD Skills
-
Projectable Models: One-Shot Generation of Small Specialized Transformers from Large Ones
-
Prompt-prompted Adaptive Structured Pruning for Efficient LLM Generation
-
Quantum-PEFT: Ultra parameter-efficient fine-tuning
-
Revealing the Utilized Rank of Subspaces of Learning in Neural Networks
-
Revisiting Cascaded Ensembles for Efficient Inference
-
Robust Federated Finetuning of Foundation Models via Alternating Minimization of LoRA
-
Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations
-
Scavenging Hyena: Distilling Transformers into Long Convolution Models
-
Seeded LoRA: Collaborative Fine-Tuning Through Seed Initialization of Adapters
-
Simple linear attention language models balance the recall-throughput tradeoff
-
SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths
-
SVFT: Parameter-Efficient Fine-Tuning with Singular Vectors
-
Task Addition and Weight Disentanglement in Closed-Vocabulary Models
-
The Mamba in the Llama: Distilling and Accelerating Hybrid Models
-
Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding
-
TinyAgent: Quantization-aware Model Compression and Adaptation for On-device LLM Agent Deployment
-
Towards Efficient Large-Scale Language-3D Representation Learning
-
Towards smaller language models via layer looping
-
Train your cake and eat it too! Repurposing collaborative training to tailor LLMs to private data without sharing
-
Training-Free Acceleration of ViTs with Delayed Spatial Merging
-
Understanding and Minimising Outlier Features in Neural Network Training
-
Unlocking the Global Synergies in Low-Rank Adapters
-
Why Transformers Need Adam: A Hessian Perspective
-
xLSTM: Extended Long Short-Term Memory
-
Zeroth-Order Fine-Tuning of LLMs with Extreme Sparsity