ICML 2025 Past EfficiencyML systemsLarge language models
ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models
ES-FoMo
Unverified seed entry. Some fields are estimates — confirm everything on the official website before planning a submission.
- Submission deadline
- May 26, 2025, 23:59 AoE (UTC−12) SEED estimate of the historical deadline — verify
- Workshop day
- Jul 19, 2025
- Submission portal
- OpenReview
- Notes
- SEED DATA — name/website from the OpenReview venue record; workshop date estimated — verify.
Accepted papers (146)
Fetched from OpenReview (v2) on 2026-06-10.
-
$\mu$-MoE: Test-Time Pruning as Micro-Grained Mixture-of-Experts
-
A Minimalist Optimizer Design for LLM Pretraining
-
A Survey on Prompt Tuning
-
ABBA: Highly Expressive Hadamard Product Adaptation for Large Language Models
-
Accelerated Test-Time Scaling with Model-Free Speculative Sampling
-
Accelerating Linear Attention Design by Unifying Forward & Backward Propagation
-
Act Only When It Pays: Efficient Reinforcement Learning for LLM Reasoning via Selective Rollouts
-
Adaptive Backbone Selection for Efficient and Real-Time Vision Inference
-
Adaptive Self-improvement LLM Agentic System for ML Library Development
-
An Efficient Row-Based Sparse Fine-Tuning with Low Quantization Error
-
Any-Order GPT as Masked Diffusion Model: Decoupling Formulation and Architecture
-
AREAL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning
-
Autoregressive Language Modeling by Compressed Sequence Mixing
-
AWP: Activation-aware Weight Pruning and Quantization with Projected Gradient Descent
-
Balancing LoRA Performance and Efficiency with Simple Shard Sharing
-
Batch-Max: Higher LLM Throughput using Larger Batch Sizes and KV Cache Compression
-
Best-of-N through the Smoothing Lens: KL Divergence and Regret Analysis
-
Beyond Cosine Decay: On the effectiveness of Infinite Learning Rate Schedule for Continual Pre-training
-
BlockBPE: Parallel BPE Tokenization
-
BREAD: Branched Rollouts from Expert Anchors Bridge SFT & RL for Reasoning
-
Byzantine-Resilient Zero-Order Optimization for Scalable Federated Fine-Tuning of Large Language Models
-
Cache Saver: A Modular Framework for Efficient, Affordable, and Reproducible LLM Inference
-
CarbonGearRL: Precision-Elastic, Carbon-Aware Scheduling for Foundation-Model Training
-
Cartridges: Lightweight and general-purpose long context representations via self-study
-
Chipmunk: Training-Free Acceleration of Diffusion Transformers with Dynamic Column-Sparse Deltas
-
CoDM: A Co-design Framework for Efficient Sparse Diffusion Models
-
Compress, Gather, and Recompute: REFORMing Long-Context Processing in Transformers
-
Compressing Large Language Models to Any Size Without Re-Computation
-
ConMeZO: Adaptive Directional Sampling for Gradient-Free Finetuning of Language Models
-
Context-lite Multi-turn Reinforcement Learning for LLM Agents
-
Continuous Autoregressive Generation with Mixture of Gaussians
-
Cost-Efficient Serving of LLM Agents via Test-Time Plan Caching
-
d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning
-
Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation
-
DEL-ToM: Inference-Time Scaling for Theory-of-Mind Reasoning via Dynamic Epistemic Logic
-
Demystifying Language Model Forgetting with Low-rank Example Associations
-
DLaVA: Document Language and Vision Assistant for Answer Localization with Enhanced Interpretability and Trustworthiness
-
Early Attentive Sparsification Accelerates Neural Speech Transcription
-
Efficient and Accurate KV-cache Management for Long-Sequence LLMs
-
Efficient Pre-Training of LLMs via Topology-Aware Communication Alignment on More Than 9600 GPUs
-
Efficient Temporal Tokenization for Mobility Prediction with Large Language Models
-
Efficient-vDiT: Efficient Video Diffusion Transformers With Attention Tile
-
Exchangeability in Neural Network Architectures and its Application to Dynamic Pruning
-
Exploring Diffusion Transformer Designs via Grafting
-
Fed-SB: A Silver Bullet for Extreme Communication Efficiency and Performance in (Private) Federated LoRA Fine-Tuning
-
Flexi-LoRA: Efficient LoRA Finetuning with Input-Adaptive Dynamic Ranks
-
Foreign Sparse Attention: Effective Distillation into Sparse Attention
-
FPTQuant: Function-Preserving Transforms for LLM Quantization
-
FrugalRAG: Learning to retrieve and reason for multi-hop QA
-
GPTailor: Large Language Model Pruning Through Layer Cutting and Stitching
-
GPU Kernel Scientist: An LLM-Driven Framework for Iterative Kernel Optimization
-
Graph Signal Processing Meets Mamba2: Adaptive Filter Bank via Delta Modulation
-
Guided Speculative Inference for Efficient Test-Time Alignment of LLMs
-
HadaNorm: Diffusion Transformer Quantization through Mean-Centered Transformations
-
Hardware-Efficient Attention for Fast Decoding
-
How Many Tokens Do 3D Point Cloud Transformer Architectures Really Need?
-
How Well do LLMs Compress Their Own Chain-of-Thought? A Token Complexity Approach
-
InterLoRA: An Adaptive LoRA Structure Based on The Mechanistic Interpretability of Transformer
-
Is Visual Prompting the Right Setup for Knowledge Transfer in new Foundation Models?
-
Iterative Amortized Inference: Unifying In-Context Learning and Learned Optimizers
-
JSONSchemaBench: Evaluating Constrained Decoding with LLMs on Efficiency, Coverage and Quality
-
Kevin: Multi-Turn RL for Generating CUDA Kernels
-
KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction
-
Language System: A Lightweight Ranking Framework for Language Models
-
Large Reasoning Models Know How to Think Efficiently
-
LATTICE: Learning to Efficiently Compress the Memory
-
Learning Adaptive Parallel Reasoning with Language Models
-
Learning to Discover Abstractions for LLM Reasoning
-
Learning What Matters: Prioritized Concept Learning via Relative Error-driven Sample Selection
-
LOGAH: Initialize Large Transformers via Small Graph HyperNetworks
-
LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification
-
LoRA Fine-Tuning Without GPUs: A CPU-Efficient Meta-Generation Framework for LLMs
-
LoRA Merging with SVD: Understanding Interference and Preserving Performance
-
Making Small Language Models Efficient Reasoners: Intervention, Supervision, Reinforcement
-
Mamba Drafters for Speculative Decoding
-
MASSV: Multimodal Adaptation and Self-Data Distillation for Speculative Decoding of Vision-Language Models
-
MatMuls are Enough for Efficient and Performant Linear-Time Attention
-
Mitigating Over-Smoothing in Mamba2 via Spectral Domain Analysis
-
Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Thinking
-
Model Parallelism With Subnetwork Data Parallelism
-
MTraining: Efficient Distributed Training for Ultra-Long Contexts via Dynamic Sparse Attention
-
Mu-Parametrization for Mixture of Experts
-
MuLoCo: Muon is a practical inner optimizer for DiLoCo
-
Multi-stream Sequence Learning
-
Multi-student Diffusion Distillation for Better One-step Generators
-
Next-Token Prediction Should be Ambiguity-Sensitive : A Meta-Learing Perspective
-
One-Pass to Reason: Token Duplication and Block-Sparse Mask for Efficient Fine-Tuning on Multi-Turn Reasoning
-
Optimal Formats for Weight Quantisation
-
Outlier-Free Genomic Foundation Models for Resource-Efficient Training and Low-Bit Inference
-
Overcoming Long-Context Limitations of State-Space Models via Context-Dependent Sparse Attention
-
Partition Generative Modeling: Masked Modeling Without Masks
-
PiKE: Adaptive Data Mixing for Large-Scale Multi-Task Learning Under Low Gradient Conflicts
-
PiKV: KV Cache Management System for MoE Architecture
-
pLSTM: parallelizable Linear Source Transition Mark networks
-
PoLAR: Polar-Decomposed Low-Rank Adapter Representation
-
PoTPTQ: A Two-step Power-of-Two Post-training for LLMs
-
Predictive Scheduling for Efficient Inference-Time Reasoning in Large Language Models
-
Privacy Isn’t Free: Benchmarking the Systems Cost of Privacy-Preserving ML
-
Private Zeroth-Order Optimization with Public Data
-
Proof-of-Concept for Private Local-to-Cloud LLM Chat via Trusted Execution Environments
-
PT-MoE: An Efficient Finetuning Framework for Integrating Mixture-of-Experts into Prompt Tuning
-
Q-Adam-mini: Memory-Efficient 8-bit Quantized Optimizer for Large Language Model Training
-
QuarterMap: Efficient Post-Training Token Pruning for Visual State Space Models
-
Quartet: Native FP4 Training Can Be Optimal for Large Language Models
-
Radio: Rate–Distortion Optimization for Large Language Model Compression
-
Resource-efficient Inference with Foundation Model Programs
-
Revisit What You See: Disclose Language Prior in Vision Tokens for Efficient Guided Decoding of LVLMs
-
SageAttention2++: A More Efficient Implementation of SageAttention2
-
SARA: Selective and Adaptive Retrieval-augmented Generation with Context Compression
-
Scaling Fine-Grained MoE Beyond 50B Parameters: Empirical Evaluation and Practical Insights
-
Scaling Up Liquid-Resistance Liquid-Capacitance Networks for Efficient Sequence Modeling
-
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
-
SD$^2$: Self-Distilled Sparse Drafters
-
Shrinking the Generation-Verification Gap with Weak Verifiers
-
SortedRL: Accelerating RL Training for LLMs through Online Length-aware Scheduling
-
SpecCoT: Accelerating Chain-of-Thought Reasoning through Speculative Exploration
-
SPECS: Faster Test-Time Scaling through Speculative Drafts
-
Speeding up Speculative Decoding via Sequential Approximate Verification
-
Steering LLM Reasoning Through Bias-Only Adaptation
-
Tail-Optimized Caching for LLM Inference
-
Tensor Product Attention Is All You Need
-
The Road Not Taken: Hindsight Exploration for LLMs in Multi-Turn RL
-
Thinformer: Guaranteed Attention Approximation via Low-Rank Thinning
-
Think Clearly: Improving Reasoning via Redundant Token Pruning
-
ThinkingViT: Nested Thinking Vision Transformer for Elastic Inference
-
Tiny Reward Models
-
TinyServe: Query-Aware Cache Selection for Efficient LLM Inference
-
TMA-Adaptive FP8 Grouped GEMM: Eliminating Padding Requirements in Low-Precision Training and Inference on Hopper
-
TORCHSIM: High Fidelity Runtime and Memory Estimation for Distributed Training
-
Toward Dataset Distillation for Regression Problems
-
Towards Efficient Pre-training: Exploring FP4 Precision in Large Language Models
-
Towards Large Scale Training on Apple Silicon
-
Towards Understanding Orthogonalization in Muon
-
Towards Understanding Self-Pretraining for Sequence Classification
-
Training Language Models to Reason Efficiently
-
Training-free LLM Verification via Recycling Few-shot Examples
-
Training-Free Semantic Deferrals for Open-Ended LLM Cascades
-
Ultra-Efficient and Effective Large Language Models with Multi-Boolean Architectures
-
Unbounded Memory and Consistent Imagination via Unified Diffusion–SSM World Models
-
Unified Scaling Laws for Compressed Representations
-
Vision Language Model Distillation Using Partial Information Decomposition
-
VOCABTRIM: Vocabulary Pruning for Efficient Speculative Decoding in LLMs
-
VScan: A Two-Stage Visual Token Reduction Framework for Accelerating Large Vision-Language Models
-
WINA: Weight Informed Neuron Activation for Accelerating Large Language Model Inference
-
Zero-Shot Conversion to Monarch-Structured Attention
-
zip2zip: Inference-Time Adaptive Vocabularies for Language Models via Token Compression