ICML 2026 Past Large language models
AdaptFM: Resource-Adaptive Foundation Model Inference
AdaptFM
- Submission deadline
- May 8, 2026, 23:59 AoE (UTC−12) imported from OpenReview — check the website for extensions
- Submission portal
- OpenReview
- Notes
- Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).
Accepted papers (124)
Fetched from OpenReview (v2) on 2026-06-10.
-
A Recipe for an Elastic Mixture: One Mixture-of-Experts for Every Resource Budget
-
A Tale of Two Temperatures: Simple, Efficient, and Diverse Sampling from Diffusion Language Models
-
A3: an Analytical Low-Rank Approximation Framework for Attention
-
AAAC: Activation-Aware Adaptive Codebooks for 4-bit LLM Weight Quantization
-
Accelerating LLM Inference via Vector Index Based Output Embeddings
-
Activation Quantization of Vision Encoders Needs Prefixing Registers
-
Activation- and Influence-Aware Ranks (AIR): Function-Preserving SVD Compression for LLMs
-
Adaptive Generate-Rank-Verify: Inference-Time Search with Costly Verification
-
Adaptive Safety Probing for Resource-Efficient Vision-Language-Action Models
-
AgentKV: Phase-Aware KV Eviction for Agentic LLMs
-
AgentRouter: Heterogeneous Model Routing for Cost-Optimal Multi-Step Agentic Workflows
-
Alignment Collapse Under KV Cache Quantization: A 35-Minute Audit for Quantized LLM Deployments
-
BASTION: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion Drafting
-
Beyond Imitation: A Resource Adaptive Embedder that Outperforms its 14×Larger Teacher on Financial Retrieval
-
Block-Based Double Decoders
-
Block-Level Recursion: Adaptive Test-Time Routing in Large Language Models
-
Cache You Later: Post-Compression KV Repair for Long-Context Agentic LLM Inference
-
CafeQ: Calibration-free Quantization via Learned Transformations and Adaptive Rounding
-
CARES: Context-Aware Resolution Selector for VLMs
-
Characterizing self-speculative decoding approaches for accelerating LLMs
-
CLAWS: Calibration-Aware Activation Sparsity for Instruction-Tuned LLMs
-
COAT: COrrelation-Aware Orthogonal Transform for LLM Quantization
-
CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization
-
Convergence-Gated Distillation for Resource-Adaptive Reinforcement Learning Agents
-
CoupledNorm: Efficient Normalization via Shared RMS Statistics
-
Cross-Tokenizer LLM Distillation through a Byte-Level Interface
-
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
-
Decoupling Spatial and Semantic Token Compression for Vision-Language Model Acceleration
-
DIPA: Difficulty-Informed Probabilistic Allocation of Test-Time Compute via Training-Free Proxies
-
Distill, Suppress, and Fuse: Cross-Modal Knowledge Integration for Optical Flow-Free Temporal Action Segmentation
-
DREAM-MoE: Downstream Routing Error-Aware Margin-Preserving Quantization for Mixture-of-Experts Large Language Models
-
DropKV: Decoupling Residual-Output Perturbation for Near-Optimal KV-Cache Eviction
-
Dropping the Anchor: Statistical Context Summarization for Distributed Systems via Pulsar Attention
-
Efficient Encoder-Only Context Compression via Marginal Contribution Scoring
-
EfficientRollout: System-Aware Self-Speculative Decoding for RL Rollouts
-
Empirical Analysis of Layer Redundancy in Diffusion Language Models
-
EntropyCache: Decoded Token Entropy Guided KV Caching for Diffusion Language Models
-
Fast Inference via Hierarchical Speculative Decoding
-
Fault Robustness of Custom Floating-Point and Integer Formats: Datatype Selection as a Reliability-Aware Compression Decision
-
Fixed-Point Reasoning: Stable and Adaptive Deep Looped Models
-
FlexRank: Nested Low-Rank Knowledge Decomposition for Adaptive Model Deployment
-
Fully Nested Transformers
-
Ghosted Layers: Unconstrained Activation Alignment for Recovering Layer-Pruned LLMs
-
GreenMoE: Exploiting Dynamic Load Imbalance for Energy-Efficient Long-Context MoE Training
-
Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts
-
HYBRIDKV: Exploiting Head-Dominant Reconstruction for Efficient Query-Agnostic KV Cache Compression
-
HyPER: Bridging Exploration and Exploitation for Scalable LLM Reasoning with Hypothesis Path Expansion and Reduction
-
Implicit Off-Diagonal Curvature Modeling via Gradient Projection for Post-Training Quantization of Vision Transformers
-
Improving Cascade Routing for Structured Attribute Generation with Heterogeneous Confidence
-
IR3DE: A Linear Router for Large Language Models
-
Is Escalation Worth It? A Decision-Theoretic Characterization of LLM Cascades
-
Jacobian-guided Noise Injection for Quantization Robustness in Large Language Models
-
KVgrad: Query-Agnostic KV Cache Eviction via Gradient-based Global Importance Scoring
-
Latent Cache Flow: Model-to-Model Communication Without Text
-
Layer Verification Accelerates Speculative Tree Decoding
-
Layout and Fusion Trade-offs for Mixture-of-Experts Inference under Single-Node Tensor Parallelism
-
LEAP: Learnable End-to-End Adaptive Pruning of Large Language Models
-
Learning Adaptive LLM Decoding
-
Learning Adaptive Reasoning Budgets via Constraint-Rectified Training
-
Learning When to Attend: Conditional Memory Access for Long-Context LLMs
-
Leech Lattice Vector Quantization for Efficient LLM Compression
-
LExI: Layer-Adaptive Active Experts for Efficient MoE Inference
-
LLM Family Expansion via Distillation and Quantization
-
LongAttnComp: Cross-Family Context Compression for Long-Context Reasoning
-
Low Dimensional Embeddings for Model Capability Understanding
-
MAGE: All-[MASK] Block Already Knows Where to Look in Block Diffusion LLM
-
MatMLA: Matryoshka Multi-Head Latent Attention
-
MineDraft: A Framework for Batch Parallel Speculative Decoding
-
Modality-Aware Block Rotation for Vision-Language-Action Model Quantization
-
MoNe: Modular Neural Memory for Efficient Long Context Inference
-
MoSE: Mixture of Slimmable Experts for Efficient and Adaptive Language Models
-
Multi-Mixer Models: Flexible Sequence Modeling with Shared Representations
-
Multi-Token Prediction via Self-Distillation
-
Neural Weight Compression for Language Models
-
NOSA: Native and Offloadable Sparse Attention
-
On State Reduction in Linear Attention
-
On the Optimal Reasoning Length for RL-Trained Language Models
-
One Simple Trick for Improving the Performance of Energy-Limited Local Inference and Training
-
OriCache: Orientation-Guided Feature Caching for DiT Acceleration
-
Prelude: Execution-Class Aware Serving for Decision-Style LLM Inference
-
PRESTO: Prefix-Aligned Tree Drafting for Diffusion Speculative Decoding
-
Pruning and Distilling Mixture-of-Experts into Dense Language Models
-
QJL is 1-bit Compressive Sensing: An Equivalence and Its Consequences for KV Cache Compression in LLMs
-
Re-evaluating Confidence Remasking in Masked Diffusion Language Models
-
Recency/Frequency Adaptive KV Caching for Large Language Model Serving
-
Recovering Selectivity with LTI State Space Operators for Portable Long-Context Inference
-
Reducing Attention Distribution Error with Unified Tail Aggregation for Sparse Attention
-
Referring Video Object Segmentation via Language-aligned Track Selection
-
Relaxed On-Policy Distillation: Selective Credit Allocation for Scaling Reasoning Efficiently
-
Resource-Adaptive Foundation Model Reasoning via Semantic Coverage
-
Resource-Adaptivity Beyond the Model: Sensor Control for Quantized On-Device Vision
-
Rethinking Layer Redundancy in Large Language Models: Calibration Objectives and Search for Depth Pruning
-
Selective Sinkhorn Routing for Improved Sparse Mixture of Experts
-
SFPruner: Single-Forward Visual Token Subset Selection for Resource-Efficient Multimodal Foundation Model Inference
-
ShadowSpec: Towards Zero Speculation Overhead for Substitute Speculative Decoding
-
Sigmoid Attention as a Better Substrate for Learned KV Cache Eviction
-
SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training
-
SparseSAM: Structured Sparsification of Activations in Segment Anything Models
-
Speedrunning GPT3: Training an (Almost-) GPT3-175B-Quality Model in Under 10K USD
-
SpiralFovea: Input-Adaptive Foveated Tokenization as a Third Lever of Resource-Adaptive Inference
-
SRA-MoE: Output-Aware Selective Router Alignment for MoE Quantization
-
Stabilizing Extrapolation in Looped Transformers via Learned Stochastic Stopping
-
Staircase Streaming for Low-Latency Multi-Agent Inference
-
Step-Tagging Early-Stopping: Toward controlling the generation of Language Reasoning Models through black-box step monitoring
-
StreamAttention: Energy-Efficient and High-Utilization Attention on Systolic Hardware
-
StructSAM: Structure- and Spectrum-Preserving Token Merging for Segment Anything Models
-
Structural Outlier-Aware Post-Training Quantization for Monocular Depth Estimation
-
Structure-Preserving Adaptive Post-Training Quantization for Monocular Depth Estimation
-
SubspacePath Pruner: Inference-time Pruning via Probe-based Representation–Parameter Coupling
-
TEAM: Temporal–Spatial Consistency Guided Expert Activation for MoE Diffusion Language Model Acceleration
-
TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning
-
Think Deep, Think Fast: Investigating Inference-Time Scaling And The Reasoning Floor
-
Training Continuous Chain of Thought Models: A Tale of Two Regimes
-
Understanding Layer Patching in Model Size Interpolation
-
VEDJE: Video-Efficient Discriminative Joint Encoder for Scalable Video-Text Retrieval
-
Vision Token Pruning via Query--Vision Interaction Decomposition
-
Weight Concentration Regularization for Improving Pruning Robustness Under High Sparsity
-
What Matters for NVFP4 Training? A Scaling Study of Low-Precision Pre-Training Recipes
-
When Good Enough Is Optimal: Multiplication-Only Matrix Inversion Approximation for Quantized Gated DeltaNet
-
Why Limit the Residual Stream to Layers and Not Tokens? Persistent Memory for Continuous Latent Reasoning
-
WildCat: Near-Linear Attention in Theory and Practice
-
XShare: Collaborative in-Batch Expert Sharing for Faster MoE Inference
-
You Had One Job: Per-Task Quantization Using LLMs’ Hidden Representations
-
Zero-Shot Quantization for Vision-Language-Action Models via Trajectory Curvature and Attention Guidance