ICML 2025 Past Efficiency
ICML 2025 Workshop on Methods and Opportunities at Small Scale
MOSS@ICML2025
- Submission deadline
- May 27, 2025, 15:50 UTC imported from OpenReview — check the website for extensions
- Submission portal
- OpenReview
- Notes
- Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).
Accepted papers (61)
Fetched from OpenReview (v2) on 2026-06-10.
-
AdaptMI: Adaptive Skill-based In-context Math Instructions for Small Language Models
-
An Empirical Investigation of Initialization Strategies for Kolmogorov–Arnold Networks
-
Approximate Message Passing on General Factor Graphs using Shallow Neural Networks
-
CaliPSo: Calibrated Predictive Models with Sharpness as Loss Function
-
Continuous Chain of Thought Enables Parallel Exploration and Reasoning
-
Cross-Validation Error Dynamics in Smaller Datasets
-
Dataset Distillation for Memorized Data: Soft Labels can Leak Held-Out Teacher Knowledge
-
Decomposed Learning: An Avenue for Mitigating Grokking
-
Discovering Hidden Algebraic Structures via Transformers with Rank-Aware Beam GRPO
-
Do Larger Language Models Imply Better Generalization? A Pretraining Scaling Law for Implicit Reasoning
-
Dynamic Low-Rank Training with Spectral Regularization: Achieving Robustness in Compressed Representations
-
Effective Reinforcement Learning for Reasoning in Language Models
-
Efficient B-Tree Insertions Using Proximal Policy Optimization and Hierarchical Attention Models
-
Emergence of Hebbian Dynamics in Regularized Non-Local Learners
-
Emergence, pretraining loss and associative recall: a toy model
-
Encoding Domain Insights into Multi-modal Fusion: Improved Performance at the Cost of Robustness
-
Evaluating Generalization and Representation Stability in Small LMs via Prompting, Fine-Tuning and Out-of-Distribution Prompts
-
Evaluating Sparse Autoencoders: From Shallow Design to Matching Pursuit
-
Exploring Diverse Solutions for Underdetermined Problems
-
Extrapolation by Association: Length Generalization Transfer in Transformers
-
Foundation Models on a Budget: Approximating Blocks in Large Vision Models
-
From SGD to Spectra: A Theory of Neural Network Weight Dynamics
-
Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models
-
Generative or Discriminative? Revisiting Text Classification in the Era of Transformers
-
Geometry of Rank Constraints in Shallow Polynomial Neural Networks
-
Gradient descent in presence of extreme flatness and steepness
-
How Much Context Does Natural Language Actually Require? An Analysis Using LLMs as Statistical Oracles
-
Improving Pathfinding with Anchoring Tokens
-
In-Context Occam’s Razor: How Transformers Prefer Simpler Hypotheses on the Fly
-
Is Visual Prompting the Right Setup for Knowledge Transfer in new Foundation Models?
-
Koopman Autoencoders Learn Neural Representation Dynamics
-
Learning Gaussian Mixture Models via Transformer Measure Flows
-
LiteByte: Efficient and Fast-Adapting MLPs for Online Byte-Level Prediction
-
Measuring Memorization and Generalization in Forecasting Models via Structured Perturbations of Chaotic Systems
-
Mind the Gap: Removing the Discretization Gap in Differentiable Logic Gate Networks
-
Neural Stochastic Differential Equations on Compact State-Spaces
-
On the Emergence of Position Bias in Transformers
-
Optimizing Explanations: Nuances Matter When Evaluation Metrics Become Loss Functions
-
Parity Requires Unified Input Dependence and Negative Eigenvalues in SSMs
-
Performance Plateaus in Inference-Time Scaling for Text-to-Image Diffusion Without External Models
-
Permutations as a testbed for studying the effect of input representations on learning
-
Personalizing AI Interventions in Multiple Health Behavioral Change Settings
-
Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry
-
Pruning Increases Orderedness in Weight-Tied Recurrent Computation
-
Quantitative Bounds for Length Generalization in Transformers
-
Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought
-
Restoring Task-Relevant Information in Synthetic Data: A Small-Scale V-Information View
-
Review, Remask, Refine: Process-Guided Block Diffusion for Text Generation
-
Stats or Facts: Decomposing Generalization in Language Models with Small-Scale Models
-
SynDaCaTE: A Synthetic Dataset For Evaluating Part-Whole Hierarchical Inference
-
The Necessity for Intervention Fidelity: Unintended Side Effects When Steering LLMs
-
TinyServe: Query-Aware Cache Selection for Efficient LLM Inference
-
Towards Understanding Self-Pretraining for Sequence Classification
-
Transformers May Learn to Classify In-Context by Context-Adaptive Kernel Gradient Descent
-
Transformers Pretrained on Procedural Data Contain Modular Structures for Algorithmic Reasoning
-
Understanding Attention Glitches with Threshold Relative Attention
-
Understanding How Chess-Playing Language Models Compute Linear Board Representations
-
Universal Dynamics of Warmup Stable Decay: understanding WSD beyond Transformers
-
What Happens During the Loss Plateau? Understanding Abrupt Learning in Transformers
-
Why Loss Re-weighting Works If You Stop Early: Training Dynamics of Unconstrained Features
-
ZeroTuning: Unlocking the Initial Token's Power to Enhance Large Language Models Without Training