ICML 2025PastEfficiency

ICML 2025 Workshop on Methods and Opportunities at Small Scale

MOSS@ICML2025

Official website ↗OpenReview venue ↗See all ICML workshops →✎ Edit this entry

Submission deadline: May 27, 2025, 15:50 UTC
imported from OpenReview — check the website for extensions
Submission portal: OpenReview
Notes: Topics were auto-suggested and may be imprecise — edits welcome.

Accepted papers (61)

Fetched from OpenReview (v2) on 2026-06-10.

AdaptMI: Adaptive Skill-based In-context Math Instructions for Small Language Models
Yinghui He, Abhishek Panigrahi, Yong Lin, Sanjeev Arora · PDF
An Empirical Investigation of Initialization Strategies for Kolmogorov–Arnold Networks
Spyros Rigas, Dhruv Verma, Georgios Alexandridis, Yixuan Wang · PDF
Approximate Message Passing on General Factor Graphs using Shallow Neural Networks
Leonhard Hennicke, Jan Lemcke, Rainer Schlosser, Ralf Herbrich · PDF
CaliPSo: Calibrated Predictive Models with Sharpness as Loss Function
Alexandre Capone, Kamron Zaidi, Tianyu Xu, Brian Yang, Geoff Pleiss, Jeff Schneider · PDF
Continuous Chain of Thought Enables Parallel Exploration and Reasoning
Halil Alperen Gozeten, Muhammed Emrullah Ildiz, Xuechen Zhang, Hrayr Harutyunyan, Ankit Singh Rawat, Samet Oymak · PDF
Cross-Validation Error Dynamics in Smaller Datasets
Bethany austhof, Lev Reyzin · PDF
Dataset Distillation for Memorized Data: Soft Labels can Leak Held-Out Teacher Knowledge
Freya Behrens, Lenka Zdeborova
Decomposed Learning: An Avenue for Mitigating Grokking
Gabryel Mason-Williams, Israel Mason-Williams · PDF
Discovering Hidden Algebraic Structures via Transformers with Rank-Aware Beam GRPO
Jaeha Lee, Gio Huh, Ning Su, Tony Yue YU · PDF
Do Larger Language Models Imply Better Generalization? A Pretraining Scaling Law for Implicit Reasoning
Xinyi Wang, Shawn Tan, Mingyu Jin, William Yang Wang, Rameswar Panda, Yikang Shen · PDF
Dynamic Low-Rank Training with Spectral Regularization: Achieving Robustness in Compressed Representations
Steffen Schotthöfer, H. Lexie Yang, Stefan Schnake · PDF
Effective Reinforcement Learning for Reasoning in Language Models
Lianghuan Huang, Shuo Li, Sagnik Anupam, Insup Lee, Osbert Bastani · PDF
Efficient B-Tree Insertions Using Proximal Policy Optimization and Hierarchical Attention Models
Alexander Kastius, Nick Lechtenbörger, Felix Schulz, Johann Schulze Tast, Rainer Schlosser, Ralf Herbrich · PDF
Emergence of Hebbian Dynamics in Regularized Non-Local Learners
David Aaron Koplow, Tomaso Poggio, Liu Ziyin · PDF
Emergence, pretraining loss and associative recall: a toy model
Sultan Daniels, Dylan Davis, Dhruv Gautam, Wentinn Liao, Gireeja Ranade, Anant Sahai · PDF
Encoding Domain Insights into Multi-modal Fusion: Improved Performance at the Cost of Robustness
Jackson Sam Michaels, Sidong Zhang, Madalina Fiterau · PDF
Evaluating Generalization and Representation Stability in Small LMs via Prompting, Fine-Tuning and Out-of-Distribution Prompts
Rahul Raja, Arpita Vats · PDF
Evaluating Sparse Autoencoders: From Shallow Design to Matching Pursuit
Valérie Costa, Thomas Fel, Ekdeep Singh Lubana, Bahareh Tolooshams, Demba E. Ba · PDF
Exploring Diverse Solutions for Underdetermined Problems
Eric Volkmann, Andreas Radler, Johannes Brandstetter, Arturs Berzins · PDF
Extrapolation by Association: Length Generalization Transfer in Transformers
Ziyang Cai, Nayoung Lee, Avi Schwarzschild, Samet Oymak, Dimitris Papailiopoulos · PDF
Foundation Models on a Budget: Approximating Blocks in Large Vision Models
Irene Cannistraci, Simone Antonelli, Emanuele Palumbo, Thomas M. Sutter, Emanuele Rodolà, Bastian Rieck, Julia E Vogt · PDF
From SGD to Spectra: A Theory of Neural Network Weight Dynamics
Brian Richard Olsen, Sam Fatehmanesh, Frank Xiao, Adarsh Kumarappan, Anirudh Gajula · PDF
Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models
Lillian Sun, Martin Pawelczyk, Zhenting Qi, Aounon Kumar, Himabindu Lakkaraju · PDF
Generative or Discriminative? Revisiting Text Classification in the Era of Transformers
Siva Rajesh Kasa, Sumegh Roychowdhury, Karan Gupta, Yaswanth Biruduraju, Santhosh Kumar Kasa, Ashutosh Kumar, Pattisapu Nikhil Priyatam, Arindam Bhattacharya, Shailendra Agarwal, Vijay huddar · PDF
Geometry of Rank Constraints in Shallow Polynomial Neural Networks
Param Mody, Maksym Zubkov · PDF
Gradient descent in presence of extreme flatness and steepness
Dravyansh Sharma
How Much Context Does Natural Language Actually Require? An Analysis Using LLMs as Statistical Oracles
Vala Vakilian, Sadegh Mahdavi, Christos Thrampoulidis · PDF
Improving Pathfinding with Anchoring Tokens
Huaqing Zhang, Bingbin Liu, Juno Kim, Andrej Risteski · PDF
In-Context Occam’s Razor: How Transformers Prefer Simpler Hypotheses on the Fly
Puneesh Deora, Bhavya Vasudeva, Tina Behnia, Christos Thrampoulidis · PDF
Is Visual Prompting the Right Setup for Knowledge Transfer in new Foundation Models?
Niclas Hergenröther, Antonio Orvieto · PDF
Koopman Autoencoders Learn Neural Representation Dynamics
Nishant Suresh Aswani, Saif Jabari · PDF
Learning Gaussian Mixture Models via Transformer Measure Flows
Aleksandr Zimin, Anastasiia Kutakh, Yury Polyanskiy, Philippe Rigollet · PDF
LiteByte: Efficient and Fast-Adapting MLPs for Online Byte-Level Prediction
Yu Mao, Yuyan Lin, Xue Liu, Chun Jason Xue · PDF
Measuring Memorization and Generalization in Forecasting Models via Structured Perturbations of Chaotic Systems
Max Kanwal, Caryn Tran · PDF
Mind the Gap: Removing the Discretization Gap in Differentiable Logic Gate Networks
Shakir Yousefi, Andreas Plesner, Till Aczel, Roger Wattenhofer · PDF
Neural Stochastic Differential Equations on Compact State-Spaces
Yue-Jane Liu, Malinda Lu, Matthew K. Nock, Yaniv Yacoby · PDF
On the Emergence of Position Bias in Transformers
Xinyi Wu, Yifei Wang, Stefanie Jegelka, Ali Jadbabaie · PDF
Optimizing Explanations: Nuances Matter When Evaluation Metrics Become Loss Functions
Jonas B Raedler, Hiwot Belay Tadesse, Weiwei Pan, Finale Doshi-Velez · PDF
Parity Requires Unified Input Dependence and Negative Eigenvalues in SSMs
Behnoush Khavari, Jayesh Khullar, Mehran Shakerinava, Jerry Huang, Siamak Ravanbakhsh, Sarath Chandar · PDF
Performance Plateaus in Inference-Time Scaling for Text-to-Image Diffusion Without External Models
Changhyun Choi, Sungha Kim, H. Jin Kim · PDF
Permutations as a testbed for studying the effect of input representations on learning
Sarah McGuire Scullen, Davis Brown, Robert Jasper, Henry Kvinge, Helen Jenne · PDF
Personalizing AI Interventions in Multiple Health Behavioral Change Settings
Samantha Marks, Michelle Chang, Eura Nofshin, Weiwei Pan, Finale Doshi-Velez · PDF
Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry
Sai Sumedh R. Hindupur, Ekdeep Singh Lubana, Thomas Fel, Demba E. Ba · PDF
Pruning Increases Orderedness in Weight-Tied Recurrent Computation
YIDING SONG · PDF
Quantitative Bounds for Length Generalization in Transformers
Zachary Izzo, Eshaan Nichani, Jason D. Lee · PDF
Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought
Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, Yuandong Tian · PDF
Restoring Task-Relevant Information in Synthetic Data: A Small-Scale V-Information View
Sid Bharthulwar · PDF
Review, Remask, Refine: Process-Guided Block Diffusion for Text Generation
Nikita Mounier, Parsa Idehpour · PDF
Stats or Facts: Decomposing Generalization in Language Models with Small-Scale Models
Tina Behnia, Puneesh Deora, Christos Thrampoulidis · PDF
SynDaCaTE: A Synthetic Dataset For Evaluating Part-Whole Hierarchical Inference
Jake Levi, Mark van der Wilk · PDF
The Necessity for Intervention Fidelity: Unintended Side Effects When Steering LLMs
Jonas B Raedler, Weiyue Li, Alyssa Mia Taliotis, Manasvi Goyal, Siddharth Swaroop, Weiwei Pan · PDF
TinyServe: Query-Aware Cache Selection for Efficient LLM Inference
Dong Liu, Yanxuan Yu · PDF
Towards Understanding Self-Pretraining for Sequence Classification
Omar Coser, Antonio Orvieto · PDF
Transformers May Learn to Classify In-Context by Context-Adaptive Kernel Gradient Descent
Sara Dragutinović, Andrew M Saxe, Aaditya K Singh · PDF
Transformers Pretrained on Procedural Data Contain Modular Structures for Algorithmic Reasoning
Zachary Shinnick, Liangze Jiang, Hemanth Saratchandran, Anton van den Hengel, Damien Teney · PDF
Understanding Attention Glitches with Threshold Relative Attention
Mattia Opper, Roland Fernandez, Paul Smolensky, Jianfeng Gao · PDF
Understanding How Chess-Playing Language Models Compute Linear Board Representations
Aaron Mei · PDF
Universal Dynamics of Warmup Stable Decay: understanding WSD beyond Transformers
Annalisa Belloni, Lorenzo Noci, Antonio Orvieto · PDF
What Happens During the Loss Plateau? Understanding Abrupt Learning in Transformers
Pulkit Gopalani, Wei Hu · PDF
Why Loss Re-weighting Works If You Stop Early: Training Dynamics of Unconstrained Features
Yize Zhao, Christos Thrampoulidis · PDF
ZeroTuning: Unlocking the Initial Token's Power to Enhance Large Language Models Without Training
Feijiang Han, Xiaodong Yu, Jianheng Tang, Qingyun Zeng, Licheng Guo, Lyle Ungar · PDF

Accepted papers (61)

☆AdaptMI: Adaptive Skill-based In-context Math Instructions for Small Language Models

☆An Empirical Investigation of Initialization Strategies for Kolmogorov–Arnold Networks

☆Approximate Message Passing on General Factor Graphs using Shallow Neural Networks

☆CaliPSo: Calibrated Predictive Models with Sharpness as Loss Function

☆Continuous Chain of Thought Enables Parallel Exploration and Reasoning

☆Cross-Validation Error Dynamics in Smaller Datasets

☆Dataset Distillation for Memorized Data: Soft Labels can Leak Held-Out Teacher Knowledge

☆Decomposed Learning: An Avenue for Mitigating Grokking

☆Discovering Hidden Algebraic Structures via Transformers with Rank-Aware Beam GRPO

☆Do Larger Language Models Imply Better Generalization? A Pretraining Scaling Law for Implicit Reasoning

☆Dynamic Low-Rank Training with Spectral Regularization: Achieving Robustness in Compressed Representations

☆Effective Reinforcement Learning for Reasoning in Language Models

☆Efficient B-Tree Insertions Using Proximal Policy Optimization and Hierarchical Attention Models

☆Emergence of Hebbian Dynamics in Regularized Non-Local Learners

☆Emergence, pretraining loss and associative recall: a toy model

☆Encoding Domain Insights into Multi-modal Fusion: Improved Performance at the Cost of Robustness

☆Evaluating Generalization and Representation Stability in Small LMs via Prompting, Fine-Tuning and Out-of-Distribution Prompts

☆Evaluating Sparse Autoencoders: From Shallow Design to Matching Pursuit

☆Exploring Diverse Solutions for Underdetermined Problems

☆Extrapolation by Association: Length Generalization Transfer in Transformers

☆Foundation Models on a Budget: Approximating Blocks in Large Vision Models

☆From SGD to Spectra: A Theory of Neural Network Weight Dynamics

☆Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models

☆Generative or Discriminative? Revisiting Text Classification in the Era of Transformers

☆Geometry of Rank Constraints in Shallow Polynomial Neural Networks

☆Gradient descent in presence of extreme flatness and steepness

☆How Much Context Does Natural Language Actually Require? An Analysis Using LLMs as Statistical Oracles

☆Improving Pathfinding with Anchoring Tokens

☆In-Context Occam’s Razor: How Transformers Prefer Simpler Hypotheses on the Fly

☆Is Visual Prompting the Right Setup for Knowledge Transfer in new Foundation Models?

☆Koopman Autoencoders Learn Neural Representation Dynamics

☆Learning Gaussian Mixture Models via Transformer Measure Flows

☆LiteByte: Efficient and Fast-Adapting MLPs for Online Byte-Level Prediction

☆Measuring Memorization and Generalization in Forecasting Models via Structured Perturbations of Chaotic Systems

☆Mind the Gap: Removing the Discretization Gap in Differentiable Logic Gate Networks

☆Neural Stochastic Differential Equations on Compact State-Spaces

☆On the Emergence of Position Bias in Transformers

☆Optimizing Explanations: Nuances Matter When Evaluation Metrics Become Loss Functions

☆Parity Requires Unified Input Dependence and Negative Eigenvalues in SSMs

☆Performance Plateaus in Inference-Time Scaling for Text-to-Image Diffusion Without External Models

☆Permutations as a testbed for studying the effect of input representations on learning

☆Personalizing AI Interventions in Multiple Health Behavioral Change Settings

☆Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry

☆Pruning Increases Orderedness in Weight-Tied Recurrent Computation

☆Quantitative Bounds for Length Generalization in Transformers

☆Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought

☆Restoring Task-Relevant Information in Synthetic Data: A Small-Scale V-Information View

☆Review, Remask, Refine: Process-Guided Block Diffusion for Text Generation

☆Stats or Facts: Decomposing Generalization in Language Models with Small-Scale Models

☆SynDaCaTE: A Synthetic Dataset For Evaluating Part-Whole Hierarchical Inference

☆The Necessity for Intervention Fidelity: Unintended Side Effects When Steering LLMs

☆TinyServe: Query-Aware Cache Selection for Efficient LLM Inference

☆Towards Understanding Self-Pretraining for Sequence Classification

☆Transformers May Learn to Classify In-Context by Context-Adaptive Kernel Gradient Descent

☆Transformers Pretrained on Procedural Data Contain Modular Structures for Algorithmic Reasoning

☆Understanding Attention Glitches with Threshold Relative Attention

☆Understanding How Chess-Playing Language Models Compute Linear Board Representations

☆Universal Dynamics of Warmup Stable Decay: understanding WSD beyond Transformers

☆What Happens During the Loss Plateau? Understanding Abrupt Learning in Transformers

☆Why Loss Re-weighting Works If You Stop Early: Training Dynamics of Unconstrained Features

☆ZeroTuning: Unlocking the Initial Token's Power to Enhance Large Language Models Without Training

AdaptMI: Adaptive Skill-based In-context Math Instructions for Small Language Models

An Empirical Investigation of Initialization Strategies for Kolmogorov–Arnold Networks

Approximate Message Passing on General Factor Graphs using Shallow Neural Networks

CaliPSo: Calibrated Predictive Models with Sharpness as Loss Function

Continuous Chain of Thought Enables Parallel Exploration and Reasoning

Cross-Validation Error Dynamics in Smaller Datasets

Dataset Distillation for Memorized Data: Soft Labels can Leak Held-Out Teacher Knowledge

Decomposed Learning: An Avenue for Mitigating Grokking

Discovering Hidden Algebraic Structures via Transformers with Rank-Aware Beam GRPO

Do Larger Language Models Imply Better Generalization? A Pretraining Scaling Law for Implicit Reasoning

Dynamic Low-Rank Training with Spectral Regularization: Achieving Robustness in Compressed Representations

Effective Reinforcement Learning for Reasoning in Language Models

Efficient B-Tree Insertions Using Proximal Policy Optimization and Hierarchical Attention Models

Emergence of Hebbian Dynamics in Regularized Non-Local Learners

Emergence, pretraining loss and associative recall: a toy model

Encoding Domain Insights into Multi-modal Fusion: Improved Performance at the Cost of Robustness

Evaluating Generalization and Representation Stability in Small LMs via Prompting, Fine-Tuning and Out-of-Distribution Prompts

Evaluating Sparse Autoencoders: From Shallow Design to Matching Pursuit

Exploring Diverse Solutions for Underdetermined Problems

Extrapolation by Association: Length Generalization Transfer in Transformers

Foundation Models on a Budget: Approximating Blocks in Large Vision Models

From SGD to Spectra: A Theory of Neural Network Weight Dynamics

Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models

Generative or Discriminative? Revisiting Text Classification in the Era of Transformers

Geometry of Rank Constraints in Shallow Polynomial Neural Networks

Gradient descent in presence of extreme flatness and steepness

How Much Context Does Natural Language Actually Require? An Analysis Using LLMs as Statistical Oracles

Improving Pathfinding with Anchoring Tokens

In-Context Occam’s Razor: How Transformers Prefer Simpler Hypotheses on the Fly

Is Visual Prompting the Right Setup for Knowledge Transfer in new Foundation Models?

Koopman Autoencoders Learn Neural Representation Dynamics

Learning Gaussian Mixture Models via Transformer Measure Flows

LiteByte: Efficient and Fast-Adapting MLPs for Online Byte-Level Prediction

Measuring Memorization and Generalization in Forecasting Models via Structured Perturbations of Chaotic Systems

Mind the Gap: Removing the Discretization Gap in Differentiable Logic Gate Networks

Neural Stochastic Differential Equations on Compact State-Spaces

On the Emergence of Position Bias in Transformers

Optimizing Explanations: Nuances Matter When Evaluation Metrics Become Loss Functions

Parity Requires Unified Input Dependence and Negative Eigenvalues in SSMs

Performance Plateaus in Inference-Time Scaling for Text-to-Image Diffusion Without External Models

Permutations as a testbed for studying the effect of input representations on learning

Personalizing AI Interventions in Multiple Health Behavioral Change Settings

Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry

Pruning Increases Orderedness in Weight-Tied Recurrent Computation

Quantitative Bounds for Length Generalization in Transformers

Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought

Restoring Task-Relevant Information in Synthetic Data: A Small-Scale V-Information View

Review, Remask, Refine: Process-Guided Block Diffusion for Text Generation

Stats or Facts: Decomposing Generalization in Language Models with Small-Scale Models

SynDaCaTE: A Synthetic Dataset For Evaluating Part-Whole Hierarchical Inference

The Necessity for Intervention Fidelity: Unintended Side Effects When Steering LLMs

TinyServe: Query-Aware Cache Selection for Efficient LLM Inference

Towards Understanding Self-Pretraining for Sequence Classification

Transformers May Learn to Classify In-Context by Context-Adaptive Kernel Gradient Descent

Transformers Pretrained on Procedural Data Contain Modular Structures for Algorithmic Reasoning

Understanding Attention Glitches with Threshold Relative Attention

Understanding How Chess-Playing Language Models Compute Linear Board Representations

Universal Dynamics of Warmup Stable Decay: understanding WSD beyond Transformers

What Happens During the Loss Plateau? Understanding Abrupt Learning in Transformers

Why Loss Re-weighting Works If You Stop Early: Training Dynamics of Unconstrained Features

ZeroTuning: Unlocking the Initial Token's Power to Enhance Large Language Models Without Training