ICML 2026PastOther

High-dimensional Learning Dynamics 2026

HiLD at ICML 2026

Official website ↗OpenReview venue ↗See all ICML workshops →✎ Edit this entry

Submission deadline: May 12, 2026, 12:00 UTC
OpenReview-synced 2026-05-12 12:00 UTC (as of 2026-06-23) — extensions on OpenReview are applied automatically; verify on the website.
Submission portal: OpenReview
Notes: Topics were auto-suggested and may be imprecise — edits welcome.

Accepted papers (167)

Fetched from OpenReview (v2) on 2026-06-10.

$\delta$-Regularized Gradient Clipping for Stable Optimization: Analysis and Empirical Evaluation
Katsiaryna Novikava, Anna Lytova, Omar Rivasplata · PDF
A "feature ODE" describing the learning behavior of shallow MLPs on simple functions
Joseph Turnbull, Berkan Ottlik, James B Simon · PDF
A $p$-adic Perspective on Low-Bit Training of Neural Networks
Daniel Bershatsky, Marina Munkhoeva, Ivan Oseledets · PDF
A Compute-Matched Study of Hidden Layer Distillation for LLM Pre-Training
Maxime Guigon, Lucas Dixon, Michael Eli Sander · PDF
A Coulomb Particle Model for Learning Kernel Attention in Transformers
Masoud Badiei Khuzani, Sharath Honnaiah, Atiq Islam, Alex Cozzi, Abraham Bagherjeiran · PDF
A Data-Scaling Sweet Spot in Structured Algorithmic Learning
Shin So, Kyelim Lee, Albert No · PDF
A Geometric Perspective on Stabilizing Value Conflict Resolution
Saket Reddy, Andy Liu · PDF
A Horizon-Dependent Intrinsic-Dimension Theory of Scaling for Biological Forecasting
Bryan Cheng, Austin Jin, Jasper Zhang, Arnav Pemmaraju, Brendan Lo, Joshua Chang · PDF
A loss curvature account of fine-tuning fragility
Ivaylo Dimitrov, Leo Karoubi, Sunny Howard, Dmitrii Krasheninnikov · PDF
A Quadratic Lens on Muon: Orthogonalization, Invariance, and Implicit Preconditioning
Egor Shulgin, Sam Laing, Antonio Orvieto, Peter Richtárik · PDF
A Simple and Efficient Measure of Loss Landscape Curvature
Hee-Sung Kim, Sungyoon Lee · PDF
Activation Functions Control Finite-Width Concentration in Wide Neural Networks
Soumya Ganguly, Nilava Metya, Alexandre V. Morozov, Anirvan M. Sengupta · PDF
Adam or Gauss-Newton? A Comparative Study In Terms of Basis Alignment and SGD Noise
Bingbin Liu, Rachit Bansal, Depen Morwani, David Alvarez-Melis, Sham M. Kakade · PDF
AMUSE: Anytime Muon with Stable Gradient Evaluation
Jueun Kim, Baekrok Shin, Jihun Yun, Beomhan Baek, Minhak Song, Chulhee Yun · PDF
Anytime Pretraining: Horizon-Free Learning-Rate Schedules with Weight Averaging
Alexandru Meterez, Pranav Ajit Nair, Depen Morwani, Cengiz Pehlevan, Sham M. Kakade · PDF
Asymmetric Scaling Laws from Sparse Features
John Sous · PDF
Be Greedy, Stay Linear: Universally Robust Feature Engineering
Yunze Leng, Rohan Ghosh, Mehul Motani · PDF
Beyond the Hessian Edge: The Stochastic Stability Cocycle of Mini-Batch SGD
Manoj Saravanan · PDF
BLADE: Binary Learning via Algebraic Dual Estimation for the Exact Edge of Stability in 1-Bit Networks
Şuayp Talha Kocabay, Talha Rüzgar Akkuş · PDF
BReD: Stabilizing Quantized EMA Dynamics for Memory-Efficient Large-Scale Training
Heshen Zhan, Youhan Huang, Yunke Peng, Yao Wang, Linghui Kong, Ziwei Zhu, Qingyu Han, Yaoyuan Wang, Congliang Chen, Ruoyu Sun · PDF
Causal Volterra Dynamics of Mamba
Ming-Ching Chang, Ting Yu Tsai, Davis Wertheimer, Felix X.-F. Ye · PDF
Characterizing Optimizer-Dependent Training Dynamics Through Hessian Eigenvector Displacement and Localization
Marcelina Marjankowska, Paolo Barucca, Valerio Modugno · PDF
Common Origins, Divergent Destinations: The Development of Cross-Layer Alignment Under GELU and SwiGLU
Delbert Bray Jr. · PDF
Compute Efficiency and Serial Runtime Tradeoffs for Stochastic Momentum Methods
Depen Morwani, Alexandru Meterez, Pranav Ajit Nair, Sham M. Kakade · PDF
Compute-Optimal Scaling Laws for the Generalization Phase Transition in Grokking
Anish Kataria · PDF
Compute-Optimal Training as Stochastic Optimal Control
Rohan Keyur Dalal · PDF
Continuous Sparsification via Minimizing Movement
Hoang Pham, Tom Jacobs, Binh Nguyen, Rebekka Burkholz, Long Tran-Thanh · PDF
Critical Batch Size for LLM Policy Optimization
Rachit Bansal, Clara Mohri, Natalie Abreu, David Alvarez-Melis, Sham M. Kakade · PDF
Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws
Zhiwei Xu, Shihao Wu, Hanseul Cho, Wei Hu, Yixin Wang · PDF
Deep Learning as Neural Low-Degree Filtering: A Theory of Hierarchical Feature Learning
Yatin Dandi, Matteo Vilucchio, Luca Arnaboldi, Hugo Tabanelli, Florent Krzakala · PDF
DeltaMomentum: A Key-Value based Anisotropic Momentum Update via Delta Rule
Euijin Hong, Guannan Qu · PDF
Depth scaling and Muon enable balanced expert usage in MoE training
Xi Wang, Soufiane Hayou, Eric Nalisnick · PDF
Deriving Hyperparameter Scaling Laws via Modern Optimization Theory
Egor Shulgin, Jörg K.H. Franke, Dimitri von Rütte, Tianyue H. Zhang, Niccolò Ajroldi, Korbinian Pöppel, Bernhard Schölkopf, Aaron Klein, Peter Richtárik, Antonio Orvieto · PDF
Detecting overfitting in Neural Networks during long-horizon grokking using Random Matrix Theory
Hari Kishan Prakash, charles h martin · PDF
Diagonalizing the Softmax: Hadamard Initialization for Tractable Cross-Entropy Dynamics
Connall Garrod, Jonathan P. Keating, Christos Thrampoulidis · PDF
Dichotomy of Feature Learning and Unlearning: Fast-Slow Analysis on Neural Networks with Stochastic Gradient Descent
Shota Imai, Sota Nishiyama, Masaaki Imaizumi · PDF
Dimension-Free Scaling Laws for Invariant Score Matching
Behrooz Tahmasebi, Melanie Weber · PDF
DOSA: Dynamic Online State Allocation for Adaptive Optimizers via Per-Tensor Sketched Smoothness Tests
Rohin Maganti, Rahul Maganti · PDF
Dynamics of Nonlinear Feature Learning in Two-Layer GCNs on XOR-CSBM
Zen Inagaki, Guillaume Braun, Masaaki Imaizumi · PDF
DynMuon: A Dynamic Spectral Shaping View of Muon
Fangzhou Wu, Rikhav Shah, Sandeep Silwal, Qiuyi Zhang · PDF
Early Alignment without Neural Collapse in Two-Layer ReLU Networks on Gaussian XOR
Shi Dong · PDF
Edge of Stability Selectively Shapes Learning Across the Data Distribution
Shauna Kwag, Anakha Ganesh, Tomaso Poggio, Pierfrancesco Beneventano · PDF
Effective Dimension Ratios under Symmetry Augmentation
HikaruMatsuoka · PDF
Effects of width-dependent model hyperparameters and $\ell_2$-regularization on the loss landscape of two-layer ReLU networks
Haruka Eshima, Makoto Yamada · PDF
Efficient Clustering with Provable Guardrails for LLM Inference at Scale
Longshaokan Wang, Waitsang Keung, Punit Ghodasara, Roman Zhuang Wang, Ali Dashti, Francesc Moreno-Noguer · PDF
Empirical Model-Size Scaling for Neural PDE Solvers on the LQR-HJB Benchmark
Rohan Keyur Dalal · PDF
Explaining Data Mixing Scaling Laws
Rui Dai, SHURAN ZHENG · PDF
Fast Learning Rate Transfer for Gradient Descent in Sketched Linear Regression
Garrett Wen, Alberto Bietti, Nikhil Ghosh, Theodor Misiakiewicz, Denny Wu · PDF
Feature Learning in High-Dimensions under Structured Covariance: Scaling Laws in Quadratic Networks
Qingyuan Yu, Nuri Mert Vural, Xin T. Tong, Murat A Erdogdu · PDF
Fixed-Point Reasoning: Stable and Adaptive Deep Looped Models
Sajad Movahedi, Shlomo Libo Feigin, Vera Milovanović, Alexander Theus, Thomas Hofmann, Valentina Boeva, T. Konstantin Rusch, Antonio Orvieto · PDF
Generalization Analysis of Linear Knowledge Distillation
Taesun Yeom, Taehyeok Ha, Jaeho Lee · PDF
Geometry, Not Scale Alone, Predicts Sparse Recovery of Causal Subspaces
Socrates Osorio, Joy Zheyun Yang · PDF
Global Linear Convergence of Inexact TD Under Generalized Smoothness
Alokendu Mazumder, Ila Ananta, Punit Rathore · PDF
Gradient Descent on Two ReLU Neurons: Global Landscape and Bifurcation Dynamics
Binghua Li, Mengzhe Li, Denny Wu, Tianhao Wang · PDF
Gradient Descent with Projection Finds Over-Parameterized Neural Networks for Learning Low-Degree Polynomials with Nearly Minimax Optimal Rate
Yingzhen Yang · PDF
Grokking or Glitching? How Low-Precision Drives Slingshot Loss Spikes
Liu Hanqing, Jianjun Cao, Yuanze Li, Zijian Zhou · PDF
High-Dimensional Limit of Stochastic Gradient Flow via Dynamical Mean-Field Theory
Sota Nishiyama, Masaaki Imaizumi · PDF
HORST: Composing Optimizer Geometries for Sparse Transformer Training
Tom Jacobs, Rohan Jain, Rebekka Burkholz · PDF
Hourglass MLP: Rethinking the Shape of Residual Architectures
Meng-Hsi Chen, Yu-Ang Lee, Feng-Ting Liao, Da-shan Shiu · PDF
How Cross-Entropy Learns Data Modes: Emergence and Implicit Bias in the Unconstrained Features Model
Arman Lotfalikhani, Mohammad Moshtaghifar, Connall Garrod, Jonathan P. Keating, Christos Thrampoulidis · PDF
How does feature learning change the function space evolution?
João Lobo, Bruno Loureiro, Long Tran-Thanh, Fanghui Liu · PDF
How Does Orthogonalization Adapt to the Neural-Network Hessian Structure? A Gradient Self Outer-Product Analysis at Initialization
Shenyang Deng, Shuhua Yu, Yaoqing Yang · PDF
How Does the ReLU Activation Affect the Implicit Bias of Gradient Descent on High-dimensional Neural Network Regression?
Kuo-Wei Lai, Guanghui Wang, Molei Tao, Vidya Muthukumar · PDF
How Excess Latent Dimensionality Delays Memorization in Diffusion Models
Trevor Chen, Ryan Shahbaba, Avni Garg, Kevin Peng · PDF
How the Hessian-Spectrum of Linear Networks Depends on Data
Jasraj Singh · PDF
How to Scale Mixture-of-Experts: From μP to the Maximally Scale-Stable Parameterization
Leena Chennuru Vankadara, Moritz Haas, Luke Hayward, Sebastian Bordt, Alessandro Breccia · PDF
How Width and Data Shape Generalization Scaling Laws in Quadratic Neural Networks
Julius Girardin, Emanuele Troiani, Yizhou Xu, Vittorio Erba, Florent Krzakala, Lenka Zdeborová · PDF
In-Context Benign Overfitting: A Feature-Selection Model in In-Context Linear Regression
Puneesh Deora, Bhavya Vasudeva, Christos Thrampoulidis · PDF
Internal Data Repetition Destroys Language Models
Jessica Chudnovsky, Joshua Kazdan, Noam Itzhak Levi, Rylan Schaeffer, Yegor Denisov-Blanch, Sanmi Koyejo, David L. Donoho · PDF
Is your LLM a Sequence Model on the Training History? The Origins and Consequences of Anticipation
Szilvia Ujváry, Bruno Kacper Mlodozeniec, José Miguel Hernández-Lobato, Ferenc Huszár, Dmitrii Krasheninnikov · PDF
KiteNorm: Variance Regularisation for Stable and Scalable Post-LN Transformers
Leon A. Trochelmann, Sajad Movahedi, Shiwei Liu, Antonio Orvieto · PDF
Layer Collapse in Diffusion Language Models
Alexander Conzelmann, Albert Catalan-Tatjer, Shiwei Liu · PDF
Learnability and Competition in High-Dimensional Multi-Component ICA
Eser İlke Genc, Samet Demir, Zafer Dogan · PDF
Learning Dynamics of LISP: A Gradient-Free Constraint-Satisfaction Family Containing Backpropagation
Vardan Grigoryants, Alexander Hakobyan · PDF
Learning High-Dimensional Transient Neural Dynamics for Zero-Shot Cross-Subject Reconstruction
Anima Kujur, Zahra Monfared · PDF
Learning Rate Matters: Vanilla LoRA May Suffice for LLM Fine-tuning
Yu-Ang Lee, Ching-Yun Ko, Pin-Yu Chen, Mi-Yen Yeh · PDF
Learning Rates Do Not Transfer Across Double Descent
Dayal Singh Kalra, Maissam Barkeshli · PDF
Learning with Synthetic Data via SGD in High-Dimensional Linear Regression
Jichu Li, Difan Zou · PDF
Learning-Forgetting Optimality in Supervised Finetuning: A Cliff Perspective
Albert Catalan-Tatjer, Jonas Geiping · PDF
Lightweight Surrogate-Assisted Language Model Pretraining
Nathan Truong · PDF
Linear Loss Classification: Efficient Training Through Neural Collapse
Wonyeong Song, Donghwan Kim · PDF
LoRA-Lens: Training Induces Spectral Compression in Low-Rank Adapters
Zhiyuan Gao · PDF
Loss and Optimizer as Two Essential Mechanisms Behind Knowledge Distillation
Satoki Ishikawa, Sameer Satish Deshmukh, Sakina Fatima, Takumi Honda, Rio Yokota · PDF
M-seq Initialization: Using Pseudo-Random Binary Sequences to Initialize Deep Neural Networks.
Zanya Gonzalez Tellez, Antonio Orvieto · PDF
Matching Features, Not Tokens: Energy-Based Fine-Tuning of Language Models
Samy Jelassi, Mujin Kwun, Rosie Zhao, Yuanzhi Li, Nicolo Fusi, Yilun Du, Sham M. Kakade, Carles Domingo-Enrich · PDF
MIDUS: Memory-Infused Depth Up-Scaling
Taero Kim, Hoyoon Byun, Youngjun Choi, Sungrae Park, Kyungwoo Song · PDF
Mini-batch Noise Lowers Sharpness via Dominant-Subspace Fluctuations
Jun-Ho So, Dongwook Shin · PDF
Mode Collapse Emerges from Low-Rank Biases in the Learning Dynamics of Generative Models
Julian Brandon, Bruno Loureiro, N Alex Cayco Gajic, Arthur Pellegrino · PDF
Model Behavior and Predictive Stability Under Severe Class Imbalance in High-Dimensional Classification
Linxi Li, Li Xing · PDF
Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds
YIDING SONG, Hanming Ye · PDF
Momentum Acceleration of Normalized Steepest Descent at the Edge of Stability
Beining Wu, Tianhao Wang, Zhiyuan Li · PDF
NarrativeWorldBench: A Frontier-Saturated Benchmark and a Latent World Model for Long-Horizon Co-Creative Audio Drama
Logan Mann, Abdur Rahman, Mohammad Saifullah, Taaha Kazi, Vasu Sharma · PDF
Neural Neural Scaling Laws
Michael Y. Hu, Jane Pan, Ayush Rajesh Jhaveri, Nicholas Lourie, Kyunghyun Cho · PDF
Nexus: Same Pretraining Loss, Better Downstream Generalization via Common Minima
Huanran Chen, Huaqing Zhang, Xiao Li, Yinpeng Dong, Ke Shen, Jun Zhu · PDF
Noise-driven escape from metastable phases explains grokking in deep neural networks
Ibrahim Talha Ersoy, Karoline Wiesner · PDF
Objective-Induced Conditional Mismatch in Sequence Diffusion Models
Matevz Matjasec, Antonio Orvieto · PDF
On How Muon Reshapes Skill Learning Dynamics
Alex Reyes Aranda, Vincent-Daniel Yun, Vatsal Sharan, Bhavya Vasudeva · PDF
On Lipschitz Explosion in Deep Neural Networks with Normalization: Consequences for Optimization and Adversarial Robustness
Ashkan Soleymani, Reyhaneh Hosseinpourkhoshkbari, Hadi Daneshmand, Patrick Jaillet · PDF
On the Convergence of Low-Precision LoRA Training
Dechen Zhang, Xuan Tang, Difan Zou · PDF
On the Mean-field Analysis of Normalized Steepest Descent via Linear Minimization Oracles
Yongtao Wu, Fanghui Liu, Taiji Suzuki, Volkan Cevher · PDF
On the Optimizer Dependence of Neural Scaling Laws
Shourya Vir Jain, Vansh Ramani · PDF
On the Surprising Effectiveness of Masking Updates in LLM Training
Taejong Joo, Wenhan Xia, Cheolmin Kim, Ming Zhang, Eugene Ie · PDF
Optimal Attention Temperature Improves the Robustness of In-Context Learning under Distribution Shift in High Dimensions
Samet Demir, Zafer Dogan · PDF
Optimal learning rate scaling depends on data in deep scalar linear networks
Yedi Zhang, Peter E. Latham, Leena Chennuru Vankadara, Andrew M Saxe · PDF
Optimal scaling laws in learning hierarchical multi-index models
Leonardo Defilippis, Florent Krzakala, Bruno Loureiro, Antoine Maillard · PDF
Optimal Scaling Needs Optimal Norm
Oleg Filatov, Jiangtao Wang, Jan Ebert, Stefan Kesselheim · PDF
Optimistic Online Learning for Data Mixture Optimization
Sofija Orlovic, Francesco Tonin, Volkan Cevher · PDF
Orthogonal Gradient Constraints Shape Noisy-Label Memorization Dynamics
Richard Mai · PDF
Outer-Momentum Restarting in High-Dimensional Two-Phase Optimization
Kristi Topollai, Allan Ma, Tolga Dimlioglu, Sui Jiet Tay, Anna Ewa Choromanska · PDF
Pathwise EMA: An Intrinsic Clock for Weight Averaging
Nghi Dao, Adam Block · PDF
Physics-Guided Policy Optimization with Self-Distillation
Ke Wang, Yuning Wu, Haoran Liu, Chaoqun Jia, Devin Chen, Kai Wei · PDF
Positional LSH: Binary Block Matrix Approximation for Attention with Linear Biases
Daniel Wolfson, Tal Wagner · PDF
Practical Muon Accelerates Projected Feature Learning in Scaling-Law Models
Chryseis Xinyi Liu, Manuel Paez · PDF
Predicting Cross-Domain RAG Retrieval Quality using Von Neumann Graph Entropy
John Carlsson, Daniel Barcklow, Nikita Seleznev, Senthil Kumar, Ke Xu · PDF
Provable Data Scaling Law for Meta Learning via Complexity Minimization
Kazuto Fukuchi, Ryuichiro Hataya, Kota Matsui · PDF
Pseudospectral Bounds for Transient Amplification in Coupled Gradient Descent
Ahanaf Hasan Ariq · PDF
Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate
Dayal Singh Kalra, Maissam Barkeshli · PDF
Radial Suppression Accelerates Algorithmic Generalization: A Geometric Analysis of Delayed Generalization
Srijan Tiwari, Aditya Chauhan, Manjot Singh · PDF
Random Sparse Subnetworks Suffice for RLVR: The Multiple Ticket Hypothesis: Random Sparse Subnetworks Suffice for RLVR
Israel Adewuyi, Solomon Okibe, Vladimir V. Ivanov · PDF
Rank Allocation in Low-Rank Optimizers
Ansh Tiwari · PDF
Rank-One Potential Geometry for Normalized Optimizers
Insung Yun, Sungyoon Lee · PDF
Refresh-Scaling the Memory of Balanced Adam
Alberto Fernández-Hernández, Cristian Pérez-Corral, Jose I. Mestre, Manuel F. Dolz, Enrique S. Quintana-Orti · PDF
Regularizing Optimizer Updates via Feasible-Set Projection
Insung Yun, Sungyoon Lee · PDF
Reparametrizing Shampoo and SOAP for Subspace Basis Updates and BFloat16 Storage
Alan Milligan, Zikun Xu, Simon Lacoste-Julien, Felix Dangel, Wu Lin · PDF
Representation Stability in High-Dimensional Noisy Time Series via Koopman-Based Features
Sucheta Ghosh, Zahra Monfared · PDF
Reset-and-Discard (ReD) Improves Coverage at every Budget under Inference Power-Law Scaling
Sagi Meir, Tommer D. Keidar, Noam Itzhak Levi, Shlomi Reuveni, Barak Hirshberg · PDF
Rethinking Bregman Divergences in Kronecker-Factored Optimizers
Bing Liu, WenJie Zhou, Chengcheng Zhao · PDF
Reward-Aware Population Scaling of Evolutionary Strategies in LLM Fine-Tuning
Sung Cho, Gyubin Han · PDF
RRD: Routing-and-Residual Distillation for Efficient MoE Recovery in Large Language Models
Hoyoon Byun, Kangjun Noh, SoMin Kim, Heedong Kim, Jaeyoon Shim, Sungjun Lim, Youngjun Choi, Kyungwoo Song · PDF
Scaling Laws for Grid-Based Approximate Nearest Neighbor Search in High Dimensions
Matthew J. Liu, Wei Hang Zheng, Vidhan Purohit, Siqi Xie, Chieh-En Li, Jerry Li, Noah Flynn · PDF
Scaling Laws from Sequential Feature Recovery: A Solvable Hierarchical Model
Arie Wortsman Zurich, Hugo Tabanelli, Yatin Dandi, Florent Krzakala, Bruno Loureiro · PDF
Scaling Theory for SlowRunning: Model size, Ensembling, and Training Horizon in the Multi-Epoch Regime
Zhen Yang, Yidi Miao, Blake Bordelon · PDF
Scaling with Recursion in Masked Discrete Diffusion Models
Alba Carballo-Castro, Julianna Piskorz, Paulius Rauba, Mihaela van der Schaar, Pascal Frossard · PDF
Self-Distillation for Data-Scarce Language Model Pretraining
Javid Lakha, Nihal V. Nayak, Bingbin Liu, Sham M. Kakade, David Alvarez-Melis · PDF
Self-Influence Governs Generalization: A von Mises Expansion Approach
John Sous, Anirvan M. Sengupta · PDF
Sequential Correlations Change In-Context Learning: Effective Context Length and Architectural Mismatch
Mary Letey, Yue M. Lu, Cengiz Pehlevan, Jacob A Zavatone-Veth · PDF
Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory
Juno Kim, Eshaan Nichani, Denny Wu, Alberto Bietti, Jason D. Lee · PDF
Sharp Generalization for Shallow Neural Networks with Channel Attention
Yingzhen Yang · PDF
Signal Frequency Imbalance and Ill-Conditioning
Tianyue H. Zhang, Francis Bach, Frederik Kunstner · PDF
Small for Small: Exploring Optimal Teacher in Knowledge Distillation with Limited Data
Minjae Park, Taesun Yeom, Jaeho Lee · PDF
Spectral Equalization Minimizes Total Training Energy: A Control-Theoretic Account of Muon's Advantage
Euijin Hong · PDF
Spherical Boltzmann machines: a solvable theory of learning and generation in energy-based models
Thomas Tulinski, Simona Cocco, Remi Monasson, Jorge FERNANDEZ-DE-COSSIO-DIAZ · PDF
SSM Adapters via Hankel Reduced-order Modeling: Injection Site Determines Task Suitability in Long-Context Fine-Tuning
Omanshu Thapliyal · PDF
Stabilizing Continuous-Time Kolen–Pollack Learning with a Scale-Balance Condition
Marc Gong Bacvanski, Liu Ziyin, Tomaso Poggio · PDF
Stochastic Gradient Descent on the Linear Bigram Model: Bias-Variance Scaling and Critical Batch Size
Zeyu Bian, Dmitriy Drusvyatskiy, Tianhao Wang · PDF
Structure and Scale in Simplicial Sequence Modelling
Matthew Farrugia-Roberts · PDF
Task-Dependent Inference-Compute Scaling Frontiers: Diffusion vs. Autoregressive Language Models
Khurram Khalil, Ripan Kumar Kundu · PDF
Test-time Verification via Optimal Transport: Coverage, ROC, & Sub-optimality
Arpan Mukherjee, Marcello Bullo, Debabrota Basu, Deniz Gunduz · PDF
The Propagation Field: A Geometric Substrate Theory of Deep Learning
Xingrui Gu · PDF
Timescale Separation in Sparse Dictionary Learning: Reconstruction Converges Before Reproducibility
Bright Liu · PDF
Too Sharp, Too Sure: When Calibration Follows Curvature
Alessandro Morosini, Matea Gjika, Tomaso Poggio, Pierfrancesco Beneventano · PDF
Towards Understanding Momentum Acceleration in River-Valley Loss Landscape
Miao Lu, Zeyu Bian, Kaiyue Wen, Beining Wu, Siyu Chen, Tianhao Wang, Zhiyuan Li · PDF
Training Dynamics of Softmax Self-Attention: Fast Global Convergence via Preconditioning
Gautam Goel, Mahdi Soltanolkotabi, Peter Bartlett · PDF
Training Transformers for KV Cache Compressibility
Yoav Gelberg, Yam Eitan, Michael M. Bronstein, Yarin Gal, Haggai Maron · PDF
Transformers Can Learn Multiclass Classification In-Context: Isotropy Governs Generalization
Daehan Yoon, Chulhee Yun · PDF
Understanding Clipping in Zeroth-order Optimization
Saket Gollapudi, Elisa Bertino, Sewoong Oh · PDF
Understanding Feature Learning Dynamics in Isotropic Regularizers via BHEP Statistics
Jeongyou Lee, Sungyoon Lee · PDF
Understanding Polyak's Momentum in Deep Learning May Require Rethinking Non-Convex Optimization
Donghwa Kim, Chulhee Yun · PDF
Uniform Spectral Growth under Factor-wise Muon Orthogonalization in Matrix Factorization and LoRA
Changmin Kang, Jihun Yun, Baekrok Shin, Yeseul Cho, Chulhee Yun · PDF
Weight Anisotropy in Mean-Field Theory: Learning on Isotropic Data
Niclas Alexander Göring, Chris Mingard, Yoonsoo Nam, Jake Reid, Ard A. Louis · PDF
What it means by learning in a neural network: easing the knot
Atiya Ibnat Tasnim, Rushmila Shehreen Khan, Md. Shahriar Karim · PDF
When and Why Grouping Attention Heads Accelerates Muon Optimization
Hongtao Zhang, WenJie Zhou, Wei Chen, Xueqi Cheng · PDF
Why Adversarial Diffusion Trains More Stably Than GANs: A Local Jacobian View
Florian Ochs · PDF
Why Are DMD Students Lazy? Understanding the Copying Behavior in Few-Step Distillation
Shucheng Li, Iolo Jones, Alexander Tong, Michael M. Bronstein · PDF
Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention
Jing Huang, Daniel Wurgaft, Rachit Bansal, Laura Ruis, Naomi Saphra, David Alvarez-Melis, Andrew Kyle Lampinen, Christopher Potts, Ekdeep Singh Lubana · PDF
Why Routers Freeze: Infinite Width Learning Dynamics for Mixture of Experts
Anish Dhir, Volkan Cevher, Leena Chennuru Vankadara · PDF
Worker Disagreement Reveals Sharp Directions in Local SGD
Tolga Dimlioglu, Kristi Topollai, Anna Ewa Choromanska · PDF

Accepted papers (167)

☆$\delta$-Regularized Gradient Clipping for Stable Optimization: Analysis and Empirical Evaluation

☆A "feature ODE" describing the learning behavior of shallow MLPs on simple functions

☆A $p$-adic Perspective on Low-Bit Training of Neural Networks

☆A Compute-Matched Study of Hidden Layer Distillation for LLM Pre-Training

☆A Coulomb Particle Model for Learning Kernel Attention in Transformers

☆A Data-Scaling Sweet Spot in Structured Algorithmic Learning

☆A Geometric Perspective on Stabilizing Value Conflict Resolution

☆A Horizon-Dependent Intrinsic-Dimension Theory of Scaling for Biological Forecasting

☆A loss curvature account of fine-tuning fragility

☆A Quadratic Lens on Muon: Orthogonalization, Invariance, and Implicit Preconditioning

☆A Simple and Efficient Measure of Loss Landscape Curvature

☆Activation Functions Control Finite-Width Concentration in Wide Neural Networks

☆Adam or Gauss-Newton? A Comparative Study In Terms of Basis Alignment and SGD Noise

☆AMUSE: Anytime Muon with Stable Gradient Evaluation

☆Anytime Pretraining: Horizon-Free Learning-Rate Schedules with Weight Averaging

☆Asymmetric Scaling Laws from Sparse Features

☆Be Greedy, Stay Linear: Universally Robust Feature Engineering

☆Beyond the Hessian Edge: The Stochastic Stability Cocycle of Mini-Batch SGD

☆BLADE: Binary Learning via Algebraic Dual Estimation for the Exact Edge of Stability in 1-Bit Networks

☆BReD: Stabilizing Quantized EMA Dynamics for Memory-Efficient Large-Scale Training

☆Causal Volterra Dynamics of Mamba

☆Characterizing Optimizer-Dependent Training Dynamics Through Hessian Eigenvector Displacement and Localization

☆Common Origins, Divergent Destinations: The Development of Cross-Layer Alignment Under GELU and SwiGLU

☆Compute Efficiency and Serial Runtime Tradeoffs for Stochastic Momentum Methods

☆Compute-Optimal Scaling Laws for the Generalization Phase Transition in Grokking

☆Compute-Optimal Training as Stochastic Optimal Control

☆Continuous Sparsification via Minimizing Movement

☆Critical Batch Size for LLM Policy Optimization

☆Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws

☆Deep Learning as Neural Low-Degree Filtering: A Theory of Hierarchical Feature Learning

☆DeltaMomentum: A Key-Value based Anisotropic Momentum Update via Delta Rule

☆Depth scaling and Muon enable balanced expert usage in MoE training

☆Deriving Hyperparameter Scaling Laws via Modern Optimization Theory

☆Detecting overfitting in Neural Networks during long-horizon grokking using Random Matrix Theory

☆Diagonalizing the Softmax: Hadamard Initialization for Tractable Cross-Entropy Dynamics

☆Dichotomy of Feature Learning and Unlearning: Fast-Slow Analysis on Neural Networks with Stochastic Gradient Descent

☆Dimension-Free Scaling Laws for Invariant Score Matching

☆DOSA: Dynamic Online State Allocation for Adaptive Optimizers via Per-Tensor Sketched Smoothness Tests

☆Dynamics of Nonlinear Feature Learning in Two-Layer GCNs on XOR-CSBM

☆DynMuon: A Dynamic Spectral Shaping View of Muon

☆Early Alignment without Neural Collapse in Two-Layer ReLU Networks on Gaussian XOR

☆Edge of Stability Selectively Shapes Learning Across the Data Distribution

☆Effective Dimension Ratios under Symmetry Augmentation

☆Effects of width-dependent model hyperparameters and $\ell_2$-regularization on the loss landscape of two-layer ReLU networks

☆Efficient Clustering with Provable Guardrails for LLM Inference at Scale

☆Empirical Model-Size Scaling for Neural PDE Solvers on the LQR-HJB Benchmark

☆Explaining Data Mixing Scaling Laws

☆Fast Learning Rate Transfer for Gradient Descent in Sketched Linear Regression

☆Feature Learning in High-Dimensions under Structured Covariance: Scaling Laws in Quadratic Networks

☆Fixed-Point Reasoning: Stable and Adaptive Deep Looped Models

☆Generalization Analysis of Linear Knowledge Distillation

☆Geometry, Not Scale Alone, Predicts Sparse Recovery of Causal Subspaces

☆Global Linear Convergence of Inexact TD Under Generalized Smoothness

☆Gradient Descent on Two ReLU Neurons: Global Landscape and Bifurcation Dynamics

☆Gradient Descent with Projection Finds Over-Parameterized Neural Networks for Learning Low-Degree Polynomials with Nearly Minimax Optimal Rate

☆Grokking or Glitching? How Low-Precision Drives Slingshot Loss Spikes

☆High-Dimensional Limit of Stochastic Gradient Flow via Dynamical Mean-Field Theory

☆HORST: Composing Optimizer Geometries for Sparse Transformer Training

☆Hourglass MLP: Rethinking the Shape of Residual Architectures

☆How Cross-Entropy Learns Data Modes: Emergence and Implicit Bias in the Unconstrained Features Model

☆How does feature learning change the function space evolution?

☆How Does Orthogonalization Adapt to the Neural-Network Hessian Structure? A Gradient Self Outer-Product Analysis at Initialization

☆How Does the ReLU Activation Affect the Implicit Bias of Gradient Descent on High-dimensional Neural Network Regression?

☆How Excess Latent Dimensionality Delays Memorization in Diffusion Models

☆How the Hessian-Spectrum of Linear Networks Depends on Data

☆How to Scale Mixture-of-Experts: From μP to the Maximally Scale-Stable Parameterization

☆How Width and Data Shape Generalization Scaling Laws in Quadratic Neural Networks

☆In-Context Benign Overfitting: A Feature-Selection Model in In-Context Linear Regression

☆Internal Data Repetition Destroys Language Models

☆Is your LLM a Sequence Model on the Training History? The Origins and Consequences of Anticipation

☆KiteNorm: Variance Regularisation for Stable and Scalable Post-LN Transformers

☆Layer Collapse in Diffusion Language Models

☆Learnability and Competition in High-Dimensional Multi-Component ICA

☆Learning Dynamics of LISP: A Gradient-Free Constraint-Satisfaction Family Containing Backpropagation

☆Learning High-Dimensional Transient Neural Dynamics for Zero-Shot Cross-Subject Reconstruction

☆Learning Rate Matters: Vanilla LoRA May Suffice for LLM Fine-tuning

☆Learning Rates Do Not Transfer Across Double Descent

☆Learning with Synthetic Data via SGD in High-Dimensional Linear Regression

☆Learning-Forgetting Optimality in Supervised Finetuning: A Cliff Perspective

$\delta$-Regularized Gradient Clipping for Stable Optimization: Analysis and Empirical Evaluation

A "feature ODE" describing the learning behavior of shallow MLPs on simple functions

A $p$-adic Perspective on Low-Bit Training of Neural Networks

A Compute-Matched Study of Hidden Layer Distillation for LLM Pre-Training

A Coulomb Particle Model for Learning Kernel Attention in Transformers

A Data-Scaling Sweet Spot in Structured Algorithmic Learning

A Geometric Perspective on Stabilizing Value Conflict Resolution

A Horizon-Dependent Intrinsic-Dimension Theory of Scaling for Biological Forecasting

A loss curvature account of fine-tuning fragility

A Quadratic Lens on Muon: Orthogonalization, Invariance, and Implicit Preconditioning

A Simple and Efficient Measure of Loss Landscape Curvature

Activation Functions Control Finite-Width Concentration in Wide Neural Networks

Adam or Gauss-Newton? A Comparative Study In Terms of Basis Alignment and SGD Noise

AMUSE: Anytime Muon with Stable Gradient Evaluation

Anytime Pretraining: Horizon-Free Learning-Rate Schedules with Weight Averaging

Asymmetric Scaling Laws from Sparse Features

Be Greedy, Stay Linear: Universally Robust Feature Engineering

Beyond the Hessian Edge: The Stochastic Stability Cocycle of Mini-Batch SGD

BLADE: Binary Learning via Algebraic Dual Estimation for the Exact Edge of Stability in 1-Bit Networks

BReD: Stabilizing Quantized EMA Dynamics for Memory-Efficient Large-Scale Training

Causal Volterra Dynamics of Mamba

Characterizing Optimizer-Dependent Training Dynamics Through Hessian Eigenvector Displacement and Localization

Common Origins, Divergent Destinations: The Development of Cross-Layer Alignment Under GELU and SwiGLU

Compute Efficiency and Serial Runtime Tradeoffs for Stochastic Momentum Methods

Compute-Optimal Scaling Laws for the Generalization Phase Transition in Grokking

Compute-Optimal Training as Stochastic Optimal Control

Continuous Sparsification via Minimizing Movement

Critical Batch Size for LLM Policy Optimization

Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws

Deep Learning as Neural Low-Degree Filtering: A Theory of Hierarchical Feature Learning

DeltaMomentum: A Key-Value based Anisotropic Momentum Update via Delta Rule

Depth scaling and Muon enable balanced expert usage in MoE training

Deriving Hyperparameter Scaling Laws via Modern Optimization Theory

Detecting overfitting in Neural Networks during long-horizon grokking using Random Matrix Theory

Diagonalizing the Softmax: Hadamard Initialization for Tractable Cross-Entropy Dynamics

Dichotomy of Feature Learning and Unlearning: Fast-Slow Analysis on Neural Networks with Stochastic Gradient Descent

Dimension-Free Scaling Laws for Invariant Score Matching

DOSA: Dynamic Online State Allocation for Adaptive Optimizers via Per-Tensor Sketched Smoothness Tests

Dynamics of Nonlinear Feature Learning in Two-Layer GCNs on XOR-CSBM

DynMuon: A Dynamic Spectral Shaping View of Muon

Early Alignment without Neural Collapse in Two-Layer ReLU Networks on Gaussian XOR

Edge of Stability Selectively Shapes Learning Across the Data Distribution

Effective Dimension Ratios under Symmetry Augmentation

Effects of width-dependent model hyperparameters and $\ell_2$-regularization on the loss landscape of two-layer ReLU networks

Efficient Clustering with Provable Guardrails for LLM Inference at Scale

Empirical Model-Size Scaling for Neural PDE Solvers on the LQR-HJB Benchmark

Explaining Data Mixing Scaling Laws

Fast Learning Rate Transfer for Gradient Descent in Sketched Linear Regression

Feature Learning in High-Dimensions under Structured Covariance: Scaling Laws in Quadratic Networks

Fixed-Point Reasoning: Stable and Adaptive Deep Looped Models

Generalization Analysis of Linear Knowledge Distillation

Geometry, Not Scale Alone, Predicts Sparse Recovery of Causal Subspaces

Global Linear Convergence of Inexact TD Under Generalized Smoothness

Gradient Descent on Two ReLU Neurons: Global Landscape and Bifurcation Dynamics

Gradient Descent with Projection Finds Over-Parameterized Neural Networks for Learning Low-Degree Polynomials with Nearly Minimax Optimal Rate

Grokking or Glitching? How Low-Precision Drives Slingshot Loss Spikes

High-Dimensional Limit of Stochastic Gradient Flow via Dynamical Mean-Field Theory

HORST: Composing Optimizer Geometries for Sparse Transformer Training

Hourglass MLP: Rethinking the Shape of Residual Architectures

How Cross-Entropy Learns Data Modes: Emergence and Implicit Bias in the Unconstrained Features Model

How does feature learning change the function space evolution?

How Does Orthogonalization Adapt to the Neural-Network Hessian Structure? A Gradient Self Outer-Product Analysis at Initialization

How Does the ReLU Activation Affect the Implicit Bias of Gradient Descent on High-dimensional Neural Network Regression?

How Excess Latent Dimensionality Delays Memorization in Diffusion Models

How the Hessian-Spectrum of Linear Networks Depends on Data

How to Scale Mixture-of-Experts: From μP to the Maximally Scale-Stable Parameterization

How Width and Data Shape Generalization Scaling Laws in Quadratic Neural Networks

In-Context Benign Overfitting: A Feature-Selection Model in In-Context Linear Regression

Internal Data Repetition Destroys Language Models

Is your LLM a Sequence Model on the Training History? The Origins and Consequences of Anticipation

KiteNorm: Variance Regularisation for Stable and Scalable Post-LN Transformers

Layer Collapse in Diffusion Language Models

Learnability and Competition in High-Dimensional Multi-Component ICA

Learning Dynamics of LISP: A Gradient-Free Constraint-Satisfaction Family Containing Backpropagation

Learning High-Dimensional Transient Neural Dynamics for Zero-Shot Cross-Subject Reconstruction

Learning Rate Matters: Vanilla LoRA May Suffice for LLM Fine-tuning

Learning Rates Do Not Transfer Across Double Descent

Learning with Synthetic Data via SGD in High-Dimensional Linear Regression

Learning-Forgetting Optimality in Supervised Finetuning: A Cliff Perspective

Lightweight Surrogate-Assisted Language Model Pretraining