ICML 2026 Past Other

High-dimensional Learning Dynamics 2026

HiLD at ICML 2026

Submission deadline
May 12, 2026, 12:00 UTC
imported from OpenReview — check the website for extensions
Submission portal
OpenReview
Notes
Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).

Accepted papers (167)

Fetched from OpenReview (v2) on 2026-06-10.

  1. $\delta$-Regularized Gradient Clipping for Stable Optimization: Analysis and Empirical Evaluation

    Katsiaryna Novikava, Anna Lytova, Omar Rivasplata · PDF
  2. A "feature ODE" describing the learning behavior of shallow MLPs on simple functions

    Joseph Turnbull, Berkan Ottlik, James B Simon · PDF
  3. A $p$-adic Perspective on Low-Bit Training of Neural Networks

    Daniel Bershatsky, Marina Munkhoeva, Ivan Oseledets · PDF
  4. A Compute-Matched Study of Hidden Layer Distillation for LLM Pre-Training

    Maxime Guigon, Lucas Dixon, Michael Eli Sander · PDF
  5. A Coulomb Particle Model for Learning Kernel Attention in Transformers

    Masoud Badiei Khuzani, Sharath Honnaiah, Atiq Islam, Alex Cozzi, Abraham Bagherjeiran · PDF
  6. A Data-Scaling Sweet Spot in Structured Algorithmic Learning

    Shin So, Kyelim Lee, Albert No · PDF
  7. A Geometric Perspective on Stabilizing Value Conflict Resolution

    Saket Reddy, Andy Liu · PDF
  8. A Horizon-Dependent Intrinsic-Dimension Theory of Scaling for Biological Forecasting

    Bryan Cheng, Austin Jin, Jasper Zhang, Arnav Pemmaraju, Brendan Lo, Joshua Chang · PDF
  9. A loss curvature account of fine-tuning fragility

    Ivaylo Dimitrov, Leo Karoubi, Sunny Howard, Dmitrii Krasheninnikov · PDF
  10. A Quadratic Lens on Muon: Orthogonalization, Invariance, and Implicit Preconditioning

    Egor Shulgin, Sam Laing, Antonio Orvieto, Peter Richtárik · PDF
  11. A Simple and Efficient Measure of Loss Landscape Curvature

    Hee-Sung Kim, Sungyoon Lee · PDF
  12. Activation Functions Control Finite-Width Concentration in Wide Neural Networks

    Soumya Ganguly, Nilava Metya, Alexandre V. Morozov, Anirvan M. Sengupta · PDF
  13. Adam or Gauss-Newton? A Comparative Study In Terms of Basis Alignment and SGD Noise

    Bingbin Liu, Rachit Bansal, Depen Morwani, David Alvarez-Melis, Sham M. Kakade · PDF
  14. AMUSE: Anytime Muon with Stable Gradient Evaluation

    Jueun Kim, Baekrok Shin, Jihun Yun, Beomhan Baek, Minhak Song, Chulhee Yun · PDF
  15. Anytime Pretraining: Horizon-Free Learning-Rate Schedules with Weight Averaging

    Alexandru Meterez, Pranav Ajit Nair, Depen Morwani, Cengiz Pehlevan, Sham M. Kakade · PDF
  16. Asymmetric Scaling Laws from Sparse Features

    John Sous · PDF
  17. Be Greedy, Stay Linear: Universally Robust Feature Engineering

    Yunze Leng, Rohan Ghosh, Mehul Motani · PDF
  18. Beyond the Hessian Edge: The Stochastic Stability Cocycle of Mini-Batch SGD

    Manoj Saravanan · PDF
  19. BLADE: Binary Learning via Algebraic Dual Estimation for the Exact Edge of Stability in 1-Bit Networks

    Şuayp Talha Kocabay, Talha Rüzgar Akkuş · PDF
  20. BReD: Stabilizing Quantized EMA Dynamics for Memory-Efficient Large-Scale Training

    Heshen Zhan, Youhan Huang, Yunke Peng, Yao Wang, Linghui Kong, Ziwei Zhu, Qingyu Han, Yaoyuan Wang, Congliang Chen, Ruoyu Sun · PDF
  21. Causal Volterra Dynamics of Mamba

    Ming-Ching Chang, Ting Yu Tsai, Davis Wertheimer, Felix X.-F. Ye · PDF
  22. Characterizing Optimizer-Dependent Training Dynamics Through Hessian Eigenvector Displacement and Localization

    Marcelina Marjankowska, Paolo Barucca, Valerio Modugno · PDF
  23. Common Origins, Divergent Destinations: The Development of Cross-Layer Alignment Under GELU and SwiGLU

    Delbert Bray Jr. · PDF
  24. Compute Efficiency and Serial Runtime Tradeoffs for Stochastic Momentum Methods

    Depen Morwani, Alexandru Meterez, Pranav Ajit Nair, Sham M. Kakade · PDF
  25. Compute-Optimal Scaling Laws for the Generalization Phase Transition in Grokking

    Anish Kataria · PDF
  26. Compute-Optimal Training as Stochastic Optimal Control

    Rohan Keyur Dalal · PDF
  27. Continuous Sparsification via Minimizing Movement

    Hoang Pham, Tom Jacobs, Binh Nguyen, Rebekka Burkholz, Long Tran-Thanh · PDF
  28. Critical Batch Size for LLM Policy Optimization

    Rachit Bansal, Clara Mohri, Natalie Abreu, David Alvarez-Melis, Sham M. Kakade · PDF
  29. Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws

    Zhiwei Xu, Shihao Wu, Hanseul Cho, Wei Hu, Yixin Wang · PDF
  30. Deep Learning as Neural Low-Degree Filtering: A Theory of Hierarchical Feature Learning

    Yatin Dandi, Matteo Vilucchio, Luca Arnaboldi, Hugo Tabanelli, Florent Krzakala · PDF
  31. DeltaMomentum: A Key-Value based Anisotropic Momentum Update via Delta Rule

    Euijin Hong, Guannan Qu · PDF
  32. Depth scaling and Muon enable balanced expert usage in MoE training

    Xi Wang, Soufiane Hayou, Eric Nalisnick · PDF
  33. Deriving Hyperparameter Scaling Laws via Modern Optimization Theory

    Egor Shulgin, Jörg K.H. Franke, Dimitri von Rütte, Tianyue H. Zhang, Niccolò Ajroldi, Korbinian Pöppel, Bernhard Schölkopf, Aaron Klein, Peter Richtárik, Antonio Orvieto · PDF
  34. Detecting overfitting in Neural Networks during long-horizon grokking using Random Matrix Theory

    Hari Kishan Prakash, charles h martin · PDF
  35. Diagonalizing the Softmax: Hadamard Initialization for Tractable Cross-Entropy Dynamics

    Connall Garrod, Jonathan P. Keating, Christos Thrampoulidis · PDF
  36. Dichotomy of Feature Learning and Unlearning: Fast-Slow Analysis on Neural Networks with Stochastic Gradient Descent

    Shota Imai, Sota Nishiyama, Masaaki Imaizumi · PDF
  37. Dimension-Free Scaling Laws for Invariant Score Matching

    Behrooz Tahmasebi, Melanie Weber · PDF
  38. DOSA: Dynamic Online State Allocation for Adaptive Optimizers via Per-Tensor Sketched Smoothness Tests

    Rohin Maganti, Rahul Maganti · PDF
  39. Dynamics of Nonlinear Feature Learning in Two-Layer GCNs on XOR-CSBM

    Zen Inagaki, Guillaume Braun, Masaaki Imaizumi · PDF
  40. DynMuon: A Dynamic Spectral Shaping View of Muon

    Fangzhou Wu, Rikhav Shah, Sandeep Silwal, Qiuyi Zhang · PDF
  41. Early Alignment without Neural Collapse in Two-Layer ReLU Networks on Gaussian XOR

    Shi Dong · PDF
  42. Edge of Stability Selectively Shapes Learning Across the Data Distribution

    Shauna Kwag, Anakha Ganesh, Tomaso Poggio, Pierfrancesco Beneventano · PDF
  43. Effective Dimension Ratios under Symmetry Augmentation

    HikaruMatsuoka · PDF
  44. Effects of width-dependent model hyperparameters and $\ell_2$-regularization on the loss landscape of two-layer ReLU networks

    Haruka Eshima, Makoto Yamada · PDF
  45. Efficient Clustering with Provable Guardrails for LLM Inference at Scale

    Longshaokan Wang, Waitsang Keung, Punit Ghodasara, Roman Zhuang Wang, Ali Dashti, Francesc Moreno-Noguer · PDF
  46. Empirical Model-Size Scaling for Neural PDE Solvers on the LQR-HJB Benchmark

    Rohan Keyur Dalal · PDF
  47. Explaining Data Mixing Scaling Laws

    Rui Dai, SHURAN ZHENG · PDF
  48. Fast Learning Rate Transfer for Gradient Descent in Sketched Linear Regression

    Garrett Wen, Alberto Bietti, Nikhil Ghosh, Theodor Misiakiewicz, Denny Wu · PDF
  49. Feature Learning in High-Dimensions under Structured Covariance: Scaling Laws in Quadratic Networks

    Qingyuan Yu, Nuri Mert Vural, Xin T. Tong, Murat A Erdogdu · PDF
  50. Fixed-Point Reasoning: Stable and Adaptive Deep Looped Models

    Sajad Movahedi, Shlomo Libo Feigin, Vera Milovanović, Alexander Theus, Thomas Hofmann, Valentina Boeva, T. Konstantin Rusch, Antonio Orvieto · PDF
  51. Generalization Analysis of Linear Knowledge Distillation

    Taesun Yeom, Taehyeok Ha, Jaeho Lee · PDF
  52. Geometry, Not Scale Alone, Predicts Sparse Recovery of Causal Subspaces

    Socrates Osorio, Joy Zheyun Yang · PDF
  53. Global Linear Convergence of Inexact TD Under Generalized Smoothness

    Alokendu Mazumder, Ila Ananta, Punit Rathore · PDF
  54. Gradient Descent on Two ReLU Neurons: Global Landscape and Bifurcation Dynamics

    Binghua Li, Mengzhe Li, Denny Wu, Tianhao Wang · PDF
  55. Gradient Descent with Projection Finds Over-Parameterized Neural Networks for Learning Low-Degree Polynomials with Nearly Minimax Optimal Rate

    Yingzhen Yang · PDF
  56. Grokking or Glitching? How Low-Precision Drives Slingshot Loss Spikes

    Liu Hanqing, Jianjun Cao, Yuanze Li, Zijian Zhou · PDF
  57. High-Dimensional Limit of Stochastic Gradient Flow via Dynamical Mean-Field Theory

    Sota Nishiyama, Masaaki Imaizumi · PDF
  58. HORST: Composing Optimizer Geometries for Sparse Transformer Training

    Tom Jacobs, Rohan Jain, Rebekka Burkholz · PDF
  59. Hourglass MLP: Rethinking the Shape of Residual Architectures

    Meng-Hsi Chen, Yu-Ang Lee, Feng-Ting Liao, Da-shan Shiu · PDF
  60. How Cross-Entropy Learns Data Modes: Emergence and Implicit Bias in the Unconstrained Features Model

    Arman Lotfalikhani, Mohammad Moshtaghifar, Connall Garrod, Jonathan P. Keating, Christos Thrampoulidis · PDF
  61. How does feature learning change the function space evolution?

    João Lobo, Bruno Loureiro, Long Tran-Thanh, Fanghui Liu · PDF
  62. How Does Orthogonalization Adapt to the Neural-Network Hessian Structure? A Gradient Self Outer-Product Analysis at Initialization

    Shenyang Deng, Shuhua Yu, Yaoqing Yang · PDF
  63. How Does the ReLU Activation Affect the Implicit Bias of Gradient Descent on High-dimensional Neural Network Regression?

    Kuo-Wei Lai, Guanghui Wang, Molei Tao, Vidya Muthukumar · PDF
  64. How Excess Latent Dimensionality Delays Memorization in Diffusion Models

    Trevor Chen, Ryan Shahbaba, Avni Garg, Kevin Peng · PDF
  65. How the Hessian-Spectrum of Linear Networks Depends on Data

    Jasraj Singh · PDF
  66. How to Scale Mixture-of-Experts: From μP to the Maximally Scale-Stable Parameterization

    Leena Chennuru Vankadara, Moritz Haas, Luke Hayward, Sebastian Bordt, Alessandro Breccia · PDF
  67. How Width and Data Shape Generalization Scaling Laws in Quadratic Neural Networks

    Julius Girardin, Emanuele Troiani, Yizhou Xu, Vittorio Erba, Florent Krzakala, Lenka Zdeborová · PDF
  68. In-Context Benign Overfitting: A Feature-Selection Model in In-Context Linear Regression

    Puneesh Deora, Bhavya Vasudeva, Christos Thrampoulidis · PDF
  69. Internal Data Repetition Destroys Language Models

    Jessica Chudnovsky, Joshua Kazdan, Noam Itzhak Levi, Rylan Schaeffer, Yegor Denisov-Blanch, Sanmi Koyejo, David L. Donoho · PDF
  70. Is your LLM a Sequence Model on the Training History? The Origins and Consequences of Anticipation

    Szilvia Ujváry, Bruno Kacper Mlodozeniec, José Miguel Hernández-Lobato, Ferenc Huszár, Dmitrii Krasheninnikov · PDF
  71. KiteNorm: Variance Regularisation for Stable and Scalable Post-LN Transformers

    Leon A. Trochelmann, Sajad Movahedi, Shiwei Liu, Antonio Orvieto · PDF
  72. Layer Collapse in Diffusion Language Models

    Alexander Conzelmann, Albert Catalan-Tatjer, Shiwei Liu · PDF
  73. Learnability and Competition in High-Dimensional Multi-Component ICA

    Eser İlke Genc, Samet Demir, Zafer Dogan · PDF
  74. Learning Dynamics of LISP: A Gradient-Free Constraint-Satisfaction Family Containing Backpropagation

    Vardan Grigoryants, Alexander Hakobyan · PDF
  75. Learning High-Dimensional Transient Neural Dynamics for Zero-Shot Cross-Subject Reconstruction

    Anima Kujur, Zahra Monfared · PDF
  76. Learning Rate Matters: Vanilla LoRA May Suffice for LLM Fine-tuning

    Yu-Ang Lee, Ching-Yun Ko, Pin-Yu Chen, Mi-Yen Yeh · PDF
  77. Learning Rates Do Not Transfer Across Double Descent

    Dayal Singh Kalra, Maissam Barkeshli · PDF
  78. Learning with Synthetic Data via SGD in High-Dimensional Linear Regression

    Jichu Li, Difan Zou · PDF
  79. Learning-Forgetting Optimality in Supervised Finetuning: A Cliff Perspective

    Albert Catalan-Tatjer, Jonas Geiping · PDF
  80. Lightweight Surrogate-Assisted Language Model Pretraining

    Nathan Truong · PDF
  81. Linear Loss Classification: Efficient Training Through Neural Collapse

    Wonyeong Song, Donghwan Kim · PDF
  82. LoRA-Lens: Training Induces Spectral Compression in Low-Rank Adapters

    Zhiyuan Gao · PDF
  83. Loss and Optimizer as Two Essential Mechanisms Behind Knowledge Distillation

    Satoki Ishikawa, Sameer Satish Deshmukh, Sakina Fatima, Takumi Honda, Rio Yokota · PDF
  84. M-seq Initialization: Using Pseudo-Random Binary Sequences to Initialize Deep Neural Networks.

    Zanya Gonzalez Tellez, Antonio Orvieto · PDF
  85. Matching Features, Not Tokens: Energy-Based Fine-Tuning of Language Models

    Samy Jelassi, Mujin Kwun, Rosie Zhao, Yuanzhi Li, Nicolo Fusi, Yilun Du, Sham M. Kakade, Carles Domingo-Enrich · PDF
  86. MIDUS: Memory-Infused Depth Up-Scaling

    Taero Kim, Hoyoon Byun, Youngjun Choi, Sungrae Park, Kyungwoo Song · PDF
  87. Mini-batch Noise Lowers Sharpness via Dominant-Subspace Fluctuations

    Jun-Ho So, Dongwook Shin · PDF
  88. Mode Collapse Emerges from Low-Rank Biases in the Learning Dynamics of Generative Models

    Julian Brandon, Bruno Loureiro, N Alex Cayco Gajic, Arthur Pellegrino · PDF
  89. Model Behavior and Predictive Stability Under Severe Class Imbalance in High-Dimensional Classification

    Linxi Li, Li Xing · PDF
  90. Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds

    YIDING SONG, Hanming Ye · PDF
  91. Momentum Acceleration of Normalized Steepest Descent at the Edge of Stability

    Beining Wu, Tianhao Wang, Zhiyuan Li · PDF
  92. NarrativeWorldBench: A Frontier-Saturated Benchmark and a Latent World Model for Long-Horizon Co-Creative Audio Drama

    Logan Mann, Abdur Rahman, Mohammad Saifullah, Taaha Kazi, Vasu Sharma · PDF
  93. Neural Neural Scaling Laws

    Michael Y. Hu, Jane Pan, Ayush Rajesh Jhaveri, Nicholas Lourie, Kyunghyun Cho · PDF
  94. Nexus: Same Pretraining Loss, Better Downstream Generalization via Common Minima

    Huanran Chen, Huaqing Zhang, Xiao Li, Yinpeng Dong, Ke Shen, Jun Zhu · PDF
  95. Noise-driven escape from metastable phases explains grokking in deep neural networks

    Ibrahim Talha Ersoy, Karoline Wiesner · PDF
  96. Objective-Induced Conditional Mismatch in Sequence Diffusion Models

    Matevz Matjasec, Antonio Orvieto · PDF
  97. On How Muon Reshapes Skill Learning Dynamics

    Alex Reyes Aranda, Vincent-Daniel Yun, Vatsal Sharan, Bhavya Vasudeva · PDF
  98. On Lipschitz Explosion in Deep Neural Networks with Normalization: Consequences for Optimization and Adversarial Robustness

    Ashkan Soleymani, Reyhaneh Hosseinpourkhoshkbari, Hadi Daneshmand, Patrick Jaillet · PDF
  99. On the Convergence of Low-Precision LoRA Training

    Dechen Zhang, Xuan Tang, Difan Zou · PDF
  100. On the Mean-field Analysis of Normalized Steepest Descent via Linear Minimization Oracles

    Yongtao Wu, Fanghui Liu, Taiji Suzuki, Volkan Cevher · PDF
  101. On the Optimizer Dependence of Neural Scaling Laws

    Shourya Vir Jain, Vansh Ramani · PDF
  102. On the Surprising Effectiveness of Masking Updates in LLM Training

    Taejong Joo, Wenhan Xia, Cheolmin Kim, Ming Zhang, Eugene Ie · PDF
  103. Optimal Attention Temperature Improves the Robustness of In-Context Learning under Distribution Shift in High Dimensions

    Samet Demir, Zafer Dogan · PDF
  104. Optimal learning rate scaling depends on data in deep scalar linear networks

    Yedi Zhang, Peter E. Latham, Leena Chennuru Vankadara, Andrew M Saxe · PDF
  105. Optimal scaling laws in learning hierarchical multi-index models

    Leonardo Defilippis, Florent Krzakala, Bruno Loureiro, Antoine Maillard · PDF
  106. Optimal Scaling Needs Optimal Norm

    Oleg Filatov, Jiangtao Wang, Jan Ebert, Stefan Kesselheim · PDF
  107. Optimistic Online Learning for Data Mixture Optimization

    Sofija Orlovic, Francesco Tonin, Volkan Cevher · PDF
  108. Orthogonal Gradient Constraints Shape Noisy-Label Memorization Dynamics

    Richard Mai · PDF
  109. Outer-Momentum Restarting in High-Dimensional Two-Phase Optimization

    Kristi Topollai, Allan Ma, Tolga Dimlioglu, Sui Jiet Tay, Anna Ewa Choromanska · PDF
  110. Pathwise EMA: An Intrinsic Clock for Weight Averaging

    Nghi Dao, Adam Block · PDF
  111. Physics-Guided Policy Optimization with Self-Distillation

    Ke Wang, Yuning Wu, Haoran Liu, Chaoqun Jia, Devin Chen, Kai Wei · PDF
  112. Positional LSH: Binary Block Matrix Approximation for Attention with Linear Biases

    Daniel Wolfson, Tal Wagner · PDF
  113. Practical Muon Accelerates Projected Feature Learning in Scaling-Law Models

    Chryseis Xinyi Liu, Manuel Paez · PDF
  114. Predicting Cross-Domain RAG Retrieval Quality using Von Neumann Graph Entropy

    John Carlsson, Daniel Barcklow, Nikita Seleznev, Senthil Kumar, Ke Xu · PDF
  115. Provable Data Scaling Law for Meta Learning via Complexity Minimization

    Kazuto Fukuchi, Ryuichiro Hataya, Kota Matsui · PDF
  116. Pseudospectral Bounds for Transient Amplification in Coupled Gradient Descent

    Ahanaf Hasan Ariq · PDF
  117. Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate

    Dayal Singh Kalra, Maissam Barkeshli · PDF
  118. Radial Suppression Accelerates Algorithmic Generalization: A Geometric Analysis of Delayed Generalization

    Srijan Tiwari, Aditya Chauhan, Manjot Singh · PDF
  119. Random Sparse Subnetworks Suffice for RLVR: The Multiple Ticket Hypothesis: Random Sparse Subnetworks Suffice for RLVR

    Israel Adewuyi, Solomon Okibe, Vladimir V. Ivanov · PDF
  120. Rank Allocation in Low-Rank Optimizers

    Ansh Tiwari · PDF
  121. Rank-One Potential Geometry for Normalized Optimizers

    Insung Yun, Sungyoon Lee · PDF
  122. Refresh-Scaling the Memory of Balanced Adam

    Alberto Fernández-Hernández, Cristian Pérez-Corral, Jose I. Mestre, Manuel F. Dolz, Enrique S. Quintana-Orti · PDF
  123. Regularizing Optimizer Updates via Feasible-Set Projection

    Insung Yun, Sungyoon Lee · PDF
  124. Reparametrizing Shampoo and SOAP for Subspace Basis Updates and BFloat16 Storage

    Alan Milligan, Zikun Xu, Simon Lacoste-Julien, Felix Dangel, Wu Lin · PDF
  125. Representation Stability in High-Dimensional Noisy Time Series via Koopman-Based Features

    Sucheta Ghosh, Zahra Monfared · PDF
  126. Reset-and-Discard (ReD) Improves Coverage at every Budget under Inference Power-Law Scaling

    Sagi Meir, Tommer D. Keidar, Noam Itzhak Levi, Shlomi Reuveni, Barak Hirshberg · PDF
  127. Rethinking Bregman Divergences in Kronecker-Factored Optimizers

    Bing Liu, WenJie Zhou, Chengcheng Zhao · PDF
  128. Reward-Aware Population Scaling of Evolutionary Strategies in LLM Fine-Tuning

    Sung Cho, Gyubin Han · PDF
  129. RRD: Routing-and-Residual Distillation for Efficient MoE Recovery in Large Language Models

    Hoyoon Byun, Kangjun Noh, SoMin Kim, Heedong Kim, Jaeyoon Shim, Sungjun Lim, Youngjun Choi, Kyungwoo Song · PDF
  130. Scaling Laws for Grid-Based Approximate Nearest Neighbor Search in High Dimensions

    Matthew J. Liu, Wei Hang Zheng, Vidhan Purohit, Siqi Xie, Chieh-En Li, Jerry Li, Noah Flynn · PDF
  131. Scaling Laws from Sequential Feature Recovery: A Solvable Hierarchical Model

    Arie Wortsman Zurich, Hugo Tabanelli, Yatin Dandi, Florent Krzakala, Bruno Loureiro · PDF
  132. Scaling Theory for SlowRunning: Model size, Ensembling, and Training Horizon in the Multi-Epoch Regime

    Zhen Yang, Yidi Miao, Blake Bordelon · PDF
  133. Scaling with Recursion in Masked Discrete Diffusion Models

    Alba Carballo-Castro, Julianna Piskorz, Paulius Rauba, Mihaela van der Schaar, Pascal Frossard · PDF
  134. Self-Distillation for Data-Scarce Language Model Pretraining

    Javid Lakha, Nihal V. Nayak, Bingbin Liu, Sham M. Kakade, David Alvarez-Melis · PDF
  135. Self-Influence Governs Generalization: A von Mises Expansion Approach

    John Sous, Anirvan M. Sengupta · PDF
  136. Sequential Correlations Change In-Context Learning: Effective Context Length and Architectural Mismatch

    Mary Letey, Yue M. Lu, Cengiz Pehlevan, Jacob A Zavatone-Veth · PDF
  137. Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory

    Juno Kim, Eshaan Nichani, Denny Wu, Alberto Bietti, Jason D. Lee · PDF
  138. Sharp Generalization for Shallow Neural Networks with Channel Attention

    Yingzhen Yang · PDF
  139. Signal Frequency Imbalance and Ill-Conditioning

    Tianyue H. Zhang, Francis Bach, Frederik Kunstner · PDF
  140. Small for Small: Exploring Optimal Teacher in Knowledge Distillation with Limited Data

    Minjae Park, Taesun Yeom, Jaeho Lee · PDF
  141. Spectral Equalization Minimizes Total Training Energy: A Control-Theoretic Account of Muon's Advantage

    Euijin Hong · PDF
  142. Spherical Boltzmann machines: a solvable theory of learning and generation in energy-based models

    Thomas Tulinski, Simona Cocco, Remi Monasson, Jorge FERNANDEZ-DE-COSSIO-DIAZ · PDF
  143. SSM Adapters via Hankel Reduced-order Modeling: Injection Site Determines Task Suitability in Long-Context Fine-Tuning

    Omanshu Thapliyal · PDF
  144. Stabilizing Continuous-Time Kolen–Pollack Learning with a Scale-Balance Condition

    Marc Gong Bacvanski, Liu Ziyin, Tomaso Poggio · PDF
  145. Stochastic Gradient Descent on the Linear Bigram Model: Bias-Variance Scaling and Critical Batch Size

    Zeyu Bian, Dmitriy Drusvyatskiy, Tianhao Wang · PDF
  146. Structure and Scale in Simplicial Sequence Modelling

    Matthew Farrugia-Roberts · PDF
  147. Task-Dependent Inference-Compute Scaling Frontiers: Diffusion vs. Autoregressive Language Models

    Khurram Khalil, Ripan Kumar Kundu · PDF
  148. Test-time Verification via Optimal Transport: Coverage, ROC, & Sub-optimality

    Arpan Mukherjee, Marcello Bullo, Debabrota Basu, Deniz Gunduz · PDF
  149. The Propagation Field: A Geometric Substrate Theory of Deep Learning

    Xingrui Gu · PDF
  150. Timescale Separation in Sparse Dictionary Learning: Reconstruction Converges Before Reproducibility

    Bright Liu · PDF
  151. Too Sharp, Too Sure: When Calibration Follows Curvature

    Alessandro Morosini, Matea Gjika, Tomaso Poggio, Pierfrancesco Beneventano · PDF
  152. Towards Understanding Momentum Acceleration in River-Valley Loss Landscape

    Miao Lu, Zeyu Bian, Kaiyue Wen, Beining Wu, Siyu Chen, Tianhao Wang, Zhiyuan Li · PDF
  153. Training Dynamics of Softmax Self-Attention: Fast Global Convergence via Preconditioning

    Gautam Goel, Mahdi Soltanolkotabi, Peter Bartlett · PDF
  154. Training Transformers for KV Cache Compressibility

    Yoav Gelberg, Yam Eitan, Michael M. Bronstein, Yarin Gal, Haggai Maron · PDF
  155. Transformers Can Learn Multiclass Classification In-Context: Isotropy Governs Generalization

    Daehan Yoon, Chulhee Yun · PDF
  156. Understanding Clipping in Zeroth-order Optimization

    Saket Gollapudi, Elisa Bertino, Sewoong Oh · PDF
  157. Understanding Feature Learning Dynamics in Isotropic Regularizers via BHEP Statistics

    Jeongyou Lee, Sungyoon Lee · PDF
  158. Understanding Polyak's Momentum in Deep Learning May Require Rethinking Non-Convex Optimization

    Donghwa Kim, Chulhee Yun · PDF
  159. Uniform Spectral Growth under Factor-wise Muon Orthogonalization in Matrix Factorization and LoRA

    Changmin Kang, Jihun Yun, Baekrok Shin, Yeseul Cho, Chulhee Yun · PDF
  160. Weight Anisotropy in Mean-Field Theory: Learning on Isotropic Data

    Niclas Alexander Göring, Chris Mingard, Yoonsoo Nam, Jake Reid, Ard A. Louis · PDF
  161. What it means by learning in a neural network: easing the knot

    Atiya Ibnat Tasnim, Rushmila Shehreen Khan, Md. Shahriar Karim · PDF
  162. When and Why Grouping Attention Heads Accelerates Muon Optimization

    Hongtao Zhang, WenJie Zhou, Wei Chen, Xueqi Cheng · PDF
  163. Why Adversarial Diffusion Trains More Stably Than GANs: A Local Jacobian View

    Florian Ochs · PDF
  164. Why Are DMD Students Lazy? Understanding the Copying Behavior in Few-Step Distillation

    Shucheng Li, Iolo Jones, Alexander Tong, Michael M. Bronstein · PDF
  165. Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention

    Jing Huang, Daniel Wurgaft, Rachit Bansal, Laura Ruis, Naomi Saphra, David Alvarez-Melis, Andrew Kyle Lampinen, Christopher Potts, Ekdeep Singh Lubana · PDF
  166. Why Routers Freeze: Infinite Width Learning Dynamics for Mixture of Experts

    Anish Dhir, Volkan Cevher, Leena Chennuru Vankadara · PDF
  167. Worker Disagreement Reveals Sharp Directions in Local SGD

    Tolga Dimlioglu, Kristi Topollai, Anna Ewa Choromanska · PDF