ICML 2024PastMath & reasoning

High-dimensional Learning Dynamics 2024: The Emergence of Structure and Reasoning

HiLD at ICML 2024

Official website ↗OpenReview venue ↗See all ICML workshops →✎ Edit this entry

Submission deadline: May 29, 2024, 04:30 UTC
imported from OpenReview — check the website for extensions
Submission portal: OpenReview
Notes: Topics were auto-suggested and may be imprecise — edits welcome.

Accepted papers (73)

Fetched from OpenReview (v2) on 2026-06-10.

A Hessian-Aware Stochastic Differential Equation for Modelling SGD
Xiang Li, Zebang Shen, Liang Zhang, Niao He · PDF
A Phase Transition between Positional and Semantic Learning in a Solvable Model of Dot-Product Attention
Hugo Cui, Freya Behrens, Florent Krzakala, Lenka Zdeborova · PDF
A Random Matrix Analysis of Learning with Noisy Labels
Aymane El Firdoussi, Mohamed El Amine Seddik · PDF
A Unified Approach to Feature Learning in Bayesian Neural Networks
Noa Rubin, Zohar Ringel, Inbar Seroussi, Moritz Helias · PDF
A Universal Class of Sharpness-Aware Minimization Algorithms
Behrooz Tahmasebi, Ashkan Soleymani, Dara Bahri, Stefanie Jegelka, Patrick Jaillet · PDF
Adam Exploits $\ell_\infty$-geometry of Loss Landscape via Coordinate-wise Adaptivity
Shuo Xie, Mohamad Amin Mohamadi, Zhiyuan Li · PDF
All Roads Lead to Rome? Exploring Representational Similarities Between Latent Spaces of Generative Image Models
Charumathi Badrinath, Usha Bhalla, Alex Oesterling, Suraj Srinivas, Himabindu Lakkaraju · PDF
An exactly solvable model for emergence and scaling laws
Yoonsoo Nam, Nayara Fonseca, Seok Hyeong Lee, Chris Mingard, Ard A. Louis · PDF
Analysing feature learning of gradient descent using periodic functions
Jaehui Hwang, Taeyoung Kim, Hongseok Yang · PDF
Analyzing & Eliminating Learning Rate Warmup in GPT Pre-Training
Atli Kosson, Bettina Messmer, Martin Jaggi · PDF
Asymptotic Dynamics for Delayed Feature Learning in a Toy Model
Blake Bordelon, Tanishq Kumar, Samuel J. Gershman, Cengiz Pehlevan · PDF
Boundary between noise and information applied to filtering neural network weight matrices
Max Staats, Matthias Thamm, Bernd Rosenow · PDF
Closed form of the Hessian spectrum for some Neural Networks
Sidak Pal Singh, Thomas Hofmann · PDF
Correlated Noise in Epoch-Based Stochastic Gradient Descent: Implications for Weight Variances
Marcel Kühn, Bernd Rosenow · PDF
Decomposing and Editing Predictions by Modeling Model Computation
Harshay Shah, Andrew Ilyas, Aleksander Madry · PDF
Deep Networks Always Grok and Here is Why
Ahmed Imtiaz Humayun, Randall Balestriero, Richard Baraniuk · PDF
Do Parameters Reveal More than Loss for Membership Inference?
Anshuman Suri, Xiao Zhang, David Evans · PDF
Does SGD really happen in tiny subspaces?
Minhak Song, Kwangjun Ahn, Chulhee Yun · PDF
Early Period of Training Impacts Out-of-Distribution Generalization
Chen Cecilia Liu, Iryna Gurevych · PDF
Effect of Random Learning Rate: Theoretical Analysis of SGD Dynamics in Non-Convex Optimization via Stationary Distribution
Naoki Yoshida, Shogo Nakakita, Masaaki Imaizumi · PDF
Effective Sharpness Aware Minimization Requires Layerwise Perturbation Scaling
Moritz Haas, Jin Xu, Volkan Cevher, Leena Chennuru Vankadara · PDF
Emergent representations in networks trained with the Forward-Forward algorithm
Niccolo Tosato, Lorenzo Basile, Emanuele Ballarin, Giuseppe De Alteriis, Alberto Cazzaniga, Alessio ansuini · PDF
Exploring the development of complexity over depth and time in deep neural networks
Hannah Pinson, Aurélien Boland, Vincent Ginis, Mykola Pechenizkiy · PDF
Expressivity of Neural Networks with Fixed Weights and Learned Biases
Ezekiel Williams, Avery Hee-Woon Ryoo, Thomas Jiralerspong, Alexandre Payeur, Matthew G Perich, Luca Mazzucato, Guillaume Lajoie · PDF
Feature Learning Dynamics under Grokking in a Sparse Parity Task
Javier Sanguino Bautiste, Gregor Bachmann, Bobby He, Lorenzo Noci, Thomas Hofmann · PDF
Fine-grained Analysis of In-context Linear Estimation
Yingcong Li, Ankit Singh Rawat, Samet Oymak · PDF
Fundamental limits of weak learnability in high-dimensional multi-index models
Emanuele Troiani, Yatin Dandi, Leonardo Defilippis, Lenka Zdeborova, Bruno Loureiro, Florent Krzakala · PDF
Get rich quick: exact solutions reveal how unbalanced initializations promote rapid feature learning
Daniel Kunin, Allan Raventos, Clémentine Carla Juliette Dominé, Feng Chen, David Klindt, Andrew M Saxe, Surya Ganguli · PDF
Gradient descent induces alignment between weights and the pre-activation tangents for deep non-linear networks
Daniel Beaglehole, Ioannis Mitliagkas, Atish Agarwala · PDF
Gradient Descent Robustly Learns the Intrinsic Dimension of Data in Training Convolutional Neural Networks
Chenyang Zhang, Gao Peifeng, Difan Zou, Yuan Cao · PDF
Gradient Descent with Polyak’s Momentum Finds Flatter Minima via Large Catapults
Prin Phunyaphibarn, Junghyun Lee, Bohan Wang, Huishuai Zhang, Chulhee Yun · PDF
Gradient Dissent in Language Model Training and Saturation
Andrei Mircea, Ekaterina Lobacheva, Irina Rish · PDF
Hidden Learning Dynamics of Capability before Behavior in Diffusion Models
Core Francisco Park, Maya Okawa, Andrew Lee, Ekdeep Singh Lubana, Hidenori Tanaka · PDF
How Do Nonlinear Transformers Acquire Generalization-Guaranteed CoT Ability?
Hongkang Li, Meng Wang, Songtao Lu, Xiaodong Cui, Pin-Yu Chen · PDF
How Do Transformers Fill in the Blanks? A Case Study on Matrix Completion
Pulkit Gopalani, Ekdeep Singh Lubana, Wei Hu · PDF
How Truncating Weights Improves Reasoning in Language Models
Lei Chen, Joan Bruna, Alberto Bietti · PDF
InfoNCE: Identifying the Gap Between Theory and Practice
Evgenia Rusak, Patrik Reizinger, Attila Juhos, Oliver Bringmann, Roland S. Zimmermann, Wieland Brendel · PDF
Interpolated-MLPs: Controllable Inductive Bias
Sean Wu, Jordan Hong, keybai, Gregor Bachmann · PDF
Landscaping Linear Mode Connectivity
Sidak Pal Singh, Linara Adilova, Michael Kamp, Asja Fischer, Bernhard Schölkopf, Thomas Hofmann · PDF
Latent functional maps
Marco Fumero, Marco Pegoraro, Valentino Maiorca, Francesco Locatello, Emanuele Rodolà · PDF
Learning Multi-Index Models with Neural Networks via Mean-Field Langevin Dynamics
Alireza Mousavi-Hosseini, Denny Wu, Murat A Erdogdu · PDF
Linear Weight Interpolation Leads to Transient Performance Gains
Gaurav Iyer, Gintare Karolina Dziugaite, David Rolnick · PDF
Looking at Deep Learning Phenomena Through a Telescoping Lens
Alan Jeffares, Alicia Curth, Mihaela van der Schaar · PDF
Loss landscape geometry reveals stagewise development of transformers
George Wang, Matthew Farrugia-Roberts, Jesse Hoogland, Liam Carroll, Susan Wei, Daniel Murfet · PDF
Merging Text Transformer Models from Different Initializations
Neha Verma, Maha Elbayad · PDF
Neural collapse versus low-rank bias: Is deep neural collapse really optimal?
Peter Súkeník, Marco Mondelli, Christoph H. Lampert · PDF
Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit
Jason D. Lee, Kazusato Oko, Taiji Suzuki, Denny Wu · PDF
Neural Symmetry Detection for Learning Neural Network Constraints
Alex Gabel, Rick Quax, Stratis Gavves · PDF
Nonconvex Meta-optimization for Deep Learning
Xinyi Chen, Evan Dogariu, Zhou Lu, Elad Hazan · PDF
On the metastability of learning algorithms in physics-informed neural networks: a case study on Schr\"{o}dinger operators
Alessandro Maria Selvitella · PDF
Probability Tools for Sequential Random Projection
Yingru Li · PDF
Progress Measures for Grokking on Real-world Tasks
Satvik Golechha · PDF
Provable Benefit of Cutout and CutMix for Feature Learning
Junsoo Oh, Chulhee Yun · PDF
Provable Tempered Overfitting of Minimal Nets and Typical Nets
Itamar Harel, William M. Hoza, Gal Vardi, Itay Evron, Nathan Srebro, Daniel Soudry · PDF
Random matrix theory analysis of neural network weight matrices
Matthias Thamm, Max Staats, Bernd Rosenow · PDF
Rank Minimization, Alignment and Weight Decay in Neural Networks
David Yunis, Kumar Kshitij Patel, Samuel Wheeler, Pedro Henrique Pamplona Savarese, Gal Vardi, Karen Livescu, Michael Maire, Matthew Walter · PDF
ReLU Characteristic Activation Analysis
Wenlin Chen, Hong Ge · PDF
Repetita Iuvant: Data Repetition Allows SGD to Learn High-Dimensional Multi-Index Functions
Luca Arnaboldi, Yatin Dandi, Florent Krzakala, Luca Pesce, Ludovic Stephan · PDF
SGD vs GD: Rank Deficiency in Linear Networks
Aditya Varre, Margarita Sagitova, Nicolas Flammarion · PDF
Simple, unified analysis of Johnson-Lindenstrauss with applications
Yingru Li · PDF
The Butterfly Effect: Tiny Perturbations Cause Neural Network Training to Diverge
Gül Sena Altıntaş, Devin Kwok, David Rolnick · PDF
The Empirical Impact of Neural Parameter Symmetries, or Lack Thereof
Derek Lim, Theo Putterman, Robin Walters, Haggai Maron, Stefanie Jegelka · PDF
The Hidden Pitfalls of the Cosine Similarity Loss
Andrew Draganov, Sharvaree Vadgama, Erik J Bekkers · PDF
The Implicit Bias of Adam on Separable Data
Chenyang Zhang, Difan Zou, Yuan Cao · PDF
The optimization landscape of Spectral neural network
Chenghui Li, Rishi Sonthalia, Nicolas Garcia Trillos · PDF
Three Mechanisms of Feature Learning in an Analytically Solvable Model
Yizhou Xu, Liu Ziyin · PDF
Toward Global Convergence of Gradient EM for Over-Parameterized Gaussian Mixture Models
Weihang Xu, Maryam Fazel, Simon Shaolei Du · PDF
u-μP: The Unit-Scaled Maximal Update Parametrization
Charlie Blake, Constantin Eichenberg, Josef Dean, Lukas Balles, Luke Yuri Prince, Björn Deiseroth, Andres Felipe Cruz-Salinas, Carlo Luschi, Samuel Weinbach, Douglas Orr · PDF
Understanding Adversarially Robust Generalization via Weight-Curvature Index
Yuelin Xu, Xiao Zhang · PDF
Understanding Nonlinear Implicit Bias via Region Counts in Input Space
Jingwei Li, Jing Xu, Zifan Wang, Huishuai Zhang, Jingzhao Zhang · PDF
When Are Bias-Free ReLU Networks Like Linear Networks?
Yedi Zhang, Andrew M Saxe, Peter E. Latham · PDF
Where Do Large Learning Rates Lead Us? A Feature Learning Perspective
Ildus Sadrtdinov, Maxim Kodryan, Eduard Pokonechny, Ekaterina Lobacheva, Dmitry Vetrov · PDF
Why Pruning and Conditional Computation Work: A High-Dimensional Perspective
Erdem Koyuncu · PDF

Accepted papers (73)

☆A Hessian-Aware Stochastic Differential Equation for Modelling SGD

☆A Phase Transition between Positional and Semantic Learning in a Solvable Model of Dot-Product Attention

☆A Random Matrix Analysis of Learning with Noisy Labels

☆A Unified Approach to Feature Learning in Bayesian Neural Networks

☆A Universal Class of Sharpness-Aware Minimization Algorithms

☆Adam Exploits $\ell_\infty$-geometry of Loss Landscape via Coordinate-wise Adaptivity

☆All Roads Lead to Rome? Exploring Representational Similarities Between Latent Spaces of Generative Image Models

☆An exactly solvable model for emergence and scaling laws

☆Analysing feature learning of gradient descent using periodic functions

☆Analyzing & Eliminating Learning Rate Warmup in GPT Pre-Training

☆Asymptotic Dynamics for Delayed Feature Learning in a Toy Model

☆Boundary between noise and information applied to filtering neural network weight matrices

☆Closed form of the Hessian spectrum for some Neural Networks

☆Correlated Noise in Epoch-Based Stochastic Gradient Descent: Implications for Weight Variances

☆Decomposing and Editing Predictions by Modeling Model Computation

☆Deep Networks Always Grok and Here is Why

☆Do Parameters Reveal More than Loss for Membership Inference?

☆Does SGD really happen in tiny subspaces?

☆Early Period of Training Impacts Out-of-Distribution Generalization

☆Effect of Random Learning Rate: Theoretical Analysis of SGD Dynamics in Non-Convex Optimization via Stationary Distribution

☆Effective Sharpness Aware Minimization Requires Layerwise Perturbation Scaling

☆Emergent representations in networks trained with the Forward-Forward algorithm

☆Exploring the development of complexity over depth and time in deep neural networks

☆Expressivity of Neural Networks with Fixed Weights and Learned Biases

☆Feature Learning Dynamics under Grokking in a Sparse Parity Task

☆Fine-grained Analysis of In-context Linear Estimation

☆Fundamental limits of weak learnability in high-dimensional multi-index models

☆Get rich quick: exact solutions reveal how unbalanced initializations promote rapid feature learning

☆Gradient descent induces alignment between weights and the pre-activation tangents for deep non-linear networks

☆Gradient Descent Robustly Learns the Intrinsic Dimension of Data in Training Convolutional Neural Networks

☆Gradient Descent with Polyak’s Momentum Finds Flatter Minima via Large Catapults

☆Gradient Dissent in Language Model Training and Saturation

☆Hidden Learning Dynamics of Capability before Behavior in Diffusion Models

☆How Do Nonlinear Transformers Acquire Generalization-Guaranteed CoT Ability?

☆How Do Transformers Fill in the Blanks? A Case Study on Matrix Completion

☆How Truncating Weights Improves Reasoning in Language Models

☆InfoNCE: Identifying the Gap Between Theory and Practice

☆Interpolated-MLPs: Controllable Inductive Bias

☆Landscaping Linear Mode Connectivity

☆Latent functional maps

☆Learning Multi-Index Models with Neural Networks via Mean-Field Langevin Dynamics

☆Linear Weight Interpolation Leads to Transient Performance Gains

☆Looking at Deep Learning Phenomena Through a Telescoping Lens

☆Loss landscape geometry reveals stagewise development of transformers

☆Merging Text Transformer Models from Different Initializations

☆Neural collapse versus low-rank bias: Is deep neural collapse really optimal?

☆Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit

☆Neural Symmetry Detection for Learning Neural Network Constraints

☆Nonconvex Meta-optimization for Deep Learning

☆On the metastability of learning algorithms in physics-informed neural networks: a case study on Schr\"{o}dinger operators

☆Probability Tools for Sequential Random Projection

☆Progress Measures for Grokking on Real-world Tasks

☆Provable Benefit of Cutout and CutMix for Feature Learning

☆Provable Tempered Overfitting of Minimal Nets and Typical Nets

☆Random matrix theory analysis of neural network weight matrices

☆Rank Minimization, Alignment and Weight Decay in Neural Networks

☆ReLU Characteristic Activation Analysis

☆Repetita Iuvant: Data Repetition Allows SGD to Learn High-Dimensional Multi-Index Functions

☆SGD vs GD: Rank Deficiency in Linear Networks

☆Simple, unified analysis of Johnson-Lindenstrauss with applications

☆The Butterfly Effect: Tiny Perturbations Cause Neural Network Training to Diverge

☆The Empirical Impact of Neural Parameter Symmetries, or Lack Thereof

☆The Hidden Pitfalls of the Cosine Similarity Loss

☆The Implicit Bias of Adam on Separable Data

☆The optimization landscape of Spectral neural network

☆Three Mechanisms of Feature Learning in an Analytically Solvable Model

☆Toward Global Convergence of Gradient EM for Over-Parameterized Gaussian Mixture Models

☆u-μP: The Unit-Scaled Maximal Update Parametrization

☆Understanding Adversarially Robust Generalization via Weight-Curvature Index

☆Understanding Nonlinear Implicit Bias via Region Counts in Input Space

☆When Are Bias-Free ReLU Networks Like Linear Networks?

☆Where Do Large Learning Rates Lead Us? A Feature Learning Perspective

☆Why Pruning and Conditional Computation Work: A High-Dimensional Perspective

A Hessian-Aware Stochastic Differential Equation for Modelling SGD

A Phase Transition between Positional and Semantic Learning in a Solvable Model of Dot-Product Attention

A Random Matrix Analysis of Learning with Noisy Labels

A Unified Approach to Feature Learning in Bayesian Neural Networks

A Universal Class of Sharpness-Aware Minimization Algorithms

Adam Exploits $\ell_\infty$-geometry of Loss Landscape via Coordinate-wise Adaptivity

All Roads Lead to Rome? Exploring Representational Similarities Between Latent Spaces of Generative Image Models

An exactly solvable model for emergence and scaling laws

Analysing feature learning of gradient descent using periodic functions

Analyzing & Eliminating Learning Rate Warmup in GPT Pre-Training

Asymptotic Dynamics for Delayed Feature Learning in a Toy Model

Boundary between noise and information applied to filtering neural network weight matrices

Closed form of the Hessian spectrum for some Neural Networks

Correlated Noise in Epoch-Based Stochastic Gradient Descent: Implications for Weight Variances

Decomposing and Editing Predictions by Modeling Model Computation

Deep Networks Always Grok and Here is Why

Do Parameters Reveal More than Loss for Membership Inference?

Does SGD really happen in tiny subspaces?

Early Period of Training Impacts Out-of-Distribution Generalization

Effect of Random Learning Rate: Theoretical Analysis of SGD Dynamics in Non-Convex Optimization via Stationary Distribution

Effective Sharpness Aware Minimization Requires Layerwise Perturbation Scaling

Emergent representations in networks trained with the Forward-Forward algorithm

Exploring the development of complexity over depth and time in deep neural networks

Expressivity of Neural Networks with Fixed Weights and Learned Biases

Feature Learning Dynamics under Grokking in a Sparse Parity Task

Fine-grained Analysis of In-context Linear Estimation

Fundamental limits of weak learnability in high-dimensional multi-index models

Get rich quick: exact solutions reveal how unbalanced initializations promote rapid feature learning

Gradient descent induces alignment between weights and the pre-activation tangents for deep non-linear networks

Gradient Descent Robustly Learns the Intrinsic Dimension of Data in Training Convolutional Neural Networks

Gradient Descent with Polyak’s Momentum Finds Flatter Minima via Large Catapults

Gradient Dissent in Language Model Training and Saturation

Hidden Learning Dynamics of Capability before Behavior in Diffusion Models

How Do Nonlinear Transformers Acquire Generalization-Guaranteed CoT Ability?

How Do Transformers Fill in the Blanks? A Case Study on Matrix Completion

How Truncating Weights Improves Reasoning in Language Models

InfoNCE: Identifying the Gap Between Theory and Practice

Interpolated-MLPs: Controllable Inductive Bias

Landscaping Linear Mode Connectivity

Latent functional maps

Learning Multi-Index Models with Neural Networks via Mean-Field Langevin Dynamics

Linear Weight Interpolation Leads to Transient Performance Gains

Looking at Deep Learning Phenomena Through a Telescoping Lens

Loss landscape geometry reveals stagewise development of transformers

Merging Text Transformer Models from Different Initializations

Neural collapse versus low-rank bias: Is deep neural collapse really optimal?

Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit

Neural Symmetry Detection for Learning Neural Network Constraints

Nonconvex Meta-optimization for Deep Learning

On the metastability of learning algorithms in physics-informed neural networks: a case study on Schr\"{o}dinger operators

Probability Tools for Sequential Random Projection

Progress Measures for Grokking on Real-world Tasks

Provable Benefit of Cutout and CutMix for Feature Learning

Provable Tempered Overfitting of Minimal Nets and Typical Nets

Random matrix theory analysis of neural network weight matrices

Rank Minimization, Alignment and Weight Decay in Neural Networks

ReLU Characteristic Activation Analysis

Repetita Iuvant: Data Repetition Allows SGD to Learn High-Dimensional Multi-Index Functions

SGD vs GD: Rank Deficiency in Linear Networks

Simple, unified analysis of Johnson-Lindenstrauss with applications

The Butterfly Effect: Tiny Perturbations Cause Neural Network Training to Diverge

The Empirical Impact of Neural Parameter Symmetries, or Lack Thereof

The Hidden Pitfalls of the Cosine Similarity Loss

The Implicit Bias of Adam on Separable Data

The optimization landscape of Spectral neural network

Three Mechanisms of Feature Learning in an Analytically Solvable Model

Toward Global Convergence of Gradient EM for Over-Parameterized Gaussian Mixture Models

u-μP: The Unit-Scaled Maximal Update Parametrization

Understanding Adversarially Robust Generalization via Weight-Curvature Index

Understanding Nonlinear Implicit Bias via Region Counts in Input Space

When Are Bias-Free ReLU Networks Like Linear Networks?

Where Do Large Learning Rates Lead Us? A Feature Learning Perspective

Why Pruning and Conditional Computation Work: A High-Dimensional Perspective