ICML 2025PastOther

High-dimensional Learning Dynamics 2025

HiLD at ICML 2025

Official website ↗OpenReview venue ↗See all ICML workshops →✎ Edit this entry

Submission deadline: May 22, 2025, 15:00 UTC
imported from OpenReview — check the website for extensions
Submission portal: OpenReview
Notes: Topics were auto-suggested and may be imprecise — edits welcome.

Accepted papers (84)

Fetched from OpenReview (v2) on 2026-06-10.

A Random Matrix Theory Perspective on the Learning Dynamics of Multi-head Latent Attention
Nandan Kumar Jha, Brandon Reagen · PDF
A simple connection from loss flatness to compressed neural representations
Shirui Chen, Stefano Recanatesi, Eric Todd SheaBrown · PDF
A solvable generative model with a linear, one-step denoiser
Indranil Halder · PDF
Adam Reduces a Unique Form of Sharpness: Theoretical Insights Near the Minimizer Manifold
Xinghan Li, Haodong Wen, Kaifeng Lyu · PDF
Adapting to High Dimensional Concepts with Metalearning
Max Gupta · PDF
Attention with Trained Embeddings Provably Selects Important Tokens
Diyuan Wu, Aleksandr Shevchenko, Samet Oymak, Marco Mondelli · PDF
Bayes optimal learning of attention-indexed models
Fabrizio Boncoraglio, Emanuele Troiani, Vittorio Erba, Lenka Zdeborova · PDF
Bayesian Influence Functions for Scalable Data Attribution
Philipp Alexander Kreer, Wilson Wu, Maxwell Adam, Zach Furman, Jesse Hoogland · PDF
Benignity of loss landscape with weight decay requires both large overparametrization and initialization
Etienne Boursier, Matthew Bowditch, Matthias Englert, Ranko Lazic · PDF
Better Rates for Private Linear Regression in the Proportional Regime via Aggressive Clipping
Simone Bombari, Inbar Seroussi, Marco Mondelli · PDF
Catalyst: Structured Pruning with Robust Bifurcation Dynamics
Jaeheun Jung, Donghun Lee · PDF
Data Free Metrics Are Not Reparameterisation Invariant Under the Critical and Robust Layer Phenomena
Gabryel Mason-Williams, Israel Mason-Williams, Fredrik Dahlqvist · PDF
Data-Free Transformer Quantization Using Parameter-Space Symmetry
Lucas Laird, Bo Zhao, Rose Yu, Robin Walters · PDF
Different simultaneous mechanisms for in-context recall have distinct learning dynamics
Sultan Daniels, Dylan Davis, Dhruv Gautam, Wentinn Liao, Gireeja Ranade, Anant Sahai · PDF
Emergence of Hebbian Dynamics in Regularized Non-Local Learners
David Aaron Koplow, Tomaso Poggio, Liu Ziyin · PDF
Emergent Linear Separability of Unseen Data Points in High-dimensional Last-Layer Feature Space
Taehun Cha, Donghun Lee · PDF
Emergent Specialization: Rare Token Neurons in Language Models
Jing Liu, Haozheng Wang, Yueheng Li · PDF
Exact Learning of Permutations for Nonzero Binary Inputs with Logarithmic Training Size and Quadratic Ensemble Complexity
George Giapitzakis, Artur Back de Luca, Kimon Fountoulakis · PDF
Exploration Behavior of Untrained Policies
Jacob Adamczyk · PDF
Exploring L2-Phase Transitions on Error Landscapes
Ibrahim Talha Ersoy, Karoline Wiesner · PDF
Feature learning is decoupled from generalization in high capacity neural networks
Niclas Alexander Göring, Charles London, Abdurrahman Hadi Erturk, Chris Mingard, Yoonsoo Nam, Ard A. Louis · PDF
From Information to Generative Exponent: Learning Rate Induces Phase Transitions in SGD
Konstantinos Christopher Tsiolis, Alireza Mousavi-Hosseini, Murat A Erdogdu · PDF
From Linear to Nonlinear: Provable Weak-to-Strong Generalization through Feature Learning
Junsoo Oh, Jerry Song, Chulhee Yun · PDF
Fundamental Limits of Learning Single-Index Models under Structured Data
Jivan Waber, Alireza Mousavi-Hosseini, Murat A Erdogdu · PDF
Generalisation and Safety Critical Evaluations at Sharp Minima: A Geometric Reappraisal
Israel Mason-Williams, Gabryel Mason-Williams, Helen Yannakoudakis · PDF
Gluon: Making Muon & Scion Great Again! (Bridging Theory and Practice of LMO-based Optimizers for LLMs)
Artem Riabinin, Egor Shulgin, Kaja Gruntkowska, Peter Richtárik · PDF
Grokking and Generalization Collapse: Insights from HTSR theory
Hari Kishan Prakash, charles h martin · PDF
How Compositional Generalization and Creativity Improve as Diffusion Models are Trained
Alessandro Favero, Antonio Sclocchi, Francesco Cagnetta, Pascal Frossard, Matthieu Wyart · PDF
How Transformers Get Rich: Training Dynamics Analysis
Mingze Wang, Ruoxi Yu, Weinan E, Lei Wu · PDF
Implicit Bias and Loss of Plasticity in Matrix Completion: Depth Promotes Low-Rank Solutions
Baekrok Shin, Chulhee Yun · PDF
Implicit Bias of Spectral Descent and Muon on Multiclass Separable Data
Chen Fan, Mark Schmidt, Christos Thrampoulidis · PDF
In Search of Adam’s Secret Sauce
Antonio Orvieto, Robert M. Gower · PDF
Information-Geometric Neural Granger Causality
Pauline Bourigault, Danilo Mandic · PDF
Input differentiation via negative computation
Linghao Kong, Angelina Ning, Nir N Shavit · PDF
Is your batch size the problem? Revisiting the Adam-SGD gap in language modeling
Teodora Srećković, Jonas Geiping, Antonio Orvieto · PDF
Jacobian Alignment Explains Grokking and Centroid Alignment Identifies It
Thomas Walker, Ahmed Imtiaz Humayun, Randall Balestriero, Richard Baraniuk · PDF
Langevin Learning Dynamics in Lazy and Non-Lazy Wide Neural Networks
Yehonatan Avidan, Haim Sompolinsky · PDF
Latent Concept Disentanglement in Transformer-based Language Models
Guan Zhe Hong, Bhavya Vasudeva, Vatsal Sharan, Cyrus Rashtchian, Prabhakar Raghavan, Rina Panigrahy · PDF
Learning curves theory of hierarchically compositional data with power-law distributed features
Francesco Cagnetta, Hyunmo Kang, Matthieu Wyart · PDF
Learning how to step in gradient-based optimization: beyond convexity and smoothness
Dravyansh Sharma · PDF
Low Rank Gradients and Where To Find Them
Rishi Sonthalia, Michael Murray, Guido Montufar · PDF
Lyapunov Learning at the Onset of Chaos
Alessandro Londei, Denise Lanzieri, Matteo Benati, Vittorio Loreto · PDF
Neural Collapse is Globally Optimal in Deep Regularized ResNets and Transformers
Peter Súkeník, Christoph H. Lampert, Marco Mondelli · PDF
New Evidence of the Two-Phase Learning Dynamics of Neural Networks
Zhanpeng Zhou, Yongyi Yang, Mahito Sugiyama, Junchi Yan · PDF
On Generalization of Spectral Gradient Descent: A Case Study on Imbalanced Data
Bhavya Vasudeva, Puneesh Deora, Christos Thrampoulidis · PDF
On the Existence of Hidden Subnetworks Within a Randomly Weighted Multi-Head Attention Mechanism
Hikari Otsuka, Yasuyuki Okoshi, Daichi Fujiki, Susumu Takeuchi, Masato Motomura, Daiki Chijiwa · PDF
On the Interaction of Noise, Compression, and Adaptivity under $(L_0,L_1)$-Smoothness: An SDE Approach
Enea Monzio Compagnoni, Rustem Islamov, Antonio Orvieto, Eduard Gorbunov · PDF
On the Learning Dynamics of Two-layer Linear Networks with Label Noise SGD
Tongcheng Zhang, Zhanpeng Zhou, Mingze Wang, Andi Han, Wei Huang, Taiji Suzuki, Junchi Yan · PDF
On the Performance of Differentially Private Optimization with Heavy-Tail Class Imbalance
Qiaoyue Tang, Alain Zhiyanov, Mathias Lécuyer · PDF
Origins of Creativity in Attention Based Diffusion Models
Emma Lucia Byrnes Finn, T. Anderson Keller, Manos Theodosis, Demba E. Ba · PDF
Probing Geometry of Next Token Prediction Using Cumulant Expansion of the Softmax Entropy
Karthik Viswanathan, Sang Eon Park · PDF
Quantitative Bounds for Length Generalization in Transformers
Zachary Izzo, Eshaan Nichani, Jason D. Lee · PDF
Quantization and the Bottom of the Loss Landscape
Luca Di Carlo, Daniel T. Bernstein, David J. Schwab · PDF
Reactivation: Empirical NTK Dynamics Under Task Shifts
Yuzhi LIU, Zixuan Chen, Zirui zhang, Yufei Liu, Giulia Lanzillotta · PDF
Reduce and Conquer: Independent Component Analysis at linear sample complexity
Fabiola Ricci, Lorenzo Bardone, Sebastian Goldt · PDF
Rethinking Memorization–Generalization Trade-Off in Generative Models
Jiseok Chae, Kyuwon Kim, Donghwan Kim · PDF
Revisiting the Goldilocks Zone in Inhomogeneous Networks
Zacharie Garnier Cuchet, Sarath Chandar, Ekaterina Lobacheva · PDF
Risk Phase Transitions in Spiked Regression: Alignment Driven Benign and Catastrophic Overfitting
Jiping Li, Rishi Sonthalia · PDF
Selective Prediction via Training Dynamics
Stephan Rabanser, Anvith Thudi, Kimia Hamidieh, Adam Dziedzic, Israfil Bahceci, Akram Bin Sediq, Hamza Sokun, Nicolas Papernot · PDF
Spectral Dynamics of Contrastive Learning with Spurious Features
Naghmeh Ghanooni, Dennis Wagner, Waleed Mustafa, Anthony Widjaja Lin, Sophie Fellenz, Marius Kloft · PDF
Studying Data Complexity and Learned Structure in Neural Networks with Bayesian Probes
Maxwell Adam, Zach Furman, Wilson Wu, Philipp Alexander Kreer, Jesse Hoogland · PDF
Symmetries in Weight Space Learning: To Retain or Remove?
Fynn Kiwitt, Behrooz Tahmasebi, Stefanie Jegelka · PDF
The Cost of Robustness: Tighter Bounds on Parameter Complexity for Robust Memorization in ReLU Nets
Yujun Kim, Chaewon Moon, Chulhee Yun · PDF
The Interplay Between Implicit Bias and Adversarial Robustness in Linear Convolutional Neural Networks
Aurélien Boland, Hannah Pinson · PDF
The Nuclear Route: Sharp Asymptotics of ERM in Overparameterized Quadratic Networks
Vittorio Erba, Emanuele Troiani, Lenka Zdeborova, Florent Krzakala · PDF
The Price of Robustness: Stable Classifiers Need Overparameterization
Jonas von Berg, Adalbert Fono, Massimiliano Datres, Sohir Maskey, Gitta Kutyniok · PDF
The Shape of Generalization through the Lens of Norm-based Capacity Control
Yichen Wang, Yudong Chen, Lorenzo Rosasco, Fanghui Liu · PDF
The Silent Helper: How Implicit Regularization Enhances Group Robustness
Nahal Mirzaie, Mahdi Ghaznavi, Hosna Oyarhoseini, Alireza Alipanah, Erfan Sobhaei, Ali Abbasi, Amirmahdi Farzane, Hossein Jafarinia, Parsa Sharifi Sedeh, Arefe Boushehrian, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban · PDF
Theoretical Guarantees and Training Dynamics of Contrastive Learning: How Misaligned Data Influence Feature Purity
Jiawei Sun, Shuai Zhang, Hongkang Li, Meng Wang · PDF
Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training
Minhak Song, Beomhan Baek, Kwangjun Ahn, Chulhee Yun · PDF
Topology-Aware Robust Representation Balancing for Estimating Causal Effects
Amirhossein Farzam, Ahmed Aloui, Vahid Tarokh, Guillermo Sapiro · PDF
Towards an Optimal Control Perspective of ResNet Training
Jens Püttschneider, Simon Heilig, Asja Fischer, Timm Faulwasser · PDF
Towards Understanding Orthogonalization in Muon
Valentyn Boreiko, Zhiqi Bu, Sheng Zha · PDF
Tracing the representation geometry of language models from pretraining to post-training
Melody Zixuan Li, Kumar Krishna Agrawal, Arna Ghosh, Komal Kumar Teru, Guillaume Lajoie, Blake Aaron Richards · PDF
Training Dynamics of In-Context Learning in Linear Attention
Yedi Zhang, Aaditya K Singh, Peter E. Latham, Andrew M Saxe · PDF
Two-point deterministic equivalence for SGD in random feature models
Alexander Atanasov, Blake Bordelon, Jacob A Zavatone-Veth, Courtney Paquette, Cengiz Pehlevan · PDF
Understanding Generalization in Diffusion Models via Probability Flow Distance
Huijie Zhang, Zijian Huang, Siyi Chen, Jinfan Zhou, Zekai Zhang, Peng Wang, Qing Qu · PDF
Understanding Lookahead Dynamics Through Laplace Transforms
Aniket Sanyal, Tatjana Chavdarova · PDF
Understanding Mamba in In-Context Learning with Outliers: A Theoretical Generalization Analysis
Hongkang Li, Songtao Lu, Xiaodong Cui, Pin-Yu Chen, Meng Wang · PDF
Understanding Normalization Layers for Sparse Training
Mohammed Adnan, Ekansh Sharma, Rahul Krishnan, Yani Ioannou · PDF
Universal Dynamics of Warmup Stable Decay: understanding WSD beyond Transformers
Annalisa Belloni, Lorenzo Noci, Antonio Orvieto · PDF
What Happens During the Loss Plateau? Understanding Abrupt Learning in Transformers
Pulkit Gopalani, Wei Hu · PDF
When Can You Get Away with Low Memory Adam?
Dayal Singh Kalra, John Kirchenbauer, Maissam Barkeshli, Tom Goldstein · PDF
When Do Transformers Outperform Feedforward and Recurrent Networks? A Statistical Perspective
Alireza Mousavi-Hosseini, Clayton Sanford, Denny Wu, Murat A Erdogdu · PDF

Accepted papers (84)

☆A Random Matrix Theory Perspective on the Learning Dynamics of Multi-head Latent Attention

☆A simple connection from loss flatness to compressed neural representations

☆A solvable generative model with a linear, one-step denoiser

☆Adam Reduces a Unique Form of Sharpness: Theoretical Insights Near the Minimizer Manifold

☆Adapting to High Dimensional Concepts with Metalearning

☆Attention with Trained Embeddings Provably Selects Important Tokens

☆Bayes optimal learning of attention-indexed models

☆Bayesian Influence Functions for Scalable Data Attribution

☆Benignity of loss landscape with weight decay requires both large overparametrization and initialization

☆Better Rates for Private Linear Regression in the Proportional Regime via Aggressive Clipping

☆Catalyst: Structured Pruning with Robust Bifurcation Dynamics

☆Data Free Metrics Are Not Reparameterisation Invariant Under the Critical and Robust Layer Phenomena

☆Data-Free Transformer Quantization Using Parameter-Space Symmetry

☆Different simultaneous mechanisms for in-context recall have distinct learning dynamics

☆Emergence of Hebbian Dynamics in Regularized Non-Local Learners

☆Emergent Linear Separability of Unseen Data Points in High-dimensional Last-Layer Feature Space

☆Emergent Specialization: Rare Token Neurons in Language Models

☆Exact Learning of Permutations for Nonzero Binary Inputs with Logarithmic Training Size and Quadratic Ensemble Complexity

☆Exploration Behavior of Untrained Policies

☆Exploring L2-Phase Transitions on Error Landscapes

☆Feature learning is decoupled from generalization in high capacity neural networks

☆From Information to Generative Exponent: Learning Rate Induces Phase Transitions in SGD

☆From Linear to Nonlinear: Provable Weak-to-Strong Generalization through Feature Learning

☆Fundamental Limits of Learning Single-Index Models under Structured Data

☆Generalisation and Safety Critical Evaluations at Sharp Minima: A Geometric Reappraisal

☆Gluon: Making Muon & Scion Great Again! (Bridging Theory and Practice of LMO-based Optimizers for LLMs)

☆Grokking and Generalization Collapse: Insights from HTSR theory

☆How Compositional Generalization and Creativity Improve as Diffusion Models are Trained

☆How Transformers Get Rich: Training Dynamics Analysis

☆Implicit Bias and Loss of Plasticity in Matrix Completion: Depth Promotes Low-Rank Solutions

☆Implicit Bias of Spectral Descent and Muon on Multiclass Separable Data

☆In Search of Adam’s Secret Sauce

☆Information-Geometric Neural Granger Causality

☆Input differentiation via negative computation

☆Is your batch size the problem? Revisiting the Adam-SGD gap in language modeling

☆Jacobian Alignment Explains Grokking and Centroid Alignment Identifies It

☆Langevin Learning Dynamics in Lazy and Non-Lazy Wide Neural Networks

☆Latent Concept Disentanglement in Transformer-based Language Models

☆Learning curves theory of hierarchically compositional data with power-law distributed features

☆Learning how to step in gradient-based optimization: beyond convexity and smoothness

☆Low Rank Gradients and Where To Find Them

☆Lyapunov Learning at the Onset of Chaos

☆Neural Collapse is Globally Optimal in Deep Regularized ResNets and Transformers

☆New Evidence of the Two-Phase Learning Dynamics of Neural Networks

☆On Generalization of Spectral Gradient Descent: A Case Study on Imbalanced Data

☆On the Existence of Hidden Subnetworks Within a Randomly Weighted Multi-Head Attention Mechanism

☆On the Interaction of Noise, Compression, and Adaptivity under $(L_0,L_1)$-Smoothness: An SDE Approach

☆On the Learning Dynamics of Two-layer Linear Networks with Label Noise SGD

☆On the Performance of Differentially Private Optimization with Heavy-Tail Class Imbalance

☆Origins of Creativity in Attention Based Diffusion Models

☆Probing Geometry of Next Token Prediction Using Cumulant Expansion of the Softmax Entropy

☆Quantitative Bounds for Length Generalization in Transformers

☆Quantization and the Bottom of the Loss Landscape

☆Reactivation: Empirical NTK Dynamics Under Task Shifts

☆Reduce and Conquer: Independent Component Analysis at linear sample complexity

☆Rethinking Memorization–Generalization Trade-Off in Generative Models

☆Revisiting the Goldilocks Zone in Inhomogeneous Networks

☆Risk Phase Transitions in Spiked Regression: Alignment Driven Benign and Catastrophic Overfitting

☆Selective Prediction via Training Dynamics

☆Spectral Dynamics of Contrastive Learning with Spurious Features

☆Studying Data Complexity and Learned Structure in Neural Networks with Bayesian Probes

☆Symmetries in Weight Space Learning: To Retain or Remove?

☆The Cost of Robustness: Tighter Bounds on Parameter Complexity for Robust Memorization in ReLU Nets

☆The Interplay Between Implicit Bias and Adversarial Robustness in Linear Convolutional Neural Networks

☆The Nuclear Route: Sharp Asymptotics of ERM in Overparameterized Quadratic Networks

☆The Price of Robustness: Stable Classifiers Need Overparameterization

☆The Shape of Generalization through the Lens of Norm-based Capacity Control

☆The Silent Helper: How Implicit Regularization Enhances Group Robustness

☆Theoretical Guarantees and Training Dynamics of Contrastive Learning: How Misaligned Data Influence Feature Purity

☆Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training

☆Topology-Aware Robust Representation Balancing for Estimating Causal Effects

☆Towards an Optimal Control Perspective of ResNet Training

☆Towards Understanding Orthogonalization in Muon

☆Tracing the representation geometry of language models from pretraining to post-training

☆Training Dynamics of In-Context Learning in Linear Attention

☆Two-point deterministic equivalence for SGD in random feature models

☆Understanding Generalization in Diffusion Models via Probability Flow Distance

☆Understanding Lookahead Dynamics Through Laplace Transforms

☆Understanding Mamba in In-Context Learning with Outliers: A Theoretical Generalization Analysis

A Random Matrix Theory Perspective on the Learning Dynamics of Multi-head Latent Attention

A simple connection from loss flatness to compressed neural representations

A solvable generative model with a linear, one-step denoiser

Adam Reduces a Unique Form of Sharpness: Theoretical Insights Near the Minimizer Manifold

Adapting to High Dimensional Concepts with Metalearning

Attention with Trained Embeddings Provably Selects Important Tokens

Bayes optimal learning of attention-indexed models

Bayesian Influence Functions for Scalable Data Attribution

Benignity of loss landscape with weight decay requires both large overparametrization and initialization

Better Rates for Private Linear Regression in the Proportional Regime via Aggressive Clipping

Catalyst: Structured Pruning with Robust Bifurcation Dynamics

Data Free Metrics Are Not Reparameterisation Invariant Under the Critical and Robust Layer Phenomena

Data-Free Transformer Quantization Using Parameter-Space Symmetry

Different simultaneous mechanisms for in-context recall have distinct learning dynamics

Emergence of Hebbian Dynamics in Regularized Non-Local Learners

Emergent Linear Separability of Unseen Data Points in High-dimensional Last-Layer Feature Space

Emergent Specialization: Rare Token Neurons in Language Models

Exact Learning of Permutations for Nonzero Binary Inputs with Logarithmic Training Size and Quadratic Ensemble Complexity

Exploration Behavior of Untrained Policies

Exploring L2-Phase Transitions on Error Landscapes

Feature learning is decoupled from generalization in high capacity neural networks

From Information to Generative Exponent: Learning Rate Induces Phase Transitions in SGD

From Linear to Nonlinear: Provable Weak-to-Strong Generalization through Feature Learning

Fundamental Limits of Learning Single-Index Models under Structured Data

Generalisation and Safety Critical Evaluations at Sharp Minima: A Geometric Reappraisal

Gluon: Making Muon & Scion Great Again! (Bridging Theory and Practice of LMO-based Optimizers for LLMs)

Grokking and Generalization Collapse: Insights from HTSR theory

How Compositional Generalization and Creativity Improve as Diffusion Models are Trained

How Transformers Get Rich: Training Dynamics Analysis

Implicit Bias and Loss of Plasticity in Matrix Completion: Depth Promotes Low-Rank Solutions

Implicit Bias of Spectral Descent and Muon on Multiclass Separable Data

In Search of Adam’s Secret Sauce

Information-Geometric Neural Granger Causality

Input differentiation via negative computation

Is your batch size the problem? Revisiting the Adam-SGD gap in language modeling

Jacobian Alignment Explains Grokking and Centroid Alignment Identifies It

Langevin Learning Dynamics in Lazy and Non-Lazy Wide Neural Networks

Latent Concept Disentanglement in Transformer-based Language Models

Learning curves theory of hierarchically compositional data with power-law distributed features

Learning how to step in gradient-based optimization: beyond convexity and smoothness

Low Rank Gradients and Where To Find Them

Lyapunov Learning at the Onset of Chaos

Neural Collapse is Globally Optimal in Deep Regularized ResNets and Transformers

New Evidence of the Two-Phase Learning Dynamics of Neural Networks

On Generalization of Spectral Gradient Descent: A Case Study on Imbalanced Data

On the Existence of Hidden Subnetworks Within a Randomly Weighted Multi-Head Attention Mechanism

On the Interaction of Noise, Compression, and Adaptivity under $(L_0,L_1)$-Smoothness: An SDE Approach

On the Learning Dynamics of Two-layer Linear Networks with Label Noise SGD

On the Performance of Differentially Private Optimization with Heavy-Tail Class Imbalance

Origins of Creativity in Attention Based Diffusion Models

Probing Geometry of Next Token Prediction Using Cumulant Expansion of the Softmax Entropy

Quantitative Bounds for Length Generalization in Transformers

Quantization and the Bottom of the Loss Landscape

Reactivation: Empirical NTK Dynamics Under Task Shifts

Reduce and Conquer: Independent Component Analysis at linear sample complexity

Rethinking Memorization–Generalization Trade-Off in Generative Models

Revisiting the Goldilocks Zone in Inhomogeneous Networks

Risk Phase Transitions in Spiked Regression: Alignment Driven Benign and Catastrophic Overfitting

Selective Prediction via Training Dynamics

Spectral Dynamics of Contrastive Learning with Spurious Features

Studying Data Complexity and Learned Structure in Neural Networks with Bayesian Probes

Symmetries in Weight Space Learning: To Retain or Remove?

The Cost of Robustness: Tighter Bounds on Parameter Complexity for Robust Memorization in ReLU Nets

The Interplay Between Implicit Bias and Adversarial Robustness in Linear Convolutional Neural Networks

The Nuclear Route: Sharp Asymptotics of ERM in Overparameterized Quadratic Networks

The Price of Robustness: Stable Classifiers Need Overparameterization

The Shape of Generalization through the Lens of Norm-based Capacity Control

The Silent Helper: How Implicit Regularization Enhances Group Robustness

Theoretical Guarantees and Training Dynamics of Contrastive Learning: How Misaligned Data Influence Feature Purity

Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training

Topology-Aware Robust Representation Balancing for Estimating Causal Effects

Towards an Optimal Control Perspective of ResNet Training

Towards Understanding Orthogonalization in Muon

Tracing the representation geometry of language models from pretraining to post-training

Training Dynamics of In-Context Learning in Linear Attention

Two-point deterministic equivalence for SGD in random feature models

Understanding Generalization in Diffusion Models via Probability Flow Distance

Understanding Lookahead Dynamics Through Laplace Transforms

Understanding Mamba in In-Context Learning with Outliers: A Theoretical Generalization Analysis

Understanding Normalization Layers for Sparse Training