ICML 2026 Past Other
High-dimensional Learning Dynamics 2026
HiLD at ICML 2026
- Submission deadline
- May 12, 2026, 12:00 UTC imported from OpenReview — check the website for extensions
- Submission portal
- OpenReview
- Notes
- Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).
Accepted papers (167)
Fetched from OpenReview (v2) on 2026-06-10.
-
$\delta$-Regularized Gradient Clipping for Stable Optimization: Analysis and Empirical Evaluation
-
A "feature ODE" describing the learning behavior of shallow MLPs on simple functions
-
A $p$-adic Perspective on Low-Bit Training of Neural Networks
-
A Compute-Matched Study of Hidden Layer Distillation for LLM Pre-Training
-
A Coulomb Particle Model for Learning Kernel Attention in Transformers
-
A Data-Scaling Sweet Spot in Structured Algorithmic Learning
-
A Geometric Perspective on Stabilizing Value Conflict Resolution
-
A Horizon-Dependent Intrinsic-Dimension Theory of Scaling for Biological Forecasting
-
A loss curvature account of fine-tuning fragility
-
A Quadratic Lens on Muon: Orthogonalization, Invariance, and Implicit Preconditioning
-
A Simple and Efficient Measure of Loss Landscape Curvature
-
Activation Functions Control Finite-Width Concentration in Wide Neural Networks
-
Adam or Gauss-Newton? A Comparative Study In Terms of Basis Alignment and SGD Noise
-
AMUSE: Anytime Muon with Stable Gradient Evaluation
-
Anytime Pretraining: Horizon-Free Learning-Rate Schedules with Weight Averaging
-
Asymmetric Scaling Laws from Sparse Features
-
Be Greedy, Stay Linear: Universally Robust Feature Engineering
-
Beyond the Hessian Edge: The Stochastic Stability Cocycle of Mini-Batch SGD
-
BLADE: Binary Learning via Algebraic Dual Estimation for the Exact Edge of Stability in 1-Bit Networks
-
BReD: Stabilizing Quantized EMA Dynamics for Memory-Efficient Large-Scale Training
-
Causal Volterra Dynamics of Mamba
-
Characterizing Optimizer-Dependent Training Dynamics Through Hessian Eigenvector Displacement and Localization
-
Common Origins, Divergent Destinations: The Development of Cross-Layer Alignment Under GELU and SwiGLU
-
Compute Efficiency and Serial Runtime Tradeoffs for Stochastic Momentum Methods
-
Compute-Optimal Scaling Laws for the Generalization Phase Transition in Grokking
-
Compute-Optimal Training as Stochastic Optimal Control
-
Continuous Sparsification via Minimizing Movement
-
Critical Batch Size for LLM Policy Optimization
-
Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws
-
Deep Learning as Neural Low-Degree Filtering: A Theory of Hierarchical Feature Learning
-
DeltaMomentum: A Key-Value based Anisotropic Momentum Update via Delta Rule
-
Depth scaling and Muon enable balanced expert usage in MoE training
-
Deriving Hyperparameter Scaling Laws via Modern Optimization Theory
-
Detecting overfitting in Neural Networks during long-horizon grokking using Random Matrix Theory
-
Diagonalizing the Softmax: Hadamard Initialization for Tractable Cross-Entropy Dynamics
-
Dichotomy of Feature Learning and Unlearning: Fast-Slow Analysis on Neural Networks with Stochastic Gradient Descent
-
Dimension-Free Scaling Laws for Invariant Score Matching
-
DOSA: Dynamic Online State Allocation for Adaptive Optimizers via Per-Tensor Sketched Smoothness Tests
-
Dynamics of Nonlinear Feature Learning in Two-Layer GCNs on XOR-CSBM
-
DynMuon: A Dynamic Spectral Shaping View of Muon
-
Early Alignment without Neural Collapse in Two-Layer ReLU Networks on Gaussian XOR
-
Edge of Stability Selectively Shapes Learning Across the Data Distribution
-
Effective Dimension Ratios under Symmetry Augmentation
-
Effects of width-dependent model hyperparameters and $\ell_2$-regularization on the loss landscape of two-layer ReLU networks
-
Efficient Clustering with Provable Guardrails for LLM Inference at Scale
-
Empirical Model-Size Scaling for Neural PDE Solvers on the LQR-HJB Benchmark
-
Explaining Data Mixing Scaling Laws
-
Fast Learning Rate Transfer for Gradient Descent in Sketched Linear Regression
-
Feature Learning in High-Dimensions under Structured Covariance: Scaling Laws in Quadratic Networks
-
Fixed-Point Reasoning: Stable and Adaptive Deep Looped Models
-
Generalization Analysis of Linear Knowledge Distillation
-
Geometry, Not Scale Alone, Predicts Sparse Recovery of Causal Subspaces
-
Global Linear Convergence of Inexact TD Under Generalized Smoothness
-
Gradient Descent on Two ReLU Neurons: Global Landscape and Bifurcation Dynamics
-
Gradient Descent with Projection Finds Over-Parameterized Neural Networks for Learning Low-Degree Polynomials with Nearly Minimax Optimal Rate
-
Grokking or Glitching? How Low-Precision Drives Slingshot Loss Spikes
-
High-Dimensional Limit of Stochastic Gradient Flow via Dynamical Mean-Field Theory
-
HORST: Composing Optimizer Geometries for Sparse Transformer Training
-
Hourglass MLP: Rethinking the Shape of Residual Architectures
-
How Cross-Entropy Learns Data Modes: Emergence and Implicit Bias in the Unconstrained Features Model
-
How does feature learning change the function space evolution?
-
How Does Orthogonalization Adapt to the Neural-Network Hessian Structure? A Gradient Self Outer-Product Analysis at Initialization
-
How Does the ReLU Activation Affect the Implicit Bias of Gradient Descent on High-dimensional Neural Network Regression?
-
How Excess Latent Dimensionality Delays Memorization in Diffusion Models
-
How the Hessian-Spectrum of Linear Networks Depends on Data
-
How to Scale Mixture-of-Experts: From μP to the Maximally Scale-Stable Parameterization
-
How Width and Data Shape Generalization Scaling Laws in Quadratic Neural Networks
-
In-Context Benign Overfitting: A Feature-Selection Model in In-Context Linear Regression
-
Internal Data Repetition Destroys Language Models
-
Is your LLM a Sequence Model on the Training History? The Origins and Consequences of Anticipation
-
KiteNorm: Variance Regularisation for Stable and Scalable Post-LN Transformers
-
Layer Collapse in Diffusion Language Models
-
Learnability and Competition in High-Dimensional Multi-Component ICA
-
Learning Dynamics of LISP: A Gradient-Free Constraint-Satisfaction Family Containing Backpropagation
-
Learning High-Dimensional Transient Neural Dynamics for Zero-Shot Cross-Subject Reconstruction
-
Learning Rate Matters: Vanilla LoRA May Suffice for LLM Fine-tuning
-
Learning Rates Do Not Transfer Across Double Descent
-
Learning with Synthetic Data via SGD in High-Dimensional Linear Regression
-
Learning-Forgetting Optimality in Supervised Finetuning: A Cliff Perspective
-
Lightweight Surrogate-Assisted Language Model Pretraining
-
Linear Loss Classification: Efficient Training Through Neural Collapse
-
LoRA-Lens: Training Induces Spectral Compression in Low-Rank Adapters
-
Loss and Optimizer as Two Essential Mechanisms Behind Knowledge Distillation
-
M-seq Initialization: Using Pseudo-Random Binary Sequences to Initialize Deep Neural Networks.
-
Matching Features, Not Tokens: Energy-Based Fine-Tuning of Language Models
-
MIDUS: Memory-Infused Depth Up-Scaling
-
Mini-batch Noise Lowers Sharpness via Dominant-Subspace Fluctuations
-
Mode Collapse Emerges from Low-Rank Biases in the Learning Dynamics of Generative Models
-
Model Behavior and Predictive Stability Under Severe Class Imbalance in High-Dimensional Classification
-
Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds
-
Momentum Acceleration of Normalized Steepest Descent at the Edge of Stability
-
NarrativeWorldBench: A Frontier-Saturated Benchmark and a Latent World Model for Long-Horizon Co-Creative Audio Drama
-
Neural Neural Scaling Laws
-
Nexus: Same Pretraining Loss, Better Downstream Generalization via Common Minima
-
Noise-driven escape from metastable phases explains grokking in deep neural networks
-
Objective-Induced Conditional Mismatch in Sequence Diffusion Models
-
On How Muon Reshapes Skill Learning Dynamics
-
On Lipschitz Explosion in Deep Neural Networks with Normalization: Consequences for Optimization and Adversarial Robustness
-
On the Convergence of Low-Precision LoRA Training
-
On the Mean-field Analysis of Normalized Steepest Descent via Linear Minimization Oracles
-
On the Optimizer Dependence of Neural Scaling Laws
-
On the Surprising Effectiveness of Masking Updates in LLM Training
-
Optimal Attention Temperature Improves the Robustness of In-Context Learning under Distribution Shift in High Dimensions
-
Optimal learning rate scaling depends on data in deep scalar linear networks
-
Optimal scaling laws in learning hierarchical multi-index models
-
Optimal Scaling Needs Optimal Norm
-
Optimistic Online Learning for Data Mixture Optimization
-
Orthogonal Gradient Constraints Shape Noisy-Label Memorization Dynamics
-
Outer-Momentum Restarting in High-Dimensional Two-Phase Optimization
-
Pathwise EMA: An Intrinsic Clock for Weight Averaging
-
Physics-Guided Policy Optimization with Self-Distillation
-
Positional LSH: Binary Block Matrix Approximation for Attention with Linear Biases
-
Practical Muon Accelerates Projected Feature Learning in Scaling-Law Models
-
Predicting Cross-Domain RAG Retrieval Quality using Von Neumann Graph Entropy
-
Provable Data Scaling Law for Meta Learning via Complexity Minimization
-
Pseudospectral Bounds for Transient Amplification in Coupled Gradient Descent
-
Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate
-
Radial Suppression Accelerates Algorithmic Generalization: A Geometric Analysis of Delayed Generalization
-
Random Sparse Subnetworks Suffice for RLVR: The Multiple Ticket Hypothesis: Random Sparse Subnetworks Suffice for RLVR
-
Rank Allocation in Low-Rank Optimizers
-
Rank-One Potential Geometry for Normalized Optimizers
-
Refresh-Scaling the Memory of Balanced Adam
-
Regularizing Optimizer Updates via Feasible-Set Projection
-
Reparametrizing Shampoo and SOAP for Subspace Basis Updates and BFloat16 Storage
-
Representation Stability in High-Dimensional Noisy Time Series via Koopman-Based Features
-
Reset-and-Discard (ReD) Improves Coverage at every Budget under Inference Power-Law Scaling
-
Rethinking Bregman Divergences in Kronecker-Factored Optimizers
-
Reward-Aware Population Scaling of Evolutionary Strategies in LLM Fine-Tuning
-
RRD: Routing-and-Residual Distillation for Efficient MoE Recovery in Large Language Models
-
Scaling Laws for Grid-Based Approximate Nearest Neighbor Search in High Dimensions
-
Scaling Laws from Sequential Feature Recovery: A Solvable Hierarchical Model
-
Scaling Theory for SlowRunning: Model size, Ensembling, and Training Horizon in the Multi-Epoch Regime
-
Scaling with Recursion in Masked Discrete Diffusion Models
-
Self-Distillation for Data-Scarce Language Model Pretraining
-
Self-Influence Governs Generalization: A von Mises Expansion Approach
-
Sequential Correlations Change In-Context Learning: Effective Context Length and Architectural Mismatch
-
Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory
-
Sharp Generalization for Shallow Neural Networks with Channel Attention
-
Signal Frequency Imbalance and Ill-Conditioning
-
Small for Small: Exploring Optimal Teacher in Knowledge Distillation with Limited Data
-
Spectral Equalization Minimizes Total Training Energy: A Control-Theoretic Account of Muon's Advantage
-
Spherical Boltzmann machines: a solvable theory of learning and generation in energy-based models
-
SSM Adapters via Hankel Reduced-order Modeling: Injection Site Determines Task Suitability in Long-Context Fine-Tuning
-
Stabilizing Continuous-Time Kolen–Pollack Learning with a Scale-Balance Condition
-
Stochastic Gradient Descent on the Linear Bigram Model: Bias-Variance Scaling and Critical Batch Size
-
Structure and Scale in Simplicial Sequence Modelling
-
Task-Dependent Inference-Compute Scaling Frontiers: Diffusion vs. Autoregressive Language Models
-
Test-time Verification via Optimal Transport: Coverage, ROC, & Sub-optimality
-
The Propagation Field: A Geometric Substrate Theory of Deep Learning
-
Timescale Separation in Sparse Dictionary Learning: Reconstruction Converges Before Reproducibility
-
Too Sharp, Too Sure: When Calibration Follows Curvature
-
Towards Understanding Momentum Acceleration in River-Valley Loss Landscape
-
Training Dynamics of Softmax Self-Attention: Fast Global Convergence via Preconditioning
-
Training Transformers for KV Cache Compressibility
-
Transformers Can Learn Multiclass Classification In-Context: Isotropy Governs Generalization
-
Understanding Clipping in Zeroth-order Optimization
-
Understanding Feature Learning Dynamics in Isotropic Regularizers via BHEP Statistics
-
Understanding Polyak's Momentum in Deep Learning May Require Rethinking Non-Convex Optimization
-
Uniform Spectral Growth under Factor-wise Muon Orthogonalization in Matrix Factorization and LoRA
-
Weight Anisotropy in Mean-Field Theory: Learning on Isotropic Data
-
What it means by learning in a neural network: easing the knot
-
When and Why Grouping Attention Heads Accelerates Muon Optimization
-
Why Adversarial Diffusion Trains More Stably Than GANs: A Local Jacobian View
-
Why Are DMD Students Lazy? Understanding the Copying Behavior in Few-Step Distillation
-
Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention
-
Why Routers Freeze: Infinite Width Learning Dynamics for Mixture of Experts
-
Worker Disagreement Reveals Sharp Directions in Local SGD