NeurIPS 2024PastMath & reasoning

NeurIPS 2024 Workshop on Mathematics of Modern Machine Learning

M3L

Official website ↗OpenReview venue ↗See all NeurIPS workshops →✎ Edit this entry

Submission deadline: Oct 2, 2024, 19:00 UTC
imported from OpenReview — check the website for extensions
Submission portal: OpenReview
Notes: Topics were auto-suggested and may be imprecise — edits welcome.

Accepted papers (81)

Fetched from OpenReview (v2) on 2026-06-10.

A Little Depth Goes a Long Way: The Expressive Power of Log-Depth Transformers
William Merrill, Ashish Sabharwal · PDF
A Multi-Power Law for Loss Curve Prediction Across Learning Rate Schedules
Kairong Luo, Haodong Wen, Shengding Hu, Zhenbo Sun, Zhiyuan Liu, Maosong Sun, Kaifeng Lyu, Wenguang Chen · PDF
A Theoretical Framework for Federated Domain Generalization with Gradient Alignment
Mahdiyar Molahasani, Milad Soltany, Farhad Pourpanah, Michael Greenspan, Ali Etemad · PDF
A Theory of Initialisation's Impact on Specialisation
Devon Jarvis, Sebastian Lee, Clémentine Carla Juliette Dominé, Andrew M Saxe, Stefano Sarao Mannelli · PDF
Accumulating Data Avoids Model Collapse
Joshua Kazdan, Apratim Dey, Rylan Schaeffer, Matthias Gerstgrasser, Rafael Rafailov, David L. Donoho, Sanmi Koyejo · PDF
Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs
Tianyu Guo, Druv Pai, Yu Bai, Jiantao Jiao, Michael Jordan, Song Mei · PDF
Adversarial Attacks as Near-Zero Eigenvalues in the Empirical Kernel of Neural Networks
Ouns El Harzli, Bernardo Cuenca Grau · PDF
Adversarial Training Can Provably Improve Robustness: Theoretical Analysis of Feature Learning Process Under Structured Data
Binghui Li, Yuanzhi Li · PDF
Algorithmic Stability of Minimum-Norm Interpolating Deep Neural Networks
Ouns El Harzli, Yoonsoo Nam, Ilja Kuzborskij, Bernardo Cuenca Grau, Ard A. Louis · PDF
An empirical study of the $(L_0, L_1)$-smoothness condition
Y Cooper · PDF
Bayesian Treatment of the Spectrum of the Empirical Kernel in (Sub)Linear-Width Neural Networks
Ouns El Harzli, Bernardo Cuenca Grau · PDF
Benign Overfitting in Out-of-Distribution Generalization of Linear Models
Shange Tang, Jiayun Wu, Jianqing Fan, Chi Jin · PDF
Benign Overfitting in Single-Head Attention
Roey Magen, Shuning Shang, Zhiwei Xu, Spencer Frei, Wei Hu, Gal Vardi · PDF
Bias in Motion: Theoretical Insights into the Dynamics of Bias in SGD Training
Anchit Jain, Rozhin Nobahari, Aristide Baratin, Stefano Sarao Mannelli · PDF
Can Bayesian Neural Networks Make Confident Predictions?
Katharine Fisher · PDF
Can Neural Networks Achieve Optimal Computational-statistical Tradeoff? An Analysis on Single-Index Model
Siyu Chen, Beining Wu, Miao Lu, Zhuoran Yang, Tianhao Wang · PDF
Classifier-Free Guidance is a Predictor-Corrector
Arwen Bradley, Preetum Nakkiran · PDF
Commute Your Domains: Trajectory Optimality Criterion for Multi-Domain Learning
Alexey Rukhovich, Alexander Podolskiy, Irina Piontkovskaya · PDF
Comparing Implicit and Denoising Score-Matching Objectives
Artem Artemev, Ayan Das, Farhang Nabiei, Alberto Bernacchia · PDF
Complexity of Vector-valued Prediction: From Linear Models to Stochastic Convex Optimization
Matan Schliserman, Tomer Koren · PDF
Composing Global Optimizers to Reasoning Tasks via Algebraic Objects in Neural Nets
Yuandong Tian · PDF
Continuous-Time Analysis of Adaptive Optimization and Normalization
Rhys Gould, Hidenori Tanaka · PDF
Convergence of Distributed Adaptive Optimization with Local Updates
Ziheng Cheng, Margalit Glasgow · PDF
Convergence Properties of Hyperbolic Neural Networks on Riemannian Manifolds
Nico Alvarado, Sebastian Burgos · PDF
Declarative characterizations of direct preference alignment algorithms
Kyle Richardson, Vivek Srikumar, Ashish Sabharwal · PDF
Depth Extrapolation of Decoders Trained on Nested Structures
Emile R Richard · PDF
Diffusion Model Learns Low-Dimensional Distributions via Subspace Clustering
Peng Wang, Huijie Zhang, Zekai Zhang, Siyi Chen, Yi Ma, Qing Qu · PDF
Diffusion Models With Learned Adaptive Noise Processes
Subham Sekhar Sahoo, Aaron Gokaslan, Christopher De Sa, Volodymyr Kuleshov · PDF
Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers
Yibo Jiang, Goutham Rajendran, Pradeep Kumar Ravikumar, Bryon Aragam · PDF
Does Machine Bring in Extra Bias in Learning? Approximating Discrimination Within Models Quickly
Yijun Bian, Yujie Luo, Ping Xu · PDF
Dynamics of Concept Learning and Compositional Generalization
Yongyi Yang, Core Francisco Park, Ekdeep Singh Lubana, Maya Okawa, Wei Hu, Hidenori Tanaka · PDF
Efficiently Learning at Test-Time: Active Fine-Tuning of LLMs
Jonas Hübotter, Sascha Bongni, Ido Hakimi, Andreas Krause · PDF
Emergence in non-neural models: grokking modular arithmetic via average gradient outer product
Neil Rohit Mallinar, Daniel Beaglehole, Libin Zhu, Adityanarayanan Radhakrishnan, Parthe Pandit, Mikhail Belkin · PDF
Exploring Task Affinities through NTK Alignment and Early Training Dynamics in Multi-Task Learning
Yoann Morello, Emilie Gregoire, Sam Verboven · PDF
Flavors of Margin: Implicit Bias of Steepest Descent in Homogeneous Neural Networks
Nikolaos Tsilivis, Gal Vardi, Julia Kempe · PDF
From Lazy to Rich: Exact Learning Dynamics in Deep Linear Networks
Clémentine Carla Juliette Dominé, Nicolas Anguita, Alexandra Maria Proca, Lukas Braun, Daniel Kunin, Pedro A. M. Mediano, Andrew M Saxe · PDF
From Sparse Dependence to Sparse Attention: Unveiling How Chain-of-Thought Enhances Transformer Sample Efficiency
Kaiyue Wen, Huaqing Zhang, Hongzhou Lin, Jingzhao Zhang · PDF
Geometric Deep Learning with Quasiconformal Neural Networks: An Introduction
Nico Alvarado, Hans Lobel · PDF
Harnessing the Power of Vicinity-Informed Analysis for Classification under Covariate Shift
Mitsuhiro Fujikawa, Youhei Akimoto, Jun Sakuma, Kazuto Fukuchi · PDF
Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models
Frederik Kunstner, Alan Milligan, Robin Yadav, Mark Schmidt, Alberto Bietti · PDF
HERTA: A High-Efficiency and Rigorous Training Algorithm for Unfolded Graph Neural Networks
Yongyi Yang, Jiaming Yang, Wei Hu, Michal Derezinski · PDF
How Discrete and Continuous Diffusion Meet: Comprehensive Analysis of Discrete Diffusion Models via a Stochastic Integral Framework
Yinuo Ren, Haoxuan Chen, Grant M. Rotskoff, Lexing Ying · PDF
How do students become teachers: A dynamical analysis for two-layer neural networks
Zhenyu Zhu, Fanghui Liu, Volkan Cevher · PDF
Implicit Bias of Adam versus Gradient Descent in One-Hidden-Layer Neural Networks
Bhavya Vasudeva, Vatsal Sharan, Mahdi Soltanolkotabi · PDF
Improving the Gaussian Approximation in Neural Networks: Para-Gaussians and Edgeworth Expansions
Mihai Nica, Janosch Ortmann · PDF
In-Context Learning by Linear Attention: Exact Asymptotics and Experiments
Yue Lu, Mary Letey, Jacob A Zavatone-Veth, Anindita Maiti, Cengiz Pehlevan · PDF
Increasing Fairness via Combination with Learning Guarantees
Yijun Bian, Kun Zhang · PDF
Information-Theoretic Foundations for Neural Scaling Laws
Hong Jun Jeon, Benjamin Van Roy · PDF
Information-Theoretic Generalization Bounds for Batch Reinforcement Learning
Xingtu Liu · PDF
Label Noise: Ignorance Is Bliss
Yilun Zhu, Jianxin Zhang, Aditya Gangrade, Clayton Scott · PDF
Leveraging Intermediate Neural Collapse with Simplex ETFs for Efficient Deep Neural Networks
Emily Liu · PDF
Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models
Yuda Song, Hanlin Zhang, Carson Eisenach, Sham M. Kakade, Dean Foster, Udaya Ghai · PDF
Misspecified $Q$ -Learning with Sparse Linear Function Approximation: Tight Bounds on Approximation Error
Ally Yalei Du, Lin Yang, Ruosong Wang · PDF
Mixture of Parrots: Mixtures of experts improve memorization more than reasoning
Samy Jelassi, Clara Mohri, David Brandfonbrener, Alex Gu, Nikhil Vyas, Nikhil Anand, David Alvarez-Melis, Yuanzhi Li, Sham M. Kakade, Eran Malach · PDF
On the Implicit Relation between Low-Rank Adaptation and Differential Privacy
Saber Malekmohammadi, Golnoosh Farnadi · PDF
On Your Mark, Get Set, Warmup!
Dayal Singh Kalra, Maissam Barkeshli · PDF
Optimal Protocols for Continual Learning via Statistical Physics and Control Theory
Francesco Mori, Stefano Sarao Mannelli, Francesca Mignacco · PDF
Optimality and Adaptivity of Deep Neural Features for Instrumental Variable Regression
Juno Kim, Dimitri Meunier, Arthur Gretton, Taiji Suzuki, Zhu Li · PDF
Optimizing Attention with Mirror Descent: Generalized Max-Margin Token Selection
Aaron Alvarado Kristanto Julistiono, Davoud Ataee Tarzanagh, Navid Azizan · PDF
Optimizing Fine-Tuning Efficiency: Gradient Subspace Tracking on Grassmann Manifolds for Large Language Models
Sahar Rajabi, Sirisha Rambhatla · PDF
Parameter Symmetry and Noise Equilibrium of Stochastic Gradient Descent
Liu Ziyin, Mingze Wang, Hongchao Li, Lei Wu · PDF
Progressive distillation induces an implicit curriculum
Abhishek Panigrahi, Bingbin Liu, Sadhika Malladi, Andrej Risteski, Surbhi Goel · PDF
Provable unlearning in topic modeling and downstream tasks
Stanley Wei, Sadhika Malladi, Sanjeev Arora, Amartya Sanyal · PDF
Provable weak-to-strong generalization via benign overfitting
David Xing Wu, Anant Sahai · PDF
Robust Feature Learning for Multi-Index Models in High Dimensions
Alireza Mousavi-Hosseini, Adel Javanmard, Murat A Erdogdu · PDF
Sample compression unleashed : New generalization bounds for real valued losses
Mathieu Bazinet, Valentina Zantedeschi, Pascal Germain · PDF
Self-Improvement in Language Models: The Sharpening Mechanism
Audrey Huang, Adam Block, Dylan J Foster, Dhruv Rohatgi, Cyril Zhang, Max Simchowitz, Jordan T. Ash, Akshay Krishnamurthy · PDF
SGD and Weight Decay Secretly Minimize the Rank of Your Neural Network
Tomer Galanti, Zachary S Siegel, Aparna Gupte, Tomaso A Poggio · PDF
Simple and Effective Masked Diffusion Language Models
Subham Sekhar Sahoo, Marianne Arriola, Aaron Gokaslan, Yair Schiff, Edgar Mariano Marroquin, Justin T Chiu, Alexander M Rush, Volodymyr Kuleshov · PDF
The Crucial Role of Samplers in Online Direct Preference Optimization
Ruizhe Shi, Runlong Zhou, Simon Shaolei Du · PDF
The GAN is dead; long live the GAN! A Modern GAN Baseline
Nick Huang, Aaron Gokaslan, Volodymyr Kuleshov, James Tompkin · PDF
Towards characterizing the value of edge embeddings in Graph Neural Networks
Dhruv Rohatgi, Tanya Marwah, Zachary Chase Lipton, Jianfeng Lu, Ankur Moitra, Andrej Risteski · PDF
Towards Principled Graph Transformers
Luis Müller, Daniel Kusuma, Blai Bonet, Christopher Morris · PDF
Towards the Effect of Examples on In-Context Learning: A Theoretical Case Study
Pengfei He, Yingqian Cui, Han Xu, Hui Liu, Makoto Yamada, Jiliang Tang, Yue Xing · PDF
Transformers are Efficient Compilers, Provably
Xiyu Zhai, Runlong Zhou, Liao Zhang, Simon Shaolei Du · PDF
Transformers Provably Solve Parity Efficiently with Chain of Thought
Juno Kim, Taiji Suzuki · PDF
Understanding Diffusion-based Representation Learning via Low-Dimensional Modeling
Xiao Li, Zekai Zhang, Xiang Li, Siyi Chen, Zhihui Zhu, Peng Wang, Qing Qu · PDF
Understanding Factual Recall in Transformers via Associative Memories
Eshaan Nichani, Jason D. Lee, Alberto Bietti · PDF
Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization
Noam Razin, Sadhika Malladi, Adithya Bhaskar, Danqi Chen, Sanjeev Arora, Boris Hanin · PDF
Universal Sharpness Dynamics in Neural Network Training: Fixed Point Analysis, Edge of Stability, and Route to Chaos
Dayal Singh Kalra, Tianyu He, Maissam Barkeshli · PDF
Unlocking State-Tracking in Linear RNNs Through Negative Eigenvalues
Riccardo Grazzi, Julien Siems, Jörg K.H. Franke, Arber Zela, Frank Hutter, Massimiliano Pontil · PDF

Accepted papers (81)

☆A Little Depth Goes a Long Way: The Expressive Power of Log-Depth Transformers

☆A Multi-Power Law for Loss Curve Prediction Across Learning Rate Schedules

☆A Theoretical Framework for Federated Domain Generalization with Gradient Alignment

☆A Theory of Initialisation's Impact on Specialisation

☆Accumulating Data Avoids Model Collapse

☆Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs

☆Adversarial Attacks as Near-Zero Eigenvalues in the Empirical Kernel of Neural Networks

☆Adversarial Training Can Provably Improve Robustness: Theoretical Analysis of Feature Learning Process Under Structured Data

☆Algorithmic Stability of Minimum-Norm Interpolating Deep Neural Networks

☆An empirical study of the $(L_0, L_1)$-smoothness condition

☆Bayesian Treatment of the Spectrum of the Empirical Kernel in (Sub)Linear-Width Neural Networks

☆Benign Overfitting in Out-of-Distribution Generalization of Linear Models

☆Benign Overfitting in Single-Head Attention

☆Bias in Motion: Theoretical Insights into the Dynamics of Bias in SGD Training

☆Can Bayesian Neural Networks Make Confident Predictions?

☆Can Neural Networks Achieve Optimal Computational-statistical Tradeoff? An Analysis on Single-Index Model

☆Classifier-Free Guidance is a Predictor-Corrector

☆Commute Your Domains: Trajectory Optimality Criterion for Multi-Domain Learning

☆Comparing Implicit and Denoising Score-Matching Objectives

☆Complexity of Vector-valued Prediction: From Linear Models to Stochastic Convex Optimization

☆Composing Global Optimizers to Reasoning Tasks via Algebraic Objects in Neural Nets

☆Continuous-Time Analysis of Adaptive Optimization and Normalization

☆Convergence of Distributed Adaptive Optimization with Local Updates

☆Convergence Properties of Hyperbolic Neural Networks on Riemannian Manifolds

☆Declarative characterizations of direct preference alignment algorithms

☆Depth Extrapolation of Decoders Trained on Nested Structures

☆Diffusion Model Learns Low-Dimensional Distributions via Subspace Clustering

☆Diffusion Models With Learned Adaptive Noise Processes

☆Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers

☆Does Machine Bring in Extra Bias in Learning? Approximating Discrimination Within Models Quickly

☆Dynamics of Concept Learning and Compositional Generalization

☆Efficiently Learning at Test-Time: Active Fine-Tuning of LLMs

☆Emergence in non-neural models: grokking modular arithmetic via average gradient outer product

☆Exploring Task Affinities through NTK Alignment and Early Training Dynamics in Multi-Task Learning

☆Flavors of Margin: Implicit Bias of Steepest Descent in Homogeneous Neural Networks

☆From Lazy to Rich: Exact Learning Dynamics in Deep Linear Networks

☆From Sparse Dependence to Sparse Attention: Unveiling How Chain-of-Thought Enhances Transformer Sample Efficiency

☆Geometric Deep Learning with Quasiconformal Neural Networks: An Introduction

☆Harnessing the Power of Vicinity-Informed Analysis for Classification under Covariate Shift

☆Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models

☆HERTA: A High-Efficiency and Rigorous Training Algorithm for Unfolded Graph Neural Networks

☆How Discrete and Continuous Diffusion Meet: Comprehensive Analysis of Discrete Diffusion Models via a Stochastic Integral Framework

☆How do students become teachers: A dynamical analysis for two-layer neural networks

☆Implicit Bias of Adam versus Gradient Descent in One-Hidden-Layer Neural Networks

☆Improving the Gaussian Approximation in Neural Networks: Para-Gaussians and Edgeworth Expansions

☆In-Context Learning by Linear Attention: Exact Asymptotics and Experiments

☆Increasing Fairness via Combination with Learning Guarantees

☆Information-Theoretic Foundations for Neural Scaling Laws

☆Information-Theoretic Generalization Bounds for Batch Reinforcement Learning

☆Label Noise: Ignorance Is Bliss

☆Leveraging Intermediate Neural Collapse with Simplex ETFs for Efficient Deep Neural Networks

☆Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models

☆Misspecified $Q$ -Learning with Sparse Linear Function Approximation: Tight Bounds on Approximation Error

☆Mixture of Parrots: Mixtures of experts improve memorization more than reasoning

☆On the Implicit Relation between Low-Rank Adaptation and Differential Privacy

☆On Your Mark, Get Set, Warmup!

☆Optimal Protocols for Continual Learning via Statistical Physics and Control Theory

☆Optimality and Adaptivity of Deep Neural Features for Instrumental Variable Regression

☆Optimizing Attention with Mirror Descent: Generalized Max-Margin Token Selection

☆Optimizing Fine-Tuning Efficiency: Gradient Subspace Tracking on Grassmann Manifolds for Large Language Models

☆Parameter Symmetry and Noise Equilibrium of Stochastic Gradient Descent

☆Progressive distillation induces an implicit curriculum

☆Provable unlearning in topic modeling and downstream tasks

☆Provable weak-to-strong generalization via benign overfitting

☆Robust Feature Learning for Multi-Index Models in High Dimensions

☆Sample compression unleashed : New generalization bounds for real valued losses

☆Self-Improvement in Language Models: The Sharpening Mechanism

☆SGD and Weight Decay Secretly Minimize the Rank of Your Neural Network

☆Simple and Effective Masked Diffusion Language Models

☆The Crucial Role of Samplers in Online Direct Preference Optimization

☆The GAN is dead; long live the GAN! A Modern GAN Baseline

☆Towards characterizing the value of edge embeddings in Graph Neural Networks

☆Towards Principled Graph Transformers

☆Towards the Effect of Examples on In-Context Learning: A Theoretical Case Study

☆Transformers are Efficient Compilers, Provably

☆Transformers Provably Solve Parity Efficiently with Chain of Thought

☆Understanding Diffusion-based Representation Learning via Low-Dimensional Modeling

☆Understanding Factual Recall in Transformers via Associative Memories

☆Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization

A Little Depth Goes a Long Way: The Expressive Power of Log-Depth Transformers

A Multi-Power Law for Loss Curve Prediction Across Learning Rate Schedules

A Theoretical Framework for Federated Domain Generalization with Gradient Alignment

A Theory of Initialisation's Impact on Specialisation

Accumulating Data Avoids Model Collapse

Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs

Adversarial Attacks as Near-Zero Eigenvalues in the Empirical Kernel of Neural Networks

Adversarial Training Can Provably Improve Robustness: Theoretical Analysis of Feature Learning Process Under Structured Data

Algorithmic Stability of Minimum-Norm Interpolating Deep Neural Networks

An empirical study of the $(L_0, L_1)$-smoothness condition

Bayesian Treatment of the Spectrum of the Empirical Kernel in (Sub)Linear-Width Neural Networks

Benign Overfitting in Out-of-Distribution Generalization of Linear Models

Benign Overfitting in Single-Head Attention

Bias in Motion: Theoretical Insights into the Dynamics of Bias in SGD Training

Can Bayesian Neural Networks Make Confident Predictions?

Can Neural Networks Achieve Optimal Computational-statistical Tradeoff? An Analysis on Single-Index Model

Classifier-Free Guidance is a Predictor-Corrector

Commute Your Domains: Trajectory Optimality Criterion for Multi-Domain Learning

Comparing Implicit and Denoising Score-Matching Objectives

Complexity of Vector-valued Prediction: From Linear Models to Stochastic Convex Optimization

Composing Global Optimizers to Reasoning Tasks via Algebraic Objects in Neural Nets

Continuous-Time Analysis of Adaptive Optimization and Normalization

Convergence of Distributed Adaptive Optimization with Local Updates

Convergence Properties of Hyperbolic Neural Networks on Riemannian Manifolds

Declarative characterizations of direct preference alignment algorithms

Depth Extrapolation of Decoders Trained on Nested Structures

Diffusion Model Learns Low-Dimensional Distributions via Subspace Clustering

Diffusion Models With Learned Adaptive Noise Processes

Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers

Does Machine Bring in Extra Bias in Learning? Approximating Discrimination Within Models Quickly

Dynamics of Concept Learning and Compositional Generalization

Efficiently Learning at Test-Time: Active Fine-Tuning of LLMs

Emergence in non-neural models: grokking modular arithmetic via average gradient outer product

Exploring Task Affinities through NTK Alignment and Early Training Dynamics in Multi-Task Learning

Flavors of Margin: Implicit Bias of Steepest Descent in Homogeneous Neural Networks

From Lazy to Rich: Exact Learning Dynamics in Deep Linear Networks

From Sparse Dependence to Sparse Attention: Unveiling How Chain-of-Thought Enhances Transformer Sample Efficiency

Geometric Deep Learning with Quasiconformal Neural Networks: An Introduction

Harnessing the Power of Vicinity-Informed Analysis for Classification under Covariate Shift

Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models

HERTA: A High-Efficiency and Rigorous Training Algorithm for Unfolded Graph Neural Networks

How Discrete and Continuous Diffusion Meet: Comprehensive Analysis of Discrete Diffusion Models via a Stochastic Integral Framework

How do students become teachers: A dynamical analysis for two-layer neural networks

Implicit Bias of Adam versus Gradient Descent in One-Hidden-Layer Neural Networks

Improving the Gaussian Approximation in Neural Networks: Para-Gaussians and Edgeworth Expansions

In-Context Learning by Linear Attention: Exact Asymptotics and Experiments

Increasing Fairness via Combination with Learning Guarantees

Information-Theoretic Foundations for Neural Scaling Laws

Information-Theoretic Generalization Bounds for Batch Reinforcement Learning

Label Noise: Ignorance Is Bliss

Leveraging Intermediate Neural Collapse with Simplex ETFs for Efficient Deep Neural Networks

Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models

Misspecified $Q$ -Learning with Sparse Linear Function Approximation: Tight Bounds on Approximation Error

Mixture of Parrots: Mixtures of experts improve memorization more than reasoning

On the Implicit Relation between Low-Rank Adaptation and Differential Privacy

On Your Mark, Get Set, Warmup!

Optimal Protocols for Continual Learning via Statistical Physics and Control Theory

Optimality and Adaptivity of Deep Neural Features for Instrumental Variable Regression

Optimizing Attention with Mirror Descent: Generalized Max-Margin Token Selection

Optimizing Fine-Tuning Efficiency: Gradient Subspace Tracking on Grassmann Manifolds for Large Language Models

Parameter Symmetry and Noise Equilibrium of Stochastic Gradient Descent

Progressive distillation induces an implicit curriculum

Provable unlearning in topic modeling and downstream tasks

Provable weak-to-strong generalization via benign overfitting

Robust Feature Learning for Multi-Index Models in High Dimensions

Sample compression unleashed : New generalization bounds for real valued losses

Self-Improvement in Language Models: The Sharpening Mechanism

SGD and Weight Decay Secretly Minimize the Rank of Your Neural Network

Simple and Effective Masked Diffusion Language Models

The Crucial Role of Samplers in Online Direct Preference Optimization

The GAN is dead; long live the GAN! A Modern GAN Baseline

Towards characterizing the value of edge embeddings in Graph Neural Networks

Towards Principled Graph Transformers

Towards the Effect of Examples on In-Context Learning: A Theoretical Case Study

Transformers are Efficient Compilers, Provably

Transformers Provably Solve Parity Efficiently with Chain of Thought

Understanding Diffusion-based Representation Learning via Low-Dimensional Modeling

Understanding Factual Recall in Transformers via Associative Memories

Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization

Universal Sharpness Dynamics in Neural Network Training: Fixed Point Analysis, Edge of Stability, and Route to Chaos