NeurIPS 2024 Past Optimization
OPT 2024: Optimization for Machine Learning
NeurIPS 2024 Workshop
- Submission deadline
- Sep 28, 2024, 12:00 UTC imported from OpenReview — check the website for extensions
- Submission portal
- OpenReview
- Notes
- Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).
Accepted papers (106)
Fetched from OpenReview (v2) on 2026-06-10.
-
$\mu$LO: Compute-Efficient Meta-Generalization of Learned Optimizers
-
A Continuous Variable Optimization method for the Quadratic Assignment Problem
-
A fast and efficient randomized quasi-Newton method
-
A Stochastic Algorithm for Sinkhorn Distance-Regularized Distributionally Robust Optimization
-
A theoretical study of the $(L_0,L_1)$-smoothness condition in deep learning
-
A Unified Convergence Theory for Large Language Model Efficient Fine-tuning
-
ACCO: Accumulate while you Communicate, Hiding Communications in Distributed LLM Training
-
Adaptive Partitioning Schemes for Black-Box Optimization
-
Addax: Utilizing Zeroth-Order Gradients to Improve Memory Efficiency and Performance of SGD for Fine-Tuning Language Models
-
AdEMAMix: Better and Faster Training with Older Gradients
-
Aggregating Data for Optimal and Private Learning
-
Aligned Multi-Objective Optimization
-
Amplitude Modulated Riemannian Optimization for QAP
-
An Elementary Predictor Obtaining 2\sqrt{T} Distance to Calibration
-
Applications of fractional calculus in learned optimization
-
Batch size invariant Adam
-
BlockLLM: Memory-Efficient Adaptation of LLMs by Selecting and Optimizing the Right Coordinate Blocks
-
Communication-efficient Algorithms Under Generalized Smoothness Assumptions
-
Communication-Efficient Loss Minimization over Heterogeneous Data with Federated Hierarchical Ensemble Aggregation via Distillation
-
Connections between Schedule-Free SGD, Accelerated SGD Variants, and Weight Averaging
-
Consensus Based Optimization Accelerates Gradient Descent
-
Cyclic Data Parallelism for Efficient Parallelism of Deep Neural Networks
-
DADA: Dual Averaging with Distance Adaptation
-
Deconstructing What Makes a Good Optimizer for Language Models
-
Dense Backpropagation Improves Routing for Sparsely-Gated Mixture-of-Experts
-
Differentially Private Random Block Coordinate Descent
-
Dimensionality Reduction Techniques for Global Bayesian Optimisation
-
Discrete-Continuous Variational Optimization with Local Gradients
-
DiSK: Differentially Private Optimizer with Simplified Kalman Filter for Noise Reduction
-
Distributionally Robust Linear Regression With Block Lewis Weights
-
Don't Be So Positive: Negative Step Sizes in Second-Order Methods
-
Dual Feature Reduction for the Sparse-Group Lasso and its Adaptive Variant
-
Dueling in the Dark: An Efficient and Optimal Mirror Descent Approach for Online Optimization with Adversarial Preferences
-
Efficient Levenberg-Marquardt for SLAM
-
Estimating Vote Choice in U.S. Elections with Approximate Poisson-Binomial Logistic Regression
-
Extra-Gradient and Optimistic Gradient Descent Converge in Iterates Faster than $O(1/\sqrt{T})$ in All Monotone Lipschitz Variational Inequalities
-
Fast Convergence of Softmax Policy Mirror Ascent for Bandits & Tabular MDPs
-
Fast decentralized gradient tracking for federated learning with local updates: From mini to minimax optimization
-
From Gradient Clipping to Normalization for Heavy Tailed SGD
-
Glocal Smoothness: Line Search can really help!
-
Graph Neural Networks for Hyperparameter Inference in Ising Solvers
-
Hierarchical Simplicity Bias of Neural Networks
-
High Dimensional First Order Mini-Batch Algorithms on Quadratic Problems
-
How Does Critical Batch Size Scale in Pre-training?
-
Improving Deep Learning Speed and Performance through Synaptic Neural Balance
-
In the Search for Optimal Portfolios of Counterstrategies in the Large Imperfect Information Games
-
Incentivizing Truthful Collaboration in Heterogeneous Federated Learning
-
Intuitive Analysis of the Quantization based Optimization : From establishing a SDE to Quantum Mechanical Perspective
-
Langevin Dynamics: A Unified Perspective on Optimization via Lyapunov Potentials
-
Learning Morphisms with Gauss-Newton Approximation for Growing Networks
-
Linear Attention Sequence Parallelism
-
Lion's sign noise can make training more stable
-
Local Curvature Descent: Squeezing More Curvature out of Standard and Polyak Gradient Descent
-
LoCoDL: Communication-Efficient Distributed Learning with Local Training and Compression
-
Memory Efficient Adaptive Stochastic Optimization via Subset-Norm
-
Memory-Efficient Large Language Model (LLM) Training and Fine-Tuning via Gradient Subspace Tracking
-
MindFlayer: Efficient Asynchronous Parallel SGD in the Presence of Heterogeneous and Random Worker Compute Times
-
Modularity aided consistent attributed graph clustering via coarsening
-
Multi Objective Regionalized Bayesian Optimization via Entropy Search
-
Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time
-
Multimodal Federated Learning with Model Personalization
-
Neural Entropic Multimarginal Optimal Transport
-
Neural Networks with Complex-Valued Weights Have No Spurious Local Minima
-
Nonlinear tomographic reconstruction via nonsmooth optimization
-
Nonmonotone Line Searches Operate at the Edge of Stability
-
Normalization Matters for Optimization Performance on Graph Neural Networks
-
Old Optimizer, New Norm: An Anthology
-
On the Convergence of DP-SGD with Adaptive Clipping
-
On the Convergence of FedProx with Extrapolation and Inexact Prox
-
On the Crucial Role of Initialization for Matrix Factorization
-
On the Hardness of Meaningful Local Guarantees in Nonsmooth Nonconvex Optimization
-
On the Hypomonotone Class of Variational Inequalities
-
On the Inherent Privacy of Two Point Zeroth Order Projected Gradient Descent
-
Online Nonconvex Bilevel Optimization with Bregman Divergences
-
Optimal Transport for Probabilistic Circuits
-
Optimizing Attention
-
Partially Observed Trajectory Inference using Optimal Transport and a Dynamics Prior
-
Path Integral Optimiser: Global Optimisation via Neural Schrödinger-Föllmer Diffusion
-
Personalized Federated Learning via Low-Rank Matrix Factorization
-
Policy Optimization for Strictly Batch Imitation Learning
-
Pseudo-Asynchronous Local SGD: Robust and Efficient Data-Parallel Training
-
Remove Symmetries to Control Model Expressivity and Improve Optimization
-
Revisiting the Initial Steps in Adaptive Gradient Descent Optimization
-
Role of Parametrization in Learning Dynamics of Recurrent Neural Networks
-
Scalable Second-Order Optimization Algorithms for Minimizing Low-rank Functions
-
Scaling Collapse Reveals Universal Dynamics in Compute-Optimally Trained Neural Networks
-
Second-Order Forward-Mode Automatic Differentiation for Optimization
-
SICNN: Sparsity-induced Input Convex Neural Network for Optimal Transport
-
Simple and Scalable Federated Learning with Uncertainty via Improved Variational Online Newton
-
SOAP: Improving and Stabilizing Shampoo using Adam
-
Solving hidden monotone variational inequalities with surrogate losses
-
SPAM: Stochastic Proximal Point Method with Momentum Variance Reduction for Nonconvex Cross-Device Federated Learning
-
Spurious Stationarity and Hardness Results for Mirror Descent
-
Statistical Inference in Latent Convex Objectives with Stream Data
-
Stochastic Proximal Point Methods for Monotone Inclusions under Expected Similarity
-
Stochastic Quasi-Variational Inequalities: Convergence Analysis Beyond Strong Monotonicity
-
Structured Regularization on the SPD Manifold
-
Tensor-GaLore: Memory-Efficient Training via Gradient Tensor Decomposition
-
The Crucial Role of Samplers in Online Direct Preference Optimization
-
The Dimension Strikes Back with Gradients: Generalization of Gradient Methods in Stochastic Convex Optimization
-
Tight Lower Bounds and Improved Convergence in Performative Prediction
-
u-$\mu$P: The Unit-Scaled Maximal Update Parametrization
-
Uncoupled and Convergent Learning in Monotone Games under Bandit Feedback
-
Understanding Adam Requires Better Rotation Dependent Assumptions
-
WASH: Train your Ensemble with Communication-Efficient Weight Shuffling, then Average
-
Weak to Strong Learning from Aggregate Labels