Awesome-Optimizers

Welcome to Awesome Optimizers, a meticulously curated collection of optimization algorithms implemented in PyTorch, designed to serve the diverse needs of the machine learning research community.

If this repository has been helpful to you, please consider giving it a ⭐️ to show your support. Your support helps us reach more researchers and contributes to the growth of this resource. Thank you! ☺️

Introduction

In the domain of machine learning, the selection of an appropriate optimizer is of equal significance to the architectural design of the model itself.

We have meticulously assembled a compendium of preeminent optimizers from the recent scholarly landscape, complemented by lucid roadmaps and pedagogical tutorial notebooks. Our insights, garnered from active engagement in pertinent projects, further enrich this repository. We warmly welcome contributions to this project, including the latest optimizers, tutorial notebooks, or any other valuable resources that can benefit researchers in our community.

We present a meticulously curated roadmap of optimizers, as depicted in the

This roadmap is continuously updated to reflect the latest advancements. Should you identify any errors or omissions in this repository, please do not hesitate to open an issue or submit a pull request. Our ongoing survey is in the process of being updated, and this represents the most current iteration.

For those seeking to explore the interconnections among relevant papers, we recommend utilizing Connected Papers, a tool that visualizes the academic landscape through a graph representation. To export a paper's BibTeX citation, consult the paper's arXiv or Semantic Scholar entry for a professionally formatted reference.

Awesome Optimizers

Awesome Optimizers List

Here is a list of some popular optimizers and their corresponding papers:

Optimizer Name	Paper	Year	Advantages
SGD	On the importance of initialization and momentum in deep learning	1999	Simple and effective; foundational for many other optimizers.
Rprop	Rprop - A Fast Adaptive Learning Algorithm	2000	Adaptive step sizes per parameter; fast convergence for small networks.
AdaGrad	Adaptive Subgradient Methods for Online Learning and Stochastic Optimization	2011	Adaptive learning rates; effective for sparse data.
RMSprop	Lecture 6.5 - rmsprop, COURSERA: Neural Networks for Machine Learning	2012	Adaptive learning rates; suitable for non-stationary objectives.
AdaDelta	ADADELTA: An Adaptive Learning Rate Method	2012	Adaptive learning rates without manual tuning; addresses AdaGrad's diminishing learning rates.
Adam	Adam: A Method for Stochastic Optimization	2014	Combines best properties of AdaGrad and RMSprop; widely used and effective.
LARS	Large Batch Training of Convolutional Networks	2017	Enables large batch training with stability; improves training efficiency.
AdamW	Decoupled Weight Decay Regularization	2017	Fixes weight decay regularization in Adam; improves generalization.
SWATS	Improving Generalization Performance by Switching from Adam to SGD	2017	Hybrid approach combining Adam and SGD; improves generalization.
Shampoo	Shampoo: Preconditioned Stochastic Tensor Optimization	2018	Preconditions stochastic tensor optimization; improves convergence.
QHAdam	Quasi-hyperbolic momentum and Adam for deep learning	2018	Combines quasi-hyperbolic terms with Adam; balances momentum and adaptivity.
QHM	Quasi-hyperbolic momentum and Adam for deep learning	2018	Introduces quasi-hyperbolic momentum; balances Nesterov momentum and SGD.
Yogi	Adaptive Methods for Nonconvex Optimization	2018	Improved update rule for adaptive methods; handles nonconvex optimization better.
AdaFactor	AdaFactor: Adaptive Learning Rates with Sublinear Memory Cost	2018	Reduces memory usage compared to Adam; suitable for large models.
AggMo	Aggregated Momentum: Stability Through Passive Damping	2018	Uses multiple momentum terms; improves stability and convergence.
PID	A PID Controller Approach for Stochastic Optimization of Deep Networks	2018	Employs PID control principles; improves convergence and stability.
AccSGD	Accelerating Stochastic Gradient Descent via Online Learning to Learn	2018	Accelerates SGD by learning to adapt the learning rate online.
AdaBound	Adaptive Gradient Methods with Dynamic Bound of Learning Rate	2019	Bounds the learning rate dynamically; combines benefits of adaptive and SGD methods.
LAMB	Large Batch Optimization for Deep Learning: Training BERT in 76 minutes	2019	Enables large batch training for BERT; improves training efficiency.
Lookahead	Lookahead Optimizer: k steps forward, 1 step back	2019	Combines with other optimizers to improve convergence and stability.
RAdam	On the Variance of the Adaptive Learning Rate and Beyond	2019	Rectifies variance of the adaptive learning rate; improves stability.
AdaMod	AdaMod: An Adaptive Momentum Method for Stochastic Gradient Descent	2019	Modulates the momentum term adaptively; improves stability and convergence.
Ranger	Ranger: A Hybrid Optimizer for Deep Learning	2019	Combines RAdam and Lookahead; improves convergence and generalization.
NAdam	Incorporating Nesterov Momentum into Adam	2019	Combines Nesterov momentum with Adam; improves convergence.
NovoGrad	Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks	2019	Uses layer-wise adaptive moments; efficient for deep networks.
DiffGRAD	DiffGrad: An Optimization Method for Convolutional Neural Networks	2019	Differentiates the gradient history; improves convergence.
Adahessian	ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning	2020	Uses Hessian information adaptively; suitable for nonconvex optimization.
AdaBelief	AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients	2020	Adapts stepsizes based on the belief in observed gradients; improves convergence.
AdamP	Slowing Down the Weight Norm Increase in Momentum-based Optimizers	2020	Mitigates weight norm increase; improves generalization.
SGDP	Slowing Down the Weight Norm Increase in Momentum-based Optimizers	2020	Prevents excessive weight norm increase; improves stability.
Apollo	Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization	2020	Adaptive quasi-Newton method; efficient for nonconvex optimization.
SAM	Sharpness-Aware Minimization for Efficiently Improving Generalization	2020	Minimizes sharpness of the loss landscape; improves generalization.
MADGRAD	Adaptive Gradient Methods with Dynamic Bound of Learning Rate	2021	Dynamically bounds the learning rate; improves stability.
LION	LION: Lévy-inspired Optimizer for Deep Learning	2021	Inspired by Lévy flights; explores the loss landscape efficiently.
Adan	Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models	2022	Adaptive Nesterov momentum; faster optimization for deep models.
CAME	CAME: Confidence-guided Adaptive Memory Efficient Optimization	2023	Adaptive and memory-efficient; improves optimization with confidence guidance.
Sophia	Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training	2023	Scalable second-order optimizer; efficient for large-scale pre-training.
SVRG	A Coefficient Makes SVRG Effective	2023	Adaptive and memory-efficient, enhancing optimization with a confidence-guided adjustment of variance reduction strength.
Prodigy	Prodigy: An Expeditiously Adaptive Parameter-Free Learner	2024	Prodigy is an adaptive, parameter-free optimizer that dynamically adjusts the learning rate by accurately estimating the distance to the optimal solution, significantly improving convergence speed and solution quality.
Adam-mini	Adam-mini: Use Fewer Learning Rates To Gain More	2024	Reduces the number of learning rates; simplifies hyperparameter tuning.

Optimizer Paradigm Definition

Click to Expand

Algorithm: General Algorithm of Optimizer for DNNs

Input:

DNN parameters $\theta = {\theta_l}_{l=1}^{L}$
Initial learning rate $\text{lr}$
Weight decays $\omega = {\omega_l}_{l=1}^{L}$
Loss function $\mathcal{L}$
Dataset $\mathcal{D}$

Initialization:

Parameters $\theta^{0} = {\theta_{l}^{0}}_{l=1}^{L}$
Learning rates ${\alpha_i^0}_{l=1}^{L} \leftarrow \text{lr}$

Procedure:

Our Latest Work: Unveiling the Backbone-Optimizer Coupling Bias in Visual Representation Learning

Click to Expand

Unveiling the Backbone-Optimizer Coupling Bias in Visual Representation Learning

Siyuan Li^1,2*, Juanxi Tian^1*, Zedong Wang^1*, Luyuan Zhang¹, Zicheng Liu^1,2*,Weiyang Jin¹, Yang Liu³, Baigui Sun³, Stan Z. Li^1†

¹Westlake University, ²Zhejiang University, ²Damo Academy

Abstract This paper delves into the interplay between vision backbones and optimizers, unvealing an inter-dependent phenomenon termed backbone-optimizer coupling bias (BOCB). We observe that canonical CNNs, such as VGG and ResNet, exhibit a marked co-dependency with SGD families, while recent architectures like ViTs and ConvNeXt share a tight coupling with the adaptive learning rate ones. We further show that BOCB can be introduced by both optimizers and certain backbone designs and may significantly impact the pre-training and downstream fine-tuning of vision models. Through in-depth empirical analysis, we summarize takeaways on recommended optimizers and insights into robust vision backbone architectures. We hope this work can inspire the community to question long-held assumptions on backbones and optimizers, stimulate further explorations, and thereby contribute to more robust vision systems. The source code and models are publicly available.

Backbone-Optimizer Coupling Bias (BOCB) is a phenomenon we observed during the bench-marking, which arises from the intricate interplay between the design principles of vision backbones and the inherent properties of optimizers.

Code: https://github.com/Westlake-AI/Backbone-vs-Optimizer

Benchmark

Click to Expand

Benchmark of the universal Vision Backbone

To illustrate the performance differences of 20 optimizers across various vision backbones under optimal parameter settings, we have included the figure

and

These figures provides clear visual representation of how different optimizers perform in different scenarios.

Benchmark of the LMFlow

In LMFlow, I have contributed a customizable optimization fine-tuning feature, enabling you to freely select an optimizer for testing, with the intention to incorporate additional optimization techniques in forthcoming updates.
custom_optimizers Below is a comprehensive Benchmark table detailing the fine-tuning of GPT2 with various optimizers on the alpaca dataset (The default hyperparameter setting is num_epoch=0.1. For reference only.).

Optimizer Name	Train Loss
RMSprop	2.4016
LION-32bit	2.4041
Adam	2.4292
AdamP	2.4295
AdamW	2.4469
AdaFactor	2.4543
AdaBound	2.4547
AdamWScheduleFree	2.4677
Adan	2.5063
NAdam	2.5569
AdaBelief	2.5857
AdaMax	2.5924
RAdam	2.6104
AdaDelta	2.6298
AdaGrad	2.8657
Yogi	2.9314
NovoGrad	3.1071
Sophia	3.1517
LAMB	3.2350
LARS	3.3329
SGDScheduleFree	3.3541
SGDP	3.3567
SGD	3.3734

Four categories of optimizers

Click to Expand

I have categorized classic optimizers into four main types, as shown in the following image:

This classification helps in understanding the underlying principles and applications of these optimizers.

Recommended Hyperparameter Settings

Click to Expand

Herein, we present a set of recommended hyperparameters for use under various visual backbone scenarios. Please note that these suggestions are based on extensive research and empirical findings, and should be considered a useful starting point rather than a one-size-fits-all solution. As with any machine learning application, the optimal hyperparameters may vary depending on the specific characteristics of your dataset and the computational resources available to you. We encourage practitioners to use these recommendations as a foundation and to further fine-tune them to suit their unique requirements and constraints.

Contribution

The main maintainer is Juanxi Tian (@Juanxi Tian).

Future contributors are welcome, and feel free to send pull requests in hopes that Awesome-Optimizers can become a more mature toolbox in the machine learning community.

[back to top]

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
Fig		Fig
Optimizers		Optimizers
Tutorial		Tutorial
example		example
.DS_Store		.DS_Store
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome-Optimizers

Table of Contents

Introduction

Awesome Optimizers

Awesome Optimizers List

Optimizer Paradigm Definition

Our Latest Work: Unveiling the Backbone-Optimizer Coupling Bias in Visual Representation Learning

Unveiling the Backbone-Optimizer Coupling Bias in Visual Representation Learning

Benchmark

Benchmark of the universal Vision Backbone

Benchmark of the LMFlow

Four categories of optimizers

Recommended Hyperparameter Settings

Contribution

About

Releases

Packages

Contributors 2

Languages

License

tianshijing/Awesome-Optimizers

Folders and files

Latest commit

History

Repository files navigation

Awesome-Optimizers

Table of Contents

Introduction

Awesome Optimizers

Awesome Optimizers List

Optimizer Paradigm Definition

Our Latest Work: Unveiling the Backbone-Optimizer Coupling Bias in Visual Representation Learning

Unveiling the Backbone-Optimizer Coupling Bias in Visual Representation Learning

Benchmark

Benchmark of the universal Vision Backbone

Benchmark of the LMFlow

Four categories of optimizers

Recommended Hyperparameter Settings

Contribution

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages