**Stochastic Smooth Nonconvex Optimization**

In progress. Upcoming: full sets of assumptions, convergence results and comparisons

**Notations**. Throughout this post, we denote by \(\gamma\) the step-sizes.

## Unconstrained Optimization

We consider peculiar optimization problems of the form

\[\underset{x\in\mathbb{R}^m}{\mathrm{minimize}}\; F(x) \quad\text{with either}\quad \begin{cases} F(x)\triangleq \frac{1}{n}\sum_{i=1}^n F_i(x) \quad \text{(finite-sum)}\\[0.4ex] F(x) \triangleq \mathbb{E}_{s\sim\mathbb{P}}\left[ f(x,s)\right]\quad \text{(stochastic)}\end{cases}\]We begin by presenting some exponential moving average based adaptive methods, which are very popular in the deep learning community.

**ADAM** [Kingma and Ba, 2015]
. The terms \(m_k\) and \(v_k\) are the exponential moving averages of the gradient and squared gradient, respectively. The parameter \(\epsilon\), originally introduced to avoid precision issues…

However, ADAM do not benefits from convergence guarantees. Even worse, non-convergence can be found in simple online convex settings with constant minibatch sizes (see work of Reddi, 2016)

**YOGI** [Zaheer et al., 2018]
. The YOGI method gets its name the Sanskrit word yuj meaning to add. It is a slight variant of ADAM where the magnitude of the difference of \(v_{k+1}\) and \(v_k\) in YOGI only depends on \(g_k^2\) as opposed to dependence on both \(v_k\) and \(g_k^2\). Henceforth, when \(v_k\) is much larger than \(g_k^2\), ADAM can rapidly increase the
effective learning rate while YOGI does it in a controlled fashion.

**SAG** [Roux et al, 2012]
. The *Stochastic
Average Gradient* method ..

**SCSG** [Lei et al., 2017]
. The *non-convex Stochastically Controlled Stochastic Gradient* method ..

**SARAH** [Nguyen et al., 2017]
. The *StochAstic Recursive grAdient algoritHm* …

**SPIDER** [Fang et al., 2018]
.

**SpiderBoost** [Wang et al., 2019]
.

**Geom-SARAH** [Hovath et al., 2020]
. The *Geometrized stochastic recursive gradient* method, called *Geom-SARAH*, is a double-loop procedure which can be seen as a combination of the SCSG method and the SARAH biased gradient estimator. Note that, it exploits a randomization technique based on the
*geometric distribution* which allows certain terms to telescope across the outer loop and the inner loop, hence simplifying the analysis of the algorithm.

**PAGE** [Li et al., 2021]
.

## Constrained Optimization

\[\underset{x\in\mathbb{R}^m}{\mathrm{minimize}}\; F(x) \quad\text{with either}\quad \begin{cases} F(x)\triangleq \frac{1}{n}\sum_{i=1}^n F_i(x) \quad \text{(finite-sum)}\\[0.4ex] F(x) \triangleq \mathbb{E}_{s\sim\mathbb{P}}\left[ f(x,s)\right]\quad \text{(stochastic)}\end{cases}\]and where \(\Omega\subseteq\mathbb{R}^{m}\) is convex and compact.

**SFW** [Reddi et al., 2016]
. The *Stochastic Frank-Wolf* algorithm …

**SVFW** [Reddi et al., 2016]
. This algorithm can be seen as a *non-convex* variant of the *Stochastic Variance Reduced Frank-Wolf* devised in [Hazan and Luo, 2016]
. As such, it is also epoch-based. At the end of each epoch, the full gradient is computed at the
current iterate. This gradient is used for controlling the variance of the stochastic gradients in the inner loop.

## References

**[Fang et al., 2018]**C. Fang, C.J. Li, Z. Lin and T. Zhang. "SPIDER - Near-Optimal Non-Convex Optimization via Stochastic Path-Integrated Differential Estimator". Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS) (2018)**[Hazan and Luo, 2016]**E. Hazan and H. Luo. "Variance-Reduced and Projection-Free Stochastic Optimization". Proceedings of the International Conference on Machine Learning (ICML) (2016)**[Hovath et al., 2020]**S. Horváth, L. Lei, P. Richtárik and M.I. Jordan. "Adaptivity of stochastic gradient methods for nonconvex optimization". Annual Workshop on Optimization for Machine Learning (OPT) (2020)**[Kingma and Ba, 2015]**D.P. Kingma and J. Ba. "Adam - A Method for Stochastic Optimization". International Conference on Learning Representations (ICLR) (2015)**[Lei et al., 2017]**L. Lei, C. Ju, J. Chen and M.I. Jordan. "Non-convex Finite-Sum Optimization Via SCSG Methods". Proceedings of the International Conference on Neural Information Processing Systems (NIPS) (2017)**[Li et al., 2021]**Z. Li, H. Bao, X. Zhang and P. Richtárik. "PAGE - A simple and optimal probabilistic gradient estimator for nonconvex optimization". Proceedings of the International Conference on Machine Learning (ICML) (2021)**[Nguyen et al., 2017]**L.M. Nguyen, J. Liu, K. Scheinberg and M. Takáč. "SARAH- A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient". Proceedings of the International Conference on Machine Learning (ICML) (2017)**[Reddi et al., 2016]**S. Reddi, S. Sra, B. Poczos and A. Smola. "Stochastic Frank-Wolfe methods for nonconvex optimization". Annual Allerton Conference on Communication, Control, and Computing (2016)**[Roux et al, 2012]**N. Roux, M. Schmidt and F. Bach. "A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets". Proceedings of the International Conference on Neural Information Processing Systems (NIPS) (2012)**[Wang et al., 2019]**Z. Wang, K. Ji, Y. Zhou, Y. Liang and V. Tarokh. "SpiderBoost and Momentum - Faster Variance Reduction Algorithms". Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS) (2019)**[Zaheer et al., 2018]**M. Zaheer, S. Reddi, D. Sachan, S. Kale and S. Kumar. "Adaptive Methods for Nonconvex Optimization". Proceedings of the International Conference on Neural Information Processing Systems (NIPS) (2018)