Stochastic Smooth Nonconvex Optimization

15 November 2021 - 3 mins read time
Tags: Nonconvex Smooth Optimization

In progress. Upcoming: full sets of assumptions, convergence results and comparisons

Notations. Throughout this post, we denote by \(\gamma\) the step-sizes.

Unconstrained Optimization

We consider peculiar optimization problems of the form

\[\underset{x\in\mathbb{R}^m}{\mathrm{minimize}}\; F(x) \quad\text{with either}\quad \begin{cases} F(x)\triangleq \frac{1}{n}\sum_{i=1}^n F_i(x) \quad \text{(finite-sum)}\\[0.4ex] F(x) \triangleq \mathbb{E}_{s\sim\mathbb{P}}\left[ f(x,s)\right]\quad \text{(stochastic)}\end{cases}\]

We begin by presenting some exponential moving average based adaptive methods, which are very popular in the deep learning community.

ADAM [Kingma and Ba, 2015] . The terms \(m_k\) and \(v_k\) are the exponential moving averages of the gradient and squared gradient, respectively. The parameter \(\epsilon\), originally introduced to avoid precision issues…

\[\begin{array}{l} x_0\in\mathbb{R}^{m}\\ \text{for}\;k=0,1,\ldots,K-1\\[0.4ex] \left\lfloor\begin{array}{l} \text{Draw a sample } s_k \text{ from } \mathbb{P}\\ g_k = \nabla f(x_k,s_k)\\ m_k = \beta_1 m_{k-1} + (1-\beta_1) g_k\\ v_{k+1} = v_{k} - (1-\beta_2) (v_k - g_k^2)\\ x_{k+1} = x_k - \gamma_k \frac{m_k}{\sqrt{v_k} + \epsilon} \end{array}\right. \end{array}\]

However, ADAM do not benefits from convergence guarantees. Even worse, non-convergence can be found in simple online convex settings with constant minibatch sizes (see work of Reddi, 2016)

YOGI [Zaheer et al., 2018] . The YOGI method gets its name the Sanskrit word yuj meaning to add. It is a slight variant of ADAM where the magnitude of the difference of \(v_{k+1}\) and \(v_k\) in YOGI only depends on \(g_k^2\) as opposed to dependence on both \(v_k\) and \(g_k^2\). Henceforth, when \(v_k\) is much larger than \(g_k^2\), ADAM can rapidly increase the effective learning rate while YOGI does it in a controlled fashion.

\[\begin{array}{l} x_0\in\mathbb{R}^{m}\\ \text{for}\;k=0,1,\ldots,K-1\\[0.4ex] \left\lfloor\begin{array}{l} \text{Draw a sample } s_k \text{ from } \mathbb{P}\\ g_k = \nabla f(x_k,s_k)\\ m_k = \beta_1 m_{k-1} + (1-\beta_1) g_k\\ v_{k+1} = v_{k} - (1-\beta_2) \mathrm{sign}(v_k - g_k^2) g_k^2\\ x_{k+1} = x_k - \gamma_k \frac{m_k}{\sqrt{v_k} + \epsilon} \end{array}\right. \end{array}\]

SAG [Roux et al, 2012] . The Stochastic Average Gradient method ..

SCSG [Lei et al., 2017] . The non-convex Stochastically Controlled Stochastic Gradient method ..

SARAH [Nguyen et al., 2017] . The StochAstic Recursive grAdient algoritHm …

SPIDER [Fang et al., 2018] .

SpiderBoost [Wang et al., 2019] .

Geom-SARAH [Hovath et al., 2020] . The Geometrized stochastic recursive gradient method, called Geom-SARAH, is a double-loop procedure which can be seen as a combination of the SCSG method and the SARAH biased gradient estimator. Note that, it exploits a randomization technique based on the geometric distribution which allows certain terms to telescope across the outer loop and the inner loop, hence simplifying the analysis of the algorithm.

\[\begin{array}{l} x_0^{(0)}\in\mathbb{R}^m, \text{ expected inner-loop queries } \{m_k\} \\ \text{for}\;k=0,1,\ldots,K-1\\[0.4ex] \left\lfloor\begin{array}{l} \text{Sample a batch } J_k \text{ of size } b_k\\ v_{k+1}^{(0)} = \frac{1}{b_k}\sum_{i\in J_k} \nabla f_i(x_{k+1}^{(0)})\\ \text{Sample } M_k\sim\mathrm{Geom}(\eta_k)\quad\text{s.t.}\quad\mathbb{E} M_k = m_k/b_k\\ \text{for}\;m=0,1,\ldots,M_k-1\\[0.4ex] \left\lfloor\begin{array}{l} x_{k+1}^{(m+1)} = x_{k+1}^{(m)} - \gamma_k v_{k+1}^{(m)}\\ \text{Sample a batch } I_m \text{ of size } \tilde{b}_m\\ v^{(m+1)}_{k+1} = \frac{1}{\tilde{b}_m}\sum_{i\in I_m}\left( \nabla f_i(x_{k+1}^{(m+1)}) - f_i({x}_{k+1}^{(m)})\right) + v_{k+1}^{(m+1)}\\ \end{array}\right. \end{array}\right. \end{array}\]

PAGE [Li et al., 2021] .

Constrained Optimization

and where \(\Omega\subseteq\mathbb{R}^{m}\) is convex and compact.

SFW [Reddi et al., 2016] . The Stochastic Frank-Wolf algorithm …

\[\begin{array}{l} x_0^{(M)}\in\Omega\\ \text{for}\;k=0,1,\ldots,K-1\\[0.4ex] \left\lfloor\begin{array}{l} \text{Uniformly randomly pick samples} \{s_1,\ldots,s_{b_k}\} \text{ according to } \mathbb{P}\\ v_{k+1} = \mathrm{arg}\;\max_{v\in\Omega} \langle v, -\frac{1}{b_k}\sum_{i=1}^{b_k} \nabla f(x_k, s_i)\rangle\\ x_{k+1} = x_{k} + \rho_k (v_{k+1} - x_k) \end{array}\right. \end{array}\]

SVFW [Reddi et al., 2016] . This algorithm can be seen as a non-convex variant of the Stochastic Variance Reduced Frank-Wolf devised in [Hazan and Luo, 2016] . As such, it is also epoch-based. At the end of each epoch, the full gradient is computed at the current iterate. This gradient is used for controlling the variance of the stochastic gradients in the inner loop.

\[\begin{array}{l} x_0^{(M)}\in\Omega\\ \text{for}\;k=0,1,\ldots,K-1\\[0.4ex] \left\lfloor\begin{array}{l} \tilde{x}_{k} = x_k^{(M)}\\ \tilde{g}_k = \nabla F(\tilde{x}_k) = \frac{1}{n}\sum_{i=1}^n f(\tilde{x}_k)\\ \text{for}\;m=0,1,\ldots,M-1\\[0.4ex] \left\lfloor\begin{array}{l} \text{Uniformly pick batch } I_m \text{ (with replacement) of size } b_m\\ v^{(m)}_{k+1} = \mathrm{arg}\;\max_{v\in\Omega} \langle v, -\frac{1}{b_m}\left(\sum_{i\in I_m} \nabla f_i(x_{k+1}^{(m)}) - f_i(\tilde{x}_k) + \tilde{g}_k\right)\rangle\\ x_{k+1}^{(m+1)} = x_{k+1}^{(m)} + \rho_k (v_{k+1}^{(m)} - x_{k+1}^{(m)}) \end{array}\right. \end{array}\right. \end{array}\]

References

[Fang et al., 2018] C. Fang, C.J. Li, Z. Lin and T. Zhang. "SPIDER - Near-Optimal Non-Convex Optimization via Stochastic Path-Integrated Differential Estimator". Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS) (2018)

[Hazan and Luo, 2016] E. Hazan and H. Luo. "Variance-Reduced and Projection-Free Stochastic Optimization". Proceedings of the International Conference on Machine Learning (ICML) (2016)

[Hovath et al., 2020] S. Horváth, L. Lei, P. Richtárik and M.I. Jordan. "Adaptivity of stochastic gradient methods for nonconvex optimization". Annual Workshop on Optimization for Machine Learning (OPT) (2020)

[Kingma and Ba, 2015] D.P. Kingma and J. Ba. "Adam - A Method for Stochastic Optimization". International Conference on Learning Representations (ICLR) (2015)

[Lei et al., 2017] L. Lei, C. Ju, J. Chen and M.I. Jordan. "Non-convex Finite-Sum Optimization Via SCSG Methods". Proceedings of the International Conference on Neural Information Processing Systems (NIPS) (2017)

[Li et al., 2021] Z. Li, H. Bao, X. Zhang and P. Richtárik. "PAGE - A simple and optimal probabilistic gradient estimator for nonconvex optimization". Proceedings of the International Conference on Machine Learning (ICML) (2021)

[Nguyen et al., 2017] L.M. Nguyen, J. Liu, K. Scheinberg and M. Takáč. "SARAH- A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient". Proceedings of the International Conference on Machine Learning (ICML) (2017)

[Reddi et al., 2016] S. Reddi, S. Sra, B. Poczos and A. Smola. "Stochastic Frank-Wolfe methods for nonconvex optimization". Annual Allerton Conference on Communication, Control, and Computing (2016)

[Roux et al, 2012] N. Roux, M. Schmidt and F. Bach. "A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets". Proceedings of the International Conference on Neural Information Processing Systems (NIPS) (2012)

[Wang et al., 2019] Z. Wang, K. Ji, Y. Zhou, Y. Liang and V. Tarokh. "SpiderBoost and Momentum - Faster Variance Reduction Algorithms". Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS) (2019)

[Zaheer et al., 2018] M. Zaheer, S. Reddi, D. Sachan, S. Kale and S. Kumar. "Adaptive Methods for Nonconvex Optimization". Proceedings of the International Conference on Neural Information Processing Systems (NIPS) (2018)