Stochastic Smooth Nonconvex Optimization
In progress. Upcoming: full sets of assumptions, convergence results and comparisons
Notations. Throughout this post, we denote by \(\gamma\) the step-sizes.
Unconstrained Optimization
We consider peculiar optimization problems of the form
\[\underset{x\in\mathbb{R}^m}{\mathrm{minimize}}\; F(x) \quad\text{with either}\quad \begin{cases} F(x)\triangleq \frac{1}{n}\sum_{i=1}^n F_i(x) \quad \text{(finite-sum)}\\[0.4ex] F(x) \triangleq \mathbb{E}_{s\sim\mathbb{P}}\left[ f(x,s)\right]\quad \text{(stochastic)}\end{cases}\]We begin by presenting some exponential moving average based adaptive methods, which are very popular in the deep learning community.
ADAM [Kingma and Ba, 2015] . The terms \(m_k\) and \(v_k\) are the exponential moving averages of the gradient and squared gradient, respectively. The parameter \(\epsilon\), originally introduced to avoid precision issues…
\[\begin{array}{l} x_0\in\mathbb{R}^{m}\\ \text{for}\;k=0,1,\ldots,K-1\\[0.4ex] \left\lfloor\begin{array}{l} \text{Draw a sample } s_k \text{ from } \mathbb{P}\\ g_k = \nabla f(x_k,s_k)\\ m_k = \beta_1 m_{k-1} + (1-\beta_1) g_k\\ v_{k+1} = v_{k} - (1-\beta_2) (v_k - g_k^2)\\ x_{k+1} = x_k - \gamma_k \frac{m_k}{\sqrt{v_k} + \epsilon} \end{array}\right. \end{array}\]However, ADAM do not benefits from convergence guarantees. Even worse, non-convergence can be found in simple online convex settings with constant minibatch sizes (see work of Reddi, 2016)
YOGI [Zaheer et al., 2018] . The YOGI method gets its name the Sanskrit word yuj meaning to add. It is a slight variant of ADAM where the magnitude of the difference of \(v_{k+1}\) and \(v_k\) in YOGI only depends on \(g_k^2\) as opposed to dependence on both \(v_k\) and \(g_k^2\). Henceforth, when \(v_k\) is much larger than \(g_k^2\), ADAM can rapidly increase the effective learning rate while YOGI does it in a controlled fashion.
\[\begin{array}{l} x_0\in\mathbb{R}^{m}\\ \text{for}\;k=0,1,\ldots,K-1\\[0.4ex] \left\lfloor\begin{array}{l} \text{Draw a sample } s_k \text{ from } \mathbb{P}\\ g_k = \nabla f(x_k,s_k)\\ m_k = \beta_1 m_{k-1} + (1-\beta_1) g_k\\ v_{k+1} = v_{k} - (1-\beta_2) \mathrm{sign}(v_k - g_k^2) g_k^2\\ x_{k+1} = x_k - \gamma_k \frac{m_k}{\sqrt{v_k} + \epsilon} \end{array}\right. \end{array}\]SAG [Roux et al, 2012] . The Stochastic Average Gradient method ..
SCSG [Lei et al., 2017] . The non-convex Stochastically Controlled Stochastic Gradient method ..
SARAH [Nguyen et al., 2017] . The StochAstic Recursive grAdient algoritHm …
SPIDER [Fang et al., 2018] .
SpiderBoost [Wang et al., 2019] .
Geom-SARAH [Hovath et al., 2020] . The Geometrized stochastic recursive gradient method, called Geom-SARAH, is a double-loop procedure which can be seen as a combination of the SCSG method and the SARAH biased gradient estimator. Note that, it exploits a randomization technique based on the geometric distribution which allows certain terms to telescope across the outer loop and the inner loop, hence simplifying the analysis of the algorithm.
\[\begin{array}{l} x_0^{(0)}\in\mathbb{R}^m, \text{ expected inner-loop queries } \{m_k\} \\ \text{for}\;k=0,1,\ldots,K-1\\[0.4ex] \left\lfloor\begin{array}{l} \text{Sample a batch } J_k \text{ of size } b_k\\ v_{k+1}^{(0)} = \frac{1}{b_k}\sum_{i\in J_k} \nabla f_i(x_{k+1}^{(0)})\\ \text{Sample } M_k\sim\mathrm{Geom}(\eta_k)\quad\text{s.t.}\quad\mathbb{E} M_k = m_k/b_k\\ \text{for}\;m=0,1,\ldots,M_k-1\\[0.4ex] \left\lfloor\begin{array}{l} x_{k+1}^{(m+1)} = x_{k+1}^{(m)} - \gamma_k v_{k+1}^{(m)}\\ \text{Sample a batch } I_m \text{ of size } \tilde{b}_m\\ v^{(m+1)}_{k+1} = \frac{1}{\tilde{b}_m}\sum_{i\in I_m}\left( \nabla f_i(x_{k+1}^{(m+1)}) - f_i({x}_{k+1}^{(m)})\right) + v_{k+1}^{(m+1)}\\ \end{array}\right. \end{array}\right. \end{array}\]PAGE [Li et al., 2021] .
Constrained Optimization
\[\underset{x\in\mathbb{R}^m}{\mathrm{minimize}}\; F(x) \quad\text{with either}\quad \begin{cases} F(x)\triangleq \frac{1}{n}\sum_{i=1}^n F_i(x) \quad \text{(finite-sum)}\\[0.4ex] F(x) \triangleq \mathbb{E}_{s\sim\mathbb{P}}\left[ f(x,s)\right]\quad \text{(stochastic)}\end{cases}\]and where \(\Omega\subseteq\mathbb{R}^{m}\) is convex and compact.
SFW [Reddi et al., 2016] . The Stochastic Frank-Wolf algorithm …
\[\begin{array}{l} x_0^{(M)}\in\Omega\\ \text{for}\;k=0,1,\ldots,K-1\\[0.4ex] \left\lfloor\begin{array}{l} \text{Uniformly randomly pick samples} \{s_1,\ldots,s_{b_k}\} \text{ according to } \mathbb{P}\\ v_{k+1} = \mathrm{arg}\;\max_{v\in\Omega} \langle v, -\frac{1}{b_k}\sum_{i=1}^{b_k} \nabla f(x_k, s_i)\rangle\\ x_{k+1} = x_{k} + \rho_k (v_{k+1} - x_k) \end{array}\right. \end{array}\]SVFW [Reddi et al., 2016] . This algorithm can be seen as a non-convex variant of the Stochastic Variance Reduced Frank-Wolf devised in [Hazan and Luo, 2016] . As such, it is also epoch-based. At the end of each epoch, the full gradient is computed at the current iterate. This gradient is used for controlling the variance of the stochastic gradients in the inner loop.
\[\begin{array}{l} x_0^{(M)}\in\Omega\\ \text{for}\;k=0,1,\ldots,K-1\\[0.4ex] \left\lfloor\begin{array}{l} \tilde{x}_{k} = x_k^{(M)}\\ \tilde{g}_k = \nabla F(\tilde{x}_k) = \frac{1}{n}\sum_{i=1}^n f(\tilde{x}_k)\\ \text{for}\;m=0,1,\ldots,M-1\\[0.4ex] \left\lfloor\begin{array}{l} \text{Uniformly pick batch } I_m \text{ (with replacement) of size } b_m\\ v^{(m)}_{k+1} = \mathrm{arg}\;\max_{v\in\Omega} \langle v, -\frac{1}{b_m}\left(\sum_{i\in I_m} \nabla f_i(x_{k+1}^{(m)}) - f_i(\tilde{x}_k) + \tilde{g}_k\right)\rangle\\ x_{k+1}^{(m+1)} = x_{k+1}^{(m)} + \rho_k (v_{k+1}^{(m)} - x_{k+1}^{(m)}) \end{array}\right. \end{array}\right. \end{array}\]References
- [Fang et al., 2018] C. Fang, C.J. Li, Z. Lin and T. Zhang. "SPIDER - Near-Optimal Non-Convex Optimization via Stochastic Path-Integrated Differential Estimator". Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS) (2018)
- [Hazan and Luo, 2016] E. Hazan and H. Luo. "Variance-Reduced and Projection-Free Stochastic Optimization". Proceedings of the International Conference on Machine Learning (ICML) (2016)
- [Hovath et al., 2020] S. Horváth, L. Lei, P. Richtárik and M.I. Jordan. "Adaptivity of stochastic gradient methods for nonconvex optimization". Annual Workshop on Optimization for Machine Learning (OPT) (2020)
- [Kingma and Ba, 2015] D.P. Kingma and J. Ba. "Adam - A Method for Stochastic Optimization". International Conference on Learning Representations (ICLR) (2015)
- [Lei et al., 2017] L. Lei, C. Ju, J. Chen and M.I. Jordan. "Non-convex Finite-Sum Optimization Via SCSG Methods". Proceedings of the International Conference on Neural Information Processing Systems (NIPS) (2017)
- [Li et al., 2021] Z. Li, H. Bao, X. Zhang and P. Richtárik. "PAGE - A simple and optimal probabilistic gradient estimator for nonconvex optimization". Proceedings of the International Conference on Machine Learning (ICML) (2021)
- [Nguyen et al., 2017] L.M. Nguyen, J. Liu, K. Scheinberg and M. Takáč. "SARAH- A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient". Proceedings of the International Conference on Machine Learning (ICML) (2017)
- [Reddi et al., 2016] S. Reddi, S. Sra, B. Poczos and A. Smola. "Stochastic Frank-Wolfe methods for nonconvex optimization". Annual Allerton Conference on Communication, Control, and Computing (2016)
- [Roux et al, 2012] N. Roux, M. Schmidt and F. Bach. "A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets". Proceedings of the International Conference on Neural Information Processing Systems (NIPS) (2012)
- [Wang et al., 2019] Z. Wang, K. Ji, Y. Zhou, Y. Liang and V. Tarokh. "SpiderBoost and Momentum - Faster Variance Reduction Algorithms". Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS) (2019)
- [Zaheer et al., 2018] M. Zaheer, S. Reddi, D. Sachan, S. Kale and S. Kumar. "Adaptive Methods for Nonconvex Optimization". Proceedings of the International Conference on Neural Information Processing Systems (NIPS) (2018)