Stochastic Smooth Nonconvex Optimization
In progress. Upcoming: full sets of assumptions, convergence results and comparisons
Notations. Throughout this post, we denote by γ the step-sizes.
Unconstrained Optimization
We consider peculiar optimization problems of the form
minimizex∈RmF(x)with either{F(x)≜1n∑ni=1Fi(x)(finite-sum)F(x)≜Es∼P[f(x,s)](stochastic)We begin by presenting some exponential moving average based adaptive methods, which are very popular in the deep learning community.
ADAM [Kingma and Ba, 2015] . The terms mk and vk are the exponential moving averages of the gradient and squared gradient, respectively. The parameter ϵ, originally introduced to avoid precision issues…
x0∈Rmfork=0,1,…,K−1⌊Draw a sample sk from Pgk=∇f(xk,sk)mk=β1mk−1+(1−β1)gkvk+1=vk−(1−β2)(vk−g2k)xk+1=xk−γkmk√vk+ϵHowever, ADAM do not benefits from convergence guarantees. Even worse, non-convergence can be found in simple online convex settings with constant minibatch sizes (see work of Reddi, 2016)
YOGI [Zaheer et al., 2018] . The YOGI method gets its name the Sanskrit word yuj meaning to add. It is a slight variant of ADAM where the magnitude of the difference of vk+1 and vk in YOGI only depends on g2k as opposed to dependence on both vk and g2k. Henceforth, when vk is much larger than g2k, ADAM can rapidly increase the effective learning rate while YOGI does it in a controlled fashion.
x0∈Rmfork=0,1,…,K−1⌊Draw a sample sk from Pgk=∇f(xk,sk)mk=β1mk−1+(1−β1)gkvk+1=vk−(1−β2)sign(vk−g2k)g2kxk+1=xk−γkmk√vk+ϵSAG [Roux et al, 2012] . The Stochastic Average Gradient method ..
SCSG [Lei et al., 2017] . The non-convex Stochastically Controlled Stochastic Gradient method ..
SARAH [Nguyen et al., 2017] . The StochAstic Recursive grAdient algoritHm …
SPIDER [Fang et al., 2018] .
SpiderBoost [Wang et al., 2019] .
Geom-SARAH [Hovath et al., 2020] . The Geometrized stochastic recursive gradient method, called Geom-SARAH, is a double-loop procedure which can be seen as a combination of the SCSG method and the SARAH biased gradient estimator. Note that, it exploits a randomization technique based on the geometric distribution which allows certain terms to telescope across the outer loop and the inner loop, hence simplifying the analysis of the algorithm.
x(0)0∈Rm, expected inner-loop queries {mk}fork=0,1,…,K−1⌊Sample a batch Jk of size bkv(0)k+1=1bk∑i∈Jk∇fi(x(0)k+1)Sample Mk∼Geom(ηk)s.t.EMk=mk/bkform=0,1,…,Mk−1⌊x(m+1)k+1=x(m)k+1−γkv(m)k+1Sample a batch Im of size ˜bmv(m+1)k+1=1˜bm∑i∈Im(∇fi(x(m+1)k+1)−fi(x(m)k+1))+v(m+1)k+1PAGE [Li et al., 2021] .
Constrained Optimization
minimizex∈RmF(x)with either{F(x)≜1n∑ni=1Fi(x)(finite-sum)F(x)≜Es∼P[f(x,s)](stochastic)and where Ω⊆Rm is convex and compact.
SFW [Reddi et al., 2016] . The Stochastic Frank-Wolf algorithm …
x(M)0∈Ωfork=0,1,…,K−1⌊Uniformly randomly pick samples{s1,…,sbk} according to Pvk+1=argmaxv∈Ω⟨v,−1bk∑bki=1∇f(xk,si)⟩xk+1=xk+ρk(vk+1−xk)SVFW [Reddi et al., 2016] . This algorithm can be seen as a non-convex variant of the Stochastic Variance Reduced Frank-Wolf devised in [Hazan and Luo, 2016] . As such, it is also epoch-based. At the end of each epoch, the full gradient is computed at the current iterate. This gradient is used for controlling the variance of the stochastic gradients in the inner loop.
x(M)0∈Ωfork=0,1,…,K−1⌊˜xk=x(M)k˜gk=∇F(˜xk)=1n∑ni=1f(˜xk)form=0,1,…,M−1⌊Uniformly pick batch Im (with replacement) of size bmv(m)k+1=argmaxv∈Ω⟨v,−1bm(∑i∈Im∇fi(x(m)k+1)−fi(˜xk)+˜gk)⟩x(m+1)k+1=x(m)k+1+ρk(v(m)k+1−x(m)k+1)References
- [Fang et al., 2018] C. Fang, C.J. Li, Z. Lin and T. Zhang. "SPIDER - Near-Optimal Non-Convex Optimization via Stochastic Path-Integrated Differential Estimator". Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS) (2018)
- [Hazan and Luo, 2016] E. Hazan and H. Luo. "Variance-Reduced and Projection-Free Stochastic Optimization". Proceedings of the International Conference on Machine Learning (ICML) (2016)
- [Hovath et al., 2020] S. Horváth, L. Lei, P. Richtárik and M.I. Jordan. "Adaptivity of stochastic gradient methods for nonconvex optimization". Annual Workshop on Optimization for Machine Learning (OPT) (2020)
- [Kingma and Ba, 2015] D.P. Kingma and J. Ba. "Adam - A Method for Stochastic Optimization". International Conference on Learning Representations (ICLR) (2015)
- [Lei et al., 2017] L. Lei, C. Ju, J. Chen and M.I. Jordan. "Non-convex Finite-Sum Optimization Via SCSG Methods". Proceedings of the International Conference on Neural Information Processing Systems (NIPS) (2017)
- [Li et al., 2021] Z. Li, H. Bao, X. Zhang and P. Richtárik. "PAGE - A simple and optimal probabilistic gradient estimator for nonconvex optimization". Proceedings of the International Conference on Machine Learning (ICML) (2021)
- [Nguyen et al., 2017] L.M. Nguyen, J. Liu, K. Scheinberg and M. Takáč. "SARAH- A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient". Proceedings of the International Conference on Machine Learning (ICML) (2017)
- [Reddi et al., 2016] S. Reddi, S. Sra, B. Poczos and A. Smola. "Stochastic Frank-Wolfe methods for nonconvex optimization". Annual Allerton Conference on Communication, Control, and Computing (2016)
- [Roux et al, 2012] N. Roux, M. Schmidt and F. Bach. "A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets". Proceedings of the International Conference on Neural Information Processing Systems (NIPS) (2012)
- [Wang et al., 2019] Z. Wang, K. Ji, Y. Zhou, Y. Liang and V. Tarokh. "SpiderBoost and Momentum - Faster Variance Reduction Algorithms". Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS) (2019)
- [Zaheer et al., 2018] M. Zaheer, S. Reddi, D. Sachan, S. Kale and S. Kumar. "Adaptive Methods for Nonconvex Optimization". Proceedings of the International Conference on Neural Information Processing Systems (NIPS) (2018)