Loading [MathJax]/jax/output/HTML-CSS/jax.js

Stochastic Smooth Nonconvex Optimization

In progress. Upcoming: full sets of assumptions, convergence results and comparisons

Notations. Throughout this post, we denote by γ the step-sizes.

Unconstrained Optimization

We consider peculiar optimization problems of the form

minimizexRmF(x)with either{F(x)1nni=1Fi(x)(finite-sum)F(x)EsP[f(x,s)](stochastic)

We begin by presenting some exponential moving average based adaptive methods, which are very popular in the deep learning community.

ADAM [Kingma and Ba, 2015] . The terms mk and vk are the exponential moving averages of the gradient and squared gradient, respectively. The parameter ϵ, originally introduced to avoid precision issues…

x0Rmfork=0,1,,K1Draw a sample sk from Pgk=f(xk,sk)mk=β1mk1+(1β1)gkvk+1=vk(1β2)(vkg2k)xk+1=xkγkmkvk+ϵ

However, ADAM do not benefits from convergence guarantees. Even worse, non-convergence can be found in simple online convex settings with constant minibatch sizes (see work of Reddi, 2016)

YOGI [Zaheer et al., 2018] . The YOGI method gets its name the Sanskrit word yuj meaning to add. It is a slight variant of ADAM where the magnitude of the difference of vk+1 and vk in YOGI only depends on g2k as opposed to dependence on both vk and g2k. Henceforth, when vk is much larger than g2k, ADAM can rapidly increase the effective learning rate while YOGI does it in a controlled fashion.

x0Rmfork=0,1,,K1Draw a sample sk from Pgk=f(xk,sk)mk=β1mk1+(1β1)gkvk+1=vk(1β2)sign(vkg2k)g2kxk+1=xkγkmkvk+ϵ

SAG [Roux et al, 2012] . The Stochastic Average Gradient method ..

SCSG [Lei et al., 2017] . The non-convex Stochastically Controlled Stochastic Gradient method ..

SARAH [Nguyen et al., 2017] . The StochAstic Recursive grAdient algoritHm

SPIDER [Fang et al., 2018] .

SpiderBoost [Wang et al., 2019] .

Geom-SARAH [Hovath et al., 2020] . The Geometrized stochastic recursive gradient method, called Geom-SARAH, is a double-loop procedure which can be seen as a combination of the SCSG method and the SARAH biased gradient estimator. Note that, it exploits a randomization technique based on the geometric distribution which allows certain terms to telescope across the outer loop and the inner loop, hence simplifying the analysis of the algorithm.

x(0)0Rm, expected inner-loop queries {mk}fork=0,1,,K1Sample a batch Jk of size bkv(0)k+1=1bkiJkfi(x(0)k+1)Sample MkGeom(ηk)s.t.EMk=mk/bkform=0,1,,Mk1x(m+1)k+1=x(m)k+1γkv(m)k+1Sample a batch Im of size ˜bmv(m+1)k+1=1˜bmiIm(fi(x(m+1)k+1)fi(x(m)k+1))+v(m+1)k+1

PAGE [Li et al., 2021] .

Constrained Optimization

minimizexRmF(x)with either{F(x)1nni=1Fi(x)(finite-sum)F(x)EsP[f(x,s)](stochastic)

and where ΩRm is convex and compact.

SFW [Reddi et al., 2016] . The Stochastic Frank-Wolf algorithm …

x(M)0Ωfork=0,1,,K1Uniformly randomly pick samples{s1,,sbk} according to Pvk+1=argmaxvΩv,1bkbki=1f(xk,si)xk+1=xk+ρk(vk+1xk)

SVFW [Reddi et al., 2016] . This algorithm can be seen as a non-convex variant of the Stochastic Variance Reduced Frank-Wolf devised in [Hazan and Luo, 2016] . As such, it is also epoch-based. At the end of each epoch, the full gradient is computed at the current iterate. This gradient is used for controlling the variance of the stochastic gradients in the inner loop.

x(M)0Ωfork=0,1,,K1˜xk=x(M)k˜gk=F(˜xk)=1nni=1f(˜xk)form=0,1,,M1Uniformly pick batch Im (with replacement) of size bmv(m)k+1=argmaxvΩv,1bm(iImfi(x(m)k+1)fi(˜xk)+˜gk)x(m+1)k+1=x(m)k+1+ρk(v(m)k+1x(m)k+1)

References