**Nonsmooth Nonconvex Composite Optimization**

In progress. Upcoming: Uniformization of notations, full sets of assumptions, convergence results and comparisons

## 1. Nonconvex Optimization Problem with Convex Nonsmooth Term

We consider the following generic optimization problem

\[\underset{x\in\mathbb{R}^{m}}{\mathrm{minimize}}\; \left\{\mathcal{L}(x) \triangleq F(x) + R(X) \right\}\]where \(F\triangleq \frac{1}{n}\sum_{i=1}^n F_i(x)\) has a finite-sum structure, and the function \(R\) is a possibly non-smooth simple convex function. Here, \(R\) is said to be simple in the sense that its proximity operator has a closed form expression. Moreover, throughout this section, we will resort to the following assumptions.

**Assumptions.**
Throughout this section, we consider the following assumptions.

- A1. \(\mathcal{L}\) is bounded from below and \(\mathrm{dom}\;\mathcal{L}\neq \emptyset\).
- A2. \(R\colon\mathbb{R}^m\to\mathbb{R}\cup\{+\infty\}\) is a proper, convex and lower semi-continuous
- A3. Each \(F_i\colon\mathbb{R}^m\to\mathbb{R}\) is continuously differentiable on an open set \(\Omega\supset\overline{\mathrm{dom}\;R}\). Moreover, there exist \(L\in(0,+\infty)\) such that,
- A3-i (\(L\) smoothness). F has an \(L\)-Lipschitz continuous gradient on \(\mathrm{dom}\;R\)
- A3-ii (\(L\)-individual smoothness). Each \(F_i\) has a L-Lipschitz continuous gradient on \(\mathrm{dom}\;J\)
- A3-iii (\(L\)-average smoothness). \(\frac{1}{n}\sum_{i-1}^n \| \nabla F_i(x)-\nabla F(y)\|^2 \leq L^2 \| x - y\|^2\).

- A4 (Bounded variance) There exist \(\sigma\in(0,\infty)\) such that \(\frac{1}{n}\sum_{i-1}^n \| \nabla F_i(x)-\nabla F(x)\|^2 \leq \sigma^2\) for every \(x\in\mathrm{dom}\; F\).

Assumptions A1 and A2 are basic assumptions used in optimization. A1 usually holds since \(F\) and \(R\) typically stand for a loss function and a regularizer, respectively. Hence, they are usually nonnegative or bounded from below, and the domain of \(R\) intersects the domain of \(F\). Also note that, most of convex regularizers popularly encountered also satisfy A2. The next assumption A3 deals with the smoothness of \(F\) and its components. As such, full-batch algorithms will rely on A3-i while stochastic algorithms, which fully exploits the finite-sum nature of \(F\), will typically rely on A3-ii. Note that some exceptions may rely on the weaker assumption A3-iii. Finally, A4 is standard in stochastic optimization. It appears useful when \(n\) is extremely large since passing over \(n\) data points is exhaustive or impossible.

Actually, in some works, slightly weaker assumptions may be required. However, the assumptions stated above are general enough to encapsulate many optimization problems.

### 1.1. Full-batch algorithms

We begin by presenting some algorithms which do not take into account the finite-sum nature of \(F\). In their original forms, they allow for inexact gradient computations and/or inexact computations of the proximal points. However, here, for the sake of simplicity, we will not show such aspects and solely deal with exact computations.

**NIPS** [Sra, 2012]. The *Nonconvex Inexact Proximal Splitting* method hinges on the splitting into smooth and nonsmooth parts. Without inexact gradient computation, it boils down to the following nonconvex forward-backward algorithm.

**VMILAn** [Bonettini et al., 2017]. The *Variable Metric Inexact Line-search Algorithm (new version)* …

The relaxation parameter \(\rho_k\) is determined to yield a sufficient decrease in objective value.

### 1.2. Stochastic variance-reduced algorithms

We now present a variety of proximal variance reduction stochastic gradient algorithms.

**ProxSVRG** [Reddi et al., 2016]. This algorithm is a nonconvex variant of the *Proximal Stochastic Variance Reduced Gradient* method devised in [Xiao and Zhang, 2014]. Note that ProxSVRG is not a
fully incremental algorithm since it requires calculation of the full gradient once per epoch. Convergence results.

**ProxSAGA** [Reddi et al., 2016]. By hinging on the work of [Defazio et al., 2014], the authors have devised a nonconvex proximal variant of SAGA as follows.

**ProxSpiderBoost** [Wang et al., 2019]. SpiderBoost uses the same gradient estimator as SARAH and SPIDER.

**ProxSARAH** [Pham et al., 2020]. The ProxSARAH algorithm differs from the *StochAstic Recursive grAdient algoritHm* (SARAH) by its proximal step followed by an additional averaging step. Note that, for \(\rho_m=1\), it boils down to the vanilla proximal SARAH which is similar to ProxSVRG and ProxSpiderBoost. Convergence results.

Comparison of *Stochastic First-order Oracle* (SFO) complexity

Algorithms | SFO | Step-size |
---|---|---|

ProxSVRG | \(\mathcal{O}(n+n^{2/3}\epsilon^{-2})\) | \(\mathcal{O}(\frac{1}{nL})\to \mathcal{O}(\frac{1}{L})\) |

ProxSAGA | \(\mathcal{O}(n+n^{2/3}\epsilon^{-2})\) | \(\mathcal{O}(\frac{1}{nL})\to \mathcal{O}(\frac{1}{L})\) |

ProxSpiderBoost | \(\mathcal{O}(n+n^{1/2}\epsilon^{-2})\) | \(\mathcal{O}(\frac{1}{L})\) |

ProxSARAH | \(\mathcal{O}(n+n^{1/2}\epsilon^{-2})\) | \(\mathcal{O}(\frac{1}{\sqrt{n}L})\to \mathcal{O}(\frac{1}{L})\) |

### 1.3. Incremental algorithms

**PIAG** [Peng et al., 2019]. The key idea of the *Proximal Incremental Aggregated Gradient* algorithm is to construct an *inexact gradient* to substitute for the full gradient at each iteration. This inexact gradient is devised by evaluating \(F\) at past iterates \(x_{k - \tau_{k,i}}\) where the time-varying delays \(\tau_{k,i}\in\{0,\ldots,\tau\}\) for some maximum delay parameter \(\tau\in\mathbb{N}^+\).

## 2. Nonconvex Optimization Problem with Nonconvex Nonsmooth Term

We now consider a variant where the nonsmooth part can be nonconvex. To this effect, we consider optimization problems of the form

\[\underset{x\in\mathbb{R}^{m}}{\mathrm{minimize}}\; \left\{\mathcal{L}(x) \triangleq F(x) + G(X) + R(X) \right\}\]where the smooth part \(F\triangleq \frac{1}{n}\sum_{i=1}^n F_i(x)\) has a finite-sum structure and the nonsmooth part is divided into two terms \(R\) and \(G\) which are convex and nonconvex, respectively. As in Section 1, we assume that both \(G\) and \(R\) have efficiently computable proximal operators.

**VRSPA** [Metel and Takeda, 2021]. The *Variance Reduced Stochastic Proximal Algorithm* is a variant of MBSPA, devised by the same authors, which takes advantage of the finite-sum nature of \(F\). Given some parameter \(\lambda>0\), the algorithm reads as follows.
Convergence results.

## 3. Nonconvex Block-Structured Optimization Problem

We consider the following block-structured optimization problem

\[\underset{x\in\mathbb{R}^{m_1}, y\in\mathbb{R}^{m_2}}{\mathrm{minimize}}\; \left\{\mathcal{L}(x,y) \triangleq R(x) + F(x,y) + Q(y) \right\}\]where \(F\triangleq \frac{1}{n}\sum_{i=1}^n F_i(x,y)\) has a finite-sum structure, and functions \(R\) and \(Q\) are possibly non-smooth functions.

**Assumptions.**
Throughout this section, we consider the following assumptions.

- \(R\colon \mathbb{R}^{m_1} \to \mathbb{R}\cup \{+\infty\}\) and \(Q\colon \mathbb{R}^{m_1} \to \mathbb{R}\cup \{+\infty\}\) are proper lower semi-continuous functions that are bounded from below.
- \(F_i\colon \mathbb{R}^{m_1} \times \mathbb{R}^{m_2} \to \mathbb{R}\) are finite-valued, differentiable and their gradients are \(M\)-Lipschitz continuous on bounded sets of \(\mathbb{R}^{m_1} \times \mathbb{R}^{m_2}\).
- The objective \(\mathcal{L}\) is bounded from below.

No convexity assumption is imposed on any of the functions \(R\), \(F_i\), \(Q\).

**PAM** [Attouch et al., 2010]. The *Proximal Alternating Minimization* method

**PALM** [Bolte et al., 2013]. The *Proximal Alternating Linearized Minimization* method circumvent the limitation of PAM by replacing the subproblems with their proximal linearizations :

This results in the following PALM’s iterations:

\[\begin{array}{l}(x_0,y_0)\in\mathbb{R}^{m_1}\times \mathbb{R}^{m_2}\\ \text{for}\;k=0,1,\ldots,K-1\\[0.4ex] \left\lfloor\begin{array}{l} x_{k+1} \in \mathrm{prox}_{\gamma_{1,k} R}\; \left( x_k - \gamma_{1,k} \nabla_x F(x_k,y_k) \right)\\ y_{k+1} \in \mathrm{prox}_{\gamma_{2,k} Q}\; \left( y_k - \gamma_{2,k} \nabla_y F(x_{k+1},y_k) \right) \end{array}\right.\end{array}\]**iPALM** [Pock and Sabach, 2016].

**GiPALM** [Gao et al., 2019]. The *Gauss-Seidel type iPALM* …

**SPRING** [Driggs et al., 2021]. The *Stochastic PRoximal alternatING linearized minimization* algorithm is a randomized version of PALM where the gradients are replaced by random estimates \(\tilde{\nabla} F\) formed using the gradients estimated on mini-batches.

## Notations

Notations | Meaning | Type and range |
---|---|---|

\(\gamma\) | Step-size | positive real |

\(\rho\) | Relaxation parameter | real in \((0,1]\) |

\(\alpha\), \(\beta\) | Inertial parameter | real in \([0,1]\) |

\(I\), \(J\) | Mini-batch | finite set of integers |

\(b\) | Size of mini-batch | positive integer |

\(g\), \(\tilde{g}\), \(\bar{g}\) | Gradient (instant, average or approximation) | real matrices |

## References

**[Attouch et al., 2010]**H. Attouch, J. Bolte, P. Redont and A. Soubeyran. "Proximal alternating minimization and projection methods for nonconvex problems - An approach based on the Kurdyka-Lojasiewicz inequality". Mathematics of Operations Research (2010)**[Bolte et al., 2013]**J. Bolte, S. Sabach and M. Teboulle. "Proximal alternating linearized minimization for nonconvex and nonsmooth problems". Mathematical Programming (2013)**[Bonettini et al., 2017]**S. Bonettini, I. Loris, F. Porta, M. Prato and S. Rebegoldi. "On the convergence of a linesearch based proximal-gradient method for nonconvex optimization". Journal of Inverse Problems (2017)**[Defazio et al., 2014]**A. Defazio, F. Bach and S. Lacoste-Julien. "SAGA - A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives". Proceedings of the International Conference on Neural Information Processing Systems (NIPS) (2014)**[Driggs et al., 2021]**D. Driggs, J. Tanq, J. Liang, M. Davies and C.-B. Schönlieb. "SPRING - A fast stochastic proximal alternating method for non-smooth non-convex optimization". ArXiv preprint (2021)**[Gao et al., 2019]**X. Gao, X. Cai and D. Han. "A Gauss-Seidel type inertial proximal alternating linearized minimization for a class of nonconvex optimization problems". Journal of Global Optimization (2019)**[Metel and Takeda, 2021]**M.R. Metel and A. Takeda. "Stochastic Proximal Methods for Non-Smooth Non-Convex Constrained Sparse Optimization". Journal of Machine Learning Research (JMLR) (2021)**[Peng et al., 2019]**W. Peng, H. Zhang and X. Zhang. "Nonconvex proximal incremental aggregated gradient method with linear convergence". Journal of Optimization Theory and Applications (2019)**[Pham et al., 2020]**N.H. Pham, L.M. Nguyen, D.T. Phan and Q. Tran-Dinh. "ProxSARAH - An Efficient Algorithmic Framework for Stochastic Composite Nonconvex Optimization". Journal of Machine Learning Research (JMLR) (2020)**[Pock and Sabach, 2016]**T. Pock and S. Sabach. "Inertial Proximal Alternating Linearized Minimization (iPALM) for Nonconvex and Nonsmooth Problems". SIAM Journal on Imaging Sciences (2016)**[Reddi et al., 2016]**S. Reddi, S. Sra, B. Poczos and A. Smola. "Proximal Stochastic Methods for Nonsmooth Nonconvex Finite-Sum Optimization". Proceedings of the International Conference on Neural Information Processing Systems (NIPS) (2016)**[Sra, 2012]**S. Sra. "Scalable nonconvex inexact proximal splitting". Advances in Neural Information Processing Systems (2012)**[Wang et al., 2019]**Z. Wang, K. Ji, Y. Zhou, Y. Liang and V. Tarokh. "SpiderBoost and Momentum - Faster Variance Reduction Algorithms". Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS) (2019)**[Xiao and Zhang, 2014]**L. Xiao and T. Zhang. "A proximal stochastic gradient method with progressive variance reduction". SIAM Journal on Optimization (2014)