Convex optimization and randomness

[math] \newcommand{\LS}{\mathtt{line\_search}} \newcommand{\mI}{\mathrm{I}} \newcommand{\mB}{\mathrm{B}} \newcommand{\conv}{\mathrm{conv}} \newcommand{\inte}{\mathrm{int}} \newcommand{\tg}{\tilde{g}} \newcommand{\tx}{\tilde{x}} \newcommand{\ty}{\tilde{y}} \newcommand{\tz}{\tilde{z}} \newcommand{\id}{\mathrm{id}} \newcommand{\K}{\mathrm{KL}} \newcommand{\kl}{\mathrm{kl}} \newcommand{\bp}{\boldsymbol{p}} \newcommand{\tr}{\mathrm{Tr}} \newcommand{\E}{\mathbb{E}} \newcommand{\N}{\mathbb{N}} \newcommand{\R}{\mathbb{R}} \newcommand{\Q}{\mathbb{Q}} \newcommand{\KL}{\mathrm{KL}} \newcommand{\LG}{\overline{\log}(K)} \newcommand{\ocP}{\overline{\mathcal{P}}} \newcommand{\cZ}{\mathcal{Z}} \newcommand{\cA}{\mathcal{A}} \newcommand{\cR}{\mathcal{R}} \newcommand{\cB}{\mathcal{B}} \newcommand{\cN}{\mathcal{N}} \newcommand{\cM}{\mathcal{M}} \newcommand{\cF}{\mathcal{F}} \newcommand{\cL}{\mathcal{L}} \newcommand{\cX}{\mathcal{X}} \newcommand{\cI}{\mathcal{I}} \newcommand{\cJ}{\mathcal{J}} \newcommand{\cY}{\mathcal{Y}} \newcommand{\cH}{\mathcal{H}} \newcommand{\cP}{\mathcal{P}} \newcommand{\cT}{\mathcal{T}} \newcommand{\cC}{\mathcal{C}} \newcommand{\cS}{\mathcal{S}} \newcommand{\cE}{\mathcal{E}} \newcommand{\cK}{\mathcal{K}} \newcommand{\cD}{\mathcal{D}} \newcommand{\oD}{\overline{\mathcal{D}}} \newcommand{\oR}{\overline{R}} \newcommand{\wh}{\widehat} \newcommand{\argmax}{\mathop{\mathrm{argmax}}} \newcommand{\argmin}{\mathop{\mathrm{argmin}}} \newcommand{\todo}{{\bf TO DO } } \newcommand{\Ber}{\mathop{\mathrm{Ber}}} \newcommand{\beq}{\begin{equation}} \newcommand{\eeq}{\end{equation}} \newcommand{\beqa}{\begin{eqnarray}} \newcommand{\eeqa}{\end{eqnarray}} \newcommand{\beqan}{\begin{eqnarray*}} \newcommand{\eeqan}{\end{eqnarray*}} \newcommand{\qed}{\hfill\BlackBox} \newcommand{\charfct}{\ds1} % \newcommand{\Fcal}{\mathcal{F}} \newcommand{\Xcal}{\mathcal{X}} \newcommand{\Hcal}{\mathcal{H}} \newcommand{\Gcal}{\mathcal{G}} \newcommand{\Nat}{\mathbb{N}} \newcommand{\mathds}{\mathbb}[/math]

In this chapter we explore the interplay between optimization and randomness. A key insight, going back to [1], is that first order methods are quite robust: the gradients do not have to be computed exactly to ensure progress towards the optimum. Indeed since these methods usually do many small steps, as long as the gradients are correct on average, the error introduced by the gradient approximations will eventually vanish. As we will see below this intuition is correct for non-smooth optimization (since the steps are indeed small) but the picture is more subtle in the case of smooth optimization (recall from Chapter that in this case we take long steps).

We introduce now the main object of this chapter: a (first order) {\em stochastic} oracle for a convex function [math]f : \cX \rightarrow \R[/math] takes as input a point [math]x \in \cX[/math] and outputs a random variable [math]\tg(x)[/math] such that [math]\E \ \tg(x) \in \partial f(x)[/math]. In the case where the query point [math]x[/math] is a random variable (possibly obtained from previous queries to the oracle), one assumes that [math]\E \ (\tg(x) | x) \in \partial f(x)[/math]. The unbiasedness assumption by itself is not enough to obtain rates of convergence, one also needs to make assumptions about the fluctuations of [math]\tg(x)[/math]. Essentially in the non-smooth case we will assume that there exists [math]B \gt 0[/math] such that [math]\E \|\tg(x)\|_*^2 \leq B^2[/math] for all [math]x \in \cX[/math], while in the smooth case we assume that there exists [math]\sigma \gt 0[/math] such that [math]\E \|\tg(x) - \nabla f(x)\|_*^2 \leq \sigma^2[/math] for all [math]x \in \cX[/math]. We also note that the situation with a {\em biased} oracle is quite different, and we refer to [2][3] for some works in this direction. The two canonical examples of a stochastic oracle in machine learning are as follows. Let [math]f(x) = \E_{\xi} \ell(x, \xi)[/math] where [math]\ell(x, \xi)[/math] should be interpreted as the loss of predictor [math]x[/math] on the example [math]\xi[/math]. We assume that [math]\ell(\cdot, \xi)[/math] is a (differentiable[Notes 1]) convex function for any [math]\xi[/math]. The goal is to find a predictor with minimal expected loss, that is to minimize [math]f[/math]. When queried at [math]x[/math] the stochastic oracle can draw [math]\xi[/math] from the unknown distribution and report [math]\nabla_x \ell(x, \xi)[/math]. One obviously has [math]\E_{\xi} \nabla_x \ell(x, \xi) \in \partial f(x)[/math].

The second example is the one described in Section Some convex optimization problems in machine learning, where one wants to minimize [math]f(x) = \frac{1}{m} \sum_{i=1}^m f_i(x)[/math]. In this situation a stochastic oracle can be obtained by selecting uniformly at random [math]I \in [m][/math] and reporting [math]\nabla f_I(x)[/math].

Observe that the stochastic oracles in the two above cases are quite different. Consider the standard situation where one has access to a data set of i.i.d. samples [math]\xi_1, \dots, \xi_m[/math]. Thus in the first case, where one wants to minimize the {\em expected loss}, one is limited to [math]m[/math] queries to the oracle, that is to a {\em single pass} over the data (indeed one cannot ensure that the conditional expectations are correct if one uses twice a data point). On the contrary for the {\em empirical loss} where [math]f_i(x) = \ell(x, \xi_i)[/math] one can do as many passes as one wishes.

Non-smooth stochastic optimization

We initiate our study with stochastic mirror descent (S-MD) which is defined as follows: [math]x_1 \in \argmin_{\cX \cap \cD} \Phi(x)[/math], and

[[math]] x_{t+1} = \argmin_{x \in \mathcal{X} \cap \mathcal{D}} \ \eta \tilde{g}(x_t)^{\top} x + D_{\Phi}(x,x_t) . [[/math]]

In this case equation eq:vfMD rewrites

[[math]] \sum_{s=1}^t \tg(x_s)^{\top} (x_s - x) \leq \frac{R^2}{\eta} + \frac{\eta}{2 \rho} \sum_{s=1}^t \|\tg(x_s)\|_*^2 . [[/math]]

This immediately yields a rate of convergence thanks to the following simple observation based on the tower rule:

[[math]] \begin{eqnarray*} \E f\bigg(\frac{1}{t} \sum_{s=1}^t x_s \bigg) - f(x) & \leq & \frac{1}{t} \E \sum_{s=1}^t (f(x_s) - f(x)) \\ & \leq & \frac{1}{t} \E \sum_{s=1}^t \E(\tg(x_s) | x_s)^{\top} (x_s - x) \\ & = & \frac{1}{t} \E \sum_{s=1}^t \tg(x_s)^{\top} (x_s - x) . \end{eqnarray*} [[/math]]

We just proved the following theorem.

Theorem

Let [math]\Phi[/math] be a mirror map [math]1[/math]-strongly convex on [math]\mathcal{X} \cap \mathcal{D}[/math] with respect to [math]\|\cdot\|[/math], and let [math]R^2 = \sup_{x \in \mathcal{X} \cap \mathcal{D}} \Phi(x) - \Phi(x_1)[/math]. Let [math]f[/math] be convex. Furthermore assume that the stochastic oracle is such that [math]\E \|\tg(x)\|_*^2 \leq B^2[/math]. Then S-MD with [math]\eta = \frac{R}{B} \sqrt{\frac{2}{t}}[/math] satisfies

[[math]] \E f\bigg(\frac{1}{t} \sum_{s=1}^t x_s \bigg) - \min_{x \in \mathcal{X}} f(x) \leq R B \sqrt{\frac{2}{t}} . [[/math]]

Similarly, in the Euclidean and strongly convex case, one can directly generalize Theorem. Precisely we consider stochastic gradient descent (SGD), that is S-MD with [math]\Phi(x) = \frac12 \|x\|_2^2[/math], with time-varying step size [math](\eta_t)_{t \geq 1}[/math], that is

[[math]] x_{t+1} = \Pi_{\cX}(x_t - \eta_t \tg(x_t)) . [[/math]]

Theorem

Let [math]f[/math] be [math]\alpha[/math]-strongly convex, and assume that the stochastic oracle is such that [math]\E \|\tg(x)\|_*^2 \leq B^2[/math]. Then SGD with [math]\eta_s = \frac{2}{\alpha (s+1)}[/math] satisfies

[[math]] f \left(\sum_{s=1}^t \frac{2 s}{t(t+1)} x_s \right) - f(x^*) \leq \frac{2 B^2}{\alpha (t+1)} . [[/math]]

Smooth stochastic optimization and mini-batch SGD

In the previous section we showed that, for non-smooth optimization, there is basically no cost for having a stochastic oracle instead of an exact oracle. Unfortunately one can show (see e.g. [4]) that smoothness does not bring any acceleration for a general stochastic oracle[Notes 2]. This is in sharp contrast with the exact oracle case where we showed that gradient descent attains a [math]1/t[/math] rate (instead of [math]1/\sqrt{t}[/math] for non-smooth), and this could even be improved to [math]1/t^2[/math] thanks to Nesterov's accelerated gradient descent. The next result interpolates between the [math]1/\sqrt{t}[/math] for stochastic smooth optimization, and the [math]1/t[/math] for deterministic smooth optimization. We will use it to propose a useful modification of SGD in the smooth case. The proof is extracted from [5].

Theorem

Let [math]\Phi[/math] be a mirror map [math]1[/math]-strongly convex on [math]\mathcal{X} \cap \mathcal{D}[/math] w.r.t. [math]\|\cdot\|[/math], and let [math]R^2 = \sup_{x \in \mathcal{X} \cap \mathcal{D}} \Phi(x) - \Phi(x_1)[/math]. Let [math]f[/math] be convex and [math]\beta[/math]-smooth w.r.t. [math]\|\cdot\|[/math]. Furthermore assume that the stochastic oracle is such that [math]\E \|\nabla f(x) - \tg(x)\|_*^2 \leq \sigma^2[/math]. Then S-MD with stepsize [math]\frac{1}{\beta + 1/\eta}[/math] and [math]\eta = \frac{R}{\sigma} \sqrt{\frac{2}{t}}[/math] satisfies

[[math]] \E f\bigg(\frac{1}{t} \sum_{s=1}^t x_{s+1} \bigg) - f(x^*) \leq R \sigma \sqrt{\frac{2}{t}} + \frac{\beta R^2}{t} . [[/math]]


Show Proof

Using [math]\beta[/math]-smoothness, Cauchy-Schwarz (with [math]2 ab \leq x a^2+ b^2 / x[/math] for any [math]x \gt 0[/math]), and the 1-strong convexity of [math]\Phi[/math], one obtains

[[math]] \begin{align*} & f(x_{s+1}) - f(x_s) \\ & \leq \nabla f(x_s)^{\top} (x_{s+1} - x_s) + \frac{\beta}{2} \|x_{s+1} - x_s\|^2 \\ & = \tg_s^{\top} (x_{s+1} - x_s) + (\nabla f(x_s) - \tg_s)^{\top} (x_{s+1} - x_s) + \frac{\beta}{2} \|x_{s+1} - x_s\|^2 \\ & \leq \tg_s^{\top} (x_{s+1} - x_s) + \frac{\eta}{2} \|\nabla f(x_s) - \tg_s\|_*^2 + \frac12 (\beta + 1/\eta) \|x_{s+1} - x_s\|^2 \\ & \leq \tg_s^{\top} (x_{s+1} - x_s) + \frac{\eta}{2} \|\nabla f(x_s) - \tg_s\|_*^2 + (\beta + 1/\eta) D_{\Phi}(x_{s+1}, x_s) . \end{align*} [[/math]]
Observe that, using the same argument as to derive eq:pourplustard1, one has

[[math]] \frac{1}{\beta + 1/\eta} \tg_s^{\top} (x_{s+1} - x^*) \leq D_{\Phi} (x^*, x_s) - D_{\Phi}(x^*, x_{s+1}) - D_{\Phi}(x_{s+1}, x_s) . [[/math]]
Thus

[[math]] \begin{align*} & f(x_{s+1}) \\ & \leq f(x_s) + \tg_s^{\top}(x^* - x_s) + (\beta + 1/\eta) \left(D_{\Phi} (x^*, x_s) - D_{\Phi}(x^*, x_{s+1})\right) \\ & \qquad + \frac{\eta}{2} \|\nabla f(x_s) - \tg_s\|_*^2 \\ & \leq f(x^*) + (\tg_s-\nabla f(x_s))^{\top}(x^* - x_s) \\ & \qquad + (\beta + 1/\eta) \left(D_{\Phi} (x^*, x_s) - D_{\Phi}(x^*, x_{s+1})\right) + \frac{\eta}{2} \|\nabla f(x_s) - \tg_s\|_*^2 . \end{align*} [[/math]]
In particular this yields

[[math]] \E f(x_{s+1}) - f(x^*) \leq (\beta + 1/\eta) \E \left(D_{\Phi} (x^*, x_s) - D_{\Phi}(x^*, x_{s+1})\right) + \frac{\eta \sigma^2}{2} . [[/math]]
By summing this inequality from [math]s=1[/math] to [math]s=t[/math] one can easily conclude with the standard argument.

We can now propose the following modification of SGD based on the idea of {\em mini-batches}. Let [math]m \in \N[/math], then mini-batch SGD iterates the following equation:

[[math]] x_{t+1} = \Pi_{\cX}\left(x_t - \frac{\eta}{m} \sum_{i=1}^m \tg_i(x_t)\right). [[/math]]

where [math]\tg_i(x_t), i=1,\dots,m[/math] are independent random variables (conditionally on [math]x_t[/math]) obtained from repeated queries to the stochastic oracle. Assuming that [math]f[/math] is [math]\beta[/math]-smooth and that the stochastic oracle is such that [math]\|\tg(x)\|_2 \leq B[/math], one can obtain a rate of convergence for mini-batch SGD with Theorem. Indeed one can apply this result with the modified stochastic oracle that returns [math]\frac{1}{m} \sum_{i=1}^m \tg_i(x)[/math], it satisfies

[[math]] \E \| \frac1{m} \sum_{i=1}^m \tg_i(x) - \nabla f(x) \|_2^2 = \frac{1}{m}\E \| \tg_1(x) - \nabla f(x) \|_2^2 \leq \frac{2 B^2}{m} . [[/math]]

Thus one obtains that with [math]t[/math] calls to the (original) stochastic oracle, that is [math]t/m[/math] iterations of the mini-batch SGD, one has a suboptimality gap bounded by

[[math]] R \sqrt{\frac{2 B^2}{m}} \sqrt{\frac{2}{t/m}} + \frac{\beta R^2}{t/m} = 2 \frac{R B}{\sqrt{t}} + \frac{m \beta R^2}{t} . [[/math]]

Thus as long as [math]m \leq \frac{B}{R \beta} \sqrt{t}[/math] one obtains, with mini-batch SGD and [math]t[/math] calls to the oracle, a point which is [math]3\frac{R B}{\sqrt{t}}[/math]-optimal. Mini-batch SGD can be a better option than basic SGD in at least two situations: (i) When the computation for an iteration of mini-batch SGD can be distributed between multiple processors. Indeed a central unit can send the message to the processors that estimates of the gradient at point [math]x_s[/math] have to be computed, then each processor can work independently and send back the estimate they obtained. (ii) Even in a serial setting mini-batch SGD can sometimes be advantageous, in particular if some calculations can be re-used to compute several estimated gradients at the same point.

Sum of smooth and strongly convex functions

Let us examine in more details the main example from Section Some convex optimization problems in machine learning. That is one is interested in the unconstrained minimization of

[[math]] f(x) = \frac1{m} \sum_{i=1}^m f_i(x) , [[/math]]

where [math]f_1, \dots, f_m[/math] are [math]\beta[/math]-smooth and convex functions, and [math]f[/math] is [math]\alpha[/math]-strongly convex. Typically in machine learning [math]\alpha[/math] can be as small as [math]1/m[/math], while [math]\beta[/math] is of order of a constant. In other words the condition number [math]\kappa= \beta / \alpha[/math] can be as large as [math]\Omega(m)[/math]. Let us now compare the basic gradient descent, that is

[[math]] x_{t+1} = x_t - \frac{\eta}{m} \sum_{i=1}^m \nabla f_i(x) , [[/math]]

to SGD

[[math]] x_{t+1} = x_t - \eta \nabla f_{i_t}(x) , [[/math]]

where [math]i_t[/math] is drawn uniformly at random in [math][m][/math] (independently of everything else). Theorem shows that gradient descent requires [math]O(m \kappa \log(1/\epsilon))[/math] gradient computations (which can be improved to [math]O(m \sqrt{\kappa} \log(1/\epsilon))[/math] with Nesterov's accelerated gradient descent), while Theorem shows that SGD (with appropriate averaging) requires [math]O(1/ (\alpha \epsilon))[/math] gradient computations. Thus one can obtain a low accuracy solution reasonably fast with SGD, but for high accuracy the basic gradient descent is more suitable. Can we get the best of both worlds? This question was answered positively in [6] with SAG (Stochastic Averaged Gradient) and in [7] with SDCA (Stochastic Dual Coordinate Ascent). These methods require only [math]O((m+\kappa) \log(1/\epsilon))[/math] gradient computations. We describe below the SVRG (Stochastic Variance Reduced Gradient descent) algorithm from [8] which makes the main ideas of SAG and SDCA more transparent (see also [9] for more on the relation between these different methods). We also observe that a natural question is whether one can obtain a Nesterov's accelerated version of these algorithms that would need only [math]O((m + \sqrt{m \kappa}) \log(1/\epsilon))[/math], see [10][11][12] for recent works on this question. To obtain a linear rate of convergence one needs to make ‘`big steps", that is the step-size should be of order of a constant. In SGD the step-size is typically of order [math]1/\sqrt{t}[/math] because of the variance introduced by the stochastic oracle. The idea of SVRG is to ``center" the output of the stochastic oracle in order to reduce the variance. Precisely instead of feeding [math]\nabla f_{i}(x)[/math] into the gradient descent one would use [math]\nabla f_i(x) - \nabla f_i(y) + \nabla f(y)[/math] where [math]y[/math] is a centering sequence. This is a sensible idea since, when [math]x[/math] and [math]y[/math] are close to the optimum, one should have that [math]\nabla f_i(x) - \nabla f_i(y)[/math] will have a small variance, and of course [math]\nabla f(y)[/math] will also be small (note that [math]\nabla f_i(x)[/math] by itself is not necessarily small). This intuition is made formal with the following lemma.

Lemma

Let [math]f_1, \dots f_m[/math] be [math]\beta[/math]-smooth convex functions on [math]\R^n[/math], and [math]i[/math] be a random variable uniformly distributed in [math][m][/math]. Then

[[math]] \E \| \nabla f_i(x) - \nabla f_i(x^*) \|_2^2 \leq 2 \beta (f(x) - f(x^*)) . [[/math]]


Show Proof

Let [math]g_i(x) = f_i(x) - f_i(x^*) - \nabla f_i(x^*)^{\top} (x - x^*)[/math]. By convexity of [math]f_i[/math] one has [math]g_i(x) \geq 0[/math] for any [math]x[/math] and in particular using eq:onestepofgd this yields [math]- g_i(x) \leq - \frac{1}{2\beta} \|\nabla g_i(x)\|_2^2[/math] which can be equivalently written as

[[math]] \| \nabla f_i(x) - \nabla f_i(x^*) \|_2^2 \leq 2 \beta (f_i(x) - f_i(x^*) - \nabla f_i(x^*)^{\top} (x - x^*)) . [[/math]]
Taking expectation with respect to [math]i[/math] and observing that [math]\E \nabla f_i(x^*) = \nabla f(x^*) = 0[/math] yields the claimed bound.

On the other hand the computation of [math]\nabla f(y)[/math] is expensive (it requires [math]m[/math] gradient computations), and thus the centering sequence should be updated more rarely than the main sequence. These ideas lead to the following epoch-based algorithm. Let [math]y^{(1)} \in \R^n[/math] be an arbitrary initial point. For [math]s=1, 2 \ldots[/math], let [math]x_1^{(s)}=y^{(s)}[/math]. For [math]t=1, \dots, k[/math] let

[[math]] x_{t+1}^{(s)} = x_t^{(s)} - \eta \left( \nabla f_{i_t^{(s)}}(x_t^{(s)}) - \nabla f_{i_t^{(s)}} (y^{(s)}) + \nabla f(y^{(s)}) \right) , [[/math]]

where [math]i_t^{(s)}[/math] is drawn uniformly at random (and independently of everything else) in [math][m][/math]. Also let

[[math]] y^{(s+1)} = \frac1{k} \sum_{t=1}^k x_t^{(s)} . [[/math]]

Theorem

Let [math]f_1, \dots f_m[/math] be [math]\beta[/math]-smooth convex functions on [math]\R^n[/math] and [math]f[/math] be [math]\alpha[/math]-strongly convex. Then SVRG with [math]\eta = \frac{1}{10\beta}[/math] and [math]k = 20 \kappa[/math] satisfies

[[math]] \E f(y^{(s+1)}) - f(x^*) \leq 0.9^s (f(y^{(1)}) - f(x^*)) . [[/math]]


Show Proof

We fix a phase [math]s \geq 1[/math] and we denote by [math]\E[/math] the expectation taken with respect to [math]i_1^{(s)}, \dots, i_k^{(s)}[/math]. We show below that

[[math]] \E f(y^{(s+1)}) - f(x^*) = \E f\left(\frac1{k} \sum_{t=1}^k x_t^{(s)}\right) - f(x^*) \leq 0.9 (f(y^{(s)}) - f(x^*)) , [[/math]]
which clearly implies the theorem. To simplify the notation in the following we drop the dependency on [math]s[/math], that is we want to show that

[[math]] \begin{equation} \label{eq:SVRG0} \E f\left(\frac1{k} \sum_{t=1}^k x_t\right) - f(x^*) \leq 0.9 (f(y) - f(x^*)) . \end{equation} [[/math]]
We start as for the proof of Theorem (analysis of gradient descent for smooth and strongly convex functions) with

[[math]] \begin{equation} \label{eq:SVRG1} \|x_{t+1} - x^*\|_2^2 = \|x_t - x^*\|_2^2 - 2 \eta v_t^{\top}(x_t - x^*) + \eta^2 \|v_t\|_2^2 , \end{equation} [[/math]]
where

[[math]] v_t = \nabla f_{i_t}(x_t) - \nabla f_{i_t} (y) + \nabla f(y) . [[/math]]
Using Lemma, we upper bound [math]\E_{i_t} \|v_t\|_2^2[/math] as follows (also recall that [math]\E\|X-\E(X)\|_2^2 \leq \E\|X\|_2^2[/math], and [math]\E_{i_t} \nabla f_{i_t}(x^*) = 0[/math]):

[[math]] \begin{align} & \E_{i_t} \|v_t\|_2^2 \notag \\ & \leq 2 \E_{i_t} \|\nabla f_{i_t}(x_t) - \nabla f_{i_t}(x^*) \|_2^2 + 2 \E_{i_t} \|\nabla f_{i_t}(y) - \nabla f_{i_t}(x^*) - \nabla f(y) \|_2^2 \notag \\ & \leq 2 \E_{i_t} \|\nabla f_{i_t}(x_t) - \nabla f_{i_t}(x^*) \|_2^2 + 2 \E_{i_t} \|\nabla f_{i_t}(y) - \nabla f_{i_t}(x^*) \|_2^2 \notag \\ & \leq 4 \beta (f(x_t) - f(x^*) + f(y) - f(x^*)) . \label{eq:SVRG2} \end{align} [[/math]]
Also observe that

[[math]] \E_{i_t} v_t^{\top}(x_t - x^*) = \nabla f(x_t)^{\top} (x_t - x^*) \geq f(x_t) - f(x^*) , [[/math]]
and thus plugging this into \eqref{eq:SVRG1} together with \eqref{eq:SVRG2} one obtains

[[math]] \begin{eqnarray*} \E_{i_t} \|x_{t+1} - x^*\|_2^2 & \leq & \|x_t - x^*\|_2^2 - 2 \eta (1 - 2 \beta \eta) (f(x_t) - f(x^*)) \\ & & + 4 \beta \eta^2 (f(y) - f(x^*)) . \end{eqnarray*} [[/math]]
Summing the above inequality over [math]t=1, \dots, k[/math] yields

[[math]] \begin{eqnarray*} \E \|x_{k+1} - x^*\|_2^2 & \leq & \|x_1 - x^*\|_2^2 - 2 \eta (1 - 2 \beta \eta) \E \sum_{t=1}^k (f(x_t) - f(x^*)) \\ & & + 4 \beta \eta^2 k (f(y) - f(x^*)) . \end{eqnarray*} [[/math]]
Noting that [math]x_1 = y[/math] and that by [math]\alpha[/math]-strong convexity one has [math]f(x) - f(x^*) \geq \frac{\alpha}{2} \|x - x^*\|_2^2[/math], one can rearrange the above display to obtain

[[math]] \E f\left(\frac1{k} \sum_{t=1}^k x_t\right) - f(x^*) \leq \left(\frac{1}{\alpha \eta (1 - 2 \beta \eta) k} + \frac{2 \beta \eta}{1- 2\beta \eta} \right) (f(y) - f(x^*)) . [[/math]]
Using that [math]\eta = \frac{1}{10\beta}[/math] and [math]k = 20 \kappa[/math] finally yields \eqref{eq:SVRG0} which itself concludes the proof.

Random coordinate descent

We assume throughout this section that [math]f[/math] is a convex and differentiable function on [math]\R^n[/math], with a unique[Notes 3] minimizer [math]x^*[/math]. We investigate one of the simplest possible scheme to optimize [math]f[/math], the random coordinate descent (RCD) method. In the following we denote [math]\nabla_i f(x) = \frac{\partial f}{\partial x_i} (x)[/math]. RCD is defined as follows, with an arbitrary initial point [math]x_1 \in \R^n[/math],

[[math]] x_{s+1} = x_s - \eta \nabla_{i_s} f(x) e_{i_s} , [[/math]]

where [math]i_s[/math] is drawn uniformly at random from [math][n][/math] (and independently of everything else). One can view RCD as SGD with the specific oracle [math]\tg(x) = n \nabla_{I} f(x) e_I[/math] where [math]I[/math] is drawn uniformly at random from [math][n][/math]. Clearly [math]\E \tg(x) = \nabla f(x)[/math], and furthermore

[[math]] \E \|\tg(x)\|_2^2 = \frac{1}{n}\sum_{i=1}^n \|n \nabla_{i} f(x) e_i\|_2^2 = n \|\nabla f(x)\|_2^2 . [[/math]]

Thus using Theorem (with [math]\Phi(x) = \frac12 \|x\|_2^2[/math], that is S-MD being SGD) one immediately obtains the following result.

Theorem

Let [math]f[/math] be convex and [math]L[/math]-Lipschitz on [math]\R^n[/math], then RCD with [math]\eta = \frac{R}{L} \sqrt{\frac{2}{n t}}[/math] satisfies

[[math]] \E f\bigg(\frac{1}{t} \sum_{s=1}^t x_s \bigg) - \min_{x \in \mathcal{X}} f(x) \leq R L \sqrt{\frac{2 n}{t}} . [[/math]]

Somewhat unsurprisingly RCD requires [math]n[/math] times more iterations than gradient descent to obtain the same accuracy. In the next section, we will see that this statement can be greatly improved by taking into account directional smoothness.

RCD for coordinate-smooth optimization

We assume now directional smoothness for [math]f[/math], that is there exists [math]\beta_1, \dots, \beta_n[/math] such that for any [math]i \in [n], x \in \R^n[/math] and [math]u \in \R[/math],

[[math]] | \nabla_i f(x+u e_i) - \nabla_i f(x) | \leq \beta_i |u| . [[/math]]

If [math]f[/math] is twice differentiable then this is equivalent to [math](\nabla^2 f(x))_{i,i} \leq \beta_i[/math]. In particular, since the maximal eigenvalue of a matrix is upper bounded by its trace, one can see that the directional smoothness implies that [math]f[/math] is [math]\beta[/math]-smooth with [math]\beta \leq \sum_{i=1}^n \beta_i[/math]. We now study the following ``aggressive" RCD, where the step-sizes are of order of the inverse smoothness:

[[math]] x_{s+1} = x_s - \frac{1}{\beta_{i_s}} \nabla_{i_s} f(x) e_{i_s} . [[/math]]

Furthermore we study a more general sampling distribution than uniform, precisely for [math]\gamma \geq 0[/math] we assume that [math]i_s[/math] is drawn (independently) from the distribution [math]p_{\gamma}[/math] defined by

[[math]] p_{\gamma}(i) = \frac{\beta_i^{\gamma}}{\sum_{j=1}^n \beta_j^{\gamma}}, i \in [n] . [[/math]]

This algorithm was proposed in [13], and we denote it by RCD([math]\gamma[/math]). Observe that, up to a preprocessing step of complexity [math]O(n)[/math], one can sample from [math]p_{\gamma}[/math] in time [math]O(\log(n))[/math]. The following rate of convergence is derived in [13], using the dual norms [math]\|\cdot\|_{[\gamma]}, \|\cdot\|_{[\gamma]}^*[/math] defined by

[[math]] \|x\|_{[\gamma]} = \sqrt{\sum_{i=1}^n \beta_i^{\gamma} x_i^2} , \;\; \text{and} \;\; \|x\|_{[\gamma]}^* = \sqrt{\sum_{i=1}^n \frac1{\beta_i^{\gamma}} x_i^2} . [[/math]]

Theorem

Let [math]f[/math] be convex and such that [math]u \in \R \mapsto f(x + u e_i)[/math] is [math]\beta_i[/math]-smooth for any [math]i \in [n], x \in \R^n[/math]. Then RCD([math]\gamma[/math]) satisfies for [math]t \geq 2[/math],

[[math]] \E f(x_{t}) - f(x^*) \leq \frac{2 R_{1 - \gamma}^2(x_1) \sum_{i=1}^n \beta_i^{\gamma}}{t-1} , [[/math]]
where

[[math]] R_{1-\gamma}(x_1) = \sup_{x \in \R^n : f(x) \leq f(x_1)} \|x - x^*\|_{[1-\gamma]} . [[/math]]
Recall from Theorem that in this context the basic gradient descent attains a rate of [math]\beta \|x_1 - x^*\|_2^2 / t[/math] where [math]\beta \leq \sum_{i=1}^n \beta_i[/math] (see the discussion above). Thus we see that RCD([math]1[/math]) greatly improves upon gradient descent for functions where [math]\beta[/math] is of order of [math]\sum_{i=1}^n \beta_i[/math]. Indeed in this case both methods attain the same accuracy after a fixed number of iterations, but the iterations of coordinate descent are potentially much cheaper than the iterations of gradient descent.


Show Proof

By applying eq:onestepofgd to the [math]\beta_i[/math]-smooth function [math]u \in \R \mapsto f(x + u e_i)[/math] one obtains

[[math]] f\left(x - \frac{1}{\beta_i} \nabla_i f(x) e_i\right) - f(x) \leq - \frac{1}{2 \beta_i} (\nabla_i f(x))^2 . [[/math]]
We use this as follows:

[[math]] \begin{eqnarray*} \E_{i_s} f(x_{s+1}) - f(x_s) & = & \sum_{i=1}^n p_{\gamma}(i) \left(f\left(x_s - \frac{1}{\beta_i} \nabla_i f(x_s) e_i\right) - f(x_s) \right) \\ & \leq & - \sum_{i=1}^n \frac{p_{\gamma}(i)}{2 \beta_i} (\nabla_i f(x_s))^2 \\ & = & - \frac{1}{2 \sum_{i=1}^n \beta_i^{\gamma}} \left(\|\nabla f(x_s)\|_{[1-\gamma]}^*\right)^2 . \end{eqnarray*} [[/math]]
Denote [math]\delta_s = \E f(x_s) - f(x^*)[/math]. Observe that the above calculation can be used to show that [math]f(x_{s+1}) \leq f(x_s)[/math] and thus one has, by definition of [math]R_{1-\gamma}(x_1)[/math],

[[math]] \begin{eqnarray*} \delta_s & \leq & \nabla f(x_s)^{\top} (x_s - x^*) \\ & \leq & \|x_s - x^*\|_{[1-\gamma]} \|\nabla f(x_s)\|_{[1-\gamma]}^* \\ & \leq & R_{1-\gamma}(x_1) \|\nabla f(x_s)\|_{[1-\gamma]}^* . \end{eqnarray*} [[/math]]
Thus putting together the above calculations one obtains

[[math]] \delta_{s+1} \leq \delta_s - \frac{1}{2 R_{1 - \gamma}^2(x_1) \sum_{i=1}^n \beta_i^{\gamma} } \delta_s^2 . [[/math]]
The proof can be concluded with similar computations than for Theorem.

We discussed above the specific case of [math]\gamma = 1[/math]. Both [math]\gamma=0[/math] and [math]\gamma=1/2[/math] also have an interesting behavior, and we refer to [13] for more details. The latter paper also contains a discussion of high probability results and potential acceleration \`a la Nesterov. We also refer to [14] for a discussion of RCD in a distributed setting.

RCD for smooth and strongly convex optimization

If in addition to directional smoothness one also assumes strong convexity, then RCD attains in fact a linear rate.

Theorem

Let [math]\gamma \geq 0[/math]. Let [math]f[/math] be [math]\alpha[/math]-strongly convex w.r.t. [math]\|\cdot\|_{[1-\gamma]}[/math], and such that [math]u \in \R \mapsto f(x + u e_i)[/math] is [math]\beta_i[/math]-smooth for any [math]i \in [n], x \in \R^n[/math]. Let [math]\kappa_{\gamma} = \frac{\sum_{i=1}^n \beta_i^{\gamma}}{\alpha}[/math], then RCD([math]\gamma[/math]) satisfies

[[math]] \E f(x_{t+1}) - f(x^*) \leq \left(1 - \frac1{\kappa_{\gamma}}\right)^t (f(x_1) - f(x^*)) . [[/math]]

Show Proof

In the proof of Theorem we showed that ö

[[math]] \delta_{s+1} \leq \delta_s - \frac{1}{2 \sum_{i=1}^n \beta_i^{\gamma}} \left(\|\nabla f(x_s)\|_{[1-\gamma]}^*\right)^2 . [[/math]]
On the other hand Lemma shows that

[[math]] \left(\|\nabla f(x_s)\|_{[1-\gamma]}^*\right)^2 \geq 2 \alpha \delta_s . [[/math]]
The proof is concluded with straightforward calculations.

We use the following elementary lemma.

Lemma

Let [math]f[/math] be [math]\alpha[/math]-strongly convex w.r.t. [math]\| \cdot\|[/math] on [math]\R^n[/math], then

[[math]] f(x) - f(x^*) \leq \frac1{2\alpha} \|\nabla f(x)\|_*^2 . [[/math]]


Show Proof

By strong convexity, Hölder’s inequality, and an elementary calculation,

[[math]] \begin{eqnarray*} f(x) - f(y) & \leq & \nabla f(x)^{\top} (x-y) - \frac{\alpha}{2} \|x-y\|_2^2 \\ & \leq & \|\nabla f(x)\|_* \|x-y\| - \frac{\alpha}{2} \|x-y\|_2^2 \\ & \leq & \frac1{2\alpha} \|\nabla f(x)\|_*^2 , \end{eqnarray*} [[/math]]
which concludes the proof by taking [math]y = x^*[/math].

Acceleration by randomization for saddle points

We explore now the use of randomness for saddle point computations. That is we consider the context of Section Saddle point computation with a stochastic oracle of the following form: given [math]z=(x,y) \in \cX \times \cY[/math] it outputs [math]\tg(z) = (\tg_{\cX}(x,y), \tg_{\cY}(x,y))[/math] where [math]\E \ (\tg_{\cX}(x,y) | x,y) \in \partial_x \phi(x,y)[/math], and [math]\E \ (\tg_{\cY}(x,y) | x,y) \in \partial_y (-\phi(x,y))[/math]. Instead of using true subgradients as in SP-MD (see Section Saddle Point Mirror Descent (SP-MD)) we use here the outputs of the stochastic oracle. We refer to the resulting algorithm as S-SP-MD (Stochastic Saddle Point Mirror Descent). Using the same reasoning than in Section Non-smooth stochastic optimization and Section Saddle Point Mirror Descent (SP-MD) one can derive the following theorem.

Theorem

Assume that the stochastic oracle is such that [math]\E \left(\|\tg_{\cX}(x,y)\|_{\cX}^* \right)^2 \leq B_{\cX}^2[/math], and [math]\E \left(\|\tg_{\cY}(x,y)\|_{\cY}^* \right)^2 \leq B_{\cY}^2[/math]. Then S-SP-MD with [math]a= \frac{B_{\cX}}{R_{\cX}}[/math], [math]b=\frac{B_{\cY}}{R_{\cY}}[/math], and [math]\eta=\sqrt{\frac{2}{t}}[/math] satisfies

[[math]] \E \left( \max_{y \in \mathcal{Y}} \phi\left( \frac1{t} \sum_{s=1}^t x_s,y \right) - \min_{x \in \mathcal{X}} \phi\left(x, \frac1{t} \sum_{s=1}^t y_s \right) \right) \leq (R_{\cX} B_{\cX} + R_{\cY} B_{\cY}) \sqrt{\frac{2}{t}}. [[/math]]

Using S-SP-MD we revisit the examples of Section Matrix games and Section Linear classification. In both cases one has [math]\phi(x,y) = x^{\top} A y[/math] (with [math]A_i[/math] being the [math]i^{th}[/math] column of [math]A[/math]), and thus [math]\nabla_x \phi(x,y) = Ay[/math] and [math]\nabla_y \phi(x,y) = A^{\top} x[/math]. \newline

Matrix games. Here [math]x \in \Delta_n[/math] and [math]y \in \Delta_m[/math]. Thus there is a quite natural stochastic oracle:

[[math]] \begin{equation} \label{eq:oraclematrixgame} \tg_{\cX}(x,y) = A_I, \; \text{where} \; I \in [m] \; \text{is drawn according to} \; y \in \Delta_m , \end{equation} [[/math]]

and [math]\forall i \in [m][/math],

[[math]] \begin{equation} \label{eq:oraclematrixgame2} \tg_{\cY}(x,y)(i) = A_i(J), \; \text{where} \; J \in [n] \; \text{is drawn according to} \; x \in \Delta_n . \end{equation} [[/math]]

Clearly [math]\|\tg_{\cX}(x,y)\|_{\infty} \leq \|A\|_{\mathrm{max}}[/math] and [math]\|\tg_{\cX}(x,y)\|_{\infty} \leq \|A\|_{\mathrm{max}}[/math], which implies that S-SP-MD attains an [math]\epsilon[/math]-optimal pair of points with [math]O\left(\|A\|_{\mathrm{max}}^2 \log(n+m) / \epsilon^2 \right)[/math] iterations. Furthermore the computational complexity of a step of S-SP-MD is dominated by drawing the indices [math]I[/math] and [math]J[/math] which takes [math]O(n + m)[/math]. Thus overall the complexity of getting an [math]\epsilon[/math]-optimal Nash equilibrium with S-SP-MD is [math]O\left(\|A\|_{\mathrm{max}}^2 (n + m) \log(n+m) / \epsilon^2 \right)[/math]. While the dependency on [math]\epsilon[/math] is worse than for SP-MP (see Section Matrix games), the dependencies on the dimensions is [math]\tilde{O}(n+m)[/math] instead of [math]\tilde{O}(nm)[/math]. In particular, quite astonishingly, this is {\em sublinear} in the size of the matrix [math]A[/math]. The possibility of sublinear algorithms for this problem was first observed in [15]. \newline

Linear classification. Here [math]x \in \mB_{2,n}[/math] and [math]y \in \Delta_m[/math]. Thus the stochastic oracle for the [math]x[/math]-subgradient can be taken as in \eqref{eq:oraclematrixgame} but for the [math]y[/math]-subgradient we modify \eqref{eq:oraclematrixgame2} as follows. For a vector [math]x[/math] we denote by [math]x^2[/math] the vector such that [math]x^2(i) = x(i)^2[/math]. For all [math]i \in [m][/math],

[[math]] \tg_{\cY}(x,y)(i) = \frac{\|x\|^2}{x(j)} A_i(J), \; \text{where} \; J \in [n] \; \text{is drawn according to} \; \frac{x^2}{\|x\|_2^2} \in \Delta_n . [[/math]]

Note that one indeed has [math]\E (\tg_{\cY}(x,y)(i) | x,y) = \sum_{j=1}^n x(j) A_i(j) = (A^{\top} x)(i)[/math]. Furthermore [math]\|\tg_{\cX}(x,y)\|_2 \leq B[/math], and

[[math]] \E (\|\tg_{\cY}(x,y)\|_{\infty}^2 | x,y) = \sum_{j=1}^n \frac{x(j)^2}{\|x\|_2^2} \max_{i \in [m]} \left(\frac{\|x\|^2}{x(j)} A_i(j)\right)^2 \leq \sum_{j=1}^n \max_{i \in [m]} A_i(j)^2 . [[/math]]

Unfortunately this last term can be [math]O(n)[/math]. However it turns out that one can do a more careful analysis of mirror descent in terms of local norms, which allows to prove that the ‘`local variance" is dimension-free. We refer to [16] for more details on these local norms, and to [17] for the specific details in the linear classification situation.

Convex relaxation and randomized rounding

In this section we briefly discuss the concept of convex relaxation, and the use of randomization to find approximate solutions. By now there is an enormous literature on these topics, and we refer to [18] for further pointers. We study here the seminal example of [math]\mathrm{MAXCUT}[/math]. This problem can be described as follows. Let [math]A \in \R_+^{n \times n}[/math] be a symmetric matrix of non-negative weights. The entry [math]A_{i,j}[/math] is interpreted as a measure of the ``dissimilarity" between point [math]i[/math] and point [math]j[/math]. The goal is to find a partition of [math][n][/math] into two sets, [math]S \subset [n][/math] and [math]S^c[/math], so as to maximize the total dissimilarity between the two groups: [math]\sum_{i \in S, j \in S^c} A_{i,j}[/math]. Equivalently [math]\mathrm{MAXCUT}[/math] corresponds to the following optimization problem:

[[math]] \begin{equation} \label{eq:maxcut1} \max_{x \in \{-1,1\}^n} \frac12 \sum_{i,j =1}^n A_{i,j} (x_i - x_j)^2 . \end{equation} [[/math]]

Viewing [math]A[/math] as the (weighted) adjacency matrix of a graph, one can rewrite \eqref{eq:maxcut1} as follows, using the graph Laplacian [math]L=D-A[/math] where [math]D[/math] is the diagonal matrix with entries [math](\sum_{j=1}^n A_{i,j})_{i \in [n]}[/math],

[[math]] \begin{equation} \label{eq:maxcut2} \max_{x \in \{-1,1\}^n} x^{\top} L x . \end{equation} [[/math]]

It turns out that this optimization problem is [math]\mathbf{NP}[/math]-hard, that is the existence of a polynomial time algorithm to solve \eqref{eq:maxcut2} would prove that [math]\mathbf{P} = \mathbf{NP}[/math]. The combinatorial difficulty of this problem stems from the hypercube constraint. Indeed if one replaces [math]\{-1,1\}^n[/math] by the Euclidean sphere, then one obtains an efficiently solvable problem (it is the problem of computing the maximal eigenvalue of [math]L[/math]). We show now that, while \eqref{eq:maxcut2} is a difficult optimization problem, it is in fact possible to find relatively good {\em approximate} solutions by using the power of randomization. Let [math]\zeta[/math] be uniformly drawn on the hypercube [math]\{-1,1\}^n[/math], then clearly

[[math]] \E \ \zeta^{\top} L \zeta = \sum_{i,j=1, i \neq j}^n A_{i,j} \geq \frac{1}{2} \max_{x \in \{-1,1\}^n} x^{\top} L x . [[/math]]

This means that, on average, [math]\zeta[/math] is a [math]1/2[/math]-approximate solution to \eqref{eq:maxcut2}. Furthermore it is immediate that the above expectation bound implies that, with probability at least [math]\epsilon[/math], [math]\zeta[/math] is a [math](1/2-\epsilon)[/math]-approximate solution. Thus by repeatedly sampling uniformly from the hypercube one can get arbitrarily close (with probability approaching [math]1[/math]) to a [math]1/2[/math]-approximation of [math]\mathrm{MAXCUT}[/math]. Next we show that one can obtain an even better approximation ratio by combining the power of convex optimization and randomization. This approach was pioneered by [19]. The Goemans-Williamson algorithm is based on the following inequality

[[math]] \max_{x \in \{-1,1\}^n} x^{\top} L x = \max_{x \in \{-1,1\}^n} \langle L, xx^{\top} \rangle \leq \max_{X \in \mathbb{S}_+^n, X_{i,i}=1, i \in [n]} \langle L, X \rangle . [[/math]]

The right hand side in the above display is known as the {\em convex (or SDP) relaxation} of [math]\mathrm{MAXCUT}[/math]. The convex relaxation is an SDP and thus one can find its solution efficiently with Interior Point Methods (see Section Interior point methods). The following result states both the Goemans-Williamson strategy and the corresponding approximation ratio.

Theorem

Let [math]\Sigma[/math] be the solution to the SDP relaxation of [math]\mathrm{MAXCUT}[/math]. Let [math]\xi \sim \cN(0, \Sigma)[/math] and [math]\zeta = \mathrm{sign}(\xi) \in \{-1,1\}^n[/math]. Then

[[math]] \E \ \zeta^{\top} L \zeta \geq 0.878 \max_{x \in \{-1,1\}^n} x^{\top} L x . [[/math]]

Show Proof

We shall use the following inequality:

[[math]] \begin{equation} \label{eq:dependsonL} 1 - \frac{2}{\pi} \mathrm{arcsin}(t) \geq 0.878 (1-t), \ \forall t \in [-1,1] . \end{equation} [[/math]]
Also remark that for [math]X \in \R^{n \times n}[/math] such that [math]X_{i,i}=1[/math], one has

[[math]] \langle L, X \rangle = \sum_{i,j=1}^n A_{i,j} (1 - X_{i,j}) , [[/math]]
and in particular for [math]x \in \{-1,1\}^n[/math], [math]x^{\top} L x = \sum_{i,j=1}^n A_{i,j} (1 - x_i x_j)[/math]. Thus, using Lemma, and the facts that [math]A_{i,j} \geq 0[/math] and [math]|\Sigma_{i,j}| \leq 1[/math] (see the proof of Lemma), one has

[[math]] \begin{eqnarray*} \E \ \zeta^{\top} L \zeta & = & \sum_{i,j=1}^n A_{i,j} \left(1- \frac{2}{\pi} \mathrm{arcsin} \left(\Sigma_{i,j}\right)\right) \\ & \geq & 0.878 \sum_{i,j=1}^n A_{i,j} \left(1- \Sigma_{i,j}\right) \\ & = & 0.878 \ \max_{X \in \mathbb{S}_+^n, X_{i,i}=1, i \in [n]} \langle L, X \rangle \\ & \geq & 0.878 \max_{x \in \{-1,1\}^n} x^{\top} L x . \end{eqnarray*} [[/math]]

The proof of this result is based on the following elementary geometric lemma.

Lemma

Let [math]\xi \sim \mathcal{N}(0,\Sigma)[/math] with [math]\Sigma_{i,i}=1[/math] for [math]i \in [n][/math], and [math]\zeta = \mathrm{sign}(\xi)[/math]. Then

[[math]] \E \ \zeta_i \zeta_j = \frac{2}{\pi} \mathrm{arcsin} \left(\Sigma_{i,j}\right) . [[/math]]


Show Proof

Let [math]V \in \R^{n \times n}[/math] (with [math]i^{th}[/math] row [math]V_i^{\top}[/math]) be such that [math]\Sigma = V V^{\top}[/math]. Note that since [math]\Sigma_{i,i}=1[/math] one has [math]\|V_i\|_2 = 1[/math] (remark also that necessarily [math]|\Sigma_{i,j}| \leq 1[/math], which will be important in the proof of Theorem). Let [math]\epsilon \sim \mathcal{N}(0,\mI_n)[/math] be such that [math]\xi = V \epsilon[/math]. Then [math]\zeta_i = \mathrm{sign}(V_i^{\top} \epsilon)[/math], and in particular

[[math]] \begin{eqnarray*} \E \ \zeta_i \zeta_j & = & \P(V_i^{\top} \epsilon \geq 0 \ \text{and} \ V_j^{\top} \epsilon \geq 0) + \P(V_i^{\top} \epsilon \leq 0 \ \text{and} \ V_j^{\top} \epsilon \leq 0 \\ & & - \P(V_i^{\top} \epsilon \geq 0 \ \text{and} \ V_j^{\top} \epsilon \lt 0) - \P(V_i^{\top} \epsilon \lt 0 \ \text{and} \ V_j^{\top} \epsilon \geq 0) \\ & = & 2 \P(V_i^{\top} \epsilon \geq 0 \ \text{and} \ V_j^{\top} \epsilon \geq 0) - 2 \P(V_i^{\top} \epsilon \geq 0 \ \text{and} \ V_j^{\top} \epsilon \lt 0) \\ & = & \P(V_j^{\top} \epsilon \geq 0 | V_i^{\top} \epsilon \geq 0) - \P(V_j^{\top} \epsilon \lt 0 | V_i^{\top} \epsilon \geq 0) \\ & = & 1 - 2 \P(V_j^{\top} \epsilon \lt 0 | V_i^{\top} \epsilon \geq 0). \end{eqnarray*} [[/math]]
Now a quick picture shows that [math]\P(V_j^{\top} \epsilon \lt 0 | V_i^{\top} \epsilon \geq 0) = \frac{1}{\pi} \mathrm{arccos}(V_i^{\top} V_j)[/math] (recall that [math]\epsilon / \|\epsilon\|_2[/math] is uniform on the Euclidean sphere). Using the fact that [math]V_i^{\top} V_j = \Sigma_{i,j}[/math] and [math]\mathrm{arccos}(x) = \frac{\pi}{2} - \mathrm{arcsin}(x)[/math] conclude the proof.

Theorem depends on the form of the Laplacian [math]L[/math] (insofar as \eqref{eq:dependsonL} was used). We show next a result from [20] that applies to any positive semi-definite matrix, at the expense of the constant of approximation. Precisely we are now interested in the following optimization problem:

[[math]] \begin{equation} \label{eq:quad} \max_{x \in \{-1,1\}^n} x^{\top} B x . \end{equation} [[/math]]

The corresponding SDP relaxation is

[[math]] \max_{X \in \mathbb{S}_+^n, X_{i,i}=1, i \in [n]} \langle B, X \rangle . [[/math]]

Theorem

Let [math]\Sigma[/math] be the solution to the SDP relaxation of \eqref{eq:quad}. Let [math]\xi \sim \cN(0, \Sigma)[/math] and [math]\zeta = \mathrm{sign}(\xi) \in \{-1,1\}^n[/math]. Then

[[math]] \E \ \zeta^{\top} B \zeta \geq \frac{2}{\pi} \max_{x \in \{-1,1\}^n} x^{\top} B x . [[/math]]


Show Proof

Lemma shows that

[[math]] \E \ \zeta^{\top} B \zeta = \sum_{i,j=1}^n B_{i,j} \frac{2}{\pi} \mathrm{arcsin} \left(X_{i,j}\right) = \frac{2}{\pi} \langle B, \mathrm{arcsin}(X) \rangle . [[/math]]
Thus to prove the result it is enough to show that [math]\langle B, \mathrm{arcsin}(\Sigma) \rangle \geq \langle B, \Sigma \rangle[/math], which is itself implied by [math]\mathrm{arcsin}(\Sigma) \succeq \Sigma[/math] (the implication is true since [math]B[/math] is positive semi-definite, just write the eigendecomposition). Now we prove the latter inequality via a Taylor expansion. Indeed recall that [math]|\Sigma_{i,j}| \leq 1[/math] and thus denoting by [math]A^{\circ \alpha}[/math] the matrix where the entries are raised to the power [math]\alpha[/math] one has

[[math]] \mathrm{arcsin}(\Sigma) = \sum_{k=0}^{+\infty} \frac{{2k \choose k}}{4^k (2k +1)} \Sigma^{\circ (2k+1)} = \Sigma + \sum_{k=1}^{+\infty} \frac{{2k \choose k}}{4^k (2k +1)} \Sigma^{\circ (2k+1)}. [[/math]]
Finally one can conclude using the fact if [math]A,B \succeq 0[/math] then [math]A \circ B \succeq 0[/math]. This can be seen by writing [math]A= V V^{\top}[/math], [math]B=U U^{\top}[/math], and thus

[[math]] (A \circ B)_{i,j} = V_i^{\top} V_j U_i^{\top} U_j = \mathrm{Tr}(U_j V_j^{\top} V_i U_i^{\top}) = \langle V_i U_i^{\top}, V_j U_j^{\top} \rangle . [[/math]]
In other words [math]A \circ B[/math] is a Gram-matrix and, thus it is positive semi-definite.

Random walk based methods

Randomization naturally suggests itself in the center of gravity method (see Section The center of gravity method), as a way to circumvent the exact calculation of the center of gravity. This idea was proposed and developed in [21]. We give below a condensed version of the main ideas of this paper. Assuming that one can draw independent points [math]X_1, \dots, X_N[/math] uniformly at random from the current set [math]\cS_t[/math], one could replace [math]c_t[/math] by [math]\hat{c}_t = \frac{1}{N} \sum_{i=1}^N X_i[/math]. [21] proved the following generalization of Lemma for the situation where one cuts a convex set through a point close the center of gravity. Recall that a convex set [math]\cK[/math] is in isotropic position if [math]\E X = 0[/math] and [math]\E X X^{\top} = \mI_n[/math], where [math]X[/math] is a random variable drawn uniformly at random from [math]\cK[/math]. Note in particular that this implies [math]\E \|X\|_2^2 = n[/math]. We also say that [math]\cK[/math] is in near-isotropic position if [math]\frac{1}{2} \mI_n \preceq \E X X^{\top} \preceq \frac3{2} \mI_n[/math].

Lemma

Let [math]\cK[/math] be a convex set in isotropic position. Then for any [math]w \in \R^n, w \neq 0[/math], [math]z \in \R^n[/math], one has

[[math]] \mathrm{Vol} \left( \cK \cap \{x \in \R^n : (x-z)^{\top} w \geq 0\} \right) \geq \left(\frac{1}{e} - \|z\|_2\right) \mathrm{Vol} (\cK) . [[/math]]

Thus if one can ensure that [math]\cS_t[/math] is in (near) isotropic position, and [math]\|c_t - \hat{c}_t\|_2[/math] is small (say smaller than [math]0.1[/math]), then the randomized center of gravity method (which replaces [math]c_t[/math] by [math]\hat{c}_t[/math]) will converge at the same speed than the original center of gravity method. Assuming that [math]\cS_t[/math] is in isotropic position one immediately obtains [math]\E \|c_t - \hat{c}_t\|_2^2 = \frac{n}{N}[/math], and thus by Chebyshev’s inequality one has [math]\P(\|c_t - \hat{c}_t\|_2 \gt 0.1) \leq 100 \frac{n}{N}[/math]. In other words with [math]N = O(n)[/math] one can ensure that the randomized center of gravity method makes progress on a constant fraction of the iterations (to ensure progress at every step one would need a larger value of [math]N[/math] because of an union bound, but this is unnecessary). Let us now consider the issue of putting [math]\cS_t[/math] in near-isotropic position. Let [math]\hat{\Sigma}_t = \frac1{N} \sum_{i=1}^N (X_i-\hat{c}_t) (X_i-\hat{c}_t)^{\top}[/math]. [22] showed that as long as [math]N= \tilde{\Omega}(n)[/math], one has with high probability (say at least probability [math]1-1/n^2[/math]) that the set [math]\hat{\Sigma}_t^{-1/2} (\cS_t - \hat{c}_t)[/math] is in near-isotropic position.

Thus it only remains to explain how to sample from a near-isotropic convex set [math]\cK[/math]. This is where random walk ideas come into the picture. The hit-and-run walk[Notes 4] is described as follows: at a point [math]x \in \cK[/math], let [math]\cL[/math] be a line that goes through [math]x[/math] in a direction taken uniformly at random, then move to a point chosen uniformly at random in [math]\cL \cap \cK[/math]. [23] showed that if the starting point of the hit-and-run walk is chosen from a distribution ``close enough" to the uniform distribution on [math]\cK[/math], then after [math]O(n^3)[/math] steps the distribution of the last point is [math]\epsilon[/math] away (in total variation) from the uniform distribution on [math]\cK[/math]. In the randomized center of gravity method one can obtain a good initial distribution for [math]\cS_t[/math] by using the distribution that was obtained for [math]\cS_{t-1}[/math]. In order to initialize the entire process correctly we start here with [math]\cS_1 = [-L, L]^n \supset \cX[/math] (in Section The center of gravity method we used [math]\cS_1 = \cX[/math]), and thus we also have to use a {\em separation oracle} at iterations where [math]\hat{c}_t \not\in \cX[/math], just like we did for the ellipsoid method (see Section The ellipsoid method). Wrapping up the above discussion, we showed (informally) that to attain an [math]\epsilon[/math]-optimal point with the randomized center of gravity method one needs: [math]\tilde{O}(n)[/math] iterations, each iterations requires [math]\tilde{O}(n)[/math] random samples from [math]\cS_t[/math] (in order to put it in isotropic position) as well as a call to either the separation oracle or the first order oracle, and each sample costs [math]\tilde{O}(n^3)[/math] steps of the random walk. Thus overall one needs [math]\tilde{O}(n)[/math] calls to the separation oracle and the first order oracle, as well as [math]\tilde{O}(n^5)[/math] steps of the random walk.

General references

Bubeck, Sébastien (2015). "Convex Optimization: Algorithms and Complexity". arXiv:1405.4980 [math.OC].

References

  1. H.Robbins and S.Monro.A stochastic approximation method.Annals of Mathematical Statistics, 22: 400--407, 1951.
  2. A.d'Aspremont.Smooth optimization with approximate gradient.SIAM Journal on Optimization, 19 (3): 1171--1183, 2008.
  3. M.Schmidt, N.LeRoux, and F.Bach.Convergence rates of inexact proximal-gradient methods for convex optimization.In Advances in neural information processing systems, pages 1458--1466, 2011.
  4. A.Tsybakov.Optimal rates of aggregation.In Conference on Learning Theory (COLT), pages 303--313. 2003.
  5. O.Dekel, R.Gilad-Bachrach, O.Shamir, and L.Xiao.Optimal distributed online prediction using mini-batches.Journal of Machine Learning Research, 13: 165--202, 2012.
  6. N.{Le Roux}, M.Schmidt, and F.Bach.A stochastic gradient method with an exponential convergence rate for strongly-convex optimization with finite training sets.In Advances in Neural Information Processing Systems (NIPS), 2012.
  7. S.Shalev-Shwartz and T.Zhang.Stochastic dual coordinate ascent methods for regularized loss minimization.Journal of Machine Learning Research, 14: 567--599, 2013{\natexlaba}.
  8. R.Johnson and T.Zhang.Accelerating stochastic gradient descent using predictive variance reduction.In Advances in Neural Information Processing Systems (NIPS), 2013.
  9. A.Defazio, F.Bach, and S.Lacoste-Julien.Saga: A fast incremental gradient method with support for non-strongly convex composite objectives.In Advances in Neural Information Processing Systems (NIPS), 2014.
  10. S.Shalev-Shwartz and T.Zhang.Accelerated mini-batch stochastic dual coordinate ascent.In Advances in Neural Information Processing Systems (NIPS), 2013{\natexlabb}.
  11. Y.Zhang and L.Xiao.Stochastic primal-dual coordinate method for regularized empirical risk minimization.Arxiv preprint arXiv:1409.3257, 2014.
  12. A.Agarwal and L.Bottou.A lower bound for the optimization of finite sums.Arxiv preprint arXiv:1410.0723, 2014.
  13. 13.0 13.1 13.2 Y.Nesterov.Efficiency of coordinate descent methods on huge-scale optimization problems.SIAM Journal on Optimization, 22: 341--362, 2012.
  14. P.Richt\'arik and M.Tak\'ac.Parallel coordinate descent methods for big data optimization.Arxiv preprint arXiv:1212.0873, 2012.
  15. M.D. Grigoriadis and L.G. Khachiyan.A sublinear-time randomized approximation algorithm for matrix games.Operations Research Letters, 18: 53--58, 1995.
  16. S.Bubeck and N.Cesa-Bianchi.Regret analysis of stochastic and nonstochastic multi-armed bandit problems.Foundations and Trends{\textregistered} in Machine Learning, 5 (1): 1--122, 2012.
  17. K.Clarkson, E.Hazan, and D.Woodruff.Sublinear optimization for machine learning.Journal of the ACM, 2012.
  18. B.Barak.Sum of squares upper bounds, lower bounds, and open questions.Lecture Notes, 2014.
  19. M.Goemans and D.Williamson.Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming.Journal of the ACM, 42 (6): 1115--1145, 1995.
  20. Y.Nesterov.Quality of semidefinite relaxation for nonconvex quadratic optimization.CORE Discussion Papers 1997019, Universit\'e catholique de Louvain, Center for Operations Research and Econometrics (CORE), 1997.
  21. 21.0 21.1 D.Bertsimas and S.Vempala.Solving convex programs by random walks.Journal of the ACM, 51: 540--556, 2004.
  22. M.Rudelson.Random vectors in the isotropic position.Journal of Functional Analysis, 164: 60--72, 1999.
  23. L.Lov\'asz.Hit-and-run mixes fast.Math. Prog., 86: 443--461, 1998.

Notes

  1. We assume differentiability only for sake of notation here.
  2. While being true in general this statement does not say anything about specific functions/oracles. For example it was shown in [1] that acceleration can be obtained for the square loss and the logistic loss.
  3. Uniqueness is only assumed for sake of notation.
  4. Other random walks are known for this problem but hit-and-run is the one with the sharpest theoretical guarantees. Curiously we note that one of those walks is closely connected to projected gradient descent, see [2].