Almost dimension-free convex optimization in non-Euclidean spaces

[math] \newcommand{\LS}{\mathtt{line\_search}} \newcommand{\mI}{\mathrm{I}} \newcommand{\mB}{\mathrm{B}} \newcommand{\conv}{\mathrm{conv}} \newcommand{\inte}{\mathrm{int}} \newcommand{\tg}{\tilde{g}} \newcommand{\tx}{\tilde{x}} \newcommand{\ty}{\tilde{y}} \newcommand{\tz}{\tilde{z}} \newcommand{\id}{\mathrm{id}} \newcommand{\K}{\mathrm{KL}} \newcommand{\kl}{\mathrm{kl}} \newcommand{\bp}{\boldsymbol{p}} \newcommand{\tr}{\mathrm{Tr}} \newcommand{\E}{\mathbb{E}} \newcommand{\N}{\mathbb{N}} \newcommand{\R}{\mathbb{R}} \newcommand{\Q}{\mathbb{Q}} \newcommand{\KL}{\mathrm{KL}} \newcommand{\LG}{\overline{\log}(K)} \newcommand{\ocP}{\overline{\mathcal{P}}} \newcommand{\cZ}{\mathcal{Z}} \newcommand{\cA}{\mathcal{A}} \newcommand{\cR}{\mathcal{R}} \newcommand{\cB}{\mathcal{B}} \newcommand{\cN}{\mathcal{N}} \newcommand{\cM}{\mathcal{M}} \newcommand{\cF}{\mathcal{F}} \newcommand{\cL}{\mathcal{L}} \newcommand{\cX}{\mathcal{X}} \newcommand{\cI}{\mathcal{I}} \newcommand{\cJ}{\mathcal{J}} \newcommand{\cY}{\mathcal{Y}} \newcommand{\cH}{\mathcal{H}} \newcommand{\cP}{\mathcal{P}} \newcommand{\cT}{\mathcal{T}} \newcommand{\cC}{\mathcal{C}} \newcommand{\cS}{\mathcal{S}} \newcommand{\cE}{\mathcal{E}} \newcommand{\cK}{\mathcal{K}} \newcommand{\cD}{\mathcal{D}} \newcommand{\oD}{\overline{\mathcal{D}}} \newcommand{\oR}{\overline{R}} \newcommand{\wh}{\widehat} \newcommand{\argmax}{\mathop{\mathrm{argmax}}} \newcommand{\argmin}{\mathop{\mathrm{argmin}}} \newcommand{\todo}{{\bf TO DO } } \newcommand{\Ber}{\mathop{\mathrm{Ber}}} \newcommand{\beq}{\begin{equation}} \newcommand{\eeq}{\end{equation}} \newcommand{\beqa}{\begin{eqnarray}} \newcommand{\eeqa}{\end{eqnarray}} \newcommand{\beqan}{\begin{eqnarray*}} \newcommand{\eeqan}{\end{eqnarray*}} \newcommand{\qed}{\hfill\BlackBox} \newcommand{\charfct}{\ds1} % \newcommand{\Fcal}{\mathcal{F}} \newcommand{\Xcal}{\mathcal{X}} \newcommand{\Hcal}{\mathcal{H}} \newcommand{\Gcal}{\mathcal{G}} \newcommand{\Nat}{\mathbb{N}} \newcommand{\mathds}{\mathbb}[/math]

In the previous chapter we showed that dimension-free oracle complexity is possible when the objective function [math]f[/math] and the constraint set [math]\cX[/math] are well-behaved in the Euclidean norm; e.g. if for all points [math]x \in \cX[/math] and all subgradients [math]g \in \partial f(x)[/math], one has that [math]\|x\|_2[/math] and [math]\|g\|_2[/math] are independent of the ambient dimension [math]n[/math]. If this assumption is not met then the gradient descent techniques of Chapter may lose their dimension-free convergence rates. For instance consider a differentiable convex function [math]f[/math] defined on the Euclidean ball [math]\mB_{2,n}[/math] and such that [math]\|\nabla f(x)\|_{\infty} \leq 1, \forall x \in \mB_{2,n}[/math]. This implies that [math]\|\nabla f(x)\|_{2} \leq \sqrt{n}[/math], and thus projected gradient descent will converge to the minimum of [math]f[/math] on [math]\mB_{2,n}[/math] at a rate [math]\sqrt{n / t}[/math]. In this chapter we describe the method of [1], known as mirror descent, which allows to find the minimum of such functions [math]f[/math] over the [math]\ell_1[/math]-ball (instead of the Euclidean ball) at the much faster rate [math]\sqrt{\log(n) / t}[/math]. This is only one example of the potential of mirror descent. This chapter is devoted to the description of mirror descent and some of its alternatives. The presentation is inspired from [2], [Chapter 11, [3]], [4][5][6].

In order to describe the intuition behind the method let us abstract the situation for a moment and forget that we are doing optimization in finite dimension. We already observed that projected gradient descent works in an arbitrary Hilbert space [math]\mathcal{H}[/math]. Suppose now that we are interested in the more general situation of optimization in some Banach space [math]\mathcal{B}[/math]. In other words the norm that we use to measure the various quantity of interest does not derive from an inner product (think of [math]\mathcal{B} = \ell_1[/math] for example). In that case the gradient descent strategy does not even make sense: indeed the gradients (more formally the Fr\'echet derivative) [math]\nabla f(x)[/math] are elements of the dual space [math]\mathcal{B}^*[/math] and thus one cannot perform the computation [math]x - \eta \nabla f(x)[/math] (it simply does not make sense). We did not have this problem for optimization in a Hilbert space [math]\mathcal{H}[/math] since by Riesz representation theorem [math]\mathcal{H}^*[/math] is isometric to [math]\mathcal{H}[/math]. The great insight of Nemirovski and Yudin is that one can still do a gradient descent by first mapping the point [math]x \in \mathcal{B}[/math] into the dual space [math]\mathcal{B}^*[/math], then performing the gradient update in the dual space, and finally mapping back the resulting point to the primal space [math]\mathcal{B}[/math]. Of course the new point in the primal space might lie outside of the constraint set [math]\mathcal{X} \subset \mathcal{B}[/math] and thus we need a way to project back the point on the constraint set [math]\mathcal{X}[/math]. Both the primal/dual mapping and the projection are based on the concept of a {\em mirror map} which is the key element of the scheme. Mirror maps are defined in Section Mirror maps, and the above scheme is formally described in Section Mirror descent. In the rest of this chapter we fix an arbitrary norm [math]\|\cdot\|[/math] on [math]\R^n[/math], and a compact convex set [math]\cX \subset \R^n[/math]. The dual norm [math]\|\cdot\|_*[/math] is defined as [math]\|g\|_* = \sup_{x \in \mathbb{R}^n : \|x\| \leq 1} g^{\top} x[/math]. We say that a convex function [math]f : \cX \rightarrow \R[/math] is (i) [math]L[/math]-Lipschitz w.r.t. [math]\|\cdot\|[/math] if [math]\forall x \in \cX, g \in \partial f(x), \|g\|_* \leq L[/math], (ii) [math]\beta[/math]-smooth w.r.t. [math]\|\cdot\|[/math] if [math]\|\nabla f(x) - \nabla f(y) \|_* \leq \beta \|x-y\|, \forall x, y \in \cX[/math], and (iii) [math]\alpha[/math]-strongly convex w.r.t. [math]\|\cdot\|[/math] if

[[math]] f(x) - f(y) \leq g^{\top} (x - y) - \frac{\alpha}{2} \|x - y \|^2 , \forall x, y \in \cX, g \in \partial f(x). [[/math]]

We also define the Bregman divergence associated to [math]f[/math] as

[[math]] D_{f}(x,y) = f(x) - f(y) - \nabla f(y)^{\top} (x - y) . [[/math]]

The following identity will be useful several times:

[[math]] \begin{equation} \label{eq:useful1} (\nabla f(x) - \nabla f(y))^{\top}(x-z) = D_{f}(x,y) + D_{f}(z,x) - D_{f}(z,y) . \end{equation} [[/math]]


Mirror maps

Let [math]\cD \subset \R^n[/math] be a convex open set such that [math]\mathcal{X}[/math] is included in its closure, that is [math]\mathcal{X} \subset \overline{\mathcal{D}}[/math], and [math]\mathcal{X} \cap \mathcal{D} \neq \emptyset[/math]. We say that [math]\Phi : \cD \rightarrow \R[/math] is a mirror map if it safisfies the following properties[Notes 1]:

  • [math]\Phi[/math] is strictly convex and differentiable.
  • The gradient of [math]\Phi[/math] takes all possible values, that is [math]\nabla \Phi(\cD) = \R^n[/math].
  • The gradient of [math]\Phi[/math] diverges on the boundary of [math]\cD[/math], that is
    [[math]] \lim_{x \rightarrow \partial \mathcal{D}} \|\nabla \Phi(x)\| = + \infty . [[/math]]

In mirror descent the gradient of the mirror map [math]\Phi[/math] is used to map points from the ‘`primal" to the ``dual" (note that all points lie in [math]\R^n[/math] so the notions of primal and dual spaces only have an intuitive meaning). Precisely a point [math]x \in \mathcal{X} \cap \mathcal{D}[/math] is mapped to [math]\nabla \Phi(x)[/math], from which one takes a gradient step to get to [math]\nabla \Phi(x) - \eta \nabla f(x)[/math]. Property (ii) then allows us to write the resulting point as [math]\nabla \Phi(y) = \nabla \Phi(x) - \eta \nabla f(x)[/math] for some [math]y \in \cD[/math]. The primal point [math]y[/math] may lie outside of the set of constraints [math]\cX[/math], in which case one has to project back onto [math]\cX[/math]. In mirror descent this projection is done via the Bregman divergence associated to [math]\Phi[/math]. Precisely one defines

[[math]] \Pi_{\cX}^{\Phi} (y) = \argmin_{x \in \mathcal{X} \cap \mathcal{D}} D_{\Phi}(x,y) . [[/math]]

Property (i) and (iii) ensures the existence and uniqueness of this projection (in particular since [math]x \mapsto D_{\Phi}(x,y)[/math] is locally increasing on the boundary of [math]\mathcal{D}[/math]). The following lemma shows that the Bregman divergence essentially behaves as the Euclidean norm squared in terms of projections (recall Lemma).

Lemma

Let [math]x \in \cX \cap \cD[/math] and [math]y \in \cD[/math], then

[[math]] (\nabla \Phi(\Pi_{\cX}^{\Phi}(y)) - \nabla \Phi(y))^{\top} (\Pi^{\Phi}_{\cX}(y) - x) \leq 0 , [[/math]]
which also implies

[[math]] D_{\Phi}(x, \Pi^{\Phi}_{\cX}(y)) + D_{\Phi}(\Pi^{\Phi}_{\cX}(y), y) \leq D_{\Phi}(x,y) . [[/math]]


Show Proof

The proof is an immediate corollary of Proposition together with the fact that [math]\nabla_x D_{\Phi}(x,y) = \nabla \Phi(x) - \nabla \Phi(y)[/math].

Mirror descent

We can now describe the mirror descent strategy based on a mirror map [math]\Phi[/math]. Let [math]x_1 \in \argmin_{x \in \mathcal{X} \cap \mathcal{D}} \Phi(x)[/math]. Then for [math]t \geq 1[/math], let [math]y_{t+1} \in \mathcal{D}[/math] such that

[[math]] \begin{equation} \label{eq:MD1} \nabla \Phi(y_{t+1}) = \nabla \Phi(x_{t}) - \eta g_t, \ \text{where} \ g_t \in \partial f(x_t) , \end{equation} [[/math]]

and

[[math]] \begin{equation} \label{eq:MD2} x_{t+1} \in \Pi_{\cX}^{\Phi} (y_{t+1}) . \end{equation} [[/math]]

See Figure for an illustration of this procedure.

Illustration of mirror descent.
Theorem

Let [math]\Phi[/math] be a mirror map [math]\rho[/math]-strongly convex on [math]\mathcal{X} \cap \mathcal{D}[/math] w.r.t. [math]\|\cdot\|[/math]. Let [math]R^2 = \sup_{x \in \mathcal{X} \cap \mathcal{D}} \Phi(x) - \Phi(x_1)[/math], and [math]f[/math] be convex and [math]L[/math]-Lipschitz w.r.t. [math]\|\cdot\|[/math]. Then mirror descent with [math]\eta = \frac{R}{L} \sqrt{\frac{2 \rho}{t}}[/math] satisfies

[[math]] f\bigg(\frac{1}{t} \sum_{s=1}^t x_s \bigg) - f(x^*) \leq RL \sqrt{\frac{2}{\rho t}} . [[/math]]


Show Proof

Let [math]x \in \mathcal{X} \cap \mathcal{D}[/math]. The claimed bound will be obtained by taking a limit [math]x \rightarrow x^*[/math]. Now by convexity of [math]f[/math], the definition of mirror descent, equation \eqref{eq:useful1}, and Lemma, one has

[[math]] \begin{align*} & f(x_s) - f(x) \\ & \leq g_s^{\top} (x_s - x) \\ & = \frac{1}{\eta} (\nabla \Phi(x_s) - \nabla \Phi(y_{s+1}))^{\top} (x_s - x) \\ & = \frac{1}{\eta} \bigg( D_{\Phi}(x, x_s) + D_{\Phi}(x_s, y_{s+1}) - D_{\Phi}(x, y_{s+1}) \bigg) \\ & \leq \frac{1}{\eta} \bigg( D_{\Phi}(x, x_s) + D_{\Phi}(x_s, y_{s+1}) - D_{\Phi}(x, x_{s+1}) - D_{\Phi}(x_{s+1}, y_{s+1}) \bigg) . \end{align*} [[/math]]
The term [math]D_{\Phi}(x, x_s) - D_{\Phi}(x, x_{s+1})[/math] will lead to a telescopic sum when summing over [math]s=1[/math] to [math]s=t[/math], and it remains to bound the other term as follows using [math]\rho[/math]-strong convexity of the mirror map and [math]a z - b z^2 \leq \frac{a^2}{4 b}, \forall z \in \R[/math]:

[[math]] \begin{align*} & D_{\Phi}(x_s, y_{s+1}) - D_{\Phi}(x_{s+1}, y_{s+1}) \\ & = \Phi(x_s) - \Phi(x_{s+1}) - \nabla \Phi(y_{s+1})^{\top} (x_{s} - x_{s+1}) \\ & \leq (\nabla \Phi(x_s) - \nabla \Phi(y_{s+1}))^{\top} (x_{s} - x_{s+1}) - \frac{\rho}{2} \|x_s - x_{s+1}\|^2 \\ & = \eta g_s^{\top} (x_{s} - x_{s+1}) - \frac{\rho}{2} \|x_s - x_{s+1}\|^2 \\ & \leq \eta L \|x_{s} - x_{s+1}\| - \frac{\rho}{2} \|x_s - x_{s+1}\|^2 \\ & \leq \frac{(\eta L)^2}{2 \rho}. \end{align*} [[/math]]
We proved

[[math]] \sum_{s=1}^t \bigg(f(x_s) - f(x)\bigg) \leq \frac{D_{\Phi}(x,x_1)}{\eta} + \eta \frac{L^2 t}{2 \rho}, [[/math]]
which concludes the proof up to trivial computation.

We observe that one can rewrite mirror descent as follows:

[[math]] \begin{eqnarray} x_{t+1} & = & \argmin_{x \in \mathcal{X} \cap \mathcal{D}} \ D_{\Phi}(x,y_{t+1}) \notag \\ & = & \argmin_{x \in \mathcal{X} \cap \mathcal{D}} \ \Phi(x) - \nabla \Phi(y_{t+1})^{\top} x \label{eq:MD3} \\ & = & \argmin_{x \in \mathcal{X} \cap \mathcal{D}} \ \Phi(x) - (\nabla \Phi(x_{t}) - \eta g_t)^{\top} x \notag \\ & = & \argmin_{x \in \mathcal{X} \cap \mathcal{D}} \ \eta g_t^{\top} x + D_{\Phi}(x,x_t) . \label{eq:MDproxview} \end{eqnarray} [[/math]]

This last expression is often taken as the definition of mirror descent (see [2]). It gives a proximal point of view on mirror descent: the method is trying to minimize the local linearization of the function while not moving too far away from the previous point, with distances measured via the Bregman divergence of the mirror map.

Standard setups for mirror descent

``Ball setup". The simplest version of mirror descent is obtained by taking [math]\Phi(x) = \frac{1}{2} \|x\|^2_2[/math] on [math]\mathcal{D} = \mathbb{R}^n[/math]. The function [math]\Phi[/math] is a mirror map strongly convex w.r.t. [math]\|\cdot\|_2[/math], and furthermore the associated Bregman divergence is given by [math]D_{\Phi}(x,y) = \frac{1}{2} \|x - y\|^2_2[/math]. Thus in that case mirror descent is exactly equivalent to projected subgradient descent, and the rate of convergence obtained in Theorem recovers our earlier result on projected subgradient descent. \newline

``Simplex setup". A more interesting choice of a mirror map is given by the negative entropy

[[math]] \Phi(x) = \sum_{i=1}^n x(i) \log x(i), [[/math]]

on [math]\mathcal{D} = \mathbb{R}_{++}^n[/math]. In that case the gradient update [math]\nabla \Phi(y_{t+1}) = \nabla \Phi(x_t) - \eta \nabla f(x_t)[/math] can be written equivalently as

[[math]] y_{t+1}(i) = x_{t}(i) \exp\big(- \eta [\nabla f(x_t) ](i) \big) , \ i=1, \dots, n. [[/math]]

The Bregman divergence of this mirror map is given by [math]D_{\Phi}(x,y) = \sum_{i=1}^n x(i) \log \frac{x(i)}{y(i)}[/math] (also known as the Kullback-Leibler divergence). It is easy to verify that the projection with respect to this Bregman divergence on the simplex [math]\Delta_n = \{x \in \mathbb{R}_+^n : \sum_{i=1}^n x(i) = 1\}[/math] amounts to a simple renormalization [math]y \mapsto y / \|y\|_1[/math]. Furthermore it is also easy to verify that [math]\Phi[/math] is [math]1[/math]-strongly convex w.r.t. [math]\|\cdot\|_1[/math] on [math]\Delta_n[/math] (this result is known as Pinsker’s inequality). Note also that for [math]\mathcal{X} = \Delta_n[/math] one has [math]x_1 = (1/n, \dots, 1/n)[/math] and [math]R^2 = \log n[/math]. The above observations imply that when minimizing on the simplex [math]\Delta_n[/math] a function [math]f[/math] with subgradients bounded in [math]\ell_{\infty}[/math]-norm, mirror descent with the negative entropy achieves a rate of convergence of order [math]\sqrt{\frac{\log n}{t}}[/math]. On the other hand the regular subgradient descent achieves only a rate of order [math]\sqrt{\frac{n}{t}}[/math] in this case! \newline

‘`Spectrahedron setup". We consider here functions defined on matrices, and we are interested in minimizing a function [math]f[/math] on the {\em spectrahedron} [math]\mathcal{S}_n[/math] defined as:

[[math]] \mathcal{S}_n = \left\{X \in \mathbb{S}_+^n : \mathrm{Tr}(X) = 1 \right\} . [[/math]]

In this setting we consider the mirror map on [math]\mathcal{D} = \mathbb{S}_{++}^n[/math] given by the negative von Neumann entropy:

[[math]] \Phi(X) = \sum_{i=1}^n \lambda_i(X) \log \lambda_i(X) , [[/math]]

where [math]\lambda_1(X), \dots, \lambda_n(X)[/math] are the eigenvalues of [math]X[/math]. It can be shown that the gradient update [math]\nabla \Phi(Y_{t+1}) = \nabla \Phi(X_t) - \eta \nabla f(X_t)[/math] can be written equivalently as

[[math]] Y_{t+1} = \exp\big(\log X_t - \eta \nabla f(X_t) \big) , [[/math]]

where the matrix exponential and matrix logarithm are defined as usual. Furthermore the projection on [math]\mathcal{S}_n[/math] is a simple trace renormalization. With highly non-trivial computation one can show that [math]\Phi[/math] is [math]\frac{1}{2}[/math]-strongly convex with respect to the Schatten [math]1[/math]-norm defined as

[[math]] \|X\|_1 = \sum_{i=1}^n \lambda_i(X). [[/math]]

It is easy to see that for [math]\mathcal{X} = \mathcal{S}_n[/math] one has [math]x_1 = \frac{1}{n} \mI_n[/math] and [math]R^2 = \log n[/math]. In other words the rate of convergence for optimization on the spectrahedron is the same than on the simplex!

Lazy mirror descent, aka Nesterov’s dual averaging

In this section we consider a slightly more efficient version of mirror descent for which we can prove that Theorem still holds true. This alternative algorithm can be advantageous in some situations (such as distributed settings), but the basic mirror descent scheme remains important for extensions considered later in this text (saddle points, stochastic oracles, ...). In lazy mirror descent, also commonly known as Nesterov's dual averaging or simply dual averaging, one replaces \eqref{eq:MD1} by

[[math]] \nabla \Phi(y_{t+1}) = \nabla \Phi(y_{t}) - \eta g_t , [[/math]]

and also [math]y_1[/math] is such that [math]\nabla \Phi(y_1) = 0[/math]. In other words instead of going back and forth between the primal and the dual, dual averaging simply averages the gradients in the dual, and if asked for a point in the primal it simply maps the current dual point to the primal using the same methodology as mirror descent. In particular using \eqref{eq:MD3} one immediately sees that dual averaging is defined by:

[[math]] \begin{equation} \label{eq:DA0} x_t = \argmin_{x \in \mathcal{X} \cap \mathcal{D}} \ \eta \sum_{s=1}^{t-1} g_s^{\top} x + \Phi(x) . \end{equation} [[/math]]

Theorem

Let [math]\Phi[/math] be a mirror map [math]\rho[/math]-strongly convex on [math]\mathcal{X} \cap \mathcal{D}[/math] w.r.t. [math]\|\cdot\|[/math]. Let [math]R^2 = \sup_{x \in \mathcal{X} \cap \mathcal{D}} \Phi(x) - \Phi(x_1)[/math], and [math]f[/math] be convex and [math]L[/math]-Lipschitz w.r.t. [math]\|\cdot\|[/math]. Then dual averaging with [math]\eta = \frac{R}{L} \sqrt{\frac{\rho}{2 t}}[/math] satisfies

[[math]] f\bigg(\frac{1}{t} \sum_{s=1}^t x_s \bigg) - f(x^*) \leq 2 RL \sqrt{\frac{2}{\rho t}} . [[/math]]


Show Proof

We define [math]\psi_t(x) = \eta \sum_{s=1}^{t} g_s^{\top} x + \Phi(x)[/math], so that [math]x_t \in \argmin_{x \in \mathcal{X} \cap \mathcal{D}} \psi_{t-1}(x)[/math]. Since [math]\Phi[/math] is [math]\rho[/math]-strongly convex one clearly has that [math]\psi_t[/math] is [math]\rho[/math]-strongly convex, and thus

[[math]] \begin{eqnarray*} \psi_t(x_{t+1}) - \psi_t(x_t) & \leq & \nabla \psi_t(x_{t+1})^{\top}(x_{t+1} - x_{t}) - \frac{\rho}{2} \|x_{t+1} - x_t\|^2 \\ & \leq & - \frac{\rho}{2} \|x_{t+1} - x_t\|^2 , \end{eqnarray*} [[/math]]
where the second inequality comes from the first order optimality condition for [math]x_{t+1}[/math] (see Proposition). Next observe that

[[math]] \begin{eqnarray*} \psi_t(x_{t+1}) - \psi_t(x_t) & = & \psi_{t-1}(x_{t+1}) - \psi_{t-1}(x_t) + \eta g_t^{\top} (x_{t+1} - x_t) \\ & \geq & \eta g_t^{\top} (x_{t+1} - x_t) . \end{eqnarray*} [[/math]]
Putting together the two above displays and using Cauchy-Schwarz (with the assumption [math]\|g_t\|_* \leq L[/math]) one obtains

[[math]] \frac{\rho}{2} \|x_{t+1} - x_t\|^2 \leq \eta g_t^{\top} (x_t - x_{t+1}) \leq \eta L \|x_t - x_{t+1} \|. [[/math]]
In particular this shows that [math]\|x_{t+1} - x_t\| \leq \frac{2 \eta L}{\rho}[/math] and thus with the above display

[[math]] \begin{equation} \label{eq:DA1} g_t^{\top} (x_t - x_{t+1}) \leq \frac{2 \eta L^2}{\rho} . \end{equation} [[/math]]
Now we claim that for any [math]x \in \cX \cap \cD[/math],

[[math]] \begin{equation} \label{eq:DA2} \sum_{s=1}^t g_s^{\top} (x_s - x) \leq \sum_{s=1}^t g_s^{\top} (x_s - x_{s+1}) + \frac{\Phi(x) - \Phi(x_1)}{\eta} , \end{equation} [[/math]]
which would clearly conclude the proof thanks to \eqref{eq:DA1} and straightforward computations. Equation \eqref{eq:DA2} is equivalent to

[[math]] \sum_{s=1}^t g_s^{\top} x_{s+1} + \frac{\Phi(x_1)}{\eta} \leq \sum_{s=1}^t g_s^{\top} x + \frac{\Phi(x)}{\eta} , [[/math]]
and we now prove the latter equation by induction. At [math]t=0[/math] it is true since [math]x_1 \in \argmin_{x \in \cX \cap \cD} \Phi(x)[/math]. The following inequalities prove the inductive step, where we use the induction hypothesis at [math]x=x_{t+1}[/math] for the first inequality, and the definition of [math]x_{t+1}[/math] for the second inequality:

[[math]] \sum_{s=1}^{t} g_s^{\top} x_{s+1} + \frac{\Phi(x_1)}{\eta} \leq g_{t}^{\top}x_{t+1} + \sum_{s=1}^{t-1} g_s^{\top} x_{t+1} + \frac{\Phi(x_{t+1})}{\eta} \leq \sum_{s=1}^{t} g_s^{\top} x + \frac{\Phi(x)}{\eta} . [[/math]]

Mirror prox

It can be shown that mirror descent accelerates for smooth functions to the rate [math]1/t[/math]. We will prove this result in Chapter (see Theorem). We describe here a variant of mirror descent which also attains the rate [math]1/t[/math] for smooth functions. This method is called mirror prox and it was introduced in [7]. The true power of mirror prox will reveal itself later in the text when we deal with smooth representations of non-smooth functions as well as stochastic oracles[Notes 2]. Mirror prox is described by the following equations:

[[math]] \begin{align*} & \nabla \Phi(y_{t+1}') = \nabla \Phi(x_{t}) - \eta \nabla f(x_t), \\ \\ & y_{t+1} \in \argmin_{x \in \mathcal{X} \cap \mathcal{D}} D_{\Phi}(x,y_{t+1}') , \\ \\ & \nabla \Phi(x_{t+1}') = \nabla \Phi(x_{t}) - \eta \nabla f(y_{t+1}), \\ \\ & x_{t+1} \in \argmin_{x \in \mathcal{X} \cap \mathcal{D}} D_{\Phi}(x,x_{t+1}') . \end{align*} [[/math]]

In words the algorithm first makes a step of mirror descent to go from [math]x_t[/math] to [math]y_{t+1}[/math], and then it makes a similar step to obtain [math]x_{t+1}[/math], starting again from [math]x_t[/math] but this time using the gradient of [math]f[/math] evaluated at [math]y_{t+1}[/math] (instead of [math]x_t[/math]), see Figure for an illustration. The following result justifies the procedure.

Illustration of mirror prox.
Theorem

Let [math]\Phi[/math] be a mirror map [math]\rho[/math]-strongly convex on [math]\mathcal{X} \cap \mathcal{D}[/math] w.r.t. [math]\|\cdot\|[/math]. Let [math]R^2 = \sup_{x \in \mathcal{X} \cap \mathcal{D}} \Phi(x) - \Phi(x_1)[/math], and [math]f[/math] be convex and [math]\beta[/math]-smooth w.r.t. [math]\|\cdot\|[/math]. Then mirror prox with [math]\eta = \frac{\rho}{\beta}[/math] satisfies

[[math]] f\bigg(\frac{1}{t} \sum_{s=1}^t y_{s+1} \bigg) - f(x^*) \leq \frac{\beta R^2}{\rho t} . [[/math]]


Show Proof

Let [math]x \in \mathcal{X} \cap \mathcal{D}[/math]. We write

[[math]] \begin{eqnarray*} f(y_{t+1}) - f(x) & \leq & \nabla f(y_{t+1})^{\top} (y_{t+1} - x) \\ & = & \nabla f(y_{t+1})^{\top} (x_{t+1} - x) + \nabla f(x_t)^{\top} (y_{t+1} - x_{t+1}) \\ & & + (\nabla f(y_{t+1}) - \nabla f(x_t))^{\top} (y_{t+1} - x_{t+1}) . \end{eqnarray*} [[/math]]
We will now bound separately these three terms. For the first one, using the definition of the method, Lemma, and equation \eqref{eq:useful1}, one gets

[[math]] \begin{align*} & \eta \nabla f(y_{t+1})^{\top} (x_{t+1} - x) \\ & = ( \nabla \Phi(x_t) - \nabla \Phi(x_{t+1}'))^{\top} (x_{t+1} - x) \\ & \leq ( \nabla \Phi(x_t) - \nabla \Phi(x_{t+1}))^{\top} (x_{t+1} - x) \\ & = D_{\Phi}(x,x_t) - D_{\Phi}(x, x_{t+1}) - D_{\Phi}(x_{t+1}, x_t) . \end{align*} [[/math]]
For the second term using the same properties than above and the strong-convexity of the mirror map one obtains

[[math]] \begin{align} & \eta \nabla f(x_t)^{\top} (y_{t+1} - x_{t+1}) \notag\\ & = ( \nabla \Phi(x_t) - \nabla \Phi(y_{t+1}'))^{\top} (y_{t+1} - x_{t+1}) \notag\\ & \leq ( \nabla \Phi(x_t) - \nabla \Phi(y_{t+1}))^{\top} (y_{t+1} - x_{t+1}) \notag\\ & = D_{\Phi}(x_{t+1},x_t) - D_{\Phi}(x_{t+1}, y_{t+1}) - D_{\Phi}(y_{t+1}, x_t) \label{eq:pourplustard1}\\ & \leq D_{\Phi}(x_{t+1},x_t) - \frac{\rho}{2} \|x_{t+1} - y_{t+1} \|^2 - \frac{\rho}{2} \|y_{t+1} - x_t\|^2 . \notag \end{align} [[/math]]
Finally for the last term, using Cauchy-Schwarz, [math]\beta[/math]-smoothness, and [math]2 ab \leq a^2 + b^2[/math] one gets

[[math]] \begin{align*} & (\nabla f(y_{t+1}) - \nabla f(x_t))^{\top} (y_{t+1} - x_{t+1}) \\ & \leq \|\nabla f(y_{t+1}) - \nabla f(x_t)\|_* \cdot \|y_{t+1} - x_{t+1} \| \\ & \leq \beta \|y_{t+1} - x_t\| \cdot \|y_{t+1} - x_{t+1} \| \\ & \leq \frac{\beta}{2} \|y_{t+1} - x_t\|^2 + \frac{\beta}{2} \|y_{t+1} - x_{t+1} \|^2 . \end{align*} [[/math]]
Thus summing up these three terms and using that [math]\eta = \frac{\rho}{\beta}[/math] one gets

[[math]] f(y_{t+1}) - f(x) \leq \frac{D_{\Phi}(x,x_t) - D_{\Phi}(x,x_{t+1})}{\eta} . [[/math]]
The proof is concluded with straightforward computations.

The vector field point of view on MD, DA, and MP

In this section we consider a mirror map [math]\Phi[/math] that satisfies the assumptions from Theorem. By inspecting the proof of Theorem one can see that for arbitrary vectors [math]g_1, \dots, g_t \in \R^n[/math] the mirror descent strategy described by \eqref{eq:MD1} or \eqref{eq:MD2} (or alternatively by \eqref{eq:MDproxview}) satisfies for any [math]x \in \cX \cap \cD[/math],

[[math]] \begin{equation} \label{eq:vfMD} \sum_{s=1}^t g_s^{\top} (x_s - x) \leq \frac{R^2}{\eta} + \frac{\eta}{2 \rho} \sum_{s=1}^t \|g_s\|_*^2 . \end{equation} [[/math]]

The observation that the sequence of vectors [math](g_s)[/math] does not have to come from the subgradients of a {\em fixed} function [math]f[/math] is the starting point for the theory of online learning, see [6] for more details. In this monograph we will use this observation to generalize mirror descent to saddle point calculations as well as stochastic settings. We note that we could also use dual averaging (defined by \eqref{eq:DA0}) which satisfies

[[math]] \sum_{s=1}^t g_s^{\top} (x_s - x) \leq \frac{R^2}{\eta} + \frac{2 \eta}{\rho} \sum_{s=1}^t \|g_s\|_*^2 . [[/math]]

In order to generalize mirror prox we simply replace the gradient [math]\nabla f[/math] by an arbitrary vector field [math]g: \cX \rightarrow \R^n[/math] which yields the following equations:

[[math]] \begin{align*} & \nabla \Phi(y_{t+1}') = \nabla \Phi(x_{t}) - \eta g(x_t), \\ & y_{t+1} \in \argmin_{x \in \mathcal{X} \cap \mathcal{D}} D_{\Phi}(x,y_{t+1}') , \\ & \nabla \Phi(x_{t+1}') = \nabla \Phi(x_{t}) - \eta g(y_{t+1}), \\ & x_{t+1} \in \argmin_{x \in \mathcal{X} \cap \mathcal{D}} D_{\Phi}(x,x_{t+1}') . \end{align*} [[/math]]

Under the assumption that the vector field is [math]\beta[/math]-Lipschitz w.r.t. [math]\|\cdot\|[/math], i.e., [math]\|g(x) - g(y)\|_* \leq \beta \|x-y\|[/math] one obtains with [math]\eta = \frac{\rho}{\beta}[/math]

[[math]] \begin{equation} \label{eq:vfMP} \sum_{s=1}^t g(y_{s+1})^{\top}(y_{s+1} - x) \leq \frac{\beta R^2}{\rho}. \end{equation} [[/math]]

General references

Bubeck, Sébastien (2015). "Convex Optimization: Algorithms and Complexity". arXiv:1405.4980 [math.OC].

References

  1. A.Nemirovski and D.Yudin.Problem Complexity and Method Efficiency in Optimization.Wiley Interscience, 1983.
  2. 2.0 2.1 A.Beck and M.Teboulle.Mirror Descent and nonlinear projected subgradient methods for convex optimization.Operations Research Letters, 31 (3): 167--175, 2003.
  3. N.Cesa-Bianchi and G.Lugosi.Prediction, Learning, and Games.Cambridge University Press, 2006.
  4. A.Rakhlin.Lecture notes on online learning.2009.
  5. "E.Hazan.The convex optimization approach to regret minimization.In S.Sra, S.Nowozin, and S.Wright, editors, Optimization for Machine Learning, pages 287--303. MIT press, 2011."
  6. 6.0 6.1 S.Bubeck.Introduction to online optimization.Lecture Notes, 2011.
  7. A.Nemirovski.Prox-method with rate of convergence o (1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems.SIAM Journal on Optimization, 15 (1): 229--251, 2004{\natexlaba}.

Notes

  1. Assumption (ii) can be relaxed in some cases, see for example [1].
  2. Basically mirror prox allows for a smooth vector field point of view (see Section The vector field point of view on MD, DA, and MP), while mirror descent does not.