Beyond the black-box model

[math] \newcommand{\LS}{\mathtt{line\_search}} \newcommand{\mI}{\mathrm{I}} \newcommand{\mB}{\mathrm{B}} \newcommand{\conv}{\mathrm{conv}} \newcommand{\inte}{\mathrm{int}} \newcommand{\tg}{\tilde{g}} \newcommand{\tx}{\tilde{x}} \newcommand{\ty}{\tilde{y}} \newcommand{\tz}{\tilde{z}} \newcommand{\id}{\mathrm{id}} \newcommand{\K}{\mathrm{KL}} \newcommand{\kl}{\mathrm{kl}} \newcommand{\bp}{\boldsymbol{p}} \newcommand{\tr}{\mathrm{Tr}} \newcommand{\E}{\mathbb{E}} \newcommand{\N}{\mathbb{N}} \newcommand{\R}{\mathbb{R}} \newcommand{\Q}{\mathbb{Q}} \newcommand{\KL}{\mathrm{KL}} \newcommand{\LG}{\overline{\log}(K)} \newcommand{\ocP}{\overline{\mathcal{P}}} \newcommand{\cZ}{\mathcal{Z}} \newcommand{\cA}{\mathcal{A}} \newcommand{\cR}{\mathcal{R}} \newcommand{\cB}{\mathcal{B}} \newcommand{\cN}{\mathcal{N}} \newcommand{\cM}{\mathcal{M}} \newcommand{\cF}{\mathcal{F}} \newcommand{\cL}{\mathcal{L}} \newcommand{\cX}{\mathcal{X}} \newcommand{\cI}{\mathcal{I}} \newcommand{\cJ}{\mathcal{J}} \newcommand{\cY}{\mathcal{Y}} \newcommand{\cH}{\mathcal{H}} \newcommand{\cP}{\mathcal{P}} \newcommand{\cT}{\mathcal{T}} \newcommand{\cC}{\mathcal{C}} \newcommand{\cS}{\mathcal{S}} \newcommand{\cE}{\mathcal{E}} \newcommand{\cK}{\mathcal{K}} \newcommand{\cD}{\mathcal{D}} \newcommand{\oD}{\overline{\mathcal{D}}} \newcommand{\oR}{\overline{R}} \newcommand{\wh}{\widehat} \newcommand{\argmax}{\mathop{\mathrm{argmax}}} \newcommand{\argmin}{\mathop{\mathrm{argmin}}} \newcommand{\todo}{{\bf TO DO } } \newcommand{\Ber}{\mathop{\mathrm{Ber}}} \newcommand{\beq}{\begin{equation}} \newcommand{\eeq}{\end{equation}} \newcommand{\beqa}{\begin{eqnarray}} \newcommand{\eeqa}{\end{eqnarray}} \newcommand{\beqan}{\begin{eqnarray*}} \newcommand{\eeqan}{\end{eqnarray*}} \newcommand{\qed}{\hfill\BlackBox} \newcommand{\charfct}{\ds1} % \newcommand{\Fcal}{\mathcal{F}} \newcommand{\Xcal}{\mathcal{X}} \newcommand{\Hcal}{\mathcal{H}} \newcommand{\Gcal}{\mathcal{G}} \newcommand{\Nat}{\mathbb{N}} \newcommand{\mathds}{\mathbb}[/math]

In the black-box model non-smoothness dramatically deteriorates the rate of convergence of first order methods from [math]1/t^2[/math] to [math]1/\sqrt{t}[/math]. However, as we already pointed out in Section Structured optimization, we (almost) always know the function to be optimized {\em globally}. In particular the ‘`source" of non-smoothness can often be identified. For instance the LASSO objective (see Section Some convex optimization problems in machine learning) is non-smooth, but it is a sum of a smooth part (the least squares fit) and a {\em simple} non-smooth part (the [math]\ell_1[/math]-norm). Using this specific structure we will propose in Section Sum of a smooth and a simple non-smooth term a first order method with a [math]1/t^2[/math] convergence rate, despite the non-smoothness. In Section Smooth saddle-point representation of a non-smooth function we consider another type of non-smoothness that can effectively be overcome, where the function is the maximum of smooth functions. Finally we conclude this chapter with a concise description of interior point methods, for which the structural assumption is made on the constraint set rather than on the objective function.

Sum of a smooth and a simple non-smooth term

We consider here the following problem[Notes 1]:

[[math]] \min_{x \in \R^n} f(x) + g(x) , [[/math]]

where [math]f[/math] is convex and [math]\beta[/math]-smooth, and [math]g[/math] is convex. We assume that [math]f[/math] can be accessed through a first order oracle, and that [math]g[/math] is known and ``simple". What we mean by simplicity will be clear from the description of the algorithm. For instance a separable function, that is [math]g(x) = \sum_{i=1}^n g_i(x(i))[/math], will be considered as simple. The prime example being [math]g(x) = \|x\|_1[/math]. This section is inspired from [1] (see also [2][3]).


ISTA (Iterative Shrinkage-Thresholding Algorithm)

Recall that gradient descent on the smooth function [math]f[/math] can be written as (see eq:MDproxview)

[[math]] x_{t+1} = \argmin_{x \in \mathbb{R}^n} \eta \nabla f(x_t)^{\top} x + \frac{1}{2} \|x - x_t\|^2_2 . [[/math]]

Here one wants to minimize [math]f+g[/math], and [math]g[/math] is assumed to be known and ``simple". Thus it seems quite natural to consider the following update rule, where only [math]f[/math] is locally approximated with a first order oracle:

[[math]] \begin{eqnarray} x_{t+1} & = & \argmin_{x \in \mathbb{R}^n} \eta (g(x) + \nabla f(x_t)^{\top} x) + \frac{1}{2} \|x - x_t\|^2_2 \notag \\ & = & \argmin_{x \in \mathbb{R}^n} \ g(x) + \frac{1}{2\eta} \|x - (x_t - \eta \nabla f(x_t)) \|_2^2 . \label{eq:proxoperator} \end{eqnarray} [[/math]]

The algorithm described by the above iteration is known as ISTA (Iterative Shrinkage-Thresholding Algorithm). In terms of convergence rate it is easy to show that ISTA has the same convergence rate on [math]f+g[/math] as gradient descent on [math]f[/math]. More precisely with [math]\eta=\frac{1}{\beta}[/math] one has

[[math]] f(x_t) + g(x_t) - (f(x^*) + g(x^*)) \leq \frac{\beta \|x_1 - x^*\|^2_2}{2 t} . [[/math]]

This improved convergence rate over a subgradient descent directly on [math]f+g[/math] comes at a price: in general \eqref{eq:proxoperator} may be a difficult optimization problem by itself, and this is why one needs to assume that [math]g[/math] is simple. For instance if [math]g[/math] can be written as [math]g(x) = \sum_{i=1}^n g_i(x(i))[/math] then one can compute [math]x_{t+1}[/math] by solving [math]n[/math] convex problems in dimension [math]1[/math]. In the case where [math]g(x) = \lambda \|x\|_1[/math] this one-dimensional problem is given by:

[[math]] \min_{x \in \mathbb{R}} \ \lambda |x| + \frac{1}{2 \eta}(x - x_0)^2, \ \text{where} \ x_0 \in \mathbb{R} . [[/math]]

Elementary computations shows that this problem has an analytical solution given by [math]\tau_{\lambda \eta}(x_0)[/math], where [math]\tau[/math] is the shrinkage operator (hence the name ISTA), defined by

[[math]] \tau_{\alpha}(x) = (|x|-\alpha)_+ \mathrm{sign}(x) . [[/math]]

Much more is known about \eqref{eq:proxoperator} (which is called the {\em proximal operator} of [math]g[/math]), and in fact entire monographs have been written about this equation, see e.g. [4][5].

FISTA (Fast ISTA)

An obvious idea is to combine Nesterov’s accelerated gradient descent (which results in a [math]1/t^2[/math] rate to optimize [math]f[/math]) with ISTA. This results in FISTA (Fast ISTA) which is described as follows. Let

[[math]] \lambda_0 = 0, \ \lambda_{t} = \frac{1 + \sqrt{1+ 4 \lambda_{t-1}^2}}{2}, \ \text{and} \ \gamma_t = \frac{1 - \lambda_t}{\lambda_{t+1}}. [[/math]]

Let [math]x_1 = y_1[/math] an arbitrary initial point, and

[[math]] \begin{eqnarray*} y_{t+1} & = & \mathrm{argmin}_{x \in \mathbb{R}^n} \ g(x) + \frac{\beta}{2} \|x - (x_t - \frac1{\beta} \nabla f(x_t)) \|_2^2 , \\ x_{t+1} & = & (1 - \gamma_t) y_{t+1} + \gamma_t y_t . \end{eqnarray*} [[/math]]

Again it is easy show that the rate of convergence of FISTA on [math]f+g[/math] is similar to the one of Nesterov's accelerated gradient descent on [math]f[/math], more precisely:

[[math]] f(y_t) + g(y_t) - (f(x^*) + g(x^*)) \leq \frac{2 \beta \|x_1 - x^*\|^2}{t^2} . [[/math]]

CMD and RDA

ISTA and FISTA assume smoothness in the Euclidean metric. Quite naturally one can also use these ideas in a non-Euclidean setting. Starting from eq:MDproxview one obtains the CMD (Composite Mirror Descent) algorithm of [6], while with eq:DA0 one obtains the RDA (Regularized Dual Averaging) of [7]. We refer to these papers for more details.

Smooth saddle-point representation of a non-smooth function

Quite often the non-smoothness of a function [math]f[/math] comes from a [math]\max[/math] operation. More precisely non-smooth functions can often be represented as

[[math]] \begin{equation} \label{eq:sprepresentation} f(x) = \max_{1 \leq i \leq m} f_i(x) , \end{equation} [[/math]]

where the functions [math]f_i[/math] are smooth. This was the case for instance with the function we used to prove the black-box lower bound [math]1/\sqrt{t}[/math] for non-smooth optimization in Theorem. We will see now that by using this structural representation one can in fact attain a rate of [math]1/t[/math]. This was first observed in [8] who proposed the Nesterov's smoothing technique. Here we will present the alternative method of [9] which we find more transparent (yet another version is the Chambolle-Pock algorithm, see [10]). Most of what is described in this section can be found in [11][12]. In the next subsection we introduce the more general problem of saddle point computation. We then proceed to apply a modified version of mirror descent to this problem, which will be useful both in Chapter and also as a warm-up for the more powerful modified mirror prox that we introduce next.

Saddle point computation

Let [math]\cX \subset \R^n[/math], [math]\cY \subset \R^m[/math] be compact and convex sets. Let [math]\phi : \cX \times \cY \rightarrow \mathbb{R}[/math] be a continuous function, such that [math]\phi(\cdot, y)[/math] is convex and [math]\phi(x, \cdot)[/math] is concave. We write [math]g_{\cX}(x,y)[/math] (respectively [math]g_{\cY}(x,y)[/math]) for an element of [math]\partial_x \phi(x,y)[/math] (respectively [math]\partial_y (-\phi(x,y))[/math]). We are interested in computing

[[math]] \min_{x \in \cX} \max_{y \in \cY} \phi(x,y) . [[/math]]

By Sion's minimax theorem there exists a pair [math](x^*, y^*) \in \cX \times \cY[/math] such that

[[math]] \phi(x^*,y^*) = \min_{x \in \mathcal{X}} \max_{y \in \mathcal{Y}} \phi(x,y) = \max_{y \in \mathcal{Y}} \min_{x \in \mathcal{X}} \phi(x,y) . [[/math]]

We will explore algorithms that produce a candidate pair of solutions [math](\tx, \ty) \in \cX \times \cY[/math]. The quality of [math](\tx, \ty)[/math] is evaluated through the so-called duality gap[Notes 2]

[[math]] \max_{y \in \mathcal{Y}} \phi(\tx,y) - \min_{x \in \mathcal{X}} \phi(x,\ty) . [[/math]]

The key observation is that the duality gap can be controlled similarly to the suboptimality gap [math]f(x) - f(x^*)[/math] in a simple convex optimization problem. Indeed for any [math](x, y) \in \cX \times \cY[/math],

[[math]] \phi(\tx,\ty) - \phi(x,\ty) \leq g_{\cX}(\tx,\ty)^{\top} (\tx-x), [[/math]]

and

[[math]] - \phi(\tx,\ty) - (- \phi(\tx,y)) \leq g_{\cY}(\tx,\ty)^{\top} (\ty-y) . [[/math]]

In particular, using the notation [math]z = (x,y) \in \cZ := \cX \times \cY[/math] and [math]g(z) = (g_{\cX}(x,y), g_{\cY}(x,y))[/math] we just proved

[[math]] \begin{equation} \label{eq:keysp} \max_{y \in \mathcal{Y}} \phi(\tx,y) - \min_{x \in \mathcal{X}} \phi(x,\ty) \leq g(\tz)^{\top} (\tz - z) , \end{equation} [[/math]]

for some [math]z \in \mathcal{Z}.[/math] In view of the vector field point of view developed in Section The vector field point of view on MD, DA, and MP this suggests to do a mirror descent in the [math]\cZ[/math]-space with the vector field [math]g : \cZ \rightarrow \R^n \times \R^m[/math]. We will assume in the next subsections that [math]\cX[/math] is equipped with a mirror map [math]\Phi_{\cX}[/math] (defined on [math]\cD_{\cX}[/math]) which is [math]1[/math]-strongly convex w.r.t. a norm [math]\|\cdot\|_{\cX}[/math] on [math]\cX \cap \cD_{\cX}[/math]. We denote [math]R^2_{\cX} = \sup_{x \in \cX} \Phi(x) - \min_{x \in \cX} \Phi(x)[/math]. We define similar quantities for the space [math]\cY[/math].

Saddle Point Mirror Descent (SP-MD)

We consider here mirror descent on the space [math]\cZ = \cX \times \cY[/math] with the mirror map [math]\Phi(z) = a \Phi_{\cX}(x) + b \Phi_{\cY}(y)[/math] (defined on [math]\cD = \cD_{\cX} \times \cD_{\cY}[/math]), where [math]a, b \in \R_+[/math] are to be defined later, and with the vector field [math]g : \cZ \rightarrow \R^n \times \R^m[/math] defined in the previous subsection. We call the resulting algorithm SP-MD (Saddle Point Mirror Descent). It can be described succintly as follows. Let [math]z_1 \in \argmin_{z \in \cZ \cap \cD} \Phi(z)[/math]. Then for [math]t \geq 1[/math], let

[[math]] z_{t+1} \in \argmin_{z \in \cZ \cap \cD} \ \eta g_t^{\top} z + D_{\Phi}(z,z_t) , [[/math]]

where [math]g_t = (g_{\cX,t}, g_{\cY,t})[/math] with [math]g_{\cX,t} \in \partial_x \phi(x_t,y_t)[/math] and [math]g_{\cY,t} \in \partial_y (- \phi(x_t,y_t))[/math].

Theorem

Assume that [math]\phi(\cdot, y)[/math] is [math]L_{\cX}[/math]-Lipschitz w.r.t. [math]\|\cdot\|_{\cX}[/math], that is [math]\|g_{\cX}(x,y)\|_{\cX}^* \leq L_{\cX}, \forall (x, y) \in \cX \times \cY[/math]. Similarly assume that [math]\phi(x, \cdot)[/math] is [math]L_{\cY}[/math]-Lipschitz w.r.t. [math]\|\cdot\|_{\cY}[/math]. Then SP-MD with [math]a= \frac{L_{\cX}}{R_{\cX}}[/math], [math]b=\frac{L_{\cY}}{R_{\cY}}[/math], and [math]\eta=\sqrt{\frac{2}{t}}[/math] satisfies

[[math]] \max_{y \in \mathcal{Y}} \phi\left( \frac1{t} \sum_{s=1}^t x_s,y \right) - \min_{x \in \mathcal{X}} \phi\left(x, \frac1{t} \sum_{s=1}^t y_s \right) \leq (R_{\cX} L_{\cX} + R_{\cY} L_{\cY}) \sqrt{\frac{2}{t}}. [[/math]]


Show Proof

First we endow [math]\mathcal{Z}[/math] with the norm [math]\|\cdot\|_{\cZ}[/math] defined by

[[math]] \|z\|_{\cZ} = \sqrt{a \|x\|_{\mathcal{X}}^2 + b \|y\|_{\mathcal{Y}}^2} . [[/math]]
It is immediate that [math]\Phi[/math] is [math]1[/math]-strongly convex with respect to [math]\|\cdot\|_{\mathcal{Z}}[/math] on [math]\mathcal{Z} \cap \mathcal{D}[/math]. Furthermore one can easily check that

[[math]] \|z\|_{\mathcal{Z}}^* = \sqrt{\frac1{a} \left(\|x\|_{\mathcal{X}}^*\right)^2 + \frac1{b} \left(\|y\|_{\mathcal{Y}}^*\right)^2} , [[/math]]
and thus the vector field [math](g_t)[/math] used in the SP-MD satisfies:

[[math]] \|g_t\|_{\mathcal{Z}}^* \leq \sqrt{\frac{L_{\mathcal{X}}^2}{a} + \frac{L_{\mathcal{Y}}^2}{b}} . [[/math]]
Using eq:vfMD together with \eqref{eq:keysp} and the values of [math]a, b[/math] and [math]\eta[/math] concludes the proof.

Saddle Point Mirror Prox (SP-MP)

We now consider the most interesting situation in the context of this chapter, where the function [math]\phi[/math] is smooth. Precisely we say that [math]\phi[/math] is [math](\beta_{11}, \beta_{12}, \beta_{22}, \beta_{21})[/math]-smooth if for any [math]x, x' \in \cX, y, y' \in \cY[/math],

[[math]] \begin{align*} & \|\nabla_x \phi(x,y) - \nabla_x \phi(x',y) \|_{\mathcal{X}}^* \leq \beta_{11} \|x-x'\|_{\mathcal{X}} , \\ & \|\nabla_x \phi(x,y) - \nabla_x \phi(x,y') \|_{\mathcal{X}}^* \leq \beta_{12} \|y-y'\|_{\mathcal{Y}} , \\ & \|\nabla_y \phi(x,y) - \nabla_y \phi(x,y') \|_{\mathcal{Y}}^* \leq \beta_{22} \|y-y'\|_{\mathcal{Y}} , \\ & \|\nabla_y \phi(x,y) - \nabla_y \phi(x',y) \|_{\mathcal{Y}}^* \leq \beta_{21} \|x-x'\|_{\mathcal{X}} , \end{align*} [[/math]]

This will imply the Lipschitzness of the vector field [math]g : \cZ \rightarrow \R^n \times \R^m[/math] under the appropriate norm. Thus we use here mirror prox on the space [math]\cZ[/math] with the mirror map [math]\Phi(z) = a \Phi_{\cX}(x) + b \Phi_{\cY}(y)[/math] and the vector field [math]g[/math]. The resulting algorithm is called SP-MP (Saddle Point Mirror Prox) and we can describe it succintly as follows. Let [math]z_1 \in \argmin_{z \in \cZ \cap \cD} \Phi(z)[/math]. Then for [math]t \geq 1[/math], let [math]z_t=(x_t,y_t)[/math] and [math]w_t=(u_t, v_t)[/math] be defined by

[[math]] \begin{eqnarray*} w_{t+1} & = & \argmin_{z \in \cZ \cap \cD} \ \eta (\nabla_x \phi(x_t, y_t), - \nabla_y \phi(x_t,y_t))^{\top} z + D_{\Phi}(z,z_t) \\ z_{t+1} & = & \argmin_{z \in \cZ \cap \cD} \ \eta (\nabla_x \phi(u_{t+1}, v_{t+1}), - \nabla_y \phi(u_{t+1},v_{t+1}))^{\top} z + D_{\Phi}(z,z_t) . \end{eqnarray*} [[/math]]


Theorem

Assume that [math]\phi[/math] is [math](\beta_{11}, \beta_{12}, \beta_{22}, \beta_{21})[/math]-smooth. Then SP-MP with [math]a= \frac{1}{R_{\cX}^2}[/math], [math]b=\frac{1}{R_{\cY}^2}[/math], and [math]\eta= 1 / \left(2 \max \left(\beta_{11} R^2_{\cX}, \beta_{22} R^2_{\cY}, \beta_{12} R_{\cX} R_{\cY}, \beta_{21} R_{\cX} R_{\cY}\right) \right)[/math] satisfies

[[math]] \begin{align*} & \max_{y \in \mathcal{Y}} \phi\left( \frac1{t} \sum_{s=1}^t u_{s+1},y \right) - \min_{x \in \mathcal{X}} \phi\left(x, \frac1{t} \sum_{s=1}^t v_{s+1} \right) \\ & \leq \max \left(\beta_{11} R^2_{\cX}, \beta_{22} R^2_{\cY}, \beta_{12} R_{\cX} R_{\cY}, \beta_{21} R_{\cX} R_{\cY}\right) \frac{4}{t} . \end{align*} [[/math]]


Show Proof

In light of the proof of Theorem and eq:vfMP it clearly suffices to show that the vector field [math]g(z) = (\nabla_x \phi(x,y), - \nabla_y \phi_(x,y))[/math] is [math]\beta[/math]-Lipschitz w.r.t. [math]\|z\|_{\cZ} = \sqrt{\frac{1}{R_{\cX}^2} \|x\|_{\mathcal{X}}^2 + \frac{1}{R_{\cY}^2} \|y\|_{\mathcal{Y}}^2}[/math] with [math]\beta = 2 \max \left(\beta_{11} R^2_{\cX}, \beta_{22} R^2_{\cY}, \beta_{12} R_{\cX} R_{\cY}, \beta_{21} R_{\cX} R_{\cY}\right)[/math]. In other words one needs to show that

[[math]] \|g(z) - g(z')\|_{\cZ}^* \leq \beta \|z - z'\|_{\cZ} , [[/math]]
which can be done with straightforward calculations (by introducing [math]g(x',y)[/math] and using the definition of smoothness for [math]\phi[/math]).

Applications

We investigate briefly three applications for SP-MD and SP-MP.

Minimizing a maximum of smooth functions

The problem \eqref{eq:sprepresentation} (when [math]f[/math] has to minimized over [math]\cX[/math]) can be rewritten as

[[math]] \min_{x \in \cX} \max_{y \in \Delta_m} \vec{f}(x)^{\top} y , [[/math]]

where [math]\vec{f}(x) = (f_1(x), \dots, f_m(x)) \in \R^m[/math]. We assume that the functions [math]f_i[/math] are [math]L[/math]-Lipschtiz and [math]\beta[/math]-smooth w.r.t. some norm [math]\|\cdot\|_{\cX}[/math]. Let us study the smoothness of [math]\phi(x,y) = \vec{f}(x)^{\top} y[/math] when [math]\cX[/math] is equipped with [math]\|\cdot\|_{\cX}[/math] and [math]\Delta_m[/math] is equipped with [math]\|\cdot\|_1[/math]. On the one hand [math]\nabla_y \phi(x,y) = \vec{f}(x)[/math], in particular one immediately has [math]\beta_{22}=0[/math], and furthermore

[[math]] \|\vec{f}(x) - \vec{f}(x') \|_{\infty} \leq L \|x-x'\|_{\mathcal{X}} , [[/math]]

that is [math]\beta_{21}=L[/math]. On the other hand [math]\nabla_x \phi(x,y) = \sum_{i=1}^m y_i \nabla f_i(x)[/math], and thus

[[math]] \begin{align*} & \|\sum_{i=1}^m y(i) (\nabla f_i(x) - \nabla f_i(x')) \|_{\cX}^* \leq \beta \|x-x'\|_{\cX} , \\ & \|\sum_{i=1}^m (y(i)-y'(i)) \nabla f_i(x) \|_{\cX}^* \leq L\|y-y'\|_1 , \end{align*} [[/math]]

that is [math]\beta_{11} = \beta[/math] and [math]\beta_{12} = L[/math]. Thus using SP-MP with some mirror map on [math]\cX[/math] and the negentropy on [math]\Delta_m[/math] (see the ‘`simplex setup" in Section Standard setups for mirror descent), one obtains an [math]\epsilon[/math]-optimal point of [math]f(x) = \max_{1 \leq i \leq m} f_i(x)[/math] in [math]O\left(\frac{\beta R_{\cX}^2 + L R_{\cX} \sqrt{\log(m)}}{\epsilon} \right)[/math] iterations. Furthermore an iteration of SP-MP has a computational complexity of order of a step of mirror descent in [math]\cX[/math] on the function [math]x \mapsto \sum_{i=1}^m y(i) f_i(x)[/math] (plus [math]O(m)[/math] for the update in the [math]\cY[/math]-space). Thus by using the structure of [math]f[/math] we were able to obtain a much better rate than black-box procedures (which would have required [math]\Omega(1/\epsilon^2)[/math] iterations as [math]f[/math] is potentially non-smooth).

Matrix games

Let [math]A \in \R^{n \times m}[/math], we denote [math]\|A\|_{\mathrm{max}}[/math] for the maximal entry (in absolute value) of [math]A[/math], and [math]A_i \in \R^n[/math] for the [math]i^{th}[/math] column of [math]A[/math]. We consider the problem of computing a Nash equilibrium for the zero-sum game corresponding to the loss matrix [math]A[/math], that is we want to solve

[[math]] \min_{x \in \Delta_n} \max_{y \in \Delta_m} x^{\top} A y . [[/math]]

Here we equip both [math]\Delta_n[/math] and [math]\Delta_m[/math] with [math]\|\cdot\|_1[/math]. Let [math]\phi(x,y) = x^{\top} A y[/math]. Using that [math]\nabla_x \phi(x,y) = Ay[/math] and [math]\nabla_y \phi(x,y) = A^{\top} x[/math] one immediately obtains [math]\beta_{11} = \beta_{22} = 0[/math]. Furthermore since

[[math]] \|A(y - y’) \|_{\infty} = \|\sum_{i=1}^m (y(i) - y'(i)) A_i \|_{\infty} \leq \|A\|_{\mathrm{max}} \|y - y'\|_1 , [[/math]]

one also has [math]\beta_{12} = \beta_{21} = \|A\|_{\mathrm{max}}[/math]. Thus SP-MP with the negentropy on both [math]\Delta_n[/math] and [math]\Delta_m[/math] attains an [math]\epsilon[/math]-optimal pair of mixed strategies with [math]O\left(\|A\|_{\mathrm{max}} \sqrt{\log(n) \log(m)} / \epsilon \right)[/math] iterations. Furthermore the computational complexity of a step of SP-MP is dominated by the matrix-vector multiplications which are [math]O(n m)[/math]. Thus overall the complexity of getting an [math]\epsilon[/math]-optimal Nash equilibrium with SP-MP is [math]O\left(\|A\|_{\mathrm{max}} n m \sqrt{\log(n) \log(m)} / \epsilon \right)[/math].

Linear classification

Let [math](\ell_i, A_i) \in \{-1,1\} \times \R^n[/math], [math]i \in [m][/math], be a data set that one wishes to separate with a linear classifier. That is one is looking for [math]x \in \mB_{2,n}[/math] such that for all [math]i \in [m][/math], [math]\mathrm{sign}(x^{\top} A_i) = \mathrm{sign}(\ell_i)[/math], or equivalently [math]\ell_i x^{\top} A_i \gt 0[/math]. Clearly without loss of generality one can assume [math]\ell_i = 1[/math] for all [math]i \in [m][/math] (simply replace [math]A_i[/math] by [math]\ell_i A_i[/math]). Let [math]A \in \R^{n \times m}[/math] be the matrix where the [math]i^{th}[/math] column is [math]A_i[/math]. The problem of finding [math]x[/math] with maximal margin can be written as

[[math]] \begin{equation} \label{eq:linearclassif} \max_{x \in \mB_{2,n}} \min_{1 \leq i \leq m} A_i^{\top} x = \max_{x \in \mB_{2,n}} \min_{y \in \Delta_m} x^{\top} A y . \end{equation} [[/math]]

Assuming that [math]\|A_i\|_2 \leq B[/math], and using the calculations we did in Section Minimizing a maximum of smooth functions, it is clear that [math]\phi(x,y) = x^{\top} A y[/math] is [math](0, B, 0, B)[/math]-smooth with respect to [math]\|\cdot\|_2[/math] on [math]\mB_{2,n}[/math] and [math]\|\cdot\|_1[/math] on [math]\Delta_m[/math]. This implies in particular that SP-MP with the Euclidean norm squared on [math]\mB_{2,n}[/math] and the negentropy on [math]\Delta_m[/math] will solve \eqref{eq:linearclassif} in [math]O(B \sqrt{\log(m)} / \epsilon)[/math] iterations. Again the cost of an iteration is dominated by the matrix-vector multiplications, which results in an overall complexity of [math]O(B n m \sqrt{\log(m)} / \epsilon)[/math] to find an [math]\epsilon[/math]-optimal solution to \eqref{eq:linearclassif}.

Interior point methods

We describe here interior point methods (IPM), a class of algorithms fundamentally different from what we have seen so far. The first algorithm of this type was described in [13], but the theory we shall present was developed in [14]. We follow closely the presentation given in [Chapter 4, [15]]. Other useful references (in particular for the primal-dual IPM, which are the ones used in practice) include [16][17][18]. \newline IPM are designed to solve convex optimization problems of the form

[[math]] \begin{align*} & \mathrm{min.} \; c^{\top} x \\ & \text{s.t.} \; x \in \cX , \end{align*} [[/math]]

with [math]c \in \R^n[/math], and [math]\cX \subset \R^n[/math] convex and compact. Note that, at this point, the linearity of the objective is without loss of generality as minimizing a convex function [math]f[/math] over [math]\cX[/math] is equivalent to minimizing a linear objective over the epigraph of [math]f[/math] (which is also a convex set). The structural assumption on [math]\cX[/math] that one makes in IPM is that there exists a {\em self-concordant barrier} for [math]\cX[/math] with an easily computable gradient and Hessian. The meaning of the previous sentence will be made precise in the next subsections. The importance of IPM stems from the fact that LPs and SDPs (see Section Structured optimization) satisfy this structural assumption.

The barrier method

We say that [math]F : \inte(\cX) \rightarrow \R[/math] is a {\em barrier} for [math]\cX[/math] if

[[math]] F(x) \xrightarrow[x \to \partial \cX]{} +\infty . [[/math]]

We will only consider strictly convex barriers. We extend the domain of definition of [math]F[/math] to [math]\R^n[/math] with [math]F(x) = +\infty[/math] for [math]x \not\in \inte(\cX)[/math]. For [math]t \in \R_+[/math] let

[[math]] x^*(t) \in \argmin_{x \in \R^n} t c^{\top} x + F(x) . [[/math]]

In the following we denote [math]F_t(x) := t c^{\top} x + F(x)[/math]. In IPM the path [math](x^*(t))_{t \in \R_+}[/math] is referred to as the {\em central path}. It seems clear that the central path eventually leads to the minimum [math]x^*[/math] of the objective function [math]c^{\top} x[/math] on [math]\cX[/math], precisely we will have

[[math]] x^*(t) \xrightarrow[t \to +\infty]{} x^* . [[/math]]

The idea of the {\em barrier method} is to move along the central path by ‘`boosting" a fast locally convergent algorithm, which we denote for the moment by [math]\cA[/math], using the following scheme: Assume that one has computed [math]x^*(t)[/math], then one uses [math]\cA[/math] initialized at [math]x^*(t)[/math] to compute [math]x^*(t’)[/math] for some [math]t' \gt t[/math]. There is a clear tension for the choice of [math]t'[/math], on the one hand [math]t'[/math] should be large in order to make as much progress as possible on the central path, but on the other hand [math]x^*(t)[/math] needs to be close enough to [math]x^*(t')[/math] so that it is in the basin of fast convergence for [math]\cA[/math] when run on [math]F_{t'}[/math]. IPM follows the above methodology with [math]\cA[/math] being {\em Newton's method}. Indeed as we will see in the next subsection, Newton's method has a quadratic convergence rate, in the sense that if initialized close enough to the optimum it attains an [math]\epsilon[/math]-optimal point in [math]\log\log(1/\epsilon)[/math] iterations! Thus we now have a clear plan to make these ideas formal and analyze the iteration complexity of IPM:

  • First we need to describe precisely the region of fast convergence for Newton's method. This will lead us to define self-concordant functions, which are ‘`natural" functions for Newton’s method.
  • Then we need to evaluate precisely how much larger [math]t'[/math] can be compared to [math]t[/math], so that [math]x^*(t)[/math] is still in the region of fast convergence of Newton's method when optimizing the function [math]F_{t'}[/math] with [math]t' \gt t[/math]. This will lead us to define [math]\nu[/math]-self concordant barriers.
  • How do we get close to the central path in the first place? Is it possible to compute [math]x^*(0) = \argmin_{x \in \R^n} F(x)[/math] (the so-called analytical center of [math]\mathcal{X}[/math])?

Traditional analysis of Newton's method

We start by describing Newton's method together with its standard analysis showing the quadratic convergence rate when initialized close enough to the optimum. In this subsection we denote [math]\|\cdot\|[/math] for both the Euclidean norm on [math]\R^n[/math] and the operator norm on matrices (in particular [math]\|A x\| \leq \|A\| \cdot \|x\|[/math]). Let [math]f: \R^n \rightarrow \R[/math] be a [math]C^2[/math] function. Using a Taylor's expansion of [math]f[/math] around [math]x[/math] one obtains

[[math]] f(x+h) = f(x) + h^{\top} \nabla f(x) + \frac12 h^{\top} \nabla^2 f(x) h + o(\|h\|^2) . [[/math]]

Thus, starting at [math]x[/math], in order to minimize [math]f[/math] it seems natural to move in the direction [math]h[/math] that minimizes

[[math]] h^{\top} \nabla f(x) + \frac12 h^{\top} \nabla f^2(x) h . [[/math]]

If [math]\nabla^2 f(x)[/math] is positive definite then the solution to this problem is given by [math]h = - [\nabla^2 f(x)]^{-1} \nabla f(x)[/math]. Newton's method simply iterates this idea: starting at some point [math]x_0 \in \R^n[/math], it iterates for [math]k \geq 0[/math] the following equation:

[[math]] x_{k+1} = x_k - [\nabla^2 f(x_k)]^{-1} \nabla f(x_k) . [[/math]]

While this method can have an arbitrarily bad behavior in general, if started close enough to a strict local minimum of [math]f[/math], it can have a very fast convergence:

Theorem

Assume that [math]f[/math] has a Lipschitz Hessian, that is [math]\| \nabla^2 f(x) - \nabla^2 f(y) \| \leq M \|x - y\|[/math]. Let [math]x^*[/math] be local minimum of [math]f[/math] with strictly positive Hessian, that is [math]\nabla^2 f(x^*) \succeq \mu \mI_n[/math], [math]\mu \gt 0[/math]. Suppose that the initial starting point [math]x_0[/math] of Newton's method is such that

[[math]] \|x_0 - x^*\| \leq \frac{\mu}{2 M} . [[/math]]
Then Newton's method is well-defined and converges to [math]x^*[/math] at a quadratic rate:

[[math]] \|x_{k+1} - x^*\| \leq \frac{M}{\mu} \|x_k - x^*\|^2. [[/math]]


Show Proof

We use the following simple formula, for [math]x, h \in \R^n[/math],

[[math]] \int_0^1 \nabla^2 f(x + s h) \ h \ ds = \nabla f(x+h) - \nabla f(x) . [[/math]]
Now note that [math]\nabla f(x^*) = 0[/math], and thus with the above formula one obtains

[[math]] \nabla f(x_k) = \int_0^1 \nabla^2 f(x^* + s (x_k - x^*)) \ (x_k - x^*) \ ds , [[/math]]
which allows us to write:

[[math]] \begin{align*} & x_{k+1} - x^* \\ & = x_k - x^* - [\nabla^2 f(x_k)]^{-1} \nabla f(x_k) \\ & = x_k - x^* - [\nabla^2 f(x_k)]^{-1} \int_0^1 \nabla^2 f(x^* + s (x_k - x^*)) \ (x_k - x^*) \ ds \\ & = [\nabla^2 f(x_k)]^{-1} \int_0^1 [\nabla^2 f (x_k) - \nabla^2 f(x^* + s (x_k - x^*)) ] \ (x_k - x^*) \ ds . \end{align*} [[/math]]
In particular one has

[[math]] \begin{align*} & \|x_{k+1} - x^*\| \\ & \leq \|[\nabla^2 f(x_k)]^{-1}\| \\ & \times \left( \int_0^1 \| \nabla^2 f (x_k) - \nabla^2 f(x^* + s (x_k - x^*)) \| \ ds \right) \|x_k - x^* \|. \end{align*} [[/math]]
Using the Lipschitz property of the Hessian one immediately obtains that

[[math]] \left( \int_0^1 \| \nabla^2 f (x_k) - \nabla^2 f(x^* + s (x_k - x^*)) \| \ ds \right) \leq \frac{M}{2} \|x_k - x^*\| . [[/math]]
Using again the Lipschitz property of the Hessian (note that [math]\|A - B\| \leq s \Leftrightarrow s \mI_n \succeq A - B \succeq - s \mI_n[/math]), the hypothesis on [math]x^*[/math], and an induction hypothesis that [math]\|x_k - x^*\| \leq \frac{\mu}{2M}[/math], one has

[[math]] \nabla^2 f(x_k) \succeq \nabla^2 f(x^*) - M \|x_k - x^*\| \mI_n \succeq (\mu - M \|x_k - x^*\|) \mI_n \succeq \frac{\mu}{2} \mI_n , [[/math]]
which concludes the proof.

Self-concordant functions

Before giving the definition of self-concordant functions let us try to get some insight into the ‘`geometry" of Newton’s method. Let [math]A[/math] be a [math]n \times n[/math] non-singular matrix. We look at a Newton step on the functions [math]f: x \mapsto f(x)[/math] and [math]\phi: y \mapsto f(A^{-1} y)[/math], starting respectively from [math]x[/math] and [math]y= A x[/math], that is:

[[math]] x^+ = x - [\nabla^2 f(x)]^{-1} \nabla f(x) , \; \text{and} \; y^+ = y - [\nabla^2 \phi(y)]^{-1} \nabla \phi(y) . [[/math]]

By using the following simple formulas

[[math]] \nabla (x \mapsto f(A x) ) =A^{\top} \nabla f(A x) , \; \text{and} \; \nabla^2 (x \mapsto f(A x) ) =A^{\top} \nabla^2 f(A x) A . [[/math]]

it is easy to show that

[[math]] y^+ = A x^+ . [[/math]]

In other words Newton's method will follow the same trajectory in the ‘`[math]x[/math]-space" and in the ``[math]y[/math]-space" (the image through [math]A[/math] of the [math]x[/math]-space), that is Newton’s method is {\em affine invariant}. Observe that this property is not shared by the methods described in Chapter (except for the conditional gradient descent). The affine invariance of Newton's method casts some concerns on the assumptions of the analysis in Section Traditional analysis of Newton's method. Indeed the assumptions are all in terms of the canonical inner product in [math]\R^n[/math]. However we just showed that the method itself does not depend on the choice of the inner product (again this is not true for first order methods). Thus one would like to derive a result similar to Theorem without any reference to a prespecified inner product. The idea of self-concordance is to modify the Lipschitz assumption on the Hessian to achieve this goal. Assume from now on that [math]f[/math] is [math]C^3[/math], and let [math]\nabla^3 f(x) : \R^n \times \R^n \times \R^n \rightarrow \R[/math] be the third order differential operator. The Lipschitz assumption on the Hessian in Theorem can be written as:

[[math]] \nabla^3 f(x) [h,h,h] \leq M \|h\|_2^3 . [[/math]]

The issue is that this inequality depends on the choice of an inner product. More importantly it is easy to see that a convex function which goes to infinity on a compact set simply cannot satisfy the above inequality. A natural idea to try fix these issues is to replace the Euclidean metric on the right hand side by the metric given by the function [math]f[/math] itself at [math]x[/math], that is:

[[math]] \|h\|_x = \sqrt{ h^{\top} \nabla^2 f(x) h }. [[/math]]

Observe that to be clear one should rather use the notation [math]\|\cdot\|_{x, f}[/math], but since [math]f[/math] will always be clear from the context we stick to [math]\|\cdot\|_x[/math].

Definition

Let [math]\mathcal{X}[/math] be a convex set with non-empty interior, and [math]f[/math] a [math]C^3[/math] convex function defined on [math]\inte(\mathcal{X})[/math]. Then [math]f[/math] is self-concordant (with constant [math]M[/math]) if for all [math]x \in \inte(\mathcal{X}), h \in \R^n[/math],

[[math]] \nabla^3 f(x) [h,h,h] \leq M \|h\|_x^3 . [[/math]]
We say that [math]f[/math] is standard self-concordant if [math]f[/math] is self-concordant with constant [math]M=2[/math].

An easy consequence of the definition is that a self-concordant function is a barrier for the set [math]\mathcal{X}[/math], see [Theorem 4.1.4, [15]]. The main example to keep in mind of a standard self-concordant function is [math]f(x) = - \log x[/math] for [math]x \gt 0[/math]. The next definition will be key in order to describe the region of quadratic convergence for Newton's method on self-concordant functions.

Definition

Let [math]f[/math] be a standard self-concordant function on [math]\mathcal{X}[/math]. For [math]x \in \mathrm{int}(\mathcal{X})[/math], we say that [math]\lambda_f(x) = \|\nabla f(x)\|_x^*[/math] is the {\em Newton decrement} of [math]f[/math] at [math]x[/math].

An important inequality is that for [math]x[/math] such that [math]\lambda_f(x) \lt 1[/math], and [math]x^* = \argmin f(x)[/math], one has

[[math]] \begin{equation} \label{eq:trucipm3} \|x - x^*\|_x \leq \frac{\lambda_f(x)}{1 - \lambda_f(x)} , \end{equation} [[/math]]

see [Equation 4.1.18, [15]]. We state the next theorem without a proof, see also [Theorem 4.1.14, [15]].

Theorem

Let [math]f[/math] be a standard self-concordant function on [math]\mathcal{X}[/math], and [math]x \in \mathrm{int}(\mathcal{X})[/math] such that [math]\lambda_f(x) \leq 1/4[/math], then

[[math]] \lambda_f\Big(x - [\nabla^2 f(x)]^{-1} \nabla f(x)\Big) \leq 2 \lambda_f(x)^2 . [[/math]]

In other words the above theorem states that, if initialized at a point [math]x_0[/math] such that [math]\lambda_f(x_0) \leq 1/4[/math], then Newton's iterates satisfy [math]\lambda_f(x_{k+1}) \leq 2 \lambda_f(x_k)^2[/math]. Thus, Newton's region of quadratic convergence for self-concordant functions can be described as a ‘`Newton decrement ball" [math]\{x : \lambda_f(x) \leq 1/4\}[/math]. In particular by taking the barrier to be a self-concordant function we have now resolved Step (1) of the plan described in Section The barrier method.

[math]\nu[/math]-self-concordant barriers

We deal here with Step (2) of the plan described in Section The barrier method. Given Theorem we want [math]t’[/math] to be as large as possible and such that

[[math]] \begin{equation} \label{eq:trucipm1} \lambda_{F_{t'}}(x^*(t) ) \leq 1/4 . \end{equation} [[/math]]

Since the Hessian of [math]F_{t'}[/math] is the Hessian of [math]F[/math], one has

[[math]] \lambda_{F_{t'}}(x^*(t) ) = \|t' c + \nabla F(x^*(t)) \|_{x^*(t)}^* . [[/math]]

Observe that, by first order optimality, one has [math]t c + \nabla F(x^*(t)) = 0,[/math] which yields

[[math]] \begin{equation} \label{eq:trucipm11} \lambda_{F_{t'}}(x^*(t) ) = (t'-t) \|c\|^*_{x^*(t)} . \end{equation} [[/math]]

Thus taking

[[math]] \begin{equation} \label{eq:trucipm2} t' = t + \frac{1}{4 \|c\|^*_{x^*(t)}} \end{equation} [[/math]]

immediately yields \eqref{eq:trucipm1}. In particular with the value of [math]t'[/math] given in \eqref{eq:trucipm2} the Newton's method on [math]F_{t'}[/math] initialized at [math]x^*(t)[/math] will converge quadratically fast to [math]x^*(t')[/math]. It remains to verify that by iterating \eqref{eq:trucipm2} one obtains a sequence diverging to infinity, and to estimate the rate of growth. Thus one needs to control [math]\|c\|^*_{x^*(t)} = \frac1{t} \|\nabla F(x^*(t))\|_{x^*(t)}^*[/math]. Luckily there is a natural class of functions for which one can control [math]\|\nabla F(x)\|_x^*[/math] uniformly over [math]x[/math]. This is the set of functions such that

[[math]] \begin{equation} \label{eq:nu} \nabla^2 F(x) \succeq \frac1{\nu} \nabla F(x) [\nabla F(x) ]^{\top} . \end{equation} [[/math]]

Indeed in that case one has:

[[math]] \begin{eqnarray*} \|\nabla F(x)\|_x^* & = & \sup_{h : h^{\top} \nabla F^2(x) h \leq 1} \nabla F(x)^{\top} h \\ & \leq & \sup_{h : h^{\top} \left( \frac1{\nu} \nabla F(x) [\nabla F(x) ]^{\top} \right) h \leq 1} \nabla F(x)^{\top} h \\ & = & \sqrt{\nu} . \end{eqnarray*} [[/math]]

Thus a safe choice to increase the penalization parameter is [math]t' = \left(1 + \frac1{4\sqrt{\nu}}\right) t[/math]. Note that the condition \eqref{eq:nu} can also be written as the fact that the function [math]F[/math] is [math]\frac1{\nu}[/math]-exp-concave, that is [math]x \mapsto \exp(- \frac1{\nu} F(x))[/math] is concave. We arrive at the following definition.

Definition

[math]F[/math] is a [math]\nu[/math]-self-concordant barrier if it is a standard self-concordant function, and it is [math]\frac1{\nu}[/math]-exp-concave.

Again the canonical example is the logarithmic function, [math]x \mapsto - \log x[/math], which is a [math]1[/math]-self-concordant barrier for the set [math]\R_{+}[/math]. We state the next theorem without a proof (see [19] for more on this result).

Theorem

Let [math]\mathcal{X} \subset \R^n[/math] be a closed convex set with non-empty interior. There exists [math]F[/math] which is a [math](c \ n)[/math]-self-concordant barrier for [math]\mathcal{X}[/math] (where [math]c[/math] is some universal constant).

A key property of [math]\nu[/math]-self-concordant barriers is the following inequality:

[[math]] \begin{equation} \label{eq:key} c^{\top} x^*(t) - \min_{x \in \mathcal{X}} c^{\top} x \leq \frac{\nu}{t} , \end{equation} [[/math]]

see [Equation (4.2.17), [15]]. More generally using \eqref{eq:key} together with \eqref{eq:trucipm3} one obtains

[[math]] \begin{eqnarray} c^{\top} y- \min_{x \in \mathcal{X}} c^{\top} x & \leq & \frac{\nu}{t} + c^{\top} (y - x^*(t)) \notag \\ & = & \frac{\nu}{t} + \frac{1}{t} (\nabla F_t(y) - \nabla F(y))^{\top} (y - x^*(t)) \notag \\ & \leq & \frac{\nu}{t} + \frac{1}{t} \|\nabla F_t(y) - \nabla F(y)\|_y^* \cdot \|y - x^*(t)\|_y \notag \\ & \leq & \frac{\nu}{t} + \frac{1}{t} (\lambda_{F_t}(y) + \sqrt{\nu})\frac{\lambda_{F_t} (y)}{1 - \lambda_{F_t}(y)} \label{eq:trucipm4} \end{eqnarray} [[/math]]

In the next section we describe a precise algorithm based on the ideas we developed above. As we will see one cannot ensure to be exactly on the central path, and thus it is useful to generalize the identity \eqref{eq:trucipm11} for a point [math]x[/math] close to the central path. We do this as follows:

[[math]] \begin{eqnarray} \lambda_{F_{t'}}(x) & = & \|t' c + \nabla F(x)\|_x^* \notag \\ & = & \|(t' / t) (t c + \nabla F(x)) + (1- t'/t) \nabla F(x)\|_x^* \notag \\ & \leq & \frac{t'}{t} \lambda_{F_t}(x) + \left(\frac{t'}{t} - 1\right) \sqrt{\nu} .\label{eq:trucipm12} \end{eqnarray} [[/math]]


Path-following scheme

We can now formally describe and analyze the most basic IPM called the {\em path-following scheme}. Let [math]F[/math] be [math]\nu[/math]-self-concordant barrier for [math]\cX[/math]. Assume that one can find [math]x_0[/math] such that [math]\lambda_{F_{t_0}}(x_0) \leq 1/4[/math] for some small value [math]t_0 \gt 0[/math] (we describe a method to find [math]x_0[/math] at the end of this subsection). Then for [math]k \geq 0[/math], let

[[math]] \begin{eqnarray*} & & t_{k+1} = \left(1 + \frac1{13\sqrt{\nu}}\right) t_k ,\\ & & x_{k+1} = x_k - [\nabla^2 F(x_k)]^{-1} (t_{k+1} c + \nabla F(x_k) ) . \end{eqnarray*} [[/math]]

The next theorem shows that after [math]O\left( \sqrt{\nu} \log \frac{\nu}{t_0 \epsilon} \right)[/math] iterations of the path-following scheme one obtains an [math]\epsilon[/math]-optimal point.

Theorem

The path-following scheme described above satisfies

[[math]] c^{\top} x_k - \min_{x \in \mathcal{X}} c^{\top} x \leq \frac{2 \nu}{t_0} \exp\left( - \frac{k}{1+13\sqrt{\nu}} \right) . [[/math]]


Show Proof

We show that the iterates [math](x_k)_{k \geq 0}[/math] remain close to the central path [math](x^*(t_k))_{k \geq 0}[/math]. Precisely one can easily prove by induction that

[[math]] \lambda_{F_{t_k}}(x_k) \leq 1/4 . [[/math]]
Indeed using Theorem and equation \eqref{eq:trucipm12} one immediately obtains

[[math]] \begin{eqnarray*} \lambda_{F_{t_{k+1}}}(x_{k+1}) & \leq & 2 \lambda_{F_{t_{k+1}}}(x_k)^2 \\ & \leq & 2 \left(\frac{t_{k+1}}{t_k} \lambda_{F_{t_k}}(x_k) + \left(\frac{t_{k+1}}{t_k} - 1\right) \sqrt{\nu}\right)^2 \\ & \leq & 1/4 , \end{eqnarray*} [[/math]]
where we used in the last inequality that [math]t_{k+1} / t_k = 1 + \frac1{13\sqrt{\nu}}[/math] and [math]\nu \geq 1[/math]. Thus using \eqref{eq:trucipm4} one obtains

[[math]] c^{\top} x_k - \min_{x \in \mathcal{X}} c^{\top} x \leq \frac{\nu + \sqrt{\nu} / 3 + 1/12}{t_k} \leq \frac{2 \nu}{t_k} . [[/math]]
Observe that [math]t_{k} = \left(1 + \frac1{13\sqrt{\nu}}\right)^{k} t_0[/math], which finally yields

[[math]] c^{\top} x_k - \min_{x \in \mathcal{X}} c^{\top} x \leq \frac{2 \nu}{t_0} \left(1 + \frac1{13\sqrt{\nu}}\right)^{- k}. [[/math]]

At this point we still need to explain how one can get close to an intial point [math]x^*(t_0)[/math] of the central path. This can be done with the following rather clever trick. Assume that one has some point [math]y_0 \in \cX[/math]. The observation is that [math]y_0[/math] is on the central path at [math]t=1[/math] for the problem where [math]c[/math] is replaced by [math]- \nabla F(y_0)[/math]. Now instead of following this central path as [math]t \to +\infty[/math], one follows it as [math]t \to 0[/math]. Indeed for [math]t[/math] small enough the central paths for [math]c[/math] and for [math]- \nabla F(y_0)[/math] will be very close. Thus we iterate the following equations, starting with [math]t_0' = 1[/math],

[[math]] \begin{eqnarray*} & & t_{k+1}' = \left(1 - \frac1{13\sqrt{\nu}}\right) t_k' ,\\ & & y_{k+1} = y_k - [\nabla^2 F(y_k)]^{-1} (- t_{k+1}' \nabla F(y_0) + \nabla F(y_k) ) . \end{eqnarray*} [[/math]]

A straightforward analysis shows that for [math]k = O(\sqrt{\nu} \log \nu)[/math], which corresponds to [math]t_k'=1/\nu^{O(1)}[/math], one obtains a point [math]y_k[/math] such that [math]\lambda_{F_{t_k'}}(y_k) \leq 1/4[/math]. In other words one can initialize the path-following scheme with [math]t_0 = t_k'[/math] and [math]x_0 = y_k[/math].

IPMs for LPs and SDPs

We have seen that, roughly, the complexity of interior point methods with a [math]\nu[/math]-self-concordant barrier is [math]O\left(M \sqrt{\nu} \log \frac{\nu}{\epsilon} \right)[/math], where [math]M[/math] is the complexity of computing a Newton direction (which can be done by computing and inverting the Hessian of the barrier). Thus the efficiency of the method is directly related to the {\em form} of the self-concordant barrier that one can construct for [math]\mathcal{X}[/math]. It turns out that for LPs and SDPs one has particularly nice self-concordant barriers. Indeed one can show that [math]F(x) = - \sum_{i=1}^n \log x_i[/math] is an [math]n[/math]-self-concordant barrier on [math]\R_{+}^n[/math], and [math]F(x) = - \log \mathrm{det}(X)[/math] is an [math]n[/math]-self-concordant barrier on [math]\mathbb{S}_{+}^n[/math]. See also [20] for a recent improvement of the basic logarithmic barrier for LPs. There is one important issue that we overlooked so far. In most interesting cases LPs and SDPs come with {\em equality constraints}, resulting in a set of constraints [math]\cX[/math] with empty interior. From a theoretical point of view there is an easy fix, which is to reparametrize the problem as to enforce the variables to live in the subspace spanned by [math]\cX[/math]. This modification also has algorithmic consequences, as the evaluation of the Newton direction will now be different. In fact, rather than doing a reparametrization, one can simply search for Newton directions such that the updated point will stay in [math]\cX[/math]. In other words one has now to solve a convex quadratic optimization problem under linear equality constraints. Luckily using Lagrange multipliers one can find a closed form solution to this problem, and we refer to previous references for more details.

General references

Bubeck, Sébastien (2015). "Convex Optimization: Algorithms and Complexity". arXiv:1405.4980 [math.OC].

References

  1. A.Beck and M.Teboulle.A fast iterative shrinkage-thresholding algorithm for linear inverse problems.SIAM Journal on Imaging Sciences, 2 (1): 183--202, 2009.
  2. Y.Nesterov.Gradient methods for minimizing composite objective function.Core discussion papers, Université catholique de Louvain, Center for Operations Research and Econometrics (CORE), 2007.
  3. S.J. Wright, R.D. Nowak, and M.A.T. Figueiredo.Sparse reconstruction by separable approximation.IEEE Transactions on Signal Processing, 57 (7): 2479--2493, 2009.
  4. N.Parikh and S.Boyd.Proximal algorithms.Foundations and Trends{\textregistered} in Optimization, 1 (3): 123--231, 2013.
  5. F.Bach, R.Jenatton, J.Mairal, and G.Obozinski.Optimization with sparsity-inducing penalties.Foundations and Trends{\textregistered} in Machine Learning, 4 (1): 1--106, 2012.
  6. J.Duchi, S.Shalev-Shwartz, Y.Singer, and A.Tewari.Composite objective mirror descent.In Proceedings of the 23rd Annual Conference on Learning Theory (COLT), 2010.
  7. L.Xiao.Dual averaging methods for regularized stochastic learning and online optimization.Journal of Machine Learning Research, 11: 2543--2596, 2010.
  8. Y.Nesterov.Smooth minimization of non-smooth functions.Mathematical programming, 103 (1): 127--152, 2004{\natexlabb}.
  9. A.Nemirovski.Prox-method with rate of convergence o (1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems.SIAM Journal on Optimization, 15 (1): 229--251, 2004{\natexlaba}.
  10. A.Chambolle and T.Pock.A first-order primal-dual algorithm for convex problems with applications to imaging.Journal of Mathematical Imaging and Vision, 40 (1): 120--145, 2011.
  11. A.Juditsky and A.Nemirovski.First-order methods for nonsmooth convex large-scale optimization, i: General purpose methods.In S.Sra, S.Nowozin, and S.Wright, editors, Optimization for Machine Learning, pages 121--147. MIT press, 2011{\natexlaba}.
  12. A.Juditsky and A.Nemirovski.First-order methods for nonsmooth convex large-scale optimization, ii: Utilizing problem's structure.In S.Sra, S.Nowozin, and S.Wright, editors, Optimization for Machine Learning, pages 149--183. MIT press, 2011{\natexlabb}.
  13. N.Karmarkar.A new polynomial-time algorithm for linear programming.Combinatorica, 4: 373--395, 1984.
  14. Y.Nesterov and A.Nemirovski.Interior-point polynomial algorithms in convex programming.Society for Industrial and Applied Mathematics (SIAM), 1994.
  15. 15.0 15.1 15.2 15.3 15.4 Y.Nesterov.Introductory lectures on convex optimization: A basic course.Kluwer Academic Publishers, 2004{\natexlaba}.
  16. J.Renegar.A mathematical view of interior-point methods in convex optimization, volume3.Siam, 2001.
  17. A.Nemirovski.Interior point polynomial time methods in convex programming.Lecture Notes, 2004{\natexlabb}.
  18. J.Nocedal and S.J. Wright.Numerical Optimization.Springer, 2006.
  19. S.Bubeck and R.Eldan.The entropic barrier: a simple and optimal universal self-concordant barrier.Arxiv preprint arXiv:1412.1587, 2014.
  20. Y.-T. Lee and A.Sidford.Path finding i :solving linear programs with Õ(sqrt(rank)) linear system solves.Arxiv preprint arXiv:1312.6677, 2013.

Notes

  1. We restrict to unconstrained minimization for sake of simplicity. One can extend the discussion to constrained minimization by using ideas from Section Gradient descent for smooth functions.
  2. Observe that the duality gap is the sum of the primal gap [math]\max_{y \in \mathcal{Y}} \phi(\tx,y) - \phi(x^*,y^*)[/math] and the dual gap [math]\phi(x^*,y^*) - \min_{x \in \mathcal{X}} \phi(x, \ty)[/math].