guide:8940e50025: Difference between revisions

From Stochiki
No edit summary
 
No edit summary
Line 1: Line 1:
<div class="d-none"><math>
\newcommand{\LS}{\mathtt{line\_search}}
\newcommand{\mI}{\mathrm{I}}
\newcommand{\mB}{\mathrm{B}}
\newcommand{\conv}{\mathrm{conv}}
\newcommand{\inte}{\mathrm{int}}
\newcommand{\tg}{\tilde{g}}
\newcommand{\tx}{\tilde{x}}
\newcommand{\ty}{\tilde{y}}
\newcommand{\tz}{\tilde{z}}
\newcommand{\id}{\mathrm{id}}
\newcommand{\K}{\mathrm{KL}}
\newcommand{\kl}{\mathrm{kl}}
\newcommand{\bp}{\boldsymbol{p}}
\newcommand{\tr}{\mathrm{Tr}}
\newcommand{\E}{\mathbb{E}}
\newcommand{\N}{\mathbb{N}}
\newcommand{\R}{\mathbb{R}}
\newcommand{\Q}{\mathbb{Q}}
\newcommand{\KL}{\mathrm{KL}}
\newcommand{\LG}{\overline{\log}(K)}
\newcommand{\ocP}{\overline{\mathcal{P}}}
\newcommand{\cZ}{\mathcal{Z}}
\newcommand{\cA}{\mathcal{A}}
\newcommand{\cR}{\mathcal{R}}
\newcommand{\cB}{\mathcal{B}}
\newcommand{\cN}{\mathcal{N}}
\newcommand{\cM}{\mathcal{M}}
\newcommand{\cF}{\mathcal{F}}
\newcommand{\cL}{\mathcal{L}}
\newcommand{\cX}{\mathcal{X}}
\newcommand{\cI}{\mathcal{I}}
\newcommand{\cJ}{\mathcal{J}}
\newcommand{\cY}{\mathcal{Y}}
\newcommand{\cH}{\mathcal{H}}
\newcommand{\cP}{\mathcal{P}}
\newcommand{\cT}{\mathcal{T}}
\newcommand{\cC}{\mathcal{C}}
\newcommand{\cS}{\mathcal{S}}
\newcommand{\cE}{\mathcal{E}}
\newcommand{\cK}{\mathcal{K}}
\newcommand{\cD}{\mathcal{D}}
\newcommand{\oD}{\overline{\mathcal{D}}}
\newcommand{\oR}{\overline{R}}
\newcommand{\wh}{\widehat}
\newcommand{\argmax}{\mathop{\mathrm{argmax}}}
\newcommand{\argmin}{\mathop{\mathrm{argmin}}}
\newcommand{\todo}{{\bf TO DO } }
\newcommand{\Ber}{\mathop{\mathrm{Ber}}}
\newcommand{\beq}{\begin{equation}}
\newcommand{\eeq}{\end{equation}}
\newcommand{\beqa}{\begin{eqnarray}}
\newcommand{\eeqa}{\end{eqnarray}}
\newcommand{\beqan}{\begin{eqnarray*}}
\newcommand{\eeqan}{\end{eqnarray*}}
\newcommand{\qed}{\hfill\BlackBox}
\newcommand{\charfct}{\ds1} %
\newcommand{\Fcal}{\mathcal{F}}
\newcommand{\Xcal}{\mathcal{X}}
\newcommand{\Hcal}{\mathcal{H}}
\newcommand{\Gcal}{\mathcal{G}}
\newcommand{\Nat}{\mathbb{N}}
\newcommand{\mathds}{\mathbb}</math></div>
\label{beyond}
In the black-box model non-smoothness dramatically deteriorates the rate of convergence of first order methods from <math>1/t^2</math> to <math>1/\sqrt{t}</math>. However, as we already pointed out in Section [[guide:72a2ed5454#sec:structured |Structured optimization]], we (almost) always know the function to be optimized {\em globally}. In particular the ‘`source" of non-smoothness can often be identified. For instance the LASSO objective (see Section [[guide:72a2ed5454#sec:mlapps |Some convex optimization problems in machine learning]]) is non-smooth, but it is a sum of a smooth part (the least squares fit) and a {\em simple} non-smooth part (the <math>\ell_1</math>-norm). Using this specific structure we will propose in Section [[#sec:simplenonsmooth |Sum of a smooth and a simple non-smooth term]] a first order method with a <math>1/t^2</math> convergence rate, despite the non-smoothness. In Section [[#sec:sprepresentation |Smooth saddle-point representation of a non-smooth function]] we consider another type of non-smoothness that can effectively be overcome, where the function is the maximum of smooth functions. Finally we conclude this chapter with a concise description of interior point methods, for which the structural assumption is made on the constraint set rather than on the objective function.
==<span id="sec:simplenonsmooth"></span>Sum of a smooth and a simple non-smooth term==
We consider here the following problem<ref group="Notes" >We restrict to unconstrained minimization for sake of simplicity. One can extend the discussion to constrained minimization by using ideas from Section [[guide:49d0e37682#sec:gdsmooth |Gradient descent for smooth functions]].</ref>:


<math display="block">
\min_{x \in \R^n} f(x) + g(x) ,
</math>
where <math>f</math> is convex and <math>\beta</math>-smooth, and <math>g</math> is convex. We assume that <math>f</math> can be accessed through a first order oracle, and that <math>g</math> is known and ``simple". What we mean by simplicity will be clear from the description of the algorithm. For instance a separable function, that is <math>g(x) = \sum_{i=1}^n g_i(x(i))</math>, will be considered as simple. The prime example being <math>g(x) = \|x\|_1</math>. This section is inspired from <ref name="BT09"></ref> (see also <ref name="Nes07"></ref><ref name=" WNF09"></ref>).
\section*{ISTA (Iterative Shrinkage-Thresholding Algorithm)}
Recall that gradient descent on the smooth function <math>f</math> can be written as (see [[guide:E3fb0c985a#eq:MDproxview |eq:MDproxview]])
<math display="block">
x_{t+1} = \argmin_{x \in \mathbb{R}^n} \eta \nabla f(x_t)^{\top} x + \frac{1}{2} \|x - x_t\|^2_2 .
</math>
Here one wants to minimize <math>f+g</math>, and <math>g</math> is assumed to be known and ``simple". Thus it seems quite natural to consider the following update rule, where only <math>f</math> is locally approximated with a first order oracle:
<span id{{=}}"eq:proxoperator"/>
<math display="block">
\begin{eqnarray}
x_{t+1} & = & \argmin_{x \in \mathbb{R}^n} \eta (g(x) + \nabla f(x_t)^{\top} x) + \frac{1}{2} \|x - x_t\|^2_2 \notag \\
& = & \argmin_{x \in \mathbb{R}^n} \ g(x) + \frac{1}{2\eta} \|x - (x_t - \eta \nabla f(x_t)) \|_2^2 . \label{eq:proxoperator}
\end{eqnarray}
</math>
The algorithm described by the above iteration is known as ISTA (Iterative Shrinkage-Thresholding Algorithm). In terms of convergence rate it is easy to show that ISTA has the same convergence rate on <math>f+g</math> as gradient descent on <math>f</math>. More precisely with <math>\eta=\frac{1}{\beta}</math> one has
<math display="block">
f(x_t) + g(x_t) - (f(x^*) + g(x^*)) \leq \frac{\beta \|x_1 - x^*\|^2_2}{2 t} .
</math>
This improved convergence rate over a subgradient descent directly on <math>f+g</math> comes at a price: in general \eqref{eq:proxoperator} may be a difficult optimization problem by itself, and this is why one needs to assume that <math>g</math> is simple. For instance if <math>g</math> can be written as <math>g(x) = \sum_{i=1}^n g_i(x(i))</math> then one can compute <math>x_{t+1}</math> by solving <math>n</math> convex problems in dimension <math>1</math>. In the case where <math>g(x) = \lambda \|x\|_1</math> this one-dimensional problem is given by:
<math display="block">
\min_{x \in \mathbb{R}} \ \lambda |x| + \frac{1}{2 \eta}(x - x_0)^2, \ \text{where} \ x_0 \in \mathbb{R} .
</math>
Elementary computations shows that this problem has an analytical solution given by <math>\tau_{\lambda \eta}(x_0)</math>,
where <math>\tau</math> is the shrinkage operator (hence the name ISTA), defined by
<math display="block">
\tau_{\alpha}(x) = (|x|-\alpha)_+ \mathrm{sign}(x) .
</math>
Much more is known about \eqref{eq:proxoperator} (which is called the {\em proximal operator} of <math>g</math>), and in fact entire monographs have been written about this equation, see e.g. <ref name="PB13"></ref><ref name=" BJMO12"></ref>.
\section*{FISTA (Fast ISTA)}
An obvious idea is to combine Nesterov’s accelerated gradient descent (which results in a <math>1/t^2</math> rate to optimize <math>f</math>) with ISTA. This results in FISTA (Fast ISTA) which is described as follows. Let
<math display="block">
\lambda_0 = 0, \ \lambda_{t} = \frac{1 + \sqrt{1+ 4 \lambda_{t-1}^2}}{2}, \ \text{and} \  \gamma_t = \frac{1 - \lambda_t}{\lambda_{t+1}}.
</math>
Let <math>x_1 = y_1</math> an arbitrary initial point, and
<math display="block">
\begin{eqnarray*}
y_{t+1} & = & \mathrm{argmin}_{x \in \mathbb{R}^n} \ g(x) + \frac{\beta}{2} \|x - (x_t - \frac1{\beta} \nabla f(x_t)) \|_2^2 , \\
x_{t+1} & = & (1 - \gamma_t) y_{t+1} + \gamma_t y_t .
\end{eqnarray*}
</math>
Again it is easy show that the rate of convergence of FISTA on <math>f+g</math> is similar to the one of Nesterov's accelerated gradient descent on <math>f</math>, more precisely:
<math display="block">
f(y_t) + g(y_t) - (f(x^*) + g(x^*)) \leq \frac{2 \beta \|x_1 - x^*\|^2}{t^2} .
</math>
\section*{CMD and RDA}
ISTA and FISTA assume smoothness in the Euclidean metric. Quite naturally one can also use these ideas in a non-Euclidean setting. Starting from [[guide:E3fb0c985a#eq:MDproxview |eq:MDproxview]] one obtains the CMD (Composite Mirror Descent) algorithm of <ref name="DSSST10"></ref>, while with [[guide:E3fb0c985a#eq:DA0 |eq:DA0]] one obtains the RDA (Regularized Dual Averaging) of <ref name="Xia10"></ref>. We refer to these papers for more details.
==<span id="sec:sprepresentation"></span>Smooth saddle-point representation of a non-smooth function==
Quite often the non-smoothness of a function <math>f</math> comes from a <math>\max</math> operation. More precisely non-smooth functions can often be represented as
<span id{{=}}"eq:sprepresentation"/>
<math display="block">
\begin{equation} \label{eq:sprepresentation}
f(x) = \max_{1 \leq i \leq m} f_i(x) ,
\end{equation}
</math>
where the functions <math>f_i</math> are smooth. This was the case for instance with the function we used to prove the black-box lower bound <math>1/\sqrt{t}</math> for non-smooth optimization in [[guide:49d0e37682#th:lb1 |Theorem]]. We will see now that by using this structural representation one can in fact attain a rate of <math>1/t</math>. This was first observed in <ref name="Nes04b"></ref> who proposed the Nesterov's smoothing technique. Here we will present the alternative method of <ref name="Nem04"></ref> which we find more transparent (yet another version is the Chambolle-Pock algorithm, see <ref name="CP11"></ref>). Most of what is described in this section can be found in <ref name="JN11a"></ref><ref name=" JN11b"></ref>.
In the next subsection we introduce the more general problem of saddle point computation. We then proceed to apply a modified version of mirror descent to this problem, which will be useful both in [[guide:474e0143f8#rand |Chapter]] and also as a warm-up for the more powerful modified mirror prox that we introduce next.
===<span id="sec:sp"></span>Saddle point computation===
Let <math>\cX \subset \R^n</math>, <math>\cY \subset \R^m</math> be compact and convex sets. Let <math>\phi : \cX \times \cY \rightarrow \mathbb{R}</math> be a continuous function, such that <math>\phi(\cdot, y)</math> is convex and <math>\phi(x, \cdot)</math> is concave. We write <math>g_{\cX}(x,y)</math> (respectively <math>g_{\cY}(x,y)</math>) for an element of <math>\partial_x \phi(x,y)</math> (respectively <math>\partial_y (-\phi(x,y))</math>). We are interested in computing
<math display="block">
\min_{x \in \cX} \max_{y \in \cY} \phi(x,y) .
</math>
By Sion's minimax theorem there exists a pair <math>(x^*, y^*) \in \cX \times \cY</math> such that
<math display="block">
\phi(x^*,y^*) = \min_{x \in \mathcal{X}} \max_{y \in \mathcal{Y}} \phi(x,y) = \max_{y \in \mathcal{Y}} \min_{x \in \mathcal{X}} \phi(x,y) .
</math>
We will explore algorithms that produce a candidate pair of solutions <math>(\tx, \ty) \in \cX \times \cY</math>. The quality of <math>(\tx, \ty)</math> is evaluated through the so-called duality gap<ref group="Notes" >Observe that the duality gap is the sum of the primal gap <math>\max_{y \in \mathcal{Y}} \phi(\tx,y) - \phi(x^*,y^*)</math> and the dual gap <math>\phi(x^*,y^*) - \min_{x \in \mathcal{X}} \phi(x, \ty)</math>.</ref>
<math display="block">
\max_{y \in \mathcal{Y}} \phi(\tx,y) - \min_{x \in \mathcal{X}} \phi(x,\ty) .
</math>
The key observation is that the duality gap can be controlled similarly to the suboptimality gap <math>f(x) - f(x^*)</math> in a simple convex optimization problem. Indeed for any <math>(x, y) \in \cX \times \cY</math>,
<math display="block">
\phi(\tx,\ty) - \phi(x,\ty) \leq g_{\cX}(\tx,\ty)^{\top} (\tx-x),
</math>
and
<math display="block">
- \phi(\tx,\ty) - (- \phi(\tx,y)) \leq g_{\cY}(\tx,\ty)^{\top} (\ty-y) .
</math>
In particular, using the notation <math>z = (x,y) \in \cZ := \cX \times \cY</math> and <math>g(z) = (g_{\cX}(x,y), g_{\cY}(x,y))</math> we just proved
<span id{{=}}"eq:keysp"/>
<math display="block">
\begin{equation} \label{eq:keysp}
\max_{y \in \mathcal{Y}} \phi(\tx,y) - \min_{x \in \mathcal{X}} \phi(x,\ty) \leq g(\tz)^{\top} (\tz - z) ,
\end{equation}
</math>
for some <math>z \in \mathcal{Z}.</math> In view of the vector field point of view developed in Section [[guide:E3fb0c985a#sec:vectorfield |The vector field point of view on MD, DA, and MP]] this suggests to do a mirror descent in the <math>\cZ</math>-space with the vector field <math>g : \cZ \rightarrow \R^n \times \R^m</math>.
We will assume in the next subsections that <math>\cX</math> is equipped with a mirror map <math>\Phi_{\cX}</math> (defined on <math>\cD_{\cX}</math>) which is <math>1</math>-strongly convex w.r.t. a norm <math>\|\cdot\|_{\cX}</math> on <math>\cX \cap \cD_{\cX}</math>. We denote <math>R^2_{\cX} = \sup_{x \in \cX} \Phi(x) - \min_{x \in \cX} \Phi(x)</math>. We define similar quantities for the space <math>\cY</math>.
===<span id="sec:spmd"></span>Saddle Point Mirror Descent (SP-MD)===
We consider here mirror descent on the space <math>\cZ = \cX \times \cY</math> with the mirror map <math>\Phi(z) = a \Phi_{\cX}(x) + b \Phi_{\cY}(y)</math> (defined on <math>\cD = \cD_{\cX} \times \cD_{\cY}</math>), where <math>a, b \in \R_+</math> are to be defined later, and with the vector field <math>g : \cZ \rightarrow \R^n \times \R^m</math> defined in the previous subsection. We call the resulting algorithm SP-MD (Saddle Point Mirror Descent). It can be described succintly as follows.
Let <math>z_1 \in \argmin_{z \in \cZ \cap \cD} \Phi(z)</math>. Then for <math>t \geq 1</math>, let
<math display="block">
z_{t+1} \in \argmin_{z \in \cZ \cap \cD} \ \eta g_t^{\top} z + D_{\Phi}(z,z_t) ,
</math>
where <math>g_t = (g_{\cX,t}, g_{\cY,t})</math> with <math>g_{\cX,t} \in \partial_x \phi(x_t,y_t)</math> and <math>g_{\cY,t} \in \partial_y (- \phi(x_t,y_t))</math>.
{{proofcard|Theorem|th:spmd|Assume that <math>\phi(\cdot, y)</math> is <math>L_{\cX}</math>-Lipschitz w.r.t. <math>\|\cdot\|_{\cX}</math>, that is <math>\|g_{\cX}(x,y)\|_{\cX}^* \leq L_{\cX}, \forall (x, y) \in \cX \times \cY</math>. Similarly assume that <math>\phi(x, \cdot)</math> is <math>L_{\cY}</math>-Lipschitz w.r.t. <math>\|\cdot\|_{\cY}</math>. Then SP-MD with <math>a= \frac{L_{\cX}}{R_{\cX}}</math>, <math>b=\frac{L_{\cY}}{R_{\cY}}</math>, and <math>\eta=\sqrt{\frac{2}{t}}</math> satisfies
<math display="block">
\max_{y \in \mathcal{Y}} \phi\left( \frac1{t} \sum_{s=1}^t x_s,y \right) - \min_{x \in \mathcal{X}} \phi\left(x, \frac1{t} \sum_{s=1}^t y_s \right) \leq (R_{\cX} L_{\cX} + R_{\cY} L_{\cY}) \sqrt{\frac{2}{t}}.
</math>
|First we endow <math>\mathcal{Z}</math> with the norm <math>\|\cdot\|_{\cZ}</math> defined by
<math display="block">
\|z\|_{\cZ} = \sqrt{a \|x\|_{\mathcal{X}}^2 + b \|y\|_{\mathcal{Y}}^2} .
</math>
It is immediate that <math>\Phi</math> is <math>1</math>-strongly convex with respect to <math>\|\cdot\|_{\mathcal{Z}}</math> on <math>\mathcal{Z} \cap \mathcal{D}</math>. Furthermore one can easily check that
<math display="block">
\|z\|_{\mathcal{Z}}^* = \sqrt{\frac1{a} \left(\|x\|_{\mathcal{X}}^*\right)^2 + \frac1{b} \left(\|y\|_{\mathcal{Y}}^*\right)^2} ,
</math>
and thus the vector field <math>(g_t)</math> used in the SP-MD satisfies:
<math display="block">
\|g_t\|_{\mathcal{Z}}^* \leq \sqrt{\frac{L_{\mathcal{X}}^2}{a} + \frac{L_{\mathcal{Y}}^2}{b}} .
</math>
Using [[guide:E3fb0c985a#eq:vfMD |eq:vfMD]] together with \eqref{eq:keysp} and the values of <math>a, b</math> and <math>\eta</math> concludes the proof.}}
===Saddle Point Mirror Prox (SP-MP)===
We now consider the most interesting situation in the context of this chapter, where the function <math>\phi</math> is smooth. Precisely we say that <math>\phi</math> is <math>(\beta_{11}, \beta_{12}, \beta_{22}, \beta_{21})</math>-smooth if for any <math>x, x' \in \cX, y, y' \in \cY</math>,
<math display="block">
\begin{align*}
& \|\nabla_x \phi(x,y) - \nabla_x \phi(x',y) \|_{\mathcal{X}}^* \leq \beta_{11} \|x-x'\|_{\mathcal{X}} , \\
& \|\nabla_x \phi(x,y) - \nabla_x \phi(x,y') \|_{\mathcal{X}}^* \leq \beta_{12} \|y-y'\|_{\mathcal{Y}} , \\
& \|\nabla_y \phi(x,y) - \nabla_y \phi(x,y') \|_{\mathcal{Y}}^* \leq \beta_{22} \|y-y'\|_{\mathcal{Y}} , \\
& \|\nabla_y \phi(x,y) - \nabla_y \phi(x',y) \|_{\mathcal{Y}}^* \leq \beta_{21} \|x-x'\|_{\mathcal{X}} ,
\end{align*}
</math>
This will imply the Lipschitzness of the vector field <math>g : \cZ \rightarrow \R^n \times \R^m</math> under the appropriate norm. Thus we use here mirror prox on the space <math>\cZ</math> with the mirror map <math>\Phi(z) = a \Phi_{\cX}(x) + b \Phi_{\cY}(y)</math> and the vector field <math>g</math>. The resulting algorithm is called SP-MP (Saddle Point Mirror Prox) and we can describe it succintly as follows.
Let <math>z_1 \in \argmin_{z \in \cZ \cap \cD} \Phi(z)</math>. Then for <math>t \geq 1</math>, let <math>z_t=(x_t,y_t)</math> and <math>w_t=(u_t, v_t)</math> be defined by
<math display="block">
\begin{eqnarray*}
w_{t+1} & = & \argmin_{z \in \cZ \cap \cD} \ \eta (\nabla_x \phi(x_t, y_t), - \nabla_y \phi(x_t,y_t))^{\top} z + D_{\Phi}(z,z_t) \\
z_{t+1} & = & \argmin_{z \in \cZ \cap \cD} \ \eta (\nabla_x \phi(u_{t+1}, v_{t+1}), - \nabla_y \phi(u_{t+1},v_{t+1}))^{\top} z + D_{\Phi}(z,z_t) .
\end{eqnarray*}
</math>
{{proofcard|Theorem|th:spmp|Assume that <math>\phi</math> is <math>(\beta_{11}, \beta_{12}, \beta_{22}, \beta_{21})</math>-smooth. Then SP-MP with <math>a= \frac{1}{R_{\cX}^2}</math>, <math>b=\frac{1}{R_{\cY}^2}</math>, and
<math>\eta= 1 / \left(2 \max \left(\beta_{11} R^2_{\cX}, \beta_{22} R^2_{\cY}, \beta_{12} R_{\cX} R_{\cY}, \beta_{21} R_{\cX} R_{\cY}\right) \right)</math>
satisfies
<math display="block">
\begin{align*}
& \max_{y \in \mathcal{Y}} \phi\left( \frac1{t} \sum_{s=1}^t u_{s+1},y \right) - \min_{x \in \mathcal{X}} \phi\left(x, \frac1{t} \sum_{s=1}^t v_{s+1} \right) \\
& \leq \max \left(\beta_{11} R^2_{\cX}, \beta_{22} R^2_{\cY}, \beta_{12} R_{\cX} R_{\cY}, \beta_{21} R_{\cX} R_{\cY}\right) \frac{4}{t} .
\end{align*}
</math>
|In light of the proof of [[#th:spmd |Theorem]] and [[guide:E3fb0c985a#eq:vfMP |eq:vfMP]] it clearly suffices to show that the vector field <math>g(z) = (\nabla_x \phi(x,y), - \nabla_y \phi_(x,y))</math> is <math>\beta</math>-Lipschitz w.r.t. <math>\|z\|_{\cZ} = \sqrt{\frac{1}{R_{\cX}^2} \|x\|_{\mathcal{X}}^2 + \frac{1}{R_{\cY}^2} \|y\|_{\mathcal{Y}}^2}</math> with <math>\beta = 2 \max \left(\beta_{11} R^2_{\cX}, \beta_{22} R^2_{\cY}, \beta_{12} R_{\cX} R_{\cY}, \beta_{21} R_{\cX} R_{\cY}\right)</math>. In other words one needs to show that
<math display="block">
\|g(z) - g(z')\|_{\cZ}^* \leq \beta \|z - z'\|_{\cZ} ,
</math>
which can be done with straightforward calculations (by introducing <math>g(x',y)</math> and using the definition of smoothness for <math>\phi</math>).}}
===<span id="sec:spex"></span>Applications===
We investigate briefly three applications for SP-MD and SP-MP.
====<span id="sec:spex1"></span>Minimizing a maximum of smooth functions====
The problem \eqref{eq:sprepresentation} (when <math>f</math> has to minimized over <math>\cX</math>) can be rewritten as
<math display="block">
\min_{x \in \cX} \max_{y \in \Delta_m} \vec{f}(x)^{\top} y ,
</math>
where <math>\vec{f}(x) = (f_1(x), \hdots, f_m(x)) \in \R^m</math>. We assume that the functions <math>f_i</math> are <math>L</math>-Lipschtiz and <math>\beta</math>-smooth w.r.t. some norm <math>\|\cdot\|_{\cX}</math>. Let us study the smoothness of <math>\phi(x,y) = \vec{f}(x)^{\top} y</math> when <math>\cX</math> is equipped with <math>\|\cdot\|_{\cX}</math> and <math>\Delta_m</math> is equipped with <math>\|\cdot\|_1</math>. On the one hand <math>\nabla_y \phi(x,y) = \vec{f}(x)</math>, in particular one immediately has <math>\beta_{22}=0</math>, and furthermore
<math display="block">
\|\vec{f}(x)  - \vec{f}(x') \|_{\infty} \leq L \|x-x'\|_{\mathcal{X}} ,
</math>
that is <math>\beta_{21}=L</math>. On the other hand <math>\nabla_x \phi(x,y) = \sum_{i=1}^m y_i \nabla f_i(x)</math>, and thus
<math display="block">
\begin{align*}
& \|\sum_{i=1}^m y(i) (\nabla f_i(x) - \nabla f_i(x')) \|_{\cX}^* \leq \beta \|x-x'\|_{\cX} , \\
& \|\sum_{i=1}^m (y(i)-y'(i)) \nabla f_i(x) \|_{\cX}^* \leq L\|y-y'\|_1 ,
\end{align*}
</math>
that is <math>\beta_{11} = \beta</math> and <math>\beta_{12} = L</math>. Thus using SP-MP with some mirror map on <math>\cX</math> and the negentropy on <math>\Delta_m</math> (see the ‘`simplex setup" in Section [[guide:E3fb0c985a#sec:mdsetups |Standard setups for mirror descent]]), one obtains an <math>\epsilon</math>-optimal point of <math>f(x) = \max_{1 \leq i \leq m} f_i(x)</math> in <math>O\left(\frac{\beta R_{\cX}^2 + L R_{\cX} \sqrt{\log(m)}}{\epsilon} \right)</math> iterations. Furthermore an iteration of SP-MP has a computational complexity of order of a step of mirror descent in <math>\cX</math> on the function <math>x \mapsto \sum_{i=1}^m y(i) f_i(x)</math> (plus <math>O(m)</math> for the update in the <math>\cY</math>-space).
Thus by using the structure of <math>f</math> we were able to obtain a much better rate than black-box procedures (which would have required <math>\Omega(1/\epsilon^2)</math> iterations as <math>f</math> is potentially non-smooth).
====<span id="sec:spex2"></span>Matrix games====
Let <math>A \in \R^{n \times m}</math>, we denote <math>\|A\|_{\mathrm{max}}</math> for the maximal entry (in absolute value) of <math>A</math>, and <math>A_i \in \R^n</math> for the <math>i^{th}</math> column of <math>A</math>. We consider the problem of computing a Nash equilibrium for the zero-sum game corresponding to the loss matrix <math>A</math>, that is we want to solve
<math display="block">
\min_{x \in \Delta_n} \max_{y \in \Delta_m} x^{\top} A y .
</math>
Here we equip both <math>\Delta_n</math> and <math>\Delta_m</math> with <math>\|\cdot\|_1</math>. Let <math>\phi(x,y) = x^{\top} A y</math>. Using that <math>\nabla_x \phi(x,y) = Ay</math> and <math>\nabla_y \phi(x,y) = A^{\top} x</math> one immediately obtains <math>\beta_{11} = \beta_{22} = 0</math>. Furthermore since
<math display="block">
\|A(y - y’) \|_{\infty} = \|\sum_{i=1}^m (y(i) - y'(i)) A_i \|_{\infty} \leq \|A\|_{\mathrm{max}} \|y - y'\|_1 ,
</math>
one also has <math>\beta_{12} = \beta_{21} = \|A\|_{\mathrm{max}}</math>. Thus SP-MP with the negentropy on both <math>\Delta_n</math> and <math>\Delta_m</math> attains an <math>\epsilon</math>-optimal pair of mixed strategies with <math>O\left(\|A\|_{\mathrm{max}} \sqrt{\log(n) \log(m)} / \epsilon \right)</math> iterations. Furthermore the computational complexity of a step of SP-MP is dominated by the matrix-vector multiplications which are <math>O(n m)</math>. Thus overall the complexity of getting an <math>\epsilon</math>-optimal Nash equilibrium with SP-MP is <math>O\left(\|A\|_{\mathrm{max}} n m \sqrt{\log(n) \log(m)} / \epsilon  \right)</math>.
====<span id="sec:spex3"></span>Linear classification====
Let <math>(\ell_i, A_i) \in \{-1,1\} \times \R^n</math>, <math>i \in [m]</math>, be a data set that one wishes to separate with a linear classifier. That is one is looking for <math>x \in \mB_{2,n}</math> such that for all <math>i \in [m]</math>, <math>\mathrm{sign}(x^{\top} A_i) = \mathrm{sign}(\ell_i)</math>, or equivalently <math>\ell_i x^{\top} A_i  >  0</math>. Clearly without loss of generality one can assume <math>\ell_i = 1</math> for all <math>i \in [m]</math> (simply replace <math>A_i</math> by <math>\ell_i A_i</math>). Let <math>A \in \R^{n \times m}</math> be the matrix where the <math>i^{th}</math> column is <math>A_i</math>. The problem of finding <math>x</math> with maximal margin can be written as
<span id{{=}}"eq:linearclassif"/>
<math display="block">
\begin{equation} \label{eq:linearclassif}
\max_{x \in \mB_{2,n}} \min_{1 \leq i \leq m} A_i^{\top} x = \max_{x \in \mB_{2,n}} \min_{y \in \Delta_m} x^{\top} A y .
\end{equation}
</math>
Assuming that <math>\|A_i\|_2 \leq B</math>, and using the calculations we did in Section [[#sec:spex1 |Minimizing a maximum of smooth functions]], it is clear that <math>\phi(x,y) = x^{\top} A y</math> is <math>(0, B, 0, B)</math>-smooth with respect to <math>\|\cdot\|_2</math> on <math>\mB_{2,n}</math> and <math>\|\cdot\|_1</math> on <math>\Delta_m</math>. This implies in particular that SP-MP with the Euclidean norm squared on <math>\mB_{2,n}</math> and the negentropy on <math>\Delta_m</math> will solve \eqref{eq:linearclassif} in <math>O(B \sqrt{\log(m)} / \epsilon)</math> iterations. Again the cost of an iteration is dominated by the matrix-vector multiplications, which results in an overall complexity of <math>O(B n m \sqrt{\log(m)} / \epsilon)</math> to find an <math>\epsilon</math>-optimal solution to \eqref{eq:linearclassif}.
==<span id="sec:IPM"></span>Interior point methods==
We describe here interior point methods (IPM), a class of algorithms fundamentally different from what we have seen so far. The first algorithm of this type was described in <ref name="Kar84"></ref>, but the theory we shall present was developed in <ref name="NN94"></ref>. We follow closely the presentation given in [Chapter 4, <ref name="Nes04"></ref>]. Other useful references (in particular for the primal-dual IPM, which are the ones used in practice) include <ref name="Ren01"></ref><ref name=" Nem04b"></ref><ref name=" NW06"></ref>.
\newline
IPM are designed to solve convex optimization problems of the form
<math display="block">
\begin{align*}
& \mathrm{min.} \; c^{\top} x \\
& \text{s.t.} \; x \in \cX ,
\end{align*}
</math>
with <math>c \in \R^n</math>, and <math>\cX \subset \R^n</math> convex and compact.
Note that, at this point, the linearity of the objective is without loss of generality as minimizing a convex function <math>f</math> over <math>\cX</math> is equivalent to minimizing a linear objective over the epigraph of <math>f</math> (which is also a convex set). The structural assumption on <math>\cX</math> that one makes in IPM is that there exists a {\em self-concordant barrier} for <math>\cX</math> with an easily computable gradient and Hessian. The meaning of the previous sentence will be made precise in the next subsections. The importance of IPM stems from the fact that LPs and SDPs (see Section [[guide:72a2ed5454#sec:structured |Structured optimization]]) satisfy this structural assumption.
===<span id="sec:barriermethod"></span>The barrier method===
We say that <math>F : \inte(\cX) \rightarrow \R</math> is a {\em barrier} for <math>\cX</math> if
<math display="block">
F(x) \xrightarrow[x \to \partial \cX]{} +\infty .
</math>
We will only consider strictly convex barriers. We extend the domain of definition of <math>F</math> to <math>\R^n</math> with <math>F(x) = +\infty</math> for <math>x \not\in \inte(\cX)</math>. For <math>t \in \R_+</math> let
<math display="block">
x^*(t) \in \argmin_{x \in \R^n} t c^{\top} x + F(x) .
</math>
In the following we denote <math>F_t(x) := t c^{\top} x + F(x)</math>.
In IPM the path <math>(x^*(t))_{t \in \R_+}</math> is referred to as the {\em central path}. It seems clear that the central path eventually leads to the minimum <math>x^*</math> of the objective function <math>c^{\top} x</math> on <math>\cX</math>, precisely we will have
<math display="block">
x^*(t) \xrightarrow[t \to +\infty]{} x^* .
</math>
The idea of the {\em barrier method} is to move along the central path by ‘`boosting" a fast locally convergent algorithm, which we denote for the moment by <math>\cA</math>, using the following scheme: Assume that one has computed <math>x^*(t)</math>, then one uses <math>\cA</math> initialized at <math>x^*(t)</math> to compute <math>x^*(t’)</math> for some <math>t' > t</math>. There is a clear tension for the choice of <math>t'</math>, on the one hand <math>t'</math> should be large in order to make as much progress as possible on the central path, but on the other hand <math>x^*(t)</math> needs to be close enough to <math>x^*(t')</math> so that it is in the basin of fast convergence for <math>\cA</math> when run on <math>F_{t'}</math>.
IPM follows the above methodology with <math>\cA</math> being {\em Newton's method}. Indeed as we will see in the next subsection, Newton's method has a quadratic convergence rate, in the sense that if initialized close enough to the optimum it attains an <math>\epsilon</math>-optimal point in <math>\log\log(1/\epsilon)</math> iterations! Thus we now have a clear plan to make these ideas formal and analyze the iteration complexity of IPM:
<ul><li> First we need to describe precisely the region of fast convergence for Newton's method. This will lead us to define self-concordant functions, which are ‘`natural" functions for Newton’s method.
</li>
<li> Then we need to evaluate precisely how much larger <math>t'</math> can be compared to <math>t</math>, so that <math>x^*(t)</math> is still in the region of fast convergence of Newton's method when optimizing the function <math>F_{t'}</math> with <math>t' > t</math>. This will lead us to define <math>\nu</math>-self concordant barriers.
</li>
<li> How do we get close to the central path in the first place? Is it possible to compute <math>x^*(0) = \argmin_{x \in \R^n} F(x)</math> (the so-called analytical center of <math>\mathcal{X}</math>)?
</li>
</ul>
===<span id="sec:tradanalysisNM"></span>Traditional analysis of Newton's method===
We start by describing Newton's method together with its standard analysis showing the quadratic convergence rate when initialized close enough to the optimum. In this subsection we denote <math>\|\cdot\|</math> for both the Euclidean norm on <math>\R^n</math> and the operator norm on matrices (in particular <math>\|A x\| \leq \|A\| \cdot \|x\|</math>).
Let <math>f: \R^n \rightarrow \R</math> be a <math>C^2</math> function.
Using a Taylor's expansion of <math>f</math> around <math>x</math> one obtains
<math display="block">
f(x+h) = f(x) + h^{\top} \nabla f(x) + \frac12 h^{\top} \nabla^2 f(x) h + o(\|h\|^2) .
</math>
Thus, starting at <math>x</math>, in order to minimize <math>f</math> it seems natural to move in the direction <math>h</math> that minimizes
<math display="block">
h^{\top} \nabla f(x) + \frac12 h^{\top} \nabla f^2(x) h .
</math>
If <math>\nabla^2 f(x)</math> is positive definite then the solution to this problem is given by <math>h = - [\nabla^2 f(x)]^{-1} \nabla f(x)</math>. Newton's method simply iterates this idea: starting at some point <math>x_0 \in \R^n</math>, it iterates for <math>k \geq 0</math> the following equation:
<math display="block">
x_{k+1} = x_k  - [\nabla^2 f(x_k)]^{-1} \nabla f(x_k) .
</math>
While this method can have an arbitrarily bad behavior in general, if started close enough to a strict local minimum of <math>f</math>, it can have a very fast convergence:
{{proofcard|Theorem|th:NM|Assume that <math>f</math> has a Lipschitz Hessian, that is <math>\| \nabla^2 f(x) - \nabla^2 f(y) \| \leq M \|x - y\|</math>. Let <math>x^*</math> be local minimum of <math>f</math> with strictly positive Hessian, that is <math>\nabla^2 f(x^*) \succeq \mu \mI_n</math>, <math>\mu  >  0</math>. Suppose that the initial starting point <math>x_0</math> of Newton's method is such that
<math display="block">
\|x_0 - x^*\| \leq \frac{\mu}{2 M} .
</math>
Then Newton's method is well-defined and converges to <math>x^*</math> at a quadratic rate:
<math display="block">
\|x_{k+1} - x^*\| \leq \frac{M}{\mu} \|x_k - x^*\|^2.
</math>
|We use the following simple formula, for <math>x, h \in \R^n</math>,
<math display="block">
\int_0^1 \nabla^2 f(x + s h) \ h \ ds = \nabla f(x+h) - \nabla f(x) .
</math>
Now note that <math>\nabla f(x^*) = 0</math>, and thus with the above formula one obtains
<math display="block">
\nabla f(x_k) = \int_0^1 \nabla^2 f(x^* + s (x_k - x^*)) \ (x_k - x^*) \ ds ,
</math>
which allows us to write:
<math display="block">
\begin{align*}
& x_{k+1} - x^* \\
& = x_k - x^* - [\nabla^2 f(x_k)]^{-1} \nabla f(x_k) \\
& = x_k - x^* - [\nabla^2 f(x_k)]^{-1} \int_0^1 \nabla^2 f(x^* + s (x_k - x^*)) \ (x_k - x^*) \ ds \\
& = [\nabla^2 f(x_k)]^{-1} \int_0^1 [\nabla^2 f (x_k) - \nabla^2 f(x^* + s (x_k - x^*)) ] \ (x_k - x^*) \ ds .
\end{align*}
</math>
In particular one has
<math display="block">
\begin{align*}
& \|x_{k+1} - x^*\| \\
& \leq \|[\nabla^2 f(x_k)]^{-1}\| \\
& \times \left( \int_0^1 \| \nabla^2 f (x_k) - \nabla^2 f(x^* + s (x_k - x^*)) \| \ ds \right) \|x_k - x^* \|.
\end{align*}
</math>
Using the Lipschitz property of the Hessian one immediately obtains that
<math display="block">
\left( \int_0^1 \| \nabla^2 f (x_k) - \nabla^2 f(x^* + s (x_k - x^*)) \| \ ds \right) \leq \frac{M}{2} \|x_k - x^*\| .
</math>
Using again the Lipschitz property of the Hessian (note that <math>\|A - B\| \leq s \Leftrightarrow s \mI_n \succeq A - B \succeq - s \mI_n</math>), the hypothesis on <math>x^*</math>, and an induction hypothesis that <math>\|x_k - x^*\| \leq \frac{\mu}{2M}</math>, one has
<math display="block">
\nabla^2 f(x_k) \succeq \nabla^2 f(x^*) - M \|x_k - x^*\| \mI_n \succeq (\mu - M \|x_k - x^*\|) \mI_n \succeq \frac{\mu}{2} \mI_n ,
</math>
which concludes the proof.}}
===Self-concordant functions===
Before giving the definition of self-concordant functions let us try to get some insight into the ‘`geometry" of Newton’s method. Let <math>A</math> be a <math>n \times n</math> non-singular matrix. We look at a Newton step on the functions <math>f: x \mapsto f(x)</math> and <math>\phi: y \mapsto f(A^{-1} y)</math>, starting respectively from <math>x</math> and <math>y= A x</math>, that is:
<math display="block">
x^+ = x  - [\nabla^2 f(x)]^{-1} \nabla f(x) , \; \text{and} \; y^+ = y  - [\nabla^2 \phi(y)]^{-1} \nabla \phi(y) .
</math>
By using the following simple formulas
<math display="block">
\nabla (x \mapsto f(A x) ) =A^{\top} \nabla f(A x) , \; \text{and} \; \nabla^2 (x \mapsto f(A x) ) =A^{\top} \nabla^2 f(A x) A .
</math>
it is easy to show that
<math display="block">
y^+ = A x^+ .
</math>
In other words Newton's method will follow the same trajectory in the ‘`<math>x</math>-space" and in the ``<math>y</math>-space" (the image through <math>A</math> of the <math>x</math>-space), that is Newton’s method is {\em affine invariant}. Observe that this property is not shared by the methods described in [[guide:49d0e37682#dimfree |Chapter]] (except for the conditional gradient descent).
The affine invariance of Newton's method casts some concerns on the assumptions of the analysis in Section [[#sec:tradanalysisNM |Traditional analysis of Newton's method]]. Indeed the assumptions are all in terms of the canonical inner product in <math>\R^n</math>. However we just showed that the method itself does not depend on the choice of the inner product (again this is not true for first order methods). Thus one would like to derive a result similar to [[#th:NM |Theorem]] without any reference to a prespecified inner product. The idea of self-concordance is to modify the Lipschitz assumption on the Hessian to achieve this goal.
Assume from now on that <math>f</math> is <math>C^3</math>, and let <math>\nabla^3 f(x) : \R^n \times \R^n \times \R^n \rightarrow \R</math> be the third order differential operator. The Lipschitz assumption on the Hessian in [[#th:NM |Theorem]] can be written as:
<math display="block">
\nabla^3 f(x) [h,h,h] \leq M \|h\|_2^3 .
</math>
The issue is that this inequality depends on the choice of an inner product. More importantly it is easy to see that a convex function which goes to infinity on a compact set simply cannot satisfy the above inequality. A natural idea to try fix these issues is to replace the Euclidean metric on the right hand side by the metric given by the function <math>f</math> itself at <math>x</math>, that is:
<math display="block">
\|h\|_x = \sqrt{ h^{\top} \nabla^2 f(x) h }.
</math>
Observe that to be clear one should rather use the notation <math>\|\cdot\|_{x, f}</math>, but since <math>f</math> will always be clear from the context we stick to <math>\|\cdot\|_x</math>.
{{defncard|label=|id=|Let <math>\mathcal{X}</math> be a convex set with non-empty interior, and <math>f</math> a <math>C^3</math> convex function defined on <math>\inte(\mathcal{X})</math>. Then <math>f</math> is self-concordant (with constant <math>M</math>) if for all <math>x \in \inte(\mathcal{X}), h \in \R^n</math>,
<math display="block">
\nabla^3 f(x) [h,h,h] \leq M \|h\|_x^3 .
</math>
We say that <math>f</math> is standard self-concordant if <math>f</math> is self-concordant with constant <math>M=2</math>.}}
An easy consequence of the definition is that a self-concordant function is a barrier for the set <math>\mathcal{X}</math>, see [Theorem 4.1.4, <ref name="Nes04"></ref>]. The main example to keep in mind of a standard self-concordant function is <math>f(x) = - \log x</math> for <math>x  >  0</math>. The next definition will be key in order to describe the region of quadratic convergence for Newton's method on self-concordant functions.
{{defncard|label=|id=|Let <math>f</math> be a standard self-concordant function on <math>\mathcal{X}</math>. For <math>x \in \mathrm{int}(\mathcal{X})</math>, we say that <math>\lambda_f(x) = \|\nabla f(x)\|_x^*</math> is the {\em Newton decrement} of <math>f</math> at <math>x</math>.}}
An important inequality is that for <math>x</math> such that <math>\lambda_f(x)  <  1</math>, and <math>x^* = \argmin f(x)</math>, one has
<span id{{=}}"eq:trucipm3"/>
<math display="block">
\begin{equation} \label{eq:trucipm3}
\|x - x^*\|_x \leq \frac{\lambda_f(x)}{1 - \lambda_f(x)} ,
\end{equation}
</math>
see [Equation 4.1.18, <ref name="Nes04"></ref>]. We state the next theorem without a proof, see also [Theorem 4.1.14, <ref name="Nes04"></ref>].
{{proofcard|Theorem|th:NMsc|Let <math>f</math> be a standard self-concordant function on <math>\mathcal{X}</math>, and <math>x \in \mathrm{int}(\mathcal{X})</math> such that <math>\lambda_f(x) \leq 1/4</math>, then
<math display="block">
\lambda_f\Big(x - [\nabla^2 f(x)]^{-1} \nabla f(x)\Big) \leq 2 \lambda_f(x)^2 .
</math>|}}
In other words the above theorem states that, if initialized at a point <math>x_0</math> such that <math>\lambda_f(x_0) \leq 1/4</math>, then Newton's iterates satisfy <math>\lambda_f(x_{k+1}) \leq 2 \lambda_f(x_k)^2</math>. Thus, Newton's region of quadratic convergence for self-concordant functions can be described as a ‘`Newton decrement ball" <math>\{x : \lambda_f(x) \leq 1/4\}</math>. In particular by taking the barrier to be a self-concordant function we have now resolved Step (1) of the plan described in Section [[#sec:barriermethod |The barrier method]].
===<math>\nu</math>-self-concordant barriers===
We deal here with Step (2) of the plan described in Section [[#sec:barriermethod |The barrier method]]. Given [[#th:NMsc |Theorem]] we want <math>t’</math> to be as large as possible and such that
<span id{{=}}"eq:trucipm1"/>
<math display="block">
\begin{equation} \label{eq:trucipm1}
\lambda_{F_{t'}}(x^*(t) ) \leq 1/4 .
\end{equation}
</math>
Since the Hessian of <math>F_{t'}</math> is the Hessian of <math>F</math>, one has
<math display="block">
\lambda_{F_{t'}}(x^*(t) ) = \|t' c + \nabla F(x^*(t)) \|_{x^*(t)}^* .
</math>
Observe that, by first order optimality, one has
<math>t c + \nabla F(x^*(t))  = 0,</math>
which yields
<span id{{=}}"eq:trucipm11"/>
<math display="block">
\begin{equation} \label{eq:trucipm11}
\lambda_{F_{t'}}(x^*(t) ) = (t'-t) \|c\|^*_{x^*(t)} .
\end{equation}
</math>
Thus taking
<span id{{=}}"eq:trucipm2"/>
<math display="block">
\begin{equation} \label{eq:trucipm2}
t' = t + \frac{1}{4 \|c\|^*_{x^*(t)}}
\end{equation}
</math>
immediately yields \eqref{eq:trucipm1}. In particular with the value of <math>t'</math> given in \eqref{eq:trucipm2} the Newton's method on <math>F_{t'}</math> initialized at <math>x^*(t)</math> will converge quadratically fast to <math>x^*(t')</math>.
It remains to verify that by iterating \eqref{eq:trucipm2} one obtains a sequence diverging to infinity, and to estimate the rate of growth. Thus one needs to control <math>\|c\|^*_{x^*(t)} = \frac1{t} \|\nabla F(x^*(t))\|_{x^*(t)}^*</math>. Luckily there is a natural class of functions for which one can control <math>\|\nabla F(x)\|_x^*</math> uniformly over <math>x</math>. This is the set of functions such that
<span id{{=}}"eq:nu"/>
<math display="block">
\begin{equation} \label{eq:nu}
\nabla^2 F(x) \succeq \frac1{\nu} \nabla F(x) [\nabla F(x) ]^{\top} .
\end{equation}
</math>
Indeed in that case one has:
<math display="block">
\begin{eqnarray*}
\|\nabla F(x)\|_x^* & = & \sup_{h : h^{\top} \nabla F^2(x) h \leq 1} \nabla F(x)^{\top} h \\
& \leq & \sup_{h : h^{\top} \left( \frac1{\nu} \nabla F(x) [\nabla F(x) ]^{\top} \right) h \leq 1} \nabla F(x)^{\top} h \\
& = & \sqrt{\nu} .
\end{eqnarray*}
</math>
Thus a safe choice to increase the penalization parameter is <math>t' = \left(1 + \frac1{4\sqrt{\nu}}\right) t</math>. Note that the condition \eqref{eq:nu} can also be written as the fact that the function <math>F</math> is <math>\frac1{\nu}</math>-exp-concave, that is <math>x \mapsto \exp(- \frac1{\nu} F(x))</math> is concave. We arrive at the following definition.
{{defncard|label=|id=|<math>F</math> is a <math>\nu</math>-self-concordant barrier if it is a standard self-concordant function, and it is <math>\frac1{\nu}</math>-exp-concave.}}
Again the canonical example is the logarithmic function, <math>x \mapsto - \log x</math>, which is a <math>1</math>-self-concordant barrier for the set <math>\R_{+}</math>. We state the next theorem without a proof (see <ref name="BE14"></ref> for more on this result).
{{proofcard|Theorem|theorem-1|Let <math>\mathcal{X} \subset \R^n</math> be a closed convex set with non-empty interior. There exists <math>F</math> which is a <math>(c \ n)</math>-self-concordant barrier for <math>\mathcal{X}</math> (where <math>c</math> is some universal constant).|}}
A key property of <math>\nu</math>-self-concordant barriers is the following inequality:
<span id{{=}}"eq:key"/>
<math display="block">
\begin{equation} \label{eq:key}
c^{\top} x^*(t) - \min_{x \in \mathcal{X}} c^{\top} x \leq \frac{\nu}{t} ,
\end{equation}
</math>
see [Equation (4.2.17), <ref name="Nes04"></ref>]. More generally using \eqref{eq:key} together with \eqref{eq:trucipm3} one obtains
<span id{{=}}"eq:trucipm4"/>
<math display="block">
\begin{eqnarray}
c^{\top} y- \min_{x \in \mathcal{X}} c^{\top} x & \leq & \frac{\nu}{t} + c^{\top} (y - x^*(t)) \notag \\
& = & \frac{\nu}{t} + \frac{1}{t} (\nabla F_t(y) - \nabla F(y))^{\top} (y - x^*(t)) \notag \\
& \leq & \frac{\nu}{t} + \frac{1}{t} \|\nabla F_t(y) - \nabla F(y)\|_y^* \cdot \|y - x^*(t)\|_y \notag \\
& \leq & \frac{\nu}{t} + \frac{1}{t} (\lambda_{F_t}(y) + \sqrt{\nu})\frac{\lambda_{F_t} (y)}{1 - \lambda_{F_t}(y)} \label{eq:trucipm4}
\end{eqnarray}
</math>
In the next section we describe a precise algorithm based on the ideas we developed above. As we will see one cannot ensure to be exactly on the central path, and thus it is useful to generalize the identity \eqref{eq:trucipm11} for a point <math>x</math> close to the central path. We do this as follows:
<span id{{=}}"eq:trucipm12"/>
<math display="block">
\begin{eqnarray}
\lambda_{F_{t'}}(x) & = & \|t' c + \nabla F(x)\|_x^* \notag \\
& = &  \|(t' / t) (t c + \nabla F(x)) + (1- t'/t) \nabla F(x)\|_x^* \notag \\
& \leq & \frac{t'}{t} \lambda_{F_t}(x) + \left(\frac{t'}{t} - 1\right) \sqrt{\nu} .\label{eq:trucipm12}
\end{eqnarray}
</math>
===Path-following scheme===
We can now formally describe and analyze the most basic IPM called the {\em path-following scheme}. Let <math>F</math> be <math>\nu</math>-self-concordant barrier for <math>\cX</math>. Assume that one can find <math>x_0</math> such that <math>\lambda_{F_{t_0}}(x_0) \leq 1/4</math> for some small value <math>t_0  > 0</math> (we describe a method to find <math>x_0</math> at the end of this subsection).
Then for <math>k \geq 0</math>, let
<math display="block">
\begin{eqnarray*}
& & t_{k+1} = \left(1 + \frac1{13\sqrt{\nu}}\right) t_k ,\\
& & x_{k+1} = x_k - [\nabla^2 F(x_k)]^{-1} (t_{k+1} c + \nabla F(x_k) ) .
\end{eqnarray*}
</math>
The next theorem shows that after <math>O\left( \sqrt{\nu} \log \frac{\nu}{t_0 \epsilon} \right)</math> iterations of the path-following scheme one obtains an <math>\epsilon</math>-optimal point.
{{proofcard|Theorem|theorem-2|The path-following scheme described above satisfies
<math display="block">
c^{\top} x_k - \min_{x \in \mathcal{X}} c^{\top} x \leq \frac{2 \nu}{t_0} \exp\left( - \frac{k}{1+13\sqrt{\nu}} \right) .
</math>
|We show that the iterates <math>(x_k)_{k \geq 0}</math> remain close to the central path <math>(x^*(t_k))_{k \geq 0}</math>. Precisely one can easily prove by induction that
<math display="block">
\lambda_{F_{t_k}}(x_k) \leq 1/4 .
</math>
Indeed using [[#th:NMsc |Theorem]] and equation \eqref{eq:trucipm12} one immediately obtains
<math display="block">
\begin{eqnarray*}
\lambda_{F_{t_{k+1}}}(x_{k+1}) & \leq & 2 \lambda_{F_{t_{k+1}}}(x_k)^2 \\
& \leq & 2 \left(\frac{t_{k+1}}{t_k} \lambda_{F_{t_k}}(x_k) + \left(\frac{t_{k+1}}{t_k} - 1\right) \sqrt{\nu}\right)^2  \\
& \leq & 1/4 ,
\end{eqnarray*}
</math>
where we used in the last inequality that <math>t_{k+1} / t_k = 1 + \frac1{13\sqrt{\nu}}</math> and <math>\nu \geq 1</math>.
Thus using \eqref{eq:trucipm4} one obtains
<math display="block">
c^{\top} x_k - \min_{x \in \mathcal{X}} c^{\top} x \leq \frac{\nu + \sqrt{\nu} / 3 + 1/12}{t_k} \leq \frac{2 \nu}{t_k} .
</math>
Observe that <math>t_{k} = \left(1 + \frac1{13\sqrt{\nu}}\right)^{k} t_0</math>, which finally yields
<math display="block">
c^{\top} x_k - \min_{x \in \mathcal{X}} c^{\top} x \leq \frac{2 \nu}{t_0} \left(1 + \frac1{13\sqrt{\nu}}\right)^{- k}.
</math>}}
At this point we still need to explain how one can get close to an intial point <math>x^*(t_0)</math> of the central path. This can be done with the following rather clever trick. Assume that one has some point <math>y_0 \in \cX</math>. The observation is that <math>y_0</math> is on the central path at <math>t=1</math> for the problem where <math>c</math> is replaced by <math>- \nabla F(y_0)</math>. Now instead of following this central path as <math>t \to +\infty</math>, one follows it as <math>t \to 0</math>. Indeed for <math>t</math> small enough the central paths for <math>c</math> and for <math>- \nabla F(y_0)</math> will be very close. Thus we iterate the following equations, starting with <math>t_0' = 1</math>,
<math display="block">
\begin{eqnarray*}
& & t_{k+1}' = \left(1 - \frac1{13\sqrt{\nu}}\right) t_k' ,\\
& & y_{k+1} = y_k - [\nabla^2 F(y_k)]^{-1} (- t_{k+1}' \nabla F(y_0) + \nabla F(y_k) ) .
\end{eqnarray*}
</math>
A straightforward analysis shows that for <math>k = O(\sqrt{\nu} \log \nu)</math>, which corresponds to <math>t_k'=1/\nu^{O(1)}</math>, one obtains a point <math>y_k</math> such that <math>\lambda_{F_{t_k'}}(y_k) \leq 1/4</math>. In other words one can initialize the path-following scheme with <math>t_0 = t_k'</math> and <math>x_0 = y_k</math>.
===IPMs for LPs and SDPs===
We have seen that, roughly, the complexity of interior point methods with a <math>\nu</math>-self-concordant barrier is <math>O\left(M \sqrt{\nu} \log \frac{\nu}{\epsilon} \right)</math>, where <math>M</math> is the complexity of computing a Newton direction (which can be done by computing and inverting the Hessian of the barrier). Thus the efficiency of the method is directly related to the {\em form} of the self-concordant barrier that one can construct for <math>\mathcal{X}</math>. It turns out that for LPs and SDPs one has particularly nice self-concordant barriers. Indeed one can show that <math>F(x) = - \sum_{i=1}^n \log x_i</math> is an <math>n</math>-self-concordant barrier on <math>\R_{+}^n</math>, and <math>F(x) = - \log \mathrm{det}(X)</math> is an <math>n</math>-self-concordant barrier on <math>\mathbb{S}_{+}^n</math>. See also <ref name="LS13"></ref> for a recent improvement of the basic logarithmic barrier for LPs.
There is one important issue that we overlooked so far. In most interesting cases LPs and SDPs come with {\em equality constraints}, resulting in a set of constraints <math>\cX</math> with empty interior. From a theoretical point of view there is an easy fix, which is to reparametrize the problem as to enforce the variables to live in the subspace spanned by <math>\cX</math>. This modification also has algorithmic consequences, as the evaluation of the Newton direction will now be different. In fact, rather than doing a reparametrization, one can simply search for Newton directions such that the updated point will stay in <math>\cX</math>. In other words one has now to solve a convex quadratic optimization problem under linear equality constraints. Luckily using Lagrange multipliers one can find a closed form solution to this problem, and we refer to previous references for more details.
==General references==
{{cite arXiv|last1=Bubeck|first1=Sébastien|year=2015|title=Convex Optimization: Algorithms and Complexity|eprint=1405.4980|class=math.OC}}
==References==
{{reflist}}
==Notes==
{{Reflist|group=Notes}}

Revision as of 18:47, 3 June 2024

[math] \newcommand{\LS}{\mathtt{line\_search}} \newcommand{\mI}{\mathrm{I}} \newcommand{\mB}{\mathrm{B}} \newcommand{\conv}{\mathrm{conv}} \newcommand{\inte}{\mathrm{int}} \newcommand{\tg}{\tilde{g}} \newcommand{\tx}{\tilde{x}} \newcommand{\ty}{\tilde{y}} \newcommand{\tz}{\tilde{z}} \newcommand{\id}{\mathrm{id}} \newcommand{\K}{\mathrm{KL}} \newcommand{\kl}{\mathrm{kl}} \newcommand{\bp}{\boldsymbol{p}} \newcommand{\tr}{\mathrm{Tr}} \newcommand{\E}{\mathbb{E}} \newcommand{\N}{\mathbb{N}} \newcommand{\R}{\mathbb{R}} \newcommand{\Q}{\mathbb{Q}} \newcommand{\KL}{\mathrm{KL}} \newcommand{\LG}{\overline{\log}(K)} \newcommand{\ocP}{\overline{\mathcal{P}}} \newcommand{\cZ}{\mathcal{Z}} \newcommand{\cA}{\mathcal{A}} \newcommand{\cR}{\mathcal{R}} \newcommand{\cB}{\mathcal{B}} \newcommand{\cN}{\mathcal{N}} \newcommand{\cM}{\mathcal{M}} \newcommand{\cF}{\mathcal{F}} \newcommand{\cL}{\mathcal{L}} \newcommand{\cX}{\mathcal{X}} \newcommand{\cI}{\mathcal{I}} \newcommand{\cJ}{\mathcal{J}} \newcommand{\cY}{\mathcal{Y}} \newcommand{\cH}{\mathcal{H}} \newcommand{\cP}{\mathcal{P}} \newcommand{\cT}{\mathcal{T}} \newcommand{\cC}{\mathcal{C}} \newcommand{\cS}{\mathcal{S}} \newcommand{\cE}{\mathcal{E}} \newcommand{\cK}{\mathcal{K}} \newcommand{\cD}{\mathcal{D}} \newcommand{\oD}{\overline{\mathcal{D}}} \newcommand{\oR}{\overline{R}} \newcommand{\wh}{\widehat} \newcommand{\argmax}{\mathop{\mathrm{argmax}}} \newcommand{\argmin}{\mathop{\mathrm{argmin}}} \newcommand{\todo}{{\bf TO DO } } \newcommand{\Ber}{\mathop{\mathrm{Ber}}} \newcommand{\beq}{\begin{equation}} \newcommand{\eeq}{\end{equation}} \newcommand{\beqa}{\begin{eqnarray}} \newcommand{\eeqa}{\end{eqnarray}} \newcommand{\beqan}{\begin{eqnarray*}} \newcommand{\eeqan}{\end{eqnarray*}} \newcommand{\qed}{\hfill\BlackBox} \newcommand{\charfct}{\ds1} % \newcommand{\Fcal}{\mathcal{F}} \newcommand{\Xcal}{\mathcal{X}} \newcommand{\Hcal}{\mathcal{H}} \newcommand{\Gcal}{\mathcal{G}} \newcommand{\Nat}{\mathbb{N}} \newcommand{\mathds}{\mathbb}[/math]

\label{beyond} In the black-box model non-smoothness dramatically deteriorates the rate of convergence of first order methods from [math]1/t^2[/math] to [math]1/\sqrt{t}[/math]. However, as we already pointed out in Section Structured optimization, we (almost) always know the function to be optimized {\em globally}. In particular the ‘`source" of non-smoothness can often be identified. For instance the LASSO objective (see Section Some convex optimization problems in machine learning) is non-smooth, but it is a sum of a smooth part (the least squares fit) and a {\em simple} non-smooth part (the [math]\ell_1[/math]-norm). Using this specific structure we will propose in Section Sum of a smooth and a simple non-smooth term a first order method with a [math]1/t^2[/math] convergence rate, despite the non-smoothness. In Section Smooth saddle-point representation of a non-smooth function we consider another type of non-smoothness that can effectively be overcome, where the function is the maximum of smooth functions. Finally we conclude this chapter with a concise description of interior point methods, for which the structural assumption is made on the constraint set rather than on the objective function.

Sum of a smooth and a simple non-smooth term

We consider here the following problem[Notes 1]:

[[math]] \min_{x \in \R^n} f(x) + g(x) , [[/math]]

where [math]f[/math] is convex and [math]\beta[/math]-smooth, and [math]g[/math] is convex. We assume that [math]f[/math] can be accessed through a first order oracle, and that [math]g[/math] is known and ``simple". What we mean by simplicity will be clear from the description of the algorithm. For instance a separable function, that is [math]g(x) = \sum_{i=1}^n g_i(x(i))[/math], will be considered as simple. The prime example being [math]g(x) = \|x\|_1[/math]. This section is inspired from [1] (see also [2][3]). \section*{ISTA (Iterative Shrinkage-Thresholding Algorithm)} Recall that gradient descent on the smooth function [math]f[/math] can be written as (see eq:MDproxview)

[[math]] x_{t+1} = \argmin_{x \in \mathbb{R}^n} \eta \nabla f(x_t)^{\top} x + \frac{1}{2} \|x - x_t\|^2_2 . [[/math]]

Here one wants to minimize [math]f+g[/math], and [math]g[/math] is assumed to be known and ``simple". Thus it seems quite natural to consider the following update rule, where only [math]f[/math] is locally approximated with a first order oracle:

[[math]] \begin{eqnarray} x_{t+1} & = & \argmin_{x \in \mathbb{R}^n} \eta (g(x) + \nabla f(x_t)^{\top} x) + \frac{1}{2} \|x - x_t\|^2_2 \notag \\ & = & \argmin_{x \in \mathbb{R}^n} \ g(x) + \frac{1}{2\eta} \|x - (x_t - \eta \nabla f(x_t)) \|_2^2 . \label{eq:proxoperator} \end{eqnarray} [[/math]]

The algorithm described by the above iteration is known as ISTA (Iterative Shrinkage-Thresholding Algorithm). In terms of convergence rate it is easy to show that ISTA has the same convergence rate on [math]f+g[/math] as gradient descent on [math]f[/math]. More precisely with [math]\eta=\frac{1}{\beta}[/math] one has

[[math]] f(x_t) + g(x_t) - (f(x^*) + g(x^*)) \leq \frac{\beta \|x_1 - x^*\|^2_2}{2 t} . [[/math]]

This improved convergence rate over a subgradient descent directly on [math]f+g[/math] comes at a price: in general \eqref{eq:proxoperator} may be a difficult optimization problem by itself, and this is why one needs to assume that [math]g[/math] is simple. For instance if [math]g[/math] can be written as [math]g(x) = \sum_{i=1}^n g_i(x(i))[/math] then one can compute [math]x_{t+1}[/math] by solving [math]n[/math] convex problems in dimension [math]1[/math]. In the case where [math]g(x) = \lambda \|x\|_1[/math] this one-dimensional problem is given by:

[[math]] \min_{x \in \mathbb{R}} \ \lambda |x| + \frac{1}{2 \eta}(x - x_0)^2, \ \text{where} \ x_0 \in \mathbb{R} . [[/math]]

Elementary computations shows that this problem has an analytical solution given by [math]\tau_{\lambda \eta}(x_0)[/math], where [math]\tau[/math] is the shrinkage operator (hence the name ISTA), defined by

[[math]] \tau_{\alpha}(x) = (|x|-\alpha)_+ \mathrm{sign}(x) . [[/math]]

Much more is known about \eqref{eq:proxoperator} (which is called the {\em proximal operator} of [math]g[/math]), and in fact entire monographs have been written about this equation, see e.g. [4][5]. \section*{FISTA (Fast ISTA)} An obvious idea is to combine Nesterov’s accelerated gradient descent (which results in a [math]1/t^2[/math] rate to optimize [math]f[/math]) with ISTA. This results in FISTA (Fast ISTA) which is described as follows. Let

[[math]] \lambda_0 = 0, \ \lambda_{t} = \frac{1 + \sqrt{1+ 4 \lambda_{t-1}^2}}{2}, \ \text{and} \ \gamma_t = \frac{1 - \lambda_t}{\lambda_{t+1}}. [[/math]]

Let [math]x_1 = y_1[/math] an arbitrary initial point, and

[[math]] \begin{eqnarray*} y_{t+1} & = & \mathrm{argmin}_{x \in \mathbb{R}^n} \ g(x) + \frac{\beta}{2} \|x - (x_t - \frac1{\beta} \nabla f(x_t)) \|_2^2 , \\ x_{t+1} & = & (1 - \gamma_t) y_{t+1} + \gamma_t y_t . \end{eqnarray*} [[/math]]

Again it is easy show that the rate of convergence of FISTA on [math]f+g[/math] is similar to the one of Nesterov's accelerated gradient descent on [math]f[/math], more precisely:

[[math]] f(y_t) + g(y_t) - (f(x^*) + g(x^*)) \leq \frac{2 \beta \|x_1 - x^*\|^2}{t^2} . [[/math]]

\section*{CMD and RDA} ISTA and FISTA assume smoothness in the Euclidean metric. Quite naturally one can also use these ideas in a non-Euclidean setting. Starting from eq:MDproxview one obtains the CMD (Composite Mirror Descent) algorithm of [6], while with eq:DA0 one obtains the RDA (Regularized Dual Averaging) of [7]. We refer to these papers for more details.

Smooth saddle-point representation of a non-smooth function

Quite often the non-smoothness of a function [math]f[/math] comes from a [math]\max[/math] operation. More precisely non-smooth functions can often be represented as

[[math]] \begin{equation} \label{eq:sprepresentation} f(x) = \max_{1 \leq i \leq m} f_i(x) , \end{equation} [[/math]]

where the functions [math]f_i[/math] are smooth. This was the case for instance with the function we used to prove the black-box lower bound [math]1/\sqrt{t}[/math] for non-smooth optimization in Theorem. We will see now that by using this structural representation one can in fact attain a rate of [math]1/t[/math]. This was first observed in [8] who proposed the Nesterov's smoothing technique. Here we will present the alternative method of [9] which we find more transparent (yet another version is the Chambolle-Pock algorithm, see [10]). Most of what is described in this section can be found in [11][12]. In the next subsection we introduce the more general problem of saddle point computation. We then proceed to apply a modified version of mirror descent to this problem, which will be useful both in Chapter and also as a warm-up for the more powerful modified mirror prox that we introduce next.

Saddle point computation

Let [math]\cX \subset \R^n[/math], [math]\cY \subset \R^m[/math] be compact and convex sets. Let [math]\phi : \cX \times \cY \rightarrow \mathbb{R}[/math] be a continuous function, such that [math]\phi(\cdot, y)[/math] is convex and [math]\phi(x, \cdot)[/math] is concave. We write [math]g_{\cX}(x,y)[/math] (respectively [math]g_{\cY}(x,y)[/math]) for an element of [math]\partial_x \phi(x,y)[/math] (respectively [math]\partial_y (-\phi(x,y))[/math]). We are interested in computing

[[math]] \min_{x \in \cX} \max_{y \in \cY} \phi(x,y) . [[/math]]

By Sion's minimax theorem there exists a pair [math](x^*, y^*) \in \cX \times \cY[/math] such that

[[math]] \phi(x^*,y^*) = \min_{x \in \mathcal{X}} \max_{y \in \mathcal{Y}} \phi(x,y) = \max_{y \in \mathcal{Y}} \min_{x \in \mathcal{X}} \phi(x,y) . [[/math]]

We will explore algorithms that produce a candidate pair of solutions [math](\tx, \ty) \in \cX \times \cY[/math]. The quality of [math](\tx, \ty)[/math] is evaluated through the so-called duality gap[Notes 2]

[[math]] \max_{y \in \mathcal{Y}} \phi(\tx,y) - \min_{x \in \mathcal{X}} \phi(x,\ty) . [[/math]]

The key observation is that the duality gap can be controlled similarly to the suboptimality gap [math]f(x) - f(x^*)[/math] in a simple convex optimization problem. Indeed for any [math](x, y) \in \cX \times \cY[/math],

[[math]] \phi(\tx,\ty) - \phi(x,\ty) \leq g_{\cX}(\tx,\ty)^{\top} (\tx-x), [[/math]]

and

[[math]] - \phi(\tx,\ty) - (- \phi(\tx,y)) \leq g_{\cY}(\tx,\ty)^{\top} (\ty-y) . [[/math]]

In particular, using the notation [math]z = (x,y) \in \cZ := \cX \times \cY[/math] and [math]g(z) = (g_{\cX}(x,y), g_{\cY}(x,y))[/math] we just proved

[[math]] \begin{equation} \label{eq:keysp} \max_{y \in \mathcal{Y}} \phi(\tx,y) - \min_{x \in \mathcal{X}} \phi(x,\ty) \leq g(\tz)^{\top} (\tz - z) , \end{equation} [[/math]]

for some [math]z \in \mathcal{Z}.[/math] In view of the vector field point of view developed in Section The vector field point of view on MD, DA, and MP this suggests to do a mirror descent in the [math]\cZ[/math]-space with the vector field [math]g : \cZ \rightarrow \R^n \times \R^m[/math]. We will assume in the next subsections that [math]\cX[/math] is equipped with a mirror map [math]\Phi_{\cX}[/math] (defined on [math]\cD_{\cX}[/math]) which is [math]1[/math]-strongly convex w.r.t. a norm [math]\|\cdot\|_{\cX}[/math] on [math]\cX \cap \cD_{\cX}[/math]. We denote [math]R^2_{\cX} = \sup_{x \in \cX} \Phi(x) - \min_{x \in \cX} \Phi(x)[/math]. We define similar quantities for the space [math]\cY[/math].

Saddle Point Mirror Descent (SP-MD)

We consider here mirror descent on the space [math]\cZ = \cX \times \cY[/math] with the mirror map [math]\Phi(z) = a \Phi_{\cX}(x) + b \Phi_{\cY}(y)[/math] (defined on [math]\cD = \cD_{\cX} \times \cD_{\cY}[/math]), where [math]a, b \in \R_+[/math] are to be defined later, and with the vector field [math]g : \cZ \rightarrow \R^n \times \R^m[/math] defined in the previous subsection. We call the resulting algorithm SP-MD (Saddle Point Mirror Descent). It can be described succintly as follows. Let [math]z_1 \in \argmin_{z \in \cZ \cap \cD} \Phi(z)[/math]. Then for [math]t \geq 1[/math], let

[[math]] z_{t+1} \in \argmin_{z \in \cZ \cap \cD} \ \eta g_t^{\top} z + D_{\Phi}(z,z_t) , [[/math]]

where [math]g_t = (g_{\cX,t}, g_{\cY,t})[/math] with [math]g_{\cX,t} \in \partial_x \phi(x_t,y_t)[/math] and [math]g_{\cY,t} \in \partial_y (- \phi(x_t,y_t))[/math].

Theorem

Assume that [math]\phi(\cdot, y)[/math] is [math]L_{\cX}[/math]-Lipschitz w.r.t. [math]\|\cdot\|_{\cX}[/math], that is [math]\|g_{\cX}(x,y)\|_{\cX}^* \leq L_{\cX}, \forall (x, y) \in \cX \times \cY[/math]. Similarly assume that [math]\phi(x, \cdot)[/math] is [math]L_{\cY}[/math]-Lipschitz w.r.t. [math]\|\cdot\|_{\cY}[/math]. Then SP-MD with [math]a= \frac{L_{\cX}}{R_{\cX}}[/math], [math]b=\frac{L_{\cY}}{R_{\cY}}[/math], and [math]\eta=\sqrt{\frac{2}{t}}[/math] satisfies

[[math]] \max_{y \in \mathcal{Y}} \phi\left( \frac1{t} \sum_{s=1}^t x_s,y \right) - \min_{x \in \mathcal{X}} \phi\left(x, \frac1{t} \sum_{s=1}^t y_s \right) \leq (R_{\cX} L_{\cX} + R_{\cY} L_{\cY}) \sqrt{\frac{2}{t}}. [[/math]]


Show Proof

First we endow [math]\mathcal{Z}[/math] with the norm [math]\|\cdot\|_{\cZ}[/math] defined by

[[math]] \|z\|_{\cZ} = \sqrt{a \|x\|_{\mathcal{X}}^2 + b \|y\|_{\mathcal{Y}}^2} . [[/math]]
It is immediate that [math]\Phi[/math] is [math]1[/math]-strongly convex with respect to [math]\|\cdot\|_{\mathcal{Z}}[/math] on [math]\mathcal{Z} \cap \mathcal{D}[/math]. Furthermore one can easily check that

[[math]] \|z\|_{\mathcal{Z}}^* = \sqrt{\frac1{a} \left(\|x\|_{\mathcal{X}}^*\right)^2 + \frac1{b} \left(\|y\|_{\mathcal{Y}}^*\right)^2} , [[/math]]
and thus the vector field [math](g_t)[/math] used in the SP-MD satisfies:

[[math]] \|g_t\|_{\mathcal{Z}}^* \leq \sqrt{\frac{L_{\mathcal{X}}^2}{a} + \frac{L_{\mathcal{Y}}^2}{b}} . [[/math]]
Using eq:vfMD together with \eqref{eq:keysp} and the values of [math]a, b[/math] and [math]\eta[/math] concludes the proof.

Saddle Point Mirror Prox (SP-MP)

We now consider the most interesting situation in the context of this chapter, where the function [math]\phi[/math] is smooth. Precisely we say that [math]\phi[/math] is [math](\beta_{11}, \beta_{12}, \beta_{22}, \beta_{21})[/math]-smooth if for any [math]x, x' \in \cX, y, y' \in \cY[/math],

[[math]] \begin{align*} & \|\nabla_x \phi(x,y) - \nabla_x \phi(x',y) \|_{\mathcal{X}}^* \leq \beta_{11} \|x-x'\|_{\mathcal{X}} , \\ & \|\nabla_x \phi(x,y) - \nabla_x \phi(x,y') \|_{\mathcal{X}}^* \leq \beta_{12} \|y-y'\|_{\mathcal{Y}} , \\ & \|\nabla_y \phi(x,y) - \nabla_y \phi(x,y') \|_{\mathcal{Y}}^* \leq \beta_{22} \|y-y'\|_{\mathcal{Y}} , \\ & \|\nabla_y \phi(x,y) - \nabla_y \phi(x',y) \|_{\mathcal{Y}}^* \leq \beta_{21} \|x-x'\|_{\mathcal{X}} , \end{align*} [[/math]]

This will imply the Lipschitzness of the vector field [math]g : \cZ \rightarrow \R^n \times \R^m[/math] under the appropriate norm. Thus we use here mirror prox on the space [math]\cZ[/math] with the mirror map [math]\Phi(z) = a \Phi_{\cX}(x) + b \Phi_{\cY}(y)[/math] and the vector field [math]g[/math]. The resulting algorithm is called SP-MP (Saddle Point Mirror Prox) and we can describe it succintly as follows. Let [math]z_1 \in \argmin_{z \in \cZ \cap \cD} \Phi(z)[/math]. Then for [math]t \geq 1[/math], let [math]z_t=(x_t,y_t)[/math] and [math]w_t=(u_t, v_t)[/math] be defined by

[[math]] \begin{eqnarray*} w_{t+1} & = & \argmin_{z \in \cZ \cap \cD} \ \eta (\nabla_x \phi(x_t, y_t), - \nabla_y \phi(x_t,y_t))^{\top} z + D_{\Phi}(z,z_t) \\ z_{t+1} & = & \argmin_{z \in \cZ \cap \cD} \ \eta (\nabla_x \phi(u_{t+1}, v_{t+1}), - \nabla_y \phi(u_{t+1},v_{t+1}))^{\top} z + D_{\Phi}(z,z_t) . \end{eqnarray*} [[/math]]


Theorem

Assume that [math]\phi[/math] is [math](\beta_{11}, \beta_{12}, \beta_{22}, \beta_{21})[/math]-smooth. Then SP-MP with [math]a= \frac{1}{R_{\cX}^2}[/math], [math]b=\frac{1}{R_{\cY}^2}[/math], and [math]\eta= 1 / \left(2 \max \left(\beta_{11} R^2_{\cX}, \beta_{22} R^2_{\cY}, \beta_{12} R_{\cX} R_{\cY}, \beta_{21} R_{\cX} R_{\cY}\right) \right)[/math] satisfies

[[math]] \begin{align*} & \max_{y \in \mathcal{Y}} \phi\left( \frac1{t} \sum_{s=1}^t u_{s+1},y \right) - \min_{x \in \mathcal{X}} \phi\left(x, \frac1{t} \sum_{s=1}^t v_{s+1} \right) \\ & \leq \max \left(\beta_{11} R^2_{\cX}, \beta_{22} R^2_{\cY}, \beta_{12} R_{\cX} R_{\cY}, \beta_{21} R_{\cX} R_{\cY}\right) \frac{4}{t} . \end{align*} [[/math]]


Show Proof

In light of the proof of Theorem and eq:vfMP it clearly suffices to show that the vector field [math]g(z) = (\nabla_x \phi(x,y), - \nabla_y \phi_(x,y))[/math] is [math]\beta[/math]-Lipschitz w.r.t. [math]\|z\|_{\cZ} = \sqrt{\frac{1}{R_{\cX}^2} \|x\|_{\mathcal{X}}^2 + \frac{1}{R_{\cY}^2} \|y\|_{\mathcal{Y}}^2}[/math] with [math]\beta = 2 \max \left(\beta_{11} R^2_{\cX}, \beta_{22} R^2_{\cY}, \beta_{12} R_{\cX} R_{\cY}, \beta_{21} R_{\cX} R_{\cY}\right)[/math]. In other words one needs to show that

[[math]] \|g(z) - g(z')\|_{\cZ}^* \leq \beta \|z - z'\|_{\cZ} , [[/math]]
which can be done with straightforward calculations (by introducing [math]g(x',y)[/math] and using the definition of smoothness for [math]\phi[/math]).

Applications

We investigate briefly three applications for SP-MD and SP-MP.

Minimizing a maximum of smooth functions

The problem \eqref{eq:sprepresentation} (when [math]f[/math] has to minimized over [math]\cX[/math]) can be rewritten as

[[math]] \min_{x \in \cX} \max_{y \in \Delta_m} \vec{f}(x)^{\top} y , [[/math]]

where [math]\vec{f}(x) = (f_1(x), \hdots, f_m(x)) \in \R^m[/math]. We assume that the functions [math]f_i[/math] are [math]L[/math]-Lipschtiz and [math]\beta[/math]-smooth w.r.t. some norm [math]\|\cdot\|_{\cX}[/math]. Let us study the smoothness of [math]\phi(x,y) = \vec{f}(x)^{\top} y[/math] when [math]\cX[/math] is equipped with [math]\|\cdot\|_{\cX}[/math] and [math]\Delta_m[/math] is equipped with [math]\|\cdot\|_1[/math]. On the one hand [math]\nabla_y \phi(x,y) = \vec{f}(x)[/math], in particular one immediately has [math]\beta_{22}=0[/math], and furthermore

[[math]] \|\vec{f}(x) - \vec{f}(x') \|_{\infty} \leq L \|x-x'\|_{\mathcal{X}} , [[/math]]

that is [math]\beta_{21}=L[/math]. On the other hand [math]\nabla_x \phi(x,y) = \sum_{i=1}^m y_i \nabla f_i(x)[/math], and thus

[[math]] \begin{align*} & \|\sum_{i=1}^m y(i) (\nabla f_i(x) - \nabla f_i(x')) \|_{\cX}^* \leq \beta \|x-x'\|_{\cX} , \\ & \|\sum_{i=1}^m (y(i)-y'(i)) \nabla f_i(x) \|_{\cX}^* \leq L\|y-y'\|_1 , \end{align*} [[/math]]

that is [math]\beta_{11} = \beta[/math] and [math]\beta_{12} = L[/math]. Thus using SP-MP with some mirror map on [math]\cX[/math] and the negentropy on [math]\Delta_m[/math] (see the ‘`simplex setup" in Section Standard setups for mirror descent), one obtains an [math]\epsilon[/math]-optimal point of [math]f(x) = \max_{1 \leq i \leq m} f_i(x)[/math] in [math]O\left(\frac{\beta R_{\cX}^2 + L R_{\cX} \sqrt{\log(m)}}{\epsilon} \right)[/math] iterations. Furthermore an iteration of SP-MP has a computational complexity of order of a step of mirror descent in [math]\cX[/math] on the function [math]x \mapsto \sum_{i=1}^m y(i) f_i(x)[/math] (plus [math]O(m)[/math] for the update in the [math]\cY[/math]-space). Thus by using the structure of [math]f[/math] we were able to obtain a much better rate than black-box procedures (which would have required [math]\Omega(1/\epsilon^2)[/math] iterations as [math]f[/math] is potentially non-smooth).

Matrix games

Let [math]A \in \R^{n \times m}[/math], we denote [math]\|A\|_{\mathrm{max}}[/math] for the maximal entry (in absolute value) of [math]A[/math], and [math]A_i \in \R^n[/math] for the [math]i^{th}[/math] column of [math]A[/math]. We consider the problem of computing a Nash equilibrium for the zero-sum game corresponding to the loss matrix [math]A[/math], that is we want to solve

[[math]] \min_{x \in \Delta_n} \max_{y \in \Delta_m} x^{\top} A y . [[/math]]

Here we equip both [math]\Delta_n[/math] and [math]\Delta_m[/math] with [math]\|\cdot\|_1[/math]. Let [math]\phi(x,y) = x^{\top} A y[/math]. Using that [math]\nabla_x \phi(x,y) = Ay[/math] and [math]\nabla_y \phi(x,y) = A^{\top} x[/math] one immediately obtains [math]\beta_{11} = \beta_{22} = 0[/math]. Furthermore since

[[math]] \|A(y - y’) \|_{\infty} = \|\sum_{i=1}^m (y(i) - y'(i)) A_i \|_{\infty} \leq \|A\|_{\mathrm{max}} \|y - y'\|_1 , [[/math]]

one also has [math]\beta_{12} = \beta_{21} = \|A\|_{\mathrm{max}}[/math]. Thus SP-MP with the negentropy on both [math]\Delta_n[/math] and [math]\Delta_m[/math] attains an [math]\epsilon[/math]-optimal pair of mixed strategies with [math]O\left(\|A\|_{\mathrm{max}} \sqrt{\log(n) \log(m)} / \epsilon \right)[/math] iterations. Furthermore the computational complexity of a step of SP-MP is dominated by the matrix-vector multiplications which are [math]O(n m)[/math]. Thus overall the complexity of getting an [math]\epsilon[/math]-optimal Nash equilibrium with SP-MP is [math]O\left(\|A\|_{\mathrm{max}} n m \sqrt{\log(n) \log(m)} / \epsilon \right)[/math].

Linear classification

Let [math](\ell_i, A_i) \in \{-1,1\} \times \R^n[/math], [math]i \in [m][/math], be a data set that one wishes to separate with a linear classifier. That is one is looking for [math]x \in \mB_{2,n}[/math] such that for all [math]i \in [m][/math], [math]\mathrm{sign}(x^{\top} A_i) = \mathrm{sign}(\ell_i)[/math], or equivalently [math]\ell_i x^{\top} A_i \gt 0[/math]. Clearly without loss of generality one can assume [math]\ell_i = 1[/math] for all [math]i \in [m][/math] (simply replace [math]A_i[/math] by [math]\ell_i A_i[/math]). Let [math]A \in \R^{n \times m}[/math] be the matrix where the [math]i^{th}[/math] column is [math]A_i[/math]. The problem of finding [math]x[/math] with maximal margin can be written as

[[math]] \begin{equation} \label{eq:linearclassif} \max_{x \in \mB_{2,n}} \min_{1 \leq i \leq m} A_i^{\top} x = \max_{x \in \mB_{2,n}} \min_{y \in \Delta_m} x^{\top} A y . \end{equation} [[/math]]

Assuming that [math]\|A_i\|_2 \leq B[/math], and using the calculations we did in Section Minimizing a maximum of smooth functions, it is clear that [math]\phi(x,y) = x^{\top} A y[/math] is [math](0, B, 0, B)[/math]-smooth with respect to [math]\|\cdot\|_2[/math] on [math]\mB_{2,n}[/math] and [math]\|\cdot\|_1[/math] on [math]\Delta_m[/math]. This implies in particular that SP-MP with the Euclidean norm squared on [math]\mB_{2,n}[/math] and the negentropy on [math]\Delta_m[/math] will solve \eqref{eq:linearclassif} in [math]O(B \sqrt{\log(m)} / \epsilon)[/math] iterations. Again the cost of an iteration is dominated by the matrix-vector multiplications, which results in an overall complexity of [math]O(B n m \sqrt{\log(m)} / \epsilon)[/math] to find an [math]\epsilon[/math]-optimal solution to \eqref{eq:linearclassif}.

Interior point methods

We describe here interior point methods (IPM), a class of algorithms fundamentally different from what we have seen so far. The first algorithm of this type was described in [13], but the theory we shall present was developed in [14]. We follow closely the presentation given in [Chapter 4, [15]]. Other useful references (in particular for the primal-dual IPM, which are the ones used in practice) include [16][17][18]. \newline IPM are designed to solve convex optimization problems of the form

[[math]] \begin{align*} & \mathrm{min.} \; c^{\top} x \\ & \text{s.t.} \; x \in \cX , \end{align*} [[/math]]

with [math]c \in \R^n[/math], and [math]\cX \subset \R^n[/math] convex and compact. Note that, at this point, the linearity of the objective is without loss of generality as minimizing a convex function [math]f[/math] over [math]\cX[/math] is equivalent to minimizing a linear objective over the epigraph of [math]f[/math] (which is also a convex set). The structural assumption on [math]\cX[/math] that one makes in IPM is that there exists a {\em self-concordant barrier} for [math]\cX[/math] with an easily computable gradient and Hessian. The meaning of the previous sentence will be made precise in the next subsections. The importance of IPM stems from the fact that LPs and SDPs (see Section Structured optimization) satisfy this structural assumption.

The barrier method

We say that [math]F : \inte(\cX) \rightarrow \R[/math] is a {\em barrier} for [math]\cX[/math] if

[[math]] F(x) \xrightarrow[x \to \partial \cX]{} +\infty . [[/math]]

We will only consider strictly convex barriers. We extend the domain of definition of [math]F[/math] to [math]\R^n[/math] with [math]F(x) = +\infty[/math] for [math]x \not\in \inte(\cX)[/math]. For [math]t \in \R_+[/math] let

[[math]] x^*(t) \in \argmin_{x \in \R^n} t c^{\top} x + F(x) . [[/math]]

In the following we denote [math]F_t(x) := t c^{\top} x + F(x)[/math]. In IPM the path [math](x^*(t))_{t \in \R_+}[/math] is referred to as the {\em central path}. It seems clear that the central path eventually leads to the minimum [math]x^*[/math] of the objective function [math]c^{\top} x[/math] on [math]\cX[/math], precisely we will have

[[math]] x^*(t) \xrightarrow[t \to +\infty]{} x^* . [[/math]]

The idea of the {\em barrier method} is to move along the central path by ‘`boosting" a fast locally convergent algorithm, which we denote for the moment by [math]\cA[/math], using the following scheme: Assume that one has computed [math]x^*(t)[/math], then one uses [math]\cA[/math] initialized at [math]x^*(t)[/math] to compute [math]x^*(t’)[/math] for some [math]t' \gt t[/math]. There is a clear tension for the choice of [math]t'[/math], on the one hand [math]t'[/math] should be large in order to make as much progress as possible on the central path, but on the other hand [math]x^*(t)[/math] needs to be close enough to [math]x^*(t')[/math] so that it is in the basin of fast convergence for [math]\cA[/math] when run on [math]F_{t'}[/math]. IPM follows the above methodology with [math]\cA[/math] being {\em Newton's method}. Indeed as we will see in the next subsection, Newton's method has a quadratic convergence rate, in the sense that if initialized close enough to the optimum it attains an [math]\epsilon[/math]-optimal point in [math]\log\log(1/\epsilon)[/math] iterations! Thus we now have a clear plan to make these ideas formal and analyze the iteration complexity of IPM:

  • First we need to describe precisely the region of fast convergence for Newton's method. This will lead us to define self-concordant functions, which are ‘`natural" functions for Newton’s method.
  • Then we need to evaluate precisely how much larger [math]t'[/math] can be compared to [math]t[/math], so that [math]x^*(t)[/math] is still in the region of fast convergence of Newton's method when optimizing the function [math]F_{t'}[/math] with [math]t' \gt t[/math]. This will lead us to define [math]\nu[/math]-self concordant barriers.
  • How do we get close to the central path in the first place? Is it possible to compute [math]x^*(0) = \argmin_{x \in \R^n} F(x)[/math] (the so-called analytical center of [math]\mathcal{X}[/math])?

Traditional analysis of Newton's method

We start by describing Newton's method together with its standard analysis showing the quadratic convergence rate when initialized close enough to the optimum. In this subsection we denote [math]\|\cdot\|[/math] for both the Euclidean norm on [math]\R^n[/math] and the operator norm on matrices (in particular [math]\|A x\| \leq \|A\| \cdot \|x\|[/math]). Let [math]f: \R^n \rightarrow \R[/math] be a [math]C^2[/math] function. Using a Taylor's expansion of [math]f[/math] around [math]x[/math] one obtains

[[math]] f(x+h) = f(x) + h^{\top} \nabla f(x) + \frac12 h^{\top} \nabla^2 f(x) h + o(\|h\|^2) . [[/math]]

Thus, starting at [math]x[/math], in order to minimize [math]f[/math] it seems natural to move in the direction [math]h[/math] that minimizes

[[math]] h^{\top} \nabla f(x) + \frac12 h^{\top} \nabla f^2(x) h . [[/math]]

If [math]\nabla^2 f(x)[/math] is positive definite then the solution to this problem is given by [math]h = - [\nabla^2 f(x)]^{-1} \nabla f(x)[/math]. Newton's method simply iterates this idea: starting at some point [math]x_0 \in \R^n[/math], it iterates for [math]k \geq 0[/math] the following equation:

[[math]] x_{k+1} = x_k - [\nabla^2 f(x_k)]^{-1} \nabla f(x_k) . [[/math]]

While this method can have an arbitrarily bad behavior in general, if started close enough to a strict local minimum of [math]f[/math], it can have a very fast convergence:

Theorem

Assume that [math]f[/math] has a Lipschitz Hessian, that is [math]\| \nabla^2 f(x) - \nabla^2 f(y) \| \leq M \|x - y\|[/math]. Let [math]x^*[/math] be local minimum of [math]f[/math] with strictly positive Hessian, that is [math]\nabla^2 f(x^*) \succeq \mu \mI_n[/math], [math]\mu \gt 0[/math]. Suppose that the initial starting point [math]x_0[/math] of Newton's method is such that

[[math]] \|x_0 - x^*\| \leq \frac{\mu}{2 M} . [[/math]]
Then Newton's method is well-defined and converges to [math]x^*[/math] at a quadratic rate:

[[math]] \|x_{k+1} - x^*\| \leq \frac{M}{\mu} \|x_k - x^*\|^2. [[/math]]


Show Proof

We use the following simple formula, for [math]x, h \in \R^n[/math],

[[math]] \int_0^1 \nabla^2 f(x + s h) \ h \ ds = \nabla f(x+h) - \nabla f(x) . [[/math]]
Now note that [math]\nabla f(x^*) = 0[/math], and thus with the above formula one obtains

[[math]] \nabla f(x_k) = \int_0^1 \nabla^2 f(x^* + s (x_k - x^*)) \ (x_k - x^*) \ ds , [[/math]]
which allows us to write:

[[math]] \begin{align*} & x_{k+1} - x^* \\ & = x_k - x^* - [\nabla^2 f(x_k)]^{-1} \nabla f(x_k) \\ & = x_k - x^* - [\nabla^2 f(x_k)]^{-1} \int_0^1 \nabla^2 f(x^* + s (x_k - x^*)) \ (x_k - x^*) \ ds \\ & = [\nabla^2 f(x_k)]^{-1} \int_0^1 [\nabla^2 f (x_k) - \nabla^2 f(x^* + s (x_k - x^*)) ] \ (x_k - x^*) \ ds . \end{align*} [[/math]]
In particular one has

[[math]] \begin{align*} & \|x_{k+1} - x^*\| \\ & \leq \|[\nabla^2 f(x_k)]^{-1}\| \\ & \times \left( \int_0^1 \| \nabla^2 f (x_k) - \nabla^2 f(x^* + s (x_k - x^*)) \| \ ds \right) \|x_k - x^* \|. \end{align*} [[/math]]
Using the Lipschitz property of the Hessian one immediately obtains that

[[math]] \left( \int_0^1 \| \nabla^2 f (x_k) - \nabla^2 f(x^* + s (x_k - x^*)) \| \ ds \right) \leq \frac{M}{2} \|x_k - x^*\| . [[/math]]
Using again the Lipschitz property of the Hessian (note that [math]\|A - B\| \leq s \Leftrightarrow s \mI_n \succeq A - B \succeq - s \mI_n[/math]), the hypothesis on [math]x^*[/math], and an induction hypothesis that [math]\|x_k - x^*\| \leq \frac{\mu}{2M}[/math], one has

[[math]] \nabla^2 f(x_k) \succeq \nabla^2 f(x^*) - M \|x_k - x^*\| \mI_n \succeq (\mu - M \|x_k - x^*\|) \mI_n \succeq \frac{\mu}{2} \mI_n , [[/math]]
which concludes the proof.

Self-concordant functions

Before giving the definition of self-concordant functions let us try to get some insight into the ‘`geometry" of Newton’s method. Let [math]A[/math] be a [math]n \times n[/math] non-singular matrix. We look at a Newton step on the functions [math]f: x \mapsto f(x)[/math] and [math]\phi: y \mapsto f(A^{-1} y)[/math], starting respectively from [math]x[/math] and [math]y= A x[/math], that is:

[[math]] x^+ = x - [\nabla^2 f(x)]^{-1} \nabla f(x) , \; \text{and} \; y^+ = y - [\nabla^2 \phi(y)]^{-1} \nabla \phi(y) . [[/math]]

By using the following simple formulas

[[math]] \nabla (x \mapsto f(A x) ) =A^{\top} \nabla f(A x) , \; \text{and} \; \nabla^2 (x \mapsto f(A x) ) =A^{\top} \nabla^2 f(A x) A . [[/math]]

it is easy to show that

[[math]] y^+ = A x^+ . [[/math]]

In other words Newton's method will follow the same trajectory in the ‘`[math]x[/math]-space" and in the ``[math]y[/math]-space" (the image through [math]A[/math] of the [math]x[/math]-space), that is Newton’s method is {\em affine invariant}. Observe that this property is not shared by the methods described in Chapter (except for the conditional gradient descent). The affine invariance of Newton's method casts some concerns on the assumptions of the analysis in Section Traditional analysis of Newton's method. Indeed the assumptions are all in terms of the canonical inner product in [math]\R^n[/math]. However we just showed that the method itself does not depend on the choice of the inner product (again this is not true for first order methods). Thus one would like to derive a result similar to Theorem without any reference to a prespecified inner product. The idea of self-concordance is to modify the Lipschitz assumption on the Hessian to achieve this goal. Assume from now on that [math]f[/math] is [math]C^3[/math], and let [math]\nabla^3 f(x) : \R^n \times \R^n \times \R^n \rightarrow \R[/math] be the third order differential operator. The Lipschitz assumption on the Hessian in Theorem can be written as:

[[math]] \nabla^3 f(x) [h,h,h] \leq M \|h\|_2^3 . [[/math]]

The issue is that this inequality depends on the choice of an inner product. More importantly it is easy to see that a convex function which goes to infinity on a compact set simply cannot satisfy the above inequality. A natural idea to try fix these issues is to replace the Euclidean metric on the right hand side by the metric given by the function [math]f[/math] itself at [math]x[/math], that is:

[[math]] \|h\|_x = \sqrt{ h^{\top} \nabla^2 f(x) h }. [[/math]]

Observe that to be clear one should rather use the notation [math]\|\cdot\|_{x, f}[/math], but since [math]f[/math] will always be clear from the context we stick to [math]\|\cdot\|_x[/math].

Definition

Let [math]\mathcal{X}[/math] be a convex set with non-empty interior, and [math]f[/math] a [math]C^3[/math] convex function defined on [math]\inte(\mathcal{X})[/math]. Then [math]f[/math] is self-concordant (with constant [math]M[/math]) if for all [math]x \in \inte(\mathcal{X}), h \in \R^n[/math],

[[math]] \nabla^3 f(x) [h,h,h] \leq M \|h\|_x^3 . [[/math]]
We say that [math]f[/math] is standard self-concordant if [math]f[/math] is self-concordant with constant [math]M=2[/math].

An easy consequence of the definition is that a self-concordant function is a barrier for the set [math]\mathcal{X}[/math], see [Theorem 4.1.4, [15]]. The main example to keep in mind of a standard self-concordant function is [math]f(x) = - \log x[/math] for [math]x \gt 0[/math]. The next definition will be key in order to describe the region of quadratic convergence for Newton's method on self-concordant functions.

Definition

Let [math]f[/math] be a standard self-concordant function on [math]\mathcal{X}[/math]. For [math]x \in \mathrm{int}(\mathcal{X})[/math], we say that [math]\lambda_f(x) = \|\nabla f(x)\|_x^*[/math] is the {\em Newton decrement} of [math]f[/math] at [math]x[/math].

An important inequality is that for [math]x[/math] such that [math]\lambda_f(x) \lt 1[/math], and [math]x^* = \argmin f(x)[/math], one has

[[math]] \begin{equation} \label{eq:trucipm3} \|x - x^*\|_x \leq \frac{\lambda_f(x)}{1 - \lambda_f(x)} , \end{equation} [[/math]]

see [Equation 4.1.18, [15]]. We state the next theorem without a proof, see also [Theorem 4.1.14, [15]].

Theorem

Let [math]f[/math] be a standard self-concordant function on [math]\mathcal{X}[/math], and [math]x \in \mathrm{int}(\mathcal{X})[/math] such that [math]\lambda_f(x) \leq 1/4[/math], then

[[math]] \lambda_f\Big(x - [\nabla^2 f(x)]^{-1} \nabla f(x)\Big) \leq 2 \lambda_f(x)^2 . [[/math]]

In other words the above theorem states that, if initialized at a point [math]x_0[/math] such that [math]\lambda_f(x_0) \leq 1/4[/math], then Newton's iterates satisfy [math]\lambda_f(x_{k+1}) \leq 2 \lambda_f(x_k)^2[/math]. Thus, Newton's region of quadratic convergence for self-concordant functions can be described as a ‘`Newton decrement ball" [math]\{x : \lambda_f(x) \leq 1/4\}[/math]. In particular by taking the barrier to be a self-concordant function we have now resolved Step (1) of the plan described in Section The barrier method.

[math]\nu[/math]-self-concordant barriers

We deal here with Step (2) of the plan described in Section The barrier method. Given Theorem we want [math]t’[/math] to be as large as possible and such that

[[math]] \begin{equation} \label{eq:trucipm1} \lambda_{F_{t'}}(x^*(t) ) \leq 1/4 . \end{equation} [[/math]]

Since the Hessian of [math]F_{t'}[/math] is the Hessian of [math]F[/math], one has

[[math]] \lambda_{F_{t'}}(x^*(t) ) = \|t' c + \nabla F(x^*(t)) \|_{x^*(t)}^* . [[/math]]

Observe that, by first order optimality, one has [math]t c + \nabla F(x^*(t)) = 0,[/math] which yields

[[math]] \begin{equation} \label{eq:trucipm11} \lambda_{F_{t'}}(x^*(t) ) = (t'-t) \|c\|^*_{x^*(t)} . \end{equation} [[/math]]

Thus taking

[[math]] \begin{equation} \label{eq:trucipm2} t' = t + \frac{1}{4 \|c\|^*_{x^*(t)}} \end{equation} [[/math]]

immediately yields \eqref{eq:trucipm1}. In particular with the value of [math]t'[/math] given in \eqref{eq:trucipm2} the Newton's method on [math]F_{t'}[/math] initialized at [math]x^*(t)[/math] will converge quadratically fast to [math]x^*(t')[/math]. It remains to verify that by iterating \eqref{eq:trucipm2} one obtains a sequence diverging to infinity, and to estimate the rate of growth. Thus one needs to control [math]\|c\|^*_{x^*(t)} = \frac1{t} \|\nabla F(x^*(t))\|_{x^*(t)}^*[/math]. Luckily there is a natural class of functions for which one can control [math]\|\nabla F(x)\|_x^*[/math] uniformly over [math]x[/math]. This is the set of functions such that

[[math]] \begin{equation} \label{eq:nu} \nabla^2 F(x) \succeq \frac1{\nu} \nabla F(x) [\nabla F(x) ]^{\top} . \end{equation} [[/math]]

Indeed in that case one has:

[[math]] \begin{eqnarray*} \|\nabla F(x)\|_x^* & = & \sup_{h : h^{\top} \nabla F^2(x) h \leq 1} \nabla F(x)^{\top} h \\ & \leq & \sup_{h : h^{\top} \left( \frac1{\nu} \nabla F(x) [\nabla F(x) ]^{\top} \right) h \leq 1} \nabla F(x)^{\top} h \\ & = & \sqrt{\nu} . \end{eqnarray*} [[/math]]

Thus a safe choice to increase the penalization parameter is [math]t' = \left(1 + \frac1{4\sqrt{\nu}}\right) t[/math]. Note that the condition \eqref{eq:nu} can also be written as the fact that the function [math]F[/math] is [math]\frac1{\nu}[/math]-exp-concave, that is [math]x \mapsto \exp(- \frac1{\nu} F(x))[/math] is concave. We arrive at the following definition.

Definition

[math]F[/math] is a [math]\nu[/math]-self-concordant barrier if it is a standard self-concordant function, and it is [math]\frac1{\nu}[/math]-exp-concave.

Again the canonical example is the logarithmic function, [math]x \mapsto - \log x[/math], which is a [math]1[/math]-self-concordant barrier for the set [math]\R_{+}[/math]. We state the next theorem without a proof (see [19] for more on this result).

Theorem

Let [math]\mathcal{X} \subset \R^n[/math] be a closed convex set with non-empty interior. There exists [math]F[/math] which is a [math](c \ n)[/math]-self-concordant barrier for [math]\mathcal{X}[/math] (where [math]c[/math] is some universal constant).

A key property of [math]\nu[/math]-self-concordant barriers is the following inequality:

[[math]] \begin{equation} \label{eq:key} c^{\top} x^*(t) - \min_{x \in \mathcal{X}} c^{\top} x \leq \frac{\nu}{t} , \end{equation} [[/math]]

see [Equation (4.2.17), [15]]. More generally using \eqref{eq:key} together with \eqref{eq:trucipm3} one obtains

[[math]] \begin{eqnarray} c^{\top} y- \min_{x \in \mathcal{X}} c^{\top} x & \leq & \frac{\nu}{t} + c^{\top} (y - x^*(t)) \notag \\ & = & \frac{\nu}{t} + \frac{1}{t} (\nabla F_t(y) - \nabla F(y))^{\top} (y - x^*(t)) \notag \\ & \leq & \frac{\nu}{t} + \frac{1}{t} \|\nabla F_t(y) - \nabla F(y)\|_y^* \cdot \|y - x^*(t)\|_y \notag \\ & \leq & \frac{\nu}{t} + \frac{1}{t} (\lambda_{F_t}(y) + \sqrt{\nu})\frac{\lambda_{F_t} (y)}{1 - \lambda_{F_t}(y)} \label{eq:trucipm4} \end{eqnarray} [[/math]]

In the next section we describe a precise algorithm based on the ideas we developed above. As we will see one cannot ensure to be exactly on the central path, and thus it is useful to generalize the identity \eqref{eq:trucipm11} for a point [math]x[/math] close to the central path. We do this as follows:

[[math]] \begin{eqnarray} \lambda_{F_{t'}}(x) & = & \|t' c + \nabla F(x)\|_x^* \notag \\ & = & \|(t' / t) (t c + \nabla F(x)) + (1- t'/t) \nabla F(x)\|_x^* \notag \\ & \leq & \frac{t'}{t} \lambda_{F_t}(x) + \left(\frac{t'}{t} - 1\right) \sqrt{\nu} .\label{eq:trucipm12} \end{eqnarray} [[/math]]


Path-following scheme

We can now formally describe and analyze the most basic IPM called the {\em path-following scheme}. Let [math]F[/math] be [math]\nu[/math]-self-concordant barrier for [math]\cX[/math]. Assume that one can find [math]x_0[/math] such that [math]\lambda_{F_{t_0}}(x_0) \leq 1/4[/math] for some small value [math]t_0 \gt 0[/math] (we describe a method to find [math]x_0[/math] at the end of this subsection). Then for [math]k \geq 0[/math], let

[[math]] \begin{eqnarray*} & & t_{k+1} = \left(1 + \frac1{13\sqrt{\nu}}\right) t_k ,\\ & & x_{k+1} = x_k - [\nabla^2 F(x_k)]^{-1} (t_{k+1} c + \nabla F(x_k) ) . \end{eqnarray*} [[/math]]

The next theorem shows that after [math]O\left( \sqrt{\nu} \log \frac{\nu}{t_0 \epsilon} \right)[/math] iterations of the path-following scheme one obtains an [math]\epsilon[/math]-optimal point.

Theorem

The path-following scheme described above satisfies

[[math]] c^{\top} x_k - \min_{x \in \mathcal{X}} c^{\top} x \leq \frac{2 \nu}{t_0} \exp\left( - \frac{k}{1+13\sqrt{\nu}} \right) . [[/math]]


Show Proof

We show that the iterates [math](x_k)_{k \geq 0}[/math] remain close to the central path [math](x^*(t_k))_{k \geq 0}[/math]. Precisely one can easily prove by induction that

[[math]] \lambda_{F_{t_k}}(x_k) \leq 1/4 . [[/math]]
Indeed using Theorem and equation \eqref{eq:trucipm12} one immediately obtains

[[math]] \begin{eqnarray*} \lambda_{F_{t_{k+1}}}(x_{k+1}) & \leq & 2 \lambda_{F_{t_{k+1}}}(x_k)^2 \\ & \leq & 2 \left(\frac{t_{k+1}}{t_k} \lambda_{F_{t_k}}(x_k) + \left(\frac{t_{k+1}}{t_k} - 1\right) \sqrt{\nu}\right)^2 \\ & \leq & 1/4 , \end{eqnarray*} [[/math]]
where we used in the last inequality that [math]t_{k+1} / t_k = 1 + \frac1{13\sqrt{\nu}}[/math] and [math]\nu \geq 1[/math]. Thus using \eqref{eq:trucipm4} one obtains

[[math]] c^{\top} x_k - \min_{x \in \mathcal{X}} c^{\top} x \leq \frac{\nu + \sqrt{\nu} / 3 + 1/12}{t_k} \leq \frac{2 \nu}{t_k} . [[/math]]
Observe that [math]t_{k} = \left(1 + \frac1{13\sqrt{\nu}}\right)^{k} t_0[/math], which finally yields

[[math]] c^{\top} x_k - \min_{x \in \mathcal{X}} c^{\top} x \leq \frac{2 \nu}{t_0} \left(1 + \frac1{13\sqrt{\nu}}\right)^{- k}. [[/math]]

At this point we still need to explain how one can get close to an intial point [math]x^*(t_0)[/math] of the central path. This can be done with the following rather clever trick. Assume that one has some point [math]y_0 \in \cX[/math]. The observation is that [math]y_0[/math] is on the central path at [math]t=1[/math] for the problem where [math]c[/math] is replaced by [math]- \nabla F(y_0)[/math]. Now instead of following this central path as [math]t \to +\infty[/math], one follows it as [math]t \to 0[/math]. Indeed for [math]t[/math] small enough the central paths for [math]c[/math] and for [math]- \nabla F(y_0)[/math] will be very close. Thus we iterate the following equations, starting with [math]t_0' = 1[/math],

[[math]] \begin{eqnarray*} & & t_{k+1}' = \left(1 - \frac1{13\sqrt{\nu}}\right) t_k' ,\\ & & y_{k+1} = y_k - [\nabla^2 F(y_k)]^{-1} (- t_{k+1}' \nabla F(y_0) + \nabla F(y_k) ) . \end{eqnarray*} [[/math]]

A straightforward analysis shows that for [math]k = O(\sqrt{\nu} \log \nu)[/math], which corresponds to [math]t_k'=1/\nu^{O(1)}[/math], one obtains a point [math]y_k[/math] such that [math]\lambda_{F_{t_k'}}(y_k) \leq 1/4[/math]. In other words one can initialize the path-following scheme with [math]t_0 = t_k'[/math] and [math]x_0 = y_k[/math].

IPMs for LPs and SDPs

We have seen that, roughly, the complexity of interior point methods with a [math]\nu[/math]-self-concordant barrier is [math]O\left(M \sqrt{\nu} \log \frac{\nu}{\epsilon} \right)[/math], where [math]M[/math] is the complexity of computing a Newton direction (which can be done by computing and inverting the Hessian of the barrier). Thus the efficiency of the method is directly related to the {\em form} of the self-concordant barrier that one can construct for [math]\mathcal{X}[/math]. It turns out that for LPs and SDPs one has particularly nice self-concordant barriers. Indeed one can show that [math]F(x) = - \sum_{i=1}^n \log x_i[/math] is an [math]n[/math]-self-concordant barrier on [math]\R_{+}^n[/math], and [math]F(x) = - \log \mathrm{det}(X)[/math] is an [math]n[/math]-self-concordant barrier on [math]\mathbb{S}_{+}^n[/math]. See also [20] for a recent improvement of the basic logarithmic barrier for LPs. There is one important issue that we overlooked so far. In most interesting cases LPs and SDPs come with {\em equality constraints}, resulting in a set of constraints [math]\cX[/math] with empty interior. From a theoretical point of view there is an easy fix, which is to reparametrize the problem as to enforce the variables to live in the subspace spanned by [math]\cX[/math]. This modification also has algorithmic consequences, as the evaluation of the Newton direction will now be different. In fact, rather than doing a reparametrization, one can simply search for Newton directions such that the updated point will stay in [math]\cX[/math]. In other words one has now to solve a convex quadratic optimization problem under linear equality constraints. Luckily using Lagrange multipliers one can find a closed form solution to this problem, and we refer to previous references for more details.

General references

Bubeck, Sébastien (2015). "Convex Optimization: Algorithms and Complexity". arXiv:1405.4980 [math.OC].

References

  1. Cite error: Invalid <ref> tag; no text was provided for refs named BT09
  2. Cite error: Invalid <ref> tag; no text was provided for refs named Nes07
  3. Cite error: Invalid <ref> tag; no text was provided for refs named WNF09
  4. Cite error: Invalid <ref> tag; no text was provided for refs named PB13
  5. Cite error: Invalid <ref> tag; no text was provided for refs named BJMO12
  6. Cite error: Invalid <ref> tag; no text was provided for refs named DSSST10
  7. Cite error: Invalid <ref> tag; no text was provided for refs named Xia10
  8. Cite error: Invalid <ref> tag; no text was provided for refs named Nes04b
  9. Cite error: Invalid <ref> tag; no text was provided for refs named Nem04
  10. Cite error: Invalid <ref> tag; no text was provided for refs named CP11
  11. Cite error: Invalid <ref> tag; no text was provided for refs named JN11a
  12. Cite error: Invalid <ref> tag; no text was provided for refs named JN11b
  13. Cite error: Invalid <ref> tag; no text was provided for refs named Kar84
  14. Cite error: Invalid <ref> tag; no text was provided for refs named NN94
  15. 15.0 15.1 15.2 15.3 15.4 Cite error: Invalid <ref> tag; no text was provided for refs named Nes04
  16. Cite error: Invalid <ref> tag; no text was provided for refs named Ren01
  17. Cite error: Invalid <ref> tag; no text was provided for refs named Nem04b
  18. Cite error: Invalid <ref> tag; no text was provided for refs named NW06
  19. Cite error: Invalid <ref> tag; no text was provided for refs named BE14
  20. Cite error: Invalid <ref> tag; no text was provided for refs named LS13

Notes

  1. We restrict to unconstrained minimization for sake of simplicity. One can extend the discussion to constrained minimization by using ideas from Section Gradient descent for smooth functions.
  2. Observe that the duality gap is the sum of the primal gap [math]\max_{y \in \mathcal{Y}} \phi(\tx,y) - \phi(x^*,y^*)[/math] and the dual gap [math]\phi(x^*,y^*) - \min_{x \in \mathcal{X}} \phi(x, \ty)[/math].