Convex optimization in finite dimension

[math] \newcommand{\LS}{\mathtt{line\_search}} \newcommand{\mI}{\mathrm{I}} \newcommand{\mB}{\mathrm{B}} \newcommand{\conv}{\mathrm{conv}} \newcommand{\inte}{\mathrm{int}} \newcommand{\tg}{\tilde{g}} \newcommand{\tx}{\tilde{x}} \newcommand{\ty}{\tilde{y}} \newcommand{\tz}{\tilde{z}} \newcommand{\id}{\mathrm{id}} \newcommand{\K}{\mathrm{KL}} \newcommand{\kl}{\mathrm{kl}} \newcommand{\bp}{\boldsymbol{p}} \newcommand{\tr}{\mathrm{Tr}} \newcommand{\E}{\mathbb{E}} \newcommand{\N}{\mathbb{N}} \newcommand{\R}{\mathbb{R}} \newcommand{\Q}{\mathbb{Q}} \newcommand{\KL}{\mathrm{KL}} \newcommand{\LG}{\overline{\log}(K)} \newcommand{\ocP}{\overline{\mathcal{P}}} \newcommand{\cZ}{\mathcal{Z}} \newcommand{\cA}{\mathcal{A}} \newcommand{\cR}{\mathcal{R}} \newcommand{\cB}{\mathcal{B}} \newcommand{\cN}{\mathcal{N}} \newcommand{\cM}{\mathcal{M}} \newcommand{\cF}{\mathcal{F}} \newcommand{\cL}{\mathcal{L}} \newcommand{\cX}{\mathcal{X}} \newcommand{\cI}{\mathcal{I}} \newcommand{\cJ}{\mathcal{J}} \newcommand{\cY}{\mathcal{Y}} \newcommand{\cH}{\mathcal{H}} \newcommand{\cP}{\mathcal{P}} \newcommand{\cT}{\mathcal{T}} \newcommand{\cC}{\mathcal{C}} \newcommand{\cS}{\mathcal{S}} \newcommand{\cE}{\mathcal{E}} \newcommand{\cK}{\mathcal{K}} \newcommand{\cD}{\mathcal{D}} \newcommand{\oD}{\overline{\mathcal{D}}} \newcommand{\oR}{\overline{R}} \newcommand{\wh}{\widehat} \newcommand{\argmax}{\mathop{\mathrm{argmax}}} \newcommand{\argmin}{\mathop{\mathrm{argmin}}} \newcommand{\todo}{{\bf TO DO } } \newcommand{\Ber}{\mathop{\mathrm{Ber}}} \newcommand{\beq}{\begin{equation}} \newcommand{\eeq}{\end{equation}} \newcommand{\beqa}{\begin{eqnarray}} \newcommand{\eeqa}{\end{eqnarray}} \newcommand{\beqan}{\begin{eqnarray*}} \newcommand{\eeqan}{\end{eqnarray*}} \newcommand{\qed}{\hfill\BlackBox} \newcommand{\charfct}{\ds1} % \newcommand{\Fcal}{\mathcal{F}} \newcommand{\Xcal}{\mathcal{X}} \newcommand{\Hcal}{\mathcal{H}} \newcommand{\Gcal}{\mathcal{G}} \newcommand{\Nat}{\mathbb{N}} \newcommand{\mathds}{\mathbb}[/math]

Let [math]\mathcal{X} \subset \R^n[/math] be a convex body (that is a compact convex set with non-empty interior), and [math]f : \mathcal{X} \rightarrow [-B,B][/math] be a continuous and convex function. Let [math]r, R \gt 0[/math] be such that [math]\mathcal{X}[/math] is contained in an Euclidean ball of radius [math]R[/math] (respectively it contains an Euclidean ball of radius [math]r[/math]). In this chapter we give several black-box algorithms to solve

[[math]] \begin{align*} & \mathrm{min.} \; f(x) \\ & \text{s.t.} \; x \in \cX . \end{align*} [[/math]]

As we will see these algorithms have an oracle complexity which is linear (or quadratic) in the dimension, hence the title of the chapter (in the next chapter the oracle complexity will be {\em independent} of the dimension). An interesting feature of the methods discussed here is that they only need a separation oracle for the constraint set [math]\cX[/math]. In the literature such algorithms are often referred to as {\em cutting plane methods}. In particular these methods can be used to {\em find} a point [math]x \in \cX[/math] given only a separating oracle for [math]\cX[/math] (this is also known as the {\em feasibility problem}).

The center of gravity method

We consider the following simple iterative algorithm[Notes 1]: let [math]\cS_1= \cX[/math], and for [math]t \geq 1[/math] do the following:

  • Compute
    [[math]] \begin{equation} c_t = \frac{1}{\mathrm{vol}(\cS_t)} \int_{x \in \cS_t} x dx . \end{equation} [[/math]]
  • Query the first order oracle at [math]c_t[/math] and obtain [math]w_t \in \partial f (c_t)[/math]. Let
    [[math]] \cS_{t+1} = \cS_t \cap \{x \in \R^n : (x-c_t)^{\top} w_t \leq 0\} . [[/math]]

If stopped after [math]t[/math] queries to the first order oracle then we use [math]t[/math] queries to a zeroth order oracle to output

[[math]] x_t\in \argmin_{1 \leq r \leq t} f(c_r) . [[/math]]

This procedure is known as the {\em center of gravity method}, it was discovered independently on both sides of the Wall by [1] and [2].

Theorem

The center of gravity method satisfies

[[math]] f(x_t) - \min_{x \in \mathcal{X}} f(x) \leq 2 B \left(1 - \frac1{e} \right)^{t/n} . [[/math]]

Before proving this result a few comments are in order. To attain an [math]\epsilon[/math]-optimal point the center of gravity method requires [math]O( n \log (2 B / \epsilon))[/math] queries to both the first and zeroth order oracles. It can be shown that this is the best one can hope for, in the sense that for [math]\epsilon[/math] small enough one needs [math]\Omega(n \log(1/ \epsilon))[/math] calls to the oracle in order to find an [math]\epsilon[/math]-optimal point, see [3] for a formal proof. The rate of convergence given by Theorem is exponentially fast. In the optimization literature this is called a {\em linear rate} as the (estimated) error at iteration [math]t+1[/math] is linearly related to the error at iteration [math]t[/math]. The last and most important comment concerns the computational complexity of the method. It turns out that finding the center of gravity [math]c_t[/math] is a very difficult problem by itself, and we do not have computationally efficient procedure to carry out this computation in general. In Section Random walk based methods we will discuss a relatively recent (compared to the 50 years old center of gravity method!) randomized algorithm to approximately compute the center of gravity. This will in turn give a randomized center of gravity method which we will describe in detail. We now turn to the proof of Theorem. We will use the following elementary result from convex geometry:

Lemma ([4])

Let [math]\cK[/math] be a centered convex set, i.e., [math]\int_{x \in \cK} x dx = 0[/math], then for any [math]w \in \R^n, w \neq 0[/math], one has

[[math]] \mathrm{Vol} \left( \cK \cap \{x \in \R^n : x^{\top} w \geq 0\} \right) \geq \frac{1}{e} \mathrm{Vol} (\cK) . [[/math]]
We now prove Theorem.


Show Proof

Let [math]x^*[/math] be such that [math]f(x^*) = \min_{x \in \mathcal{X}} f(x)[/math]. Since [math]w_t \in \partial f(c_t)[/math] one has

[[math]] f(c_t) - f(x) \leq w_t^{\top} (c_t - x) . [[/math]]
and thus

[[math]] \begin{equation} \label{eq:centerofgravity1} \cS_{t} \setminus \cS_{t+1} \subset \{x \in \cX : (x-c_t)^{\top} w_t \gt 0\} \subset \{x \in \cX : f(x) \gt f(c_t)\} , \end{equation} [[/math]]
which clearly implies that one can never remove the optimal point from our sets in consideration, that is [math]x^* \in \cS_t[/math] for any [math]t[/math]. Without loss of generality we can assume that we always have [math]w_t \neq 0[/math], for otherwise one would have [math]f(c_t) = f(x^*)[/math] which immediately conludes the proof. Now using that [math]w_t \neq 0[/math] for any [math]t[/math] and Lemma one clearly obtains

[[math]] \mathrm{vol}(\cS_{t+1}) \leq \left(1 - \frac1{e} \right)^t \mathrm{vol}(\cX) . [[/math]]
For [math]\epsilon \in [0,1][/math], let [math]\mathcal{X}_{\epsilon} = \{(1-\epsilon) x^* + \epsilon x, x \in \mathcal{X}\}[/math]. Note that [math]\mathrm{vol}(\mathcal{X}_{\epsilon}) = \epsilon^n \mathrm{vol}(\mathcal{X})[/math]. These volume computations show that for [math]\epsilon \gt \left(1 - \frac1{e} \right)^{t/n}[/math] one has [math]\mathrm{vol}(\mathcal{X}_{\epsilon}) \gt \mathrm{vol}(\cS_{t+1})[/math]. In particular this implies that for [math]\epsilon \gt \left(1 - \frac1{e} \right)^{t/n}[/math], there must exist a time [math]r \in \{1,\dots, t\}[/math], and [math]x_{\epsilon} \in \mathcal{X}_{\epsilon}[/math], such that [math]x_{\epsilon} \in \cS_{r}[/math] and [math]x_{\epsilon} \not\in \cS_{r+1}[/math]. In particular by \eqref{eq:centerofgravity1} one has [math]f(c_r) \lt f(x_{\epsilon})[/math]. On the other hand by convexity of [math]f[/math] one clearly has [math]f(x_{\epsilon}) \leq f(x^*) + 2 \epsilon B[/math]. This concludes the proof.

The ellipsoid method

Recall that an ellipsoid is a convex set of the form

[[math]] \mathcal{E} = \{x \in \R^n : (x - c)^{\top} H^{-1} (x-c) \leq 1 \} , [[/math]]

where [math]c \in \R^n[/math], and [math]H[/math] is a symmetric positive definite matrix. Geometrically [math]c[/math] is the center of the ellipsoid, and the semi-axes of [math]\cE[/math] are given by the eigenvectors of [math]H[/math], with lengths given by the square root of the corresponding eigenvalues. We give now a simple geometric lemma, which is at the heart of the ellipsoid method.

Lemma

Let [math]\mathcal{E}_0 = \{x \in \R^n : (x - c_0)^{\top} H_0^{-1} (x-c_0) \leq 1 \}[/math]. For any [math]w \in \R^n[/math], [math]w \neq 0[/math], there exists an ellipsoid [math]\mathcal{E}[/math] such that

[[math]] \begin{equation} \mathcal{E} \supset \{x \in \mathcal{E}_0 : w^{\top} (x-c_0) \leq 0\} , \label{eq:ellipsoidlemma1} \end{equation} [[/math]]
and

[[math]] \begin{equation} \mathrm{vol}(\mathcal{E}) \leq \exp \left(- \frac{1}{2 n} \right) \mathrm{vol}(\mathcal{E}_0) . \label{eq:ellipsoidlemma2} \end{equation} [[/math]]
Furthermore for [math]n \geq 2[/math] one can take [math]\cE = \{x \in \R^n : (x - c)^{\top} H^{-1} (x-c) \leq 1 \}[/math] where

[[math]] \begin{align} & c = c_0 - \frac{1}{n+1} \frac{H_0 w}{\sqrt{w^{\top} H_0 w}} , \label{eq:ellipsoidlemma3}\\ & H = \frac{n^2}{n^2-1} \left(H_0 - \frac{2}{n+1} \frac{H_0 w w^{\top} H_0}{w^{\top} H_0 w} \right) . \label{eq:ellipsoidlemma4} \end{align} [[/math]]


Show Proof

For [math]n=1[/math] the result is obvious, in fact we even have [math]\mathrm{vol}(\mathcal{E}) \leq \frac12 \mathrm{vol}(\mathcal{E}_0) .[/math] For [math]n \geq 2[/math] one can simply verify that the ellipsoid given by \eqref{eq:ellipsoidlemma3} and \eqref{eq:ellipsoidlemma4} satisfy the required properties \eqref{eq:ellipsoidlemma1} and \eqref{eq:ellipsoidlemma2}. Rather than bluntly doing these computations we will show how to derive \eqref{eq:ellipsoidlemma3} and \eqref{eq:ellipsoidlemma4}. As a by-product this will also show that the ellipsoid defined by \eqref{eq:ellipsoidlemma3} and \eqref{eq:ellipsoidlemma4} is the unique ellipsoid of minimal volume that satisfy \eqref{eq:ellipsoidlemma1}. Let us first focus on the case where [math]\mathcal{E}_0[/math] is the Euclidean ball [math]\cB = \{x \in \R^n : x^{\top} x \leq 1\}[/math]. We momentarily assume that [math]w[/math] is a unit norm vector. By doing a quick picture, one can see that it makes sense to look for an ellipsoid [math]\mathcal{E}[/math] that would be centered at [math]c= - t w[/math], with [math]t \in [0,1][/math] (presumably [math]t[/math] will be small), and such that one principal direction is [math]w[/math] (with inverse squared semi-axis [math]a \gt 0[/math]), and the other principal directions are all orthogonal to [math]w[/math] (with the same inverse squared semi-axes [math]b \gt 0[/math]). In other words we are looking for [math]\mathcal{E}= \{x: (x - c)^{\top} H^{-1} (x-c) \leq 1 \}[/math] with

[[math]] c = - t w, \; \text{and} \; H^{-1} = a w w^{\top} + b (\mI_n - w w^{\top} ) . [[/math]]
Now we have to express our constraints on the fact that [math]\mathcal{E}[/math] should contain the half Euclidean ball [math]\{x \in \cB : x^{\top} w \leq 0\}[/math]. Since we are also looking for [math]\mathcal{E}[/math] to be as small as possible, it makes sense to ask for [math]\mathcal{E}[/math] to "touch" the Euclidean ball, both at [math]x = - w[/math], and at the equator [math]\partial \cB \cap w^{\perp}[/math]. The former condition can be written as:

[[math]] (- w - c)^{\top} H^{-1} (- w - c) = 1 \Leftrightarrow (t-1)^2 a = 1 , [[/math]]
while the latter is expressed as:

[[math]] \forall y \in \partial \cB \cap w^{\perp}, (y - c)^{\top} H^{-1} (y - c) = 1 \Leftrightarrow b + t^2 a = 1 . [[/math]]
As one can see from the above two equations, we are still free to choose any value for [math]t \in [0,1/2)[/math] (the fact that we need [math]t \lt 1/2[/math] comes from [math]b=1 - \left(\frac{t}{t-1}\right)^2 \gt 0[/math]). Quite naturally we take the value that minimizes the volume of the resulting ellipsoid. Note that

[[math]] \frac{\mathrm{vol}(\mathcal{E})}{\mathrm{vol}(\cB)} = \frac{1}{\sqrt{a}} \left(\frac{1}{\sqrt{b}}\right)^{n-1} = \frac{1}{\sqrt{\frac{1}{(1-t)^2}\left (1 - \left(\frac{t}{1-t}\right)^2\right)^{n-1}}} \\= \frac{1}{\sqrt{f\left(\frac{1}{1-t}\right)}} , [[/math]]
where [math]f(h) = h^2 (2 h - h^2)^{n-1}[/math]. Elementary computations show that the maximum of [math]f[/math] (on [math][1,2][/math]) is attained at [math]h = 1+ \frac{1}{n}[/math] (which corresponds to [math]t=\frac{1}{n+1}[/math]), and the value is

[[math]] \left(1+\frac{1}{n}\right)^2 \left(1 - \frac{1}{n^2} \right)^{n-1} \geq \exp \left(\frac{1}{n} \right), [[/math]]
where the lower bound follows again from elementary computations. Thus we showed that, for [math]\cE_0 = \cB[/math], \eqref{eq:ellipsoidlemma1} and \eqref{eq:ellipsoidlemma2} are satisfied with the ellipsoid given by the set of points [math]x[/math] satisfying:

[[math]] \begin{equation} \label{eq:ellipsoidlemma5} \left(x + \frac{w/\|w\|_2}{n+1}\right)^{\top} \left(\frac{n^2-1}{n^2} \mI_n + \frac{2(n+1)}{n^2} \frac{w w^{\top}}{\|w\|_2^2} \right) \left(x + \frac{w/\|w\|_2}{n+1} \right) \leq 1 . \end{equation} [[/math]]


We consider now an arbitrary ellipsoid [math]\cE_0 = \{x \in \R^n : (x - c_0)^{\top} H_0^{-1} (x-c_0) \leq 1 \}[/math]. Let [math]\Phi(x) = c_0 + H_0^{1/2} x[/math], then clearly [math]\cE_0 = \Phi(\cB)[/math] and [math]\{x : w^{\top}(x - c_0) \leq 0\} = \Phi(\{x : (H_0^{1/2} w)^{\top} x \leq 0\})[/math]. Thus in this case the image by [math]\Phi[/math] of the ellipsoid given in \eqref{eq:ellipsoidlemma5} with [math]w[/math] replaced by [math]H_0^{1/2} w[/math] will satisfy \eqref{eq:ellipsoidlemma1} and \eqref{eq:ellipsoidlemma2}. It is easy to see that this corresponds to an ellipsoid defined by

[[math]] \begin{align} & c = c_0 - \frac{1}{n+1} \frac{H_0 w}{\sqrt{w^{\top} H_0 w}} , \notag \\ & H^{-1} = \left(1 - \frac{1}{n^2}\right) H_0^{-1} + \frac{2(n+1)}{n^2} \frac{w w^{\top}}{w^{\top} H_0 w} . \label{eq:ellipsoidlemma6} \end{align} [[/math]]
Applying Sherman-Morrison formula to \eqref{eq:ellipsoidlemma6} one can recover \eqref{eq:ellipsoidlemma4} which concludes the proof.

We describe now the ellipsoid method, which only assumes a separation oracle for the constraint set [math]\cX[/math] (in particular it can be used to solve the feasibility problem mentioned at the beginning of the chapter). Let [math]\cE_0[/math] be the Euclidean ball of radius [math]R[/math] that contains [math]\cX[/math], and let [math]c_0[/math] be its center. Denote also [math]H_0=R^2 \mI_n[/math]. For [math]t \geq 0[/math] do the following:

  • If [math]c_t \not\in \cX[/math] then call the separation oracle to obtain a separating hyperplane [math]w_t \in \R^n[/math] such that [math]\cX \subset \{x : (x- c_t)^{\top} w_t \leq 0\}[/math], otherwise call the first order oracle at [math]c_t[/math] to obtain [math]w_t \in \partial f (c_t)[/math].
  • Let [math]\cE_{t+1} = \{x : (x - c_{t+1})^{\top} H_{t+1}^{-1} (x-c_{t+1}) \leq 1 \}[/math] be the ellipsoid given in Lemma that contains [math]\{x \in \mathcal{E}_t : (x- c_t)^{\top} w_t \leq 0\}[/math], that is
    [[math]] \begin{align*} & c_{t+1} = c_{t} - \frac{1}{n+1} \frac{H_t w}{\sqrt{w^{\top} H_t w}} ,\\ & H_{t+1} = \frac{n^2}{n^2-1} \left(H_t - \frac{2}{n+1} \frac{H_t w w^{\top} H_t}{w^{\top} H_t w} \right) . \end{align*} [[/math]]

If stopped after [math]t[/math] iterations and if [math]\{c_1, \dots, c_t\} \cap \cX \neq \emptyset[/math], then we use the zeroth order oracle to output

[[math]] x_t\in \argmin_{c \in \{c_1, \dots, c_t\} \cap \cX} f(c_r) . [[/math]]

The following rate of convergence can be proved with the exact same argument than for Theorem (observe that at step [math]t[/math] one can remove a point in [math]\cX[/math] from the current ellipsoid only if [math]c_t \in \cX[/math]).

Theorem

For [math]t \geq 2n^2 \log(R/r)[/math] the ellipsoid method satisfies [math]\{c_1, \dots, c_t\} \cap \cX \neq \emptyset[/math] and

[[math]] f(x_t) - \min_{x \in \mathcal{X}} f(x) \leq \frac{2 B R}{r} \exp\left( - \frac{t}{2 n^2}\right) . [[/math]]

We observe that the oracle complexity of the ellipsoid method is much worse than the one of the center gravity method, indeed the former needs [math]O(n^2 \log(1/\epsilon))[/math] calls to the oracles while the latter requires only [math]O(n \log(1/\epsilon))[/math] calls. However from a computational point of view the situation is much better: in many cases one can derive an efficient separation oracle, while the center of gravity method is basically always intractable. This is for instance the case in the context of LPs and SDPs: with the notation of Section Structured optimization the computational complexity of the separation oracle for LPs is [math]O(m n)[/math] while for SDPs it is [math]O(\max(m,n) n^2)[/math] (we use the fact that the spectral decomposition of a matrix can be done in [math]O(n^3)[/math] operations). This gives an overall complexity of [math]O(\max(m,n) n^3 \log(1/\epsilon))[/math] for LPs and [math]O(\max(m,n^2) n^6 \log(1/\epsilon))[/math] for SDPs. We note however that the ellipsoid method is almost never used in practice, essentially because the method is too rigid to exploit the potential easiness of real problems (e.g., the volume decrease given by \eqref{eq:ellipsoidlemma2} is essentially always tight).

Vaidya's cutting plane method

We focus here on the feasibility problem (it should be clear from the previous sections how to adapt the argument for optimization). We have seen that for the feasibility problem the center of gravity has a [math]O(n)[/math] oracle complexity and unclear computational complexity (see Section Random walk based methods for more on this), while the ellipsoid method has oracle complexity [math]O(n^2)[/math] and computational complexity [math]O(n^4)[/math]. We describe here the beautiful algorithm of [5][6] which has oracle complexity [math]O(n \log(n))[/math] and computational complexity [math]O(n^4)[/math], thus getting the best of both the center of gravity and the ellipsoid method. In fact the computational complexity can even be improved further, and the recent breakthrough [7] shows that it can essentially (up to logarithmic factors) be brought down to [math]O(n^3)[/math]. This section, while giving a fundamental algorithm, should probably be skipped on a first reading. In particular we use several concepts from the theory of interior point methods which are described in Section Interior point methods.

The volumetric barrier

Let [math]A \in \mathbb{R}^{m \times n}[/math] where the [math]i^{th}[/math] row is [math]a_i \in \mathbb{R}^n[/math], and let [math]b \in \mathbb{R}^m[/math]. We consider the logarithmic barrier [math]F[/math] for the polytope [math]\{x \in \mathbb{R}^n : A x \gt b\}[/math] defined by

[[math]] F(x) = - \sum_{i=1}^m \log(a_i^{\top} x - b_i) . [[/math]]

We also consider the volumetric barrier [math]v[/math] defined by

[[math]] v(x) = \frac{1}{2} \mathrm{logdet}(\nabla^2 F(x) ) . [[/math]]

The intuition is clear: [math]v(x)[/math] is equal to the logarithm of the inverse volume of the Dikin ellipsoid (for the logarithmic barrier) at [math]x[/math]. It will be useful to spell out the hessian of the logarithmic barrier:

[[math]] \nabla^2 F(x) = \sum_{i=1}^m \frac{a_i a_i^{\top}}{(a_i^{\top} x - b_i)^2} . [[/math]]

Introducing the leverage score

[[math]] \sigma_i(x) = \frac{(\nabla^2 F(x) )^{-1}[a_i, a_i]}{(a_i^{\top} x - b_i)^2} , [[/math]]

one can easily verify that

[[math]] \begin{equation} \label{eq:gradvol} \nabla v(x) = - \sum_{i=1}^m \sigma_i(x) \frac{a_i}{a_i^{\top} x - b_i} , \end{equation} [[/math]]

and

[[math]] \begin{equation} \label{eq:hessianvol} \nabla^2 v(x) \succeq \sum_{i=1}^m \sigma_i(x) \frac{a_i a_i^{\top}}{(a_i^{\top} x - b_i)^2} =: Q(x) . \end{equation} [[/math]]


Vaidya's algorithm

We fix [math]\epsilon \leq 0.006[/math] a small constant to be specified later. Vaidya's algorithm produces a sequence of pairs [math](A^{(t)}, b^{(t)}) \in \mathbb{R}^{m_t \times n} \times \mathbb{R}^{m_t}[/math] such that the corresponding polytope contains the convex set of interest. The initial polytope defined by [math](A^{(0)},b^{(0)})[/math] is a simplex (in particular [math]m_0=n+1[/math]). For [math]t\geq0[/math] we let [math]x_t[/math] be the minimizer of the volumetric barrier [math]v_t[/math] of the polytope given by [math](A^{(t)}, b^{(t)})[/math], and [math](\sigma_i^{(t)})_{i \in [m_t]}[/math] the leverage scores (associated to [math]v_t[/math]) at the point [math]x_t[/math]. We also denote [math]F_t[/math] for the logarithmic barrier given by [math](A^{(t)}, b^{(t)})[/math]. The next polytope [math](A^{(t+1)}, b^{(t+1)})[/math] is defined by either adding or removing a constraint to the current polytope:

  • If for some [math]i \in [m_t][/math] one has [math]\sigma_i^{(t)} = \min_{j \in [m_t]} \sigma_j^{(t)} \lt \epsilon[/math], then [math](A^{(t+1)}, b^{(t+1)})[/math] is defined by removing the [math]i^{th}[/math] row in [math](A^{(t)}, b^{(t)})[/math] (in particular [math]m_{t+1} = m_t - 1[/math]).
  • Otherwise let [math]c^{(t)}[/math] be the vector given by the separation oracle queried at [math]x_t[/math], and [math]\beta^{(t)} \in \mathbb{R}[/math] be chosen so that
    [[math]] \frac{(\nabla^2 F_t(x_t) )^{-1}[c^{(t)}, c^{(t)}]}{(x_t^{\top} c^{(t)} - \beta^{(t)})^2} = \frac{1}{5} \sqrt{\epsilon} . [[/math]]
    Then we define [math](A^{(t+1)}, b^{(t+1)})[/math] by adding to [math](A^{(t)}, b^{(t)})[/math] the row given by [math](c^{(t)}, \beta^{(t)})[/math] (in particular [math]m_{t+1} = m_t + 1[/math]).

It can be shown that the volumetric barrier is a self-concordant barrier, and thus it can be efficiently minimized with Newton's method. In fact it is enough to do {\em one step} of Newton's method on [math]v_t[/math] initialized at [math]x_{t-1}[/math], see [5][6] for more details on this.

Analysis of Vaidya's method

The construction of Vaidya's method is based on a precise understanding of how the volumetric barrier changes when one adds or removes a constraint to the polytope. This understanding is derived in Section Constraints and the volumetric barrier. In particular we obtain the following two key inequalities: If case 1 happens at iteration [math]t[/math] then

[[math]] \begin{equation} \label{eq:analysis1} v_{t+1}(x_{t+1}) - v_t(x_t) \geq - \epsilon , \end{equation} [[/math]]

while if case 2 happens then

[[math]] \begin{equation} \label{eq:analysis2} v_{t+1}(x_{t+1}) - v_t(x_t) \geq \frac{1}{20} \sqrt{\epsilon} . \end{equation} [[/math]]

We show now how these inequalities imply that Vaidya's method stops after [math]O(n \log(n R/r))[/math] steps. First we claim that after [math]2t[/math] iterations, case 2 must have happened at least [math]t-1[/math] times. Indeed suppose that at iteration [math]2t-1[/math], case 2 has happened [math]t-2[/math] times; then [math]\nabla^2 F(x)[/math] is singular and the leverage scores are infinite, so case 2 must happen at iteration [math]2t[/math]. Combining this claim with the two inequalities above we obtain:

[[math]] v_{2t}(x_{2t}) \geq v_0(x_0) + \frac{t-1}{20} \sqrt{\epsilon} - (t+1) \epsilon \geq \frac{t}{50} \epsilon - 1 +v_0(x_0) . [[/math]]

The key point now is to recall that by definition one has [math]v(x) = - \log \mathrm{vol}(\cE(x,1))[/math] where [math]\cE(x,r) = \{y : \nabla F^2(x)[y-x,y-x] \leq r^2\}[/math] is the Dikin ellipsoid centered at [math]x[/math] and of radius [math]r[/math]. Moreover the logarithmic barrier [math]F[/math] of a polytope with [math]m[/math] constraints is [math]m[/math]-self-concordant, which implies that the polytope is included in the Dikin ellipsoid [math]\cE(z, 2m)[/math] where [math]z[/math] is the minimizer of [math]F[/math] (see [Theorem 4.2.6., [8]]). The volume of [math]\cE(z, 2m)[/math] is equal to [math](2m)^n \exp(-v(z))[/math], which is thus always an upper bound on the volume of the polytope. Combining this with the above display we just proved that at iteration [math]2k[/math] the volume of the current polytope is at most

[[math]] \exp \left(n \log(2m_{2t}) + 1 - v_0(x_0) - \frac{t}{50} \epsilon \right) . [[/math]]

Since [math]\cE(x,1)[/math] is always included in the polytope we have that [math]- v_0(x_0)[/math] is at most the logarithm of the volume of the initial polytope which is [math]O(n \log(R))[/math]. This clearly concludes the proof as the procedure will necessarily stop when the volume is below [math]\exp(n \log(r))[/math] (we also used the trivial bound [math]m_t \leq n+1+t[/math]).

Constraints and the volumetric barrier

We want to understand the effect on the volumetric barrier of addition/deletion of constraints to the polytope. Let [math]c \in \mathbb{R}^n[/math], [math]\beta \in \mathbb{R}[/math], and consider the logarithmic barrier [math]\tilde{F}[/math] and the volumetric barrier [math]\tilde{v}[/math] corresponding to the matrix [math]\tilde{A}\in \mathbb{R}^{(m+1) \times n}[/math] and the vector [math]\tilde{b} \in \mathbb{R}^{m+1}[/math] which are respectively the concatenation of [math]A[/math] and [math]c[/math], and the concatenation of [math]b[/math] and [math]\beta[/math]. Let [math]x^*[/math] and [math]\tilde{x}^*[/math] be the minimizer of respectively [math]v[/math] and [math]\tilde{v}[/math]. We recall the definition of leverage scores, for [math]i \in [m+1][/math], where [math]a_{m+1}=c[/math] and [math]b_{m+1}=\beta[/math],

[[math]] \sigma_i(x) = \frac{(\nabla^2 F(x) )^{-1}[a_i, a_i]}{(a_i^{\top} x - b_i)^2}, \ \text{and} \ \tilde{\sigma}_i(x) = \frac{(\nabla^2 \tilde{F}(x) )^{-1}[a_i, a_i]}{(a_i^{\top} x - b_i)^2}. [[/math]]

The leverage scores [math]\sigma_i[/math] and [math]\tilde{\sigma}_i[/math] are closely related:

Lemma

One has for any [math]i \in [m+1][/math],

[[math]] \frac{\tilde{\sigma}_{m+1}(x)}{1 - \tilde{\sigma}_{m+1}(x)} \geq \sigma_i(x) \geq \tilde{\sigma}_i(x) \geq (1-\sigma_{m+1}(x)) \sigma_i(x) . [[/math]]


Show Proof

First we observe that by Sherman-Morrison's formula [math](A+uv^{\top})^{-1} = A^{-1} - \frac{A^{-1} u v^{\top} A^{-1}}{1+A^{-1}[u,v]}[/math] one has

[[math]] \begin{equation} \label{eq:SM} (\nabla^2 \tilde{F}(x))^{-1} = (\nabla^2 F(x))^{-1} - \frac{(\nabla^2 F(x))^{-1} c c^{\top} (\nabla^2 F(x))^{-1}}{(c^{\top} x - \beta)^2 + (\nabla^2 F(x))^{-1}[c,c]} , \end{equation} [[/math]]
This immediately proves [math]\tilde{\sigma}_i(x) \leq \sigma_i(x)[/math]. It also implies the inequality [math]\tilde{\sigma}_i(x) \geq (1-\sigma_{m+1}(x)) \sigma_i(x)[/math] thanks the following fact: [math]A - \frac{A u u^{\top} A}{1+A[u,u]} \succeq (1-A[u,u]) A[/math]. For the last inequality we use that [math]A + \frac{A u u^{\top} A}{1+A[u,u]} \preceq \frac{1}{1-A[u,u]} A[/math] together with

[[math]] (\nabla^2 {F}(x))^{-1} = (\nabla^2 \tilde{F}(x))^{-1} + \frac{(\nabla^2 \tilde{F}(x))^{-1} c c^{\top} (\nabla^2 \tilde{F}(x))^{-1}}{(c^{\top} x - \beta)^2 - (\nabla^2 \tilde{F}(x))^{-1}[c,c]} . [[/math]]

We now assume the following key result, which was first proven by Vaidya. To put the statement in context recall that for a self-concordant barrier [math]f[/math] the suboptimality gap [math]f(x) - \min f[/math] is intimately related to the Newton decrement [math]\|\nabla f(x) \|_{(\nabla^2 f(x))^{-1}}[/math]. Vaidya's inequality gives a similar claim for the volumetric barrier. We use the version given in [Theorem 2.6, [9]] which has slightly better numerical constants than the original bound. Recall also the definition of [math]Q[/math] from \eqref{eq:hessianvol}.

Theorem

Let [math]\lambda(x) = \|\nabla v(x) \|_{Q(x)^{-1}}[/math] be an approximate Newton decrement, [math]\epsilon = \min_{i \in [m]} \sigma_i(x)[/math], and assume that [math]\lambda(x)^2 \leq \frac{2 \sqrt{\epsilon} - \epsilon}{36}[/math]. Then

[[math]] v(x) - v(x^*) \leq 2 \lambda(x)^2 . [[/math]]

We also denote [math]\tilde{\lambda}[/math] for the approximate Newton decrement of [math]\tilde{v}[/math]. The goal for the rest of the section is to prove the following theorem which gives the precise understanding of the volumetric barrier we were looking for.

Theorem

Let [math]\epsilon := \min_{i \in [m]} \sigma_i(x^*)[/math], [math]\delta := \sigma_{m+1}(x^*) / \sqrt{\epsilon}[/math] and assume that [math]\frac{\left(\delta \sqrt{\epsilon} + \sqrt{\delta^{3} \sqrt{\epsilon}}\right)^2}{1- \delta \sqrt{\epsilon}} \lt \frac{2 \sqrt{\epsilon} - \epsilon}{36}[/math]. Then one has

[[math]] \begin{equation} \label{eq:thV11} \tilde{v}(\tilde{x}^*) - v(x^*) \geq \frac{1}{2} \log(1+\delta \sqrt{\epsilon}) - 2 \frac{\left(\delta \sqrt{\epsilon} + \sqrt{\delta^{3} \sqrt{\epsilon}}\right)^2}{1- \delta \sqrt{\epsilon}} . \end{equation} [[/math]]
On the other hand assuming that [math]\tilde{\sigma}_{m+1}(\tilde{x}^*) = \min_{i \in [m+1]} \tilde{\sigma}_{i}(\tilde{x}^*) =: \epsilon[/math] and that [math]\epsilon \leq 1/4[/math], one has

[[math]] \begin{equation} \label{eq:thV12} \tilde{v}(\tilde{x}^*) - v(x^*) \leq - \frac{1}{2} \log(1 - \epsilon) + \frac{8 \epsilon^2}{(1-\epsilon)^2}. \end{equation} [[/math]]
Before going into the proof let us see briefly how Theorem give the two inequalities stated at the beginning of Section Analysis of Vaidya's method. To prove \eqref{eq:analysis2} we use \eqref{eq:thV11} with [math]\delta=1/5[/math] and [math]\epsilon \leq 0.006[/math], and we observe that in this case the right hand side of \eqref{eq:thV11} is lower bounded by [math]\frac{1}{20} \sqrt{\epsilon}[/math]. On the other hand to prove \eqref{eq:analysis1} we use \eqref{eq:thV12}, and we observe that for [math]\epsilon \leq 0.006[/math] the right hand side of \eqref{eq:thV12} is upper bounded by [math]\epsilon[/math].


Show Proof

We start with the proof of \eqref{eq:thV11}. First observe that by factoring [math](\nabla^2 F(x))^{1/2}[/math] on the left and on the right of [math]\nabla^2 \tilde{F}(x)[/math] one obtains

[[math]] \begin{align*} & \mathrm{det}(\nabla^2 \tilde{F}(x)) \\ & = \mathrm{det}\left(\nabla^2 {F}(x) + \frac{cc^{\top}}{(c^{\top} x- \beta)^2} \right) \\ & = \mathrm{det}(\nabla^2 {F}(x)) \mathrm{det}\left(\mathrm{I}_n + \frac{(\nabla^2 {F}(x))^{-1/2} c c^{\top} (\nabla^2 {F}(x))^{-1/2}}{(c^{\top} x- \beta)^2}\right) \\ & = \mathrm{det}(\nabla^2 {F}(x)) (1+\sigma_{m+1}(x)) , \end{align*} [[/math]]
and thus

[[math]] \tilde{v}(x) = v(x) + \frac{1}{2} \log(1+ \sigma_{m+1}(x)) . [[/math]]
In particular we have

[[math]] \tilde{v}(\tilde{x}^*) - v(x^*) = \frac{1}{2} \log(1+ \sigma_{m+1}(x^*)) - (\tilde{v}(x^*) - \tilde{v}(\tilde{x}^*)) . [[/math]]
To bound the suboptimality gap of [math]x^*[/math] in [math]\tilde{v}[/math] we will invoke Theorem and thus we have to upper bound the approximate Newton decrement [math]\tilde{\lambda}[/math]. Using [\eqref{eq:V21}, Lemma] below one has

[[math]] \tilde{\lambda} (x^*)^2 \leq \frac{\left(\sigma_{m+1}(x^*) + \sqrt{\frac{\sigma_{m+1}^3(x^*)}{\min_{i \in [m]} \sigma_i(x^*)}}\right)^2}{1-\sigma_{m+1}(x^*)} = \frac{\left(\delta \sqrt{\epsilon} + \sqrt{\delta^{3} \sqrt{\epsilon}}\right)^2}{1- \delta \sqrt{\epsilon}} . [[/math]]
This concludes the proof of \eqref{eq:thV11}.


We now turn to the proof of \eqref{eq:thV12}. Following the same steps as above we immediately obtain

[[math]] \begin{eqnarray*} \tilde{v}(\tilde{x}^*) - v(x^*) & = & \tilde{v}(\tilde{x}^*) - v(\tilde{x}^*)+v(\tilde{x}^*)- v(x^*) \\ & = & - \frac{1}{2} \log(1 - \tilde{\sigma}_{m+1}(\tilde{x}^*)) + v(\tilde{x}^*)- v(x^*). \end{eqnarray*} [[/math]]
To invoke Theorem it remains to upper bound [math]\lambda(\tilde{x}^*)[/math]. Using [\eqref{eq:V22}, Lemma] below one has

[[math]] \lambda(\tilde{x}^*) \leq \frac{2 \ \tilde{\sigma}_{m+1}(\tilde{x}^*)}{1 - \tilde{\sigma}_{m+1}(\tilde{x}^*)} . [[/math]]
We can apply Theorem since the assumption [math]\epsilon \leq 1/4[/math] implies that [math]\left(\frac{2 \epsilon}{1-\epsilon}\right)^2 \leq \frac{2 \sqrt{\epsilon} - \epsilon}{36}[/math]. This concludes the proof of \eqref{eq:thV12}.

Lemma

One has

[[math]] \begin{equation} \label{eq:V21} \sqrt{1- \sigma_{m+1}(x)} \ \tilde{\lambda} (x) \leq \|\nabla {v}(x)\|_{Q(x)^{-1}} + \sigma_{m+1}(x) + \sqrt{\frac{\sigma_{m+1}^3(x)}{\min_{i \in [m]} \sigma_i(x)}} . \end{equation} [[/math]]
Furthermore if [math]\tilde{\sigma}_{m+1}(x) = \min_{i \in [m+1]} \tilde{\sigma}_{i}(x)[/math] then one also has

[[math]] \begin{equation} \label{eq:V22} \lambda(x) \leq \|\nabla \tilde{v}(x)\|_{Q(x)^{-1}} + \frac{2 \ \tilde{\sigma}_{m+1}(x)}{1 - \tilde{\sigma}_{m+1}(x)} . \end{equation} [[/math]]


Show Proof

We start with the proof of \eqref{eq:V21}. First observe that by Lemma one has [math]\tilde{Q}(x) \succeq (1-\sigma_{m+1}(x)) Q(x)[/math] and thus by definition of the Newton decrement

[[math]] \tilde{\lambda} (x) = \|\nabla \tilde{v}(x)\|_{\tilde{Q}(x)^{-1}} \leq \frac{\|\nabla \tilde{v}(x)\|_{Q(x)^{-1}}}{\sqrt{1-\sigma_{m+1}(x)}} . [[/math]]
Next observe that (recall \eqref{eq:gradvol})

[[math]] \nabla \tilde{v}(x) = \nabla v(x) + \sum_{i=1}^m ({\sigma}_i(x) - \tilde{\sigma}_i(x)) \frac{a_i}{a_i^{\top} x - b_i} - \tilde{\sigma}_{m+1}(x) \frac{c}{c^{\top} x - \beta} . [[/math]]
We now use that [math]Q(x) \succeq (\min_{i \in [m]} \sigma_i(x)) \nabla^2 F(x)[/math] to obtain

[[math]] \left \| \tilde{\sigma}_{m+1}(x) \frac{c}{c^{\top} x - \beta} \right\|_{Q(x)^{-1}}^2 \leq \frac{\tilde{\sigma}_{m+1}^2(x) \sigma_{m+1}(x)}{\min_{i \in [m]} \sigma_i(x)} . [[/math]]
By Lemma one has [math]\tilde{\sigma}_{m+1}(x) \leq {\sigma}_{m+1}(x)[/math] and thus we see that it only remains to prove

[[math]] \left\|\sum_{i=1}^m ({\sigma}_i(x) - \tilde{\sigma}_i(x)) \frac{a_i}{a_i^{\top}x - b_i} \right\|_{Q(x)^{-1}}^2 \leq \sigma_{m+1}^2(x) . [[/math]]
The above inequality follows from a beautiful calculation of Vaidya (see [Lemma 12, [6]]), starting from the identity

[[math]] \sigma_i(x) - \tilde{\sigma}_i(x) = \frac{((\nabla^2 F(x))^{-1}[a_i,c])^2}{((c^{\top} x - \beta)^2 + (\nabla^2 F(x))^{-1}[c,c])(a_i^{\top} x - b_i)^2} , [[/math]]
which itself follows from \eqref{eq:SM}. \newline We now turn to the proof of \eqref{eq:V22}. Following the same steps as above we immediately obtain

[[math]] \lambda(x) = \|\nabla v(x)\|_{Q(x)^{-1}} \leq \|\nabla \tilde{v}(x)\|_{Q(x)^{-1}} + \sigma_{m+1}(x) + \sqrt{\frac{\tilde{\sigma}_{m+1}^2(x) \sigma_{m+1}(x)}{\min_{i \in [m]} \sigma_i(x)}} . [[/math]]
Using Lemma together with the assumption [math]\tilde{\sigma}_{m+1}(x) = \min_{i \in [m+1]} \tilde{\sigma}_{i}(x)[/math] yields \eqref{eq:V22}, thus concluding the proof.

Conjugate gradient

We conclude this chapter with the special case of unconstrained optimization of a convex quadratic function [math]f(x) = \frac12 x^{\top} A x - b^{\top} x[/math], where [math]A \in \R^{n \times n}[/math] is a positive definite matrix and [math]b \in \R^n[/math]. This problem, of paramount importance in practice (it is equivalent to solving the linear system [math]Ax = b[/math]), admits a simple first-order black-box procedure which attains the {\em exact} optimum [math]x^*[/math] in at most [math]n[/math] steps. This method, called the {\em conjugate gradient}, is described and analyzed below. What is written below is taken from [Chapter 5, [10]]. Let [math]\langle \cdot , \cdot\rangle_A[/math] be the inner product on [math]\R^n[/math] defined by the positive definite matrix [math]A[/math], that is [math]\langle x, y\rangle_A = x^{\top} A y[/math] (we also denote by [math]\|\cdot\|_A[/math] the corresponding norm). For sake of clarity we denote here [math]\langle \cdot , \cdot\rangle[/math] for the standard inner product in [math]\R^n[/math]. Given an orthogonal set [math]\{p_0, \dots, p_{n-1}\}[/math] for [math]\langle \cdot , \cdot \rangle_A[/math] we will minimize [math]f[/math] by sequentially minimizing it along the directions given by this orthogonal set. That is, given [math]x_0 \in \R^n[/math], for [math]t \geq 0[/math] let

[[math]] \begin{equation} \label{eq:CG1} x_{t+1} := \argmin_{x \in \{x_t + \lambda p_t, \ \lambda \in \R\}} f(x) . \end{equation} [[/math]]

Equivalently one can write

[[math]] \begin{equation} \label{eq:CG2} x_{t+1} = x_t - \langle \nabla f(x_t) , p_t \rangle \frac{p_t}{\|p_t\|_A^2} . \end{equation} [[/math]]

The latter identity follows by differentiating [math]\lambda \mapsto f(x + \lambda p_t)[/math], and using that [math]\nabla f(x) = A x - b[/math]. We also make an observation that will be useful later, namely that [math]x_{t+1}[/math] is the minimizer of [math]f[/math] on [math]x_0 + \mathrm{span}\{p_0, \dots, p_t\}[/math], or equivalently

[[math]] \begin{equation} \label{eq:CG3prime} \langle \nabla f(x_{t+1}), p_i \rangle = 0, \forall \ 0 \leq i \leq t. \end{equation} [[/math]]

Equation \eqref{eq:CG3prime} is true by construction for [math]i=t[/math], and for [math]i \leq t-1[/math] it follows by induction, assuming \eqref{eq:CG3prime} at [math]t=1[/math] and using the following formula:

[[math]] \begin{equation} \label{eq:CG3} \nabla f(x_{t+1}) = \nabla f(x_{t}) - \langle \nabla f(x_{t}) , p_{t} \rangle \frac{A p_{t}}{\|p_t\|_A^2} . \end{equation} [[/math]]


We now claim that [math]x_n = x^* = \argmin_{x \in \R^n} f(x)[/math]. It suffices to show that [math]\langle x_n -x_0 , p_t \rangle_A = \langle x^*-x_0 , p_t \rangle_A[/math] for any [math]t\in \{0,\dots,n-1\}[/math]. Note that [math]x_n - x_0 = - \sum_{t=0}^{n-1} \langle \nabla f(x_t), p_t \rangle \frac{p_t}{\|p_t\|_A^2}[/math], and thus using that [math]x^* = A^{-1} b[/math],

[[math]] \begin{eqnarray*} \langle x_n -x_0 , p_t \rangle_A = - \langle \nabla f(x_t) , p_t \rangle = \langle b - A x_t , p_t \rangle & = & \langle x^* - x_t, p_t \rangle_A \\ & = & \langle x^* - x_0, p_t \rangle_A , \end{eqnarray*} [[/math]]

which concludes the proof of [math]x_n = x^*[/math]. \newline In order to have a proper black-box method it remains to describe how to build iteratively the orthogonal set [math]\{p_0, \dots, p_{n-1}\}[/math] based only on gradient evaluations of [math]f[/math]. A natural guess to obtain a set of orthogonal directions (w.r.t. [math]\langle \cdot , \cdot \rangle_A[/math]) is to take [math]p_0 = \nabla f(x_0)[/math] and for [math]t \geq 1[/math],

[[math]] \begin{equation} \label{eq:CG4} p_t = \nabla f(x_t) - \langle \nabla f(x_t), p_{t-1} \rangle_A \ \frac{p_{t-1}}{\|p_{t-1}\|^2_A} . \end{equation} [[/math]]

Let us first verify by induction on [math]t \in [n-1][/math] that for any [math]i \in \{0,\dots,t-2\}[/math], [math]\langle p_t, p_{i}\rangle_A = 0[/math] (observe that for [math]i=t-1[/math] this is true by construction of [math]p_t[/math]). Using the induction hypothesis one can see that it is enough to show [math]\langle \nabla f(x_t), p_i \rangle_A = 0[/math] for any [math]i \in \{0, \dots, t-2\}[/math], which we prove now. First observe that by induction one easily obtains [math]A p_i \in \mathrm{span}\{p_0, \dots, p_{i+1}\}[/math] from \eqref{eq:CG3} and \eqref{eq:CG4}. Using this fact together with [math]\langle \nabla f(x_t), p_i \rangle_A = \langle \nabla f(x_t), A p_i \rangle[/math] and \eqref{eq:CG3prime} thus concludes the proof of orthogonality of the set [math]\{p_0, \dots, p_{n-1}\}[/math]. We still have to show that \eqref{eq:CG4} can be written by making only reference to the gradients of [math]f[/math] at previous points. Recall that [math]x_{t+1}[/math] is the minimizer of [math]f[/math] on [math]x_0 + \mathrm{span}\{p_0, \dots, p_t\}[/math], and thus given the form of [math]p_t[/math] we also have that [math]x_{t+1}[/math] is the minimizer of [math]f[/math] on [math]x_0 + \mathrm{span}\{\nabla f(x_0), \dots, \nabla f(x_t)\}[/math] (in some sense the conjugate gradient is the {\em optimal} first order method for convex quadratic functions). In particular one has [math]\langle \nabla f(x_{t+1}) , \nabla f(x_t) \rangle = 0[/math]. This fact, together with the orthogonality of the set [math]\{p_t\}[/math] and \eqref{eq:CG3}, imply that

[[math]] \frac{\langle \nabla f(x_{t+1}) , p_{t} \rangle_A}{\|p_t\|_A^2} = \langle \nabla f(x_{t+1}) , \frac{A p_{t}}{\|p_t\|_A^2} \rangle = - \frac{\langle \nabla f(x_{t+1}) , \nabla f(x_{t+1}) \rangle}{\langle \nabla f(x_{t}) , p_t \rangle} . [[/math]]

Furthermore using the definition \eqref{eq:CG4} and [math]\langle \nabla f(x_t) , p_{t-1} \rangle = 0[/math] one also has

[[math]] \langle \nabla f(x_t), p_t \rangle = \langle \nabla f(x_t) , \nabla f(x_t) \rangle . [[/math]]

Thus we arrive at the following rewriting of the (linear) conjugate gradient algorithm, where we recall that [math]x_0[/math] is some fixed starting point and [math]p_0 = \nabla f(x_0)[/math],

[[math]] \begin{eqnarray} x_{t+1} & = & \argmin_{x \in \left\{x_t + \lambda p_t, \ \lambda \in \R \right\}} f(x) , \label{eq:CG5} \\ p_{t+1} & = & \nabla f(x_{t+1}) + \frac{\langle \nabla f(x_{t+1}) , \nabla f(x_{t+1}) \rangle}{\langle \nabla f(x_{t}) , \nabla f(x_t) \rangle} p_t . \label{eq:CG6} \end{eqnarray} [[/math]]

Observe that the algorithm defined by \eqref{eq:CG5} and \eqref{eq:CG6} makes sense for an arbitary convex function, in which case it is called the {\em non-linear conjugate gradient}. There are many variants of the non-linear conjugate gradient, and the above form is known as the Fletcher-–Reeves method. Another popular version in practice is the Polak-Ribi{\`e}re method which is based on the fact that for the general non-quadratic case one does not necessarily have [math]\langle \nabla f(x_{t+1}) , \nabla f(x_t) \rangle = 0[/math], and thus one replaces \eqref{eq:CG6} by

[[math]] p_{t+1} = \nabla f(x_{t+1}) + \frac{\langle \nabla f(x_{t+1}) - \nabla f(x_t), \nabla f(x_{t+1}) \rangle}{\langle \nabla f(x_{t}) , \nabla f(x_t) \rangle} p_t . [[/math]]

We refer to [10] for more details about these algorithms, as well as for advices on how to deal with the line search in \eqref{eq:CG5}. Finally we also note that the linear conjugate gradient method can often attain an approximate solution in much fewer than [math]n[/math] steps. More precisely, denoting [math]\kappa[/math] for the condition number of [math]A[/math] (that is the ratio of the largest eigenvalue to the smallest eigenvalue of [math]A[/math]), one can show that linear conjugate gradient attains an [math]\epsilon[/math] optimal point in a number of iterations of order [math]\sqrt{\kappa} \log(1/\epsilon)[/math]. The next chapter will demistify this convergence rate, and in particular we will see that (i) this is the optimal rate among first order methods, and (ii) there is a way to generalize this rate to non-quadratic convex functions (though the algorithm will have to be modified).

General references

Bubeck, Sébastien (2015). "Convex Optimization: Algorithms and Complexity". arXiv:1405.4980 [math.OC].

References

  1. A.Levin.On an algorithm for the minimization of convex functions.In Soviet Mathematics Doklady, volume 160, pages 1244--1247, 1965.
  2. D.Newman.Location of the maximum on unimodal surfaces.Journal of the ACM, 12 (3): 395--398, 1965.
  3. A.Nemirovski and D.Yudin.Problem Complexity and Method Efficiency in Optimization.Wiley Interscience, 1983.
  4. B.Grünbaum.Partitions of mass-distributions and of convex bodies by hyperplanes.Pacific J. Math, 10 (4): 1257--1261, 1960.
  5. 5.0 5.1 P.M. Vaidya.A new algorithm for minimizing convex functions over convex sets.In Foundations of Computer Science, 1989., 30th Annual Symposium on, pages 338--343, 1989.
  6. 6.0 6.1 6.2 P.M. Vaidya.A new algorithm for minimizing convex functions over convex sets.Mathematical programming, 73 (3): 291--341, 1996.
  7. Y.-T. Lee, A.Sidford, and S.C.-W Wong.A faster cutting plane method and its implications for combinatorial and convex optimization.abs/1508.04874, 2015.
  8. Y.Nesterov.Introductory lectures on convex optimization: A basic course.Kluwer Academic Publishers, 2004{\natexlaba}.
  9. K.M. Anstreicher.Towards a practical volumetric cutting plane method for convex programming.SIAM Journal on Optimization, 9 (1): 190--206, 1998.
  10. 10.0 10.1 J.Nocedal and S.J. Wright.Numerical Optimization.Springer, 2006.

Notes

  1. As a warm-up we assume in this section that [math]\cX[/math] is known. It should be clear from the arguments in the next section that in fact the same algorithm would work if initialized with [math]\cS_1 \supset \cX[/math].