Convex optimization in finite dimension
Let [math]\mathcal{X} \subset \R^n[/math] be a convex body (that is a compact convex set with non-empty interior), and [math]f : \mathcal{X} \rightarrow [-B,B][/math] be a continuous and convex function. Let [math]r, R \gt 0[/math] be such that [math]\mathcal{X}[/math] is contained in an Euclidean ball of radius [math]R[/math] (respectively it contains an Euclidean ball of radius [math]r[/math]). In this chapter we give several black-box algorithms to solve
As we will see these algorithms have an oracle complexity which is linear (or quadratic) in the dimension, hence the title of the chapter (in the next chapter the oracle complexity will be {\em independent} of the dimension). An interesting feature of the methods discussed here is that they only need a separation oracle for the constraint set [math]\cX[/math]. In the literature such algorithms are often referred to as {\em cutting plane methods}. In particular these methods can be used to {\em find} a point [math]x \in \cX[/math] given only a separating oracle for [math]\cX[/math] (this is also known as the {\em feasibility problem}).
The center of gravity method
We consider the following simple iterative algorithm[Notes 1]: let [math]\cS_1= \cX[/math], and for [math]t \geq 1[/math] do the following:
- Compute
[[math]] \begin{equation} c_t = \frac{1}{\mathrm{vol}(\cS_t)} \int_{x \in \cS_t} x dx . \end{equation} [[/math]]
- Query the first order oracle at [math]c_t[/math] and obtain [math]w_t \in \partial f (c_t)[/math]. Let
[[math]] \cS_{t+1} = \cS_t \cap \{x \in \R^n : (x-c_t)^{\top} w_t \leq 0\} . [[/math]]
If stopped after [math]t[/math] queries to the first order oracle then we use [math]t[/math] queries to a zeroth order oracle to output
This procedure is known as the {\em center of gravity method}, it was discovered independently on both sides of the Wall by [1] and [2].
The center of gravity method satisfies
Before proving this result a few comments are in order. To attain an [math]\epsilon[/math]-optimal point the center of gravity method requires [math]O( n \log (2 B / \epsilon))[/math] queries to both the first and zeroth order oracles. It can be shown that this is the best one can hope for, in the sense that for [math]\epsilon[/math] small enough one needs [math]\Omega(n \log(1/ \epsilon))[/math] calls to the oracle in order to find an [math]\epsilon[/math]-optimal point, see [3] for a formal proof. The rate of convergence given by Theorem is exponentially fast. In the optimization literature this is called a {\em linear rate} as the (estimated) error at iteration [math]t+1[/math] is linearly related to the error at iteration [math]t[/math]. The last and most important comment concerns the computational complexity of the method. It turns out that finding the center of gravity [math]c_t[/math] is a very difficult problem by itself, and we do not have computationally efficient procedure to carry out this computation in general. In Section Random walk based methods we will discuss a relatively recent (compared to the 50 years old center of gravity method!) randomized algorithm to approximately compute the center of gravity. This will in turn give a randomized center of gravity method which we will describe in detail. We now turn to the proof of Theorem. We will use the following elementary result from convex geometry:
Let [math]\cK[/math] be a centered convex set, i.e., [math]\int_{x \in \cK} x dx = 0[/math], then for any [math]w \in \R^n, w \neq 0[/math], one has
Let [math]x^*[/math] be such that [math]f(x^*) = \min_{x \in \mathcal{X}} f(x)[/math]. Since [math]w_t \in \partial f(c_t)[/math] one has
The ellipsoid method
Recall that an ellipsoid is a convex set of the form
where [math]c \in \R^n[/math], and [math]H[/math] is a symmetric positive definite matrix. Geometrically [math]c[/math] is the center of the ellipsoid, and the semi-axes of [math]\cE[/math] are given by the eigenvectors of [math]H[/math], with lengths given by the square root of the corresponding eigenvalues. We give now a simple geometric lemma, which is at the heart of the ellipsoid method.
Let [math]\mathcal{E}_0 = \{x \in \R^n : (x - c_0)^{\top} H_0^{-1} (x-c_0) \leq 1 \}[/math]. For any [math]w \in \R^n[/math], [math]w \neq 0[/math], there exists an ellipsoid [math]\mathcal{E}[/math] such that
For [math]n=1[/math] the result is obvious, in fact we even have [math]\mathrm{vol}(\mathcal{E}) \leq \frac12 \mathrm{vol}(\mathcal{E}_0) .[/math] For [math]n \geq 2[/math] one can simply verify that the ellipsoid given by \eqref{eq:ellipsoidlemma3} and \eqref{eq:ellipsoidlemma4} satisfy the required properties \eqref{eq:ellipsoidlemma1} and \eqref{eq:ellipsoidlemma2}. Rather than bluntly doing these computations we will show how to derive \eqref{eq:ellipsoidlemma3} and \eqref{eq:ellipsoidlemma4}. As a by-product this will also show that the ellipsoid defined by \eqref{eq:ellipsoidlemma3} and \eqref{eq:ellipsoidlemma4} is the unique ellipsoid of minimal volume that satisfy \eqref{eq:ellipsoidlemma1}. Let us first focus on the case where [math]\mathcal{E}_0[/math] is the Euclidean ball [math]\cB = \{x \in \R^n : x^{\top} x \leq 1\}[/math]. We momentarily assume that [math]w[/math] is a unit norm vector. By doing a quick picture, one can see that it makes sense to look for an ellipsoid [math]\mathcal{E}[/math] that would be centered at [math]c= - t w[/math], with [math]t \in [0,1][/math] (presumably [math]t[/math] will be small), and such that one principal direction is [math]w[/math] (with inverse squared semi-axis [math]a \gt 0[/math]), and the other principal directions are all orthogonal to [math]w[/math] (with the same inverse squared semi-axes [math]b \gt 0[/math]). In other words we are looking for [math]\mathcal{E}= \{x: (x - c)^{\top} H^{-1} (x-c) \leq 1 \}[/math] with
We consider now an arbitrary ellipsoid [math]\cE_0 = \{x \in \R^n : (x - c_0)^{\top} H_0^{-1} (x-c_0) \leq 1 \}[/math]. Let [math]\Phi(x) = c_0 + H_0^{1/2} x[/math], then clearly [math]\cE_0 = \Phi(\cB)[/math] and [math]\{x : w^{\top}(x - c_0) \leq 0\} = \Phi(\{x : (H_0^{1/2} w)^{\top} x \leq 0\})[/math]. Thus in this case the image by [math]\Phi[/math] of the ellipsoid given in \eqref{eq:ellipsoidlemma5} with [math]w[/math] replaced by [math]H_0^{1/2} w[/math] will satisfy \eqref{eq:ellipsoidlemma1} and \eqref{eq:ellipsoidlemma2}. It is easy to see that this corresponds to an ellipsoid defined by
We describe now the ellipsoid method, which only assumes a separation oracle for the constraint set [math]\cX[/math] (in particular it can be used to solve the feasibility problem mentioned at the beginning of the chapter). Let [math]\cE_0[/math] be the Euclidean ball of radius [math]R[/math] that contains [math]\cX[/math], and let [math]c_0[/math] be its center. Denote also [math]H_0=R^2 \mI_n[/math]. For [math]t \geq 0[/math] do the following:
- If [math]c_t \not\in \cX[/math] then call the separation oracle to obtain a separating hyperplane [math]w_t \in \R^n[/math] such that [math]\cX \subset \{x : (x- c_t)^{\top} w_t \leq 0\}[/math], otherwise call the first order oracle at [math]c_t[/math] to obtain [math]w_t \in \partial f (c_t)[/math].
- Let [math]\cE_{t+1} = \{x : (x - c_{t+1})^{\top} H_{t+1}^{-1} (x-c_{t+1}) \leq 1 \}[/math] be the ellipsoid given in Lemma that contains [math]\{x \in \mathcal{E}_t : (x- c_t)^{\top} w_t \leq 0\}[/math], that is
[[math]] \begin{align*} & c_{t+1} = c_{t} - \frac{1}{n+1} \frac{H_t w}{\sqrt{w^{\top} H_t w}} ,\\ & H_{t+1} = \frac{n^2}{n^2-1} \left(H_t - \frac{2}{n+1} \frac{H_t w w^{\top} H_t}{w^{\top} H_t w} \right) . \end{align*} [[/math]]
If stopped after [math]t[/math] iterations and if [math]\{c_1, \dots, c_t\} \cap \cX \neq \emptyset[/math], then we use the zeroth order oracle to output
The following rate of convergence can be proved with the exact same argument than for Theorem (observe that at step [math]t[/math] one can remove a point in [math]\cX[/math] from the current ellipsoid only if [math]c_t \in \cX[/math]).
For [math]t \geq 2n^2 \log(R/r)[/math] the ellipsoid method satisfies [math]\{c_1, \dots, c_t\} \cap \cX \neq \emptyset[/math] and
We observe that the oracle complexity of the ellipsoid method is much worse than the one of the center gravity method, indeed the former needs [math]O(n^2 \log(1/\epsilon))[/math] calls to the oracles while the latter requires only [math]O(n \log(1/\epsilon))[/math] calls. However from a computational point of view the situation is much better: in many cases one can derive an efficient separation oracle, while the center of gravity method is basically always intractable. This is for instance the case in the context of LPs and SDPs: with the notation of Section Structured optimization the computational complexity of the separation oracle for LPs is [math]O(m n)[/math] while for SDPs it is [math]O(\max(m,n) n^2)[/math] (we use the fact that the spectral decomposition of a matrix can be done in [math]O(n^3)[/math] operations). This gives an overall complexity of [math]O(\max(m,n) n^3 \log(1/\epsilon))[/math] for LPs and [math]O(\max(m,n^2) n^6 \log(1/\epsilon))[/math] for SDPs. We note however that the ellipsoid method is almost never used in practice, essentially because the method is too rigid to exploit the potential easiness of real problems (e.g., the volume decrease given by \eqref{eq:ellipsoidlemma2} is essentially always tight).
Vaidya's cutting plane method
We focus here on the feasibility problem (it should be clear from the previous sections how to adapt the argument for optimization). We have seen that for the feasibility problem the center of gravity has a [math]O(n)[/math] oracle complexity and unclear computational complexity (see Section Random walk based methods for more on this), while the ellipsoid method has oracle complexity [math]O(n^2)[/math] and computational complexity [math]O(n^4)[/math]. We describe here the beautiful algorithm of [5][6] which has oracle complexity [math]O(n \log(n))[/math] and computational complexity [math]O(n^4)[/math], thus getting the best of both the center of gravity and the ellipsoid method. In fact the computational complexity can even be improved further, and the recent breakthrough [7] shows that it can essentially (up to logarithmic factors) be brought down to [math]O(n^3)[/math]. This section, while giving a fundamental algorithm, should probably be skipped on a first reading. In particular we use several concepts from the theory of interior point methods which are described in Section Interior point methods.
The volumetric barrier
Let [math]A \in \mathbb{R}^{m \times n}[/math] where the [math]i^{th}[/math] row is [math]a_i \in \mathbb{R}^n[/math], and let [math]b \in \mathbb{R}^m[/math]. We consider the logarithmic barrier [math]F[/math] for the polytope [math]\{x \in \mathbb{R}^n : A x \gt b\}[/math] defined by
We also consider the volumetric barrier [math]v[/math] defined by
The intuition is clear: [math]v(x)[/math] is equal to the logarithm of the inverse volume of the Dikin ellipsoid (for the logarithmic barrier) at [math]x[/math]. It will be useful to spell out the hessian of the logarithmic barrier:
Introducing the leverage score
one can easily verify that
and
Vaidya's algorithm
We fix [math]\epsilon \leq 0.006[/math] a small constant to be specified later. Vaidya's algorithm produces a sequence of pairs [math](A^{(t)}, b^{(t)}) \in \mathbb{R}^{m_t \times n} \times \mathbb{R}^{m_t}[/math] such that the corresponding polytope contains the convex set of interest. The initial polytope defined by [math](A^{(0)},b^{(0)})[/math] is a simplex (in particular [math]m_0=n+1[/math]). For [math]t\geq0[/math] we let [math]x_t[/math] be the minimizer of the volumetric barrier [math]v_t[/math] of the polytope given by [math](A^{(t)}, b^{(t)})[/math], and [math](\sigma_i^{(t)})_{i \in [m_t]}[/math] the leverage scores (associated to [math]v_t[/math]) at the point [math]x_t[/math]. We also denote [math]F_t[/math] for the logarithmic barrier given by [math](A^{(t)}, b^{(t)})[/math]. The next polytope [math](A^{(t+1)}, b^{(t+1)})[/math] is defined by either adding or removing a constraint to the current polytope:
- If for some [math]i \in [m_t][/math] one has [math]\sigma_i^{(t)} = \min_{j \in [m_t]} \sigma_j^{(t)} \lt \epsilon[/math], then [math](A^{(t+1)}, b^{(t+1)})[/math] is defined by removing the [math]i^{th}[/math] row in [math](A^{(t)}, b^{(t)})[/math] (in particular [math]m_{t+1} = m_t - 1[/math]).
- Otherwise let [math]c^{(t)}[/math] be the vector given by the separation oracle queried at [math]x_t[/math], and [math]\beta^{(t)} \in \mathbb{R}[/math] be chosen so that
[[math]] \frac{(\nabla^2 F_t(x_t) )^{-1}[c^{(t)}, c^{(t)}]}{(x_t^{\top} c^{(t)} - \beta^{(t)})^2} = \frac{1}{5} \sqrt{\epsilon} . [[/math]]Then we define [math](A^{(t+1)}, b^{(t+1)})[/math] by adding to [math](A^{(t)}, b^{(t)})[/math] the row given by [math](c^{(t)}, \beta^{(t)})[/math] (in particular [math]m_{t+1} = m_t + 1[/math]).
It can be shown that the volumetric barrier is a self-concordant barrier, and thus it can be efficiently minimized with Newton's method. In fact it is enough to do {\em one step} of Newton's method on [math]v_t[/math] initialized at [math]x_{t-1}[/math], see [5][6] for more details on this.
Analysis of Vaidya's method
The construction of Vaidya's method is based on a precise understanding of how the volumetric barrier changes when one adds or removes a constraint to the polytope. This understanding is derived in Section Constraints and the volumetric barrier. In particular we obtain the following two key inequalities: If case 1 happens at iteration [math]t[/math] then
while if case 2 happens then
We show now how these inequalities imply that Vaidya's method stops after [math]O(n \log(n R/r))[/math] steps. First we claim that after [math]2t[/math] iterations, case 2 must have happened at least [math]t-1[/math] times. Indeed suppose that at iteration [math]2t-1[/math], case 2 has happened [math]t-2[/math] times; then [math]\nabla^2 F(x)[/math] is singular and the leverage scores are infinite, so case 2 must happen at iteration [math]2t[/math]. Combining this claim with the two inequalities above we obtain:
The key point now is to recall that by definition one has [math]v(x) = - \log \mathrm{vol}(\cE(x,1))[/math] where [math]\cE(x,r) = \{y : \nabla F^2(x)[y-x,y-x] \leq r^2\}[/math] is the Dikin ellipsoid centered at [math]x[/math] and of radius [math]r[/math]. Moreover the logarithmic barrier [math]F[/math] of a polytope with [math]m[/math] constraints is [math]m[/math]-self-concordant, which implies that the polytope is included in the Dikin ellipsoid [math]\cE(z, 2m)[/math] where [math]z[/math] is the minimizer of [math]F[/math] (see [Theorem 4.2.6., [8]]). The volume of [math]\cE(z, 2m)[/math] is equal to [math](2m)^n \exp(-v(z))[/math], which is thus always an upper bound on the volume of the polytope. Combining this with the above display we just proved that at iteration [math]2k[/math] the volume of the current polytope is at most
Since [math]\cE(x,1)[/math] is always included in the polytope we have that [math]- v_0(x_0)[/math] is at most the logarithm of the volume of the initial polytope which is [math]O(n \log(R))[/math]. This clearly concludes the proof as the procedure will necessarily stop when the volume is below [math]\exp(n \log(r))[/math] (we also used the trivial bound [math]m_t \leq n+1+t[/math]).
Constraints and the volumetric barrier
We want to understand the effect on the volumetric barrier of addition/deletion of constraints to the polytope. Let [math]c \in \mathbb{R}^n[/math], [math]\beta \in \mathbb{R}[/math], and consider the logarithmic barrier [math]\tilde{F}[/math] and the volumetric barrier [math]\tilde{v}[/math] corresponding to the matrix [math]\tilde{A}\in \mathbb{R}^{(m+1) \times n}[/math] and the vector [math]\tilde{b} \in \mathbb{R}^{m+1}[/math] which are respectively the concatenation of [math]A[/math] and [math]c[/math], and the concatenation of [math]b[/math] and [math]\beta[/math]. Let [math]x^*[/math] and [math]\tilde{x}^*[/math] be the minimizer of respectively [math]v[/math] and [math]\tilde{v}[/math]. We recall the definition of leverage scores, for [math]i \in [m+1][/math], where [math]a_{m+1}=c[/math] and [math]b_{m+1}=\beta[/math],
The leverage scores [math]\sigma_i[/math] and [math]\tilde{\sigma}_i[/math] are closely related:
One has for any [math]i \in [m+1][/math],
First we observe that by Sherman-Morrison's formula [math](A+uv^{\top})^{-1} = A^{-1} - \frac{A^{-1} u v^{\top} A^{-1}}{1+A^{-1}[u,v]}[/math] one has
We now assume the following key result, which was first proven by Vaidya. To put the statement in context recall that for a self-concordant barrier [math]f[/math] the suboptimality gap [math]f(x) - \min f[/math] is intimately related to the Newton decrement [math]\|\nabla f(x) \|_{(\nabla^2 f(x))^{-1}}[/math]. Vaidya's inequality gives a similar claim for the volumetric barrier. We use the version given in [Theorem 2.6, [9]] which has slightly better numerical constants than the original bound. Recall also the definition of [math]Q[/math] from \eqref{eq:hessianvol}.
Let [math]\lambda(x) = \|\nabla v(x) \|_{Q(x)^{-1}}[/math] be an approximate Newton decrement, [math]\epsilon = \min_{i \in [m]} \sigma_i(x)[/math], and assume that [math]\lambda(x)^2 \leq \frac{2 \sqrt{\epsilon} - \epsilon}{36}[/math]. Then
We also denote [math]\tilde{\lambda}[/math] for the approximate Newton decrement of [math]\tilde{v}[/math]. The goal for the rest of the section is to prove the following theorem which gives the precise understanding of the volumetric barrier we were looking for.
Let [math]\epsilon := \min_{i \in [m]} \sigma_i(x^*)[/math], [math]\delta := \sigma_{m+1}(x^*) / \sqrt{\epsilon}[/math] and assume that [math]\frac{\left(\delta \sqrt{\epsilon} + \sqrt{\delta^{3} \sqrt{\epsilon}}\right)^2}{1- \delta \sqrt{\epsilon}} \lt \frac{2 \sqrt{\epsilon} - \epsilon}{36}[/math]. Then one has
We start with the proof of \eqref{eq:thV11}. First observe that by factoring [math](\nabla^2 F(x))^{1/2}[/math] on the left and on the right of [math]\nabla^2 \tilde{F}(x)[/math] one obtains
We now turn to the proof of \eqref{eq:thV12}. Following the same steps as above we immediately obtain
One has
We start with the proof of \eqref{eq:V21}. First observe that by Lemma one has [math]\tilde{Q}(x) \succeq (1-\sigma_{m+1}(x)) Q(x)[/math] and thus by definition of the Newton decrement
Conjugate gradient
We conclude this chapter with the special case of unconstrained optimization of a convex quadratic function [math]f(x) = \frac12 x^{\top} A x - b^{\top} x[/math], where [math]A \in \R^{n \times n}[/math] is a positive definite matrix and [math]b \in \R^n[/math]. This problem, of paramount importance in practice (it is equivalent to solving the linear system [math]Ax = b[/math]), admits a simple first-order black-box procedure which attains the {\em exact} optimum [math]x^*[/math] in at most [math]n[/math] steps. This method, called the {\em conjugate gradient}, is described and analyzed below. What is written below is taken from [Chapter 5, [10]]. Let [math]\langle \cdot , \cdot\rangle_A[/math] be the inner product on [math]\R^n[/math] defined by the positive definite matrix [math]A[/math], that is [math]\langle x, y\rangle_A = x^{\top} A y[/math] (we also denote by [math]\|\cdot\|_A[/math] the corresponding norm). For sake of clarity we denote here [math]\langle \cdot , \cdot\rangle[/math] for the standard inner product in [math]\R^n[/math]. Given an orthogonal set [math]\{p_0, \dots, p_{n-1}\}[/math] for [math]\langle \cdot , \cdot \rangle_A[/math] we will minimize [math]f[/math] by sequentially minimizing it along the directions given by this orthogonal set. That is, given [math]x_0 \in \R^n[/math], for [math]t \geq 0[/math] let
Equivalently one can write
The latter identity follows by differentiating [math]\lambda \mapsto f(x + \lambda p_t)[/math], and using that [math]\nabla f(x) = A x - b[/math]. We also make an observation that will be useful later, namely that [math]x_{t+1}[/math] is the minimizer of [math]f[/math] on [math]x_0 + \mathrm{span}\{p_0, \dots, p_t\}[/math], or equivalently
Equation \eqref{eq:CG3prime} is true by construction for [math]i=t[/math], and for [math]i \leq t-1[/math] it follows by induction, assuming \eqref{eq:CG3prime} at [math]t=1[/math] and using the following formula:
We now claim that [math]x_n = x^* = \argmin_{x \in \R^n} f(x)[/math]. It suffices to show that [math]\langle x_n -x_0 , p_t \rangle_A = \langle x^*-x_0 , p_t \rangle_A[/math] for any [math]t\in \{0,\dots,n-1\}[/math]. Note that [math]x_n - x_0 = - \sum_{t=0}^{n-1} \langle \nabla f(x_t), p_t \rangle \frac{p_t}{\|p_t\|_A^2}[/math], and thus using that [math]x^* = A^{-1} b[/math],
which concludes the proof of [math]x_n = x^*[/math]. \newline In order to have a proper black-box method it remains to describe how to build iteratively the orthogonal set [math]\{p_0, \dots, p_{n-1}\}[/math] based only on gradient evaluations of [math]f[/math]. A natural guess to obtain a set of orthogonal directions (w.r.t. [math]\langle \cdot , \cdot \rangle_A[/math]) is to take [math]p_0 = \nabla f(x_0)[/math] and for [math]t \geq 1[/math],
Let us first verify by induction on [math]t \in [n-1][/math] that for any [math]i \in \{0,\dots,t-2\}[/math], [math]\langle p_t, p_{i}\rangle_A = 0[/math] (observe that for [math]i=t-1[/math] this is true by construction of [math]p_t[/math]). Using the induction hypothesis one can see that it is enough to show [math]\langle \nabla f(x_t), p_i \rangle_A = 0[/math] for any [math]i \in \{0, \dots, t-2\}[/math], which we prove now. First observe that by induction one easily obtains [math]A p_i \in \mathrm{span}\{p_0, \dots, p_{i+1}\}[/math] from \eqref{eq:CG3} and \eqref{eq:CG4}. Using this fact together with [math]\langle \nabla f(x_t), p_i \rangle_A = \langle \nabla f(x_t), A p_i \rangle[/math] and \eqref{eq:CG3prime} thus concludes the proof of orthogonality of the set [math]\{p_0, \dots, p_{n-1}\}[/math]. We still have to show that \eqref{eq:CG4} can be written by making only reference to the gradients of [math]f[/math] at previous points. Recall that [math]x_{t+1}[/math] is the minimizer of [math]f[/math] on [math]x_0 + \mathrm{span}\{p_0, \dots, p_t\}[/math], and thus given the form of [math]p_t[/math] we also have that [math]x_{t+1}[/math] is the minimizer of [math]f[/math] on [math]x_0 + \mathrm{span}\{\nabla f(x_0), \dots, \nabla f(x_t)\}[/math] (in some sense the conjugate gradient is the {\em optimal} first order method for convex quadratic functions). In particular one has [math]\langle \nabla f(x_{t+1}) , \nabla f(x_t) \rangle = 0[/math]. This fact, together with the orthogonality of the set [math]\{p_t\}[/math] and \eqref{eq:CG3}, imply that
Furthermore using the definition \eqref{eq:CG4} and [math]\langle \nabla f(x_t) , p_{t-1} \rangle = 0[/math] one also has
Thus we arrive at the following rewriting of the (linear) conjugate gradient algorithm, where we recall that [math]x_0[/math] is some fixed starting point and [math]p_0 = \nabla f(x_0)[/math],
Observe that the algorithm defined by \eqref{eq:CG5} and \eqref{eq:CG6} makes sense for an arbitary convex function, in which case it is called the {\em non-linear conjugate gradient}. There are many variants of the non-linear conjugate gradient, and the above form is known as the Fletcher-–Reeves method. Another popular version in practice is the Polak-Ribi{\`e}re method which is based on the fact that for the general non-quadratic case one does not necessarily have [math]\langle \nabla f(x_{t+1}) , \nabla f(x_t) \rangle = 0[/math], and thus one replaces \eqref{eq:CG6} by
We refer to [10] for more details about these algorithms, as well as for advices on how to deal with the line search in \eqref{eq:CG5}. Finally we also note that the linear conjugate gradient method can often attain an approximate solution in much fewer than [math]n[/math] steps. More precisely, denoting [math]\kappa[/math] for the condition number of [math]A[/math] (that is the ratio of the largest eigenvalue to the smallest eigenvalue of [math]A[/math]), one can show that linear conjugate gradient attains an [math]\epsilon[/math] optimal point in a number of iterations of order [math]\sqrt{\kappa} \log(1/\epsilon)[/math]. The next chapter will demistify this convergence rate, and in particular we will see that (i) this is the optimal rate among first order methods, and (ii) there is a way to generalize this rate to non-quadratic convex functions (though the algorithm will have to be modified).
General references
Bubeck, Sébastien (2015). "Convex Optimization: Algorithms and Complexity". arXiv:1405.4980 [math.OC].
References
- A.Levin.On an algorithm for the minimization of convex functions.In Soviet Mathematics Doklady, volume 160, pages 1244--1247, 1965.
- D.Newman.Location of the maximum on unimodal surfaces.Journal of the ACM, 12 (3): 395--398, 1965.
- A.Nemirovski and D.Yudin.Problem Complexity and Method Efficiency in Optimization.Wiley Interscience, 1983.
- B.Grünbaum.Partitions of mass-distributions and of convex bodies by hyperplanes.Pacific J. Math, 10 (4): 1257--1261, 1960.
- 5.0 5.1 P.M. Vaidya.A new algorithm for minimizing convex functions over convex sets.In Foundations of Computer Science, 1989., 30th Annual Symposium on, pages 338--343, 1989.
- 6.0 6.1 6.2 P.M. Vaidya.A new algorithm for minimizing convex functions over convex sets.Mathematical programming, 73 (3): 291--341, 1996.
- Y.-T. Lee, A.Sidford, and S.C.-W Wong.A faster cutting plane method and its implications for combinatorial and convex optimization.abs/1508.04874, 2015.
- Y.Nesterov.Introductory lectures on convex optimization: A basic course.Kluwer Academic Publishers, 2004{\natexlaba}.
- K.M. Anstreicher.Towards a practical volumetric cutting plane method for convex programming.SIAM Journal on Optimization, 9 (1): 190--206, 1998.
- 10.0 10.1 J.Nocedal and S.J. Wright.Numerical Optimization.Springer, 2006.