Matrix estimation

Over the past decade or so, matrices have entered the picture of high-dimensional statistics for several reasons. Perhaps the simplest explanation is that they are the most natural extension of vectors. While this is true, and we will see examples where the extension from vectors to matrices is straightforward, matrices have a much richer structure than vectors allowing interaction between their rows and columns. In particular, while we have been describing simple vectors in terms of their sparsity, here we can measure the complexity of a matrix by its rank. This feature was successfully employed in a variety of applications ranging from multi-task learning to collaborative filtering. This last application was made popular by the NETFLIX prize in particular. In this chapter, we study several statistical problems where the parameter of interest [math]\theta[/math] is a matrix rather than a vector. These problems include multivariate regression, covariance matrix estimation, and principal component analysis. Before getting to these topics, we begin with a quick reminder on matrices and linear algebra.

Basic facts about matrices

Matrices are much more complicated objects than vectors. In particular, while vectors can be identified with linear operators from [math]\R^d[/math] to [math]\R[/math], matrices can be identified to linear operators from [math]\R^d[/math] to [math]\R^n[/math] for [math]n \ge 1[/math]. This seemingly simple fact gives rise to a profusion of notions and properties as illustrated by Bernstein's book ^[1] that contains facts about matrices over more than a thousand pages. Fortunately, we will be needing only a small number of such properties, which can be found in the excellent book ^[2], which has become a standard reference on matrices and numerical linear algebra.

Singular value decomposition

Let [math]A=\{a_{ij}, 1\le i \le m, 1\le j \le n\}[/math] be a [math]m \times n[/math] real matrix of rank [math]r \le \min(m,n)[/math]. The Singular Value Decomposition (SVD) of [math]A[/math] is given by

[[math]] A=UDV^\top=\sum_{j=1}^r\lambda_j u_jv_j^\top\,, [[/math]]

where [math]D[/math] is an [math]r\times r[/math] diagonal matrix with positive diagonal entries [math]\{\lambda_1, \ldots, \lambda_r\}[/math], [math]U[/math] is a matrix with columns [math]\{u_1, \ldots, u_r\} \in \R^m[/math] that are orthonormal, and [math]V[/math] is a matrix with columns [math]\{v_1, \ldots, v_r\} \in \R^n[/math] that are also orthonormal. Moreover, it holds that

[[math]] AA^\top u_j=\lambda_j^2 u_j\,, \qquad \text{and} \qquad A^\top A v_j=\lambda_j^2v_j [[/math]]

for [math]j =1, \ldots, r[/math]. The values [math]\lambda_j \gt 0[/math] are called singular values of [math]A[/math] and are uniquely defined. If rank [math]r \lt \min(n,m)[/math] then the singular values of [math]A[/math] are given by [math]\lambda=(\lambda_1, \ldots, \lambda_r, 0, \ldots, 0)^\top \in \R^{\min(n,m)}[/math] where there are [math]\min(n,m)-r[/math] zeros. This way, the vector [math]\lambda[/math] of singular values of a [math]n\times m[/math] matrix is a vector in [math]\R^{\min(n,m)}[/math]. In particular, if [math]A[/math] is an [math]n \times n[/math] symmetric positive semidefinite (PSD), i.e. [math]A^\top=A[/math] and [math]u^\top A u\ge 0[/math] for all [math]u \in \R^n[/math], then the singular values of [math]A[/math] are equal to its eigenvalues. The largest singular value of [math]A[/math] denoted by [math]\lambda_{\max{}}(A)[/math] also satisfies the following variational formulation:

[[math]] \lambda_{\max{}}(A)=\max_{x \in \R^n}\frac{|Ax|_2}{|x|_2}=\max_{\substack{x \in \R^n\\ y \in \R^m}}\frac{y^\top A x}{|y|_2|x|_2}=\max_{\substack{x \in \cS^{n-1}\\ y \in \cS^{m-1}}}y^\top A x\,. [[/math]]

In the case of a [math]n \times n[/math] PSD matrix [math]A[/math], we have

[[math]] \lambda_{\max{}}(A)=\max_{x \in \cS^{n-1}}x^\top A x\,. [[/math]]

In these notes, we always assume that eigenvectors and singular vectors have unit norm.

Norms and inner product

Let [math]A=\{a_{ij}\}[/math] and [math]B=\{b_{ij}\}[/math] be two real matrices. Their size will be implicit in the following notation.

Vector norms

The simplest way to treat a matrix is to deal with it as if it were a vector. In particular, we can extend [math]\ell_q[/math] norms to matrices:

[[math]] |A|_q=\Big(\sum_{ij}|a_{ij}|^q\Big)^{1/q}\,, \quad q \gt 0\,. [[/math]]

The cases where [math]q \in \{0,\infty\}[/math] can also be extended matrices:

[[math]] |A|_0=\sum_{ij}\1(a_{ij}\neq 0)\,, \qquad |A|_\infty=\max_{ij}|a_{ij}|\,. [[/math]]

The case [math]q=2[/math] plays a particular role for matrices and [math]|A|_2[/math] is called the Frobenius norm of [math]A[/math] and is often denoted by [math]\|A\|_F[/math]. It is also the Hilbert-Schmidt norm associated to the inner product:

[[math]] \langle A,B\rangle=\tr(A^\top B)=\tr(B^\top A)\,. [[/math]]

Spectral norms

Let [math]\lambda=(\lambda_1, \ldots, \lambda_r, 0, \ldots, 0)[/math] be the singular values of a matrix [math]A[/math]. We can define spectral norms on [math]A[/math] as vector norms on the vector [math]\lambda[/math]. In particular, for any [math]q\in [1, \infty][/math],

[[math]] \|A\|_q=|\lambda|_q\,, [[/math]]

is called Schatten [math]q[/math]-norm of [math]A[/math]. Here again, special cases have special names:

[math]q=2[/math]: [math]\|A\|_2=\|A\|_F[/math] is the Frobenius norm defined above.
[math]q=1[/math]: [math]\|A\|_1=\|A\|_*[/math] is called the Nuclear norm (or trace norm) of [math]A[/math].
[math]q=\infty[/math]: [math]\|A\|_\infty=\lambda_{\max{}}(A)=\|A\|_{\mathrm{op}}[/math] is called the operator norm (or spectral norm) of [math]A[/math].

We are going to employ these norms to assess the proximity to our matrix of interest. While the interpretation of vector norms is clear by extension from the vector case, the meaning of ‘`[math]\|A-B\|_{\mathrm{op}}[/math] is small" is not as transparent. The following subsection provides some inequalities (without proofs) that allow a better reading.

Useful matrix inequalities

Let [math]A[/math] and [math]B[/math] be two [math]m\times n[/math] matrices with singular values [math] \lambda_1(A) \ge \lambda _2(A) \ldots \ge \lambda_{\min(m,n)}(A)[/math] and [math] \lambda_1(B) \ge \ldots \ge \lambda_{\min(m,n)}(B)[/math] respectively. Then the following inequalities hold:

[[math]] \begin{array}{ll} \DS \max_k\big|\lambda_k(A)-\lambda_k(B)\big|\le \|A-B\|_{\mathrm{op}}\,, & \text{Weyl (1912)}\\ \DS \sum_{k}\big|\lambda_k(A)-\lambda_k(B)\big|^2\le \|A-B\|_F^2\,,& \text{Hoffman-Weilandt (1953)}\\ \DS \langle A, B\rangle \le \|A\|_p \|B\|_q\,, \ \frac{1}{p}+\frac{1}{q}=1, p,q \in [1, \infty]\,,& \text{H\"older} \end{array} [[/math]]

The singular value decomposition is associated to the following useful lemma, known as the ’'Eckart-Young (or Eckart-Young-Mirsky) Theorem. It states that for any given [math]k[/math] the closest matrix of rank [math]k[/math] to a given matrix [math]A[/math] in Frobenius norm is given by its truncated SVD.

Lemma

Let [math]A[/math] be a rank-[math]r[/math] matrix with singular value decomposition

[[math]] A = \sum_{i=1}^r \lambda_i u_i v_i^\top\,, [[/math]]

where [math]\lambda_1 \ge \lambda_2 \ge \dots \ge \lambda_r \gt 0[/math] are the ordered singular values of [math]A[/math]. For any [math]k \lt r[/math], let [math]A_k[/math] to be the truncated singular value decomposition of [math]A[/math] given by

[[math]] A_k= \sum_{i=1}^r \lambda_i u_i v_i^\top\,. [[/math]]

Then for any matrix [math]B[/math] such that [math]\rank(B) \le k[/math], it holds

[[math]] \|A-A_k\|_F\le \|A-B\|_F\,. [[/math]]

Moreover,

[[math]] \|A-A_k\|_F^2=\sum_{j=k+1}^r \lambda_j^2\,. [[/math]]

Show Proof

Note that the last equality of the lemma is obvious since

[[math]] A-A_k=\sum_{j=k+1}^r \lambda_ju_jv_j^\top [[/math]]

and the matrices in the sum above are orthogonal to each other. Thus, it is sufficient to prove that for any matrix [math]B[/math] such that [math]\rank(B) \le k[/math], it holds

[[math]] \|A-B\|_F^2 \ge \sum_{j=k+1}^r \lambda_j^2\,. [[/math]]

To that end, denote by [math]\sigma_1 \ge \sigma_2 \ge\dots \ge \sigma_k \ge 0[/math] the ordered singular values of [math]B[/math]---some may be equal to zero if [math]\rank(B) \lt k[/math]---and observe that it follows from the Hoffman-Weilandt inequality that

[[math]] \begin{equation*} \|A-B\|_F^2 \ge \sum_{j=1}^r\big(\lambda_j -\sigma_j\big)^2=\sum_{j=1}^k\big(\lambda_j -\sigma_j\big)^2+ \sum_{j=k+1}^r \lambda_j^2\ge \sum_{j=k+1}^r \lambda_j^2\,. \end{equation*} [[/math]]

■

Multivariate regression

In the traditional regression setup, the response variable [math]Y[/math] is a scalar. In several applications, the goal is not to predict a variable but rather a vector [math]Y \in \R^T[/math], still from a covariate [math]X \in \R^d[/math]. A standard example arises in genomics data where [math]Y[/math] contains [math]T[/math] physical measurements of a patient and [math]X[/math] contains the expression levels for [math]d[/math] genes. As a result the regression function in this case [math]f(x)=\E[Y|X=x][/math] is a function from [math]\R^d[/math] to [math]\R^T[/math]. Clearly, [math]f[/math] can be estimated independently for each coordinate, using the tools that we have developed in the previous chapter. However, we will see that in several interesting scenarios, some structure is shared across coordinates and this information can be leveraged to yield better prediction bounds.

The model

Throughout this section, we consider the following multivariate linear regression model:

[[math]] \begin{equation} \label{EQ:MVRmodel} \Y=\X\Theta^* + E\,, \end{equation} [[/math]]

where [math]\Y \in \R^{n\times T}[/math] is the matrix of observed responses, [math]\X[/math] is the [math]n \times d[/math] observed design matrix (as before), [math]\Theta \in \R^{d\times T}[/math] is the matrix of unknown parameters and [math]E\sim \sg_{n\times T}(\sigma^2)[/math] is the noise matrix. In this chapter, we will focus on the prediction task, which consists in estimating [math]\X\Theta^*[/math]. As mentioned in the foreword of this chapter, we can view this problem as [math]T[/math] (univariate) linear regression problems [math]Y^{(j)}=\X\theta^{*,(j)}+ \eps^{(j)}, j=1, \ldots, T[/math], where [math]Y^{(j)}, \theta^{*,(j)}[/math] and [math]\eps^{(j)}[/math] are the [math]j[/math]th column of [math]\Y, \Theta^*[/math] and [math]E[/math] respectively. In particular, an estimator for [math]\X\Theta^*[/math] can be obtained by concatenating the estimators for each of the [math]T[/math] problems. This approach is the subject of Problem.

The columns of [math]\Theta^*[/math] correspond to [math]T[/math] different regression tasks. Consider the following example as a motivation. Assume that the \textsc{Subway} headquarters want to evaluate the effect of [math]d[/math] variables (promotions, day of the week, TV ads,\dots) on their sales. To that end, they ask each of their [math]T=40,000[/math] restaurants to report their sales numbers for the past [math]n=200[/math] days. As a result, franchise [math]j[/math] returns to headquarters a vector [math]\Y^{(j)} \in \R^n[/math]. The [math]d[/math] variables for each of the [math]n[/math] days are already known to headquarters and are stored in a matrix [math]\X \in \R^{n\times d}[/math]. In this case, it may be reasonable to assume that the same subset of variables has an impact of the sales for each of the franchises, though the magnitude of this impact may differ from franchise to franchise. As a result, one may assume that the matrix [math]\Theta^*[/math] has each of its [math]T[/math] columns that is row sparse and that they share the same sparsity pattern, i.e., [math]\Theta^*[/math] is of the form:

[[math]] \Theta^*=\left(\begin{array}{cccc}0 & 0 & 0 & 0 \\\bullet & \bullet & \bullet & \bullet \\\bullet & \bullet & \bullet & \bullet \\0 & 0 & 0 & 0 \\\vdots & \vdots & \vdots & \vdots \\0 & 0 & 0 & 0 \\\bullet & \bullet & \bullet & \bullet\end{array}\right)\,, [[/math]]

where [math]\bullet[/math] indicates a potentially nonzero entry. It follows from the result of ProblemMatrix estimation that if each task is performed individually, one may find an estimator [math]\hat \Theta[/math] such that

[[math]] \frac{1}{n}\E\|\X\hat \Theta -\X \Theta^*\|_F^2 \lesssim \sigma^2\frac{kT\log(ed)}{n}\,, [[/math]]

where [math]k[/math] is the number of nonzero coordinates in each column of [math]\Theta^*[/math]. We remember that the term [math]\log(ed)[/math] corresponds to the additional price to pay for not knowing where the nonzero components are. However, in this case, when the number of tasks grows, this should become easier. This fact was proved in^[3]. We will see that we can recover a similar phenomenon when the number of tasks becomes large, though larger than in ^[3]. Indeed, rather than exploiting sparsity, observe that such a matrix [math]\Theta^*[/math] has rank [math]k[/math]. This is the kind of structure that we will be predominantly using in this chapter. Rather than assuming that the columns of [math]\Theta^*[/math] share the same sparsity pattern, it may be more appropriate to assume that the matrix [math]\Theta^*[/math] is low rank or approximately so. As a result, while the matrix may not be sparse at all, the fact that it is low rank still materializes the idea that some structure is shared across different tasks. In this more general setup, it is assumed that the columns of [math]\Theta^*[/math] live in a lower dimensional space. Going back to the \textsc{Subway} example this amounts to assuming that while there are 40,000 franchises, there are only a few canonical profiles for these franchises and that all franchises are linear combinations of these profiles.

Sub-Gaussian matrix model

Recall that under the assumption [math]\textsf{ORT}[/math] for the design matrix, i.e., [math]\X^\top \X=nI_d[/math], then the univariate regression model can be reduced to the sub-Gaussian sequence model. Here we investigate the effect of this assumption on the multivariate regression model\eqref{EQ:MVRmodel}. Observe that under assumption [math]\textsf{ORT}[/math],

[[math]] \frac{1}{n}\X^\top \Y=\Theta^* + \frac{1}{n}\X^\top E\,. [[/math]]

Which can be written as an equation in [math]\R^{d\times T}[/math] called the sub-Gaussian matrix model (sGMM):

[[math]] \begin{equation} \label{EQ:sGMM} y=\Theta^* + F\,, \end{equation} [[/math]]

where [math]y=\frac{1}{n}\X^\top \Y[/math] and [math]F=\frac{1}{n}\X^\top E\sim \sg_{d\times T}(\sigma^2/n)[/math]. Indeed, for any [math]u \in \cS^{d-1}, v \in \cS^{T-1}[/math], it holds

[[math]] u^\top Fv=\frac{1}{n}(\X u)^\top E v=\frac{1}{\sqrt{n}} w^\top E v \sim \sg(\sigma^2/n)\,, [[/math]]

where [math]w=\X u/\sqrt{n}[/math] has unit norm: [math]|w|_2^2=u^\top \frac{\X^\top \X}{n}u=|u|_2^2=1[/math]. Akin to the sub-Gaussian sequence model, we have a direct observation model where we observe the parameter of interest with additive noise. This enables us to use thresholding methods for estimating [math]\Theta^*[/math] when [math]|\Theta^*|_0[/math] is small. However, this also follows from Problem. The reduction to the vector case in the sGMM is just as straightforward. The interesting analysis begins when [math]\Theta^*[/math] is low-rank, which is equivalent to sparsity in its unknown eigenbasis. Consider the SVD of [math]\Theta^*[/math]:

[[math]] \Theta^*=\sum_{j} \lambda_j u_j v_j^\top\,. [[/math]]

and recall that [math]\|\Theta^*\|_0=|\lambda|_0[/math]. Therefore, if we knew [math]u_j[/math] and [math]v_j[/math], we could simply estimate the [math]\lambda_j[/math]s by hard thresholding. It turns out that estimating these eigenvectors by the eigenvectors of [math]y[/math] is sufficient. Consider the SVD of the observed matrix [math]y[/math]:

[[math]] y=\sum_{j} \hat \lambda_j \hat u_j \hat v_j^\top\,. [[/math]]

Definition

The singular value thresholding estimator with threshold [math]2\tau\ge 0[/math] is defined by

[[math]] \Thetasvt=\sum_{j} \hat \lambda_j\1(|\hat \lambda_j| \gt 2\tau) \hat u_j \hat v_j^\top\,. [[/math]]

Recall that the threshold for the hard thresholding estimator was chosen to be the level of the noise with high probability. The singular value thresholding estimator obeys the same rule, except that the norm in which the magnitude of the noise is measured is adapted to the matrix case. Specifically, the following lemma will allow us to control the operator norm of the matrix [math]F[/math].

Lemma

Let [math]A[/math] be a [math]d\times T[/math] random matrix such that [math]A \sim \sg_{d\times T}(\sigma^2)[/math]. Then

[[math]] \|A\|_{\mathrm{op}} \le 4\sigma\sqrt{\log(12)(d\vee T)}+2\sigma\sqrt{2\log (1/\delta)} [[/math]]

with probability [math]1-\delta[/math].

Show Proof

This proof follows the same steps as ProblemSub-Gaussian Random Variables. Let [math]\cN_1[/math] be a [math]1/4[/math]-net for [math]\cS^{d-1}[/math] and [math]\cN_2[/math] be a [math]1/4[/math]-net for [math]\cS^{T-1}[/math]. It follows from Lemma that we can always choose [math]|\cN_1|\le 12^d[/math] and [math]|\cN_2|\le 12^T[/math]. Moreover, for any [math]u \in \cS^{d-1}, v \in \cS^{T-1}[/math], it holds

[[math]] \begin{align*} u^\top A v &\le \max_{x \in \cN_1}x^\top A v + \frac{1}{4} \max_{u \in \cS^{d-1}}u^\top A v\\ & \le \max_{x \in \cN_1}\max_{y \in \cN_2}x^\top A y+ \frac{1}{4} \max_{x \in \cN_1}\max_{v \in \cS^{T-1}}x^\top A v + \frac{1}{4} \max_{u \in \cS^{d-1}}u^\top A v\\ &\le \max_{x \in \cN_1}\max_{y \in \cN_2}x^\top A y +\frac{1}{2} \max_{u \in \cS^{d-1}}\max_{v \in \cS^{T-1}}u^\top A v \end{align*} [[/math]]

It yields

[[math]] \|A\|_{\textrm{op}} \le 2\max_{x \in \cN_1}\max_{y \in \cN_2}x^\top A y [[/math]]

So that for any [math]t \ge 0[/math], by a union bound,

[[math]] \p\big(\|A\|_{\textrm{op}} \gt t\big) \le \sum_{\substack{x \in \cN_1\\y \in \cN_2}}\p\big(x^\top A y \gt t/2\big) [[/math]]

Next, since [math]A \sim \sg_{d\times T}(\sigma^2)[/math], it holds that [math]x^\top A y\sim \sg(\sigma^2)[/math] for any [math]x \in \cN_1, y \in \cN_2[/math].Together with the above display, it yields

[[math]] \p\big(\|A\|_{\textrm{op}} \gt t\big) \le 12^{d+T}\exp\big(-\frac{t^2}{8\sigma^2}\big) \le \delta [[/math]]

for

[[math]] t\ge 4\sigma\sqrt{\log(12)(d\vee T)}+2\sigma\sqrt{2\log (1/\delta)}\,. [[/math]]

■

The following theorem holds.

Theorem

Consider the multivariate linear regression model\eqref{EQ:MVRmodel} under the assumption [math]\textsf{ORT}[/math] or, equivalently, the sub-Gaussian matrix model\eqref{EQ:sGMM}. Then, the singular value thresholding estimator [math]\Thetasvt[/math] with threshold

[[math]] \begin{equation} \label{EQ:threshSVT} 2\tau=8\sigma\sqrt{\frac{\log(12)(d\vee T)}{n}}+4\sigma\sqrt{\frac{2\log (1/\delta)}{n}}\,, \end{equation} [[/math]]

satisfies

[[math]] \begin{align*} \frac{1}{n}\|\X\Thetasvt-\X\Theta^*\|_F^2=\|\Thetasvt-\Theta^*\|_F^2& \le 144\rank(\Theta^*)\tau^2\\ &\lesssim \frac{\sigma^2\rank(\Theta^*)}{n}\Big(d\vee T + \log(1/\delta)\Big)\,. \end{align*} [[/math]]

with probability [math]1-\delta[/math].

Show Proof

Assume without loss of generality that the singular values of [math]\Theta^*[/math] and [math]y[/math] are arranged in a non increasing order: [math]\lambda_1 \ge \lambda_2 \ge \dots[/math] and [math]\hat \lambda_1 \ge \hat \lambda_2\ge \dots[/math]. Define the set [math]S=\{j\,:\, |\hat \lambda_j| \gt 2 \tau\}[/math]. Observe first that it follows from Lemma that [math]\|F\|_{\mathrm{op}} \le \tau[/math] for [math]\tau[/math] chosen as in\eqref{EQ:threshSVT} on an event [math]\cA[/math] such that [math]\p(\cA)\ge 1-\delta[/math]. The rest of the proof assumes that the event [math]\cA[/math] occurred. Note that it follows from Weyl's inequality that [math]|\hat \lambda_j - \lambda_j|\le \|F\|_{\mathrm{op}}\le \tau[/math]. It implies that [math]S \subset \{j\,:\, |\lambda_j| \gt \tau\}[/math] and [math]S^c \subset \{j\,:\, |\lambda_j| \le 3 \tau\}[/math]. Next define the oracle [math]\bar \Theta=\sum_{j \in S} \lambda_j u_jv_j^\top[/math] and note that

[[math]] \begin{equation} \label{EQ:prSVT1} \|\Thetasvt -\Theta^*\|_F^2 \le 2\|\Thetasvt -\bar \Theta\|_F^2 +2\|\bar \Theta -\Theta^*\|_F^2 \end{equation} [[/math]]

Using the H\"older inequality, we control the first term as follows

[[math]] \|\Thetasvt -\bar \Theta\|_F^2\le \rank(\Thetasvt - \bar \Theta)\|\Thetasvt -\bar \Theta\|_{\mathrm{op}}^2 \le 2|S|\|\Thetasvt -\bar \Theta\|_{\mathrm{op}}^2 [[/math]]

Moreover,

[[math]] \begin{align*} \|\Thetasvt -\bar \Theta\|_{\mathrm{op}}&\le \|\Thetasvt -y\|_{\mathrm{op}}+\|y - \Theta^*\|_{\mathrm{op}}+\|\Theta^* -\bar \Theta\|_{\mathrm{op}}\\ &\le \max_{j \in S^c}|\hat \lambda_j|+\tau + \max_{j \in S^c}|\lambda_j| \le 6\tau\,. \end{align*} [[/math]]

Therefore,

[[math]] \|\Thetasvt -\bar \Theta\|_F^2\le 72|S|\tau^2=72 \sum_{j\in S} \tau^2\,. [[/math]]

The second term in\eqref{EQ:prSVT1} can be written as

[[math]] \|\bar \Theta -\Theta^*\|_F^2 =\sum_{j \in S^c}|\lambda_j|^2\,. [[/math]]

Plugging the above two displays in\eqref{EQ:prSVT1}, we get

[[math]] \|\Thetasvt -\Theta^*\|_F^2 \le 144\sum_{j \in S}\tau^2+\sum_{j \in S^c}|\lambda_j|^2 [[/math]]

[[math]] \begin{align*} \|\Thetasvt -\Theta^*\|_F^2 &\le 144\sum_{j}\min(\tau^2,|\lambda_j|^2)\\ &\le 144\sum_{j=1}^{\rank(\Theta^*)}\tau^2\\ &=144\rank(\Theta^*)\tau^2\,. \end{align*} [[/math]]

■

In the next subsection, we extend our analysis to the case where [math]\X[/math] does not necessarily satisfy the assumption [math]\textsf{ORT}[/math].

Penalization by rank

The estimator from this section is the counterpart of the BIC estimator in the spectral domain. However, we will see that unlike BIC, it can be computed efficiently.

Let [math]\Thetarank[/math] be any solution to the following minimization problem:

[[math]] \min_{\Theta \in \R^{d \times T}}\Big\{ \frac{1}{n}\|\Y -\X \Theta\|_F^2 + 2\tau^2\rank(\Theta)\Big\}\,. [[/math]]

This estimator is called estimator by rank penalization with regularization parameter [math]\tau^2[/math]. It enjoys the following property.

Theorem

Consider the multivariate linear regression model\eqref{EQ:MVRmodel}. Then, the estimator by rank penalization [math]\Thetarank[/math] with regularization parameter [math]\tau^2[/math], where [math]\tau[/math] is defined in\eqref{EQ:threshSVT} satisfies

[[math]] \frac{1}{n}\|\X\Thetarank-\X\Theta^*\|_F^2 \le 8\rank(\Theta^*)\tau^2\lesssim \frac{\sigma^2\rank(\Theta^*)}{n}\Big(d\vee T + \log(1/\delta)\Big)\,. [[/math]]

with probability [math]1-\delta[/math].

Show Proof

We begin as usual by noting that

[[math]] \|\Y -\X \Thetarank\|_F^2 + 2n\tau^2\rank(\Thetarank)\le \|\Y -\X \Theta^*\|_F^2 + 2n\tau^2\rank(\Theta^*)\,, [[/math]]

which is equivalent to

[[math]] \|\X\Thetarank -\X\Theta^*\|_F^2 \le 2\langle E, \X\Thetarank -\X\Theta^*\rangle -2n\tau^2\rank(\Thetarank)+2n\tau^2\rank(\Theta^*)\,. [[/math]]

Next, by Young's inequality, we have

[[math]] 2\langle E, \X\Thetarank -\X\Theta^*\rangle=2\langle E, U\rangle^2 + \frac{1}{2}\|\X\Thetarank -\X\Theta^*\|_F^2\,, [[/math]]

where

[[math]] U=\frac{\X\Thetarank -\X\Theta^*}{\|\X\Thetarank -\X\Theta^*\|_F}\,. [[/math]]

Write

[[math]] \X\Thetarank -\X\Theta^*=\Phi N\,, [[/math]]

where [math]\Phi[/math] is a [math]n \times r, r \le d[/math] matrix whose columns form an orthonormal basis of the column span of [math]\X[/math]. The matrix [math]\Phi[/math] can come from the SVD of [math]\X[/math] for example: [math]\X=\Phi \Lambda \Psi^\top[/math]. It yields

[[math]] U=\frac{\Phi N}{\|N\|_F} [[/math]]

and

[[math]] \begin{equation} \label{EQ:prmatrixBIC1} \|\X\Thetarank -\X\Theta^*\|_F^2 \le 4\langle \Phi^\top E, N/\|N\|_F\rangle^2 -4n\tau^2\rank(\Thetarank)+4n\tau^2\rank(\Theta^*)\,. \end{equation} [[/math]]

Note that [math]\rank(N)\le \rank(\Thetarank)+\rank(\Theta^*)[/math]. Therefore, by H\"older's inequality, we get

[[math]] \begin{align*} \langle E, U\rangle^2&=\langle \Phi^\top E, N/\|N\|_F\rangle^2\\ &\le \|\Phi^\top E\|_{\mathrm{op}}^2\frac{\|N\|_1^2}{\|N\|_F^2} \\ &\le \rank(N)\|\Phi^\top E\|_{\mathrm{op}}^2\\ &\le \|\Phi^\top E\|_{\mathrm{op}}^2\big[\rank(\Thetarank)+\rank(\Theta^*)\big]\,. \end{align*} [[/math]]

Next, note that Lemma yields [math]\|\Phi^\top E\|_{\mathrm{op}}^2 \le n\tau^2[/math] so that

[[math]] \langle E, U\rangle^2\le n\tau^2 \big[\rank(\Thetarank)+\rank(\Theta^*)\big]\,. [[/math]]

Together with\eqref{EQ:prmatrixBIC1}, this completes the proof.

■

It follows from Theorem that the estimator by rank penalization enjoys the same properties as the singular value thresholding estimator even when [math]\X[/math] does not satisfy the [math]\textsf{ORT}[/math] condition. This is reminiscent of the BIC estimator which enjoys the same properties as the hard thresholding estimator. However this analogy does not extend to computational questions. Indeed, while the rank penalty, just like the sparsity penalty, is not convex, it turns out that [math]\X \Thetarank[/math] can be computed efficiently. Note first that

[[math]] \min_{\Theta \in \R^{d \times T}} \frac{1}{n}\|\Y -\X \Theta\|_F^2 + 2\tau^2\rank(\Theta)= \min_k\Big\{ \frac{1}{n}\min_{\substack{\Theta \in \R^{d \times T}\\ \rank(\Theta)\le k}} \|\Y -\X \Theta\|_F^2 + 2\tau^2k\Big\}\,. [[/math]]

Therefore, it remains to show that

[[math]] \min_{\substack{\Theta \in \R^{d \times T}\\ \rank(\Theta)\le k}} \|\Y -\X \Theta\|_F^2 [[/math]]

can be solved efficiently. To that end, let [math]\bar \Y=\X (\X^\top \X)^\dagger\X^\top\Y[/math] denote the orthogonal projection of [math]\Y[/math] onto the image space of [math]\X[/math]: this is a linear operator from [math]\R^{d \times T}[/math] into [math]\R^{n \times T}[/math]. By the Pythagorean theorem, we get for any [math]\Theta \in \R^{d\times T}[/math],

Next consider the SVD of [math]\bar \Y[/math]:

[[math]] \bar \Y=\sum_j \lambda_j u_j v_j^\top [[/math]]

where [math]\lambda_1 \ge \lambda_2 \ge \ldots[/math] and define [math]\tilde \Y[/math] by

[[math]] \tilde \Y=\sum_{j=1}^k \lambda_j u_j v_j^\top\,. [[/math]]

By Lemma, it holds

[[math]] \|\bar \Y -\tilde \Y\|_F^2 =\min_{Z:\rank(Z) \le k}\|\bar \Y - Z\|_F^2\,. [[/math]]

Therefore, any minimizer of [math]\X\Theta\mapsto \|\Y -\X \Theta\|_F^2[/math] over matrices of rank at most [math]k[/math] can be obtained by truncating the SVD of [math]\bar \Y[/math] at order [math]k[/math]. Once [math]\X\Thetarank[/math] has been found, one may obtain a corresponding [math]\Thetarank[/math] by least squares but this is not necessary for our results.

While the rank penalized estimator can be computed efficiently, it is worth pointing out that a convex relaxation for the rank penalty can also be used. The estimator by nuclear norm penalization [math]\hat \Theta[/math] is defined to be any solution to the minimization problem

[[math]] \min_{\Theta \in \R^{d \times T}} \Big\{ \frac{1}{n}\|\Y-\X\Theta\|_F^2 + \tau \|\Theta\|_1\Big\} [[/math]]

Clearly, this criterion is convex and it can actually be implemented efficiently using semi-definite programming. It has been popularized by matrix completion problems. Let [math]\X[/math] have the following SVD:

[[math]] \X=\sum_{j=1}^r \lambda_j u_jv_j^\top\,, [[/math]]

with [math]\lambda_1 \ge \lambda_2 \ge \ldots \ge \lambda_r \gt 0[/math]. It can be shown that for some appropriate choice of [math]\tau[/math], it holds

[[math]] \frac{1}{n}\|\X\hat \Theta-\X\Theta^*\|_F^2 \lesssim \frac{\lambda_1}{\lambda_r}\frac{\sigma^2\rank(\Theta^*)}{n}d\vee T [[/math]]

with probability .99. However, the proof of this result is far more involved than a simple adaption of the proof for the Lasso estimator to the matrix case (the readers are invited to see that for themselves). For one thing, there is no assumption on the design matrix (such as [math]\textsf{INC}[/math] for example). This result can be found in ^[4].

Covariance matrix estimation

Empirical covariance matrix

Let [math]X_1, \ldots, X_n[/math] be [math]n[/math] i.i.d copies of a random vector [math]X \in \R^d[/math] such that [math]\E\big[X X^\top\big] =\Sigma[/math] for some unknown matrix [math]\Sigma \succ 0[/math] called covariance matrix. This matrix contains information about the moments of order 2 of the random vector [math]X[/math]. A natural candidate to estimate [math]\Sigma[/math] is the empirical covariance matrix [math]\hat \Sigma[/math] defined by

[[math]] \hat \Sigma=\frac{1}{n} \sum_{i=1}^n X_i X_i^\top\,. [[/math]]

Using the tools of Chapter Sub-Gaussian Random Variables, we can prove the following result.

Theorem

Let [math]Y \in \R^d[/math] be a random vector such that [math]\E[Y]=0, \E[YY^\top]=I_d[/math] and [math]Y \sim \sg_d(1)[/math]. Let [math]X_1, \ldots, X_n[/math] be [math]n[/math] independent copies of sub-Gaussian random vector [math]X =\Sigma^{1/2}Y[/math]. Then [math]\E[X]=0, \E[XX^\top]=\Sigma[/math] and [math]X \sim \sg_d(\|\Sigma\|_{\mathrm{op}})[/math]. Moreover,

[[math]] \|\hat \Sigma -\Sigma \|_{\mathrm{op}} \lesssim \|\Sigma\|_{\mathrm{op}}\Big( \sqrt{\frac{d+\log(1/\delta)}{n}} \vee \frac{d+\log(1/\delta)}{n}\Big)\,, [[/math]]

with probability [math]1-\delta[/math].

Show Proof

It follows from elementary computations that [math]\E[X]=0, \E[XX^\top]=\Sigma[/math] and [math]X \sim \sg_d(\|\Sigma\|_{\mathrm{op}})[/math]. To prove the deviation inequality, observe first that without loss of generality, we can assume that [math]\Sigma=I_d[/math]. Indeed,

[[math]] \begin{align*} \frac{\|\hat \Sigma -\Sigma \|_{\mathrm{op}}}{ \|\Sigma\|_{\mathrm{op}}}&=\frac{\|\frac1n\sum_{i=1}^nX_iX_i^\top -\Sigma \|_{\mathrm{op}} }{ \|\Sigma\|_{\mathrm{op}}}\\ &\le \frac{\|\Sigma^{1/2}\|_{\mathrm{op}}\|\frac1n\sum_{i=1}^nY_iY_i^\top -I_d \|_{\mathrm{op}}\|\Sigma^{1/2}\|_{\mathrm{op}} }{ \|\Sigma\|_{\mathrm{op}}}\\ &=\|\frac1n\sum_{i=1}^nY_iY_i^\top -I_d \|_{\mathrm{op}}\,. \end{align*} [[/math]]

Thus, in the rest of the proof, we assume that [math]\Sigma=I_d[/math]. Let [math]\cN[/math] be a [math]1/4[/math]-net for [math]\cS^{d-1}[/math] such that [math]|\cN|\le 12^d[/math]. It follows from the proof of Lemma that

[[math]] \|\hat \Sigma -I_d\|_{\textrm{op}} \le 2\max_{x,y \in \cN}x^\top (\hat \Sigma -I_d) y. [[/math]]

Hence, for any [math]t \ge 0[/math], by a union bound,

[[math]] \begin{equation} \label{EQ:pr:opnormcov1} \p\big(\|\hat \Sigma -I_d\|_{\textrm{op}} \gt t\big) \le \sum_{x,y \in \cN}\p\big(x^\top (\hat \Sigma -I_d) y \gt t/2\big)\,. \end{equation} [[/math]]

It holds,

[[math]] x^\top (\hat \Sigma -I_d) y=\frac{1}{n}\sum_{i=1}^n\big\{(X_i^\top x)(X_i^\top y)-\E\big[(X_i^\top x)(X_i^\top y)\big]\big\}\,. [[/math]]

Using polarization, we also have

[[math]] (X_i^\top x)(X_i^\top y)=\frac{Z_{+}^2-Z_{-}^2}{4}\,, [[/math]]

where [math]Z_{+}=X_i^\top (x+y)[/math] and [math]Z_{-}=X_i^\top (x-y)[/math]. It yields

[[math]] \begin{align*} \E&\big[\exp\big(s\big((X_i^\top x)(X_i^\top y)-\E\big[(X_i^\top x)(X_i^\top y)\big]\big)\big)\Big]\\ &= \E\big[\exp\big(\frac{s}{4}\big(Z_{+}^2-\E[Z_{+}^2]\big)-\frac{s}{4}\big(Z_{-}^2-\E[Z_{-}^2])\big)\big)\Big]\\ &\le \Big(\E\big[\exp\big(\frac{s}{2}\big(Z_{+}^2-\E[Z_{+}^2]\big)\big)\big]\E\big[\exp\big(-\frac{s}{2}\big(Z_{-}^2-\E[Z_{-}^2]\big)\big)\big]\Big)^{1/2}\,, \end{align*} [[/math]]

where in the last inequality, we used Cauchy-Schwarz. Next, since [math]X \sim \sg_d(1)[/math], we have [math]Z_{+},Z_{-}\sim \sg(2)[/math], and it follows from Lemma that

[[math]] Z_{+}^2-\E[Z_{+}^2]\sim \subE(32)\,, \qquad \text{and} \qquad Z_{-}^2-\E[Z_{-}^2]\sim \subE(32). [[/math]]

Therefore, for any [math]s \le 1/16[/math], we have for any [math]Z \in \{Z_+, Z_-\}[/math] that

[[math]] \E\big[\exp\big(\frac{s}{2}\big(Z^2-\E[Z^2]\big)\big)\big]\le e^{128s^2}\,. [[/math]]

It yields that

[[math]] \begin{align*} (X_i^\top x)(X_i^\top y)-\E\big[(X_i^\top x)(X_i^\top y)\big]\sim \subE(16)\,. \end{align*} [[/math]]

Applying now Bernstein's inequality (Theorem), we get

[[math]] \begin{equation} \label{eq:chi2-control} \p\big(x^\top (\hat \Sigma -I_d) y \gt t/2\big)\le \exp\Big[-\frac{n}{2}(\Big(\frac{t}{32}\Big)^2\wedge \frac{t}{32})\Big]\,. \end{equation} [[/math]]

Together with\eqref{EQ:pr:opnormcov1}, this yields

[[math]] \begin{equation} \label{EQ:pr:opnormcov2} \p\big(\|\hat \Sigma -I_d\|_{\textrm{op}} \gt t\big)\le 144^d\exp\Big[-\frac{n}{2}(\Big(\frac{t}{32}\Big)^2\wedge \frac{t}{32})\Big]\,. \end{equation} [[/math]]

In particular, the right hand side of the above inequality is at most [math]\delta \in (0,1)[/math] if

[[math]] \frac{t}{32} \ge \Big(\frac{2d}{n}\log(144)+\frac{2}{n}\log(1/\delta)\Big) \vee \Big(\frac{2d}{n}\log(144)+\frac{2}{n}\log(1/\delta)\Big)^{1/2}. [[/math]]

This concludes our proof.

■

Theorem indicates that for fixed [math]d[/math], the empirical covariance matrix is a consistent estimator of [math]\Sigma[/math] (in any norm as they are all equivalent in finite dimension). However, the bound that we got is not satisfactory in high-dimensions when [math]d \gg n[/math]. To overcome this limitation, we can introduce sparsity as we have done in the case of regression. The most obvious way to do so is to assume that few of the entries of [math]\Sigma[/math] are non zero and it turns out that in this case thresholding is optimal. There is a long line of work on this subject (see for example ^[5] and ^[6]). Once we have a good estimator of [math]\Sigma[/math], what can we do with it? The key insight is that [math]\Sigma[/math] contains information about the projection of the vector [math]X[/math] onto any direction [math]u \in \cS^{d-1}[/math]. Indeed, we have that [math]\var(X^\top u)=u^\top \Sigma u[/math], which can be readily estimated by [math]\widehat \Var(X^\top u)=u^\top \hat \Sigma u[/math]. Observe that it follows from Theorem that

[[math]] \begin{align*} \big|\widehat \Var(X^\top u)- \Var(X^\top u)\big|&=\big|u^\top(\hat \Sigma-\Sigma)u\big|\\ &\le \|\hat \Sigma -\Sigma \|_{\mathrm{op}}\\ & \lesssim \|\Sigma\|_{\mathrm{op}}\Big( \sqrt{\frac{d+\log(1/\delta)}{n}} \vee \frac{d+\log(1/\delta)}{n}\Big) \end{align*} [[/math]]

with probability [math]1-\delta[/math]. The above fact is useful in the Markowitz theory of portfolio section for example ^[7], where a portfolio of assets is a vector [math]u \in \R^d[/math] such that [math]|u|_1=1[/math] and the risk of a portfolio is given by the variance [math] \Var(X^\top u)[/math]. The goal is then to maximize reward subject to risk constraints. In most instances, the empirical covariance matrix is plugged into the formula in place of [math]\Sigma[/math]. (See Problem).

Principal component analysis

Spiked covariance model

Estimating the variance in all directions is also useful for dimension reduction. In Principal Component Analysis (PCA), the goal is to find one (or more) directions onto which the data [math]X_1, \ldots, X_n[/math] can be projected without loosing much of its properties. There are several goals for doing this but perhaps the most prominent ones are data visualization (in few dimensions, one can plot and visualize the cloud of [math]n[/math] points) and clustering (clustering is a hard computational problem and it is therefore preferable to carry it out in lower dimensions). An example of the output of a principal component analysis is given in Figure. In this figure, the data has been projected onto two orthogonal directions [math]\textsf{PC1}[/math] and [math]\textsf{PC2}[/math], that were estimated to have the most variance (among all such orthogonal pairs). The idea is that when projected onto such directions, points will remain far apart and a clustering pattern will still emerge. This is the case in Figure where the original data is given by [math]d=500,000[/math] gene expression levels measured on [math]n \simeq 1,387[/math] people. Depicted are the projections of these [math]1,387[/math] points in two dimension. This image has become quite popular as it shows that gene expression levels can recover the structure induced by geographic clustering.

Projection onto two dimensions of [math]1,387[/math] points from gene expression data. Source: Gene expression blog.

How is it possible to ‘`compress" half a million dimensions into only two? The answer is that the data is intrinsically low dimensional. In this case, a plausible assumption is that all the [math]1,387[/math] points live close to a two-dimensional linear subspace. To see how this assumption (in one dimension instead of two for simplicity) translates into the structure of the covariance matrix [math]\Sigma[/math], assume that [math]X_1, \ldots, X_n[/math] are Gaussian random variables generated as follows. Fix a direction [math]v \in \cS^{d-1}[/math] and let [math]Y_1, \ldots, Y_n \sim \cN_d(0,I_d)[/math] so that [math]v^\top Y_i[/math] are i.i.d. [math]\cN(0,1)[/math]. In particular, the vectors [math](v^\top Y_1)v, \ldots, (v^\top Y_n)v[/math] live in the one-dimensional space spanned by [math]v[/math]. If one would observe such data the problem would be easy as only two observations would suffice to recover [math]v[/math]. Instead, we observe [math]X_1, \ldots, X_n \in \R^d[/math] where [math]X_i=(v^\top Y_i)v+Z_i[/math], and [math]Z_i \sim \cN_d(0, \sigma^2I_d)[/math] are i.i.d and independent of the [math]Y_i[/math]s, that is we add isotropic noise to every point. If the [math]\sigma[/math] is small enough, we can hope to recover the direction [math]v[/math] (See Figure). The covariance matrix of [math]X_i[/math] generated as such is given by

[[math]] \Sigma=\E\big[XX^\top\big]=\E\big[((v^\top Y)v+Z)((v^\top Y)v+Z)^\top\big]=vv^\top + \sigma^2I_d\,. [[/math]]

This model is often called the ’'spiked covariance model. By a simple rescaling, it is equivalent to the following definition.

Definition

A covariance matrix [math]\Sigma \in \R^{d\times d}[/math] is said to satisfy the spiked covariance model if it is of the form

[[math]] \Sigma=\theta vv^\top +I_d\,, [[/math]]

where [math]\theta \gt 0[/math] and [math]v \in \cS^{d-1}[/math]. The vector [math]v[/math] is called the spike.

This model can be extended to more than one spike but this extension is beyond the scope of these notes.

Points are close to a one-dimensional space spanned by [math]v[/math].

Clearly, under the spiked covariance model, [math]v[/math] is the eigenvector of the matrix [math]\Sigma[/math] that is associated to its largest eigenvalue [math]1+\theta[/math]. We will refer to this vector simply as largest eigenvector. To estimate it, a natural candidate is the largest eigenvector [math]\hat v[/math] of [math]\tilde \Sigma[/math], where [math]\tilde \Sigma[/math] is any estimator of [math]\Sigma[/math]. There is a caveat: by symmetry, if [math]u[/math] is an eigenvector, of a symmetric matrix, then [math]-u[/math] is also an eigenvector associated to the same eigenvalue. Therefore, we may only estimate [math]v[/math] up to a sign flip. To overcome this limitation, it is often useful to describe proximity between two vectors [math]u[/math] and [math]v[/math] in terms of the principal angle between their linear span. Let us recall that for two unit vectors the principal angle between their linear spans is denoted by [math]\angle(u,v)[/math] and defined as

[[math]] \angle(u,v)=\arccos(|u^\top v|)\,. [[/math]]

The following result from perturbation theory is known as the Davis-Kahan [math]\sin(\theta)[/math] theorem as it bounds the sin of the principal angle between eigenspaces. This theorem exists in much more general versions that extend beyond one-dimensional eigenspaces.

Theorem (Davis-Kahan [math]\sin(\theta)[/math] theorem)

Let [math]A, B[/math] be two [math]d \times d[/math] PSD matrices. Let [math](\lambda_1, u_1), \ldots, (\lambda_d, u_d)[/math] (resp. [math](\mu_1, v_1), \ldots, (\mu_d, v_d)[/math]) denote the pairs of eigenvalues--eigenvectors of [math]A[/math] (resp. B) ordered such that [math]\lambda_1 \ge \lambda_2 \ge \dots[/math] (resp. [math]\mu_1 \ge \mu_2 \ge \dots[/math]). Then

[[math]] \sin\big(\angle(u_1, v_1)\big)\le \frac{2}{\max(\lambda_1-\lambda_2, \mu_1-\mu_2)} \|A-B\|_{\mathrm{op}}\,. [[/math]]

Moreover,

[[math]] \min_{\eps \in \{\pm1\}}|\eps u_1 -v_1|_2^2 \le 2 \sin^2\big(\angle(u_1, v_1)\big)\,. [[/math]]

Show Proof

Note that [math]u_1^\top A u_1=\lambda_1[/math] and for any [math]x \in \cS^{d-1}[/math], it holds

[[math]] \begin{align*} x^\top A x &=\sum_{j=1}^d\lambda_j(u_j^\top x)^2\le \lambda_1(u_1^\top x)^2 + \lambda_2(1-(u_1^\top x)^2)\\\ &=\lambda_1\cos^2\big(\angle(u_1, x)\big) + \lambda_2 \sin^2\big(\angle(u_1, x)\big)\,. \end{align*} [[/math]]

Therefore, taking [math]x=v_1[/math], we get that on the one hand,

[[math]] \begin{align*} u_1^\top A u_1-v_1^\top A v_1&\ge \lambda_1 -\lambda_1\cos^2\big(\angle(u_1, x)\big) - \lambda_2 \sin^2\big(\angle(u_1, x)\big)\\ &=(\lambda_1-\lambda_2)\sin^2\big(\angle(u_1, x)\big)\,. \end{align*} [[/math]]

On the other hand,

[[math]] \begin{align*} u_1^\top A u_1-v_1^\top A v_1&=u_1^\top B u_1-v_1^\top A v_1+u_1^\top(A-B)u_1&\\ &\le v_1^\top B v_1 -v_1^\top A v_1+u_1^\top(A-B)u_1& \text{($v_1$ is leading eigenvector of $B$)}\\ &=\langle A-B, u_1 u_1^\top -v_1 v_1^\top\rangle&\\ &\le \|A-B\|_\mathrm{op}\|u_1 u_1^\top -v_1 v_1^\top\|_1 & \text{(H\"older)}\\ & \le \|A-B\|_\mathrm{op}\sqrt{2}\|u_1 u_1^\top -v_1 v_1^\top\|_2 & \text{(Cauchy-Schwarz)}\\ \end{align*} [[/math]]

where in the last inequality, we used the fact that [math]\rank(u_1 u_1^\top -v_1 v_1^\top)=2[/math]. It is straightforward to check that

[[math]] \|u_1 u_1^\top -v_1 v_1^\top\|_2^2=\|u_1 u_1^\top -v_1 v_1^\top\|_F^2=2-2(u_1^\top v_1)^2=2\sin^2\big(\angle(u_1, v_1)\big)\,. [[/math]]

We have proved that

[[math]] (\lambda_1-\lambda_2)\sin^2\big(\angle(u_1, x)\big) \le 2 \|A-B\|_\mathrm{op}\sin\big(\angle(u_1, x)\big)\,, [[/math]]

which concludes the first part of the lemma. Note that we can replace [math]\lambda_1-\lambda_2[/math] with [math]\mu_1 -\mu_2[/math] since the result is completely symmetric in [math]A[/math] and [math]B[/math]. It remains to show the second part of the lemma. To that end, observe that

[[math]] \begin{equation*} \min_{\eps \in \{\pm1\}}|\eps u_1 -v_1|_2^2=2-2|\tilde u_1^\top v_1|\le 2-2(u_1^\top v_1)^2=2\sin^2\big(\angle(u_1, x)\big)\,. \end{equation*} [[/math]]

■

Combined with Theorem, we immediately get the following corollary.

Corollary

Let [math]Y \in \R^d[/math] be a random vector such that [math]\E[Y]=0, \E[YY^\top]=I_d[/math] and [math]Y \sim \sg_d(1)[/math]. Let [math]X_1, \ldots, X_n[/math] be [math]n[/math] independent copies of sub-Gaussian random vector [math]X =\Sigma^{1/2}Y[/math] so that [math]\E[X]=0, \E[XX^\top]=\Sigma[/math] and [math]X \sim \sg_d(\|\Sigma\|_{\mathrm{op}})[/math]. Assume further that [math]\Sigma=\theta vv^\top + I_d[/math] satisfies the spiked covariance model. Then, the largest eigenvector [math]\hat v[/math] of the empirical covariance matrix [math]\hat \Sigma[/math] satisfies,

[[math]] \min_{\eps \in \{\pm1\}}|\eps\hat v -v|_2\lesssim \frac{1+\theta}{\theta}\Big( \sqrt{\frac{d+\log(1/\delta)}{n}} \vee \frac{d+\log(1/\delta)}{n}\Big) [[/math]]

with probability [math]1-\delta[/math].

This result justifies the use of the empirical covariance matrix [math]\hat \Sigma[/math] as a replacement for the true covariance matrix [math]\Sigma[/math] when performing PCA in low dimensions, that is when [math]d \ll n[/math]. In the high-dimensional case, where [math]d \gg n[/math], the above result is uninformative. As before, we resort to sparsity to overcome this limitation.

Sparse PCA

In the example of Figure, it may be desirable to interpret the meaning of the two directions denoted by [math]\textsf{PC1}[/math] and [math]\textsf{PC2}[/math]. We know that they are linear combinations of the original 500,000 gene expression levels. A natural question to ask is whether only a subset of these genes could suffice to obtain similar results. Such a discovery could have potential interesting scientific applications as it would point to a few genes responsible for disparities between European populations. In the case of the spiked covariance model this amounts to having a sparse [math]v[/math]. Beyond interpretability as we just discussed, sparsity should also lead to statistical stability as in the case of sparse linear regression for example. To enforce sparsity, we will assume that [math]v[/math] in the spiked covariance model is [math]k[/math]-sparse: [math]|v|_0=k[/math]. Therefore, a natural candidate to estimate [math]v[/math] is given by [math]\hat v[/math] defined by

[[math]] \hat v^\top \hat \Sigma \hat v =\max_{\substack{u \in \cS^{d-1}\\|u|_0=k}}u^\top \hat \Sigma u\,. [[/math]]

It is easy to check that [math]\lambda^k_{\max}(\hat \Sigma)=\hat v^\top \hat \Sigma \hat v[/math] is the largest of all leading eigenvalues among all [math]k\times k[/math] sub-matrices of [math]\hat \Sigma[/math] so that the maximum is indeed attained, though there my be several maximizers. We call [math]\lambda^k_{\max}(\hat \Sigma)[/math] the [math]k[/math]-sparse leading eigenvalue of [math]\hat \Sigma[/math] and [math]\hat v[/math] a [math]k[/math]-sparse leading eigenvector.

Theorem

Let [math]Y \in \R^d[/math] be a random vector such that [math]\E[Y]=0, \E[YY^\top]=I_d[/math] and [math]Y \sim \sg_d(1)[/math]. Let [math]X_1, \ldots, X_n[/math] be [math]n[/math] independent copies of sub-Gaussian random vector [math]X =\Sigma^{1/2}Y[/math] so that [math]\E[X]=0, \E[XX^\top]=\Sigma[/math] and [math]X \sim \sg_d(\|\Sigma\|_{\mathrm{op}})[/math]. Assume further that [math]\Sigma=\theta vv^\top + I_d[/math] satisfies the spiked covariance model for [math]v[/math] such that [math]|v|_0=k \le d/2[/math]. Then, the [math]k[/math]-sparse largest eigenvector [math]\hat v[/math] of the empirical covariance matrix satisfies,

[[math]] \min_{\eps \in \{\pm1\}}|\eps\hat v -v|_2\lesssim \frac{1+\theta}{\theta}\Big( \sqrt{\frac{k\log(ed/k)+\log(1/\delta)}{n}} \vee \frac{k\log(ed/k)+\log(1/\delta)}{n}\Big)\,. [[/math]]

with probability [math]1-\delta[/math].

Show Proof

We begin by obtaining an intermediate result of the Davis-Kahan [math]\sin(\theta)[/math] theorem. Using the same steps as in the proof of Theorem, we get

[[math]] v^\top \Sigma v -\hat v^\top \Sigma \hat v\le \langle \hat \Sigma - \Sigma, \hat v \hat v^\top -vv^\top\rangle [[/math]]

Since both [math]\hat v[/math] and [math]v[/math] are [math]k[/math] sparse, there exists a (random) set [math]S \subset \{1, \ldots, d\}[/math] such that [math]|S|\le 2k[/math] and [math]\{\hat v \hat v^\top -vv^\top\}_{ij} =0[/math] if [math](i,j) \notin S^2[/math]. It yields

[[math]] \langle \hat \Sigma - \Sigma, \hat v \hat v^\top -vv^\top\rangle=\langle \hat \Sigma(S) - \Sigma(S), \hat v(S) \hat v(S)^\top -v(S)v(S)^\top\rangle [[/math]]

Where for any [math]d\times d[/math] matrix [math]M[/math], we defined the matrix [math]M(S)[/math] to be the [math]|S|\times |S|[/math] sub-matrix of [math]M[/math] with rows and columns indexed by [math]S[/math] and for any vector [math]x \in \R^d[/math], [math]x(S)\in \R^{|S|}[/math] denotes the sub-vector of [math]x[/math] with coordinates indexed by [math]S[/math]. It yields by H\"older's inequality that

[[math]] v^\top \Sigma v -\hat v^\top \Sigma \hat v\le \|\hat \Sigma(S) - \Sigma(S)\|_{\mathrm{op}}\| \hat v(S) \hat v(S)^\top -v(S)v(S)^\top\|_1\,. [[/math]]

Following the same steps as in the proof of Theorem, we get now that

[[math]] \sin\big(\angle(\hat v, v)\big)\le \frac{2}{\theta}\sup_{S\,:\,|S|=2k} \|\hat \Sigma(S) - \Sigma(S)\|_{\mathrm{op}}\,. [[/math]]

To conclude the proof, it remains to control [math]\sup_{S\,:\,|S|=2k} \|\hat \Sigma(S) - \Sigma(S)\|_{\mathrm{op}}[/math]. To that end, observe that

[[math]] \begin{align*} \p\Big[\sup_{S\,:\,|S|=2k}& \|\hat \Sigma(S) - \Sigma(S)\|_{\mathrm{op}} \gt t\|\Sigma\|_{\mathrm{op}}\Big]\\ &\le \sum_{{S\,:\,|S|=2k} }\p\Big[\sup_{S\,:\,|S|=2k} \|\hat \Sigma(S) - \Sigma(S)\|_{\mathrm{op}} \gt t\|\Sigma(S)\|_{\mathrm{op}}\Big]\\ &\le \binom{d}{2k} 144^{2k}\exp\Big[-\frac{n}{2}(\Big(\frac{t}{32}\Big)^2\wedge \frac{t}{32})\Big]\,. \end{align*} [[/math]]

where we used\eqref{EQ:pr:opnormcov2} in the second inequality. Using Lemma, we get that the right-hand side above is further bounded by

[[math]] \exp\Big[-\frac{n}{2}(\Big(\frac{t}{32}\Big)^2\wedge \frac{t}{32})+2k\log(144)+k\log\big(\frac{ed}{2k}\big)\Big]\, [[/math]]

Choosing now [math]t[/math] such that

[[math]] t\ge C\sqrt{\frac{k\log(ed/k)+\log(1/\delta)}{n}} \vee \frac{k\log(ed/k)+\log(1/\delta)}{n}\,, [[/math]]

for large enough [math]C[/math] ensures that the desired bound holds with probability at least [math]1-\delta[/math].

■

Minimax lower bound

The last section has established that the spike [math]v[/math] in the spiked covariance model [math]\Sigma =\theta v v^\top[/math] may be estimated at the rate [math]\theta^{-1} \sqrt{k\log (d/k)/n}[/math], assuming that [math]\theta \le 1[/math] and [math]k\log (d/k)\le n[/math], which corresponds to the interesting regime. Using the tools from Chapter Minimax Lower Bounds, we can show that this rate is in fact optimal already in the Gaussian case.

Theorem

Let [math]X_1, \ldots, X_n[/math] be [math]n[/math] independent copies of the Gaussian random vector [math]X \sim\cN_d(0, \Sigma)[/math] where [math]\Sigma=\theta vv^\top + I_d[/math] satisfies the spiked covariance model for [math]v[/math] such that [math]|v|_0=k[/math] and write [math]\p_v[/math] the distribution of [math]X[/math] under this assumption and by [math]\E_v[/math] the associated expectation. Then, there exists constants [math]C,c \gt 0[/math] such that for any [math]n\ge 2[/math], [math]d \ge 9[/math], [math]2 \le k \le (d+7)/8[/math], it holds

[[math]] \inf_{\hat v} \sup_{\substack{v \in \cS^{d-1}\\ |v|_0 =k}} \p_v^n\Big[ \min_{\eps \in \{\pm 1\}} |\eps \hat v -v|_2\ge C \frac{1}{\theta} \sqrt{\frac{k}{n} \log (ed/k)}\Big] \ge c\,. [[/math]]

Show Proof

We use the general technique developed in Chapter Minimax Lower Bounds. Using this technique, we need to provide the a set [math]v_1, \ldots, v_M \in \cS^{d-1}[/math] of [math]k[/math] sparse vectors such that

[math]\DS \min_{\eps \in \{-1,1\}} |\eps v_j -v_k|_2 \gt 2\psi[/math] for all [math]1\le j \lt k \le M[/math]\,,
[math]\DS \KL(\p_{v_j}, \p_{v_k}) \le c\log(M)/n[/math]\,,

where [math]\psi=C\theta^{-1} \sqrt{k\log (d/k)/n}[/math]. Akin to the Gaussian sequence model, our goal is to employ the sparse Varshamov-Gilbert lemma to construct the [math]v_j[/math]'s. However, the [math]v_j[/math]'s have the extra constraint that they need to be of unit norm so we cannot simply rescale the sparse binary vectors output by the sparse Varshamov-Gilbert lemma. Instead, we use the last coordinate to make sure that they have unit norm. Recall that it follows from the sparse Varshamov-Gilbert LemmaMinimax Lower Bounds that there exits [math]\omega_1, \ldots, \omega_M \in \{-1, 1\}^{d-1}[/math] such that [math]|\omega_j|_0=k-1[/math], [math]\rho(\omega_i, \omega_j) \ge (k-1)/2[/math] for all [math]i \neq j[/math] and [math]\log(M) \ge \frac{k-1}{8}\log(1+ \frac{d-1}{2k-2})[/math]. Define [math]v_j=(\gamma\omega_j, \ell)[/math] where [math]\gamma \lt 1/\sqrt{k-1}[/math] is to be defined later and [math]\ell =\sqrt{1-\gamma^2(k-1)}[/math]. It is straightforward to check that [math]|v_j|_0=k[/math], [math]|v_j|_2=1[/math] and [math]v_j[/math] has only nonnegative coordinates. Moreover, for any [math]i \neq j[/math],

[[math]] \begin{equation} \label{EQ:prlbspca1} \min_{\eps \in {\pm 1}} |\eps v_i -v_j|^2_2= |v_i -v_j|^2_2=\gamma^2\rho(v_i, v_j)\ge \gamma^2 \frac{k-1}2\,. \end{equation} [[/math]]

We choose

[[math]] \gamma^2=\frac{C_\gamma}{\theta^2 n} \log\big(\frac{d}{k}\big)\,, [[/math]]

so that [math]\gamma^2(k-1)=\psi[/math] and (i) is verified. It remains to check (ii). To that end, note that

[[math]] \begin{align*} \KL(\p_u, \p_v)&=\frac{1}{2}\E_u\Big(X^\top \big[(I_d+\theta vv^\top)^{-1}-(I_d+\theta vv^\top)^{-1} \big]X\Big)\\ &=\frac{1}{2}\E_u\tr\Big(X^\top \big[(I_d+\theta vv^\top)^{-1}-(I_d+\theta vv^\top)^{-1} \big]X\Big)\\ &=\frac{1}{2}\tr\Big(\big[(I_d+\theta vv^\top)^{-1}-(I_d+\theta uu^\top)^{-1} \big](I_d+\theta uu^\top)\Big) \end{align*} [[/math]]

Next, we use the Sherman-Morrison formula^{[Notes 1]}:

[[math]] (I_d+\theta u u^\top)^{-1}=I_d-\frac{\theta}{1+\theta} u u^\top\,, \qquad \forall\, u \in \cS^{d-1}\,. [[/math]]

It yields

[[math]] \begin{align*} \KL(\p_u, \p_v)&=\frac{\theta}{2(1+\theta)}\tr\Big((uu^\top-vv^\top)(I_d+\theta uu^\top)\Big)\\ &=\frac{\theta^2}{2(1+\theta)}(1-(u^\top v)^2)\\ &=\frac{\theta^2}{2(1+\theta)}\sin^2\big(\angle(u,v)\big) \end{align*} [[/math]]

It yields

[[math]] \KL(\p_{v_i}, \p_{v_j})=\frac{\theta^2}{2(1+\theta)}\sin^2\big(\angle(v_i,v_j)\big)\,. [[/math]]

Next, note that

[[math]] \begin{align*} \sin^2\big(\angle(v_i,v_j)\big)&=1-\gamma^2 \omega_i^\top \omega_j -\ell^2\le \gamma^2(k-1)\le C_\gamma\frac{k}{\theta^2 n} \log\big(\frac{d}{k}\big)\le \frac{\log M}{\theta^2 n}\,, \end{align*} [[/math]]

for [math]C_\gamma[/math] small enough. We conclude that (ii) holds, which completes our proof.

■

Together with the upper bound of Theorem, we get the following corollary.

Corollary

Assume that [math]\theta \le 1[/math] and [math]k\log (d/k)\le n[/math]. Then [math]\theta^{-1} \sqrt{k\log (d/k)/n}[/math] is the minimax rate of estimation over [math]\cB_0(k)[/math] in the spiked covariance model.

Graphical models

Gaussian graphical models

In Section Covariance matrix estimation, we noted that thresholding is an appropriate way to gain an advantage from sparsity in the covariance matrix when trying to estimate it. However, in some cases, it is more natural to assume sparsity in the inverse of the covariance matrix, [math] \Theta = \Sigma^{-1} [/math], which is sometimes called concentration or precision matrix. In this chapter, we will develop an approach that allows us to get guarantees similar to the lasso for the error between the estimate and the ground truth matrix in Frobenius norm. We can understand why sparsity in [math] \Theta [/math] can be appropriate in the context of undirected graphical models ^[8]. These are models that serve to simplify dependence relations between high-dimensional distributions according to a graph.

Definition

Let [math] G = (V, E) [/math] be a graph on [math] d [/math] nodes and to each node [math] v [/math], associate a random variable [math] X_v [/math]. An undirected graphical model or Markov random field is a collection of probability distributions over [math] X_v [/math] that factorize according to

[[math]] \begin{equation} \label{eq:factor-graph} p(x_1, \dots, x_d) = \frac{1}{Z} \prod_{C \in \mathcal{C}} \psi_C(x_C), \end{equation} [[/math]]

where [math] \mathcal{C} [/math] is the collection of all cliques (completely connected subsets) in [math] G [/math], [math] x_C [/math] is the restriction of [math] (x_1, \dots, x_d) [/math] to the subset [math] C [/math], and [math] \psi_C [/math] are non-negative potential functions.

This definition implies certain conditional independence relations, such as the following.

Proposition

A graphical model as in \eqref{eq:factor-graph} fulfills the global Markov property. That is, for any triple [math] (A, B, S) [/math] of disjoint subsets such that [math] S [/math] separates [math] A [/math] and [math] B [/math],

[[math]] \begin{equation*} A \perp\!\!\!\perp B \mid S. \end{equation*} [[/math]]

Show Proof

Let [math] (A, B, S) [/math] a triple as in the proposition statement and define [math] \tilde{A} [/math] to be the connected component in the induced subgraph [math] G[V \setminus S] [/math], as well as [math] \tilde{B} = G[V \setminus (\tilde{A} \cup S)] [/math]. By assumption, [math] A [/math] and [math] B [/math] have to be in different connected components of [math] G[V \setminus S] [/math], hence any clique in [math] G [/math] has to be a subset of [math] \tilde{A} \cup S [/math] or [math] \tilde{B} \cup S [/math]. Denoting the former cliques by [math] \mathcal{C}_A [/math], we get

[[math]] \begin{equation*} f(x) = \prod_{C \in \mathcal{C}} \psi_C(x) = \prod_{C \in \mathcal{C}_A} \psi_C(x) \prod_{c \in \mathcal{C} \setminus \mathcal{C}_A} \psi_C(x) = h(x_{\tilde{A} \cup S}) k(x_{\tilde{B} \cup S}), \end{equation*} [[/math]]

which implies the desired conditional independence.

■

A weaker form of the Markov property is the so-called pairwise Markov property, which says that the above holds only for singleton sets of the form [math] A = \{a\} [/math], [math] B = \{ b \} [/math] and [math] S = V \setminus \{a, b\} [/math] if [math] (a, b) \notin E [/math]. In the following, we will focus on Gaussian graphical models for [math] n [/math] i.i.d samples [math] X_i \sim \cN(0, (\Theta^\ast)^{-1}) [/math]. By contrast, there is a large body of work on discrete graphical models such as the Ising model (see for example ^[9]), which we are not going to discuss here, although links to the following can be established as well ^[10]. For a Gaussian model with mean zero, by considering the density in terms of the concentration matrix [math] \Theta [/math],

[[math]] \begin{equation*} f_\Theta(x) \propto \exp(-x^\top \Theta x/2), \end{equation*} [[/math]]

it is immediate that the pairs [math] a, b [/math] for which the pairwise Markov property holds are exactly those for which [math] \Theta_{i, j} = 0 [/math]. Conversely, it can be shown that for a family of distributions with non-negative density, this actually implies the factorization \eqref{eq:factor-graph} according to the graph given by the edges [math] \{ (i, j) : \Theta_{i, j} \neq 0 \} [/math]. This is the content of the Hammersley-Clifford theorem.

Theorem (Hammersley-Clifford)

Let [math] P [/math] be a probability distribution with positive density [math] f [/math] with respect to a product measure over [math] V [/math]. Then [math] P [/math] factorizes over [math] G [/math] as in \eqref{eq:factor-graph} if and only if it fulfills the pairwise Markov property.

Show Proof

Define

[[math]] \begin{equation*} H_A(x) = \log f(x_A, x^\ast_{A^c}) \end{equation*} [[/math]]

and

[[math]] \begin{equation*} \Phi_A(x) = \sum_{B \subseteq A} (-1)^{|A \setminus B|} H_B(x). \end{equation*} [[/math]]

Note that both [math] H_A [/math] and [math] \Phi_A [/math] depend on [math] x [/math] only through [math] x_A [/math]. By Lemma, we get

[[math]] \begin{equation*} \log f(x) = H_V(x) = \sum_{A \subseteq V} \Phi_A(x). \end{equation*} [[/math]]

It remains to show that [math] \Phi_A = 0 [/math] if [math] A [/math] is not a clique. Assume [math] A \subseteq V [/math] with [math] v, w \in A [/math], [math] (v, w) \notin E [/math], and set [math] C = V \setminus \{v, w\} [/math]. By adding all possible subsets of [math] \{v, w\} [/math] to every subset of [math] C [/math], we get every subset of [math] A [/math], and hence

[[math]] \begin{equation*} \Phi_A(x) = \sum_{B \subseteq C} (-1)^{|C \setminus B|} ( H_B - H_{B \cup \{v\}} - H_{B \cup \{w\}} + H_{B \cup \{v, w\}}). \end{equation*} [[/math]]

Writing [math] D = V \setminus \{v, w\} [/math], by the pairwise Markov property, we have

[[math]] \begin{align*} H_{B \cup \{v, w\}}(x) - H_{B \cup \{v\}}(x) = {} & \log \frac{f(x_v, x_w, x_B, x^\ast_{D \setminus B}}{f(x_v, x_w^\ast, x_B, x^\ast_{D \setminus B}}\\ = {} & \log \frac{f(x_v | x_{B}, x^\ast_{D \setminus B}) f(x_w, x_B, x^\ast_{D \setminus B})}{f(x_v | x_{B}, x^\ast_{D \setminus B}) f(x^\ast_w, x_B, x^\ast_{D \setminus B})}\\ = {} & \log \frac{f(x^\ast_v | x_{B}, x^\ast_{D \setminus B}) f(x_w, x_B, x^\ast_{D \setminus B})}{f(x^\ast_v | x_{B}, x^\ast_{D \setminus B}) f(x^\ast_w, x_B, x^\ast_{D \setminus B})}\\ = {} & \log \frac{f(x^\ast_v, x_w, x_B, x^\ast_{D \setminus B})}{f(x^\ast_v, x_w^\ast, x_B, x^\ast_{D \setminus B})}\\ = {} & H_{B \cup \{ w \}}(x) - H_B(x). \end{align*} [[/math]]

Thus [math] \Psi_A [/math] vanishes if [math] A [/math] is not a clique.

■

We show the if direction. The requirement that [math] f [/math] be positive allows us to take logarithms. Pick a fixed element [math] x^\ast [/math]. The idea of the proof is to rewrite the density in a way that allows us to make use of the pairwise Markov property. The following combinatorial lemma will be used for this purpose.

Lemma (Moebius Inversion)

Let [math] \Psi, \Phi : \cP(V) \to \R [/math] be functions defined for all subsets of a set [math] V [/math]. Then, the following are equivalent:

For all [math] A \subseteq V [/math]: [math] \Psi(A) = \sum_{B \subseteq A} \Phi(B) [/math];
For all [math] A \subseteq V [/math]: [math] \Phi(A) = \sum_{B \subseteq A} (-1)^{|A \setminus B}\Psi(B) [/math].

Show Proof

[math] 1. \implies 2. [/math]:

[[math]] \begin{align*} \sum_{B \subseteq A} \Phi(B) = {} & \sum_{B \subseteq A} \sum_{C \subseteq B} (-1)^{|B \setminus C|} \Psi(C)\\ = {} & \sum_{C \subseteq A} \Psi(C) \left( \sum_{C \subseteq B \subseteq A} (-1)^{|B \setminus C|}\right) \\ = {} & \sum_{C \subseteq A} \Psi(C) \left( \sum_{H \subseteq A \setminus C} (-1)^{|H|} \right). \end{align*} [[/math]]

Now, [math] \sum_{H \subseteq A \setminus C} (-1)^{|H|} = 0 [/math] unless [math] A \setminus C = \emptyset [/math] because any non-empty set has the same number of subsets with odd and even cardinality, which can be seen by induction.

■

In the following, we will show how to exploit this kind of sparsity when trying to estimate [math] \Theta [/math]. The key insight is that we can set up an optimization problem that has a similar error sensitivity as the lasso, but with respect to [math] \Theta [/math]. In fact, it is the penalized version of the maximum likelihood estimator for the Gaussian distribution. We set

[[math]] \begin{equation} \label{eq:af} \hat{\Theta}_\lambda = \argmin_{\Theta \succ 0} \tr(\hat{\Sigma} \Theta) - \log \det (\Theta) + \lambda \| \Theta_{D^c} \|_1, \end{equation} [[/math]]

where we wrote [math] \Theta_{D^c} [/math] for the off-diagonal restriction of [math] \Theta [/math], [math] \| A \|_1 = \sum_{i, j} | A_{i, j} | [/math] for the element-wise [math] \ell_1 [/math] norm, and [math] \hat{\Sigma} = \frac{1}{n} \sum_{i} X_i X_i^\top [/math] for the empirical covariance matrix of [math] n [/math] i.i.d observations. We will also make use of the notation

[[math]] \begin{equation*} \| A \|_\infty = \max_{i, j \in [d]} | A_{i, j} | \end{equation*} [[/math]]

for the element-wise [math] \infty [/math]-norm of a matrix [math] A \in \R^{d \times d} [/math] and

[[math]] \begin{equation*} \| A \|_{\infty, \infty} = \max_{i} \sum_{j} | A_{i, j} | \end{equation*} [[/math]]

for the [math] \infty [/math] operator norm. We note that the function [math] \Theta \mapsto - \log \operatorname{det}(\Theta) [/math] has first derivative [math] - \Theta^{-1} [/math] and second derivative [math] \Theta^{-1} \otimes \Theta^{-1} [/math], which means it is convex. Derivations for this can be found in ^[11]^{(Appendix A.4)}. This means that in the population case and for [math] \lambda = 0 [/math], the minimizer in \eqref{eq:af} would coincide with [math] \Theta^\ast [/math]. First, let us derive a bound on the error in [math] \hat{\Sigma} [/math] that is similar to Theorem, with the difference that we are dealing with sub-Exponential distributions.

Lemma

If [math] X_i \sim \cN(0, \Sigma) [/math] are i.i.d and [math] \hat{\Sigma} = \frac{1}{n} \sum_{i=1}^{n} X_i X_i^\top [/math] denotes the empirical covariance matrix, then

[[math]] \begin{equation*} \| \hat{\Sigma} - \Sigma \|_{\infty} \lesssim \| \Sigma \|_{\mathrm{op}}\sqrt{\frac{\log(d/\delta)}{n}}, \end{equation*} [[/math]]

with probability at least [math] 1 - \delta [/math], if [math] n \gtrsim \log(d/\delta) [/math].

Show Proof

Start as in the proof of Theorem by writing [math] Y = \Sigma^{-1/2} X \sim \cN(0, I_d) [/math] and notice that

[[math]] \begin{align*} | e_i^\top (\hat{\Sigma} - \Sigma) e_j | = {} & | ( \Sigma^{1/2} e_i)^\top \left( \frac{1}{n} \sum_{i=1}^{n} Y_i Y_i^\top - I_d \right) (\Sigma^{1/2} e_j) |\\ \leq {} & \| \Sigma^{1/2} \|_{\mathrm{op}}^2 | v^\top \left(\frac{1}{n} \sum_{i=1}^{n} Y_i Y_i^\top - I_d \right) w | \end{align*} [[/math]]

with [math] v, w \in \cS^{d-1} [/math]. Hence, without loss of generality, we can assume [math] \Sigma = I_d [/math]. The remaining term is sub-Exponential and can be controlled as in \eqref{eq:chi2-control}

[[math]] \begin{equation*} \p\big(e_i^\top (\hat \Sigma -I_d) e_j \gt t\big)\le \exp\Big[-\frac{n}{2}(\Big(\frac{t}{16}\Big)^2\wedge \frac{t}{16})\Big]\,. \end{equation*} [[/math]]

By a union bound, we get

[[math]] \begin{equation*} \p\big( \| \hat \Sigma -I_d \|_\infty \gt t\big)\le d^2 \exp\Big[-\frac{n}{2}(\Big(\frac{t}{16}\Big)^2\wedge \frac{t}{16})\Big]\,. \end{equation*} [[/math]]

Hence, if [math] n \gtrsim \log(d/\delta) [/math], then the right-hand side above is at most [math] \delta \in (0, 1) [/math] if

[[math]] \begin{equation*} \frac{t}{16} \geq \left( \frac{2 \log(d/\delta)}{n}\right)^{1/2}. \end{equation*} [[/math]]

■

Theorem

Let [math] X_i \sim \cN(0, \Sigma) [/math] be i.i.d and [math] \hat{\Sigma} = \frac{1}{n} \sum_{i=1}^{n} X_i X_i^\top [/math] denote the empirical covariance matrix. Assume that [math] | S | = k [/math], where [math] S = \{(i,j) : i \neq j, \, \Theta_{i, j} \neq 0 \} [/math]. There exists a constant [math] c = c(\| \Sigma \|_{\mathrm{op}}) [/math] such that if

[[math]] \begin{equation*} \lambda = c \sqrt{\frac{\log(d/\delta)}{n}}, \end{equation*} [[/math]]

and [math] n \gtrsim \log(d/\delta) [/math], then the minimizer in \eqref{eq:af} fulfills

[[math]] \begin{equation*} \| \hat{\Theta} - \Theta^\ast \|_F^2 \lesssim \frac{(p+k) \log(d/\delta)}{n} \end{equation*} [[/math]]

with probability at least [math] 1 - \delta [/math].

Show Proof

We start by considering the likelihood

[[math]] \begin{equation*} l(\Theta, \Sigma) = \tr(\Sigma \Theta) - \log \det (\Theta). \end{equation*} [[/math]]

Taylor expanding it and the mean value theorem yield

[[math]] \begin{align*} l(\Theta, \Sigma_n) - l(\Theta^\ast, \Sigma_n) = {} & \tr(\Sigma_n (\Theta - \Theta^\ast)) - \tr(\Sigma^\ast (\Theta - \Theta^\ast))\\ {} &+ \frac{1}{2} \tr(\tilde{\Theta}^{-1} (\Theta - \Theta^\ast) \tilde{\Theta}^{-1} (\Theta - \Theta^\ast)), \end{align*} [[/math]]

for some [math] \tilde{\Theta} = \Theta^\ast + t (\Theta - \Theta^\ast) [/math], [math] t \in [0, 1] [/math]. Note that essentially by the convexity of [math] \log \det [/math], we have

[[math]] \begin{equation*} \tr(\tilde{\Theta}^{-1} (\Theta - \Theta^\ast) \tilde{\Theta}^{-1} (\Theta - \Theta^\ast)) = \| \tilde{\Theta}^{-1} (\Theta - \Theta^\ast) \|_F^2 \geq \lambda_{\mathrm{min}}(\tilde{\Theta}^{-1})^2 \| \Theta - \Theta^\ast \|_F^2 \end{equation*} [[/math]]

and

[[math]] \begin{equation*} \lambda_{\mathrm{min}}(\tilde{\Theta}^{-1}) = (\lambda_{\mathrm{max}}(\tilde{\Theta}))^{-1}. \end{equation*} [[/math]]

If we write [math] \Delta = \Theta - \Theta^\ast [/math], then for [math] \| \Delta \|_F \leq 1 [/math],

[[math]] \begin{equation*} \lambda_{\mathrm{max}}(\tilde{\Theta}) = \| \tilde{\Theta} \|_\mathrm{op} = \| \Theta^\ast + \Delta \|_{\mathrm{op}} \leq \| \Theta^\ast \|_{\mathrm{op}} + \| \Delta \|_{\mathrm{op}} \leq \| \Theta^\ast \|_{\mathrm{op}} + \| \Delta \|_{F} \leq \| \Theta^\ast \|_\mathrm{op} + 1, \end{equation*} [[/math]]

and therefore

[[math]] \begin{equation*} l(\Theta, \Sigma_n) - l(\Theta^\ast, \Sigma_n) \geq \tr((\Sigma_n - \Sigma^\ast) (\Theta - \Theta^\ast)) + c \| \Theta - \Theta^\ast \|_F^2, \end{equation*} [[/math]]

for [math] c = (\| \Theta^\ast \|_\mathrm{op} + 1)^{-2}/2 [/math], if [math] \| \Delta \|_F \leq 1 [/math]. This takes care of the case where [math] \Delta [/math] is small. To handle the case where it is large, define [math] g(t) = l(\Theta^\ast + t \Delta, \Sigma_n) - l(\Theta^\ast, \Sigma_n) [/math]. By the convexity of [math] l [/math] in [math] \Theta [/math],

[[math]] \begin{equation*} \frac{g(1) - g(0)}{1} \geq \frac{g(t) - g(0)}{t}, \end{equation*} [[/math]]

so that plugging in [math] t = \| \Delta \|_F^{-1} [/math] gives

[[math]] \begin{align*} l(\Theta, \Sigma_n) - l(\Theta^\ast, \Sigma_n) \ge {} & \| \Delta \|_F \left( l(\Theta^\ast +\frac{1}{\| \Delta \|_F} \Delta , \Sigma_n) - l(\Theta^\ast, \Sigma_n) \right)\\ \geq {} & \| \Delta \|_F \left( \tr((\Sigma_n - \Sigma^\ast) \frac{1}{\| \Delta \|_F} \Delta) + c \right)\\ = {} & \tr((\Sigma_n - \Sigma^\ast) \Delta) + c \| \Delta \|_F, \end{align*} [[/math]]

for [math] \| \Delta \|_F \geq 1 [/math].

If we now write [math] \Delta = \hat{\Theta} - \Theta^\ast [/math] and assume [math] \| \Delta \|_F \geq 1 [/math], then by optimality of [math] \hat{\Theta} [/math],

[[math]] \begin{align*} \| \Delta \|_F \leq {} & C \left[ \tr((\Sigma^\ast - \Sigma_n) \Delta) + \lambda (\| \Theta^\ast_{D^c} \|_1 - \| \Theta_{D^c} \|_1) \right]\\ \leq {} & C \left[ \tr((\Sigma^\ast - \Sigma_n) \Delta) + \lambda (\| \Delta_S \|_1 - \| \Delta_{S^c} \|_1) \right], \end{align*} [[/math]]

where [math] S = \{ (i, j) \in D^c : \Theta^\ast_{i, j} \neq 0 \} [/math] by triangle inequality. Now, split the error contributions for the diagonal and off-diagonal elements,

[[math]] \begin{align*} \leadeq{\tr((\Sigma^\ast - \Sigma_n) \Delta) + \lambda (\| \Delta_S \|_1 - \| \Delta_{S^c} \|_1)}\\ \leq {} & \| (\Sigma^\ast - \Sigma_n)_D \|_F \| \Delta_D \|_F + \| (\Sigma^\ast - \Sigma_n)_{D^c} \|_\infty \| \Delta_{D^c} \|_1\\ {} & + \lambda (\| \Delta_S \|_1 - \| \Delta_{S^c} \|_1). \end{align*} [[/math]]

By H\"older inequality, [math] \| (\Sigma^\ast - \Sigma_n)_D \|_F \leq \sqrt{d} \| \Sigma^\ast - \Sigma_n \|_\infty [/math], and by Lemma, [math] \| \Sigma^\ast - \Sigma_n \|_\infty \leq \sqrt{\log(d/\delta)}/\sqrt{n} [/math] for [math] n \gtrsim \log (d/\delta) [/math]. Combining these two estimates,

[[math]] \begin{align*} \leadeq{\tr((\Sigma^\ast - \Sigma_n) \Delta) + \lambda (\| \Delta_S \|_1 - \| \Delta_{S^c} \|_1)}\\ \lesssim {} & \sqrt{\frac{d \log (d/\delta)}{n}} \| \Delta_D \|_F + \sqrt{\frac{\log (d/\delta)}{n}} \| \Delta_{D^c} \|_1 + \lambda (\| \Delta_S \|_1 - \| \Delta_{S^c} \|_1) \end{align*} [[/math]]

Setting [math] \lambda = C \sqrt{\log(ep/\delta)/n} [/math] and splitting [math] \| \Delta_{D^c} \|_1 = \| \Delta_S \|_1 + \| \Delta_{S^c} \|_1 [/math] yields

[[math]] \begin{align*} \leadeq{\sqrt{\frac{d \log (d/\delta)}{n}} \| \Delta_D \|_F + \sqrt{\frac{\log (d/\delta)}{n}} \| \Delta_{D^c} \|_1 + \lambda (\| \Delta_S \|_1 - \| \Delta_{S^c} \|_1)}\\ \leq {} & \sqrt{\frac{d \log (d/\delta)}{n}} \| \Delta_D \|_F + \sqrt{\frac{\log (d/\delta)}{n}} \| \Delta_{S} \|_1\\ \leq {} & \sqrt{\frac{d \log (d/\delta)}{n}} \| \Delta_D \|_F + \sqrt{\frac{k \log (d/\delta)}{n}} \| \Delta_{S} \|_F\\ \leq {} & \sqrt{\frac{(d + k)\log (d/\delta)}{n}} \| \Delta \|_F \end{align*} [[/math]]

[[math]] \begin{equation*} \| \Delta \|_F^2 \lesssim \frac{(d+k) \log (d/\delta)}{n}, \end{equation*} [[/math]]

as desired.

■

Write [math] \Sigma^\ast = W^\ast \Gamma^\ast W^\ast [/math], where [math] W^2 = \Sigma^\ast_D [/math] and similarly define [math] \hat{W} [/math] by [math] \hat{W}^2 = \hat{\Sigma}_D [/math]. Define a slightly modified estimator by replacing [math] \hat{\Sigma} [/math] in \eqref{eq:af} by [math] \hat{\Gamma} [/math] to get an estimator [math] \hat{K} [/math] for [math] K = (\Gamma^\ast)^{-1} [/math].

[[math]] \begin{align} \| \hat{\Gamma} - \Gamma \|_\infty = {} & \| \hat{W}^{-1} \hat{\Sigma} \hat{W}^{-1} - (W^\ast)^{-1} \Sigma^\ast (W^\ast)^{-1} \|_\infty\\ \leq {} & \| \hat{W}^{-1} - (W^\ast)^{-1} \|_{\infty, \infty} \| \hat{\Sigma} - \Sigma^\ast \|_\infty \| \hat{W}^{-1} - (W^\ast)^{-1} \|_{\infty, \infty} \nonumber \\ {} & + \| \hat{W}^{-1} - (W^\ast)^{-1} \|_{\infty, \infty} \left( \| \hat{\Sigma} \|_\infty \| (W^\ast)^{-1} \|_\infty + \| \Sigma^\ast \|_\infty \| \hat{W}^{-1} \|_{\infty, \infty} \right) \nonumber \\ {} & + \| \hat{W}^{-1} \|_{\infty, \infty} \| \hat{\Sigma} - \Sigma^\ast \|_\infty \| (W^\ast)^{-1} \|_{\infty, \infty} \label{eq:aa} \end{align} [[/math]]

Note that [math] \| A \|_{\infty} \leq \| A \|_{\mathrm{op}} [/math], so that if [math] n \gtrsim \log (d/\delta) [/math], then [math] \| \hat{\Gamma} - \Gamma \|_\infty \lesssim \sqrt{\log (d/\delta)}/\sqrt{n} [/math]. From the above arguments, conclude [math] \| \hat{K} - K^\ast \|_F \lesssim \sqrt{k \log(d/\delta)/n} [/math] and with a calculation similar to \eqref{eq:aa} that

[[math]] \begin{equation*} \| \hat{\Theta}' - \Theta^\ast \|_\mathrm{op} \lesssim \sqrt{\frac{k \log(d/\delta)}{n}}. \end{equation*} [[/math]]

This will not work in Frobenius norm since there, we only have [math] \| \hat{W}^2 - W^2 \|_F^2 \lesssim \sqrt{p \log(d/\delta)/n}[/math].

Lower bounds

Here, we will show that the bounds in Theorem are optimal up to log factors. It will again be based on applying Fano's inequality, Theorem Minimax Lower Bounds. In order to calculate the KL divergence between two Gaussian distributions with densities [math] f_{\Sigma_1} [/math] and [math] f_{\Sigma_2} [/math], we first observe that

[[math]] \begin{align*} \log \frac{f_{\Sigma_1}(x)}{f_{\Sigma_2}(x)} = {} & -\frac{1}{2} x^\top \Theta_1 x + \frac{1}{2}\log \operatorname{det}(\Theta_1) + \frac{1}{2} x^\top \Theta_2 x - \frac{1}{2} \log \operatorname{det}(\Theta_2) \end{align*} [[/math]]

, so that

[[math]] \begin{align*} \KL(\p_{\Sigma_1}, \p_{\Sigma_2}) = {} & \E_{\Sigma_1} \log \left( \frac{f_{\Sigma_1}}{f_{\Sigma_2}}(X) \right)\\ = {} & \frac{1}{2} \left[\log \operatorname{det} (\Theta_1) - \log \operatorname{det} (\Theta_2) + \E[ \tr((\Theta_2 - \Theta_1) X X^\top)] \right]\\ = {} & \frac{1}{2} \left[ \log \operatorname{det} (\Theta_1) - \log \operatorname{det} (\Theta_2) + \tr((\Theta_2 - \Theta_1) \Sigma_1) \right] \\ = {} & \frac{1}{2} \left[\log \operatorname{det} (\Theta_1) - \log \operatorname{det} (\Theta_2) + \tr(\Theta_2 \Sigma_1) - p\right]. \end{align*} [[/math]]

writing [math] \Delta = \Theta_2 - \Theta_1 [/math], [math] \tilde{\Theta} = \Theta_1 + t \Delta [/math], [math] t \in [0,1] [/math], Taylor expansion then yields

[[math]] \begin{align*} \KL(\p_{\Sigma_1}, \p_{\Sigma_2}) = {} & \frac{1}{2} \left[ - \tr(\Delta \Sigma_1) + \frac{1}{2} \tr(\tilde{\Theta}^{-1} \Delta \tilde{\Theta}^{-1}\Delta) + \tr(\Delta \Sigma_1) \right]\\ \leq {} & \frac{1}{4} \lambda_{\mathrm{max}}(\tilde{\Theta}^{-1}) \| \Delta \|_F^2\\ = {} & \frac{1}{4} \lambda_\mathrm{min}^{-1}(\tilde{\Theta}) \| \Delta \|_F^2. \end{align*} [[/math]]

This means we are in good shape in applying our standard tricks if we can make sure that [math] \lambda_\mathrm{min}(\tilde{\Theta}) [/math] stays large enough among our hypotheses. To this end, assume [math] k \leq p^2/16 [/math] and define the collection of matrices [math] (B_{j})_{j \in [M_1]} [/math] with entries in [math] \{0, 1\} [/math] such that they are symmetric, have zero diagonal and only have non-zero entries in a band of size [math] \sqrt{k} [/math] around the diagonal, and whose entries are given by a flattening of the vectors obtained by the sparse Varshamov-Gilbert Lemma Minimax Lower Bounds. Moreover, let [math] (C_j)_{j \in [M_2]} [/math] be diagonal matrices with entries in [math] \{0, 1\} [/math] given by the regular Varshamov-Gilbert Lemma Minimax Lower Bounds. Then, define

[[math]] \begin{equation*} \Theta_j = \Theta_{j_1, j_2} = I + \frac{\beta}{\sqrt{n}} (\sqrt{\log(1+d/(2k))} B_{j_1} + C_{j_2}), \quad j_1 \in [M_1], \, j_2 \in [M_2]. \end{equation*} [[/math]]

We can ensure that [math] \lambda_\mathrm{min}(\tilde{\Theta}) [/math] is small by the Gershgorin circle theorem. For each row of [math] \tilde{\Theta} - I [/math],

[[math]] \begin{align*} \sum_{l} |\tilde{\Theta}_{il}| \leq \frac{2\beta (\sqrt{k} + 1)}{\sqrt{n}} \sqrt{\log(1+d/(2k)} \lt \frac{1}{2} \end{align*} [[/math]]

if [math] n \gtrsim k/\log(1 + d/(2k)) [/math]. Hence,

[[math]] \begin{align*} \| \Theta_j - \Theta_l \|_F^2 = {} & \frac{\beta^2}{n} (\rho(B_{j_1}, B_{l_1}) \log (1+d/(2k)) + \rho(C_{j_2}, C_{l_2}))\\ \gtrsim {} & \frac{\beta^2 k}{n} (k \log(1+d/(2k)) + d), \end{align*} [[/math]]

and

[[math]] \begin{align*} \| \Theta_j - \Theta_l \|_F^2 \lesssim {} & \frac{\beta^2}{n} (k \log(1+d/(2k)) + d)\\ \leq {} & \frac{\beta^2}{n} \log(M_1 M_2), \end{align*} [[/math]]

which yields the following theorem.

Theorem

Denote by [math] \mathcal{B}_k [/math] the set of positive definite matrices with at most [math] k [/math] non-zero off-diagonal entries and assume [math] n \gtrsim k \log(d) [/math]. Then, if [math] \p_{\Theta} [/math] denotes a Gaussian distribution with inverse covariance matrix [math] \Theta [/math] and mean [math] 0 [/math], there exists a constant [math] c \gt 0 [/math] such that

[[math]] \begin{equation*} \inf_{\hat{\Theta}} \sup_{\Theta^\ast \in \mathcal{B}_k} \p^{\otimes n}_{\Theta^\ast} \left( \| \hat{\Theta} - \Theta^\ast \|_F^2 \geq c \frac{\alpha^2}{n} (k \log(1+d/(2k)) + d) \right) \geq \frac{1}{2} - 2 \alpha. \end{equation*} [[/math]]

Ising model

Like the Gaussian graphical model, the Ising model is also a model of pairwise interactions but for random variables that take values in [math]\{-1,1\}^d[/math]. Before developing our general approach, we describe the main ideas in the simplest Markov random field: an Ising model without^{[Notes 2]} external field^[12]. Such models specify the distribution of a random vector [math]Z=(Z^{(1)}, \ldots, Z^{(d)})\in \dHyp[/math] as follows

[[math]] \begin{equation} \label{EQ:defIsing} \p(Z=z)=\exp\big(z^\top W z -\Phi(W)\big)\,, \end{equation} [[/math]]

where [math]W \in \R^{pd \times d}[/math] is an unknown matrix of interest with null diagonal elements and

[[math]] \Phi(W)=\log \sum_{z \in \dHyp}\exp\big(z^\top W z \big)\,, [[/math]]

is a normalization term known as log-partition function. Fix [math]\beta,\lambda \gt 0[/math]. Our goal in this section is to estimate the parameter [math]W[/math] subject to the constraint that the [math]j[/math]th row [math]e_j^\top W[/math] of [math]W[/math] satisfies [math]|e_j^\top W|_1\le \lambda[/math]. To that end, we observe [math]n[/math] independent copies [math]Z_1, \ldots Z_n[/math] of [math]Z\sim \p[/math]. To that end, estimate the matrix [math]W[/math] row by row using constrained likelihood maximization. Recall from ^[13]^(Eq.(4)) that for any [math]j \in [d][/math], it holds for any [math]z^{(j)}\in \{-1, 1\}[/math], that

[[math]] \begin{equation} \label{EQ:logistic} \p(Z^{(j)}=z^{(j)} | Z^{(\lnot j)}=z^{(\lnot j)})=\frac{\exp(2z^{(j)} e_j^\top W z)}{1+\exp(2z^{(j)} e_j^\top W z)}\,, \end{equation} [[/math]]

where we used the fact that the diagonal elements [math]W[/math] are equal to zero. Therefore, the [math]j[/math]th row [math]e_j^\top W[/math] of [math]W[/math] may be estimated by performing a logistic regression of [math]Z^{(j)}[/math] onto [math]Z^{(\lnot j)}[/math] subject to the constraint that [math]|e_j^\top W|_1 \le \lambda[/math] and [math]\|e_j^\top W\|_\infty \le \beta[/math]. To simplify notation, fix [math]j[/math] and define [math]Y=Z^{(j)}, X=Z^{(\lnot j)}[/math] and [math]w^*=(e_j^\top W)^{(\lnot j)}[/math]. Let [math](X_1, Y_1), \ldots, (X_n, Y_n)[/math] be [math]n[/math] independent copies of [math](X,Y)[/math]. Then the (rescaled) log-likelihood of a candidate parameter [math]w \in \R^{d-1}[/math] is denoted by [math]\bar \ell_n(w)[/math] and defined by

[[math]] \begin{align*} \bar \ell_n(w)&=\frac{1}{n}\sum_{i=1}^n\log\Big( \frac{\exp(2Y_i X_i^\top w)}{1+\exp(2Y_i X_i^\top w)}\Big)\\ &=\frac{1}{n}\sum_{i=1}^nY_iX_i^\top w - \frac{1}{n}\sum_{i=1}^n\log\big(1+ \exp(2Y_i X_i^\top w)\big) \end{align*} [[/math]]

With this representation, it is easy to see that [math]w \mapsto \bar \ell_n(w)[/math] is a concave function. The (constrained) maximum likelihood estimator (MLE) [math]\hat w \in \R^{d-1}[/math] is defined to be any solution of the following convex optimization problem:

[[math]] \begin{equation} \label{EQ:MLE-Ising} \hat w \in \argmax_{w \in \cB_1(\lambda)} \bar \ell_n(w) \end{equation} [[/math]]

This problem can be solved very efficiently using a variety of methods similar to the Lasso. The following lemma follows from^[14].

Lemma

Fix [math]\delta \in (0,1)[/math]. Conditionally on [math](X_1, \ldots, X_n)[/math], the constrained {MLE} [math]\hat w[/math] defined in\eqref{EQ:MLE-Ising} satisfies with probability at least [math]1-\delta[/math] that

[[math]] \frac{1}{n}\sum_{i=1}^n (X_i^\top (\hat w - w))^2 \le 2\lambda e^\lambda\sqrt{\frac{\log (2(p-1)/\delta)}{2n}} [[/math]]

Show Proof

We begin with some standard manipulations of the log-likelihood. Note that

[[math]] \begin{align*} \log\Big( \frac{\exp(2Y_i X_i^\top w)}{1+\exp(2Y_i X_i^\top w)}\Big)&=\log\Big( \frac{\exp(Y_i X_i^\top w)}{\exp(-Y_i X_i^\top w)+\exp(Y_i X_i^\top w)}\Big)\\ &=\log\Big( \frac{\exp(Y_i X_i^\top w)}{\exp(-X_i^\top w)+\exp(X_i^\top w)}\Big)\\ &=(Y_i+1) X_i^\top w - \log\big(1+\exp(2X_i^\top w) \big) \end{align*} [[/math]]

Therefore, writing [math]\tilde Y_i =(Y_i+1)/2 \in \{0,1\}[/math], we get that [math]\hat w[/math] is the solution to the following minimization problem:

[[math]] \hat w = \argmin_{w \in \cB_1(\lambda)} \bar \kappa_n(w) [[/math]]

where

[[math]] \bar \kappa_n(w)=\frac{1}{n}\sum_{i=1}^n\big\{ -2\tilde Y_i X_i^\top w + \log\big(1+\exp(2X_i^\top w) \big)\big\} [[/math]]

For any [math]w \in \R^{d-1}[/math], write [math]\kappa(w)=\E[\bar \kappa_n(w)][/math] where here and throughout this proof, all expectations are implicitly taken conditionally on [math]X_1, \ldots, X_n[/math]. Observe that

[[math]] \kappa(w)=-\E[\tilde Y_1 ] X_1^\top w + \log\big(1+\exp(2X_i^\top w) \big)\,. [[/math]]

Next, we get from the basic inequality [math]\bar \kappa_n(\hat w) \le \bar \kappa_n(w)[/math] for any [math]w \in \cB_1(\lambda)[/math] that

[[math]] \begin{align*} \kappa(\hat w) -\kappa(w) &\le \frac{1}{n} \sum_{i=1}^n (\tilde Y_i -\E[\tilde Y_i]) X_i^\top(\hat w - w) \\ &\le \frac{2\lambda}{n} \max_{1\le j \le p-1}\Big|\sum_{i=1}^n (\tilde Y_i -\E[\tilde Y_i]) X_i^\top e_j\Big|\,, \end{align*} [[/math]]

where in the second inequality, we used H\"older's inequality and the fact [math]|X_i|_\infty \le 1[/math] for all [math]i \in [n][/math]. Together with Hoeffding's inequality and a union bound, it yields

[[math]] \kappa(\hat w) -\kappa(w) \le 2\lambda \sqrt{\frac{\log (4(p-1)/\delta)}{2n}}\,, \quad \text{with probability $1-\frac{\delta}{2}$.} [[/math]]

Note that the Hessian of [math]\kappa[/math] is given by

[[math]] \nabla^2\kappa(w)=\frac{1}{n} \sum_{i=1}^n \frac{4\exp(2X_i^\top w)}{(1+\exp(2X_i^\top w))^2} X_i X_i^\top\,. [[/math]]

Moreover, for any [math]w \in \cB_1(\lambda)[/math], since [math]\|X_i\|_\infty \le 1[/math], it holds

[[math]] \frac{4\exp(2X_i^\top w)}{(1+\exp(2X_i^\top w))^2}\ge \frac{4e^{2\lambda}}{(1+e^{2\lambda})^2} \ge e^{-2\lambda} [[/math]]

Next, since [math]w^*[/math] is a minimizer of [math]w \mapsto \kappa(w)[/math], we get from a second order Taylor expansion that

[[math]] \kappa(\hat w) -\kappa(w^*) \ge e^{-2\lambda} \frac{1}{n}\sum_{i=1}^n (X_i^\top (\hat w - w))^2\,. [[/math]]

This completes our proof.

■

Next, we bound from below the quantity

[[math]] \frac{1}{n}\sum_{i=1}^n (X_i^\top (\hat w - w))^2\,. [[/math]]

To that end, we must exploit the covariance structure of the Ising model. The following lemma is similar to the combination of Lemmas6 and7 in^[12].

Lemma

Fix [math]R \gt 0[/math], [math]\delta\in (0,1)[/math] and let [math]Z \in \R^{d}[/math] be distributed according to\eqref{EQ:defIsing}. The following holds with probability [math]1-\delta/2[/math], uniformly in [math]u \in \cB_1(2\lambda)[/math]:

[[math]] \frac{1}{n}\sum_{i=1}^n (Z_i^\top u)^2 \ge \frac12|u|_\infty^2 \exp(-2\lambda)- 4\lambda\sqrt{\frac{\log (2p(p-1)/\delta)}{2n}} [[/math]]

Show Proof

Note that

[[math]] \frac{1}{n}\sum_{i=1}^n (Z_i^\top u)^2 =u^\top S_n u\,, [[/math]]

where [math]S_n \in \R^{d\times d}[/math] denotes the sample covariance matrix of [math]Z_1, \ldots, Z_n[/math], defined by

[[math]] S_n=\frac{1}{n}\sum_{i=1}^n X_iX_i^\top\,. [[/math]]

Observe that

[[math]] \begin{equation} \label{EQ:LBempcov:pr1} u^\top S_n u \ge u^\top \Sigma u - 2\lambda |S_n -S|_\infty\,, \end{equation} [[/math]]

where [math]S=\E[S_n][/math]. Using Hoeffding's inequality together with a union bound, we get that

[[math]] |S_n -S|_\infty \le 2\sqrt{\frac{\log (2p(p-1)/\delta)}{2n}} [[/math]]

with probability [math]1-\delta/2[/math]. Moreover, [math]u^\top S u= \var(X^\top u)[/math]. Assume without loss of generality that [math]|u^{(1)}|=|u|_\infty[/math] and observe that

[[math]] \var(Z^\top u)=\E[(Z^\top u)^2]=(u^{(1)})^2 + \E\big[\big(\sum_{j=2}^{p}u^{(j)}Z^{(j)}\big)^2\big]+2\E\big[u^{(1)}Z^{(1)}\sum_{j=2}^{p}u^{(j)}Z^{(j)}\big]\,. [[/math]]

To control the cross term, let us condition on the neighborhood [math]Z^{(\lnot 1)}[/math] to get

[[math]] \begin{align*} 2\Big|\E\big[u^{(1)}Z^{(1)}\sum_{j=2}^{p}u^{(j)}Z^{(j)}\big]\Big|&=2\Big|\E\Big(\E\big[u^{(1)}Z^{(1)}|Z^{(\lnot 1)}\big]\sum_{j=2}^{p}u^{(j)}Z^{(j)}\big]\Big)\Big|\\ &\le \E\Big(\big(\E\big[u^{(1)}Z^{(1)}|Z^{(\lnot 1)}\big]\big)^2\Big)+ \E\Big(\big(\sum_{j=2}^{p}u^{(j)}Z^{(j)}\big)^2\Big)\,, \end{align*} [[/math]]

where we used the fact that [math]2|ab|\le a^2 + b^2[/math] for all [math]a,b \in \R[/math]. The above two displays together yield

[[math]] \var(Z^\top u)\ge |u|_\infty^2 (1- \sup_{z \in \{-1,1\}^{p-1}}\big( \E[Z^{(1)}|Z^{(\lnot 1)}=z]\big)^2\big) [[/math]]

To control the above supremum, recall that from\eqref{EQ:logistic}, we have

[[math]] \begin{align*} \sup_{z \in \{-1,1\}^{p-1}}\E[Z^{(1)}|Z^{(\lnot 1)}=z]&=\sup_{z \in \{-1,1\}^{p}}\frac{\exp(2 e_1^\top W z)-1}{\exp(2e_1^\top W z)+1}=\frac{\exp(2 \lambda)-1}{\exp(2 \lambda)+1}\,. \end{align*} [[/math]]

Therefore,

[[math]] \var(Z^\top u)\ge \frac{|u|_\infty^2 }{1+\exp(2\lambda)}\ge \frac12 |u|_\infty^2\exp(-\lambda)\,. [[/math]]

Together with\eqref{EQ:LBempcov:pr1}, it yields the desired result.

■

Combining the above two lemmas immediately yields the following theorem.

Theorem

Let [math]\hat w[/math] be the constrained maximum likelihood estimator\eqref{EQ:MLE-Ising} and let [math]w^*=(e_j^\top W)^{(\lnot j)}[/math] be the [math]j[/math]th row of [math]W[/math] with the [math]j[/math]th entry removed. Then with probability [math]1-\delta[/math] we have

[[math]] |\hat w - w^*|_\infty^2 \le 9\lambda \exp(3\lambda)\sqrt{\frac{\log (2p^2/\delta)}{n}}\,. [[/math]]

General references

Rigollet, Philippe; Hütter, Jan-Christian (2023). "High-Dimensional Statistics". arXiv:2310.19244 [math.ST].

Notes

To see this, check that [math](I_d+\theta u u^\top)(I_d-\frac{\theta}{1+\theta})=I_d[/math] and that the two matrices on the left-hand side commute.
The presence of an external field does not change our method. It merely introduces an intercept in the logistic regression problem and comes at the cost of more cumbersome notation. All arguments below follow, potentially after minor modifications and explicit computations are left to the reader.

References

Flaxman, Abraham (2003). A spectral technique for random satisfiable 3CNF formulas. Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms (Report). Society for Industrial and Applied Mathematics.
Flaxman, Abraham (2003). A spectral technique for random satisfiable 3CNF formulas. Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms (Report). Society for Industrial and Applied Mathematics.
^3.0 ^3.1 "Oracle inequalities and optimal inference under group sparsity" . Ann. Statist. 39.
"Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion" (2011). Ann. Statist. 39.
"Optimal rates of convergence for covariance matrix estimation" (2010). Ann. Statist. 38.
"Minimax estimation of large covariance matrices under {$\ell_1$}-norm" (2012). Statist. Sinica 22.
Markowitz, Harry. "Portfolio selection". The journal of finance 7. Wiley Online Library.
Besag, Julian (1986). "On the statistical analysis of dirty pictures". J. Roy. Statist. Soc. Ser. B 48.
Bresler, Guy (2015). "Efficiently learning Ising models on arbitrary graphs". STOC'15---Proceedings of the 2015 ACM Symposium on Theory of Computing.
Loh, Po-Ling; Wainwright, Martin J. (2012). Structure Estimation for Discrete Graphical Models: Template:Generalized Covariance Matrices and Their Inverses. Template:NIPS (Report). |first3= missing |last3= (help)
"Covariance regularization by thresholding" (2008). Annals of Statistics 36.
^12.0 ^12.1 "Minimax Rates in Permutation Estimation for Feature Matching" (2016). Journal of Machine Learning Research 17.
"High-dimensional Ising model selection using $\ell_1$-regularized logistic regression" (2010). Ann. Statist. 38.
Rigollet, Philippe (2012). "Kullback-Leibler aggregation and misspecified generalized linear models". Ann. Statist. 40.

[8] To see this, check that [math](I_d+\theta u u^\top)(I_d-\frac{\theta}{1+\theta})=I_d[/math] and that the two matrices on the left-hand side commute.

[13] The presence of an external field does not change our method. It merely introduces an intercept in the logistic regression problem and comes at the cost of more cumbersome notation. All arguments below follow, potentially after minor modifications and explicit computations are left to the reader.

[Ber09-1] Flaxman, Abraham (2003). A spectral technique for random satisfiable 3CNF formulas. Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms (Report). Society for Industrial and Applied Mathematics.

[GolVan96-2] Flaxman, Abraham (2003). A spectral technique for random satisfiable 3CNF formulas. Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms (Report). Society for Industrial and Applied Mathematics.

[LouPonTsy11-3] 3.0 ^3.1 "Oracle inequalities and optimal inference under group sparsity" . Ann. Statist. 39.

[KolLouTsy11-4] "Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion" (2011). Ann. Statist. 39.

[CaiZhaZho10-5] "Optimal rates of convergence for covariance matrix estimation" (2010). Ann. Statist. 38.

[CaiZho12-6] "Minimax estimation of large covariance matrices under {$\ell_1$}-norm" (2012). Statist. Sinica 22.

[Mar52-7] Markowitz, Harry. "Portfolio selection". The journal of finance 7. Wiley Online Library.

[Lau96-9] Besag, Julian (1986). "On the statistical analysis of dirty pictures". J. Roy. Statist. Soc. Ser. B 48.

[Bre15-10] Bresler, Guy (2015). "Efficiently learning Ising models on arbitrary graphs". STOC'15---Proceedings of the 2015 ACM Symposium on Theory of Computing.

[LohWaioth12-11] Loh, Po-Ling; Wainwright, Martin J. (2012). Structure Estimation for Discrete Graphical Models: Template:Generalized Covariance Matrices and Their Inverses. Template:NIPS (Report). |first3= missing |last3= (help)

[BoyVan04-12] "Covariance regularization by thresholding" (2008). Annals of Statistics 36.

[VufMisLok16-14] 12.0 ^12.1 "Minimax Rates in Permutation Estimation for Feature Matching" (2016). Journal of Machine Learning Research 17.

[RavWaiLaf10-15] "High-dimensional Ising model selection using $\ell_1$-regularized logistic regression" (2010). Ann. Statist. 38.

[Rig12-16] Rigollet, Philippe (2012). "Kullback-Leibler aggregation and misspecified generalized linear models". Ann. Statist. 40.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[Notes 1]

[8]

[9]

[10]

[11]

[Notes 2]

[12]

[13]

[14]