Sub-Gaussian Random Variables

Gaussian tails and MGF

Recall that a random variable [math]X \in \R[/math] has Gaussian distribution iff it has a density [math]p[/math] with respect to the Lebesgue measure on [math]\R[/math] given by

[[math]] p(x)=\frac{1}{\sqrt{2\pi\sigma^2}}\exp\Big(-\frac{(x-\mu)^2}{2\sigma^2}\Big)\,, \quad x \in \R\,, [[/math]]

where [math]\mu=\E(X) \in \R[/math] and [math]\sigma^2=\var(X) \gt 0[/math] are the mean and variance of [math]X[/math]. We write [math]X \sim \cN(\mu, \sigma^2)[/math]. Note that [math]X=\sigma Z+\mu[/math] for [math]Z \sim \cN(0,1)[/math] (called standard Gaussian) and where the equality holds in distribution. Clearly, this distribution has unbounded support but it is well known that it has almost bounded support in the following sense: [math]\p(|X-\mu|\le 3\sigma)\simeq 0.997[/math]. This is due to the fast decay of the tails of [math]p[/math] as [math]|x|\to \infty[/math] (see [[#fig:gaussdens |Figure]). This decay can be quantified using the following proposition (Mill's inequality).

Probabilities of falling within 1, 2, and 3 standard deviations close to the mean in a Gaussian distribution. Source [1]

Proposition

Let [math]X[/math] be a Gaussian random variable with mean [math]\mu[/math] and variance [math]\sigma^2[/math] then for any [math]t \gt 0[/math], it holds

[[math]] \p(X-\mu \gt t)\le \frac{\sigma}{\sqrt{2\pi}}\frac{\e^{-\frac{t^2}{2\sigma^2}}}{t}\,. [[/math]]

By symmetry we also have

[[math]] \p(X-\mu \lt -t)\le \frac{\sigma}{\sqrt{2\pi}}\frac{\e^{-\frac{t^2}{2\sigma^2}}}{t}\, [[/math]]

and

[[math]] \p(|X-\mu| \gt t)\le \sigma \sqrt{\frac{2}{\pi}}\frac{\e^{-\frac{t^2}{2\sigma^2}}}{t}\,. [[/math]]

Show Proof

Note that it is sufficient to prove the theorem for [math]\mu=0[/math] and [math]\sigma^2=1[/math] by simple translation and rescaling. We get for [math]Z\sim \cN(0,1)[/math],

[[math]] \begin{align*} \p(Z \gt t)&= \frac{1}{\sqrt{2\pi}}\int_t^\infty \exp\Big(-\frac{x^2}{2}\Big) \ud x\\ &\le\frac{1}{\sqrt{2\pi}}\int_t^\infty \frac{x}{t}\exp\Big(-\frac{x^2}{2}\Big)\ud x\\ &=\frac{1}{t\sqrt{2\pi}}\int_t^\infty -\frac{\partial}{\partial x}\exp\Big(-\frac{x^2}{2}\Big) \ud x\\ &=\frac{1}{t\sqrt{2\pi}}\exp(-t^2/2)\,. \end{align*} [[/math]]

The second inequality follows from symmetry and the last one using the union bound:

[[math]] \p(|Z| \gt t)=\p(\{Z \gt t\}\cup\{Z \lt -t\})\le \p(Z \gt t)+\p(Z \lt -t)=2\p(Z \gt t)\,. [[/math]]

■

The fact that a Gaussian random variable [math]Z[/math] has tails that decay to zero exponentially fast can also be seen in the moment generating function (MGF)

[[math]] M\,:\,s \mapsto M(s)=\E[\exp(sZ)]\,. [[/math]]

Indeed, in the case of a standard Gaussian random variable, we have

[[math]] \begin{align*} M(s)=\E[\exp(sZ)]&=\frac{1}{\sqrt{2\pi}}\int e^{sz}e^{-\frac{z^2}{2}}\ud z\\ &=\frac{1}{\sqrt{2\pi}}\int e^{-\frac{(z-s)^2}{2}+\frac{s^2}{2}}\ud z\\ &=e^{\frac{s^2}{2}}\,. \end{align*} [[/math]]

It follows that if [math]X\sim \cN(\mu,\sigma^2)[/math], then [math]\E[\exp(sX)]=\exp\big(s\mu+\frac{\sigma^2s^2}{2}\big)[/math].

Sub-Gaussian random variables and Chernoff bounds

Definition and first properties

Gaussian tails are practical when controlling the tail of an average of independent random variables. Indeed, recall that if [math]X_1, \ldots, X_n[/math] are \\iid [math]\cN(\mu,\sigma^2)[/math], then [math]\bar X=\frac{1}{n}\sum_{i=1}^nX_i\sim\cN(\mu,\sigma^2/n)[/math]. Using Lemma below for example, we get

[[math]] \p(|\bar X -\mu| \gt t)\le 2\exp\big(-\frac{nt^2}{2\sigma^2}\big)\,. [[/math]]

Equating the right-hand side with some confidence level [math]\delta \gt 0[/math], we find that with probability at least^{[Notes 1]} [math]1-\delta[/math],

[[math]] \begin{equation} \label{EQ:subG:CI} \mu \in \Big[\bar X - \sigma\sqrt{\frac{2 \log (2/\delta)}{n}}, \bar X + \sigma\sqrt{\frac{2 \log (2/\delta)}{n}}\Big]\,, \end{equation} [[/math]]

This is almost the confidence interval that you used in introductory statistics. The only difference is that we used an approximation for the Gaussian tail whereas statistical tables or software use a much more accurate computation. Figure shows the ratio of the width of the confidence interval to that of the confidence interval computed by the software R. It turns out that intervals of the same form can be also derived for non-Gaussian random variables as long as they have sub-Gaussian tails.

Width of confidence intervals from exact computation in R (red dashed) and from the approximation \eqref{EQ:subG:CI} (solid black). Here [math]x=\delta[/math] and [math]y[/math] is the width of the confidence intervals.

Definition

A random variable [math]X \in \R[/math] is said to be ’'sub-Gaussian with variance proxy [math]\sigma^2[/math] if [math]\E[X]=0[/math] and its moment generating function satisfies

[[math]] \begin{equation} \label{EQ:defsubgauss} \E[\exp(sX)]\le \exp\Big(\frac{\sigma^2s^2}{2}\Big)\,, \quad \forall\, s \in \R\,. \end{equation} [[/math]]

In this case we write [math]X \sim \sg(\sigma^2)[/math]. Note that [math]\sg(\sigma^2)[/math] denotes a class of distributions rather than a distribution. Therefore, we abuse notation when writing [math]X \sim \sg(\sigma^2)[/math]. More generally, we can talk about sub-Gaussian random vectors and matrices. A random vector [math]X \in \R^d[/math] is said to be sub-Gaussian with variance proxy [math]\sigma^2[/math] if [math]\E[X]=0[/math] and [math]u^\top X[/math] is sub-Gaussian with variance proxy [math]\sigma^2[/math] for any vector [math]u \in \cS^{d-1}[/math]. In this case we write [math]X \sim \sg_d(\sigma^2)[/math]. Note that if [math]X \sim \sg_d(\sigma^2)[/math], then for any [math]v[/math] such that [math]|v|_2\le 1[/math], we have [math]v^\top X\sim \sg(\sigma^2)[/math]. Indeed, denoting [math]u=v/|v|_2 \in \cS^{d-1}[/math], we have

[[math]] \E[e^{sv^\top X}]=\E[e^{s|v|_2u^\top X}]\le e^{\frac{\sigma^2 s^2|v|_2^2}{2}}\le e^{\frac{\sigma^2 s^2}{2}}\,. [[/math]]

A random matrix [math]X \in \R^{d\times T}[/math] is said to be sub-Gaussian with variance proxy [math]\sigma^2[/math] if [math]\E[X]=0[/math] and [math]u^\top X v[/math] is sub-Gaussian with variance proxy [math]\sigma^2[/math] for any unit vectors [math]u \in \cS^{d-1}, v \in \cS^{T-1}[/math]. In this case we write [math]X \sim \sg_{d\times T}(\sigma^2)[/math].

This property can equivalently be expressed in terms of bounds on the tail of the random variable [math]X[/math].

Lemma

Let [math]X \sim \sg( \sigma^2)[/math]. Then for any [math]t \gt 0[/math], it holds

[[math]] \begin{equation} \label{EQ:LEM:subgausstails} \p[X \gt t]\le \exp\Big(-\frac{t^2}{2\sigma^2}\Big)\,, \quad \text{and} \quad \p[X \lt -t]\le \exp\Big(-\frac{t^2}{2\sigma^2}\Big)\,. \end{equation} [[/math]]

Show Proof

Assume first that [math]X \sim \sg( \sigma^2)[/math]. We will employ a very useful technique called Chernoff bound that allows to translate a bound on the moment generating function into a tail bound. Using Markov's inequality, we have for any [math]s \gt 0[/math],

[[math]] \p(X \gt t)\le \p\big(e^{sX} \gt e^{st}\big)\le \frac{\E\big[e^{sX}\big]}{e^{st}}\,. [[/math]]

Next we use the fact that [math]X[/math] is sub-Gaussian to get

[[math]] \p(X \gt t)\le e^{\frac{\sigma^2 s^2}{2}-st}\,. [[/math]]

The above inequality holds for any [math]s \gt 0[/math] so to make it the tightest possible, we minimize with respect to [math]s \gt 0[/math]. Solving [math]\phi'(s)=0[/math], where [math]\phi(s)=\frac{\sigma^2 s^2}{2}-st[/math], we find that [math]\inf_{s \gt 0}\phi(s)=-\frac{t^2}{2\sigma^2}[/math]. This proves the first part of\eqref{EQ:LEM:subgausstails}. The second inequality in this equation follows in the same manner (recall that\eqref{EQ:defsubgauss} holds for any [math]s \in \R[/math]).

■

Moments

Recall that the absolute moments of [math]Z \sim \cN(0, \sigma^2)[/math] are given by

[[math]] \E[|Z|^k]=\frac{1}{\sqrt\pi}(2\sigma^2)^{k/2}\Gamma\Big(\frac{k+1}{2}\Big) [[/math]]

where [math]\Gamma(\cdot)[/math] denotes the Gamma function defined by

[[math]] \Gamma(t)=\int_0^\infty x^{t-1}e^{-x}\ud x \,, \quad t \gt 0\,. [[/math]]

The next lemma shows that the tail bounds of Lemma are sufficient to show that the absolute moments of [math]X \sim \sg( \sigma^2)[/math] can be bounded by those of [math]Z\sim \cN(0,\sigma^2)[/math] up to multiplicative constants.

Lemma

Let [math]X[/math] be a random variable such that

[[math]] \begin{equation}\label{LEM:subgauss_moments}\end{equation} \p[|X| \gt t]\le 2\exp\Big(-\frac{t^2}{2\sigma^2}\Big)\,, [[/math]]

then for any positive integer [math]k\ge 1[/math],

[[math]] \E [|X|^k] \le (2\sigma^2)^{k/2}k\Gamma(k/2)\,. [[/math]]

In particular,

[[math]] \big(\E [|X|^k])^{1/k} \le \sigma e^{1/e} \sqrt{k}\,, \quad k\ge 2\,. [[/math]]

and [math]\E[|X|] \le \sigma\sqrt{2\pi}[/math]\,.

Show Proof

[[math]] \begin{align*} \E [|X|^k]&=\int_0^\infty \p(|X|^k \gt t)\ud t&\\ &=\int_0^\infty \p(|X| \gt t^{1/k})\ud t&\\ &\le 2\int_0^\infty e^{-\frac{t^{2/k}}{2\sigma^2}}\ud t&\\ &=(2\sigma^2)^{k/2}k\int_0^\infty e^{-u}u^{k/2-1}\ud u\,,& u=\frac{t^{2/k}}{2\sigma^2}\\ &=(2\sigma^2)^{k/2}k\Gamma(k/2)& \end{align*} [[/math]]

The second statement follows from [math]\Gamma(k/2)\le (k/2)^{k/2}[/math] and [math]k^{1/k} \le e^{1/e}[/math] for any [math]k \ge 2[/math]. It yields

[[math]] \big((2\sigma^2)^{k/2}k\Gamma(k/2)\big)^{1/k}\le k^{1/k}\sqrt{\frac{2\sigma^2k}{2}}\le e^{1/e}\sigma\sqrt{k}\,. [[/math]]

Moreover, for [math]k=1[/math], we have [math]\sqrt{2}\Gamma(1/2)=\sqrt{2\pi}[/math].

■

Using moments, we can prove the following reciprocal to Lemma.

Lemma

If\eqref{EQ:LEM:subgausstails} holds and [math] \E[X] = 0 [/math], then for any [math]s \gt 0[/math], it holds

[[math]] \E[\exp(sX)]\le e^{4\sigma^2 s^2}\,. [[/math]]

As a result, we will sometimes write [math]X\sim \sg(\sigma^2)[/math] when it satisfies\eqref{EQ:LEM:subgausstails} and [math] \E[X] = 0 [/math].

Show Proof

We use the Taylor expansion of the exponential function as follows. Observe that by the dominated convergence theorem

[[math]] \begin{align*} \E\big[e^{sX}\big]&\le 1+\sum_{k=2}^\infty \frac{s^k\E[|X|^k]}{k!}&\\ &\le 1+\sum_{k=2}^\infty \frac{(2\sigma^2s^2)^{k/2}k\Gamma(k/2)}{k!}&\\ &= 1+\sum_{k=1}^\infty \frac{(2\sigma^2s^2)^{k}2k\Gamma(k)}{(2k)!}+\sum_{k=1}^\infty \frac{(2\sigma^2s^2)^{k+1/2}(2k+1)\Gamma(k+1/2)}{(2k+1)!}&\\ &\le1+\big(2+\sqrt{2\sigma^2s^2}\big)\sum_{k=1}^\infty \frac{(2\sigma^2s^2)^{k}k!}{(2k)!}&\\ &\le1+\big(1+\sqrt{\frac{\sigma^2s^2}{2}}\big)\sum_{k=1}^\infty \frac{(2\sigma^2s^2)^{k}}{k!}&\hspace{-2cm} 2(k!)^2\le(2k)!\\ &=e^{2\sigma^2s^2}+\sqrt{\frac{\sigma^2s^2}{2}}(e^{2\sigma^2s^2}-1)\\ &\le e^{4\sigma^2 s^2}\,.& \end{align*} [[/math]]

■

From the above Lemma, we see that sub-Gaussian random variables can be equivalently defined from their tail bounds and their moment generating functions, up to constants.

Sums of independent sub-Gaussian random variables

Recall that if [math]X_1, \ldots, X_n[/math] are \\iid [math]\cN(0,\sigma^2)[/math], then for any [math]a \in \R^n[/math],

[[math]] \sum_{i=1}^na_iX_i \sim \cN(0,|a|_2^2 \sigma^2). [[/math]]

If we only care about the tails, this property is preserved for sub-Gaussian random variables.

Theorem

Let [math]X=(X_1, \ldots, X_n)[/math] be a vector of independent sub-Gaussian random variables that have variance proxy [math]\sigma^2[/math]. Then, the random vector [math]X[/math] is sub-Gaussian with variance proxy [math]\sigma^2[/math].

Show Proof

Let [math]u \in \cS^{n-1}[/math] be a unit vector, then

[[math]] \E[e^{su^\top X}]=\prod_{i=1}^n \E[e^{su_i X_i}]\le \prod_{i=1}^ne^{\frac{\sigma^2s^2u_i^2}{2}}=e^{\frac{\sigma^2s^2|u|_2^2}{2}}=e^{\frac{\sigma^2s^2}{2}}\,. [[/math]]

■

Using a Chernoff bound, we immediately get the following corollary

Corollary

Let [math]X_1, \ldots, X_n[/math] be [math]n[/math] independent random variables such that [math]X_i \sim \sg( \sigma^2)[/math]. Then for any [math]a \in \R^n[/math], we have

[[math]] \p\Big[\sum_{i=1}^n a_i X_i \gt t\Big]\le \exp\Big(-\frac{t^2}{2\sigma^2|a|_2^2}\Big)\,, [[/math]]

and

[[math]] \p\Big[\sum_{i=1}^n a_i X_i \lt -t\Big]\le \exp\Big(-\frac{t^2}{2\sigma^2|a|_2^2}\Big) [[/math]]

Of special interest is the case where [math]a_i=1/n[/math] for all [math]i[/math]. Then, we get that the average [math]\bar X=\frac{1}{n}\sum_{i=1}^n X_i[/math], satisfies

[[math]] \p(\bar X \gt t) \le e^{-\frac{nt^2}{2\sigma^2}} \quad \text{and} \quad \p(\bar X \lt -t) \le e^{-\frac{nt^2}{2\sigma^2}} [[/math]]

just like for the Gaussian average.

Hoeffding's inequality

The class of sub-Gaussian random variables is actually quite large. Indeed, Hoeffding's lemma below implies that all random variables that are bounded uniformly are actually sub-Gaussian with a variance proxy that depends on the size of their support.

Lemma (Hoeffding's lemma (1963))

Let [math]X[/math] be a random variable such that [math]\E(X)=0[/math] and [math]X \in [a,b][/math] almost surely. Then, for any [math]s \in \R[/math], it holds

[[math]] \E[e^{sX}]\le e^{\frac{s^2(b-a)^2}{8}}\,. [[/math]]

In particular, [math]X \sim \sg(\frac{(b-a)^2}{4})[/math]\,.

Show Proof

Define [math]\psi(s)=\log \E[e^{s X}][/math], and observe that and we can readily compute

[[math]] \psi'(s)=\frac{ \E[Xe^{s X}] }{\E[e^{s X}]},\qquad \psi''(s)= \frac{ \E[X^2e^{s X}] }{\E[e^{s X}]}- \bigg[ \frac{ \E[Xe^{s X}] }{\E[e^{s X}]} \bigg]^2. [[/math]]

Thus [math]\psi''(s)[/math] can be interpreted as the variance of the random variable [math]X[/math] under the probability measure [math]d\mathbb{Q}=\frac{e^{s X}}{\E[e^{s X}]}d\p[/math]. But since [math]X \in [a,b][/math] almost surely, we have, under any probability,

[[math]] \var(X)=\var\big(X-\frac{a+b}{2}\big)\le \E\Big[\Big(X-\frac{a+b}{2}\Big)^2\Big] \le \frac{(b-a)^2}{4}\,. [[/math]]

The fundamental theorem of calculus yields

[[math]] \psi(s) = \int_0^s\int_0^\mu \psi''(\rho)\,d\rho\,d\mu \le \frac{s^2(b-a)^2}{8} [[/math]]

using [math]\psi(0)=\log 1=0[/math] and [math]\psi'(0)=\E X=0[/math].

■

Using a Chernoff bound, we get the following (extremely useful) result.

Theorem (Hoeffding's inequality)

Let [math]X_1, \ldots, X_n[/math] be [math]n[/math] independent random variables such that almost surely,

[[math]] X_i \in [a_i, b_i]\,, \qquad \forall\, i. [[/math]]

Let [math]\bar X=\frac{1}{n}\sum_{i=1}^n X_i[/math], then for any [math]t \gt 0[/math],

[[math]] \p (\bar X - \E(\bar X) \gt t) \leq \exp \Big(-\frac{2n^2t^2}{\sum_{i=1}^n (b_i - a_i)^2} \Big), [[/math]]

and

[[math]] \p(\bar X - \E(\bar X) \lt -t) \leq \exp \Big(-\frac{2n^2t^2}{\sum_{i=1}^n (b_i - a_i)^2} \Big)\,. [[/math]]

Note that Hoeffding's lemma holds for any bounded random variable. For example, if one knows that [math]X[/math] is a Rademacher random variable,

[[math]] \E(e^{sX})=\frac{e^s+e^{-s}}{2}=\cosh(s)\le e^{\frac{s^2}{2}}. [[/math]]

Note that 2 is the best possible constant in the above approximation. For such variables, [math]a=-1, b=1[/math], and [math]\E(X)=0[/math], so Hoeffding's lemma yields

[[math]] \E(e^{sX})\le e^{\frac{s^2}{2}}\,. [[/math]]

Hoeffding's inequality is very general but there is a price to pay for this generality. Indeed, if the random variables have small variance, we would like to see it reflected in the exponential tail bound (as for the Gaussian case) but the variance does not appear in Hoeffding's inequality. We need a more refined inequality.

Sub-exponential random variables

What can we say when a centered random variable is not sub-Gaussian? A typical example is the double exponential (or Laplace) distribution with parameter 1, denoted by [math]\Lap(1)[/math]. Let [math]X \sim \Lap(1)[/math] and observe that

[[math]] \p(|X| \gt t)=e^{-t}, \quad t \ge 0\,. [[/math]]

In particular, the tails of this distribution do not decay as fast as the Gaussian ones (that decays as [math]e^{-t^2/2}[/math]). Such tails are said to be heavier than Gaussian. This tail behavior is also captured by the moment generating function of [math]X[/math]. Indeed, we have

[[math]] \E\big[e^{sX}\big]=\frac{1}{1-s^2} \quad \text{if} \ |s| \lt 1\,, [[/math]]

and it is not defined for [math]s \ge 1[/math]. It turns out that a rather weak condition on the moment generating function is enough to partially reproduce some of the bounds that we have proved for sub-Gaussian random variables. Observe that for [math]X \sim \Lap(1)[/math]

[[math]] \E\big[e^{sX}\big]\le e^{2s^2}\quad \text{if} \ |s| \lt 1/2\,, [[/math]]

In particular, the moment generating function of the Laplace distribution is bounded by that of a Gaussian in a neighborhood of 0 but does not even exist away from zero. It turns out that all distributions that have tails at most as heavy as that of a Laplace distribution satisfy such a property.

Lemma

Let [math]X[/math] be a centered random variable such that [math]\p(|X| \gt t)\le 2e^{-2t/\lambda}[/math] for some [math]\lambda \gt 0[/math]. Then, for any positive integer [math]k\ge 1[/math],

[[math]] \E [|X|^k] \le \lambda^{k}k!\,. [[/math]]

Moreover,

[[math]] \big(\E [|X|^k])^{1/k} \le 2\lambda k\,, [[/math]]

and the moment generating function of [math]X[/math] satisfies

[[math]] \E\big[e^{sX}\big]\le e^{2s^2\lambda^2}\,, \qquad \forall |s| \le \frac{1}{2\lambda}\,. [[/math]]

Show Proof

[[math]] \begin{align*} \E [|X|^k]&=\int_0^\infty \p(|X|^k \gt t)\ud t&\\ &=\int_0^\infty \p(|X| \gt t^{1/k})\ud t&\\ &\le \int_0^\infty 2e^{-\frac{2t^{1/k}}{\lambda}}\ud t&\\ &=2(\lambda/2)^{k}k\int_0^\infty e^{-u}u^{k-1}\ud u\,,& u=\frac{2t^{1/k}}{\lambda}\\ &\le \lambda^{k}k\Gamma(k)=\lambda^{k}k!& \end{align*} [[/math]]

The second statement follows from [math]\Gamma(k)\le k^k[/math] and [math]k^{1/k} \le e^{1/e}\le 2[/math] for any [math]k \ge 1[/math]. It yields

[[math]] \big(\lambda^{k}k\Gamma(k)\big)^{1/k}\le 2\lambda k\,. [[/math]]

To control the MGF of [math]X[/math], we use the Taylor expansion of the exponential function as follows. Observe that by the dominated convergence theorem, for any [math]s[/math] such that [math]|s|\le 1/2\lambda[/math]

[[math]] \begin{align*} \E\big[e^{sX}\big]&\le 1+\sum_{k=2}^\infty \frac{|s|^k\E[|X|^k]}{k!}&\\ &\le 1+\sum_{k=2}^\infty(|s|\lambda)^{k}&\\ &= 1+s^2\lambda^2\sum_{k=0}^\infty(|s|\lambda)^{k}&\\ &\le 1+ 2s^2\lambda^2& |s|\le \frac{1}{2\lambda}\\ &\le e^{2s^2\lambda^2} \end{align*} [[/math]]

■

This leads to the following definition

Definition

A random variable [math]X[/math] is said to be sub-exponential with parameter [math]\lambda[/math] (denoted [math]X \sim \subE(\lambda)[/math]) if [math]\E[X]=0[/math] and its moment generating function satisfies

[[math]] \E\big[e^{sX}\big]\le e^{s^2\lambda^2/2}\,, \qquad \forall |s| \le \frac{1}{\lambda}\,. [[/math]]

A simple and useful example of of a sub-exponential random variable is given in the next lemma.

Lemma

Let [math]X\sim\sg(\sigma^2)[/math] then the random variable [math]Z=X^2-\E[X^2][/math] is sub-exponential: [math]Z\sim \subE(16\sigma^2)[/math].

Show Proof

We have, by the dominated convergence theorem,

[[math]] \begin{align*} \E[e^{sZ}]&=1+\sum_{k=2}^\infty\frac{s^k\E\big[X^2-\E[X^2]\big]^k}{k!}\\ &\le 1+\sum_{k=2}^\infty\frac{s^k2^{k-1}\big(\E[X^{2k}]+(\E[X^2])^{k}\big)}{k!}&\text{(Jensen)}\\ &\le 1+\sum_{k=2}^\infty\frac{s^k4^{k}\E[X^{2k}]}{2(k!)}&\text{(Jensen again)}\\ &\le 1+\sum_{k=2}^\infty\frac{s^k4^{k}2(2\sigma^2)^kk!}{2(k!)}& \text{(Lemma \ref{LEM:subgauss_moments})}\\ &=1+(8s\sigma^2)^2\sum_{k=0}^\infty(8s\sigma^2)^k&\\ &=1+128s^2\sigma^4&\text{for} \quad |s|\le \frac{1}{16\sigma^2}\\ &\le e^{128s^2\sigma^4}\,. \end{align*} [[/math]]

■

Sub-exponential random variables also give rise to exponential deviation inequalities such as Corollary (Chernoff bound) or Theorem (Hoeffding's inequality) for weighted sums of independent sub-exponential random variables. The significant difference here is that the larger deviations are controlled in by a weaker bound.

Bernstein's inequality

Theorem (Bernstein's inequality)

Let [math]X_1,\ldots,X_n[/math] be independent random variables such that [math]\E(X_i) = 0[/math] and [math]X_i \sim \subE(\lambda)[/math]. Define

[[math]] \bar X = \frac{1}{n}\sum_{i=1}^n X_i\,, [[/math]]

Then for any [math]t \gt 0[/math] we have

[[math]] \p(\bar X \gt t) \vee \p(\bar X \lt -t)\leq \exp \left[- \frac{n}{2}(\frac{t^2}{\lambda^2}\wedge \frac{t}{\lambda})\right]. [[/math]]

Show Proof

Without loss of generality, assume that [math]\lambda=1[/math] (we can always replace [math]X_i[/math] by [math]X_i/\lambda[/math] and [math]t[/math] by [math]t/\lambda[/math]). Next, using a Chernoff bound, we get for any [math]s \gt 0[/math]

[[math]] \p(\bar X \gt t)\le \prod_{i=1}^n\E\big[e^{sX_i}\big]e^{-snt}\,. [[/math]]

Next, if [math]|s|\le 1[/math], then [math]\E\big[e^{sX_i}\big] \le e^{s^2/2}[/math] by definition of sub-exponential distributions. It yields

[[math]] \p(\bar X \gt t)\le e^{\frac{ns^2}{2}-snt} [[/math]]

Choosing [math]s=1\wedge t[/math] yields

[[math]] \p(\bar X \gt t)\le e^{-\frac{n}{2}(t^2\wedge t)} [[/math]]

We obtain the same bound for [math]\p(\bar X \lt -t)[/math] which concludes the proof.

■

Note that usually, Bernstein's inequality refers to a slightly more precise result that is qualitatively the same as the one above: it exhibits a Gaussian tail [math]e^{-nt^2/(2\lambda^2)}[/math] and an exponential tail [math]e^{-nt/(2\lambda)}[/math]. See for example Theorem2.10 in ^[1].

Maximal inequalities

The exponential inequalities of the previous section are valid for linear combinations of independent random variables, and, in particular, for the average [math]\bar X[/math]. In many instances, we will be interested in controlling the maximum over the parameters of such linear combinations (this is because of empirical risk minimization). The purpose of this section is to present such results.

Maximum over a finite set

We begin by the simplest case possible: the maximum over a finite set.

Theorem

Let [math]X_1, \ldots, X_N[/math] be [math]N[/math] random variables such that

[[math]] X_i \sim \sg( \sigma^2). [[/math]]

Then,

[[math]] \E[\max_{1\le i \le N}X_i]\le \sigma\sqrt{2\log (N)}\,, \qquad \text{and}\qquad \E[\max_{1\le i \le N}|X_i|]\le \sigma\sqrt{2\log (2N)} [[/math]]

Moreover, for any [math]t \gt 0[/math],

[[math]] \p\big(\max_{1\le i \le N}X_i \gt t\big)\le Ne^{-\frac{t^2}{2\sigma^2 }}\,, \qquad \text{and}\qquad \p\big(\max_{1\le i \le N}|X_i| \gt t\big)\le 2Ne^{-\frac{ t^2}{2\sigma^2}} [[/math]]

Note that the random variables in this theorem need not be independent.

Show Proof

For any [math]s \gt 0[/math],

[[math]] \begin{align*} \E[\max_{1\le i \le N}X_i]&=\frac{1}{s}\E\big[\log e^{s\max_{1\le i \le N}X_i}\big]&\\ &\le \frac{1}{s}\log \E\big[e^{s\max_{1\le i \le N}X_i}\big]&\text{(by Jensen)}\\ &= \frac{1}{s}\log\E\big[\max_{1\le i \le N}e^{sX_i}\big]&\\ &\le \frac{1}{s}\log\sum_{1\le i \le N}\E\big[e^{sX_i}\big]&\\ &\le \frac{1}{s}\log\sum_{1\le i \le N}e^{\frac{\sigma^2s^2}{2}}&\\ &=\frac{\log N}{s}+ \frac{\sigma^2s}{2}.\\ \end{align*} [[/math]]

Taking [math]s=\sqrt{2(\log N)/\sigma^2}[/math] yields the first inequality in expectation. The first inequality in probability is obtained by a simple union bound:

[[math]] \begin{align*} \p\big(\max_{1\le i \le N}X_i \gt t\big)&=\p\Big(\bigcup_{1\le i\le N} \{X_i \gt t\}\Big)\\ &\le \sum_{1\le i \le N}\p(X_i \gt t)\\ & \le Ne^{-\frac{t^2}{2\sigma^2 }}\,, \end{align*} [[/math]]

where we used Lemma in the last inequality. The remaining two inequalities follow trivially by noting that

[[math]] \max_{1\le i \le N}|X_i|=\max_{1\le i \le 2N}X_i\,, [[/math]]

where [math]X_{N+i}=-X_i[/math] for [math]i=1, \ldots, N[/math].

■

Extending these results to a maximum over an infinite set may be impossible. For example, if one is given an infinite sequence of \\iid [math]\cN(0, \sigma^2)[/math] random variables [math]X_1, X_2, \ldots[/math], then for any [math]N \ge 1[/math], we have for any [math]t \gt 0[/math],

[[math]] \p(\max_{1\le i \le N} X_i \lt t)=[\p(X_1 \lt t)]^N\to 0\,, \quad N \to \infty\,. [[/math]]

On the opposite side of the picture, if all the [math]X_i[/math]s are equal to the same random variable [math]X[/math], we have for any [math]t \gt 0[/math],

[[math]] \p(\max_{1\le i \le N} X_i \lt t)=\p(X_1 \lt t) \gt 0 \quad \forall\: N \ge 1\,. [[/math]]

In the Gaussian case, lower bounds are also available. They illustrate the effect of the correlation between the [math]X_i[/math]s. Examples from statistics have structure and we encounter many examples where a maximum of random variables over an infinite set is in fact finite. This is due to the fact that the random variables that we are considering are not independent of each other. In the rest of this section, we review some of these examples.

Maximum over a convex polytope

We use the definition of a polytope from ^[2]: a convex polytope [math]\mathsf{P}[/math] is a compact set with a finite number of vertices [math]\cV(\mathsf{P})[/math] called extreme points. It satisfies [math]\mathsf{P}=\conv(\cV(\mathsf{P}))[/math], where [math]\conv(\cV(\mathsf{P}))[/math] denotes the convex hull of the vertices of [math]\mathsf{P}[/math]. Let [math]X \in \R^d[/math] be a random vector and consider the (infinite) family of random variables

[[math]] \cF=\{\theta^\top X\,:\,\theta \in \mathsf{P}\}\,, [[/math]]

where [math]\mathsf{P} \subset \R^d[/math] is a polytope with [math]N[/math] vertices. While the family [math]\cF[/math] is infinite, the maximum over [math]\cF[/math] can be reduced to the a finite maximum using the following useful lemma.

Lemma

Consider a linear form [math]x \mapsto c^\top x[/math], [math]x, c \in \R^d[/math]. Then for any convex polytope [math]\mathsf{P} \subset \R^d[/math],

[[math]] \max_{x \in \mathsf{P}}c^\top x=\max_{x \in \cV(\mathsf{P})}c^\top x [[/math]]

where [math]\cV(\mathsf{P})[/math] denotes the set of vertices of [math]\mathsf{P}[/math].

Show Proof

Assume that [math]\cV(\mathsf{P})=\{v_1, \ldots, v_N\}[/math]. For any [math]x \in \mathsf{P}=\conv(\cV(\mathsf{P}))[/math], there exist nonnegative numbers [math]\lambda_1, \ldots \lambda_N[/math] that sum up to 1 and such that [math]x=\lambda_1 v_1 + \cdots + \lambda_N v_N[/math]. Thus

[[math]] c^\top x=c^\top\Big(\sum_{i=1}^N \lambda_i v_i\Big)=\sum_{i=1}^N \lambda_ic^\top v_i\le \sum_{i=1}^N \lambda_i\max_{x \in \cV(\mathsf{P})}c^\top x=\max_{x \in \cV(\mathsf{P})}c^\top x\,. [[/math]]

It yields

[[math]] \max_{x \in \mathsf{P}}c^\top x\le \max_{x \in \cV(\mathsf{P})}c^\top x\le \max_{x \in \mathsf{P}}c^\top x [[/math]]

so the two quantities are equal.

■

It immediately yields the following theorem

Theorem

Let [math]\mathsf{P}[/math] be a polytope with [math]N[/math] vertices [math]v^{(1)}, \ldots, v^{(N)} \in \R^d[/math] and let [math]X \in \R^d[/math] be a random vector such that [math][v^{(i)}]^\top X[/math], [math]i=1, \ldots, N[/math], are sub-Gaussian random variables with variance proxy [math]\sigma^2[/math]. Then

[[math]] \E[\max_{\theta \in \mathsf{P}} \theta^\top X]\le \sigma \sqrt{2\log (N)}\,, \qquad \text{and}\qquad \E[\max_{\theta \in \mathsf{P}} |\theta^\top X|]\le \sigma \sqrt{2\log (2N)}\,. [[/math]]

Moreover, for any [math]t \gt 0[/math],

[[math]] \p\big(\max_{\theta \in \mathsf{P}} \theta^\top X \gt t\big)\le Ne^{-\frac{ t^2}{2\sigma^2}}\,, \qquad \text{and}\qquad \p\big(\max_{\theta \in \mathsf{P}}| \theta^\top X| \gt t\big)\le 2Ne^{-\frac{t^2}{2\sigma^2}} [[/math]]

Of particular interest are polytopes that have a small number of vertices. A primary example is the [math]\ell_1[/math] ball of [math]\R^d[/math] defined by

[[math]] \cB_1=\Big\{x \in \R^d\,:\, \sum_{i=1}^d|x_i|\le 1\Big\}\,. [[/math]]

Indeed, it has exactly [math]2d[/math] vertices.

Maximum over the [math]\ell_2[/math] ball

Recall that the unit [math]\ell_2[/math] ball of [math]\R^d[/math] is defined by the set of vectors [math]u[/math] that have Euclidean norm [math]|u|_2[/math] at most 1. Formally, it is defined by

[[math]] \cB_2=\Big\{x \in \R^d\,:\, \sum_{i=1}^d x_i^2\le 1\Big\}\,. [[/math]]

Clearly, this ball is not a polytope and yet, we can control the maximum of random variables indexed by [math]\cB_2[/math]. This is due to the fact that there exists a finite subset of [math]\cB_2[/math] such that the maximum over this finite set is of the same order as the maximum over the entire ball.

Definition

Fix [math]K\subset \R^d[/math] and [math]\eps \gt 0[/math]. A set [math]\cN[/math] is called an [math]\eps[/math]-net of [math]K[/math] with respect to a distance [math]d(\cdot, \cdot)[/math] on [math]\R^d[/math], if [math]\cN \subset K[/math] and for any [math]z \in K[/math], there exists [math]x \in \cN[/math] such that [math]d(x,z)\le \eps[/math].

Therefore, if [math]\cN[/math] is an [math]\eps[/math]-net of [math]K[/math] with respect to a norm [math]\|\cdot\|[/math], then every point of [math]K[/math] is at distance at most [math]\eps[/math] from a point in [math]\cN[/math]. Clearly, every compact set admits a finite [math]\eps[/math]-net. The following lemma gives an upper bound on the size of the smallest [math]\eps[/math]-net of [math]\cB_2[/math].

Lemma

Fix [math]\eps\in(0,1)[/math]. Then the unit Euclidean ball [math]\cB_2[/math] has an [math]\eps[/math]-net [math]\cN[/math] with respect to the Euclidean distance of cardinality [math]|\cN|\le (3/\eps)^d[/math]

Show Proof

Consider the following iterative construction if the [math]\eps[/math]-net. Choose [math]x_1=0[/math]. For any [math]i\ge 2[/math], take [math]x_i[/math] to be any [math]x \in \cB_2[/math] such that [math]|x -x_j|_2 \gt \eps[/math] for all [math]j \lt i[/math]. If no such [math]x[/math] exists, stop the procedure. Clearly, this will create an [math]\eps[/math]-net. We now control its size. Observe that since [math]|x -y|_2 \gt \eps[/math] for all [math]x,y \in \cN[/math], the Euclidean balls centered at [math]x \in \cN[/math] and with radius [math]\eps/2[/math] are disjoint. Moreover,

[[math]] \bigcup_{z \in \cN} \{z+\frac{\eps}{2} \cB_2\} \subset (1+\frac{\eps}{2}) \cB_2 [[/math]]

where [math]\{z+\eps\cB_2\}=\{z+\eps x\,, x \in \cB_2\}[/math]. Thus, measuring volumes, we get

[[math]] \vol\big((1+\frac{\eps}{2}) \cB_2\big) \ge \vol\Big( \bigcup_{z \in \cN} \{z+\frac{\eps}{2} \cB_2\}\Big)=\sum_{z \in \cN}\vol\big(\{z+\frac{\eps}{2} \cB_2\}\big) [[/math]]

This is equivalent to

[[math]] \Big(1+\frac{\eps}{2}\Big)^d \ge |\cN|\Big(\frac{\eps}{2}\Big)^d\,. [[/math]]

Therefore, we get the following bound

[[math]] |\cN|\le \Big(1+\frac{2}{\eps}\Big)^d\le \Big(\frac{3}{\eps}\Big)^d\,. [[/math]]

■

Theorem

Let [math]X \in \R^d[/math] be a sub-Gaussian random vector with variance proxy [math]\sigma^2[/math]. Then

[[math]] \E[\max_{\theta \in \cB_2} \theta^\top X]=\E[\max_{\theta \in \cB_2} |\theta^\top X|]\le 4\sigma\sqrt{d}\,. [[/math]]

Moreover, for any [math]\delta \gt 0[/math], with probability [math]1-\delta[/math], it holds

[[math]] \max_{\theta \in \cB_2} \theta^\top X=\max_{\theta \in \cB_2}| \theta^\top X|\le 4\sigma\sqrt{d}+2\sigma\sqrt{2\log(1/\delta)}\,. [[/math]]

Show Proof

Let [math]\cN[/math] be a [math]1/2[/math]-net of [math]\cB_2[/math] with respect to the Euclidean norm that satisfies [math]|\cN|\le 6^d[/math]. Next, observe that for every [math]\theta \in \cB_2[/math], there exists [math]z \in \cN[/math] and [math]x[/math] such that [math]|x|_2\le 1/2[/math] and [math]\theta=z+x[/math]. Therefore,

[[math]] \max_{\theta \in \cB_2} \theta^\top X\le \max_{z \in \cN}z^\top X+ \max_{x \in \frac12 \cB_2} x^\top X [[/math]]

But

[[math]] \max_{x \in \frac12 \cB_2} x^\top X=\frac{1}{2}\max_{x \in \cB_2} x^\top X [[/math]]

Therefore, using Theorem, we get

[[math]] \E[\max_{\theta \in \cB_2} \theta^\top X]\le 2\E[\max_{z \in \cN}z^\top X] \le 2\sigma \sqrt{2\log (|\cN|)}\le 2\sigma \sqrt{2(\log 6)d}\le 4\sigma\sqrt{d}\,. [[/math]]

\medskip The bound with high probability then follows because

[[math]] \p\big(\max_{\theta \in \cB_2} \theta^\top X \gt t\big)\le \p\big(2\max_{z \in \cN}z^\top X \gt t\big)\le |\cN|e^{-\frac{t^2}{8\sigma^2 }}\le 6^de^{-\frac{t^2}{8\sigma^2 }}\,. [[/math]]

To conclude the proof, we find [math]t[/math] such that

[[math]] e^{-\frac{t^2}{8\sigma^2 }+d\log(6)}\le \delta \ \Leftrightarrow \ t^2 \ge 8\log(6)\sigma^2d+8\sigma^2\log(1/\delta)\,. [[/math]]

Therefore, it is sufficient to take [math]t=\sqrt{8\log(6)}\sigma\sqrt{d}+2\sigma\sqrt{2\log(1/\delta)}[/math]\,.

■

Sums of independent random matrices

In this section, we are going to explore how concentration statements can be extended to sums of matrices. As an example, we are going to show a version of Bernstein's inequality, for sums of independent matrices, closely following the presentation in ^[3], which builds on previous work in ^[4]. Results of this type have been crucial in providing guarantees for low rank recovery, see for example ^[5]. In particular, we want to control the maximum eigenvalue of a sum of independent random symmetric matrices [math] \p(\lambda_{\mathrm{max}}(\sum_i \bX_i) \gt t)[/math]. The tools involved will closely mimic those used for scalar random variables, but with the caveat that care must be taken when handling exponentials of matrices because [math] e^{A + B} \neq e^{A} e^{B} [/math] if [math] A, B \in \R^d [/math] and [math] AB \neq BA [/math].

Preliminaries

We denote the set of symmetric, positive semi-definite, and positive definite matrices in [math] \R^{d \times d} [/math] by [math] \cS_d, \cS_d^{+} [/math], and [math] \cS_d^{++} [/math], respectively, and will omit the subscript when the dimensionality is clear. Here, positive semi-definite means that for a matrix [math] \bX \in \cS^+ [/math], [math] v^\top \bX v \geq 0 [/math] for all [math] v \in \R^d [/math], [math] |v|_2 = 1 [/math], and positive definite means that the equality holds strictly. This is equivalent to all eigenvalues of [math] \bX [/math] being larger or equal (strictly larger, respectively) than 0, [math] \lambda_j(\bX) \geq 0 [/math] for all [math] j = 1, \dots, d [/math]. The cone of positive definite matrices induces an order on matrices by setting [math] \bA \preceq \bB [/math] if [math] \bB - \bA \in \cS^+ [/math]. Since we want to extend the notion of exponentiating random variables that was essential to our derivation of the Chernoff bound to matrices, we will make use of the matrix exponential and the matrix logarithm. For a symmetric matrix [math] \bX [/math], we can define a functional calculus by applying a function [math] f : \R \to \R [/math] to the diagonal elements of its spectral decomposition, \ie, if [math] \bX = \bQ \Lambda \bQ^\top [/math], then

[[math]] \begin{equation} \label{eq:funccalc} f(\bX) := \bQ f(\Lambda) \bQ^\top, \end{equation} [[/math]]

where

[[math]] \begin{equation} \label{eq:al} [f(\Lambda)]_{i, j} = f(\Lambda_{i, j}), \quad i, j \in [d], \end{equation} [[/math]]

is only non-zero on the diagonal and [math] f(\bX) [/math] is well-defined because the spectral decomposition is unique up to the ordering of the eigenvalues and the basis of the eigenspaces, with respect to both of which \eqref{eq:funccalc} is invariant, as well. From the definition, it is clear that for the spectrum of the resulting matrix,

[[math]] \begin{equation} \label{eq:specthm} \sigma(f(\bX)) = f(\sigma(\bX)). \end{equation} [[/math]]

While inequalities between functions do not generally carry over to matrices, we have the following transfer rule:

[[math]] \begin{equation} \label{eq:g} f(a) \leq g(a) \text{ for } a \in \sigma(\bX) \implies f(\bX) \preceq g(\bX). \end{equation} [[/math]]

We can now define the matrix exponential by \eqref{eq:funccalc} which is equivalent to defining it via a power series,

[[math]] \begin{equation*} \exp(\bX) = \sum_{k=1}^{\infty} \frac{1}{k!} \bX^k. \end{equation*} [[/math]]

Similarly, we can define the matrix logarithm as the inverse function of [math] \exp [/math] on [math] \cS [/math], [math] \log(e^{A}) = A [/math], which defines it on [math] \cS^+ [/math]. Some nice properties of these functions on matrices are their monotonicity: For [math] \bA \preceq \bB [/math],

[[math]] \begin{equation} \label{eq:h} \tr \exp \bA \leq \tr \exp \bB, \end{equation} [[/math]]

and for [math] 0 \prec \bA \preceq \bB [/math],

[[math]] \begin{equation} \label{eq:i} \log \bA \preceq \log \bB. \end{equation} [[/math]]

Note that the analog of \eqref{eq:i} is in general not true for the matrix exponential.

The Laplace transform method

For the remainder of this section, let [math] \bX_1, \dots, \bX_n \in \R^{d \times d} [/math] be independent random symmetric matrices. As a first step that can lead to different types of matrix concentration inequalities, we give a generalization of the Chernoff bound for the maximum eigenvalue of a matrix.

Proposition

Let [math] \bY [/math] be a random symmetric matrix. Then, for all [math] t \in \R [/math],

[[math]] \begin{equation*} \p ( \lambda_{\mathrm{max}}(\bY) \geq t) \leq \inf_{\theta \gt 0} \{ e^{- \theta t} \E[ \tr \, e^{\theta \bY} ]\} \end{equation*} [[/math]]

Show Proof

We multiply both sides of the inequality [math] \lambda_{\mathrm{max}}(\bY) \geq t [/math] by [math] \theta [/math], take exponentials, apply the spectral theorem \eqref{eq:specthm} and then estimate the maximum eigenvalue by the sum over all eigenvalues, the trace.

[[math]] \begin{align*} \label{eq:f} \p( \lambda_{\mathrm{max}}(\bY) \geq t ) = {} & \p( \lambda_{\mathrm{max}}(\theta \bY) \geq \theta t )\\ = {} & \p( e^{\lambda_{\mathrm{max}}(\theta \bY)} \geq e^{\theta t} )\\ \leq {} & e^{-\theta t} \E[ e^{\lambda_{\mathrm{max}}(\theta \bY)} ]\\ = {} & e^{-\theta t} \E[ \lambda_{\mathrm{max}}(e^{\theta \bY}) ]\\ \leq {} & e^{-\theta t} \E[ \tr (e^{\theta \bY}) ]. \end{align*} [[/math]]

■

Recall that the crucial step in proving Bernstein's and Hoeffding's inequality was to exploit the independence of the summands by the fact that the exponential function turns products into sums.

[[math]] \begin{equation*} \E[ e^{ \sum_{i} \theta X_i } ] = \E[ \prod_{i} e^{\theta X_i} ] = \prod_i \E [ e^{\theta X_i}]. \end{equation*} [[/math]]

This property, [math] e^{A + B} = e^{A}e^{B} [/math], no longer holds true for matrices, unless they commute. We could try to replace it with a similar property, the Golden-Thompson inequality,

[[math]] \begin{equation*} \tr [e^{\theta (\bX_1 + \bX_2)}] \leq \tr [e^{\theta \bX_1} e^{\theta \bX_2}]. \end{equation*} [[/math]]

Unfortunately, this does not generalize to more than two matrices, and when trying to peel off factors, we would have to pull a maximum eigenvalue out of the trace,

[[math]] \begin{equation*} \tr [e^{\theta \bX_1} e^{\theta \bX_2}] \leq \lambda_{\mathrm{max}}(e^{\theta \bX_2}) \tr [e^{\theta \bX_1}]. \end{equation*} [[/math]]

This is the approach followed by Ahlswede-Winter ^[4], which leads to worse constants for concentration than the ones we obtain below. Instead, we are going to use the following deep theorem due to Lieb ^[6]. A sketch of the proof can be found in the appendix of ^[7].

Theorem (Lieb)

Let [math] \bH \in \R^{d \times d} [/math] be symmetric. Then,

[[math]] \begin{equation*} \cS^{+}_d \to \R, \quad \bA \mapsto \tr e^{\bH + \log \bA} \end{equation*} [[/math]]

is a concave function.

Corollary

Let [math] \bX, \bH \in \mathcal{S}_d [/math] be two symmetric matrices such that [math] \bX [/math] is random and [math] \bH [/math] is fixed. Then,

[[math]] \begin{equation*} \E [\tr e^{\bH + \bX}] \leq \tr e^{\bH + \log \E[e^{\bX}]}. \end{equation*} [[/math]]

Show Proof

Write [math] \bY = e^{\bX} [/math] and use Jensen's inequality:

[[math]] \begin{align*} \label{eq:e} \E[\tr e^{\bH+\bX}] = {} & \E[ \tr e^{\bH + \log \bY}]\\ \leq {} & \tr e^{\bH + \log(\E[\bY])}\\ = {} & \tr e^{\bH + \log \E [e^\bX]} \end{align*} [[/math]]

■

With this, we can establish a better bound on the moment generating function of sums of independent matrices.

Lemma

Let [math] \bX_1, \dots, \bX_n [/math] be [math] n [/math] independent, random symmetric matrices. Then,

[[math]] \begin{equation*} \E[ \tr \exp(\sum_i \theta \bX_i)] \leq \tr \exp(\sum_i \log \E e^{\theta \bX_i}) \end{equation*} [[/math]]

Show Proof

Without loss of generality, we can assume [math] \theta = 1 [/math]. Write [math] \E_k [/math] for the expectation conditioned on [math] \bX_1, \dots, \bX_k [/math], [math] \E_k[\,.\,] = \E[\,.\,|\bX_1, \dots, \bX_k] [/math]. By the tower property,

[[math]] \begin{align*} \E[\tr \exp (\sum_{i = 1}^{n} \bX_i)] = {} & \E_0 \dots \E_{n-1} \tr \exp (\sum_{i = 1}^{n} \bX_i)\\ \leq {} & \E_0 \dots \E_{n-2} \tr \exp \big( \sum_{i = 1}^{n-1} \bX_i + \log \underbrace{\E_{n-1} e^{\bX_n}}_{= \E e^{\bX_n}} \big)\\ \vdots\\ \leq {} & \tr \exp ( \sum_{i = 1}^{n}\log \E[e^{\bX_i}]), \end{align*} [[/math]]

where we applied Lieb's theorem, Theorem on the conditional expectation and then used the independence of [math] \bX_1, \dots, \bX_n [/math].

■

Tail bounds for sums of independent matrices

Theorem (Master tail bound)

For all [math] t \in \R [/math],

[[math]] \begin{equation*} \p ( \lambda_{\mathrm{max}}( \sum_{i} \bX_i ) \geq t ) \leq \inf_{\theta \gt 0}\{ e^{-\theta t} \tr \exp ( \sum_{i} \log \E e^{\theta \bX_i})\} \end{equation*} [[/math]]

Show Proof

Combine Lemma with Proposition.

■

Corollary

Assume that there is a function [math] g : (0, \infty) \to [0, \infty] [/math] and fixed symmetric matrices [math] \bA_i [/math] such that [math] \E[e^{\theta \bX_i}] \preceq e^{g(\theta) \bA_i} [/math] for all [math] i [/math]. Set

[[math]] \begin{equation*} \rho = \lambda_{\mathrm{max}} ( \sum_{i} \bA_k ). \end{equation*} [[/math]]

Then, for all [math] t \in \R [/math],

[[math]] \begin{equation*} \p ( \lambda_{\mathrm{max}}( \sum_{i} \bX_i ) \geq t) \leq d \inf_{\theta \gt 0}\{ e^{- \theta t + g(\theta) \rho}\}. \end{equation*} [[/math]]

Show Proof

Using the operator monoticity of [math] \log [/math], \eqref{eq:i}, and the monotonicity of [math] \tr \exp [/math], \eqref{eq:h}, we can plug the estimates for the matrix mgfs into the master inequality, Theorem Theorem, to obtain

[[math]] \begin{align*} \p( \lambda_{\mathrm{max}}( \sum_{i} \bX_i ) \geq t ) \leq {} & e^{- \theta t} \tr \exp ( g(\theta) \sum_{i} \bA_i)\\ \leq {} & d e^{-\theta t} \lambda_{\mathrm{max}} (\exp(g(\theta) \sum_{i} \bA_i))\\ = {} & d e^{-\theta t} \exp(g(\theta) \lambda_{\mathrm{max}} (\sum_{i} \bA_i)), \end{align*} [[/math]]

where we estimated [math] \tr(\bX) = \sum_{j=1}^{d} \lambda_j \leq d \lambda_{\mathrm{max}} [/math] and used the spectral theorem.

■

Now we are ready to prove Bernstein's inequality We just need to come up with a dominating function for Corollary Corollary.

Lemma

If [math] \E[\bX] = 0 [/math] and [math] \lambda_{\mathrm{max}}(\bX) \leq 1 [/math], then

[[math]] \begin{equation*} \E[e^{\theta \bX}] \preceq \exp((e^\theta - \theta - 1) \E[\bX^2]). \end{equation*} [[/math]]

Show Proof

Define

[[math]] \begin{equation*} f(x) = \left\{ \begin{aligned} \frac{e^{\theta x}-\theta x - 1}{x^2}, {} \quad & x \neq 0,\\ \frac{\theta^2}{2}, \quad & x = 0, \end{aligned} \right. \end{equation*} [[/math]]

and verify that this is a smooth and increasing function on [math] \R [/math]. Hence, [math] f(x) \leq f(1) [/math] for [math] x \leq 1 [/math]. By the transfer rule, \eqref{eq:g}, [math] f(\bX) \preceq f(I) = f(1) I [/math]. Therefore,

[[math]] \begin{equation*} e^{\theta \bX} = I + \theta \bX + \bX f(\bX) \bX \preceq I + \theta \bX + f(1) \bX^2, \end{equation*} [[/math]]

and by taking expectations,

[[math]] \begin{equation*} \E [e^{\theta \bX}] \preceq I + f(1) \E[\bX^2] \preceq \exp( f(1) \E[\bX^2]) = \exp((e^{\theta} - \theta - 1) \E[\bX^2]). \end{equation*} [[/math]]

■

Theorem (Matrix Bernstein)

Assume [math] \E \bX_i = 0 [/math] and [math] \lambda_{\mathrm{max}}(\bX_i) \leq R [/math] almost surely for all [math] i [/math] and set

[[math]] \begin{equation*} \sigma^2 = \| \sum_{i} \E[\bX_i^2] \|_{\mathrm{op}}. \end{equation*} [[/math]]

Then, for all [math] t \geq 0 [/math],

[[math]] \begin{align*} \p ( \lambda_{\mathrm{max}}( \sum_{i} \bX_i) \geq t ) \leq {} & d \exp(- \frac{\sigma^2}{R^2} h(Rt/\sigma^2))\\ \leq {} & d \exp \left( \frac{-t^2/2}{\sigma^2 + Rt/3} \right)\\ \leq {} & \left\{ \begin{aligned} d \exp\left(-\frac{3t^2}{8 \sigma^2}\right), \quad & t \leq \frac{\sigma^2}{R},\\ d \exp\left(-\frac{3t}{8R}\right), \quad & t \geq \frac{\sigma^2}{R}, \end{aligned} \right. \end{align*} [[/math]]

where [math] h(u) = (1+u) \log (1+u) - u [/math], [math] u \gt 0 [/math].

Show Proof

Without loss of generality, take [math] R = 1 [/math]. By Lemma,

[[math]] \begin{equation*} \E[e^{\theta \bX_i}] \preceq \exp(g(\theta) \E[\bX_i^2]), \quad \text{with } g(\theta) = e^\theta - \theta - 1, \quad \theta \gt 0. \end{equation*} [[/math]]

By Corollary,

[[math]] \begin{equation*} \p( \lambda_{\mathrm{max}}( \sum_{i} \bX_i) \geq t) \leq d \exp(-\theta t + g(\theta) \sigma^2). \end{equation*} [[/math]]

We can verify that the minimal value is attained at [math] \theta = \log(1+t/\sigma^2) [/math]. The second inequality then follows from the fact that [math] h(u) \geq h_2(u) = \frac{u^2/2}{1+u/3} [/math] for [math] u \geq 0 [/math] which can be verified by comparing derivatives. The third inequality follows from exploiting the properties of [math] h_2 [/math].

■

With similar techniques, one can also prove a version of Hoeffding's inequality for matrices, see ^[3]^{(Theorem 1.3)}.

General references

Rigollet, Philippe; Hütter, Jan-Christian (2023). "High-Dimensional Statistics". arXiv:2310.19244 [math.ST].

Notes

We will often commit the statement ‘`at least" for brevity

References

Ramon, van (2017). "Probability in High Dimension". Constructive Approximation 13. Springer-Verlag. doi:10.1214/aos/1034276635.
Wikipedia (2012). "Moore--Penrose pseudoinverse --- Wikipedia, The Free Encyclopedia". ArXiv:1212.1247 8. Princeton University Press. doi:10.1214/aos/1034276635.
^3.0 ^3.1 Tropp, Joel A. (2012). "User-Friendly Tail Bounds for Sums of Random Matrices". Foundations of computational mathematics 12. Springer-Verlag. doi:10.1214/aos/1034276635.
^4.0 ^4.1 "Strong Converse for Identification via Quantum Channels" (2002). IEEE Transactions on Information Theory 48. Springer-Verlag. doi:10.1214/aos/1034276635.
Gross, David (2011). "Recovering Low-Rank Matrices from Few Coefficients in Any Basis". IEEE Transactions on Information Theory 57. Springer-Verlag. doi:10.1214/aos/1034276635.
Lieb, Elliott H. (1973). "Convex Trace Functions and the Wigner-Yanase-Dyson Conjecture". Advances in Mathematics 11. Springer-Verlag. doi:10.1214/aos/1034276635.
Ruskai, Mary Beth (2002). "Inequalities for Quantum Entropy: A Review with Conditions for Equality". Journal of Mathematical Physics 43. Springer-Verlag. doi:10.1214/aos/1034276635.

[1] We will often commit the statement ‘`at least" for brevity

[BouLugMas13-2] Ramon, van (2017). "Probability in High Dimension". Constructive Approximation 13. Springer-Verlag. doi:10.1214/aos/1034276635.

[Gru03-3] Wikipedia (2012). "Moore--Penrose pseudoinverse --- Wikipedia, The Free Encyclopedia". ArXiv:1212.1247 8. Princeton University Press. doi:10.1214/aos/1034276635.

[Tro12-4] 3.0 ^3.1 Tropp, Joel A. (2012). "User-Friendly Tail Bounds for Sums of Random Matrices". Foundations of computational mathematics 12. Springer-Verlag. doi:10.1214/aos/1034276635.

[AhlWin02-5] 4.0 ^4.1 "Strong Converse for Identification via Quantum Channels" (2002). IEEE Transactions on Information Theory 48. Springer-Verlag. doi:10.1214/aos/1034276635.

[Gro11-6] Gross, David (2011). "Recovering Low-Rank Matrices from Few Coefficients in Any Basis". IEEE Transactions on Information Theory 57. Springer-Verlag. doi:10.1214/aos/1034276635.

[Lie73-7] Lieb, Elliott H. (1973). "Convex Trace Functions and the Wigner-Yanase-Dyson Conjecture". Advances in Mathematics 11. Springer-Verlag. doi:10.1214/aos/1034276635.

[Rus02-8] Ruskai, Mary Beth (2002). "Inequalities for Quantum Entropy: A Review with Conditions for Equality". Journal of Mathematical Physics 43. Springer-Verlag. doi:10.1214/aos/1034276635.

[Notes 1]

[1]

[2]

[3]

[4]

[5]

[6]

[7]