Minimax Lower Bounds
[math] \newcommand{\DS}{\displaystyle} \newcommand{\intbeta}{\lfloor \beta \rfloor} \newcommand{\cA}{\mathcal{A}} \newcommand{\cB}{\mathcal{B}} \newcommand{\cC}{\mathcal{C}} \newcommand{\cD}{\mathcal{D}} \newcommand{\cE}{\mathcal{E}} \newcommand{\cF}{\mathcal{F}} \newcommand{\cG}{\mathcal{G}} \newcommand{\cH}{\mathcal{H}} \newcommand{\cI}{\mathcal{I}} \newcommand{\cJ}{\mathcal{J}} \newcommand{\cK}{\mathcal{K}} \newcommand{\cL}{\mathcal{L}} \newcommand{\cM}{\mathcal{M}} \newcommand{\cN}{\mathcal{N}} \newcommand{\cO}{\mathcal{O}} \newcommand{\cP}{\mathcal{P}} \newcommand{\cQ}{\mathcal{Q}} \newcommand{\cS}{\mathcal{S}} \newcommand{\cT}{\mathcal{T}} \newcommand{\cU}{\mathcal{U}} \newcommand{\cV}{\mathcal{V}} \newcommand{\cX}{\mathcal{X}} \newcommand{\cY}{\mathcal{Y}} \newcommand{\cZ}{\mathcal{Z}} \newcommand{\sg}{\mathsf{subG}} \newcommand{\subE}{\mathsf{subE}} \newcommand{\bA}{\mathbf{A}} \newcommand{\bB}{\mathbf{B}} \newcommand{\bC}{\mathbf{C}} \newcommand{\bD}{\mathbf{D}} \newcommand{\bE}{\mathbf{E}} \newcommand{\bF}{\mathbf{F}} \newcommand{\bG}{\mathbf{G}} \newcommand{\bH}{\mathbf{H}} \newcommand{\bI}{\mathbf{I}} \newcommand{\bJ}{\mathbf{J}} \newcommand{\bK}{\mathbf{K}} \newcommand{\bM}{\mathbf{M}} \newcommand{\bN}{\mathbf{N}} \newcommand{\bO}{\mathbf{O}} \newcommand{\bP}{\mathbf{P}} \newcommand{\bp}{\mathbf{p}} \newcommand{\bQ}{\mathbf{Q}} \newcommand{\bS}{\mathbf{S}} \newcommand{\bT}{\mathbf{T}} \newcommand{\bU}{\mathbf{U}} \newcommand{\bV}{\mathbf{V}} \newcommand{\bX}{\mathbf{X}} \newcommand{\bY}{\mathbf{Y}} \newcommand{\bZ}{\mathbf{Z}} \newcommand{\bflambda}{\boldsymbol{\lambda}} \newcommand{\bftheta}{\boldsymbol{\theta}} \newcommand{\bfg}{\boldsymbol{g}} \newcommand{\bfy}{\boldsymbol{y}} \def\thetaphat{\hat{\bftheta}_\bp} \def\bflam{\boldsymbol{\lambda}} \def\Lam{\Lambda} \def\lam{\lambda} \def\bfpi{\boldsymbol{\pi}} \def\bfz{\boldsymbol{z}} \def\bfw{\boldsymbol{w}} \def\bfeta{\boldsymbol{\eta}} \newcommand{\R}{\mathrm{ I}\kern-0.18em\mathrm{ R}} \newcommand{\h}{\mathrm{ I}\kern-0.18em\mathrm{ H}} \newcommand{\K}{\mathrm{ I}\kern-0.18em\mathrm{ K}} \newcommand{\p}{\mathrm{ I}\kern-0.18em\mathrm{ P}} \newcommand{\E}{\mathrm{ I}\kern-0.18em\mathrm{ E}} %\newcommand{\Z}{\mathrm{ Z}\kern-0.26em\mathrm{ Z}} \newcommand{\1}{\mathrm{ 1}\kern-0.24em\mathrm{ I}} \newcommand{\N}{\mathrm{ I}\kern-0.18em\mathrm{ N}} \newcommand{\field}[1]{\mathbb{#1}} \newcommand{\q}{\field{Q}} \newcommand{\Z}{\field{Z}} \newcommand{\X}{\field{X}} \newcommand{\Y}{\field{Y}} \newcommand{\bbS}{\field{S}} \newcommand{\n}{\mathcal{N}} \newcommand{\x}{\mathcal{X}} \newcommand{\pp}{{\sf p}} \newcommand{\hh}{{\sf h}} \newcommand{\ff}{{\sf f}} \newcommand{\Bern}{\mathsf{Ber}} \newcommand{\Bin}{\mathsf{Bin}} \newcommand{\Lap}{\mathsf{Lap}} \newcommand{\tr}{\mathsf{Tr}} \newcommand{\phin}{\varphi_n} \newcommand{\phinb}{\overline \varphi_n(t)} \newcommand{\pn}{\p_{\kern-0.25em n}} \newcommand{\pnm}{\p_{\kern-0.25em n,m}} \newcommand{\psubm}{\p_{\kern-0.25em m}} \newcommand{\psubp}{\p_{\kern-0.25em p}} \newcommand{\cfi}{\cF_{\kern-0.25em \infty}} \newcommand{\e}{\mathrm{e}} \newcommand{\ic}{\mathrm{i}} \newcommand{\Leb}{\mathrm{Leb}_d} \newcommand{\Var}{\mathrm{Var}} \newcommand{\ddelta}{d_{\symdiffsmall}} \newcommand{\dsubh}{d_H} \newcommand{\indep}{\perp\kern-0.95em{\perp}} \newcommand{\supp}{\mathop{\mathrm{supp}}} \newcommand{\rank}{\mathop{\mathrm{rank}}} \newcommand{\vol}{\mathop{\mathrm{vol}}} \newcommand{\conv}{\mathop{\mathrm{conv}}} \newcommand{\card}{\mathop{\mathrm{card}}} \newcommand{\argmin}{\mathop{\mathrm{argmin}}} \newcommand{\argmax}{\mathop{\mathrm{argmax}}} \newcommand{\ud}{\mathrm{d}} \newcommand{\var}{\mathrm{var}} \newcommand{\re}{\mathrm{Re}} \newcommand{\MSE}{\mathsf{MSE}} \newcommand{\im}{\mathrm{Im}} \newcommand{\epr}{\hfill\hbox{\hskip 4pt\vrule width 5pt height 6pt depth 1.5pt}\vspace{0.5cm}\par} \newcommand{\bi}[1]{^{(#1)}} \newcommand{\eps}{\varepsilon} \newcommand{\Deq}{\stackrel{\mathcal{D}}{=}} \newcommand{\ubar}{\underbar} \newcommand{\Kbeta}{K_{\hspace{-0.3mm} \beta}} \newcommand{\crzero}[1]{\overset{r_0}{\underset{#1}{\longleftrightarrow}}} \newcommand{\hint}[1]{\texttt{[Hint:#1]}} \newcommand{\vp}{\vspace{.25cm}} \newcommand{\vm}{\vspace{.5cm}} \newcommand{\vg}{\vspace{1cm}} \newcommand{\vgneg}{\vspace{-1cm}} \newcommand{\vneg}{\vspace{-.5cm}} \newcommand{\vpneg}{\vspace{-.25cm}} \newcommand{\tp}{\ptsize{10}} \newcommand{\douzp}{\ptsize{12}} \newcommand{\np}{\ptsize{9}} \newcommand{\hp}{\ptsize{8}} \newcommand{\red}{\color{red}} \newcommand{\ndpr}[1]{{\textsf{\red[#1]}}} \newcommand\iid{i.i.d\@ifnextchar.{}{.\@\xspace} } \newcommand\MoveEqLeft[1][2]{\kern #1em & \kern -#1em} \newcommand{\leadeq}[2][4]{\MoveEqLeft[#1] #2 \nonumber} \newcommand{\leadeqnum}[2][4]{\MoveEqLeft[#1] #2} \newcommand\independent{\protect\mathpalette{\protect\independenT}{\perp}} \def\independenT#1#2{\mathrel{\rlap{$#1#2$}\mkern2mu{#1#2}}} \newcommand{\MIT}[1]{{\color{MITred} #1}} \newcommand{\dHyp}{\{-1,1\}^d} \newcommand{\thetahard}{\hat \theta^{hrd}} \newcommand{\thetasoft}{\hat \theta^{sft}} \newcommand{\thetabic}{\hat \theta^{bic}} \newcommand{\thetalasso}{\hat \theta^{\cL}} \newcommand{\thetaslope}{\hat \theta^{\cS}} \newcommand{\thetahard}{\hat \theta^{hrd}} \newcommand{\thetasoft}{\hat \theta^{sft}} \newcommand{\thetabic}{\hat \theta^{bic}} \newcommand{\thetalasso}{\hat \theta^{\cL}} \newcommand{\thetaslope}{\hat \theta^{\cS}} \newcommand{\thetals}{\hat \theta^{ls}} \newcommand{\thetalsm}{\tilde \theta^{ls_X}} \newcommand{\thetaridge}{\hat\theta^{\mathrm{ridge}}_\tau} \newcommand{\thetalsK}{\hat \theta^{ls}_K} \newcommand{\muls}{\hat \mu^{ls}} \newcommand{\KL}{\mathsf{KL}} \newcommand{\TV}{\mathsf{TV}} [/math]
In the previous chapters, we have proved several upper bounds and the goal of this chapter is to assess their optimality. Specifically, our goal is to answer the following questions:
- Can our analysis be improved? In other words: do the estimators that we have studied actually satisfy better bounds?
- Can any estimator improve upon these bounds?
Both questions ask about some form of optimality. The first one is about optimality of an estimator, whereas the second one is about optimality of a bound. The difficulty of these questions varies depending on whether we are looking for a positive or a negative answer. Indeed, a positive answer to these questions simply consists in finding a better proof for the estimator we have studied (question 1.) or simply finding a better estimator, together with a proof that it performs better (question 2.). A negative answer is much more arduous. For example, in question 2., it is a statement about all estimators. How can this be done? The answer lies in information theory (see[1] for a nice introduction). In this chapter, we will see how to give a negative answer to question 2. It will imply a negative answer to question 1.
Optimality in a minimax sense
Consider the Gaussian Sequence Model (GSM) where we observe [math]\bY=(Y_1, \ldots, Y_d)^\top[/math], defined by
where [math]\varepsilon=(\eps_1, \ldots, \eps_d)^\top \sim \cN_d(0,\frac{\sigma^2}{n}I_d)[/math], [math]\theta^*=(\theta^*_1, \ldots, \theta^*_d)^\top \in \Theta [/math] is the parameter of interest and [math]\Theta \subset \R^d[/math] is a given set of parameters. We will need a more precise notation for probabilities and expectations throughout this chapter. Denote by [math]\p_{\theta^*}[/math] and [math]\E_{\theta^*}[/math] the probability measure and corresponding expectation that are associated to the distribution of [math]\bY[/math] from the GSM\eqref{EQ:GSMminimax}. Recall that GSM is a special case of the linear regression model when the design matrix satisfies the ORT condition. In this case, we have proved several performance guarantees (upper bounds) for various choices of [math]\Theta[/math] that can be expressed either in the form
or the form
For some constant [math]C[/math]. The rates [math]\phi(\Theta)[/math] for different choices of [math]\Theta[/math] that we have obtained are gathered in Table together with the estimator (and the corresponding result from Chapter Linear Regression Model) that was employed to obtain this rate.
[math]\Theta[/math] | [math]\phi(\Theta)[/math] | Estimator | Result |
[math]\R^d[/math] | [math]\DS \frac{\sigma^2 d}{n}[/math] | [math]\thetals[/math] | Theorem |
[math]\cB_1[/math] | [math]\DS\sigma \sqrt{\frac{ \log d}{n}}[/math] | [math]\thetals_{\cB_1}[/math] | Theorem |
[math]\cB_0(k)[/math] | [math]\DS\frac{ \sigma^2 k}{n}\log (ed/k)[/math] | [math]\thetals_{\cB_0(k)}[/math] | Corollary-Corollary |
Can any of these results be improved? In other words, does there exists another estimator [math]\tilde \theta[/math] such that [math]\sup_{\theta^* \in \Theta}\E|\tilde \theta -\theta^*|_2^2\ll\phi(\Theta)[/math]? A first step in this direction is the Cram\'er-Rao lower bound [2] that allows us to prove lower bounds in terms of the Fisher information. Nevertheless, this notion of optimality is too stringent and often leads to nonexistence of optimal estimators. Rather, we prefer here the notion of minimax optimality that characterizes how fast [math]\theta^*[/math] can be estimated uniformly over [math]\Theta[/math].
We say that an estimator [math]\hat \theta_n[/math] is minimax optimal over [math]\Theta[/math] if it satisfies \eqref{EQ:minimaxUB_E} and there exists [math]C' \gt 0[/math] such that
Note that minimax rates of convergence [math]\phi[/math] are defined up to multiplicative constants. We may then choose this constant such that the minimax rate has a simple form such as [math]\sigma^2d/n[/math] as opposed to [math]7\sigma^2d/n[/math] for example. This definition can be adapted to rates that hold with high probability. As we saw in Chapter Linear Regression Model (Cf. Table), the upper bounds in expectation and those with high probability are of the same order of magnitude. It is also the case for lower bounds. Indeed, observe that it follows from the Markov inequality that for any [math]A \gt 0[/math],
Therefore,\eqref{EQ:minimaxLB1} follows if we prove that
for some positive constants [math]A[/math] and [math]C"[/math]. The above inequality also implies a lower bound with high probability. We can therefore employ the following alternate definition for minimax optimality.
We say that an estimator [math]\hat \theta[/math] is minimax optimal over [math]\Theta[/math] if it satisfies either \eqref{EQ:minimaxUB_E} or \eqref{EQ:minimaxUB_P} and there exists [math]C' \gt 0[/math] such that
Reduction to finite hypothesis testing
Minimax lower bounds rely on information theory and follow from a simple principle: if the number of observations is too small, it may be hard to distinguish between two probability distributions that are close to each other. For example, given [math]n[/math] i.i.d. observations, it is impossible to reliably decide whether they are drawn from [math]\cN(0,1)[/math] or [math]\cN(\frac1n,1)[/math]. This simple argument can be made precise using the formalism of statistical hypothesis testing. To do so, we reduce our estimation problem to a testing problem. The reduction consists of two steps.
- Reduction to a finite number of parameters. In this step the goal is to find the largest possible number of parameters [math]\theta_1, \ldots, \theta_M \in \Theta[/math] under the constraint that
[[math]] \begin{equation} \label{EQ:dist_constraint} |\theta_j -\theta_k|_2^2 \ge 4 \phi(\Theta)\,. \end{equation} [[/math]]This problem boils down to a packing of the set [math]\Theta[/math]. Then we can use the following trivial observations:[[math]] \inf_{\hat \theta}\sup_{\theta \in \Theta}\p_\theta\big[|\hat \theta- \theta|_2^2 \gt \phi(\Theta)\big] \ge \inf_{\hat \theta}\max_{1\le j \le M}\p_{\theta_j}\big[|\hat \theta- \theta_j|_2^2 \gt \phi(\Theta)\big]\,. [[/math]]
- Reduction to a hypothesis testing problem. In this second step, the necessity of the constraint\eqref{EQ:dist_constraint} becomes apparent.
For any estimator [math]\hat \theta[/math], define the minimum distance test [math]\psi(\hat \theta)[/math] that is associated to it by
[[math]] \psi(\hat \theta)=\argmin_{1\le j \le M} |\hat \theta -\theta_j|_2\,, [[/math]]with ties broken arbitrarily. Next observe that if, for some [math]j=1, \ldots, M[/math], [math]\psi(\hat \theta)\neq j[/math], then there exists [math]k \neq j[/math] such that [math]|\hat \theta -\theta_k|_2 \le |\hat \theta -\theta_j|_2[/math]. Together with the reverse triangle inequality it yields[[math]] |\hat \theta - \theta_j|_2 \ge |\theta_j -\theta_k|_2- |\hat \theta - \theta_k|_2 \ge |\theta_j -\theta_k|_2- |\hat \theta - \theta_j|_2 [[/math]]so that[[math]] |\hat \theta - \theta_j|_2 \ge \frac12 |\theta_j -\theta_k|_2 [[/math]]Together with constraint\eqref{EQ:dist_constraint}, it yields[[math]] |\hat \theta - \theta_j|_2^2 \ge \phi(\Theta) [[/math]]As a result,[[math]] \begin{align*} \inf_{\hat \theta}\max_{1\le j \le M}\p_{\theta_j}\big[|\hat \theta- \theta_j|_2^2 \gt \phi(\Theta)\big]&\ge \inf_{\hat \theta}\max_{1\le j \le M}\p_{\theta_j}\big[\psi(\hat \theta)\neq j\big]\\ &\ge \inf_{\psi}\max_{1\le j \le M}\p_{\theta_j}\big[\psi\neq j\big] \end{align*} [[/math]]where the infimum is taken over all tests [math]\psi[/math] based on [math]\bY[/math] and that take values in [math]\{1, \ldots, M\}[/math].
Conclusion: it is sufficient for proving lower bounds to find [math]\theta_1, \ldots, \theta_M \in \Theta[/math] such that [math]|\theta_j -\theta_k|_2^2 \ge 4 \phi(\Theta)[/math] and
The above quantity is called minimax probability of error. In the next sections, we show how it can be bounded from below using arguments from information theory. For the purpose of illustration, we begin with the simple case where [math]M=2[/math] in the next section.
Lower bounds based on two hypotheses
The Neyman-Pearson Lemma and the total variation distance
Consider two probability measures [math]\p_0[/math] and [math]\p_1[/math] and observations [math]X[/math] drawn from either [math]\p_0[/math] or [math]\p_1[/math]. We want to know which distribution [math]X[/math] comes from. It corresponds to the following statistical hypothesis problem:
A test [math]\psi=\psi(Z) \in \{0,1\}[/math] indicates which hypothesis should be true. Any test [math]\psi[/math] can make two types of errors. It can commit either an error of type I ([math]\psi=1[/math] whereas [math]Z \sim \p_0[/math]) or an error of type II ([math]\psi=0[/math] whereas [math]Z \sim \p_1[/math]). Of course, the test may also be correct. The following fundamental result called the Neyman Pearson Lemma indicates that any test [math]\psi[/math] is bound to commit one of these two types of error with positive probability unless [math]\p_0[/math] and [math]\p_1[/math] have essentially disjoint support. Let [math]\nu[/math] be a sigma finite measure satisfying [math]\p_0 \ll \nu[/math] and [math]\p_1\ll \nu[/math]. For example, we can take [math]\nu = \p_0+\p_1[/math]. It follows from the Radon-Nikodym theorem [3] that both [math]\p_0[/math] and [math]\p_1[/math] admit probability densities with respect to [math]\nu[/math]. We denote them by [math]p_0[/math] and [math]p_1[/math] respectively. For any function [math]f[/math], we write for simplicity
Let [math]\p_0[/math] and [math]\p_1[/math] be two probability measures. Then for any test [math]\psi[/math], it holds
Observe first that
The lower bound in the Neyman-Pearson lemma is related to a well-known quantity: the total variation distance.
The total variation distance between two probability measures [math]\p_0[/math] and [math]\p_1[/math] on a measurable space [math](\cX, \cA)[/math] is defined by
Clearly [math](i)=(ii)[/math] and the Neyman-Pearson Lemma gives [math](iv)=(v)[/math]. Moreover, by identifying a test [math]\psi[/math] to its rejection region, it is not hard to see that [math](i)=(v)[/math]. Therefore it remains only to show that [math](iii)[/math] is equal to any of the other expressions. Hereafter, we show that [math](iii)=(iv)[/math]. To that end, observe that
In view of the Neyman-Pearson lemma, it is clear that if we want to prove large lower bounds, we need to find probability distributions that are close in total variation. Yet, this conflicts with constraint\eqref{EQ:dist_constraint} and a tradeoff needs to be achieved. To that end, in the Gaussian sequence model, we need to be able to compute the total variation distance between [math]\cN(\theta_0, \frac{\sigma^2}{n}I_d)[/math] and [math]\cN(\theta_1, \frac{\sigma^2}{n}I_d)[/math]. None of the expression in Definition-Proposition gives an easy way to do so. The Kullback-Leibler divergence is much more convenient.
The Kullback-Leibler divergence
The Kullback-Leibler divergence between probability measures [math]\p_1[/math] and [math]\p_0[/math] is given by
It can be shown [4] that the integral is always well defined when [math]\p_1 \ll \p_0[/math] (though it can be equal to [math]\infty[/math] even in this case). Unlike the total variation distance, the Kullback-Leibler divergence is not a distance. Actually, it is not even symmetric. Nevertheless, it enjoys properties that are very useful for our purposes.
Let [math]\p[/math] and [math]\q[/math] be two probability measures. Then
- [math]\KL(\p, \q)\ge 0[/math].
- The function [math](\p, \q) \mapsto \KL(\p,\q)[/math] is convex.
- If [math]\p[/math] and [math]\q[/math] are product measures, i.e.,
[[math]] \p=\bigotimes_{i=1}^n \p_i\quad \text{and} \quad \q=\bigotimes_{i=1}^n \q_i [[/math]]then[[math]] \KL(\p,\q)=\sum_{i=1}^n \KL(\p_i,\q_i)\,. [[/math]]
If [math]\p[/math] is not absolutely continuous then the result is trivial. Next, assume that [math]\p \ll \q[/math] and let [math]X \sim \p[/math]. 1. Observe that by Jensen's inequality,
2. Consider the function [math]f:(x,y) \mapsto x\log(x/y)[/math] and compute its Hessian:
Point 2. in Proposition is particularly useful in statistics where observations typically consist of [math]n[/math] independent random variables.
Example
For any [math]\theta \in \R^d[/math], let [math]P_\theta[/math] denote the distribution of [math]\bY \sim \cN(\theta, \sigma^2I_d)[/math]. Then
The proof is left as an exercise (see Problem).
The Kullback-Leibler divergence is easier to manipulate than the total variation distance but only the latter is related to the minimax probability of error. Fortunately, these two quantities can be compared using Pinsker's inequality. We prove here a slightly weaker version of Pinsker's inequality that will be sufficient for our purpose. For a stronger statement, see [4], Lemma2.5.
Let [math]\p[/math] and [math]\q[/math] be two probability measures such that [math]\p \ll \q[/math]. Then
Note that
Pinsker's inequality yields the following theorem for the GSM.
Assume that [math]\Theta[/math] contains two hypotheses [math]\theta_0[/math] and [math]\theta_1[/math] such that [math]|\theta_0-\theta_1|_2^2 = 8\alpha^2\sigma^2/n[/math] for some [math]\alpha \in (0,1/2)[/math]. Then
Write for simplicity [math]\p_j=\p_{\theta_j}[/math], [math]j=0,1[/math]. Recall that it follows from the reduction to hypothesis testing that
Clearly, the result of Theorem matches the upper bound for [math]\Theta=\R^d[/math] only for [math]d=1[/math]. How about larger [math]d[/math]? A quick inspection of our proof shows that our technique, in its present state, cannot yield better results. Indeed, there are only two known candidates for the choice of [math]\theta^*[/math]. With this knowledge, one can obtain upper bounds that do not depend on [math]d[/math] by simply projecting [math]Y[/math] onto the linear span of [math]\theta_0, \theta_1[/math] and then solving the GSM in two dimensions. To obtain larger lower bounds, we need to use more than two hypotheses. In particular, in view of the above discussion, we need a set of hypotheses that spans a linear space of dimension proportional to [math]d[/math]. In principle, we should need at least order [math]d[/math] hypotheses but we will actually need much more.
Lower bounds based on many hypotheses
The reduction to hypothesis testing from Section Reduction to finite hypothesis testing allows us to use more than two hypotheses. Specifically, we should find [math]\theta_1, \ldots, \theta_M[/math] such that
for some positive constant [math]C'[/math]. Unfortunately, the Neyman-Pearson Lemma no longer exists for more than two hypotheses. Nevertheless, it is possible to relate the minimax probability of error directly to the Kullback-Leibler divergence, without involving the total variation distance. This is possible using a well known result from information theory called Fano's inequality.
Let [math]P_1, \ldots, P_M, M \ge 2[/math] be probability distributions such that [math]P_j \ll P_k[/math], [math]\forall \, j ,k[/math]. Then
Define
Fano's inequality leads to the following useful theorem.
Assume that [math]\Theta[/math] contains [math]M \ge 5[/math] hypotheses [math]\theta_1, \ldots, \theta_M[/math] such that for some constant [math]0 \lt \alpha \lt 1/4[/math], it holds
- [math]\DS|\theta_j-\theta_k|_2^2 \ge 4\phi[/math]
- [math]\DS |\theta_j-\theta_k|_2^2 \le \frac{2\alpha\sigma^2 }{n}\log(M)[/math]
Then
in view of (i), it follows from the reduction to hypothesis testing that it is sufficient to prove that
Theorem indicates that we must take [math]\phi\le \frac{\alpha\sigma^2 }{2n}\log(M)[/math]. Therefore, the larger the [math]M[/math], the larger the lower bound can be. However, [math]M[/math] cannot be arbitrary larger because of the constraint [math](i)[/math]. We are therefore facing a packing problem where the goal is to ‘`pack" as many Euclidean balls of radius proportional to [math]\sigma\sqrt{\log(M)/n}[/math] in [math]\Theta[/math] under the constraint that their centers remain close together (constraint [math](ii)[/math]). If [math]\Theta=\R^d[/math], this the goal is to pack the Euclidean ball of radius [math]R=\sigma\sqrt{2\alpha\log(M)/n}[/math] with Euclidean balls of radius [math]R\sqrt{2\alpha/\gamma}[/math]. This can be done using a volume argument (see Problem). However, we will use the more versatile lemma below. It gives a a lower bound on the size of a packing of the discrete hypercube [math]\{0,1\}^d[/math] with respect to the ’'Hamming distance defined by
For any [math]\gamma \in (0,1/2)[/math], there exist binary vectors [math]\omega_{1}, \ldots \omega_{M} \in \{0,1\}^d[/math] such that
- [math]\DS\rho(\omega_{j}, \omega_{k})\ge \big(\frac12-\gamma\big)d[/math] for all [math]j \neq k[/math]\,,
- [math]\DS M =\lfloor e^{\gamma^2d}\rfloor\ge e^{\frac{\gamma^2d}{2}}[/math]\,.
Let [math]\omega_{j,i}[/math], [math]1\le i \le d, 1\le j \le M[/math] be \iid Bernoulli random variables with parameter [math]1/2[/math] and observe that
Application to the Gaussian sequence model
We are now in a position to apply Theorem by choosing [math]\theta_1, \ldots, \theta_M[/math] based on [math]\omega_{1}, \ldots, \omega_{M}[/math] from the Varshamov-Gilbert Lemma.
Lower bounds for estimation
Take [math]\gamma=1/4[/math] and apply the Varshamov-Gilbert Lemma to obtain [math]\omega_{1}, \ldots, \omega_{M}[/math] with [math]M=\lfloor e^{d/16}\rfloor\ge e^{d/32} [/math] and such that [math]\rho(\omega_{j}, \omega_{k})\ge d/4[/math] for all [math]j \neq k[/math]. Let [math]\theta_{1},\ldots, \theta_M[/math] be such that
for some [math]\beta \gt 0[/math] to be chosen later. We can check the conditions of Theorem:
- [math]\DS|\theta_j-\theta_k|_2^2=\frac{\beta^2\sigma^2 }{n}\rho(\omega_{j}, \omega_{k}) \ge 4\frac{\beta^2\sigma^2 d}{16n}[/math];
- [math]\DS |\theta_j-\theta_k|_2^2 =\frac{\beta^2\sigma^2 }{n}\rho(\omega_{j}, \omega_{k}) \le \frac{\beta^2\sigma^2 d}{n}\le \frac{32\beta^2\sigma^2 }{n}\log(M)=\frac{2\alpha \sigma^2}{n}\log(M)[/math]\,,
for [math]\beta=\frac{\sqrt{\alpha}}{4}[/math]. Applying now Theorem yields
It implies the following corollary.
The minimax rate of estimation of over [math]\R^d[/math] in the Gaussian sequence model is [math]\phi(\R^d)=\sigma^2d/n[/math]. Moreover, it is attained by the least squares estimator [math]\thetals=\bY[/math].
Note that this rate is minimax over sets [math]\Theta[/math] that are strictly smaller than [math]\R^d[/math] (see ProblemMinimax Lower Bounds). Indeed, it is minimax over any subset of [math]\R^d[/math] that contains [math]\theta_1, \ldots, \theta_M[/math].
Lower bounds for sparse estimation
It appears from Table that when estimating sparse vectors, we have to pay for an extra logarithmic term [math]\log(ed/k)[/math] for not knowing the sparsity pattern of the unknown [math]\theta^*[/math]. In this section, we show that this term is unavoidable as it appears in the minimax optimal rate of estimation of sparse vectors. Note that the vectors [math]\theta_1, \ldots, \theta_M[/math] employed in the previous subsection are not guaranteed to be sparse because the vectors [math]\omega_1, \ldots, \omega_M[/math] obtained from the Varshamov-Gilbert Lemma may themselves not be sparse. To overcome this limitation, we need a sparse version of the Varhsamov-Gilbert lemma.
There exist positive constants [math]C_1[/math] and [math]C_2[/math] such that the following holds for any two integers [math]k[/math] and [math]d[/math] such that [math]1\le k \le d/8[/math]. There exist binary vectors [math]\omega_{1}, \ldots \omega_{M} \in \{0,1\}^d[/math] such that
- [math]\DS\rho(\omega_{i}, \omega_{j})\ge \frac{k}2[/math] for all [math]i \neq j[/math]\,,
- [math]\DS \log(M) \ge \frac{k}{8}\log(1+\frac{d}{2k})[/math]\,.
- [math]\DS |\omega_j|_0=k[/math] for all [math]j[/math]\,.
Take [math]\omega_1, \ldots, \omega_M[/math] independently and uniformly at random from the set
Apply the sparse Varshamov-Gilbert lemma to obtain [math]\omega_{1}, \ldots, \omega_{M}[/math] with [math]\log(M)\ge \frac{k}{8}\log(1+\frac{d}{2k})[/math] and such that [math]\rho(\omega_{j}, \omega_{k})\ge k/2[/math] for all [math]j \neq k[/math]. Let [math]\theta_{1},\ldots, \theta_M[/math] be such that
for some [math]\beta \gt 0[/math] to be chosen later. We can check the conditions of Theorem:
- [math]\DS|\theta_j-\theta_k|_2^2=\frac{\beta^2\sigma^2 }{n}\rho(\omega_{j}, \omega_{k})\log(1+\frac{d}{2k}) \ge 4\frac{\beta^2\sigma^2 }{8n}k\log(1+\frac{d}{2k})[/math];
- [math]\DS |\theta_j-\theta_k|_2^2 =\frac{\beta^2\sigma^2 }{n}\rho(\omega_{j}, \omega_{k})\log(1+\frac{d}{2k}) \le \frac{2k\beta^2\sigma^2 }{n}\log(1+\frac{d}{2k})\le \frac{2\alpha \sigma^2}{n}\log(M)[/math]\,,
for [math]\beta=\sqrt{\frac{\alpha}{8}}[/math]. Applying now Theorem yields
It implies the following corollary.
Recall that [math]\cB_{0}(k) \subset \R^d[/math] denotes the set of all [math]k[/math]-sparse vectors of [math]\R^d[/math]. The minimax rate of estimation over [math]\cB_{0}(k)[/math] in the Gaussian sequence model is [math]\phi(\cB_{0}(k))=\frac{\sigma^2k}{n}\log(ed/k)[/math]. Moreover, it is attained by the constrained least squares estimator [math]\thetals_{\cB_0(k)}[/math].
Note that the modified BIC estimator of Problem and the Slope eq:ad are also minimax optimal over [math]\cB_{0}(k)[/math]. However, unlike [math] \thetals_{\cB_0(k)} [/math] and the modified BIC, the Slope is also adaptive to [math] k [/math]. For any [math]\eps \gt 0[/math], the Lasso estimator and the BIC estimator are minimax optimal for sets of parameters such that [math]k\le d^{1-\eps}[/math].
Lower bounds for estimating vectors in [math]\ell_1[/math] balls
Recall that in Maurey's argument, we approximated a vector [math]\theta[/math] such that [math]|\theta|_1=R[/math] by a vector [math]\theta'[/math] such that [math]|\theta'|_0=\frac{R}{\sigma}\sqrt{\frac{n}{\log d}}[/math]\,. We can essentially do the same for the lower bound. Assume that [math]d \ge \sqrt{n}[/math] and let [math]\beta \in (0,1)[/math] be a parameter to be chosen later and define [math]k[/math] to be the smallest integer such that
Let [math]\omega_1, \ldots, \omega_M[/math] be obtained from the sparse Varshamov-Gilbert Lemma with this choice of [math]k[/math] and define
Observe that [math]|\theta_j|_1=R[/math] for [math]j=1,\ldots, M[/math]. We can check the conditions of Theorem:
- [math]\DS|\theta_j-\theta_k|_2^2=\frac{R^2}{k^2}\rho(\omega_{j}, \omega_{k})\ge \frac{R^2}{2k}\ge 4R\min\big(\frac{R}8, \beta^2\sigma\frac{\log (ed/\sqrt{n})}{8n}\big)[/math]\,.
- [math]\DS |\theta_j-\theta_k|_2^2 \le \frac{2R^2}{k} \le 4R\beta\sigma\sqrt{\frac{\log(ed/\sqrt{n})}{n}}\le \frac{2\alpha \sigma^2}{n}\log(M)[/math]\,,
for [math]\beta[/math] small enough if [math]d \ge Ck[/math] for some constant [math]C \gt 0[/math] chosen large enough. Applying now Theorem yields
It implies the following corollary.
Recall that [math]\cB_{1}(R) \subset \R^d[/math] denotes the set vectors [math]\theta \in \R^d[/math] such that [math]|\theta|_1 \le R[/math]. Then there exist a constant [math]C \gt 0[/math] such that if [math]d \ge n^{1/2+\eps}[/math], [math]\eps \gt 0[/math], the minimax rate of estimation over [math]\cB_{1}(R)[/math] in the Gaussian sequence model is
Moreover, it is attained by the constrained least squares estimator [math]\thetals_{\cB_1(R)}[/math] if [math]R\ge \sigma\frac{\log d}{n}[/math] and by the trivial estimator [math]\hat \theta =0[/math] otherwise.
To complete the proof of the statement, we need to study the risk of the trivial estimator equal to zero for small [math]R[/math]. Note that if [math]|\theta^*|_1\le R[/math], we have
Note that the inequality [math]|\theta^*|_2^2\le |\theta^*|_1^2[/math] appears to be quite loose. Nevertheless, it is tight up to a multiplicative constant for the vectors of the form [math]\theta_j=\omega_j\frac{R}{k}[/math] that are employed in the lower bound. Indeed, if [math]R\le \sigma\frac{\log d}{n}[/math], we have [math]k\le 2/\beta[/math]
Lower bounds for sparse estimation via [math] \chi^2 [/math]divergence
In this section, we will show how to derive lower bounds by directly controlling the distance between a simple and a composite hypothesis, which is useful when investigating lower rates for decision problems instead of estimation problems. Define the [math] \chi^2 [/math] divergence between two probability distributions [math] \p, \q [/math] as
When we compare the expression in the case where [math] \p \ll \q [/math], then we see that both the KL divergence and the [math] \chi^2 [/math] divergence can be written as [math] \int f \left( \frac{d\p}{d\q} \right) d\q [/math] with [math] f(x) = x \log x [/math] and [math] f(x) = (x-1)^2 [/math], respectively. Also note that both of these functions are convex in [math] x [/math] and fulfill [math] f(1) = 0 [/math], which by Jensen’s inequality shows us that they are non-negative and equal to [math] 0 [/math] if and only if [math] \p = \q [/math]. Firstly, let us derive another useful expression for the [math] \chi^2 [/math] divergence. By expanding the square, we have
Secondly, we note that we can bound the KL divergence from above by the [math] \chi^2 [/math] divergence, via Jensen’s inequality, as
where we used the concavity inequality [math] \log(1+x) \leq x [/math] for [math] x \gt -1 [/math]. Note that this estimate is a priori very rough and that we are transitioning from something at [math] \log [/math] scale to something on a regular scale. However, since we are mostly interested in showing that these distances are bounded by a small constant after adjusting the parameters of our models appropriately, it will suffice for our purposes. In particular, we can combine \eqref{eq:ag} with the Neyman-Pearson Lemma and Pinsker’s inequality, Lemma, to get
for distinguishing between two hypothesis with a decision rule [math] \psi [/math] if [math] \chi^2(\p_0, \p_1) \leq \varepsilon [/math]. In the following, we want to apply this observation to derive lower rates for distinguishing between
where [math] \p_u [/math] denotes the probability distribution of a Gaussian sequence model [math] Y = u + \frac{\sigma}{n} \xi [/math], with [math] \xi \sim \cN(0, I_d) [/math].
Consider the detection problem
We introduce the mixture hypothesis
Now, we need to take the expectation over two uniform draws of support sets [math] S [/math] and [math] T [/math], which reduces via conditioning and exploiting the independence of the two draws to
If instead of the [math] \chi^2 [/math] divergence, we try to compare the [math] \KL [/math] divergence between [math] \p_0 [/math] and [math] \bar{\p} [/math], we are tempted to exploit its convexity properties to get
This rate for detection implies lower rates for estimation of both [math] \theta^\ast [/math] and [math] \| \theta^\ast \|_2 [/math]: If given an estimator [math] \widehat{T}(X) [/math] for [math] T(\theta) = \| \theta \|_2 [/math] that achieves an error less than [math] \mu/2 [/math] for some [math] X [/math], then we would get a correct decision rule for those [math] X [/math] by setting
Lower bounds for estimating the [math] \ell_1 [/math] norm via moment matching
So far, we have seen many examples of rates that scale with [math] n^{-1} [/math] for the squared error. In this section, we are going to see an example for estimating a functional that can only be estimated at a much slower rate of [math] 1/\log n [/math], following [5]. Consider the normalized [math] \ell_1 [/math] norm,
as the target functional to be estimated, and measurements
We are going to show lower bounds for the estimation if restricted to an [math] \ell_\infty [/math] ball, [math] \Theta_n(M) = \{ \theta \in \R^n : | \theta_i | \leq M \} [/math], and for a general [math] \theta \in \R^n [/math].
A constrained risk inequality
We start with a general principle that is in the same spirit as the Neyman-Pearson lemma, (Lemma), but deals with expectations rather than probabilities and uses the [math] \chi^2 [/math] distance to estimate the closeness of distributions. Moreover, contrary to what we have seen so far, it allows us to compare two mixture distributions over candidate hypotheses, also known as priors. In the following, we write [math] X [/math] for a random observation coming from a distribution indexed by [math] \theta \in \Theta [/math], [math] \widehat{T} = \widehat{T}(X) [/math] an estimator for a function [math] T(\theta) [/math], and write [math] B(\theta) = \E_\theta \widehat{T} - T(\theta) [/math] for the bias of [math] \widehat{T} [/math]. Let [math] \mu_0 [/math] and [math] \mu_1 [/math] be two prior distributions over [math] \Theta [/math] and denote by [math] m_i [/math], [math] v_i^2 [/math] the means and variance of [math] T(\theta) [/math] under the priors [math] \mu_i [/math], [math] i \in \{0, 1\} [/math],
Moreover, write [math] F_i [/math] for the marginal distributions over the priors and denote their density with respect to the average of the two probability distributions by [math] f_i [/math]. With this, we write the [math] \chi^2 [/math] distance between [math] f_0 [/math] and [math] f_1 [/math] as
(i) If [math] \int \E_\theta (\widehat{T}(X) - T(\theta))^2 d\mu_0(\theta) \leq \varepsilon^2 [/math],
(ii) If [math] | m_1 - m_0 | \gt v_0 I[/math] and [math] 0 \leq \lambda \leq 1 [/math],
Without loss of generality, assume [math] m_1 \geq m_0 [/math]. We start by considering the term
Finally, since the minimax risk is bigger than any mixture risk, we can bound it from below by the maximum value of this bound, which is obtained at [math] \lambda = (I+1)/(I + 2) [/math] to get \eqref{eq:minimax_risk_mixture}, and by the same argument \eqref{eq:minimax_risk}.
Unfavorable priors via polynomial approximation
In order to construct unfavorable priors for the [math] \ell_1 [/math] norm \eqref{eq:aw}, we resort to polynomial approximation theory. For a continuous function [math] f [/math] on [math] [-1, 1] [/math], denote the maximal approximation error by polynomials of degree [math] k [/math], written as [math] \mathcal{P}_k [/math], by
and the polynomial attaining this error by [math] G_k^\ast [/math]. For [math] f(t) = |t| [/math], it is known [6] that
which is a very slow rate of convergence of the approximation, compared to smooth functions for which we would expect geometric convergence. We can now construct our priors using some abstract measure theoretical results.
For an integer [math] k \gt 0 [/math], there are two probability measures [math] \nu_0 [/math] and [math] \nu_1 [/math] on [math] [-1, 1] [/math] that fulfill the following:
- [math] \nu_0 [/math] and [math] \nu_1 [/math] are symmetric about 0;
- [math] \int t^l d \nu_1 (t) = \int t^l d \nu_0(t) [/math] for all [math] l \in \{0, 1, \dots, k \} [/math];
- [math]\int |t| d \nu_1(t) - \int | t | d \nu_0(t) = 2 \delta_k [/math].
The idea is to construct the measures via the Hahn-Banach theorem and the Riesz representation theorem. First, consider [math] f(t) = |t| [/math] as an element of the space [math] C([-1, 1]) [/math] equipped with the supremum norm and define [math] \mathcal{P}_k [/math] as the space of polynomials of degree up to [math] k [/math], and [math] \mathcal{F}_k := \operatorname{span}(\mathcal{P}_k \cup \{ f \}) [/math]. On [math] \mathcal{F}_k [/math], define the functional [math] T(c f + p_k) = c \delta_k [/math], which is well-defined because [math] f [/math] is not a polynomial. Claim: [math] \| T \| = \sup \{ T(g) : g \in \mathcal{F}_k, \| g \|_\infty \leq 1\} = 1 [/math]. [math] \| T \| \geq 1 [/math]: Let [math] G_k^\ast [/math] be the best-approximating polynomial to [math] f [/math] in [math] \mathcal{P}_k [/math]. Then, [math] \| f - G_k^\ast \|_\infty = \delta_k [/math], [math] \| (f - G^\ast_k)/\delta_k \|_\infty = 1 [/math], and [math] T((f - G^\ast_k)/\delta_k) = 1 [/math]. [math] \| T \| \leq 1 [/math]: Suppose [math] g = cf + p_k [/math] with [math] p_k \in \mathcal{P}_k [/math], [math] \| g \|_\infty = 1 [/math] and [math] T(g) \gt 1 [/math]. Then [math] c \gt 1/\delta_k [/math] and
Minimax lower bounds
We have the following bounds on the minimax rate for [math] T(\theta) = \frac{1}{n} \sum_{i = 1}^{n} | \theta_i| [/math]: For [math] \theta \in \Theta_n(M) = \{ \theta \in \R^n : | \theta_i | \leq M \} [/math],
We want to use Theorem. To construct the prior measures, we scale the measures from Lemma appropriately: Let [math] k_n [/math] be an even integer that is to be determined, and [math] \nu_0 [/math], [math] \nu_1 [/math] the two measures given by Lemma. Define [math] g(x) = Mx [/math] and define the measures [math] \mu_i [/math] by dilating [math] \nu_i [/math], [math] \mu_i(A) = \nu_i(g^{-1}(A)) [/math] for every Borel set [math] A \subseteq [-M, M] [/math] and [math] i \in \{0, 1\} [/math]. Hence,
- [math] \mu_0 [/math] and [math] \mu_1 [/math] are symmetric about 0;
- [math] \int t^l d \mu_1 (t) = \int t^l d \mu_0(t) [/math] for all [math] l \in \{0, 1, \dots, k_n \} [/math];
- [math]\int |t| d \mu_1(t) - \int | t | d \mu_0(t) = 2 M \delta_{k_n} [/math].
To get priors for [math] n [/math] \iid samples, consider the product priors [math] \mu_i^n = \prod_{j=1}^n \mu_i [/math]. With this, we have
To prove the lower bound over [math] \R^n [/math], take [math] M = \sqrt{\log n} [/math] and [math] k_n [/math] to be the smallest integer such that [math] k_n \geq 2 \e \log n [/math] and plug this into \eqref{eq:ax}, yielding
Note that [5] also complement these lower bounds with upper bounds that are based on polynomial approximation, completing the picture of estimating [math] T(\theta) [/math].
General references
Rigollet, Philippe; Hütter, Jan-Christian (2023). "High-Dimensional Statistics". arXiv:2310.19244 [math.ST].
References
- Cover, Thomas M.; Thomas, Joy A. (2006). Elements of information theory. 65. Hoboken, NJ: Wiley-Interscience [John Wiley & Sons]. pp. xxiv+748. ISBN 978-0-471-24195-9.
- Shao, Jun (2003). Mathematical statistics. 65. New York: Springer-Verlag. pp. xvi+591. ISBN 0-387-95382-5.
- Billingsley, Patrick (1995). Probability and measure. 65. New York: John Wiley & Sons Inc. pp. xiv+593. ISBN 0-471-00710-2.
- 4.0 4.1 Tsybakov, Alexandre B. (2009). Introduction to nonparametric estimation. 1. New York: Springer. pp. xii+214. ISBN 978-0-387-79051-0.
- 5.0 5.1 "Testing composite hypotheses, Hermite polynomials and optimal estimation of a nonsmooth functional" (2011). Ann. Statist. 39: 1012--1041. The Institute of Mathematical Statistics. doi: .
- Rivlin, Theodore J. (1990). Chebyshev Polynomials: From Approximation Theory to Algebra and Number Theory. 39. Berlin: Wiley. pp. 1012--1041. ISBN 978-0-471-62896-5.