Minimax Lower Bounds
[math] \newcommand{\DS}{\displaystyle} \newcommand{\intbeta}{\lfloor \beta \rfloor} \newcommand{\cA}{\mathcal{A}} \newcommand{\cB}{\mathcal{B}} \newcommand{\cC}{\mathcal{C}} \newcommand{\cD}{\mathcal{D}} \newcommand{\cE}{\mathcal{E}} \newcommand{\cF}{\mathcal{F}} \newcommand{\cG}{\mathcal{G}} \newcommand{\cH}{\mathcal{H}} \newcommand{\cI}{\mathcal{I}} \newcommand{\cJ}{\mathcal{J}} \newcommand{\cK}{\mathcal{K}} \newcommand{\cL}{\mathcal{L}} \newcommand{\cM}{\mathcal{M}} \newcommand{\cN}{\mathcal{N}} \newcommand{\cO}{\mathcal{O}} \newcommand{\cP}{\mathcal{P}} \newcommand{\cQ}{\mathcal{Q}} \newcommand{\cS}{\mathcal{S}} \newcommand{\cT}{\mathcal{T}} \newcommand{\cU}{\mathcal{U}} \newcommand{\cV}{\mathcal{V}} \newcommand{\cX}{\mathcal{X}} \newcommand{\cY}{\mathcal{Y}} \newcommand{\cZ}{\mathcal{Z}} \newcommand{\sg}{\mathsf{subG}} \newcommand{\subE}{\mathsf{subE}} \newcommand{\bA}{\mathbf{A}} \newcommand{\bB}{\mathbf{B}} \newcommand{\bC}{\mathbf{C}} \newcommand{\bD}{\mathbf{D}} \newcommand{\bE}{\mathbf{E}} \newcommand{\bF}{\mathbf{F}} \newcommand{\bG}{\mathbf{G}} \newcommand{\bH}{\mathbf{H}} \newcommand{\bI}{\mathbf{I}} \newcommand{\bJ}{\mathbf{J}} \newcommand{\bK}{\mathbf{K}} \newcommand{\bM}{\mathbf{M}} \newcommand{\bN}{\mathbf{N}} \newcommand{\bO}{\mathbf{O}} \newcommand{\bP}{\mathbf{P}} \newcommand{\bp}{\mathbf{p}} \newcommand{\bQ}{\mathbf{Q}} \newcommand{\bS}{\mathbf{S}} \newcommand{\bT}{\mathbf{T}} \newcommand{\bU}{\mathbf{U}} \newcommand{\bV}{\mathbf{V}} \newcommand{\bX}{\mathbf{X}} \newcommand{\bY}{\mathbf{Y}} \newcommand{\bZ}{\mathbf{Z}} \newcommand{\bflambda}{\boldsymbol{\lambda}} \newcommand{\bftheta}{\boldsymbol{\theta}} \newcommand{\bfg}{\boldsymbol{g}} \newcommand{\bfy}{\boldsymbol{y}} \def\thetaphat{\hat{\bftheta}_\bp} \def\bflam{\boldsymbol{\lambda}} \def\Lam{\Lambda} \def\lam{\lambda} \def\bfpi{\boldsymbol{\pi}} \def\bfz{\boldsymbol{z}} \def\bfw{\boldsymbol{w}} \def\bfeta{\boldsymbol{\eta}} \newcommand{\R}{\mathrm{ I}\kern-0.18em\mathrm{ R}} \newcommand{\h}{\mathrm{ I}\kern-0.18em\mathrm{ H}} \newcommand{\K}{\mathrm{ I}\kern-0.18em\mathrm{ K}} \newcommand{\p}{\mathrm{ I}\kern-0.18em\mathrm{ P}} \newcommand{\E}{\mathrm{ I}\kern-0.18em\mathrm{ E}} %\newcommand{\Z}{\mathrm{ Z}\kern-0.26em\mathrm{ Z}} \newcommand{\1}{\mathrm{ 1}\kern-0.24em\mathrm{ I}} \newcommand{\N}{\mathrm{ I}\kern-0.18em\mathrm{ N}} \newcommand{\field}[1]{\mathbb{#1}} \newcommand{\q}{\field{Q}} \newcommand{\Z}{\field{Z}} \newcommand{\X}{\field{X}} \newcommand{\Y}{\field{Y}} \newcommand{\bbS}{\field{S}} \newcommand{\n}{\mathcal{N}} \newcommand{\x}{\mathcal{X}} \newcommand{\pp}{{\sf p}} \newcommand{\hh}{{\sf h}} \newcommand{\ff}{{\sf f}} \newcommand{\Bern}{\mathsf{Ber}} \newcommand{\Bin}{\mathsf{Bin}} \newcommand{\Lap}{\mathsf{Lap}} \newcommand{\tr}{\mathsf{Tr}} \newcommand{\phin}{\varphi_n} \newcommand{\phinb}{\overline \varphi_n(t)} \newcommand{\pn}{\p_{\kern-0.25em n}} \newcommand{\pnm}{\p_{\kern-0.25em n,m}} \newcommand{\psubm}{\p_{\kern-0.25em m}} \newcommand{\psubp}{\p_{\kern-0.25em p}} \newcommand{\cfi}{\cF_{\kern-0.25em \infty}} \newcommand{\e}{\mathrm{e}} \newcommand{\ic}{\mathrm{i}} \newcommand{\Leb}{\mathrm{Leb}_d} \newcommand{\Var}{\mathrm{Var}} \newcommand{\ddelta}{d_{\symdiffsmall}} \newcommand{\dsubh}{d_H} \newcommand{\indep}{\perp\kern-0.95em{\perp}} \newcommand{\supp}{\mathop{\mathrm{supp}}} \newcommand{\rank}{\mathop{\mathrm{rank}}} \newcommand{\vol}{\mathop{\mathrm{vol}}} \newcommand{\conv}{\mathop{\mathrm{conv}}} \newcommand{\card}{\mathop{\mathrm{card}}} \newcommand{\argmin}{\mathop{\mathrm{argmin}}} \newcommand{\argmax}{\mathop{\mathrm{argmax}}} \newcommand{\ud}{\mathrm{d}} \newcommand{\var}{\mathrm{var}} \newcommand{\re}{\mathrm{Re}} \newcommand{\MSE}{\mathsf{MSE}} \newcommand{\im}{\mathrm{Im}} \newcommand{\epr}{\hfill\hbox{\hskip 4pt\vrule width 5pt height 6pt depth 1.5pt}\vspace{0.5cm}\par} \newcommand{\bi}[1]{^{(#1)}} \newcommand{\eps}{\varepsilon} \newcommand{\Deq}{\stackrel{\mathcal{D}}{=}} \newcommand{\ubar}{\underbar} \newcommand{\Kbeta}{K_{\hspace{-0.3mm} \beta}} \newcommand{\crzero}[1]{\overset{r_0}{\underset{#1}{\longleftrightarrow}}} \newcommand{\hint}[1]{\texttt{[Hint:#1]}} \newcommand{\vp}{\vspace{.25cm}} \newcommand{\vm}{\vspace{.5cm}} \newcommand{\vg}{\vspace{1cm}} \newcommand{\vgneg}{\vspace{-1cm}} \newcommand{\vneg}{\vspace{-.5cm}} \newcommand{\vpneg}{\vspace{-.25cm}} \newcommand{\tp}{\ptsize{10}} \newcommand{\douzp}{\ptsize{12}} \newcommand{\np}{\ptsize{9}} \newcommand{\hp}{\ptsize{8}} \newcommand{\red}{\color{red}} \newcommand{\ndpr}[1]{{\textsf{\red[#1]}}} \newcommand\iid{i.i.d\@ifnextchar.{}{.\@\xspace} } \newcommand\MoveEqLeft[1][2]{\kern #1em & \kern -#1em} \newcommand{\leadeq}[2][4]{\MoveEqLeft[#1] #2 \nonumber} \newcommand{\leadeqnum}[2][4]{\MoveEqLeft[#1] #2} \newcommand\independent{\protect\mathpalette{\protect\independenT}{\perp}} \def\independenT#1#2{\mathrel{\rlap{$#1#2$}\mkern2mu{#1#2}}} \newcommand{\MIT}[1]{{\color{MITred} #1}} \newcommand{\dHyp}{\{-1,1\}^d} \newcommand{\thetahard}{\hat \theta^{hrd}} \newcommand{\thetasoft}{\hat \theta^{sft}} \newcommand{\thetabic}{\hat \theta^{bic}} \newcommand{\thetalasso}{\hat \theta^{\cL}} \newcommand{\thetaslope}{\hat \theta^{\cS}} \newcommand{\thetahard}{\hat \theta^{hrd}} \newcommand{\thetasoft}{\hat \theta^{sft}} \newcommand{\thetabic}{\hat \theta^{bic}} \newcommand{\thetalasso}{\hat \theta^{\cL}} \newcommand{\thetaslope}{\hat \theta^{\cS}} \newcommand{\thetals}{\hat \theta^{ls}} \newcommand{\thetalsm}{\tilde \theta^{ls_X}} \newcommand{\thetaridge}{\hat\theta^{\mathrm{ridge}}_\tau} \newcommand{\thetalsK}{\hat \theta^{ls}_K} \newcommand{\muls}{\hat \mu^{ls}} [/math]
\label{chap:minimax} \newcommand{\KL}{\mathsf{KL}} \newcommand{\TV}{\mathsf{TV}} In the previous chapters, we have proved several upper bounds and the goal of this chapter is to assess their optimality. Specifically, our goal is to answer the following questions:
- Can our analysis be improved? In other words: do the estimators that we have studied actually satisfy better bounds?
- Can any estimator improve upon these bounds?
Both questions ask about some form of optimality. The first one is about optimality of an estimator, whereas the second one is about optimality of a bound. The difficulty of these questions varies depending on whether we are looking for a positive or a negative answer. Indeed, a positive answer to these questions simply consists in finding a better proof for the estimator we have studied (question 1.) or simply finding a better estimator, together with a proof that it performs better (question 2.). A negative answer is much more arduous. For example, in question 2., it is a statement about all estimators. How can this be done? The answer lies in information theory (see~[1] for a nice introduction). In this chapter, we will see how to give a negative answer to question 2. It will imply a negative answer to question 1.
Optimality in a minimax sense
Consider the Gaussian Sequence Model (GSM) where we observe [math]\bY=(Y_1, \ldots, Y_d)^\top[/math], defined by
where [math]\varepsilon=(\eps_1, \ldots, \eps_d)^\top \sim \cN_d(0,\frac{\sigma^2}{n}I_d)[/math], [math]\theta^*=(\theta^*_1, \ldots, \theta^*_d)^\top \in \Theta [/math] is the parameter of interest and [math]\Theta \subset \R^d[/math] is a given set of parameters. We will need a more precise notation for probabilities and expectations throughout this chapter. Denote by [math]\p_{\theta^*}[/math] and [math]\E_{\theta^*}[/math] the probability measure and corresponding expectation that are associated to the distribution of [math]\bY[/math] from the GSM~\eqref{EQ:GSMminimax}. Recall that GSM is a special case of the linear regression model when the design matrix satisfies the ORT condition. In this case, we have proved several performance guarantees (upper bounds) for various choices of [math]\Theta[/math] that can be expressed either in the form
or the form
For some constant [math]C[/math]. The rates [math]\phi(\Theta)[/math] for different choices of [math]\Theta[/math] that we have obtained are gathered in Table~Minimax Lower Bounds together with the estimator (and the corresponding result from Chapter~Linear Regression Model) that was employed to obtain this rate. \renewcommand{\arraystretch}{2.5}
[math]\Theta[/math] | [math]\phi(\Theta)[/math] | Estimator | Result |
[math]\R^d[/math] | [math]\DS \frac{\sigma^2 d}{n}[/math] | [math]\thetals[/math] | Theorem~Linear Regression Model |
[math]\cB_1[/math] | [math]\DS\sigma \sqrt{\frac{ \log d}{n}}[/math] | [math]\thetals_{\cB_1}[/math] | Theorem~Linear Regression Model |
[math]\cB_0(k)[/math] | [math]\DS\frac{ \sigma^2 k}{n}\log (ed/k)[/math] | [math]\thetals_{\cB_0(k)}[/math] | Corollaries~Linear Regression Model-Linear Regression Model |
\renewcommand{\arraystretch}{1} Can any of these results be improved? In other words, does there exists another estimator [math]\tilde \theta[/math] such that [math]\sup_{\theta^* \in \Theta}\E|\tilde \theta -\theta^*|_2^2\ll\phi(\Theta)[/math]? A first step in this direction is the Cram\'er-Rao lower bound [2] that allows us to prove lower bounds in terms of the Fisher information. Nevertheless, this notion of optimality is too stringent and often leads to nonexistence of optimal estimators. Rather, we prefer here the notion of minimax optimality that characterizes how fast [math]\theta^*[/math] can be estimated uniformly over [math]\Theta[/math].
We say that an estimator [math]\hat \theta_n[/math] is minimax optimal over [math]\Theta[/math] if it satisfies \eqref{EQ:minimaxUB_E} and there exists [math]C' \gt 0[/math] such that
Note that minimax rates of convergence [math]\phi[/math] are defined up to multiplicative constants. We may then choose this constant such that the minimax rate has a simple form such as [math]\sigma^2d/n[/math] as opposed to [math]7\sigma^2d/n[/math] for example. This definition can be adapted to rates that hold with high probability. As we saw in Chapter~Linear Regression Model (Cf. Table~Minimax Lower Bounds), the upper bounds in expectation and those with high probability are of the same order of magnitude. It is also the case for lower bounds. Indeed, observe that it follows from the Markov inequality that for any [math]A \gt 0[/math],
We say that an estimator [math]\hat \theta[/math] is minimax optimal over [math]\Theta[/math] if it satisfies either \eqref{EQ:minimaxUB_E} or \eqref{EQ:minimaxUB_P} and there exists [math]C' \gt 0[/math] such that
Reduction to finite hypothesis testing
Minimax lower bounds rely on information theory and follow from a simple principle: if the number of observations is too small, it may be hard to distinguish between two probability distributions that are close to each other. For example, given [math]n[/math] i.i.d. observations, it is impossible to reliably decide whether they are drawn from [math]\cN(0,1)[/math] or [math]\cN(\frac1n,1)[/math]. This simple argument can be made precise using the formalism of statistical hypothesis testing. To do so, we reduce our estimation problem to a testing problem. The reduction consists of two steps.
- Reduction to a finite number of parameters. In this step the goal is to find the largest possible number of parameters [math]\theta_1, \ldots, \theta_M \in \Theta[/math] under the constraint that
[[math]] \begin{equation} \label{EQ:dist_constraint} |\theta_j -\theta_k|_2^2 \ge 4 \phi(\Theta)\,. \end{equation} [[/math]]This problem boils down to a packing of the set [math]\Theta[/math]. Then we can use the following trivial observations:[[math]] \inf_{\hat \theta}\sup_{\theta \in \Theta}\p_\theta\big[|\hat \theta- \theta|_2^2 \gt \phi(\Theta)\big] \ge \inf_{\hat \theta}\max_{1\le j \le M}\p_{\theta_j}\big[|\hat \theta- \theta_j|_2^2 \gt \phi(\Theta)\big]\,. [[/math]]
- Reduction to a hypothesis testing problem. In this second step, the necessity of the constraint~\eqref{EQ:dist_constraint} becomes apparent.
For any estimator [math]\hat \theta[/math], define the minimum distance test [math]\psi(\hat \theta)[/math] that is associated to it by
[[math]] \psi(\hat \theta)=\argmin_{1\le j \le M} |\hat \theta -\theta_j|_2\,, [[/math]]with ties broken arbitrarily. Next observe that if, for some [math]j=1, \ldots, M[/math], [math]\psi(\hat \theta)\neq j[/math], then there exists [math]k \neq j[/math] such that [math]|\hat \theta -\theta_k|_2 \le |\hat \theta -\theta_j|_2[/math]. Together with the reverse triangle inequality it yields[[math]] |\hat \theta - \theta_j|_2 \ge |\theta_j -\theta_k|_2- |\hat \theta - \theta_k|_2 \ge |\theta_j -\theta_k|_2- |\hat \theta - \theta_j|_2 [[/math]]so that[[math]] |\hat \theta - \theta_j|_2 \ge \frac12 |\theta_j -\theta_k|_2 [[/math]]Together with constraint~\eqref{EQ:dist_constraint}, it yields[[math]] |\hat \theta - \theta_j|_2^2 \ge \phi(\Theta) [[/math]]As a result,[[math]] \begin{align*} \inf_{\hat \theta}\max_{1\le j \le M}\p_{\theta_j}\big[|\hat \theta- \theta_j|_2^2 \gt \phi(\Theta)\big]&\ge \inf_{\hat \theta}\max_{1\le j \le M}\p_{\theta_j}\big[\psi(\hat \theta)\neq j\big]\\ &\ge \inf_{\psi}\max_{1\le j \le M}\p_{\theta_j}\big[\psi\neq j\big] \end{align*} [[/math]]where the infimum is taken over all tests [math]\psi[/math] based on [math]\bY[/math] and that take values in [math]\{1, \ldots, M\}[/math].
\textsc{Conclusion:} it is sufficient for proving lower bounds to find [math]\theta_1, \ldots, \theta_M \in \Theta[/math] such that [math]|\theta_j -\theta_k|_2^2 \ge 4 \phi(\Theta)[/math] and
Lower bounds based on two hypotheses
The Neyman-Pearson Lemma and the total variation distance
Consider two probability measures [math]\p_0[/math] and [math]\p_1[/math] and observations [math]X[/math] drawn from either [math]\p_0[/math] or [math]\p_1[/math]. We want to know which distribution [math]X[/math] comes from. It corresponds to the following statistical hypothesis problem:
Let [math]\p_0[/math] and [math]\p_1[/math] be two probability measures. Then for any test [math]\psi[/math], it holds
Observe first that
The lower bound in the Neyman-Pearson lemma is related to a well-known quantity: the total variation distance. \begin{propdef} \label{PROP:TV} The total variation distance between two probability measures [math]\p_0[/math] and [math]\p_1[/math] on a measurable space [math](\cX, \cA)[/math] is defined by
The Kullback-Leibler divergence
The Kullback-Leibler divergence between probability measures [math]\p_1[/math] and [math]\p_0[/math] is given by
It can be shown [4] that the integral is always well defined when [math]\p_1 \ll \p_0[/math] (though it can be equal to [math]\infty[/math] even in this case). Unlike the total variation distance, the Kullback-Leibler divergence is not a distance. Actually, it is not even symmetric. Nevertheless, it enjoys properties that are very useful for our purposes.
Let [math]\p[/math] and [math]\q[/math] be two probability measures. Then
- [math]\KL(\p, \q)\ge 0[/math].
- The function [math](\p, \q) \mapsto \KL(\p,\q)[/math] is convex.
- If [math]\p[/math] and [math]\q[/math] are product measures, i.e.,
[[math]] \p=\bigotimes_{i=1}^n \p_i\quad \text{and} \quad \q=\bigotimes_{i=1}^n \q_i [[/math]]then[[math]] \KL(\p,\q)=\sum_{i=1}^n \KL(\p_i,\q_i)\,. [[/math]]
If [math]\p[/math] is not absolutely continuous then the result is trivial. Next, assume that [math]\p \ll \q[/math] and let [math]X \sim \p[/math]. 1. Observe that by Jensen's inequality,
2. Consider the function [math]f:(x,y) \mapsto x\log(x/y)[/math] and compute its Hessian:
Point 2. in Proposition~Minimax Lower Bounds is particularly useful in statistics where observations typically consist of [math]n[/math] independent random variables.Example \label{ex:KL_Gauss} For any [math]\theta \in \R^d[/math], let [math]P_\theta[/math] denote the distribution of [math]\bY \sim \cN(\theta, \sigma^2I_d)[/math]. Then
The Kullback-Leibler divergence is easier to manipulate than the total variation distance but only the latter is related to the minimax probability of error. Fortunately, these two quantities can be compared using Pinsker's inequality. We prove here a slightly weaker version of Pinsker's inequality that will be sufficient for our purpose. For a stronger statement, see [4], Lemma~2.5.
Let [math]\p[/math] and [math]\q[/math] be two probability measures such that [math]\p \ll \q[/math]. Then
Note that
Pinsker's inequality yields the following theorem for the GSM.
Assume that [math]\Theta[/math] contains two hypotheses [math]\theta_0[/math] and [math]\theta_1[/math] such that [math]|\theta_0-\theta_1|_2^2 = 8\alpha^2\sigma^2/n[/math] for some [math]\alpha \in (0,1/2)[/math]. Then
Write for simplicity [math]\p_j=\p_{\theta_j}[/math], [math]j=0,1[/math]. Recall that it follows from the reduction to hypothesis testing that
Clearly, the result of Theorem~Minimax Lower Bounds matches the upper bound for [math]\Theta=\R^d[/math] only for [math]d=1[/math]. How about larger [math]d[/math]? A quick inspection of our proof shows that our technique, in its present state, cannot yield better results. Indeed, there are only two known candidates for the choice of [math]\theta^*[/math]. With this knowledge, one can obtain upper bounds that do not depend on [math]d[/math] by simply projecting [math]Y[/math] onto the linear span of [math]\theta_0, \theta_1[/math] and then solving the GSM in two dimensions. To obtain larger lower bounds, we need to use more than two hypotheses. In particular, in view of the above discussion, we need a set of hypotheses that spans a linear space of dimension proportional to [math]d[/math]. In principle, we should need at least order [math]d[/math] hypotheses but we will actually need much more.
Lower bounds based on many hypotheses
The reduction to hypothesis testing from Section~Reduction to finite hypothesis testing allows us to use more than two hypotheses. Specifically, we should find [math]\theta_1, \ldots, \theta_M[/math] such that
Let [math]P_1, \ldots, P_M, M \ge 2[/math] be probability distributions such that [math]P_j \ll P_k[/math], [math]\forall \, j ,k[/math]. Then
Define
Fano's inequality leads to the following useful theorem.
Assume that [math]\Theta[/math] contains [math]M \ge 5[/math] hypotheses [math]\theta_1, \ldots, \theta_M[/math] such that for some constant [math]0 \lt \alpha \lt 1/4[/math], it holds
- [math]\DS|\theta_j-\theta_k|_2^2 \ge 4\phi[/math]
- [math]\DS |\theta_j-\theta_k|_2^2 \le \frac{2\alpha\sigma^2 }{n}\log(M)[/math]
Then
in view of (i), it follows from the reduction to hypothesis testing that it is sufficient to prove that
Theorem~Minimax Lower Bounds indicates that we must take [math]\phi\le \frac{\alpha\sigma^2 }{2n}\log(M)[/math]. Therefore, the larger the [math]M[/math], the larger the lower bound can be. However, [math]M[/math] cannot be arbitrary larger because of the constraint [math](i)[/math]. We are therefore facing a packing problem where the goal is to ‘`pack" as many Euclidean balls of radius proportional to [math]\sigma\sqrt{\log(M)/n}[/math] in [math]\Theta[/math] under the constraint that their centers remain close together (constraint [math](ii)[/math]). If [math]\Theta=\R^d[/math], this the goal is to pack the Euclidean ball of radius [math]R=\sigma\sqrt{2\alpha\log(M)/n}[/math] with Euclidean balls of radius [math]R\sqrt{2\alpha/\gamma}[/math]. This can be done using a volume argument (see Problem~Minimax Lower Bounds). However, we will use the more versatile lemma below. It gives a a lower bound on the size of a packing of the discrete hypercube [math]\{0,1\}^d[/math] with respect to the ’'Hamming distance defined by
For any [math]\gamma \in (0,1/2)[/math], there exist binary vectors [math]\omega_{1}, \ldots \omega_{M} \in \{0,1\}^d[/math] such that
- [math]\DS\rho(\omega_{j}, \omega_{k})\ge \big(\frac12-\gamma\big)d[/math] for all [math]j \neq k[/math]\,,
- [math]\DS M =\lfloor e^{\gamma^2d}\rfloor\ge e^{\frac{\gamma^2d}{2}}[/math]\,.
Let [math]\omega_{j,i}[/math], [math]1\le i \le d, 1\le j \le M[/math] be \iid Bernoulli random variables with parameter [math]1/2[/math] and observe that
Application to the Gaussian sequence model
We are now in a position to apply Theorem~Minimax Lower Bounds by choosing [math]\theta_1, \ldots, \theta_M[/math] based on [math]\omega_{1}, \ldots, \omega_{M}[/math] from the Varshamov-Gilbert Lemma.
Lower bounds for estimation
Take [math]\gamma=1/4[/math] and apply the Varshamov-Gilbert Lemma to obtain [math]\omega_{1}, \ldots, \omega_{M}[/math] with [math]M=\lfloor e^{d/16}\rfloor\ge e^{d/32} [/math] and such that [math]\rho(\omega_{j}, \omega_{k})\ge d/4[/math] for all [math]j \neq k[/math]. Let [math]\theta_{1},\ldots, \theta_M[/math] be such that
- [math]\DS|\theta_j-\theta_k|_2^2=\frac{\beta^2\sigma^2 }{n}\rho(\omega_{j}, \omega_{k}) \ge 4\frac{\beta^2\sigma^2 d}{16n}[/math];
- [math]\DS |\theta_j-\theta_k|_2^2 =\frac{\beta^2\sigma^2 }{n}\rho(\omega_{j}, \omega_{k}) \le \frac{\beta^2\sigma^2 d}{n}\le \frac{32\beta^2\sigma^2 }{n}\log(M)=\frac{2\alpha \sigma^2}{n}\log(M)[/math]\,,
for [math]\beta=\frac{\sqrt{\alpha}}{4}[/math]. Applying now Theorem~Minimax Lower Bounds yields
The minimax rate of estimation of over [math]\R^d[/math] in the Gaussian sequence model is [math]\phi(\R^d)=\sigma^2d/n[/math]. Moreover, it is attained by the least squares estimator [math]\thetals=\bY[/math].
Note that this rate is minimax over sets [math]\Theta[/math] that are strictly smaller than [math]\R^d[/math] (see Problem~Minimax Lower Bounds). Indeed, it is minimax over any subset of [math]\R^d[/math] that contains [math]\theta_1, \ldots, \theta_M[/math].
Lower bounds for sparse estimation
It appears from Table~Minimax Lower Bounds that when estimating sparse vectors, we have to pay for an extra logarithmic term [math]\log(ed/k)[/math] for not knowing the sparsity pattern of the unknown [math]\theta^*[/math]. In this section, we show that this term is unavoidable as it appears in the minimax optimal rate of estimation of sparse vectors. Note that the vectors [math]\theta_1, \ldots, \theta_M[/math] employed in the previous subsection are not guaranteed to be sparse because the vectors [math]\omega_1, \ldots, \omega_M[/math] obtained from the Varshamov-Gilbert Lemma may themselves not be sparse. To overcome this limitation, we need a sparse version of the Varhsamov-Gilbert lemma.
There exist positive constants [math]C_1[/math] and [math]C_2[/math] such that the following holds for any two integers [math]k[/math] and [math]d[/math] such that [math]1\le k \le d/8[/math]. There exist binary vectors [math]\omega_{1}, \ldots \omega_{M} \in \{0,1\}^d[/math] such that
- [math]\DS\rho(\omega_{i}, \omega_{j})\ge \frac{k}2[/math] for all [math]i \neq j[/math]\,,
- [math]\DS \log(M) \ge \frac{k}{8}\log(1+\frac{d}{2k})[/math]\,.
- [math]\DS |\omega_j|_0=k[/math] for all [math]j[/math]\,.
Take [math]\omega_1, \ldots, \omega_M[/math] independently and uniformly at random from the set
Apply the sparse Varshamov-Gilbert lemma to obtain [math]\omega_{1}, \ldots, \omega_{M}[/math] with [math]\log(M)\ge \frac{k}{8}\log(1+\frac{d}{2k})[/math] and such that [math]\rho(\omega_{j}, \omega_{k})\ge k/2[/math] for all [math]j \neq k[/math]. Let [math]\theta_{1},\ldots, \theta_M[/math] be such that
- [math]\DS|\theta_j-\theta_k|_2^2=\frac{\beta^2\sigma^2 }{n}\rho(\omega_{j}, \omega_{k})\log(1+\frac{d}{2k}) \ge 4\frac{\beta^2\sigma^2 }{8n}k\log(1+\frac{d}{2k})[/math];
- [math]\DS |\theta_j-\theta_k|_2^2 =\frac{\beta^2\sigma^2 }{n}\rho(\omega_{j}, \omega_{k})\log(1+\frac{d}{2k}) \le \frac{2k\beta^2\sigma^2 }{n}\log(1+\frac{d}{2k})\le \frac{2\alpha \sigma^2}{n}\log(M)[/math]\,,
for [math]\beta=\sqrt{\frac{\alpha}{8}}[/math]. Applying now Theorem~Minimax Lower Bounds yields
Recall that [math]\cB_{0}(k) \subset \R^d[/math] denotes the set of all [math]k[/math]-sparse vectors of [math]\R^d[/math]. The minimax rate of estimation over [math]\cB_{0}(k)[/math] in the Gaussian sequence model is [math]\phi(\cB_{0}(k))=\frac{\sigma^2k}{n}\log(ed/k)[/math]. Moreover, it is attained by the constrained least squares estimator [math]\thetals_{\cB_0(k)}[/math].
Note that the modified BIC estimator of Problem~Linear Regression Model and the Slope eq:ad are also minimax optimal over [math]\cB_{0}(k)[/math]. However, unlike [math] \thetals_{\cB_0(k)} [/math] and the modified BIC, the Slope is also adaptive to [math] k [/math]. For any [math]\eps \gt 0[/math], the Lasso estimator and the BIC estimator are minimax optimal for sets of parameters such that [math]k\le d^{1-\eps}[/math].
Lower bounds for estimating vectors in [math]\ell_1[/math] balls
Recall that in Maurey's argument, we approximated a vector [math]\theta[/math] such that [math]|\theta|_1=R[/math] by a vector [math]\theta'[/math] such that [math]|\theta'|_0=\frac{R}{\sigma}\sqrt{\frac{n}{\log d}}[/math]\,. We can essentially do the same for the lower bound. Assume that [math]d \ge \sqrt{n}[/math] and let [math]\beta \in (0,1)[/math] be a parameter to be chosen later and define [math]k[/math] to be the smallest integer such that
- [math]\DS|\theta_j-\theta_k|_2^2=\frac{R^2}{k^2}\rho(\omega_{j}, \omega_{k})\ge \frac{R^2}{2k}\ge 4R\min\big(\frac{R}8, \beta^2\sigma\frac{\log (ed/\sqrt{n})}{8n}\big)[/math]\,.
- [math]\DS |\theta_j-\theta_k|_2^2 \le \frac{2R^2}{k} \le 4R\beta\sigma\sqrt{\frac{\log(ed/\sqrt{n})}{n}}\le \frac{2\alpha \sigma^2}{n}\log(M)[/math]\,,
for [math]\beta[/math] small enough if [math]d \ge Ck[/math] for some constant [math]C \gt 0[/math] chosen large enough. Applying now Theorem~Minimax Lower Bounds yields
Recall that [math]\cB_{1}(R) \subset \R^d[/math] denotes the set vectors [math]\theta \in \R^d[/math] such that [math]|\theta|_1 \le R[/math]. Then there exist a constant [math]C \gt 0[/math] such that if [math]d \ge n^{1/2+\eps}[/math], [math]\eps \gt 0[/math], the minimax rate of estimation over [math]\cB_{1}(R)[/math] in the Gaussian sequence model is
Moreover, it is attained by the constrained least squares estimator [math]\thetals_{\cB_1(R)}[/math] if [math]R\ge \sigma\frac{\log d}{n}[/math] and by the trivial estimator [math]\hat \theta =0[/math] otherwise.
To complete the proof of the statement, we need to study the risk of the trivial estimator equal to zero for small [math]R[/math]. Note that if [math]|\theta^*|_1\le R[/math], we have
Note that the inequality [math]|\theta^*|_2^2\le |\theta^*|_1^2[/math] appears to be quite loose. Nevertheless, it is tight up to a multiplicative constant for the vectors of the form [math]\theta_j=\omega_j\frac{R}{k}[/math] that are employed in the lower bound. Indeed, if [math]R\le \sigma\frac{\log d}{n}[/math], we have [math]k\le 2/\beta[/math]
Lower bounds for sparse estimation via \texorpdfstring{[math] \chi^2 [/math]}{χ2
divergence} In this section, we will show how to derive lower bounds by directly controlling the distance between a simple and a composite hypothesis, which is useful when investigating lower rates for decision problems instead of estimation problems. Define the [math] \chi^2 [/math] divergence between two probability distributions [math] \p, \q [/math] as
Secondly, we note that we can bound the KL divergence from above by the [math] \chi^2 [/math] divergence, via Jensen’s inequality, as
Consider the detection problem
We introduce the mixture hypothesis
Now, we need to take the expectation over two uniform draws of support sets [math] S [/math] and [math] T [/math], which reduces via conditioning and exploiting the independence of the two draws to
If instead of the [math] \chi^2 [/math] divergence, we try to compare the [math] \KL [/math] divergence between [math] \p_0 [/math] and [math] \bar{\p} [/math], we are tempted to exploit its convexity properties to get
This rate for detection implies lower rates for estimation of both [math] \theta^\ast [/math] and [math] \| \theta^\ast \|_2 [/math]: If given an estimator [math] \widehat{T}(X) [/math] for [math] T(\theta) = \| \theta \|_2 [/math] that achieves an error less than [math] \mu/2 [/math] for some [math] X [/math], then we would get a correct decision rule for those [math] X [/math] by setting
Lower bounds for estimating the \texorpdfstring{[math] \ell_1 [/math]}{ℓ1
norm via moment matching} So far, we have seen many examples of rates that scale with [math] n^{-1} [/math] for the squared error. In this section, we are going to see an example for estimating a functional that can only be estimated at a much slower rate of [math] 1/\log n [/math], following [5]. Consider the normalized [math] \ell_1 [/math] norm,
A constrained risk inequality
We start with a general principle that is in the same spirit as the Neyman-Pearson lemma, (Lemma Minimax Lower Bounds), but deals with expectations rather than probabilities and uses the [math] \chi^2 [/math] distance to estimate the closeness of distributions. Moreover, contrary to what we have seen so far, it allows us to compare two mixture distributions over candidate hypotheses, also known as priors. In the following, we write [math] X [/math] for a random observation coming from a distribution indexed by [math] \theta \in \Theta [/math], [math] \widehat{T} = \widehat{T}(X) [/math] an estimator for a function [math] T(\theta) [/math], and write [math] B(\theta) = \E_\theta \widehat{T} - T(\theta) [/math] for the bias of [math] \widehat{T} [/math]. Let [math] \mu_0 [/math] and [math] \mu_1 [/math] be two prior distributions over [math] \Theta [/math] and denote by [math] m_i [/math], [math] v_i^2 [/math] the means and variance of [math] T(\theta) [/math] under the priors [math] \mu_i [/math], [math] i \in \{0, 1\} [/math],
(i) If [math] \int \E_\theta (\widehat{T}(X) - T(\theta))^2 d\mu_0(\theta) \leq \varepsilon^2 [/math],
(ii) If [math] | m_1 - m_0 | \gt v_0 I[/math] and [math] 0 \leq \lambda \leq 1 [/math],
Without loss of generality, assume [math] m_1 \geq m_0 [/math]. We start by considering the term
Finally, since the minimax risk is bigger than any mixture risk, we can bound it from below by the maximum value of this bound, which is obtained at [math] \lambda = (I+1)/(I + 2) [/math] to get \eqref{eq:minimax_risk_mixture}, and by the same argument \eqref{eq:minimax_risk}.
Unfavorable priors via polynomial approximation
In order to construct unfavorable priors for the [math] \ell_1 [/math] norm \eqref{eq:aw}, we resort to polynomial approximation theory. For a continuous function [math] f [/math] on [math] [-1, 1] [/math], denote the maximal approximation error by polynomials of degree [math] k [/math], written as [math] \mathcal{P}_k [/math], by
For an integer [math] k \gt 0 [/math], there are two probability measures [math] \nu_0 [/math] and [math] \nu_1 [/math] on [math] [-1, 1] [/math] that fulfill the following:
- [math] \nu_0 [/math] and [math] \nu_1 [/math] are symmetric about 0;
- [math] \int t^l d \nu_1 (t) = \int t^l d \nu_0(t) [/math] for all [math] l \in \{0, 1, \dots, k \} [/math];
- [math]\int |t| d \nu_1(t) - \int | t | d \nu_0(t) = 2 \delta_k [/math].
The idea is to construct the measures via the Hahn-Banach theorem and the Riesz representation theorem. First, consider [math] f(t) = |t| [/math] as an element of the space [math] C([-1, 1]) [/math] equipped with the supremum norm and define [math] \mathcal{P}_k [/math] as the space of polynomials of degree up to [math] k [/math], and [math] \mathcal{F}_k := \operatorname{span}(\mathcal{P}_k \cup \{ f \}) [/math]. On [math] \mathcal{F}_k [/math], define the functional [math] T(c f + p_k) = c \delta_k [/math], which is well-defined because [math] f [/math] is not a polynomial. Claim: [math] \| T \| = \sup \{ T(g) : g \in \mathcal{F}_k, \| g \|_\infty \leq 1\} = 1 [/math]. [math] \| T \| \geq 1 [/math]: Let [math] G_k^\ast [/math] be the best-approximating polynomial to [math] f [/math] in [math] \mathcal{P}_k [/math]. Then, [math] \| f - G_k^\ast \|_\infty = \delta_k [/math], [math] \| (f - G^\ast_k)/\delta_k \|_\infty = 1 [/math], and [math] T((f - G^\ast_k)/\delta_k) = 1 [/math]. [math] \| T \| \leq 1 [/math]: Suppose [math] g = cf + p_k [/math] with [math] p_k \in \mathcal{P}_k [/math], [math] \| g \|_\infty = 1 [/math] and [math] T(g) \gt 1 [/math]. Then [math] c \gt 1/\delta_k [/math] and
Minimax lower bounds
We have the following bounds on the minimax rate for [math] T(\theta) = \frac{1}{n} \sum_{i = 1}^{n} | \theta_i| [/math]: For [math] \theta \in \Theta_n(M) = \{ \theta \in \R^n : | \theta_i | \leq M \} [/math],
We want to use Theorem Minimax Lower Bounds. To construct the prior measures, we scale the measures from Lemma Minimax Lower Bounds appropriately: Let [math] k_n [/math] be an even integer that is to be determined, and [math] \nu_0 [/math], [math] \nu_1 [/math] the two measures given by Lemma Minimax Lower Bounds. Define [math] g(x) = Mx [/math] and define the measures [math] \mu_i [/math] by dilating [math] \nu_i [/math], [math] \mu_i(A) = \nu_i(g^{-1}(A)) [/math] for every Borel set [math] A \subseteq [-M, M] [/math] and [math] i \in \{0, 1\} [/math]. Hence,
- [math] \mu_0 [/math] and [math] \mu_1 [/math] are symmetric about 0;
- [math] \int t^l d \mu_1 (t) = \int t^l d \mu_0(t) [/math] for all [math] l \in \{0, 1, \dots, k_n \} [/math];
- [math]\int |t| d \mu_1(t) - \int | t | d \mu_0(t) = 2 M \delta_{k_n} [/math].
To get priors for [math] n [/math] \iid samples, consider the product priors [math] \mu_i^n = \prod_{j=1}^n \mu_i [/math]. With this, we have
To prove the lower bound over [math] \R^n [/math], take [math] M = \sqrt{\log n} [/math] and [math] k_n [/math] to be the smallest integer such that [math] k_n \geq 2 \e \log n [/math] and plug this into \eqref{eq:ax}, yielding
Note that [5] also complement these lower bounds with upper bounds that are based on polynomial approximation, completing the picture of estimating [math] T(\theta) [/math].
Problem Set
General references
Rigollet, Philippe; Hütter, Jan-Christian (2023). "High-Dimensional Statistics". arXiv:2310.19244 [math.ST].
References
- BirgTemplate:\'e, Lucien (1983). "Approximation dans les espaces m{\'e}triques et th{\'e}orie de l'estimation". Z. Wahrsch. Verw. Gebiete 65.
- BirgTemplate:\'e, Lucien (1983). "Approximation dans les espaces m{\'e}triques et th{\'e}orie de l'estimation". Z. Wahrsch. Verw. Gebiete 65.
- BirgTemplate:\'e, Lucien (1983). "Approximation dans les espaces m{\'e}triques et th{\'e}orie de l'estimation". Z. Wahrsch. Verw. Gebiete 65.
- 4.0 4.1 LeCam, L. (1973). "Convergence of estimates under dimensionality restrictions". Ann. Statist. 1.
- 5.0 5.1 "Testing composite hypotheses, Hermite polynomials and optimal estimation of a nonsmooth functional" (2011). Ann. Statist. 39. The Institute of Mathematical Statistics.
- "Testing Composite Hypotheses, Template:Hermite Polynomials and Optimal Estimation of a Nonsmooth Functional" (2011). The Annals of Statistics 39.