guide:5b5300544d: Difference between revisions
No edit summary |
mNo edit summary |
||
(6 intermediate revisions by the same user not shown) | |||
Line 184: | Line 184: | ||
\newcommand{\thetalsK}{\hat \theta^{ls}_K} | \newcommand{\thetalsK}{\hat \theta^{ls}_K} | ||
\newcommand{\muls}{\hat \mu^{ls}} | \newcommand{\muls}{\hat \mu^{ls}} | ||
\newcommand{\KL}{\mathsf{KL}} | |||
\newcommand{\TV}{\mathsf{TV}} | |||
</math> | </math> | ||
</div> | </div> | ||
In the previous chapters, we have proved several upper bounds and the goal of this chapter is to assess their optimality. Specifically, our goal is to answer the following questions: | In the previous chapters, we have proved several upper bounds and the goal of this chapter is to assess their optimality. Specifically, our goal is to answer the following questions: | ||
<ul><li> Can our analysis be improved? In other words: do the estimators that we have studied actually satisfy better bounds? | <ul><li> Can our analysis be improved? In other words: do the estimators that we have studied actually satisfy better bounds? | ||
Line 196: | Line 196: | ||
</ul> | </ul> | ||
Both questions ask about some form of ''optimality''. The first one is about optimality of an estimator, whereas the second one is about optimality of a bound. | Both questions ask about some form of ''optimality''. The first one is about optimality of an estimator, whereas the second one is about optimality of a bound. | ||
The difficulty of these questions varies depending on whether we are looking for a positive or a negative answer. Indeed, a positive answer to these questions simply consists in finding a better proof for the estimator we have studied (question 1.) or simply finding a better estimator, together with a proof that it performs better (question 2.). A negative answer is much more arduous. For example, in question 2., it is a statement about ''all estimators''. How can this be done? The answer lies in information theory (see | The difficulty of these questions varies depending on whether we are looking for a positive or a negative answer. Indeed, a positive answer to these questions simply consists in finding a better proof for the estimator we have studied (question 1.) or simply finding a better estimator, together with a proof that it performs better (question 2.). A negative answer is much more arduous. For example, in question 2., it is a statement about ''all estimators''. How can this be done? The answer lies in information theory (see<ref name="CovTho06">{{cite book|authors=|last1=Cover|first1=Thomas M.|last2=Thomas|first2=Joy A.|year=2006|title=Elements of information theory|volume=65|number=2|publisher=Wiley-Interscience [John Wiley & Sons]|isbn=978-0-471-24195-9|location=Hoboken, NJ|pages=xxiv+748}}</ref> for a nice introduction). | ||
In this chapter, we will see how to give a negative answer to question 2. It will imply a negative answer to question 1. | In this chapter, we will see how to give a negative answer to question 2. It will imply a negative answer to question 1. | ||
Line 202: | Line 202: | ||
Consider the Gaussian Sequence Model (GSM) where we observe <math>\bY=(Y_1, \ldots, Y_d)^\top</math>, defined by | Consider the Gaussian Sequence Model (GSM) where we observe <math>\bY=(Y_1, \ldots, Y_d)^\top</math>, defined by | ||
<span id{{=}}"EQ:GSMminimax"/> | |||
<math display="block"> | <math display="block"> | ||
\begin{equation} | \begin{equation} | ||
Line 208: | Line 209: | ||
\end{equation} | \end{equation} | ||
</math> | </math> | ||
where <math>\varepsilon=(\eps_1, \ldots, \eps_d)^\top \sim \cN_d(0,\frac{\sigma^2}{n}I_d)</math>, <math>\theta^*=(\theta^*_1, \ldots, \theta^*_d)^\top \in \Theta </math> is the parameter of interest and <math>\Theta \subset \R^d</math> is a given set of parameters. We will need a more precise notation for probabilities and expectations throughout this chapter. Denote by <math>\p_{\theta^*}</math> and <math>\E_{\theta^*}</math> the probability measure and corresponding expectation that are associated to the distribution of <math>\bY</math> from the GSM | where <math>\varepsilon=(\eps_1, \ldots, \eps_d)^\top \sim \cN_d(0,\frac{\sigma^2}{n}I_d)</math>, <math>\theta^*=(\theta^*_1, \ldots, \theta^*_d)^\top \in \Theta </math> is the parameter of interest and <math>\Theta \subset \R^d</math> is a given set of parameters. We will need a more precise notation for probabilities and expectations throughout this chapter. Denote by <math>\p_{\theta^*}</math> and <math>\E_{\theta^*}</math> the probability measure and corresponding expectation that are associated to the distribution of <math>\bY</math> from the GSM\eqref{EQ:GSMminimax}. | ||
Recall that GSM is a special case of the linear regression model when the design matrix satisfies the ORT condition. In this case, we have proved several performance guarantees (''upper bounds'') for various choices of <math>\Theta</math> that can be expressed either in the form | Recall that GSM is a special case of the linear regression model when the design matrix satisfies the ORT condition. In this case, we have proved several performance guarantees (''upper bounds'') for various choices of <math>\Theta</math> that can be expressed either in the form | ||
<span id{{=}}"EQ:minimaxUB_E"/> | |||
<math display="block"> | <math display="block"> | ||
\begin{equation} | \begin{equation} | ||
Line 219: | Line 221: | ||
or the form | or the form | ||
<span id{{=}}"EQ:minimaxUB_P"/> | |||
<math display="block"> | <math display="block"> | ||
\begin{equation} | \begin{equation} | ||
Line 225: | Line 228: | ||
\end{equation} | \end{equation} | ||
</math> | </math> | ||
For some constant <math>C</math>. The rates <math>\phi(\Theta)</math> for different choices of <math>\Theta</math> that we have obtained are gathered in | For some constant <math>C</math>. The rates <math>\phi(\Theta)</math> for different choices of <math>\Theta</math> that we have obtained are gathered in [[#TAB:minimaxUB |Table]] together with the estimator (and the corresponding result from Chapter [[guide:090050c501#chap:GSM |Linear Regression Model]]) that was employed to obtain this rate. | ||
<span id="TAB:minimaxUB"/> | <span id{{=}}"TAB:minimaxUB"/> | ||
{|class="table" | {|class="table" | ||
|+ Rate <math>\phi(\Theta)</math> obtained for different choices of <math>\Theta</math>. | |+ Rate <math>\phi(\Theta)</math> obtained for different choices of <math>\Theta</math>. | ||
Line 233: | Line 236: | ||
|<math>\Theta</math> || <math>\phi(\Theta)</math> || Estimator || Result | |<math>\Theta</math> || <math>\phi(\Theta)</math> || Estimator || Result | ||
|- | |- | ||
|<math>\R^d</math> || <math>\DS \frac{\sigma^2 d}{n}</math> || <math>\thetals</math> || | |<math>\R^d</math> || <math>\DS \frac{\sigma^2 d}{n}</math> || <math>\thetals</math> || [[guide:090050c501#TH:lsOI |Theorem]] | ||
|- | |- | ||
|<math>\cB_1</math> || <math>\DS\sigma \sqrt{\frac{ \log d}{n}}</math> || <math>\thetals_{\cB_1}</math> || | |<math>\cB_1</math> || <math>\DS\sigma \sqrt{\frac{ \log d}{n}}</math> || <math>\thetals_{\cB_1}</math> || [[guide:090050c501#TH:ell1const |Theorem]] | ||
|- | |- | ||
|<math>\cB_0(k)</math> || <math>\DS\frac{ \sigma^2 k}{n}\log (ed/k)</math> || <math>\thetals_{\cB_0(k)}</math> || | |<math>\cB_0(k)</math> || <math>\DS\frac{ \sigma^2 k}{n}\log (ed/k)</math> || <math>\thetals_{\cB_0(k)}</math> ||[[guide:090050c501#COR:bss1 |Corollary]]-[[guide:090050c501#COR:bss2 |Corollary]] | ||
|} | |} | ||
Can any of these results be improved? In other words, does there exists another estimator <math>\tilde \theta</math> such that <math>\sup_{\theta^* \in \Theta}\E|\tilde \theta -\theta^*|_2^2\ll\phi(\Theta)</math>? | Can any of these results be improved? In other words, does there exists another estimator <math>\tilde \theta</math> such that <math>\sup_{\theta^* \in \Theta}\E|\tilde \theta -\theta^*|_2^2\ll\phi(\Theta)</math>? | ||
A first step in this direction is the Cram\'er-Rao lower bound <ref name="Sha03">{{cite | A first step in this direction is the Cram\'er-Rao lower bound <ref name="Sha03">{{cite book|authors=|last=Shao|first=Jun|year=2003|title=Mathematical statistics|volume=65|number=2|publisher=Springer-Verlag|isbn=0-387-95382-5|location=New York|pages=xvi+591}}</ref> that allows us to prove lower bounds in terms of the Fisher information. Nevertheless, this notion of optimality is too stringent and often leads to nonexistence of optimal estimators. Rather, we prefer here the notion of ''minimax optimality'' that characterizes how fast <math>\theta^*</math> can be estimated ''uniformly'' over <math>\Theta</math>. | ||
{{defncard|label=id=|We say that an estimator <math>\hat \theta_n</math> is ''minimax optimal over <math>\Theta</math>'' if it satisfies \eqref{EQ:minimaxUB_E} and there exists <math>C' > 0</math> such that | {{defncard|label=|id=|We say that an estimator <math>\hat \theta_n</math> is ''minimax optimal over <math>\Theta</math>'' if it satisfies \eqref{EQ:minimaxUB_E} and there exists <math>C' > 0</math> such that | ||
<span id{{=}}"EQ:minimaxLB1"/> | |||
<math display="block"> | <math display="block"> | ||
\begin{equation} | \begin{equation} | ||
Line 253: | Line 257: | ||
where the infimum is taker over all estimators (i.e., measurable functions of <math>\bY</math>). Moreover, <math>\phi(\Theta)</math> is called ''minimax rate of estimation over <math>\Theta</math>''.}} | where the infimum is taker over all estimators (i.e., measurable functions of <math>\bY</math>). Moreover, <math>\phi(\Theta)</math> is called ''minimax rate of estimation over <math>\Theta</math>''.}} | ||
Note that minimax rates of convergence <math>\phi</math> are defined up to multiplicative constants. We may then choose this constant such that the minimax rate has a simple form such as <math>\sigma^2d/n</math> as opposed to <math>7\sigma^2d/n</math> for example. | Note that minimax rates of convergence <math>\phi</math> are defined up to multiplicative constants. We may then choose this constant such that the minimax rate has a simple form such as <math>\sigma^2d/n</math> as opposed to <math>7\sigma^2d/n</math> for example. | ||
This definition can be adapted to rates that hold with high probability. As we saw in Chapter | This definition can be adapted to rates that hold with high probability. As we saw in Chapter [[guide:090050c501#chap:GSM |Linear Regression Model]] (Cf. [[#TAB:minimaxUB |Table]]), the upper bounds in expectation and those with high probability are of the same order of magnitude. It is also the case for lower bounds. Indeed, observe that it follows from the Markov inequality that for any <math>A > 0</math>, | ||
<span id{{=}}"EQ:markov_minimax"/> | |||
<math display="block"> | <math display="block"> | ||
\begin{equation} | \begin{equation} | ||
Line 261: | Line 266: | ||
\end{equation} | \end{equation} | ||
</math> | </math> | ||
Therefore, | Therefore,\eqref{EQ:minimaxLB1} follows if we prove that | ||
<math display="block"> | <math display="block"> | ||
Line 267: | Line 272: | ||
</math> | </math> | ||
for some positive constants <math>A</math> and <math>C"</math>. The above inequality also implies a lower bound with high probability. We can therefore employ the following alternate definition for minimax optimality. | for some positive constants <math>A</math> and <math>C"</math>. The above inequality also implies a lower bound with high probability. We can therefore employ the following alternate definition for minimax optimality. | ||
{{defncard|label=|id=|We say that an estimator <math>\hat \theta</math> is ''minimax optimal over <math>\Theta</math>'' if it satisfies either \eqref{EQ:minimaxUB_E} or \eqref{EQ:minimaxUB_P} and there exists <math>C' > 0</math> such that | |||
<span id{{=}}"EQ:minimaxLB"/> | |||
<math display="block"> | <math display="block"> | ||
\begin{equation} | \begin{equation} | ||
Line 277: | Line 284: | ||
where the infimum is taken over all estimators (i.e., measurable functions of <math>\bY</math>). Moreover, <math>\phi(\Theta)</math> is called ''minimax rate of estimation over <math>\Theta</math>''.}} | where the infimum is taken over all estimators (i.e., measurable functions of <math>\bY</math>). Moreover, <math>\phi(\Theta)</math> is called ''minimax rate of estimation over <math>\Theta</math>''.}} | ||
==<span id="SEC:reduc"></span>Reduction to finite hypothesis testing== | ==<span id{{=}}"SEC:reduc"></span>Reduction to finite hypothesis testing== | ||
Minimax lower bounds rely on information theory and follow from a simple principle: if the number of observations is too small, it may be hard to distinguish between two probability distributions that are close to each other. For example, given <math>n</math> i.i.d. observations, it is impossible to reliably decide whether they are drawn from <math>\cN(0,1)</math> or <math>\cN(\frac1n,1)</math>. This simple argument can be made precise using the formalism of ''statistical hypothesis testing''. To do so, we reduce our estimation problem to a testing problem. The reduction consists of two steps. | Minimax lower bounds rely on information theory and follow from a simple principle: if the number of observations is too small, it may be hard to distinguish between two probability distributions that are close to each other. For example, given <math>n</math> i.i.d. observations, it is impossible to reliably decide whether they are drawn from <math>\cN(0,1)</math> or <math>\cN(\frac1n,1)</math>. This simple argument can be made precise using the formalism of ''statistical hypothesis testing''. To do so, we reduce our estimation problem to a testing problem. The reduction consists of two steps. | ||
<ul><li> '''Reduction to a finite number of parameters.''' In this step the goal is to find the largest possible number of parameters <math>\theta_1, \ldots, \theta_M \in \Theta</math> under the constraint that | <ul><li> '''Reduction to a finite number of parameters.''' In this step the goal is to find the largest possible number of parameters <math>\theta_1, \ldots, \theta_M \in \Theta</math> under the constraint that | ||
<span id{{=}}"EQ:dist_constraint"/> | |||
<math display="block"> | <math display="block"> | ||
\begin{equation} | \begin{equation} | ||
Line 295: | Line 303: | ||
</li> | </li> | ||
<li> '''Reduction to a hypothesis testing problem.''' In this second step, the necessity of the constraint | <li> '''Reduction to a hypothesis testing problem.''' In this second step, the necessity of the constraint\eqref{EQ:dist_constraint} becomes apparent. | ||
For any estimator <math>\hat \theta</math>, define the minimum distance test <math>\psi(\hat \theta)</math> that is associated to it by | For any estimator <math>\hat \theta</math>, define the minimum distance test <math>\psi(\hat \theta)</math> that is associated to it by | ||
Line 312: | Line 320: | ||
|\hat \theta - \theta_j|_2 \ge \frac12 |\theta_j -\theta_k|_2 | |\hat \theta - \theta_j|_2 \ge \frac12 |\theta_j -\theta_k|_2 | ||
</math> | </math> | ||
Together with constraint | Together with constraint\eqref{EQ:dist_constraint}, it yields | ||
<math display="block"> | <math display="block"> | ||
Line 328: | Line 336: | ||
</li> | </li> | ||
</ul> | </ul> | ||
Conclusion: it is sufficient for proving lower bounds to find <math>\theta_1, \ldots, \theta_M \in \Theta</math> such that <math>|\theta_j -\theta_k|_2^2 \ge 4 \phi(\Theta)</math> and | |||
<math display="block"> | <math display="block"> | ||
Line 345: | Line 354: | ||
</math> | </math> | ||
A test <math>\psi=\psi(Z) \in \{0,1\}</math> indicates which hypothesis should be true. Any test <math>\psi</math> can make two types of errors. It can commit either an error of type I (<math>\psi=1</math> whereas <math>Z \sim \p_0</math>) or an error of type II (<math>\psi=0</math> whereas <math>Z \sim \p_1</math>). Of course, the test may also be correct. The following fundamental result called the ''Neyman Pearson Lemma'' indicates that any test <math>\psi</math> is bound to commit one of these two types of error with positive probability unless <math>\p_0</math> and <math>\p_1</math> have essentially disjoint support. | A test <math>\psi=\psi(Z) \in \{0,1\}</math> indicates which hypothesis should be true. Any test <math>\psi</math> can make two types of errors. It can commit either an error of type I (<math>\psi=1</math> whereas <math>Z \sim \p_0</math>) or an error of type II (<math>\psi=0</math> whereas <math>Z \sim \p_1</math>). Of course, the test may also be correct. The following fundamental result called the ''Neyman Pearson Lemma'' indicates that any test <math>\psi</math> is bound to commit one of these two types of error with positive probability unless <math>\p_0</math> and <math>\p_1</math> have essentially disjoint support. | ||
Let <math>\nu</math> be a sigma finite measure satisfying <math>\p_0 \ll \nu</math> and <math>\p_1\ll \nu</math>. For example, we can take <math>\nu = \p_0+\p_1</math>. It follows from the Radon-Nikodym theorem <ref name="Bil95">{{cite | Let <math>\nu</math> be a sigma finite measure satisfying <math>\p_0 \ll \nu</math> and <math>\p_1\ll \nu</math>. For example, we can take <math>\nu = \p_0+\p_1</math>. It follows from the Radon-Nikodym theorem <ref name="Bil95">{{cite book|authors=|last=Billingsley|first=Patrick|year=1995|title=Probability and measure|volume=65|number=2|publisher=John Wiley & Sons Inc.|isbn=0-471-00710-2|location=New York|pages=xiv+593}}</ref> that both <math>\p_0</math> and <math>\p_1</math> admit probability densities with respect to <math>\nu</math>. We denote them by <math>p_0</math> and <math>p_1</math> respectively. For any function <math>f</math>, we write for simplicity | ||
<math display="block"> | <math display="block"> | ||
Line 379: | Line 388: | ||
The above quantity is clearly minimized for <math>R=R^\star</math>.}} | The above quantity is clearly minimized for <math>R=R^\star</math>.}} | ||
The lower bound in the Neyman-Pearson lemma is related to a well-known quantity: the total variation distance. | The lower bound in the Neyman-Pearson lemma is related to a well-known quantity: the total variation distance. | ||
{{proofcard|Definition-Proposition|PROP:TV|The ''total variation distance'' between two probability measures <math>\p_0</math> and <math>\p_1</math> on a measurable space <math>(\cX, \cA)</math> is defined by | |||
The ''total variation distance'' between two probability measures <math>\p_0</math> and <math>\p_1</math> on a measurable space <math>(\cX, \cA)</math> is defined by | |||
<math display="block"> | <math display="block"> | ||
Line 392: | Line 400: | ||
\end{align*} | \end{align*} | ||
</math> | </math> | ||
where the infimum above is taken over all tests. | where the infimum above is taken over all tests.|Clearly <math>(i)=(ii)</math> and the Neyman-Pearson Lemma gives <math>(iv)=(v)</math>. Moreover, by identifying a test <math>\psi</math> to its rejection region, it is not hard to see that <math>(i)=(v)</math>. Therefore it remains only to show that <math>(iii)</math> is equal to any of the other expressions. Hereafter, we show that <math>(iii)=(iv)</math>. To that end, observe that | ||
Clearly <math>(i)=(ii)</math> and the Neyman-Pearson Lemma gives <math>(iv)=(v)</math>. Moreover, by identifying a test <math>\psi</math> to its rejection region, it is not hard to see that <math>(i)=(v)</math>. Therefore it remains only to show that <math>(iii)</math> is equal to any of the other expressions. Hereafter, we show that <math>(iii)=(iv)</math>. To that end, observe that | |||
<math display="block"> | <math display="block"> | ||
Line 404: | Line 409: | ||
&=2-2\int\min(p_0,p_1) | &=2-2\int\min(p_0,p_1) | ||
\end{align*} | \end{align*} | ||
</math> | </math>}} | ||
In view of the Neyman-Pearson lemma, it is clear that if we want to prove large lower bounds, we need to find probability distributions that are close in total variation. Yet, this conflicts with constraint | In view of the Neyman-Pearson lemma, it is clear that if we want to prove large lower bounds, we need to find probability distributions that are close in total variation. Yet, this conflicts with constraint\eqref{EQ:dist_constraint} and a tradeoff needs to be achieved. To that end, in the Gaussian sequence model, we need to be able to compute the total variation distance between <math>\cN(\theta_0, \frac{\sigma^2}{n}I_d)</math> and <math>\cN(\theta_1, \frac{\sigma^2}{n}I_d)</math>. None of the expression in [[#PROP:TV |Definition-Proposition]] gives an easy way to do so. The Kullback-Leibler divergence is much more convenient. | ||
===The Kullback-Leibler divergence=== | ===The Kullback-Leibler divergence=== | ||
{{defncard|label=id=|The Kullback-Leibler divergence between probability measures <math>\p_1</math> and <math>\p_0</math> is given by | {{defncard|label=|id=|The Kullback-Leibler divergence between probability measures <math>\p_1</math> and <math>\p_0</math> is given by | ||
<math display="block"> | <math display="block"> | ||
Line 418: | Line 423: | ||
</math> | </math> | ||
}} | }} | ||
It can be shown <ref name="Tsy09">{{cite | It can be shown <ref name="Tsy09">{{cite book|authors=|last=Tsybakov|first=Alexandre B.|year=2009|title=Introduction to nonparametric estimation|volume=1|number=5|publisher=Springer|isbn=978-0-387-79051-0|location=New York|pages=xii+214}}</ref> that the integral is always well defined when <math>\p_1 \ll \p_0</math> (though it can be equal to <math>\infty</math> even in this case). Unlike the total variation distance, the Kullback-Leibler divergence is not a distance. Actually, it is not even symmetric. Nevertheless, it enjoys properties that are very useful for our purposes. | ||
{{proofcard|Proposition|PROP:KL|Let <math>\p</math> and <math>\q</math> be two probability measures. Then | {{proofcard|Proposition|PROP:KL|Let <math>\p</math> and <math>\q</math> be two probability measures. Then | ||
<ul><li> <math>\KL(\p, \q)\ge 0</math>. | <ul><li> <math>\KL(\p, \q)\ge 0</math>. | ||
Line 464: | Line 469: | ||
\end{align*} | \end{align*} | ||
</math>}} | </math>}} | ||
Point 2. in | Point 2. in [[#PROP:KL |Proposition]] is particularly useful in statistics where observations typically consist of <math>n</math> independent random variables. | ||
'''Example''' | |||
<span id ="ex:KL_Gauss"/> | |||
For any <math>\theta \in \R^d</math>, let <math>P_\theta</math> denote the distribution of <math>\bY \sim \cN(\theta, \sigma^2I_d)</math>. Then | For any <math>\theta \in \R^d</math>, let <math>P_\theta</math> denote the distribution of <math>\bY \sim \cN(\theta, \sigma^2I_d)</math>. Then | ||
Line 471: | Line 479: | ||
\KL(P_\theta, P_{\theta'})=\sum_{i=1}^d \frac{(\theta_i -\theta'_i)^2}{2\sigma^2}=\frac{|\theta-\theta'|_2^2}{2\sigma^2}\,. | \KL(P_\theta, P_{\theta'})=\sum_{i=1}^d \frac{(\theta_i -\theta'_i)^2}{2\sigma^2}=\frac{|\theta-\theta'|_2^2}{2\sigma^2}\,. | ||
</math> | </math> | ||
The Kullback-Leibler divergence is easier to manipulate than the total variation distance but only the latter is related to the minimax probability of error. Fortunately, these two quantities can be compared using Pinsker's inequality. We prove here a slightly weaker version of Pinsker's inequality that will be sufficient for our purpose. For a stronger statement, see <ref name="Tsy09" | The proof is left as an exercise (see [[exercise:59a2bb2f30 |Problem]]). | ||
The Kullback-Leibler divergence is easier to manipulate than the total variation distance but only the latter is related to the minimax probability of error. Fortunately, these two quantities can be compared using Pinsker's inequality. We prove here a slightly weaker version of Pinsker's inequality that will be sufficient for our purpose. For a stronger statement, see <ref name="Tsy09"/>, Lemma2.5. | |||
{{proofcard|Lemma (Pinsker's inequality.)|LEM:Pinsker|Let <math>\p</math> and <math>\q</math> be two probability measures such that <math>\p \ll \q</math>. Then | {{proofcard|Lemma (Pinsker's inequality.)|LEM:Pinsker|Let <math>\p</math> and <math>\q</math> be two probability measures such that <math>\p \ll \q</math>. Then | ||
Line 508: | Line 517: | ||
where we used the fact that <math>0\le \TV(\p,\q) \le 1</math> and <math>\sqrt{1-x} \le 1-x/2</math> for <math>x \in [0,1]</math>.}} | where we used the fact that <math>0\le \TV(\p,\q) \le 1</math> and <math>\sqrt{1-x} \le 1-x/2</math> for <math>x \in [0,1]</math>.}} | ||
Pinsker's inequality yields the following theorem for the GSM. | Pinsker's inequality yields the following theorem for the GSM. | ||
{{proofcard|Theorem|TH:LB2hyp|Assume that <math>\Theta</math> contains two hypotheses <math>\theta_0</math> and <math>\theta_1</math> such that <math>|\theta_0-\theta_1|_2^2 = 8\alpha^2\sigma^2/n</math> for some <math>\alpha \in (0,1/2)</math>. Then | {{proofcard|Theorem|TH:LB2hyp|Assume that <math>\Theta</math> contains two hypotheses <math>\theta_0</math> and <math>\theta_1</math> such that <math>|\theta_0-\theta_1|_2^2 = 8\alpha^2\sigma^2/n</math> for some <math>\alpha \in (0,1/2)</math>. Then | ||
Line 520: | Line 530: | ||
\inf_{\hat \theta} \sup_{\theta \in \Theta}\p_\theta(|\hat \theta-\theta|_2^2 \ge \frac{2\alpha\sigma^2}{n})&\ge \inf_{\psi}\max_{j=0,1}\p_j(\psi\neq j)&\\ | \inf_{\hat \theta} \sup_{\theta \in \Theta}\p_\theta(|\hat \theta-\theta|_2^2 \ge \frac{2\alpha\sigma^2}{n})&\ge \inf_{\psi}\max_{j=0,1}\p_j(\psi\neq j)&\\ | ||
&\ge \frac12\inf_{\psi}\Big(\p_0(\psi=1)+\p_1(\psi=0)\Big)&\\ | &\ge \frac12\inf_{\psi}\Big(\p_0(\psi=1)+\p_1(\psi=0)\Big)&\\ | ||
&=\frac12\Big[1-\TV(\p_0,\p_1)\Big] | &=\frac12\Big[1-\TV(\p_0,\p_1)\Big]\\ | ||
&\ge \frac12\Big[1-\sqrt{\KL(\p_1, \p_0)}\Big] | &\ge \frac12\Big[1-\sqrt{\KL(\p_1, \p_0)}\Big]\\ | ||
&= \frac12\Big[1-\sqrt{\frac{n|\theta_1-\theta_0|_2^2}{2\sigma^2}}\Big] | &= \frac12\Big[1-\sqrt{\frac{n|\theta_1-\theta_0|_2^2}{2\sigma^2}}\Big]\\ | ||
&= \frac12\Big[1-2\alpha\Big]& | &= \frac12\Big[1-2\alpha\Big]& | ||
\end{align*} | \end{align*} | ||
</math>}} | </math>}} | ||
Clearly, the result of | Clearly, the result of [[#TH:LB2hyp |Theorem]] matches the upper bound for <math>\Theta=\R^d</math> only for <math>d=1</math>. How about larger <math>d</math>? A quick inspection of our proof shows that our technique, in its present state, cannot yield better results. Indeed, there are only two known candidates for the choice of <math>\theta^*</math>. With this knowledge, one can obtain upper bounds that do not depend on <math>d</math> by simply projecting <math>Y</math> onto the linear span of <math>\theta_0, \theta_1</math> and then solving the GSM in two dimensions. To obtain larger lower bounds, we need to use more than two hypotheses. In particular, in view of the above discussion, we need a set of hypotheses that spans a linear space of dimension proportional to <math>d</math>. In principle, we should need at least order <math>d</math> hypotheses but we will actually need much more. | ||
==Lower bounds based on many hypotheses== | ==Lower bounds based on many hypotheses== | ||
The reduction to hypothesis testing from Section | The reduction to hypothesis testing from Section [[#SEC:reduc |Reduction to finite hypothesis testing]] allows us to use more than two hypotheses. Specifically, we should find <math>\theta_1, \ldots, \theta_M</math> such that | ||
<math display="block"> | <math display="block"> | ||
Line 561: | Line 571: | ||
\bar p \le \frac{\mathsf{kl}(\bar p, \bar q)+ \log 2}{\log M}\,. | \bar p \le \frac{\mathsf{kl}(\bar p, \bar q)+ \log 2}{\log M}\,. | ||
</math> | </math> | ||
Next, observe that by convexity ( | Next, observe that by convexity ([[#PROP:KL |Proposition]], 2.), it holds | ||
<math display="block"> | <math display="block"> | ||
Line 585: | Line 595: | ||
\end{align*} | \end{align*} | ||
</math> | </math> | ||
where we used | where we used [[#PROP:KL |Proposition]], 1. | ||
We have proved that | We have proved that | ||
Line 607: | Line 617: | ||
\inf_{\psi}\max_{1\le j \le M}\p_{\theta_j}\big[\psi\neq j\big] \ge \frac{1}{2}-2\alpha. | \inf_{\psi}\max_{1\le j \le M}\p_{\theta_j}\big[\psi\neq j\big] \ge \frac{1}{2}-2\alpha. | ||
</math> | </math> | ||
If follows from (''ii'') and Example | If follows from (''ii'') and Example[[#ex:KL_Gauss |Minimax Lower Bounds]] that | ||
<math display="block"> | <math display="block"> | ||
Line 618: | Line 628: | ||
</math> | </math> | ||
The proof then follows from Fano's inequality.}} | The proof then follows from Fano's inequality.}} | ||
[[#TH:mainLB |Theorem]] indicates that we must take <math>\phi\le \frac{\alpha\sigma^2 }{2n}\log(M)</math>. Therefore, the larger the <math>M</math>, the larger the lower bound can be. However, <math>M</math> cannot be arbitrary larger because of the constraint <math>(i)</math>. We are therefore facing a ''packing'' problem where the goal is to ‘`pack" as many Euclidean balls of radius proportional to <math>\sigma\sqrt{\log(M)/n}</math> in <math>\Theta</math> under the constraint that their centers remain close together (constraint <math>(ii)</math>). If <math>\Theta=\R^d</math>, this the goal is to pack the Euclidean ball of radius <math>R=\sigma\sqrt{2\alpha\log(M)/n}</math> with Euclidean balls of radius <math>R\sqrt{2\alpha/\gamma}</math>. This can be done using a volume argument (see [[exercise:83c01895fd |Problem]]). However, we will use the more versatile lemma below. It gives a a lower bound on the size of a packing of the discrete hypercube <math>\{0,1\}^d</math> with respect to the ’'Hamming distance'' defined by | |||
<math display="block"> | <math display="block"> | ||
Line 653: | Line 663: | ||
</math> | </math> | ||
and by virtue of the probabilistic method, there exist <math>\omega_{1}, \ldots \omega_{M} \in \{0,1\}^d</math> that satisfy <math>(i)</math> and <math>(ii)</math>}} | and by virtue of the probabilistic method, there exist <math>\omega_{1}, \ldots \omega_{M} \in \{0,1\}^d</math> that satisfy <math>(i)</math> and <math>(ii)</math>}} | ||
==<span id="sec:lb-gsm"></span>Application to the Gaussian sequence model== | ==<span id{{=}}"sec:lb-gsm"></span>Application to the Gaussian sequence model== | ||
We are now in a position to apply | We are now in a position to apply [[#TH:mainLB |Theorem]] by choosing <math>\theta_1, \ldots, \theta_M</math> based on <math>\omega_{1}, \ldots, \omega_{M}</math> from the Varshamov-Gilbert Lemma. | ||
===Lower bounds for estimation=== | ===Lower bounds for estimation=== | ||
Take <math>\gamma=1/4</math> and apply the Varshamov-Gilbert Lemma to obtain <math>\omega_{1}, \ldots, \omega_{M}</math> with <math>M=\lfloor e^{d/16}\rfloor\ge e^{d/32} </math> and such that | Take <math>\gamma=1/4</math> and apply the Varshamov-Gilbert Lemma to obtain <math>\omega_{1}, \ldots, \omega_{M}</math> with <math>M=\lfloor e^{d/16}\rfloor\ge e^{d/32} </math> and such that | ||
Line 664: | Line 674: | ||
\theta_j=\omega_j\frac{\beta\sigma}{\sqrt{n}}\,, | \theta_j=\omega_j\frac{\beta\sigma}{\sqrt{n}}\,, | ||
</math> | </math> | ||
for some <math>\beta > 0</math> to be chosen later. We can check the conditions of | for some <math>\beta > 0</math> to be chosen later. We can check the conditions of [[#TH:mainLB |Theorem]]: | ||
<ul style{{=}}"list-style-type:lower-alpha"><li><math>\DS|\theta_j-\theta_k|_2^2=\frac{\beta^2\sigma^2 }{n}\rho(\omega_{j}, \omega_{k}) \ge 4\frac{\beta^2\sigma^2 d}{16n}</math>;</li> | <ul style{{=}}"list-style-type:lower-alpha"><li><math>\DS|\theta_j-\theta_k|_2^2=\frac{\beta^2\sigma^2 }{n}\rho(\omega_{j}, \omega_{k}) \ge 4\frac{\beta^2\sigma^2 d}{16n}</math>;</li> | ||
<li><math>\DS |\theta_j-\theta_k|_2^2 =\frac{\beta^2\sigma^2 }{n}\rho(\omega_{j}, \omega_{k}) \le \frac{\beta^2\sigma^2 d}{n}\le \frac{32\beta^2\sigma^2 }{n}\log(M)=\frac{2\alpha \sigma^2}{n}\log(M)</math>\,,</li> | <li><math>\DS |\theta_j-\theta_k|_2^2 =\frac{\beta^2\sigma^2 }{n}\rho(\omega_{j}, \omega_{k}) \le \frac{\beta^2\sigma^2 d}{n}\le \frac{32\beta^2\sigma^2 }{n}\log(M)=\frac{2\alpha \sigma^2}{n}\log(M)</math>\,,</li> | ||
</ul> | </ul> | ||
for <math>\beta=\frac{\sqrt{\alpha}}{4}</math>. Applying now | for <math>\beta=\frac{\sqrt{\alpha}}{4}</math>. Applying now [[#TH:mainLB |Theorem]] yields | ||
<math display="block"> | <math display="block"> | ||
Line 675: | Line 685: | ||
It implies the following corollary. | It implies the following corollary. | ||
{{proofcard|Corollary|cor-1|The minimax rate of estimation of over <math>\R^d</math> in the Gaussian sequence model is <math>\phi(\R^d)=\sigma^2d/n</math>. Moreover, it is attained by the least squares estimator <math>\thetals=\bY</math>.|}} | {{proofcard|Corollary|cor-1|The minimax rate of estimation of over <math>\R^d</math> in the Gaussian sequence model is <math>\phi(\R^d)=\sigma^2d/n</math>. Moreover, it is attained by the least squares estimator <math>\thetals=\bY</math>.|}} | ||
Note that this rate is minimax over sets <math>\Theta</math> that are strictly smaller than <math>\R^d</math> (see Problem | Note that this rate is minimax over sets <math>\Theta</math> that are strictly smaller than <math>\R^d</math> (see Problem[[#EXO:minimax_other |Minimax Lower Bounds]]). Indeed, it is minimax over any subset of <math>\R^d</math> that contains <math>\theta_1, \ldots, \theta_M</math>. | ||
===Lower bounds for sparse estimation=== | ===Lower bounds for sparse estimation=== | ||
It appears from | It appears from [[#TAB:minimaxUB |Table]] that when estimating sparse vectors, we have to pay for an extra logarithmic term <math>\log(ed/k)</math> for not knowing the sparsity pattern of the unknown <math>\theta^*</math>. In this section, we show that this term is unavoidable as it appears in the minimax optimal rate of estimation of sparse vectors. | ||
Note that the vectors <math>\theta_1, \ldots, \theta_M</math> employed in the previous subsection are not guaranteed to be sparse because the vectors <math>\omega_1, \ldots, \omega_M</math> obtained from the Varshamov-Gilbert Lemma may themselves not be sparse. To overcome this limitation, we need a sparse version of the Varhsamov-Gilbert lemma. | Note that the vectors <math>\theta_1, \ldots, \theta_M</math> employed in the previous subsection are not guaranteed to be sparse because the vectors <math>\omega_1, \ldots, \omega_M</math> obtained from the Varshamov-Gilbert Lemma may themselves not be sparse. To overcome this limitation, we need a sparse version of the Varhsamov-Gilbert lemma. | ||
{{proofcard|Lemma (Sparse Varshamov-Gilbert)|LEM:sVG|There exist positive constants <math>C_1</math> and <math>C_2</math> such that the following holds for any two integers <math>k</math> and <math>d</math> such that <math>1\le k \le d/8</math>. | {{proofcard|Lemma (Sparse Varshamov-Gilbert)|LEM:sVG|There exist positive constants <math>C_1</math> and <math>C_2</math> such that the following holds for any two integers <math>k</math> and <math>d</math> such that <math>1\le k \le d/8</math>. | ||
Line 754: | Line 764: | ||
<math display="block"> | <math display="block"> | ||
\begin{equation*} | \begin{equation*} | ||
\log M < \frac{k}{4}\log(1+\frac{d}{2k}). | \log M < \frac{k}{4}\log(1+\frac{d}{2k}). | ||
\end{equation*} | \end{equation*} | ||
</math>}} | </math>}} | ||
Line 763: | Line 773: | ||
\theta_j=\omega_j\frac{\beta\sigma}{\sqrt{n}}\sqrt{\log(1+\frac{d}{2k})}\,, | \theta_j=\omega_j\frac{\beta\sigma}{\sqrt{n}}\sqrt{\log(1+\frac{d}{2k})}\,, | ||
</math> | </math> | ||
for some <math>\beta > 0</math> to be chosen later. We can check the conditions of | for some <math>\beta > 0</math> to be chosen later. We can check the conditions of [[#TH:mainLB |Theorem]]: | ||
<ul style{{=}}"list-style-type:lower-alpha"><li><math>\DS|\theta_j-\theta_k|_2^2=\frac{\beta^2\sigma^2 }{n}\rho(\omega_{j}, \omega_{k})\log(1+\frac{d}{2k}) \ge 4\frac{\beta^2\sigma^2 }{8n}k\log(1+\frac{d}{2k})</math>;</li> | <ul style{{=}}"list-style-type:lower-alpha"><li><math>\DS|\theta_j-\theta_k|_2^2=\frac{\beta^2\sigma^2 }{n}\rho(\omega_{j}, \omega_{k})\log(1+\frac{d}{2k}) \ge 4\frac{\beta^2\sigma^2 }{8n}k\log(1+\frac{d}{2k})</math>;</li> | ||
<li><math>\DS |\theta_j-\theta_k|_2^2 =\frac{\beta^2\sigma^2 }{n}\rho(\omega_{j}, \omega_{k})\log(1+\frac{d}{2k}) \le \frac{2k\beta^2\sigma^2 }{n}\log(1+\frac{d}{2k})\le \frac{2\alpha \sigma^2}{n}\log(M)</math>\,,</li> | <li><math>\DS |\theta_j-\theta_k|_2^2 =\frac{\beta^2\sigma^2 }{n}\rho(\omega_{j}, \omega_{k})\log(1+\frac{d}{2k}) \le \frac{2k\beta^2\sigma^2 }{n}\log(1+\frac{d}{2k})\le \frac{2\alpha \sigma^2}{n}\log(M)</math>\,,</li> | ||
</ul> | </ul> | ||
for <math>\beta=\sqrt{\frac{\alpha}{8}}</math>. Applying now | for <math>\beta=\sqrt{\frac{\alpha}{8}}</math>. Applying now [[#TH:mainLB |Theorem]] yields | ||
<math display="block"> | <math display="block"> | ||
Line 775: | Line 785: | ||
{{proofcard|Corollary|cor:minimax-sparse|Recall that <math>\cB_{0}(k) \subset \R^d</math> denotes the set of all <math>k</math>-sparse vectors of <math>\R^d</math>. | {{proofcard|Corollary|cor:minimax-sparse|Recall that <math>\cB_{0}(k) \subset \R^d</math> denotes the set of all <math>k</math>-sparse vectors of <math>\R^d</math>. | ||
The minimax rate of estimation over <math>\cB_{0}(k)</math> in the Gaussian sequence model is <math>\phi(\cB_{0}(k))=\frac{\sigma^2k}{n}\log(ed/k)</math>. Moreover, it is attained by the constrained least squares estimator <math>\thetals_{\cB_0(k)}</math>.|}} | The minimax rate of estimation over <math>\cB_{0}(k)</math> in the Gaussian sequence model is <math>\phi(\cB_{0}(k))=\frac{\sigma^2k}{n}\log(ed/k)</math>. Moreover, it is attained by the constrained least squares estimator <math>\thetals_{\cB_0(k)}</math>.|}} | ||
Note that the modified BIC estimator of | Note that the modified BIC estimator of [[exercise:D94bf7a88b |Problem]] and the Slope [[guide:090050c501#eq:ad |eq:ad]] are also minimax optimal over <math>\cB_{0}(k)</math>. | ||
However, unlike <math> \thetals_{\cB_0(k)} </math> and the modified BIC, the Slope is also adaptive to <math> k </math>. | However, unlike <math> \thetals_{\cB_0(k)} </math> and the modified BIC, the Slope is also adaptive to <math> k </math>. | ||
For any <math>\eps > 0</math>, the Lasso estimator and the BIC estimator are minimax optimal for sets of parameters such that <math>k\le d^{1-\eps}</math>. | For any <math>\eps > 0</math>, the Lasso estimator and the BIC estimator are minimax optimal for sets of parameters such that <math>k\le d^{1-\eps}</math>. | ||
Line 785: | Line 795: | ||
k\ge \frac{R}{\beta\sigma}\sqrt{\frac{n}{\log (ed/\sqrt{n})}}\,. | k\ge \frac{R}{\beta\sigma}\sqrt{\frac{n}{\log (ed/\sqrt{n})}}\,. | ||
</math> | </math> | ||
Let <math>\omega_1, \ldots, \omega_M</math> be obtained from the sparse Varshamov-Gilbert | Let <math>\omega_1, \ldots, \omega_M</math> be obtained from the sparse Varshamov-Gilbert [[#LEM:sVG |Lemma]] with this choice of <math>k</math> and define | ||
<math display="block"> | <math display="block"> | ||
Line 791: | Line 801: | ||
</math> | </math> | ||
Observe that <math>|\theta_j|_1=R</math> for <math>j=1,\ldots, M</math>. | Observe that <math>|\theta_j|_1=R</math> for <math>j=1,\ldots, M</math>. | ||
We can check the conditions of | We can check the conditions of [[#TH:mainLB |Theorem]]: | ||
<ul style{{=}}"list-style-type:lower-alpha"><li><math>\DS|\theta_j-\theta_k|_2^2=\frac{R^2}{k^2}\rho(\omega_{j}, \omega_{k})\ge \frac{R^2}{2k}\ge 4R\min\big(\frac{R}8, \beta^2\sigma\frac{\log (ed/\sqrt{n})}{8n}\big)</math>\,.</li> | <ul style{{=}}"list-style-type:lower-alpha"><li><math>\DS|\theta_j-\theta_k|_2^2=\frac{R^2}{k^2}\rho(\omega_{j}, \omega_{k})\ge \frac{R^2}{2k}\ge 4R\min\big(\frac{R}8, \beta^2\sigma\frac{\log (ed/\sqrt{n})}{8n}\big)</math>\,.</li> | ||
<li><math>\DS |\theta_j-\theta_k|_2^2 \le \frac{2R^2}{k} \le 4R\beta\sigma\sqrt{\frac{\log(ed/\sqrt{n})}{n}}\le \frac{2\alpha \sigma^2}{n}\log(M)</math>\,,</li> | <li><math>\DS |\theta_j-\theta_k|_2^2 \le \frac{2R^2}{k} \le 4R\beta\sigma\sqrt{\frac{\log(ed/\sqrt{n})}{n}}\le \frac{2\alpha \sigma^2}{n}\log(M)</math>\,,</li> | ||
</ul> | </ul> | ||
for <math>\beta</math> small enough if <math>d \ge Ck</math> for some constant <math>C > 0</math> chosen large enough. Applying now | for <math>\beta</math> small enough if <math>d \ge Ck</math> for some constant <math>C > 0</math> chosen large enough. Applying now [[#TH:mainLB |Theorem]] yields | ||
<math display="block"> | <math display="block"> | ||
Line 811: | Line 821: | ||
<math display="block"> | <math display="block"> | ||
\begin{equation*} | \begin{equation*} | ||
|0 -\theta^*|_2^2=|\theta^*|_2^2\le |\theta^*|_1^2=R^2\,. | |0 -\theta^*|_2^2=|\theta^*|_2^2\le |\theta^*|_1^2=R^2\,. | ||
\end{equation*} | \end{equation*} | ||
</math>}} | </math>}} | ||
Line 822: | Line 832: | ||
}} | }} | ||
==Lower bounds for sparse estimation via | ==Lower bounds for sparse estimation via <math> \chi^2 </math>divergence == | ||
In this section, we will show how to derive lower bounds by directly controlling the distance between a simple and a composite hypothesis, which is useful when investigating lower rates for decision problems instead of estimation problems. | In this section, we will show how to derive lower bounds by directly controlling the distance between a simple and a composite hypothesis, which is useful when investigating lower rates for decision problems instead of estimation problems. | ||
Define the <math> \chi^2 </math> divergence between two probability distributions <math> \p, \q </math> as | Define the <math> \chi^2 </math> divergence between two probability distributions <math> \p, \q </math> as | ||
Line 851: | Line 861: | ||
Secondly, we note that we can bound the KL divergence from above by the <math> \chi^2 </math> divergence, via Jensen’s inequality, as | Secondly, we note that we can bound the KL divergence from above by the <math> \chi^2 </math> divergence, via Jensen’s inequality, as | ||
<span id{{=}}"eq:ag"/> | |||
<math display="block"> | <math display="block"> | ||
\begin{equation} | \begin{equation} | ||
Line 860: | Line 871: | ||
Note that this estimate is a priori very rough and that we are transitioning from something at <math> \log </math> scale to something on a regular scale. | Note that this estimate is a priori very rough and that we are transitioning from something at <math> \log </math> scale to something on a regular scale. | ||
However, since we are mostly interested in showing that these distances are bounded by a small constant after adjusting the parameters of our models appropriately, it will suffice for our purposes. | However, since we are mostly interested in showing that these distances are bounded by a small constant after adjusting the parameters of our models appropriately, it will suffice for our purposes. | ||
In particular, we can combine \eqref{eq:ag} with the Neyman-Pearson | In particular, we can combine \eqref{eq:ag} with the Neyman-Pearson [[#LEM:Neyman |Lemma]] and Pinsker’s inequality, [[#LEM:Pinsker |Lemma]], to get | ||
<math display="block"> | <math display="block"> | ||
Line 958: | Line 969: | ||
\end{align*} | \end{align*} | ||
</math> | </math> | ||
Similar to the proof of the sparse Varshamov-Gilbert bound, | Similar to the proof of the sparse Varshamov-Gilbert bound, [[#LEM:sVG |Lemma]], the distribution of <math> | S \cap [k] | </math> is stochastically dominated by a binomial distribution <math> \Bin(k, k/d) </math>, so that | ||
<math display="block"> | <math display="block"> | ||
Line 983: | Line 994: | ||
\p_0 (\psi = 1) \vee \max_{\substack{v \in \cB_0(k)\\\|v\|_2 = 1}} \p_v(\psi = 0) | \p_0 (\psi = 1) \vee \max_{\substack{v \in \cB_0(k)\\\|v\|_2 = 1}} \p_v(\psi = 0) | ||
\geq {} & \p_0 (\psi = 1) \vee \bar{\p}(\psi = 0) \\ | \geq {} & \p_0 (\psi = 1) \vee \bar{\p}(\psi = 0) \\ | ||
\geq {} & \frac{1}{2}(1 - \sqrt{\varepsilon}). | \geq {} & \frac{1}{2}(1 - \sqrt{\varepsilon}). | ||
\end{align*} | \end{align*} | ||
</math>}} | </math>}} | ||
Line 1,015: | Line 1,026: | ||
Conversely, the lower bound above means that such an error rate can not be achieved with vanishing error probability for <math> \mu </math> smaller than the critical value. | Conversely, the lower bound above means that such an error rate can not be achieved with vanishing error probability for <math> \mu </math> smaller than the critical value. | ||
Moreover, by the triangle inequality, we can the same lower bound also for estimation of <math> \theta^\ast </math> itself. | Moreover, by the triangle inequality, we can the same lower bound also for estimation of <math> \theta^\ast </math> itself. | ||
We note that the rate that we obtained is slightly worse in the scaling of the log factor (<math> d/k^2 </math> instead of <math> d/k </math>) than the one obtained in | We note that the rate that we obtained is slightly worse in the scaling of the log factor (<math> d/k^2 </math> instead of <math> d/k </math>) than the one obtained in [[#cor:minimax-sparse |Corollary]]. | ||
This is because it applies to estimating the <math> \ell_2 </math> norm as well, which is a strictly easier problem and for which estimators can be constructed that achieve the faster rate of <math> \frac{k}{n} \log(d/k^2) </math>. | This is because it applies to estimating the <math> \ell_2 </math> norm as well, which is a strictly easier problem and for which estimators can be constructed that achieve the faster rate of <math> \frac{k}{n} \log(d/k^2) </math>. | ||
}} | }} | ||
==Lower bounds for estimating the | ==Lower bounds for estimating the <math> \ell_1 </math> norm via moment matching== | ||
norm via moment matching | |||
So far, we have seen many examples of rates that scale with <math> n^{-1} </math> for the squared error. | So far, we have seen many examples of rates that scale with <math> n^{-1} </math> for the squared error. | ||
In this section, we are going to see an example for estimating a functional that can only be estimated at a much slower rate of <math> 1/\log n </math>, following <ref name="CaiLow11">{{cite journal|last1=Cai|first1=T. Tony|last2=Low|first2=Mark G.|journal=Ann. Statist.|year=2011|title=Testing composite hypotheses, Hermite polynomials and optimal estimation of a nonsmooth functional|volume=39|number=2|publisher=The Institute of Mathematical Statistics}}</ref>. | In this section, we are going to see an example for estimating a functional that can only be estimated at a much slower rate of <math> 1/\log n </math>, following <ref name="CaiLow11">{{cite journal|last1=Cai|first1=T. Tony|last2=Low|first2=Mark G.|journal=Ann. Statist.|year=2011|title=Testing composite hypotheses, Hermite polynomials and optimal estimation of a nonsmooth functional|volume=39|number=2|doi=10.1214/aos/1034276635|publisher=The Institute of Mathematical Statistics|url-access=http://dx.doi.org/10.1214/aos/1034276635|eprint=1407.0381|pages=1012--1041}}</ref>. | ||
Consider the normalized <math> \ell_1 </math> norm, | Consider the normalized <math> \ell_1 </math> norm, | ||
<span id{{=}}"eq:aw"/> | |||
<math display="block"> | <math display="block"> | ||
\begin{equation} | \begin{equation} | ||
Line 1,033: | Line 1,045: | ||
as the target functional to be estimated, and measurements | as the target functional to be estimated, and measurements | ||
<span id{{=}}"eq:at"/> | |||
<math display="block"> | <math display="block"> | ||
\begin{equation} | \begin{equation} | ||
Line 1,041: | Line 1,054: | ||
We are going to show lower bounds for the estimation if restricted to an <math> \ell_\infty </math> ball, <math> \Theta_n(M) = \{ \theta \in \R^n : | \theta_i | \leq M \} </math>, and for a general <math> \theta \in \R^n </math>. | We are going to show lower bounds for the estimation if restricted to an <math> \ell_\infty </math> ball, <math> \Theta_n(M) = \{ \theta \in \R^n : | \theta_i | \leq M \} </math>, and for a general <math> \theta \in \R^n </math>. | ||
===A constrained risk inequality=== | ===A constrained risk inequality=== | ||
We start with a general principle that is in the same spirit as the Neyman-Pearson lemma, ( | We start with a general principle that is in the same spirit as the Neyman-Pearson lemma, ([[#LEM:Neyman |Lemma]]), but deals with expectations rather than probabilities and uses the <math> \chi^2 </math> distance to estimate the closeness of distributions. | ||
Moreover, contrary to what we have seen so far, it allows us to compare two mixture distributions over candidate hypotheses, also known as priors. | Moreover, contrary to what we have seen so far, it allows us to compare two mixture distributions over candidate hypotheses, also known as priors. | ||
In the following, we write <math> X </math> for a random observation coming from a distribution indexed by <math> \theta \in \Theta </math>, <math> \widehat{T} = \widehat{T}(X) </math> an estimator for a function <math> T(\theta) </math>, and write <math> B(\theta) = \E_\theta \widehat{T} - T(\theta) </math> for the bias of <math> \widehat{T} </math>. | In the following, we write <math> X </math> for a random observation coming from a distribution indexed by <math> \theta \in \Theta </math>, <math> \widehat{T} = \widehat{T}(X) </math> an estimator for a function <math> T(\theta) </math>, and write <math> B(\theta) = \E_\theta \widehat{T} - T(\theta) </math> for the bias of <math> \widehat{T} </math>. | ||
Let <math> \mu_0 </math> and <math> \mu_1 </math> be two prior distributions over <math> \Theta </math> and denote by <math> m_i </math>, <math> v_i^2 </math> the means and variance of <math> T(\theta) </math> under the priors <math> \mu_i </math>, <math> i \in \{0, 1\} </math>, | Let <math> \mu_0 </math> and <math> \mu_1 </math> be two prior distributions over <math> \Theta </math> and denote by <math> m_i </math>, <math> v_i^2 </math> the means and variance of <math> T(\theta) </math> under the priors <math> \mu_i </math>, <math> i \in \{0, 1\} </math>, | ||
<span id{{=}}"eq:au"/> | |||
<math display="block"> | <math display="block"> | ||
\begin{equation*} | \begin{equation*} | ||
Line 1,055: | Line 1,069: | ||
With this, we write the <math> \chi^2 </math> distance between <math> f_0 </math> and <math> f_1 </math> as | With this, we write the <math> \chi^2 </math> distance between <math> f_0 </math> and <math> f_1 </math> as | ||
<span id{{=}}"eq:av"/> | |||
<math display="block"> | <math display="block"> | ||
\begin{equation*} | \begin{equation*} | ||
Line 1,065: | Line 1,080: | ||
{{proofcard|Theorem|thm:constr_risk|(i) If <math> \int \E_\theta (\widehat{T}(X) - T(\theta))^2 d\mu_0(\theta) \leq \varepsilon^2 </math>, | {{proofcard|Theorem|thm:constr_risk|(i) If <math> \int \E_\theta (\widehat{T}(X) - T(\theta))^2 d\mu_0(\theta) \leq \varepsilon^2 </math>, | ||
<span id{{=}}"eq:constr_risk"/> | |||
<math display="block"> | <math display="block"> | ||
\begin{equation} | \begin{equation} | ||
Line 1,075: | Line 1,091: | ||
(ii) If <math> | m_1 - m_0 | > v_0 I</math> and <math> 0 \leq \lambda \leq 1 </math>, | (ii) If <math> | m_1 - m_0 | > v_0 I</math> and <math> 0 \leq \lambda \leq 1 </math>, | ||
<span id{{=}}"eq:mixture_risk"/> | |||
<math display="block"> | <math display="block"> | ||
\begin{equation} | \begin{equation} | ||
Line 1,084: | Line 1,101: | ||
in particular | in particular | ||
<span id{{=}}"eq:minimax_risk_mixture"/> | |||
<math display="block"> | <math display="block"> | ||
\begin{equation} | \begin{equation} | ||
Line 1,092: | Line 1,110: | ||
and | and | ||
<span id{{=}}"eq:minimax_risk"/> | |||
<math display="block"> | <math display="block"> | ||
\begin{equation} | \begin{equation} | ||
Line 1,138: | Line 1,157: | ||
whence | whence | ||
<span id{{=}}"eq:ac"/> | |||
<math display="block"> | <math display="block"> | ||
\begin{equation*} | \begin{equation*} | ||
Line 1,194: | Line 1,214: | ||
</math> | </math> | ||
and the polynomial attaining this error by <math> G_k^\ast </math>. | and the polynomial attaining this error by <math> G_k^\ast </math>. | ||
For <math> f(t) = |t| </math>, it is known <ref name="Riv90">{{cite | For <math> f(t) = |t| </math>, it is known <ref name="Riv90">{{cite book|authors=|last=Rivlin|first=Theodore J.|year=1990|title=Chebyshev Polynomials: From Approximation Theory to Algebra and Number Theory|volume=39|number=2|publisher=Wiley|isbn=978-0-471-62896-5|location=Berlin|pages=1012--1041}}</ref> that | ||
<math display="block"> | <math display="block"> | ||
Line 1,248: | Line 1,268: | ||
For <math> \theta \in \Theta_n(M) = \{ \theta \in \R^n : | \theta_i | \leq M \} </math>, | For <math> \theta \in \Theta_n(M) = \{ \theta \in \R^n : | \theta_i | \leq M \} </math>, | ||
<span id{{=}}"eq:ai"/> | |||
<math display="block"> | <math display="block"> | ||
\begin{equation} | \begin{equation} | ||
Line 1,256: | Line 1,277: | ||
For <math> \theta \in \R^n </math>, | For <math> \theta \in \R^n </math>, | ||
<span id{{=}}"eq:aj"/> | |||
<math display="block"> | <math display="block"> | ||
\begin{equation} | \begin{equation} | ||
Line 1,264: | Line 1,286: | ||
The last remaining ingredient for the proof are the ''Hermite polynomials'', a family of orthogonal polynomials with respect to the Gaussian density | The last remaining ingredient for the proof are the ''Hermite polynomials'', a family of orthogonal polynomials with respect to the Gaussian density | ||
<span id{{=}}"eq:ae"/> | |||
<math display="block"> | <math display="block"> | ||
\begin{equation*} | \begin{equation*} | ||
Line 1,286: | Line 1,309: | ||
|We want to use | |We want to use [[#thm:constr_risk |Theorem]]. | ||
To construct the prior measures, we scale the measures from | To construct the prior measures, we scale the measures from [[#lem:minimax_priors |Lemma]] appropriately: | ||
Let <math> k_n </math> be an even integer that is to be determined, and <math> \nu_0 </math>, <math> \nu_1 </math> the two measures given by | Let <math> k_n </math> be an even integer that is to be determined, and <math> \nu_0 </math>, <math> \nu_1 </math> the two measures given by [[#lem:minimax_priors |Lemma]]. | ||
Define <math> g(x) = Mx </math> and define the measures <math> \mu_i </math> by dilating <math> \nu_i </math>, <math> \mu_i(A) = \nu_i(g^{-1}(A)) </math> for every Borel set <math> A \subseteq [-M, M] </math> and <math> i \in \{0, 1\} </math>. | Define <math> g(x) = Mx </math> and define the measures <math> \mu_i </math> by dilating <math> \nu_i </math>, <math> \mu_i(A) = \nu_i(g^{-1}(A)) </math> for every Borel set <math> A \subseteq [-M, M] </math> and <math> i \in \{0, 1\} </math>. | ||
Hence, | Hence, | ||
Line 1,343: | Line 1,366: | ||
The <math> \chi^2 </math> distance for <math> n </math> \iid observations can now be easily bounded by | The <math> \chi^2 </math> distance for <math> n </math> \iid observations can now be easily bounded by | ||
<span id{{=}}"eq:ax"/> | |||
<math display="block"> | <math display="block"> | ||
\begin{align} | \begin{align} | ||
Line 1,355: | Line 1,379: | ||
where the last step used a Stirling type estimate, <math> k! > (k/e)^k </math>. | where the last step used a Stirling type estimate, <math> k! > (k/e)^k </math>. | ||
Note that if <math> x = o(1/n) </math>, then <math> (1+x)^n - 1 = o(n x) </math> by Taylor's formula, so we can choose <math> k_n \geq \log n / (\log \log n) </math> to guarantee that for <math> n </math> large enough, <math> I_n \to 0 </math>. | Note that if <math> x = o(1/n) </math>, then <math> (1+x)^n - 1 = o(n x) </math> by Taylor's formula, so we can choose <math> k_n \geq \log n / (\log \log n) </math> to guarantee that for <math> n </math> large enough, <math> I_n \to 0 </math>. | ||
With | With [[#thm:constr_risk |Theorem]], we have | ||
<math display="block"> | <math display="block"> | ||
Line 1,375: | Line 1,399: | ||
which goes to zero. | which goes to zero. | ||
Hence, we can conclude just as before.}} | Hence, we can conclude just as before.}} | ||
Note that <ref name="CaiLow11" | Note that <ref name="CaiLow11"/> also complement these lower bounds with upper bounds that are based on polynomial approximation, completing the picture of estimating <math> T(\theta) </math>. | ||
==General references== | ==General references== | ||
{{cite arXiv|last1=Rigollet|first1=Philippe|last2=Hütter|first2=Jan-Christian|year=2023|title=High-Dimensional Statistics|eprint=2310.19244|class=math.ST}} | {{cite arXiv|last1=Rigollet|first1=Philippe|last2=Hütter|first2=Jan-Christian|year=2023|title=High-Dimensional Statistics|eprint=2310.19244|class=math.ST}} | ||
==References== | ==References== | ||
{{reflist}} | {{reflist}} |
Latest revision as of 00:37, 3 June 2024
[math] \newcommand{\DS}{\displaystyle} \newcommand{\intbeta}{\lfloor \beta \rfloor} \newcommand{\cA}{\mathcal{A}} \newcommand{\cB}{\mathcal{B}} \newcommand{\cC}{\mathcal{C}} \newcommand{\cD}{\mathcal{D}} \newcommand{\cE}{\mathcal{E}} \newcommand{\cF}{\mathcal{F}} \newcommand{\cG}{\mathcal{G}} \newcommand{\cH}{\mathcal{H}} \newcommand{\cI}{\mathcal{I}} \newcommand{\cJ}{\mathcal{J}} \newcommand{\cK}{\mathcal{K}} \newcommand{\cL}{\mathcal{L}} \newcommand{\cM}{\mathcal{M}} \newcommand{\cN}{\mathcal{N}} \newcommand{\cO}{\mathcal{O}} \newcommand{\cP}{\mathcal{P}} \newcommand{\cQ}{\mathcal{Q}} \newcommand{\cS}{\mathcal{S}} \newcommand{\cT}{\mathcal{T}} \newcommand{\cU}{\mathcal{U}} \newcommand{\cV}{\mathcal{V}} \newcommand{\cX}{\mathcal{X}} \newcommand{\cY}{\mathcal{Y}} \newcommand{\cZ}{\mathcal{Z}} \newcommand{\sg}{\mathsf{subG}} \newcommand{\subE}{\mathsf{subE}} \newcommand{\bA}{\mathbf{A}} \newcommand{\bB}{\mathbf{B}} \newcommand{\bC}{\mathbf{C}} \newcommand{\bD}{\mathbf{D}} \newcommand{\bE}{\mathbf{E}} \newcommand{\bF}{\mathbf{F}} \newcommand{\bG}{\mathbf{G}} \newcommand{\bH}{\mathbf{H}} \newcommand{\bI}{\mathbf{I}} \newcommand{\bJ}{\mathbf{J}} \newcommand{\bK}{\mathbf{K}} \newcommand{\bM}{\mathbf{M}} \newcommand{\bN}{\mathbf{N}} \newcommand{\bO}{\mathbf{O}} \newcommand{\bP}{\mathbf{P}} \newcommand{\bp}{\mathbf{p}} \newcommand{\bQ}{\mathbf{Q}} \newcommand{\bS}{\mathbf{S}} \newcommand{\bT}{\mathbf{T}} \newcommand{\bU}{\mathbf{U}} \newcommand{\bV}{\mathbf{V}} \newcommand{\bX}{\mathbf{X}} \newcommand{\bY}{\mathbf{Y}} \newcommand{\bZ}{\mathbf{Z}} \newcommand{\bflambda}{\boldsymbol{\lambda}} \newcommand{\bftheta}{\boldsymbol{\theta}} \newcommand{\bfg}{\boldsymbol{g}} \newcommand{\bfy}{\boldsymbol{y}} \def\thetaphat{\hat{\bftheta}_\bp} \def\bflam{\boldsymbol{\lambda}} \def\Lam{\Lambda} \def\lam{\lambda} \def\bfpi{\boldsymbol{\pi}} \def\bfz{\boldsymbol{z}} \def\bfw{\boldsymbol{w}} \def\bfeta{\boldsymbol{\eta}} \newcommand{\R}{\mathrm{ I}\kern-0.18em\mathrm{ R}} \newcommand{\h}{\mathrm{ I}\kern-0.18em\mathrm{ H}} \newcommand{\K}{\mathrm{ I}\kern-0.18em\mathrm{ K}} \newcommand{\p}{\mathrm{ I}\kern-0.18em\mathrm{ P}} \newcommand{\E}{\mathrm{ I}\kern-0.18em\mathrm{ E}} %\newcommand{\Z}{\mathrm{ Z}\kern-0.26em\mathrm{ Z}} \newcommand{\1}{\mathrm{ 1}\kern-0.24em\mathrm{ I}} \newcommand{\N}{\mathrm{ I}\kern-0.18em\mathrm{ N}} \newcommand{\field}[1]{\mathbb{#1}} \newcommand{\q}{\field{Q}} \newcommand{\Z}{\field{Z}} \newcommand{\X}{\field{X}} \newcommand{\Y}{\field{Y}} \newcommand{\bbS}{\field{S}} \newcommand{\n}{\mathcal{N}} \newcommand{\x}{\mathcal{X}} \newcommand{\pp}{{\sf p}} \newcommand{\hh}{{\sf h}} \newcommand{\ff}{{\sf f}} \newcommand{\Bern}{\mathsf{Ber}} \newcommand{\Bin}{\mathsf{Bin}} \newcommand{\Lap}{\mathsf{Lap}} \newcommand{\tr}{\mathsf{Tr}} \newcommand{\phin}{\varphi_n} \newcommand{\phinb}{\overline \varphi_n(t)} \newcommand{\pn}{\p_{\kern-0.25em n}} \newcommand{\pnm}{\p_{\kern-0.25em n,m}} \newcommand{\psubm}{\p_{\kern-0.25em m}} \newcommand{\psubp}{\p_{\kern-0.25em p}} \newcommand{\cfi}{\cF_{\kern-0.25em \infty}} \newcommand{\e}{\mathrm{e}} \newcommand{\ic}{\mathrm{i}} \newcommand{\Leb}{\mathrm{Leb}_d} \newcommand{\Var}{\mathrm{Var}} \newcommand{\ddelta}{d_{\symdiffsmall}} \newcommand{\dsubh}{d_H} \newcommand{\indep}{\perp\kern-0.95em{\perp}} \newcommand{\supp}{\mathop{\mathrm{supp}}} \newcommand{\rank}{\mathop{\mathrm{rank}}} \newcommand{\vol}{\mathop{\mathrm{vol}}} \newcommand{\conv}{\mathop{\mathrm{conv}}} \newcommand{\card}{\mathop{\mathrm{card}}} \newcommand{\argmin}{\mathop{\mathrm{argmin}}} \newcommand{\argmax}{\mathop{\mathrm{argmax}}} \newcommand{\ud}{\mathrm{d}} \newcommand{\var}{\mathrm{var}} \newcommand{\re}{\mathrm{Re}} \newcommand{\MSE}{\mathsf{MSE}} \newcommand{\im}{\mathrm{Im}} \newcommand{\epr}{\hfill\hbox{\hskip 4pt\vrule width 5pt height 6pt depth 1.5pt}\vspace{0.5cm}\par} \newcommand{\bi}[1]{^{(#1)}} \newcommand{\eps}{\varepsilon} \newcommand{\Deq}{\stackrel{\mathcal{D}}{=}} \newcommand{\ubar}{\underbar} \newcommand{\Kbeta}{K_{\hspace{-0.3mm} \beta}} \newcommand{\crzero}[1]{\overset{r_0}{\underset{#1}{\longleftrightarrow}}} \newcommand{\hint}[1]{\texttt{[Hint:#1]}} \newcommand{\vp}{\vspace{.25cm}} \newcommand{\vm}{\vspace{.5cm}} \newcommand{\vg}{\vspace{1cm}} \newcommand{\vgneg}{\vspace{-1cm}} \newcommand{\vneg}{\vspace{-.5cm}} \newcommand{\vpneg}{\vspace{-.25cm}} \newcommand{\tp}{\ptsize{10}} \newcommand{\douzp}{\ptsize{12}} \newcommand{\np}{\ptsize{9}} \newcommand{\hp}{\ptsize{8}} \newcommand{\red}{\color{red}} \newcommand{\ndpr}[1]{{\textsf{\red[#1]}}} \newcommand\iid{i.i.d\@ifnextchar.{}{.\@\xspace} } \newcommand\MoveEqLeft[1][2]{\kern #1em & \kern -#1em} \newcommand{\leadeq}[2][4]{\MoveEqLeft[#1] #2 \nonumber} \newcommand{\leadeqnum}[2][4]{\MoveEqLeft[#1] #2} \newcommand\independent{\protect\mathpalette{\protect\independenT}{\perp}} \def\independenT#1#2{\mathrel{\rlap{$#1#2$}\mkern2mu{#1#2}}} \newcommand{\MIT}[1]{{\color{MITred} #1}} \newcommand{\dHyp}{\{-1,1\}^d} \newcommand{\thetahard}{\hat \theta^{hrd}} \newcommand{\thetasoft}{\hat \theta^{sft}} \newcommand{\thetabic}{\hat \theta^{bic}} \newcommand{\thetalasso}{\hat \theta^{\cL}} \newcommand{\thetaslope}{\hat \theta^{\cS}} \newcommand{\thetahard}{\hat \theta^{hrd}} \newcommand{\thetasoft}{\hat \theta^{sft}} \newcommand{\thetabic}{\hat \theta^{bic}} \newcommand{\thetalasso}{\hat \theta^{\cL}} \newcommand{\thetaslope}{\hat \theta^{\cS}} \newcommand{\thetals}{\hat \theta^{ls}} \newcommand{\thetalsm}{\tilde \theta^{ls_X}} \newcommand{\thetaridge}{\hat\theta^{\mathrm{ridge}}_\tau} \newcommand{\thetalsK}{\hat \theta^{ls}_K} \newcommand{\muls}{\hat \mu^{ls}} \newcommand{\KL}{\mathsf{KL}} \newcommand{\TV}{\mathsf{TV}} [/math]
In the previous chapters, we have proved several upper bounds and the goal of this chapter is to assess their optimality. Specifically, our goal is to answer the following questions:
- Can our analysis be improved? In other words: do the estimators that we have studied actually satisfy better bounds?
- Can any estimator improve upon these bounds?
Both questions ask about some form of optimality. The first one is about optimality of an estimator, whereas the second one is about optimality of a bound. The difficulty of these questions varies depending on whether we are looking for a positive or a negative answer. Indeed, a positive answer to these questions simply consists in finding a better proof for the estimator we have studied (question 1.) or simply finding a better estimator, together with a proof that it performs better (question 2.). A negative answer is much more arduous. For example, in question 2., it is a statement about all estimators. How can this be done? The answer lies in information theory (see[1] for a nice introduction). In this chapter, we will see how to give a negative answer to question 2. It will imply a negative answer to question 1.
Optimality in a minimax sense
Consider the Gaussian Sequence Model (GSM) where we observe [math]\bY=(Y_1, \ldots, Y_d)^\top[/math], defined by
where [math]\varepsilon=(\eps_1, \ldots, \eps_d)^\top \sim \cN_d(0,\frac{\sigma^2}{n}I_d)[/math], [math]\theta^*=(\theta^*_1, \ldots, \theta^*_d)^\top \in \Theta [/math] is the parameter of interest and [math]\Theta \subset \R^d[/math] is a given set of parameters. We will need a more precise notation for probabilities and expectations throughout this chapter. Denote by [math]\p_{\theta^*}[/math] and [math]\E_{\theta^*}[/math] the probability measure and corresponding expectation that are associated to the distribution of [math]\bY[/math] from the GSM\eqref{EQ:GSMminimax}. Recall that GSM is a special case of the linear regression model when the design matrix satisfies the ORT condition. In this case, we have proved several performance guarantees (upper bounds) for various choices of [math]\Theta[/math] that can be expressed either in the form
or the form
For some constant [math]C[/math]. The rates [math]\phi(\Theta)[/math] for different choices of [math]\Theta[/math] that we have obtained are gathered in Table together with the estimator (and the corresponding result from Chapter Linear Regression Model) that was employed to obtain this rate.
[math]\Theta[/math] | [math]\phi(\Theta)[/math] | Estimator | Result |
[math]\R^d[/math] | [math]\DS \frac{\sigma^2 d}{n}[/math] | [math]\thetals[/math] | Theorem |
[math]\cB_1[/math] | [math]\DS\sigma \sqrt{\frac{ \log d}{n}}[/math] | [math]\thetals_{\cB_1}[/math] | Theorem |
[math]\cB_0(k)[/math] | [math]\DS\frac{ \sigma^2 k}{n}\log (ed/k)[/math] | [math]\thetals_{\cB_0(k)}[/math] | Corollary-Corollary |
Can any of these results be improved? In other words, does there exists another estimator [math]\tilde \theta[/math] such that [math]\sup_{\theta^* \in \Theta}\E|\tilde \theta -\theta^*|_2^2\ll\phi(\Theta)[/math]? A first step in this direction is the Cram\'er-Rao lower bound [2] that allows us to prove lower bounds in terms of the Fisher information. Nevertheless, this notion of optimality is too stringent and often leads to nonexistence of optimal estimators. Rather, we prefer here the notion of minimax optimality that characterizes how fast [math]\theta^*[/math] can be estimated uniformly over [math]\Theta[/math].
We say that an estimator [math]\hat \theta_n[/math] is minimax optimal over [math]\Theta[/math] if it satisfies \eqref{EQ:minimaxUB_E} and there exists [math]C' \gt 0[/math] such that
Note that minimax rates of convergence [math]\phi[/math] are defined up to multiplicative constants. We may then choose this constant such that the minimax rate has a simple form such as [math]\sigma^2d/n[/math] as opposed to [math]7\sigma^2d/n[/math] for example. This definition can be adapted to rates that hold with high probability. As we saw in Chapter Linear Regression Model (Cf. Table), the upper bounds in expectation and those with high probability are of the same order of magnitude. It is also the case for lower bounds. Indeed, observe that it follows from the Markov inequality that for any [math]A \gt 0[/math],
Therefore,\eqref{EQ:minimaxLB1} follows if we prove that
for some positive constants [math]A[/math] and [math]C"[/math]. The above inequality also implies a lower bound with high probability. We can therefore employ the following alternate definition for minimax optimality.
We say that an estimator [math]\hat \theta[/math] is minimax optimal over [math]\Theta[/math] if it satisfies either \eqref{EQ:minimaxUB_E} or \eqref{EQ:minimaxUB_P} and there exists [math]C' \gt 0[/math] such that
Reduction to finite hypothesis testing
Minimax lower bounds rely on information theory and follow from a simple principle: if the number of observations is too small, it may be hard to distinguish between two probability distributions that are close to each other. For example, given [math]n[/math] i.i.d. observations, it is impossible to reliably decide whether they are drawn from [math]\cN(0,1)[/math] or [math]\cN(\frac1n,1)[/math]. This simple argument can be made precise using the formalism of statistical hypothesis testing. To do so, we reduce our estimation problem to a testing problem. The reduction consists of two steps.
- Reduction to a finite number of parameters. In this step the goal is to find the largest possible number of parameters [math]\theta_1, \ldots, \theta_M \in \Theta[/math] under the constraint that
[[math]] \begin{equation} \label{EQ:dist_constraint} |\theta_j -\theta_k|_2^2 \ge 4 \phi(\Theta)\,. \end{equation} [[/math]]This problem boils down to a packing of the set [math]\Theta[/math]. Then we can use the following trivial observations:[[math]] \inf_{\hat \theta}\sup_{\theta \in \Theta}\p_\theta\big[|\hat \theta- \theta|_2^2 \gt \phi(\Theta)\big] \ge \inf_{\hat \theta}\max_{1\le j \le M}\p_{\theta_j}\big[|\hat \theta- \theta_j|_2^2 \gt \phi(\Theta)\big]\,. [[/math]]
- Reduction to a hypothesis testing problem. In this second step, the necessity of the constraint\eqref{EQ:dist_constraint} becomes apparent.
For any estimator [math]\hat \theta[/math], define the minimum distance test [math]\psi(\hat \theta)[/math] that is associated to it by
[[math]] \psi(\hat \theta)=\argmin_{1\le j \le M} |\hat \theta -\theta_j|_2\,, [[/math]]with ties broken arbitrarily. Next observe that if, for some [math]j=1, \ldots, M[/math], [math]\psi(\hat \theta)\neq j[/math], then there exists [math]k \neq j[/math] such that [math]|\hat \theta -\theta_k|_2 \le |\hat \theta -\theta_j|_2[/math]. Together with the reverse triangle inequality it yields[[math]] |\hat \theta - \theta_j|_2 \ge |\theta_j -\theta_k|_2- |\hat \theta - \theta_k|_2 \ge |\theta_j -\theta_k|_2- |\hat \theta - \theta_j|_2 [[/math]]so that[[math]] |\hat \theta - \theta_j|_2 \ge \frac12 |\theta_j -\theta_k|_2 [[/math]]Together with constraint\eqref{EQ:dist_constraint}, it yields[[math]] |\hat \theta - \theta_j|_2^2 \ge \phi(\Theta) [[/math]]As a result,[[math]] \begin{align*} \inf_{\hat \theta}\max_{1\le j \le M}\p_{\theta_j}\big[|\hat \theta- \theta_j|_2^2 \gt \phi(\Theta)\big]&\ge \inf_{\hat \theta}\max_{1\le j \le M}\p_{\theta_j}\big[\psi(\hat \theta)\neq j\big]\\ &\ge \inf_{\psi}\max_{1\le j \le M}\p_{\theta_j}\big[\psi\neq j\big] \end{align*} [[/math]]where the infimum is taken over all tests [math]\psi[/math] based on [math]\bY[/math] and that take values in [math]\{1, \ldots, M\}[/math].
Conclusion: it is sufficient for proving lower bounds to find [math]\theta_1, \ldots, \theta_M \in \Theta[/math] such that [math]|\theta_j -\theta_k|_2^2 \ge 4 \phi(\Theta)[/math] and
The above quantity is called minimax probability of error. In the next sections, we show how it can be bounded from below using arguments from information theory. For the purpose of illustration, we begin with the simple case where [math]M=2[/math] in the next section.
Lower bounds based on two hypotheses
The Neyman-Pearson Lemma and the total variation distance
Consider two probability measures [math]\p_0[/math] and [math]\p_1[/math] and observations [math]X[/math] drawn from either [math]\p_0[/math] or [math]\p_1[/math]. We want to know which distribution [math]X[/math] comes from. It corresponds to the following statistical hypothesis problem:
A test [math]\psi=\psi(Z) \in \{0,1\}[/math] indicates which hypothesis should be true. Any test [math]\psi[/math] can make two types of errors. It can commit either an error of type I ([math]\psi=1[/math] whereas [math]Z \sim \p_0[/math]) or an error of type II ([math]\psi=0[/math] whereas [math]Z \sim \p_1[/math]). Of course, the test may also be correct. The following fundamental result called the Neyman Pearson Lemma indicates that any test [math]\psi[/math] is bound to commit one of these two types of error with positive probability unless [math]\p_0[/math] and [math]\p_1[/math] have essentially disjoint support. Let [math]\nu[/math] be a sigma finite measure satisfying [math]\p_0 \ll \nu[/math] and [math]\p_1\ll \nu[/math]. For example, we can take [math]\nu = \p_0+\p_1[/math]. It follows from the Radon-Nikodym theorem [3] that both [math]\p_0[/math] and [math]\p_1[/math] admit probability densities with respect to [math]\nu[/math]. We denote them by [math]p_0[/math] and [math]p_1[/math] respectively. For any function [math]f[/math], we write for simplicity
Let [math]\p_0[/math] and [math]\p_1[/math] be two probability measures. Then for any test [math]\psi[/math], it holds
Observe first that
The lower bound in the Neyman-Pearson lemma is related to a well-known quantity: the total variation distance.
The total variation distance between two probability measures [math]\p_0[/math] and [math]\p_1[/math] on a measurable space [math](\cX, \cA)[/math] is defined by
Clearly [math](i)=(ii)[/math] and the Neyman-Pearson Lemma gives [math](iv)=(v)[/math]. Moreover, by identifying a test [math]\psi[/math] to its rejection region, it is not hard to see that [math](i)=(v)[/math]. Therefore it remains only to show that [math](iii)[/math] is equal to any of the other expressions. Hereafter, we show that [math](iii)=(iv)[/math]. To that end, observe that
In view of the Neyman-Pearson lemma, it is clear that if we want to prove large lower bounds, we need to find probability distributions that are close in total variation. Yet, this conflicts with constraint\eqref{EQ:dist_constraint} and a tradeoff needs to be achieved. To that end, in the Gaussian sequence model, we need to be able to compute the total variation distance between [math]\cN(\theta_0, \frac{\sigma^2}{n}I_d)[/math] and [math]\cN(\theta_1, \frac{\sigma^2}{n}I_d)[/math]. None of the expression in Definition-Proposition gives an easy way to do so. The Kullback-Leibler divergence is much more convenient.
The Kullback-Leibler divergence
The Kullback-Leibler divergence between probability measures [math]\p_1[/math] and [math]\p_0[/math] is given by
It can be shown [4] that the integral is always well defined when [math]\p_1 \ll \p_0[/math] (though it can be equal to [math]\infty[/math] even in this case). Unlike the total variation distance, the Kullback-Leibler divergence is not a distance. Actually, it is not even symmetric. Nevertheless, it enjoys properties that are very useful for our purposes.
Let [math]\p[/math] and [math]\q[/math] be two probability measures. Then
- [math]\KL(\p, \q)\ge 0[/math].
- The function [math](\p, \q) \mapsto \KL(\p,\q)[/math] is convex.
- If [math]\p[/math] and [math]\q[/math] are product measures, i.e.,
[[math]] \p=\bigotimes_{i=1}^n \p_i\quad \text{and} \quad \q=\bigotimes_{i=1}^n \q_i [[/math]]then[[math]] \KL(\p,\q)=\sum_{i=1}^n \KL(\p_i,\q_i)\,. [[/math]]
If [math]\p[/math] is not absolutely continuous then the result is trivial. Next, assume that [math]\p \ll \q[/math] and let [math]X \sim \p[/math]. 1. Observe that by Jensen's inequality,
2. Consider the function [math]f:(x,y) \mapsto x\log(x/y)[/math] and compute its Hessian:
Point 2. in Proposition is particularly useful in statistics where observations typically consist of [math]n[/math] independent random variables.
Example
For any [math]\theta \in \R^d[/math], let [math]P_\theta[/math] denote the distribution of [math]\bY \sim \cN(\theta, \sigma^2I_d)[/math]. Then
The proof is left as an exercise (see Problem).
The Kullback-Leibler divergence is easier to manipulate than the total variation distance but only the latter is related to the minimax probability of error. Fortunately, these two quantities can be compared using Pinsker's inequality. We prove here a slightly weaker version of Pinsker's inequality that will be sufficient for our purpose. For a stronger statement, see [4], Lemma2.5.
Let [math]\p[/math] and [math]\q[/math] be two probability measures such that [math]\p \ll \q[/math]. Then
Note that
Pinsker's inequality yields the following theorem for the GSM.
Assume that [math]\Theta[/math] contains two hypotheses [math]\theta_0[/math] and [math]\theta_1[/math] such that [math]|\theta_0-\theta_1|_2^2 = 8\alpha^2\sigma^2/n[/math] for some [math]\alpha \in (0,1/2)[/math]. Then
Write for simplicity [math]\p_j=\p_{\theta_j}[/math], [math]j=0,1[/math]. Recall that it follows from the reduction to hypothesis testing that
Clearly, the result of Theorem matches the upper bound for [math]\Theta=\R^d[/math] only for [math]d=1[/math]. How about larger [math]d[/math]? A quick inspection of our proof shows that our technique, in its present state, cannot yield better results. Indeed, there are only two known candidates for the choice of [math]\theta^*[/math]. With this knowledge, one can obtain upper bounds that do not depend on [math]d[/math] by simply projecting [math]Y[/math] onto the linear span of [math]\theta_0, \theta_1[/math] and then solving the GSM in two dimensions. To obtain larger lower bounds, we need to use more than two hypotheses. In particular, in view of the above discussion, we need a set of hypotheses that spans a linear space of dimension proportional to [math]d[/math]. In principle, we should need at least order [math]d[/math] hypotheses but we will actually need much more.
Lower bounds based on many hypotheses
The reduction to hypothesis testing from Section Reduction to finite hypothesis testing allows us to use more than two hypotheses. Specifically, we should find [math]\theta_1, \ldots, \theta_M[/math] such that
for some positive constant [math]C'[/math]. Unfortunately, the Neyman-Pearson Lemma no longer exists for more than two hypotheses. Nevertheless, it is possible to relate the minimax probability of error directly to the Kullback-Leibler divergence, without involving the total variation distance. This is possible using a well known result from information theory called Fano's inequality.
Let [math]P_1, \ldots, P_M, M \ge 2[/math] be probability distributions such that [math]P_j \ll P_k[/math], [math]\forall \, j ,k[/math]. Then
Define
Fano's inequality leads to the following useful theorem.
Assume that [math]\Theta[/math] contains [math]M \ge 5[/math] hypotheses [math]\theta_1, \ldots, \theta_M[/math] such that for some constant [math]0 \lt \alpha \lt 1/4[/math], it holds
- [math]\DS|\theta_j-\theta_k|_2^2 \ge 4\phi[/math]
- [math]\DS |\theta_j-\theta_k|_2^2 \le \frac{2\alpha\sigma^2 }{n}\log(M)[/math]
Then
in view of (i), it follows from the reduction to hypothesis testing that it is sufficient to prove that
Theorem indicates that we must take [math]\phi\le \frac{\alpha\sigma^2 }{2n}\log(M)[/math]. Therefore, the larger the [math]M[/math], the larger the lower bound can be. However, [math]M[/math] cannot be arbitrary larger because of the constraint [math](i)[/math]. We are therefore facing a packing problem where the goal is to ‘`pack" as many Euclidean balls of radius proportional to [math]\sigma\sqrt{\log(M)/n}[/math] in [math]\Theta[/math] under the constraint that their centers remain close together (constraint [math](ii)[/math]). If [math]\Theta=\R^d[/math], this the goal is to pack the Euclidean ball of radius [math]R=\sigma\sqrt{2\alpha\log(M)/n}[/math] with Euclidean balls of radius [math]R\sqrt{2\alpha/\gamma}[/math]. This can be done using a volume argument (see Problem). However, we will use the more versatile lemma below. It gives a a lower bound on the size of a packing of the discrete hypercube [math]\{0,1\}^d[/math] with respect to the ’'Hamming distance defined by
For any [math]\gamma \in (0,1/2)[/math], there exist binary vectors [math]\omega_{1}, \ldots \omega_{M} \in \{0,1\}^d[/math] such that
- [math]\DS\rho(\omega_{j}, \omega_{k})\ge \big(\frac12-\gamma\big)d[/math] for all [math]j \neq k[/math]\,,
- [math]\DS M =\lfloor e^{\gamma^2d}\rfloor\ge e^{\frac{\gamma^2d}{2}}[/math]\,.
Let [math]\omega_{j,i}[/math], [math]1\le i \le d, 1\le j \le M[/math] be \iid Bernoulli random variables with parameter [math]1/2[/math] and observe that
Application to the Gaussian sequence model
We are now in a position to apply Theorem by choosing [math]\theta_1, \ldots, \theta_M[/math] based on [math]\omega_{1}, \ldots, \omega_{M}[/math] from the Varshamov-Gilbert Lemma.
Lower bounds for estimation
Take [math]\gamma=1/4[/math] and apply the Varshamov-Gilbert Lemma to obtain [math]\omega_{1}, \ldots, \omega_{M}[/math] with [math]M=\lfloor e^{d/16}\rfloor\ge e^{d/32} [/math] and such that [math]\rho(\omega_{j}, \omega_{k})\ge d/4[/math] for all [math]j \neq k[/math]. Let [math]\theta_{1},\ldots, \theta_M[/math] be such that
for some [math]\beta \gt 0[/math] to be chosen later. We can check the conditions of Theorem:
- [math]\DS|\theta_j-\theta_k|_2^2=\frac{\beta^2\sigma^2 }{n}\rho(\omega_{j}, \omega_{k}) \ge 4\frac{\beta^2\sigma^2 d}{16n}[/math];
- [math]\DS |\theta_j-\theta_k|_2^2 =\frac{\beta^2\sigma^2 }{n}\rho(\omega_{j}, \omega_{k}) \le \frac{\beta^2\sigma^2 d}{n}\le \frac{32\beta^2\sigma^2 }{n}\log(M)=\frac{2\alpha \sigma^2}{n}\log(M)[/math]\,,
for [math]\beta=\frac{\sqrt{\alpha}}{4}[/math]. Applying now Theorem yields
It implies the following corollary.
The minimax rate of estimation of over [math]\R^d[/math] in the Gaussian sequence model is [math]\phi(\R^d)=\sigma^2d/n[/math]. Moreover, it is attained by the least squares estimator [math]\thetals=\bY[/math].
Note that this rate is minimax over sets [math]\Theta[/math] that are strictly smaller than [math]\R^d[/math] (see ProblemMinimax Lower Bounds). Indeed, it is minimax over any subset of [math]\R^d[/math] that contains [math]\theta_1, \ldots, \theta_M[/math].
Lower bounds for sparse estimation
It appears from Table that when estimating sparse vectors, we have to pay for an extra logarithmic term [math]\log(ed/k)[/math] for not knowing the sparsity pattern of the unknown [math]\theta^*[/math]. In this section, we show that this term is unavoidable as it appears in the minimax optimal rate of estimation of sparse vectors. Note that the vectors [math]\theta_1, \ldots, \theta_M[/math] employed in the previous subsection are not guaranteed to be sparse because the vectors [math]\omega_1, \ldots, \omega_M[/math] obtained from the Varshamov-Gilbert Lemma may themselves not be sparse. To overcome this limitation, we need a sparse version of the Varhsamov-Gilbert lemma.
There exist positive constants [math]C_1[/math] and [math]C_2[/math] such that the following holds for any two integers [math]k[/math] and [math]d[/math] such that [math]1\le k \le d/8[/math]. There exist binary vectors [math]\omega_{1}, \ldots \omega_{M} \in \{0,1\}^d[/math] such that
- [math]\DS\rho(\omega_{i}, \omega_{j})\ge \frac{k}2[/math] for all [math]i \neq j[/math]\,,
- [math]\DS \log(M) \ge \frac{k}{8}\log(1+\frac{d}{2k})[/math]\,.
- [math]\DS |\omega_j|_0=k[/math] for all [math]j[/math]\,.
Take [math]\omega_1, \ldots, \omega_M[/math] independently and uniformly at random from the set
Apply the sparse Varshamov-Gilbert lemma to obtain [math]\omega_{1}, \ldots, \omega_{M}[/math] with [math]\log(M)\ge \frac{k}{8}\log(1+\frac{d}{2k})[/math] and such that [math]\rho(\omega_{j}, \omega_{k})\ge k/2[/math] for all [math]j \neq k[/math]. Let [math]\theta_{1},\ldots, \theta_M[/math] be such that
for some [math]\beta \gt 0[/math] to be chosen later. We can check the conditions of Theorem:
- [math]\DS|\theta_j-\theta_k|_2^2=\frac{\beta^2\sigma^2 }{n}\rho(\omega_{j}, \omega_{k})\log(1+\frac{d}{2k}) \ge 4\frac{\beta^2\sigma^2 }{8n}k\log(1+\frac{d}{2k})[/math];
- [math]\DS |\theta_j-\theta_k|_2^2 =\frac{\beta^2\sigma^2 }{n}\rho(\omega_{j}, \omega_{k})\log(1+\frac{d}{2k}) \le \frac{2k\beta^2\sigma^2 }{n}\log(1+\frac{d}{2k})\le \frac{2\alpha \sigma^2}{n}\log(M)[/math]\,,
for [math]\beta=\sqrt{\frac{\alpha}{8}}[/math]. Applying now Theorem yields
It implies the following corollary.
Recall that [math]\cB_{0}(k) \subset \R^d[/math] denotes the set of all [math]k[/math]-sparse vectors of [math]\R^d[/math]. The minimax rate of estimation over [math]\cB_{0}(k)[/math] in the Gaussian sequence model is [math]\phi(\cB_{0}(k))=\frac{\sigma^2k}{n}\log(ed/k)[/math]. Moreover, it is attained by the constrained least squares estimator [math]\thetals_{\cB_0(k)}[/math].
Note that the modified BIC estimator of Problem and the Slope eq:ad are also minimax optimal over [math]\cB_{0}(k)[/math]. However, unlike [math] \thetals_{\cB_0(k)} [/math] and the modified BIC, the Slope is also adaptive to [math] k [/math]. For any [math]\eps \gt 0[/math], the Lasso estimator and the BIC estimator are minimax optimal for sets of parameters such that [math]k\le d^{1-\eps}[/math].
Lower bounds for estimating vectors in [math]\ell_1[/math] balls
Recall that in Maurey's argument, we approximated a vector [math]\theta[/math] such that [math]|\theta|_1=R[/math] by a vector [math]\theta'[/math] such that [math]|\theta'|_0=\frac{R}{\sigma}\sqrt{\frac{n}{\log d}}[/math]\,. We can essentially do the same for the lower bound. Assume that [math]d \ge \sqrt{n}[/math] and let [math]\beta \in (0,1)[/math] be a parameter to be chosen later and define [math]k[/math] to be the smallest integer such that
Let [math]\omega_1, \ldots, \omega_M[/math] be obtained from the sparse Varshamov-Gilbert Lemma with this choice of [math]k[/math] and define
Observe that [math]|\theta_j|_1=R[/math] for [math]j=1,\ldots, M[/math]. We can check the conditions of Theorem:
- [math]\DS|\theta_j-\theta_k|_2^2=\frac{R^2}{k^2}\rho(\omega_{j}, \omega_{k})\ge \frac{R^2}{2k}\ge 4R\min\big(\frac{R}8, \beta^2\sigma\frac{\log (ed/\sqrt{n})}{8n}\big)[/math]\,.
- [math]\DS |\theta_j-\theta_k|_2^2 \le \frac{2R^2}{k} \le 4R\beta\sigma\sqrt{\frac{\log(ed/\sqrt{n})}{n}}\le \frac{2\alpha \sigma^2}{n}\log(M)[/math]\,,
for [math]\beta[/math] small enough if [math]d \ge Ck[/math] for some constant [math]C \gt 0[/math] chosen large enough. Applying now Theorem yields
It implies the following corollary.
Recall that [math]\cB_{1}(R) \subset \R^d[/math] denotes the set vectors [math]\theta \in \R^d[/math] such that [math]|\theta|_1 \le R[/math]. Then there exist a constant [math]C \gt 0[/math] such that if [math]d \ge n^{1/2+\eps}[/math], [math]\eps \gt 0[/math], the minimax rate of estimation over [math]\cB_{1}(R)[/math] in the Gaussian sequence model is
Moreover, it is attained by the constrained least squares estimator [math]\thetals_{\cB_1(R)}[/math] if [math]R\ge \sigma\frac{\log d}{n}[/math] and by the trivial estimator [math]\hat \theta =0[/math] otherwise.
To complete the proof of the statement, we need to study the risk of the trivial estimator equal to zero for small [math]R[/math]. Note that if [math]|\theta^*|_1\le R[/math], we have
Note that the inequality [math]|\theta^*|_2^2\le |\theta^*|_1^2[/math] appears to be quite loose. Nevertheless, it is tight up to a multiplicative constant for the vectors of the form [math]\theta_j=\omega_j\frac{R}{k}[/math] that are employed in the lower bound. Indeed, if [math]R\le \sigma\frac{\log d}{n}[/math], we have [math]k\le 2/\beta[/math]
Lower bounds for sparse estimation via [math] \chi^2 [/math]divergence
In this section, we will show how to derive lower bounds by directly controlling the distance between a simple and a composite hypothesis, which is useful when investigating lower rates for decision problems instead of estimation problems. Define the [math] \chi^2 [/math] divergence between two probability distributions [math] \p, \q [/math] as
When we compare the expression in the case where [math] \p \ll \q [/math], then we see that both the KL divergence and the [math] \chi^2 [/math] divergence can be written as [math] \int f \left( \frac{d\p}{d\q} \right) d\q [/math] with [math] f(x) = x \log x [/math] and [math] f(x) = (x-1)^2 [/math], respectively. Also note that both of these functions are convex in [math] x [/math] and fulfill [math] f(1) = 0 [/math], which by Jensen’s inequality shows us that they are non-negative and equal to [math] 0 [/math] if and only if [math] \p = \q [/math]. Firstly, let us derive another useful expression for the [math] \chi^2 [/math] divergence. By expanding the square, we have
Secondly, we note that we can bound the KL divergence from above by the [math] \chi^2 [/math] divergence, via Jensen’s inequality, as
where we used the concavity inequality [math] \log(1+x) \leq x [/math] for [math] x \gt -1 [/math]. Note that this estimate is a priori very rough and that we are transitioning from something at [math] \log [/math] scale to something on a regular scale. However, since we are mostly interested in showing that these distances are bounded by a small constant after adjusting the parameters of our models appropriately, it will suffice for our purposes. In particular, we can combine \eqref{eq:ag} with the Neyman-Pearson Lemma and Pinsker’s inequality, Lemma, to get
for distinguishing between two hypothesis with a decision rule [math] \psi [/math] if [math] \chi^2(\p_0, \p_1) \leq \varepsilon [/math]. In the following, we want to apply this observation to derive lower rates for distinguishing between
where [math] \p_u [/math] denotes the probability distribution of a Gaussian sequence model [math] Y = u + \frac{\sigma}{n} \xi [/math], with [math] \xi \sim \cN(0, I_d) [/math].
Consider the detection problem
We introduce the mixture hypothesis
Now, we need to take the expectation over two uniform draws of support sets [math] S [/math] and [math] T [/math], which reduces via conditioning and exploiting the independence of the two draws to
If instead of the [math] \chi^2 [/math] divergence, we try to compare the [math] \KL [/math] divergence between [math] \p_0 [/math] and [math] \bar{\p} [/math], we are tempted to exploit its convexity properties to get
This rate for detection implies lower rates for estimation of both [math] \theta^\ast [/math] and [math] \| \theta^\ast \|_2 [/math]: If given an estimator [math] \widehat{T}(X) [/math] for [math] T(\theta) = \| \theta \|_2 [/math] that achieves an error less than [math] \mu/2 [/math] for some [math] X [/math], then we would get a correct decision rule for those [math] X [/math] by setting
Lower bounds for estimating the [math] \ell_1 [/math] norm via moment matching
So far, we have seen many examples of rates that scale with [math] n^{-1} [/math] for the squared error. In this section, we are going to see an example for estimating a functional that can only be estimated at a much slower rate of [math] 1/\log n [/math], following [5]. Consider the normalized [math] \ell_1 [/math] norm,
as the target functional to be estimated, and measurements
We are going to show lower bounds for the estimation if restricted to an [math] \ell_\infty [/math] ball, [math] \Theta_n(M) = \{ \theta \in \R^n : | \theta_i | \leq M \} [/math], and for a general [math] \theta \in \R^n [/math].
A constrained risk inequality
We start with a general principle that is in the same spirit as the Neyman-Pearson lemma, (Lemma), but deals with expectations rather than probabilities and uses the [math] \chi^2 [/math] distance to estimate the closeness of distributions. Moreover, contrary to what we have seen so far, it allows us to compare two mixture distributions over candidate hypotheses, also known as priors. In the following, we write [math] X [/math] for a random observation coming from a distribution indexed by [math] \theta \in \Theta [/math], [math] \widehat{T} = \widehat{T}(X) [/math] an estimator for a function [math] T(\theta) [/math], and write [math] B(\theta) = \E_\theta \widehat{T} - T(\theta) [/math] for the bias of [math] \widehat{T} [/math]. Let [math] \mu_0 [/math] and [math] \mu_1 [/math] be two prior distributions over [math] \Theta [/math] and denote by [math] m_i [/math], [math] v_i^2 [/math] the means and variance of [math] T(\theta) [/math] under the priors [math] \mu_i [/math], [math] i \in \{0, 1\} [/math],
Moreover, write [math] F_i [/math] for the marginal distributions over the priors and denote their density with respect to the average of the two probability distributions by [math] f_i [/math]. With this, we write the [math] \chi^2 [/math] distance between [math] f_0 [/math] and [math] f_1 [/math] as
(i) If [math] \int \E_\theta (\widehat{T}(X) - T(\theta))^2 d\mu_0(\theta) \leq \varepsilon^2 [/math],
(ii) If [math] | m_1 - m_0 | \gt v_0 I[/math] and [math] 0 \leq \lambda \leq 1 [/math],
Without loss of generality, assume [math] m_1 \geq m_0 [/math]. We start by considering the term
Finally, since the minimax risk is bigger than any mixture risk, we can bound it from below by the maximum value of this bound, which is obtained at [math] \lambda = (I+1)/(I + 2) [/math] to get \eqref{eq:minimax_risk_mixture}, and by the same argument \eqref{eq:minimax_risk}.
Unfavorable priors via polynomial approximation
In order to construct unfavorable priors for the [math] \ell_1 [/math] norm \eqref{eq:aw}, we resort to polynomial approximation theory. For a continuous function [math] f [/math] on [math] [-1, 1] [/math], denote the maximal approximation error by polynomials of degree [math] k [/math], written as [math] \mathcal{P}_k [/math], by
and the polynomial attaining this error by [math] G_k^\ast [/math]. For [math] f(t) = |t| [/math], it is known [6] that
which is a very slow rate of convergence of the approximation, compared to smooth functions for which we would expect geometric convergence. We can now construct our priors using some abstract measure theoretical results.
For an integer [math] k \gt 0 [/math], there are two probability measures [math] \nu_0 [/math] and [math] \nu_1 [/math] on [math] [-1, 1] [/math] that fulfill the following:
- [math] \nu_0 [/math] and [math] \nu_1 [/math] are symmetric about 0;
- [math] \int t^l d \nu_1 (t) = \int t^l d \nu_0(t) [/math] for all [math] l \in \{0, 1, \dots, k \} [/math];
- [math]\int |t| d \nu_1(t) - \int | t | d \nu_0(t) = 2 \delta_k [/math].
The idea is to construct the measures via the Hahn-Banach theorem and the Riesz representation theorem. First, consider [math] f(t) = |t| [/math] as an element of the space [math] C([-1, 1]) [/math] equipped with the supremum norm and define [math] \mathcal{P}_k [/math] as the space of polynomials of degree up to [math] k [/math], and [math] \mathcal{F}_k := \operatorname{span}(\mathcal{P}_k \cup \{ f \}) [/math]. On [math] \mathcal{F}_k [/math], define the functional [math] T(c f + p_k) = c \delta_k [/math], which is well-defined because [math] f [/math] is not a polynomial. Claim: [math] \| T \| = \sup \{ T(g) : g \in \mathcal{F}_k, \| g \|_\infty \leq 1\} = 1 [/math]. [math] \| T \| \geq 1 [/math]: Let [math] G_k^\ast [/math] be the best-approximating polynomial to [math] f [/math] in [math] \mathcal{P}_k [/math]. Then, [math] \| f - G_k^\ast \|_\infty = \delta_k [/math], [math] \| (f - G^\ast_k)/\delta_k \|_\infty = 1 [/math], and [math] T((f - G^\ast_k)/\delta_k) = 1 [/math]. [math] \| T \| \leq 1 [/math]: Suppose [math] g = cf + p_k [/math] with [math] p_k \in \mathcal{P}_k [/math], [math] \| g \|_\infty = 1 [/math] and [math] T(g) \gt 1 [/math]. Then [math] c \gt 1/\delta_k [/math] and
Minimax lower bounds
We have the following bounds on the minimax rate for [math] T(\theta) = \frac{1}{n} \sum_{i = 1}^{n} | \theta_i| [/math]: For [math] \theta \in \Theta_n(M) = \{ \theta \in \R^n : | \theta_i | \leq M \} [/math],
We want to use Theorem. To construct the prior measures, we scale the measures from Lemma appropriately: Let [math] k_n [/math] be an even integer that is to be determined, and [math] \nu_0 [/math], [math] \nu_1 [/math] the two measures given by Lemma. Define [math] g(x) = Mx [/math] and define the measures [math] \mu_i [/math] by dilating [math] \nu_i [/math], [math] \mu_i(A) = \nu_i(g^{-1}(A)) [/math] for every Borel set [math] A \subseteq [-M, M] [/math] and [math] i \in \{0, 1\} [/math]. Hence,
- [math] \mu_0 [/math] and [math] \mu_1 [/math] are symmetric about 0;
- [math] \int t^l d \mu_1 (t) = \int t^l d \mu_0(t) [/math] for all [math] l \in \{0, 1, \dots, k_n \} [/math];
- [math]\int |t| d \mu_1(t) - \int | t | d \mu_0(t) = 2 M \delta_{k_n} [/math].
To get priors for [math] n [/math] \iid samples, consider the product priors [math] \mu_i^n = \prod_{j=1}^n \mu_i [/math]. With this, we have
To prove the lower bound over [math] \R^n [/math], take [math] M = \sqrt{\log n} [/math] and [math] k_n [/math] to be the smallest integer such that [math] k_n \geq 2 \e \log n [/math] and plug this into \eqref{eq:ax}, yielding
Note that [5] also complement these lower bounds with upper bounds that are based on polynomial approximation, completing the picture of estimating [math] T(\theta) [/math].
General references
Rigollet, Philippe; Hütter, Jan-Christian (2023). "High-Dimensional Statistics". arXiv:2310.19244 [math.ST].
References
- Cover, Thomas M.; Thomas, Joy A. (2006). Elements of information theory. 65. Hoboken, NJ: Wiley-Interscience [John Wiley & Sons]. pp. xxiv+748. ISBN 978-0-471-24195-9.
- Shao, Jun (2003). Mathematical statistics. 65. New York: Springer-Verlag. pp. xvi+591. ISBN 0-387-95382-5.
- Billingsley, Patrick (1995). Probability and measure. 65. New York: John Wiley & Sons Inc. pp. xiv+593. ISBN 0-471-00710-2.
- 4.0 4.1 Tsybakov, Alexandre B. (2009). Introduction to nonparametric estimation. 1. New York: Springer. pp. xii+214. ISBN 978-0-387-79051-0.
- 5.0 5.1 "Testing composite hypotheses, Hermite polynomials and optimal estimation of a nonsmooth functional" (2011). Ann. Statist. 39: 1012--1041. The Institute of Mathematical Statistics. doi: .
- Rivlin, Theodore J. (1990). Chebyshev Polynomials: From Approximation Theory to Algebra and Number Theory. 39. Berlin: Wiley. pp. 1012--1041. ISBN 978-0-471-62896-5.