guide:8dff7bda6c: Difference between revisions

From Stochiki
No edit summary
mNo edit summary
 
(3 intermediate revisions by the same user not shown)
Line 186: Line 186:
</math>
</math>
</div>
</div>
\label{chap:misspecified}


Arguably, the strongest assumption that we made in Chapter~[[guide:090050c501#chap:GSM |Linear Regression Model]] is that the regression function <math>f(x)</math> is of the form <math>f(x)=x^\top \theta^*</math>. What if this assumption is violated? In reality, we do not really believe in the linear model and we hope that good statistical methods should be ''robust'' to deviations from this model. This is the problem of model misspecification, and in particular misspecified linear models.
Arguably, the strongest assumption that we made in Chapter [[guide:090050c501#chap:GSM |Linear Regression Model]] is that the regression function <math>f(x)</math> is of the form <math>f(x)=x^\top \theta^*</math>. What if this assumption is violated? In reality, we do not really believe in the linear model and we hope that good statistical methods should be ''robust'' to deviations from this model. This is the problem of model misspecification, and in particular misspecified linear models.
Throughout this chapter, we assume the following model:
Throughout this chapter, we assume the following model:


<span id="EQ:regmodgen"/>
<math display="block">
<math display="block">
\begin{equation}
\begin{equation}
Line 220: Line 220:
<ul><li> The least squares estimator:
<ul><li> The least squares estimator:


<span id = "EQ:defthetaLSmis"/>
<span id="EQ:defthetaLSmis"/>
<math display="block">
<math display="block">
\begin{equation}
\begin{equation}
Line 235: Line 237:
<li> The BIC estimator:
<li> The BIC estimator:


<span id="EQ:defBICmis"/>
<math display="block">
<math display="block">
\begin{equation}
\begin{equation}
Line 244: Line 247:
<li> The Lasso estimator:
<li> The Lasso estimator:


<span id="EQ:defLassomis"/>
<math display="block">
<math display="block">
\begin{equation}
\begin{equation}
Line 273: Line 277:
===Oracle inequality for the least squares estimator===
===Oracle inequality for the least squares estimator===
While our ultimate goal is to prove sparse oracle inequalities for the BIC and Lasso estimator in the case of misspecified model, the difficulty of the extension to this case for linear models is essentially already captured by the analysis for the least squares estimator. In this simple case, can even obtain an exact oracle inequality.
While our ultimate goal is to prove sparse oracle inequalities for the BIC and Lasso estimator in the case of misspecified model, the difficulty of the extension to this case for linear models is essentially already captured by the analysis for the least squares estimator. In this simple case, can even obtain an exact oracle inequality.
{{proofcard|Theorem|TH:LS_mis|Assume the general regression model~\eqref{EQ:regmodgen} with <math>\eps\sim\sg_n(\sigma^2)</math>. Then,  the least squares estimator <math>\thetals</math> satisfies for some numerical constant <math>C > 0</math>,
{{proofcard|Theorem|TH:LS_mis|Assume the general regression model\eqref{EQ:regmodgen} with <math>\eps\sim\sg_n(\sigma^2)</math>. Then,  the least squares estimator <math>\thetals</math> satisfies for some numerical constant <math>C > 0</math>,


<math display="block">
<math display="block">
Line 299: Line 303:
|\varphi_{\thetals}-\varphi_{\bar \theta}|_2^2\le 2\eps^\top(\varphi_{\thetals}-\varphi_{\bar \theta})\,.
|\varphi_{\thetals}-\varphi_{\bar \theta}|_2^2\le 2\eps^\top(\varphi_{\thetals}-\varphi_{\bar \theta})\,.
</math>
</math>
Using the same steps as the ones following equation~[[guide:090050c501#EQ:bound_ls_1 |EQ:bound_ls_1]] for the well specified case, we get
Using the same steps as the ones following equation [[guide:090050c501#EQ:bound_ls_1 |EQ:bound_ls_1]] for the well specified case, we get


<math display="block">
<math display="block">
Line 307: Line 311:
===Sparse oracle inequality for the BIC estimator===
===Sparse oracle inequality for the BIC estimator===
The analysis for more complicated estimators such as the BIC in Chapter [[guide:090050c501#chap:GSM |Linear Regression Model]] allows us to derive oracle inequalities for these estimators.
The analysis for more complicated estimators such as the BIC in Chapter [[guide:090050c501#chap:GSM |Linear Regression Model]] allows us to derive oracle inequalities for these estimators.
{{proofcard|Theorem|TH:BIC_mis|Assume the general regression model~\eqref{EQ:regmodgen} with <math>\eps\sim\sg_n(\sigma^2)</math>. Then,  the BIC estimator <math>\thetabic</math> with regularization parameter
{{proofcard|Theorem|TH:BIC_mis|Assume the general regression model\eqref{EQ:regmodgen} with <math>\eps\sim\sg_n(\sigma^2)</math>. Then,  the BIC estimator <math>\thetabic</math> with regularization parameter


<span id="EQ:taubicmis"/>
<math display="block">
<math display="block">
\begin{equation}
\begin{equation}
Line 323: Line 328:
</math>
</math>
with probability at least <math>1-\delta</math>.
with probability at least <math>1-\delta</math>.
|Recall that the proof of Theorem~[[guide:090050c501#TH:BIC |Linear Regression Model]] for the BIC estimator begins as follows:
|Recall that the proof of [[guide:090050c501#TH:BIC |Theorem]] for the BIC estimator begins as follows:


<math display="block">
<math display="block">
Line 356: Line 361:
\end{align*}
\end{align*}
</math>
</math>
We conclude as in the proof of Theorem~[[guide:090050c501#TH:BIC |Linear Regression Model]].}}
We conclude as in the proof of [[guide:090050c501#TH:BIC |Theorem]].}}
The interpretation of this theorem is  enlightening. It implies that the BIC estimator will mimic the best tradeoff between the approximation error <math>\MSE(\varphi_\theta)</math> and the complexity of <math>\theta</math> as measured by its sparsity. In particular, this result, which is sometimes called ''sparse oracle inequality'', implies the following oracle inequality. Define the oracle <math>\bar \theta</math> to be such that
The interpretation of this theorem is  enlightening. It implies that the BIC estimator will mimic the best tradeoff between the approximation error <math>\MSE(\varphi_\theta)</math> and the complexity of <math>\theta</math> as measured by its sparsity. In particular, this result, which is sometimes called ''sparse oracle inequality'', implies the following oracle inequality. Define the oracle <math>\bar \theta</math> to be such that


Line 370: Line 375:
===Sparse oracle inequality for the Lasso===
===Sparse oracle inequality for the Lasso===
To prove an oracle inequality for the Lasso, we need additional assumptions on the design matrix, such as incoherence. Here, the design matrix is given by the <math>n\times M</math> matrix <math>\Phi</math> with elements <math>\Phi_{i,j}=\varphi_j(X_i)</math>.
To prove an oracle inequality for the Lasso, we need additional assumptions on the design matrix, such as incoherence. Here, the design matrix is given by the <math>n\times M</math> matrix <math>\Phi</math> with elements <math>\Phi_{i,j}=\varphi_j(X_i)</math>.
{{proofcard|Theorem|TH:lasso_fast_mis|Assume the general regression model~\eqref{EQ:regmodgen} with <math>\eps\sim\sg_n(\sigma^2)</math>. Moreover, assume that there exists an integer <math>k</math> such that the matrix <math>\Phi</math> satisfies assumption \textsf{INC$(k)$}.  Then, the Lasso estimator <math>\thetalasso</math> with regularization parameter given by
{{proofcard|Theorem|TH:lasso_fast_mis|Assume the general regression model\eqref{EQ:regmodgen} with <math>\eps\sim\sg_n(\sigma^2)</math>. Moreover, assume that there exists an integer <math>k</math> such that the matrix <math>\Phi</math> satisfies assumption <math>\textsf{INC$(k)$}</math>.  Then, the Lasso estimator <math>\thetalasso</math> with regularization parameter given by


<span id{{=}}"EQ:taulassomis"/>
<math display="block">
<math display="block">
\begin{equation}
\begin{equation}
Line 396: Line 402:
Expanding the squares, adding <math>\tau|\thetalasso-\theta|_1</math> on each side and multiplying by <math>n</math>, we get  
Expanding the squares, adding <math>\tau|\thetalasso-\theta|_1</math> on each side and multiplying by <math>n</math>, we get  


<span id{{=}}"EQ:pr:TH:lassofastmis1"/>
<math display="block">
<math display="block">
\begin{align}
\begin{align}
Line 408: Line 415:
2\eps^\top(\varphi_{\thetalasso}-\varphi_{\theta}) \le \frac{n\tau}2|\thetalasso-\theta|_1.
2\eps^\top(\varphi_{\thetalasso}-\varphi_{\theta}) \le \frac{n\tau}2|\thetalasso-\theta|_1.
</math>
</math>
Therefore, taking <math>S=\supp(\theta)</math> to be the support of <math>\theta</math>, we get that the right-hand side of~\eqref{EQ:pr:TH:lassofastmis1} is bounded by
Therefore, taking <math>S=\supp(\theta)</math> to be the support of <math>\theta</math>, we get that the right-hand side of\eqref{EQ:pr:TH:lassofastmis1} is bounded by


<span id{{=}}"EQ:pr:TH:lassofastmis2"/>
<math display="block">
<math display="block">
\begin{align}
\begin{align}
Line 424: Line 432:
|\thetalasso_{S^c}-\theta_{S^c}|_1\le 3 |\thetalasso_{S}-\theta_{S}|_1\,.
|\thetalasso_{S^c}-\theta_{S^c}|_1\le 3 |\thetalasso_{S}-\theta_{S}|_1\,.
</math>
</math>
so that <math>\theta=\thetalasso-\theta</math> satisfies the cone condition~[[guide:090050c501#EQ:conecond |EQ:conecond]].
so that <math>\theta=\thetalasso-\theta</math> satisfies the cone condition[[guide:090050c501#EQ:conecond |EQ:conecond]].
Using now  the Cauchy-Schwarz inequality  and Lemma~[[guide:090050c501#lem:inc |Linear Regression Model]], respectively, assuming that <math>|\theta|_0\le k</math>, we get  
Using now  the Cauchy-Schwarz inequality  and [[guide:090050c501#lem:inc |Lemma]], respectively, assuming that <math>|\theta|_0\le k</math>, we get  


<math display="block">
<math display="block">
Line 438: Line 446:
\end{align*}
\end{align*}
</math>
</math>
Combining this result with~\eqref{EQ:pr:TH:lassofastmis1} and~\eqref{EQ:pr:TH:lassofastmis2}, we find
Combining this result with\eqref{EQ:pr:TH:lassofastmis1} and\eqref{EQ:pr:TH:lassofastmis2}, we find


<math display="block">
<math display="block">
Line 444: Line 452:
</math>
</math>
To conclude the proof, it only remains to divide by <math>1-\alpha</math> on both sides of the above inequality and take <math>\alpha=1/2</math>.}}
To conclude the proof, it only remains to divide by <math>1-\alpha</math> on both sides of the above inequality and take <math>\alpha=1/2</math>.}}
===Maurey's argument===
===Maurey's argument===
In there is no sparse <math>\theta</math> such that <math>\MSE(\varphi_{\theta})</math> is small, Theorem~[[#TH:BIC_mis |Misspecified Linear Models]] is useless whereas the Lasso may still enjoy slow rates. In reality, no one really believes in the existence of sparse vectors but rather of approximately sparse vectors. Zipf's law would instead favor the existence of vectors <math>\theta</math> with absolute coefficients that decay polynomially when ordered from largest to smallest in absolute value. This is the case for example if <math>\theta</math> has a small <math>\ell_1</math> norm but is not sparse. For such <math>\theta</math>, the Lasso estimator still enjoys slow rates as in Theorem~[[guide:090050c501#TH:lasso_slow |Linear Regression Model]], which can be easily extended to the misspecified case (see Problem~[[#EXO:lasso_mis_slow |Misspecified Linear Models]]). As a result, it seems that the Lasso estimator is strictly better than the BIC estimator as long as incoherence holds since it enjoys both fast and slow rates, whereas the BIC estimator seems to be tailored to the fast rate.  Fortunately, such vectors can be well approximated by sparse vectors in the following sense: for any vector <math>\theta \in \R^M</math> such that <math>|\theta|_1\le 1</math>, there exists a vector <math>\theta'</math> that is sparse and for which <math>\MSE(\varphi_{\theta'})</math> is not much larger than <math>\MSE(\varphi_\theta)</math>. The following theorem quantifies exactly the tradeoff between sparsity and <math>\MSE</math>. It is often attributed to B. Maurey and was published by Pisier <ref name="Pis81">{{cite report|authors=|last=Flaxman|first=Abraham|year=2003|title=A spectral technique for random satisfiable 3CNF formulas|publisher=Society for Industrial and Applied Mathematics|work=Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms}}</ref>. This is why it is referred to as ''Maurey's argument''.
In there is no sparse <math>\theta</math> such that <math>\MSE(\varphi_{\theta})</math> is small, [[#TH:BIC_mis |Theorem]] is useless whereas the Lasso may still enjoy slow rates. In reality, no one really believes in the existence of sparse vectors but rather of approximately sparse vectors. Zipf's law would instead favor the existence of vectors <math>\theta</math> with absolute coefficients that decay polynomially when ordered from largest to smallest in absolute value. This is the case for example if <math>\theta</math> has a small <math>\ell_1</math> norm but is not sparse. For such <math>\theta</math>, the Lasso estimator still enjoys slow rates as in [[guide:090050c501#TH:lasso_slow |Theorem]], which can be easily extended to the misspecified case (see [[exercise:165e02b2e2 |Problem]]). As a result, it seems that the Lasso estimator is strictly better than the BIC estimator as long as incoherence holds since it enjoys both fast and slow rates, whereas the BIC estimator seems to be tailored to the fast rate.  Fortunately, such vectors can be well approximated by sparse vectors in the following sense: for any vector <math>\theta \in \R^M</math> such that <math>|\theta|_1\le 1</math>, there exists a vector <math>\theta'</math> that is sparse and for which <math>\MSE(\varphi_{\theta'})</math> is not much larger than <math>\MSE(\varphi_\theta)</math>. The following theorem quantifies exactly the tradeoff between sparsity and <math>\MSE</math>. It is often attributed to B. Maurey and was published by Pisier <ref name="Pis81">{{cite journal|last=Flaxman|first=Abraham|journal=Ann. Statist.|year=2003|title=A spectral technique for random satisfiable 3CNF formulas|volume=40|number=1|doi=10.1214/aos/1034276635|publisher=Society for Industrial and Applied Mathematics|url-access=http://dx.doi.org/10.1214/aos/1034276635|eprint=http://epubs.siam.org/doi/pdf/10.1137/S0097539705446974}}</ref>. This is why it is referred to as ''Maurey's argument''.
 
{{proofcard|Theorem|thm-1|Let <math>\{\varphi_1, \dots, \varphi_M\}</math> be a dictionary normalized in such a way that
{{proofcard|Theorem|thm-1|Let <math>\{\varphi_1, \dots, \varphi_M\}</math> be a dictionary normalized in such a way that


Line 463: Line 473:
and assume without loss of generality that <math>|\bar \theta_1| \ge |\bar \theta_2|\ge \ldots \ge |\bar \theta_M|</math>.   
and assume without loss of generality that <math>|\bar \theta_1| \ge |\bar \theta_2|\ge \ldots \ge |\bar \theta_M|</math>.   
Let now <math>U \in \R^n</math> be a random vector with values in <math>\{0, \pm R\varphi_1, \ldots, \pm R \varphi_M\}</math> defined by
Let now <math>U \in \R^n</math> be a random vector with values in <math>\{0, \pm R\varphi_1, \ldots, \pm R \varphi_M\}</math> defined by
<math display = "block">
\begin{alignat*}{2}
\begin{alignat*}{2}
\p(U=R\,\mathrm{sign}(\bar \theta_j)\varphi_j)&=\frac{|\bar \theta_j|}{R}\,, &\quad &j=1, \ldots, M,\\
\p(U=R\,\mathrm{sign}(\bar \theta_j)\varphi_j)&=\frac{|\bar \theta_j|}{R}\,, &\quad &j=1, \ldots, M,\\
\p(U=0)&=1-\frac{|\bar \theta|_1}{R}\,.
\p(U=0)&=1-\frac{|\bar \theta|_1}{R}\,.
\end{alignat*}
\end{alignat*}
</math>
Note that <math>\E[U]=\varphi_{\bar \theta}</math> and <math>|U|_2 \le RD \sqrt{n}</math>. Let now <math>U_1, \ldots, U_k</math> be <math>k</math> independent copies of <math>U</math> and define their average  
Note that <math>\E[U]=\varphi_{\bar \theta}</math> and <math>|U|_2 \le RD \sqrt{n}</math>. Let now <math>U_1, \ldots, U_k</math> be <math>k</math> independent copies of <math>U</math> and define their average  
<math display="block">
<math display="block">
Line 487: Line 500:
</math>
</math>
and divide by <math>n</math>.}}
and divide by <math>n</math>.}}
Maurey's argument implies the following corollary.
Maurey's argument implies the following corollary.
{{proofcard|Corollary|cor-1|Assume that the assumptions of Theorem~[[#TH:BIC_mis |Misspecified Linear Models]] hold and that the dictionary <math>\{\varphi_1, \dots, \varphi_M\}</math> is  normalized in such a way that
{{proofcard|Corollary|cor-1|Assume that the assumptions of [[#TH:BIC_mis |Theorem]] hold and that the dictionary <math>\{\varphi_1, \dots, \varphi_M\}</math> is  normalized in such a way that


<math display="block">
<math display="block">
Line 502: Line 516:
</math>
</math>
with probability at least <math>1-\delta</math>.
with probability at least <math>1-\delta</math>.
|Choosing <math>\alpha=1/3</math> in  Theorem~[[#TH:BIC_mis |Misspecified Linear Models]] yields
|Choosing <math>\alpha=1/3</math> in  [[#TH:BIC_mis |Theorem]] yields


<math display="block">
<math display="block">
Line 567: Line 581:
<math display="block">
<math display="block">
\begin{equation*}
\begin{equation*}
\frac{\sigma^2|\theta|_0\log (eM)}{n} \le \frac{\sigma^2M\log (eM)}{n}  \le C\sigma|\theta'|_1\sqrt{\frac{\log (eM)}{n}}\,. \qedhere
\frac{\sigma^2|\theta|_0\log (eM)}{n} \le \frac{\sigma^2M\log (eM)}{n}  \le C\sigma|\theta'|_1\sqrt{\frac{\log (eM)}{n}}\,.  
\end{equation*}
\end{equation*}
</math>
</math>
Line 581: Line 595:
</math>
</math>
which together with Theorem 3.4 yields the claim.}}
which together with Theorem 3.4 yields the claim.}}
Note that this last result holds for any estimator that satisfies an oracle inequality with respect to the <math>\ell_0</math> norm as in Theorem~[[#TH:BIC_mis |Misspecified Linear Models]]. In particular, this estimator need not be the BIC estimator. An example is the Exponential Screening estimator of~<ref name="RigTsy11">{{cite journal|last1=Rigollet|first1=Philippe|last2=Tsybakov|first2=Alex|last3=re|journal=Ann. Statist.|year=2011|title=Exponential screening and optimal rates of sparse estimation|volume=39|number=2}}</ref>.
Note that this last result holds for any estimator that satisfies an oracle inequality with respect to the <math>\ell_0</math> norm as in [[#TH:BIC_mis |Theorem]]. In particular, this estimator need not be the BIC estimator. An example is the Exponential Screening estimator of<ref name="RigTsy11">{{cite journal|last1=Rigollet|first1=Philippe|last2=Tsybakov|first2=Alexandre|journal=Ann. Statist.|year=2011|title=Exponential screening and optimal rates of sparse estimation|volume=39|number=2|doi=10.1214/aos/1034276635|publisher=John Wiley & Sons Inc.|url-access=http://dx.doi.org/10.1214/aos/1034276635|eprint=1206.5349}}</ref>.
Maurey's argument allows us to enjoy the best of both the <math>\ell_0</math> and the <math>\ell_1</math> world. The rate adapts to the sparsity of the problem and can be even generalized to <math>\ell_q</math>-sparsity (see Problem~[[#EXO:maurey |Misspecified Linear Models]]). However, it is clear from the proof that this argument is limited to squared <math>\ell_2</math> norms such as the one appearing in <math>\MSE</math> and extension to other risk measures is non-trivial. Some work has been done for non-Hilbert spaces <ref name="Pis81">{{cite report|authors=|last=Flaxman|first=Abraham|year=2003|title=A spectral technique for random satisfiable 3CNF formulas|publisher=Society for Industrial and Applied Mathematics|work=Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms}}</ref><ref name=" DonDarGur97"></ref> using more sophisticated arguments.
Maurey's argument allows us to enjoy the best of both the <math>\ell_0</math> and the <math>\ell_1</math> world. The rate adapts to the sparsity of the problem and can be even generalized to <math>\ell_q</math>-sparsity (see [[exercise:Db743c87c5|Problem]]). However, it is clear from the proof that this argument is limited to squared <math>\ell_2</math> norms such as the one appearing in <math>\MSE</math> and extension to other risk measures is non-trivial. Some work has been done for non-Hilbert spaces <ref name="Pis81"/><ref name=" DonDarGur97">{{cite journal|last=Ramon|first=van|journal=Constructive Approximation|year=2017|title=Probability in High Dimension|volume=13|number=2|doi=10.1214/aos/1034276635|publisher=Springer-Verlag|url-access=http://dx.doi.org/10.1214/aos/1034276635|eprint=ArXiv:1502.00665}}</ref> using more sophisticated arguments.


==Nonparametric regression==
==Nonparametric regression==
Line 606: Line 620:


There exists many choices for the orthonormal system and we give only two as examples.
There exists many choices for the orthonormal system and we give only two as examples.
'''Example'''  
'''Example'''  
\label{ex:trigbasis}
<span id = "ex:trigbasis"/>
''Trigonometric basis''. This is an orthonormal basis of <math>L_2([0,1])</math>. It is defined by
''Trigonometric basis''. This is an orthonormal basis of <math>L_2([0,1])</math>. It is defined by


Line 619: Line 634:
for <math>k=1,2, \dots</math> and <math>x \in [0,1]</math>. The fact that it is indeed an orthonormal system can be easily check using trigonometric identities.
for <math>k=1,2, \dots</math> and <math>x \in [0,1]</math>. The fact that it is indeed an orthonormal system can be easily check using trigonometric identities.


The next example has received a lot of attention in the signal (sound, image, \dots) processing community.
The next example has received a lot of attention in the signal (sound, image, ...) processing community.
 
'''Example'''  
'''Example'''  
''Wavelets''.  
''Wavelets''.  
Line 633: Line 649:
</math>
</math>
The coefficients <math>\theta_{jk}</math> are called ''wavelet coefficients'' of <math>g</math>.
The coefficients <math>\theta_{jk}</math> are called ''wavelet coefficients'' of <math>g</math>.
The simplest example is given by the ''Haar system'' obtained by taking <math>\psi</math> to be the following piecewise constant function (see Figure~[[#FIG:haar |Misspecified Linear Models]]). We will not give more details about wavelets here but simply point the interested reader to~<ref name="Mal09">{{citation|last=Ryan|first=O'Donnell|year=2005|title=The PCP Theorem and Hardness of Approximation}}</ref>.
The simplest example is given by the ''Haar system'' obtained by taking <math>\psi</math> to be the following piecewise constant function (see [[#FIG:haar |Figure]]). We will not give more details about wavelets here but simply point the interested reader to<ref name="Mal09">{{cite journal|last=Ryan|first=O'Donnell|journal=SIAM J. Comput.|year=2005|title=The PCP Theorem and Hardness of Approximation|volume=24|number=2|doi=10.1214/aos/1034276635|publisher=W. H. Freeman & Co.|url-access=http://dx.doi.org/10.1214/aos/1034276635|eprint=1011.3396}}</ref>.


<math display="block">
<math display="block">
Line 651: Line 667:
===Sobolev classes and ellipsoids===
===Sobolev classes and ellipsoids===
We begin by describing a class of smooth functions where smoothness is understood in terms of its number of derivatives. Recall that <math>f^{(k)}</math> denotes the <math>k</math>-th derivative of <math>f</math>.
We begin by describing a class of smooth functions where smoothness is understood in terms of its number of derivatives. Recall that <math>f^{(k)}</math> denotes the <math>k</math>-th derivative of <math>f</math>.
{{defncard|label=id=|Fix parameters <math>\beta \in \{1, 2, \dots\}</math> and <math>L > 0</math>. The Sobolev class of functions <math>W(\beta,L)</math> is defined by
 
{{defncard|label=|id=|Fix parameters <math>\beta \in \{1, 2, \dots\}</math> and <math>L > 0</math>. The Sobolev class of functions <math>W(\beta,L)</math> is defined by


<math display="block">
<math display="block">
Line 677: Line 694:
For any <math>\beta > 0</math>, define the coefficients
For any <math>\beta > 0</math>, define the coefficients


<span id="EQ:defaj"/>
<math display="block">
<math display="block">
\begin{equation}
\begin{equation}
Line 689: Line 707:
</math>
</math>
With these coefficients, we can define the Sobolev class of functions in terms of Fourier coefficients.
With these coefficients, we can define the Sobolev class of functions in terms of Fourier coefficients.
{{proofcard|Theorem|thm-2|Fix <math>\beta\ge 1</math> and <math>L > 0</math> and let <math>\{\varphi_j\}_{j \ge 1}</math> denote the trigonometric basis of <math>L_2([0,1])</math>. Moreover,  let <math>\{a_j\}_{j\ge 1}</math> be defined as in~\eqref{EQ:defaj}. A function <math>f \in W(\beta,L)</math> can be represented as  
{{proofcard|Theorem|thm-2|Fix <math>\beta\ge 1</math> and <math>L > 0</math> and let <math>\{\varphi_j\}_{j \ge 1}</math> denote the trigonometric basis of <math>L_2([0,1])</math>. Moreover,  let <math>\{a_j\}_{j\ge 1}</math> be defined as in\eqref{EQ:defaj}. A function <math>f \in W(\beta,L)</math> can be represented as  


<math display="block">
<math display="block">
Line 737: Line 755:
s_{2k}(\beta)^2+s_{2k+1}(\beta)^2=(2\pi k)^{2\beta}\big(\theta_{2k}^2 + \theta_{2k+1}^2\big)
s_{2k}(\beta)^2+s_{2k+1}(\beta)^2=(2\pi k)^{2\beta}\big(\theta_{2k}^2 + \theta_{2k+1}^2\big)
</math>
</math>
Next, it follows for the definition~\eqref{EQ:defaj} of <math>a_j</math> that
Next, it follows for the definition\eqref{EQ:defaj} of <math>a_j</math> that


<math display="block">
<math display="block">
Line 770: Line 788:
</math></li>
</math></li>
</ul>|}}
</ul>|}}
The proof is left as an exercise (Problem~[[#PROB:beta12 |Misspecified Linear Models]]).
The proof is left as an exercise ([[exercise:143f8e5f8a |Problem]]).
It turns out that the first <math> n </math> functions in the trigonometric basis are not only orthonormal with respect to the inner product of <math>L_2</math>, but also with respect to the inner predictor associated with fixed design, <math>\langle f,g\rangle:=\frac{1}{n} \sum_{i=1}^n f(X_i)g(X_i)</math>, when the design is chosen to be ''regular'', i.e., <math>X_i=(i-1)/n</math>, <math>i=1, \ldots, n</math>.
It turns out that the first <math> n </math> functions in the trigonometric basis are not only orthonormal with respect to the inner product of <math>L_2</math>, but also with respect to the inner predictor associated with fixed design, <math>\langle f,g\rangle:=\frac{1}{n} \sum_{i=1}^n f(X_i)g(X_i)</math>, when the design is chosen to be ''regular'', i.e., <math>X_i=(i-1)/n</math>, <math>i=1, \ldots, n</math>.
{{proofcard|Lemma|LEM:regdesign|Assume that <math>\{X_1, \ldots, X_n\}</math> is the regular design, i.e., <math>X_i=(i-1)/n</math>.  Then, for any <math>M\le n-1</math>, the design matrix <math>\Phi=\{\varphi_j(X_i)\}_{\substack{1\le i \le n\\1\le j \le M}}</math> satisfies the \textsf{ORT} condition.
{{proofcard|Lemma|LEM:regdesign|Assume that <math>\{X_1, \ldots, X_n\}</math> is the regular design, i.e., <math>X_i=(i-1)/n</math>.  Then, for any <math>M\le n-1</math>, the design matrix <math>\Phi=\{\varphi_j(X_i)\}_{\substack{1\le i \le n\\1\le j \le M}}</math> satisfies the <math>\textsf{ORT}</math> condition.
|Fix <math>j,j'\in \{1, \ldots, n-1\}</math>, <math>j\neq j'</math>, and consider the inner product <math>\varphi_j^\top \varphi_{j'}</math>. Write <math>k_{j}=\lfloor j/2\rfloor</math> for the integer part of <math>j/2</math> and define the vectors <math>a, b, a', b' \in \R^n</math> with coordinates such that <math>e^{\frac{\mathsf{i}2\pi k_j s}{n}}=a_{s+1}+\mathsf{i} b_{s+1}</math> and <math>e^{\frac{\mathsf{i}2\pi k_{j'} s}{n}}=a'_{s+1}+\mathsf{i} b'_{s+1}</math> for <math> s \in \{0, \dots, n-1\} </math>. It holds that
|Fix <math>j,j'\in \{1, \ldots, n-1\}</math>, <math>j\neq j'</math>, and consider the inner product <math>\varphi_j^\top \varphi_{j'}</math>. Write <math>k_{j}=\lfloor j/2\rfloor</math> for the integer part of <math>j/2</math> and define the vectors <math>a, b, a', b' \in \R^n</math> with coordinates such that <math>e^{\frac{\mathsf{i}2\pi k_j s}{n}}=a_{s+1}+\mathsf{i} b_{s+1}</math> and <math>e^{\frac{\mathsf{i}2\pi k_{j'} s}{n}}=a'_{s+1}+\mathsf{i} b'_{s+1}</math> for <math> s \in \{0, \dots, n-1\} </math>. It holds that


Line 805: Line 823:
<math display="block">
<math display="block">
\begin{equation*}
\begin{equation*}
\Phi^\top \Phi=nI_M\,. \qedhere
\Phi^\top \Phi=nI_M\,.  
\end{equation*}
\end{equation*}
</math>}}
</math>}}
Line 819: Line 837:
{{proofcard|Lemma|TH:bias|For any integer <math>M \ge 1</math>, and <math>f \in \Theta(\beta, Q)</math>, <math>\beta > 1/2</math>, it holds
{{proofcard|Lemma|TH:bias|For any integer <math>M \ge 1</math>, and <math>f \in \Theta(\beta, Q)</math>, <math>\beta > 1/2</math>, it holds


<span id="EQ:bias1"/>
<math display="block">
<math display="block">
\begin{equation}
\begin{equation}
Line 827: Line 846:
and for <math>M =n-1</math>, we have
and for <math>M =n-1</math>, we have


<span id="EQ:bias2"/>
<math display="block">
<math display="block">
\begin{equation}
\begin{equation}
Line 862: Line 882:
<math display="block">
<math display="block">
\begin{equation*}
\begin{equation*}
\sum_{j\ge n} |\theta^*_j|=\sum_{j\ge n} a_j|\theta^*_j|\frac{1}{a_j}\le \sqrt{\sum_{j\ge n} a_j^2|\theta^*_j|^2}\sqrt{\sum_{j\ge n}\frac{1}{a_j^2}}\lesssim \sqrt{Q} n^{\frac{1}{2}-\beta}\,. \qedhere
\sum_{j\ge n} |\theta^*_j|=\sum_{j\ge n} a_j|\theta^*_j|\frac{1}{a_j}\le \sqrt{\sum_{j\ge n} a_j^2|\theta^*_j|^2}\sqrt{\sum_{j\ge n}\frac{1}{a_j^2}}\lesssim \sqrt{Q} n^{\frac{1}{2}-\beta}\,.  
\end{equation*}
\end{equation*}
</math>}}
</math>}}
Line 871: Line 891:
\thetals \in \argmin_{\theta \in \R^M} \sum_{i=1}^n \big(Y_i -\varphi_\theta(X_i)\big)^2\,,
\thetals \in \argmin_{\theta \in \R^M} \sum_{i=1}^n \big(Y_i -\varphi_\theta(X_i)\big)^2\,,
</math>
</math>
which should be such that <math>\varphi_{\thetals}</math> is close to <math>\varphi_{\theta^*}</math>. For this estimator, we have proved (Theorem~[[#TH:LS_mis |Misspecified Linear Models]]) an oracle inequality for the <math>\MSE</math> that is of the form
which should be such that <math>\varphi_{\thetals}</math> is close to <math>\varphi_{\theta^*}</math>. For this estimator, we have proved ([[#TH:LS_mis |Theorem]]) an oracle inequality for the <math>\MSE</math> that is of the form


<math display="block">
<math display="block">
Line 885: Line 905:
\end{align*}
\end{align*}
</math>
</math>
where we used Lemma~[[#LEM:regdesign |Misspecified Linear Models]] in the last equality. Together with~\eqref{EQ:bias2} and Young's inequality <math>2ab\le \alpha a^2+b^2/\alpha, a,b \ge 0</math> for any <math>\alpha > 0</math>, we get
where we used [[#LEM:regdesign |Lemma]] in the last equality. Together with\eqref{EQ:bias2} and Young's inequality <math>2ab\le \alpha a^2+b^2/\alpha, a,b \ge 0</math> for any <math>\alpha > 0</math>, we get


<math display="block">
<math display="block">
Line 893: Line 913:
for some positive constant <math>C</math> when <math>\theta^* \in \Theta(\beta, Q)</math>. As a result,  
for some positive constant <math>C</math> when <math>\theta^* \in \Theta(\beta, Q)</math>. As a result,  


<span id="EQ:almostLSNP"/>
<math display="block">
<math display="block">
\begin{equation}
\begin{equation}
Line 899: Line 920:
\end{equation}
\end{equation}
</math>
</math>
for any <math>\alpha \in (0,1)</math>. Since, Lemma~[[#LEM:regdesign |Misspecified Linear Models]] implies, <math>|\varphi_{\thetals}^M-\varphi_{\theta^*}^M|_2^2=n\|\varphi_{\thetals}^M-\varphi_{\theta^*}^M\|_{L_2([0,1])}^2</math>, we have proved the following theorem.
for any <math>\alpha \in (0,1)</math>. Since, [[#LEM:regdesign |Lemma]] implies, <math>|\varphi_{\thetals}^M-\varphi_{\theta^*}^M|_2^2=n\|\varphi_{\thetals}^M-\varphi_{\theta^*}^M\|_{L_2([0,1])}^2</math>, we have proved the following theorem.


{{proofcard|Theorem|TH:UB_ls_NP|Fix  <math>\beta\ge (1+\sqrt{5})/4\simeq 0.81, Q > 0, \delta > 0</math> and assume the general regression model~\eqref{EQ:regmodgen} with <math>f \in \Theta(\beta,Q)</math> and <math>\eps\sim\sg_n(\sigma^2), \sigma^2 \le 1</math>. Moreover, let <math>M=\lceil n^{\frac{1}{2\beta+1}}\rceil</math> and <math>n</math> be large enough so that <math>M \le n-1</math>. Then the least squares estimator <math>\thetals</math> defined in~\eqref{EQ:defthetaLSmis} with <math>\{\varphi_j\}_{j=1}^M</math> being the trigonometric basis, satisfies with probability <math>1-\delta</math>, for <math>n</math> large enough,
{{proofcard|Theorem|TH:UB_ls_NP|Fix  <math>\beta\ge (1+\sqrt{5})/4\simeq 0.81, Q > 0, \delta > 0</math> and assume the general regression model\eqref{EQ:regmodgen} with <math>f \in \Theta(\beta,Q)</math> and <math>\eps\sim\sg_n(\sigma^2), \sigma^2 \le 1</math>. Moreover, let <math>M=\lceil n^{\frac{1}{2\beta+1}}\rceil</math> and <math>n</math> be large enough so that <math>M \le n-1</math>. Then the least squares estimator <math>\thetals</math> defined in\eqref{EQ:defthetaLSmis} with <math>\{\varphi_j\}_{j=1}^M</math> being the trigonometric basis, satisfies with probability <math>1-\delta</math>, for <math>n</math> large enough,


<math display="block">
<math display="block">
Line 907: Line 928:
</math>
</math>
where the constant factors may depend on <math>\beta, Q</math> and <math>\sigma</math>.
where the constant factors may depend on <math>\beta, Q</math> and <math>\sigma</math>.
|Choosing <math>\alpha=1/2</math> for example and absorbing <math>Q</math> in the constants, we get from~\eqref{EQ:almostLSNP} and Lemma~[[#LEM:regdesign |Misspecified Linear Models]] that
|Choosing <math>\alpha=1/2</math> for example and absorbing <math>Q</math> in the constants, we get from\eqref{EQ:almostLSNP} and [[#LEM:regdesign |Lemma]] that
for <math>M \le n-1</math>,
for <math>M \le n-1</math>,


Line 913: Line 934:
\|\varphi^M_{\thetals}-\varphi^M_{\theta^*}\|^2_{L_2([0,1])}\lesssim n^{1-2\beta}+\sigma^2\frac{M\log(1/\delta)}{n} \,.
\|\varphi^M_{\thetals}-\varphi^M_{\theta^*}\|^2_{L_2([0,1])}\lesssim n^{1-2\beta}+\sigma^2\frac{M\log(1/\delta)}{n} \,.
</math>
</math>
Using now Lemma~[[#TH:bias |Misspecified Linear Models]] and  <math>\sigma^2 \le 1</math>, we get  
Using now [[#TH:bias |Lemma]] and  <math>\sigma^2 \le 1</math>, we get  


<math display="block">
<math display="block">
Line 925: Line 946:
To conclude the proof, simply note that for the prescribed <math>\beta</math>, we have <math>n^{1-2\beta}\le n^{-\frac{2\beta}{2\beta+1}}</math>\,.}}
To conclude the proof, simply note that for the prescribed <math>\beta</math>, we have <math>n^{1-2\beta}\le n^{-\frac{2\beta}{2\beta+1}}</math>\,.}}
===Adaptive estimation===
===Adaptive estimation===
The rate attained by the projection estimator <math>\varphi_{\thetals}</math> with <math>M=\lceil n^{\frac{1}{2\beta+1}}\rceil</math> is actually optimal so, in this sense, it is a good estimator. Unfortunately, its implementation requires the knowledge of the smoothness parameter <math>\beta</math> which is typically unknown, to determine the level <math>M</math> of truncation. The purpose of ''adaptive estimation'' is precisely to adapt to the unknown <math>\beta</math>, that is to build an estimator that does not depend on <math>\beta</math> and yet, attains a rate of the order of <math>Cn^{-\frac{2\beta}{2\beta+1}}</math> (up to a logarithmic lowdown). To that end, we will use the oracle inequalities for the BIC and Lasso estimator defined in~\eqref{EQ:defBICmis} and~\eqref{EQ:defLassomis} respectively. In view of Lemma~[[#LEM:regdesign |Misspecified Linear Models]], the design matrix <math>\Phi</math> actually satisfies the assumption~\textsf{ORT} when we work with the trigonometric basis. This has two useful implications:
The rate attained by the projection estimator <math>\varphi_{\thetals}</math> with <math>M=\lceil n^{\frac{1}{2\beta+1}}\rceil</math> is actually optimal so, in this sense, it is a good estimator. Unfortunately, its implementation requires the knowledge of the smoothness parameter <math>\beta</math> which is typically unknown, to determine the level <math>M</math> of truncation. The purpose of ''adaptive estimation'' is precisely to adapt to the unknown <math>\beta</math>, that is to build an estimator that does not depend on <math>\beta</math> and yet, attains a rate of the order of <math>Cn^{-\frac{2\beta}{2\beta+1}}</math> (up to a logarithmic lowdown). To that end, we will use the oracle inequalities for the BIC and Lasso estimator defined in\eqref{EQ:defBICmis} and\eqref{EQ:defLassomis} respectively. In view of [[#LEM:regdesign |Lemma]], the design matrix <math>\Phi</math> actually satisfies the assumption<math>\textsf{ORT}</math> when we work with the trigonometric basis. This has two useful implications:
<ul><li> Both estimators are actually thresholding estimators and can therefore be implemented efficiently
<ul><li> Both estimators are actually thresholding estimators and can therefore be implemented efficiently
</li>
</li>
<li> The condition \textsf{INC($k$)} is automatically satisfied for any <math>k \ge 1</math>.
<li> The condition <math>\textsf{INC($k$)}</math> is automatically satisfied for any <math>k \ge 1</math>.
</li>
</li>
</ul>
</ul>
These observations lead to the following corollary.
These observations lead to the following corollary.


{{proofcard|Corollary|COR:adap|Fix <math>\beta >  (1+\sqrt{5})/4\simeq 0.81, Q > 0, \delta > 0</math> and <math>n</math> large enough to ensure <math>n-1\ge \lceil n^{\frac{1}{2\beta+1}}\rceil</math> assume the general regression model~\eqref{EQ:regmodgen} with <math>f \in \Theta(\beta,Q)</math> and <math>\eps\sim\sg_n(\sigma^2), \sigma^2 \le 1</math>. Let <math>\{\varphi_j\}_{j=1}^{n-1}</math> be the trigonometric basis. Denote by <math>\varphi^{n-1}_{\thetabic}</math> (resp. <math>\varphi^{n-1}_{\thetalasso}</math>) the BIC (resp. Lasso) estimator defined in~\eqref{EQ:defBICmis} (resp. \eqref{EQ:defLassomis}) over <math>\R^{n-1}</math> with regularization parameter given by~\eqref{EQ:taubicmis} (resp. \eqref{EQ:taulassomis}). Then <math>\varphi^{n-1}_{\hat \theta}</math>, where <math>\hat \theta \in \{\thetabic, \thetalasso\}</math> satisfies with probability <math>1-\delta</math>,
{{proofcard|Corollary|COR:adap|Fix <math>\beta >  (1+\sqrt{5})/4\simeq 0.81, Q > 0, \delta > 0</math> and <math>n</math> large enough to ensure <math>n-1\ge \lceil n^{\frac{1}{2\beta+1}}\rceil</math> assume the general regression model\eqref{EQ:regmodgen} with <math>f \in \Theta(\beta,Q)</math> and <math>\eps\sim\sg_n(\sigma^2), \sigma^2 \le 1</math>. Let <math>\{\varphi_j\}_{j=1}^{n-1}</math> be the trigonometric basis. Denote by <math>\varphi^{n-1}_{\thetabic}</math> (resp. <math>\varphi^{n-1}_{\thetalasso}</math>) the BIC (resp. Lasso) estimator defined in\eqref{EQ:defBICmis} (resp. \eqref{EQ:defLassomis}) over <math>\R^{n-1}</math> with regularization parameter given by\eqref{EQ:taubicmis} (resp. \eqref{EQ:taulassomis}). Then <math>\varphi^{n-1}_{\hat \theta}</math>, where <math>\hat \theta \in \{\thetabic, \thetalasso\}</math> satisfies with probability <math>1-\delta</math>,


<math display="block">
<math display="block">
Line 939: Line 960:
</math>
</math>
where the constants may depend on <math>\beta</math> and <math>Q</math>.
where the constants may depend on <math>\beta</math> and <math>Q</math>.
|For <math>\hat \theta \in \{\thetabic, \thetalasso\}</math>, adapting the proofs of Theorem~[[#TH:BIC_mis |Misspecified Linear Models]] for the BIC estimator and Theorem~[[#TH:lasso_fast_mis |Misspecified Linear Models]] for the Lasso estimator, for any <math>\theta \in \R^{n-1}</math>, with probability <math>1-\delta</math>
|For <math>\hat \theta \in \{\thetabic, \thetalasso\}</math>, adapting the proofs of [[#TH:BIC_mis |Theorem]] for the BIC estimator and [[#TH:lasso_fast_mis |Theorem]] for the Lasso estimator, for any <math>\theta \in \R^{n-1}</math>, with probability <math>1-\delta</math>


<math display="block">
<math display="block">
Line 962: Line 983:
|\varphi^{n-1}_{\hat \theta} -\varphi^{n-1}_{\theta^*_M}|_2^2\lesssim |\varphi^{n-1}_{\theta^*_M}-f|_2^2+ R(M)\lesssim|\varphi^{n-1}_{\theta^*_M}-\varphi^{n-1}_{\theta^*}|_2^2+|\varphi^{n-1}_{\theta^*}-f|_2^2+ R(M)
|\varphi^{n-1}_{\hat \theta} -\varphi^{n-1}_{\theta^*_M}|_2^2\lesssim |\varphi^{n-1}_{\theta^*_M}-f|_2^2+ R(M)\lesssim|\varphi^{n-1}_{\theta^*_M}-\varphi^{n-1}_{\theta^*}|_2^2+|\varphi^{n-1}_{\theta^*}-f|_2^2+ R(M)
</math>
</math>
Next, it follows from~\eqref{EQ:bias2} that <math>|\varphi^{n-1}_{\theta^*}-f|_2^2 \lesssim Qn^{2-2\beta}</math>. Together with  Lemma~[[#LEM:regdesign |Misspecified Linear Models]], it yields
Next, it follows from\eqref{EQ:bias2} that <math>|\varphi^{n-1}_{\theta^*}-f|_2^2 \lesssim Qn^{2-2\beta}</math>. Together with  [[#LEM:regdesign |Lemma]], it yields


<math display="block">
<math display="block">
\|\varphi^{n-1}_{\hat \theta} -\varphi^{n-1}_{\theta^*_M}\|_{L_2([0,1])}^2\lesssim  \|\varphi^{n-1}_{\theta^*} -\varphi^{n-1}_{\theta^*_M}\|_{L_2([0,1])}^2+Qn^{1-2\beta}+ \frac{R(M)}{n}\,.
\|\varphi^{n-1}_{\hat \theta} -\varphi^{n-1}_{\theta^*_M}\|_{L_2([0,1])}^2\lesssim  \|\varphi^{n-1}_{\theta^*} -\varphi^{n-1}_{\theta^*_M}\|_{L_2([0,1])}^2+Qn^{1-2\beta}+ \frac{R(M)}{n}\,.
</math>
</math>
Moreover, using~\eqref{EQ:bias1}, we find that
Moreover, using\eqref{EQ:bias1}, we find that


<math display="block">
<math display="block">
Line 973: Line 994:
</math>
</math>
To conclude the proof, choose <math>M=\lceil (n/\log n)^{\frac{1}{2\beta+1}}\rceil</math> and observe that the choice of <math>\beta</math> ensures that <math>n^{1-2\beta} \lesssim M^{-2\beta}</math>.}}
To conclude the proof, choose <math>M=\lceil (n/\log n)^{\frac{1}{2\beta+1}}\rceil</math> and observe that the choice of <math>\beta</math> ensures that <math>n^{1-2\beta} \lesssim M^{-2\beta}</math>.}}
While there is sometimes a (logarithmic) price to pay for adaptation, it turns out that the extra logarithmic factor can be removed by a clever use of blocks (see <ref name="Tsy09">{{cite journal|last=LeCam|first=L.|journal=Ann. Statist.|year=1973|title=Convergence of estimates under dimensionality restrictions|volume=1}}</ref>{{rp|at=Chapter~3}}). The reason why we get this extra logarithmic factor here is because we use a hammer that's too big. Indeed, BIC and Lasso allow for ‘`holes" in the Fourier decomposition, so we only get part of their potential benefits.
While there is sometimes a (logarithmic) price to pay for adaptation, it turns out that the extra logarithmic factor can be removed by a clever use of blocks (see <ref name="Tsy09">{{cite journal|last=LeCam|first=L.|journal=Ann. Statist.|year=1973|title=Convergence of estimates under dimensionality restrictions|volume=1|number=5|doi=10.1214/aos/1034276635|publisher=Omnipress|url-access=http://dx.doi.org/10.1214/aos/1034276635|eprint=1108.1170v6}}</ref>{{rp|at=Chapter3}}). The reason why we get this extra logarithmic factor here is because we use a hammer that's too big. Indeed, BIC and Lasso allow for ‘`holes" in the Fourier decomposition, so we only get part of their potential benefits.
 
==Problem Set==
 
 
 


==General references==
==General references==
{{cite arXiv|last1=Rigollet|first1=Philippe|last2=Hütter|first2=Jan-Christian|year=2023|title=High-Dimensional Statistics|eprint=2310.19244|class=math.ST}}
{{cite arXiv|last1=Rigollet|first1=Philippe|last2=Hütter|first2=Jan-Christian|year=2023|title=High-Dimensional Statistics|eprint=2310.19244|class=math.ST}}


==Notes==
==Notes==

Latest revision as of 01:34, 3 June 2024

[math] \newcommand{\DS}{\displaystyle} \newcommand{\intbeta}{\lfloor \beta \rfloor} \newcommand{\cA}{\mathcal{A}} \newcommand{\cB}{\mathcal{B}} \newcommand{\cC}{\mathcal{C}} \newcommand{\cD}{\mathcal{D}} \newcommand{\cE}{\mathcal{E}} \newcommand{\cF}{\mathcal{F}} \newcommand{\cG}{\mathcal{G}} \newcommand{\cH}{\mathcal{H}} \newcommand{\cI}{\mathcal{I}} \newcommand{\cJ}{\mathcal{J}} \newcommand{\cK}{\mathcal{K}} \newcommand{\cL}{\mathcal{L}} \newcommand{\cM}{\mathcal{M}} \newcommand{\cN}{\mathcal{N}} \newcommand{\cO}{\mathcal{O}} \newcommand{\cP}{\mathcal{P}} \newcommand{\cQ}{\mathcal{Q}} \newcommand{\cS}{\mathcal{S}} \newcommand{\cT}{\mathcal{T}} \newcommand{\cU}{\mathcal{U}} \newcommand{\cV}{\mathcal{V}} \newcommand{\cX}{\mathcal{X}} \newcommand{\cY}{\mathcal{Y}} \newcommand{\cZ}{\mathcal{Z}} \newcommand{\sg}{\mathsf{subG}} \newcommand{\subE}{\mathsf{subE}} \newcommand{\bA}{\mathbf{A}} \newcommand{\bB}{\mathbf{B}} \newcommand{\bC}{\mathbf{C}} \newcommand{\bD}{\mathbf{D}} \newcommand{\bE}{\mathbf{E}} \newcommand{\bF}{\mathbf{F}} \newcommand{\bG}{\mathbf{G}} \newcommand{\bH}{\mathbf{H}} \newcommand{\bI}{\mathbf{I}} \newcommand{\bJ}{\mathbf{J}} \newcommand{\bK}{\mathbf{K}} \newcommand{\bM}{\mathbf{M}} \newcommand{\bN}{\mathbf{N}} \newcommand{\bO}{\mathbf{O}} \newcommand{\bP}{\mathbf{P}} \newcommand{\bp}{\mathbf{p}} \newcommand{\bQ}{\mathbf{Q}} \newcommand{\bS}{\mathbf{S}} \newcommand{\bT}{\mathbf{T}} \newcommand{\bU}{\mathbf{U}} \newcommand{\bV}{\mathbf{V}} \newcommand{\bX}{\mathbf{X}} \newcommand{\bY}{\mathbf{Y}} \newcommand{\bZ}{\mathbf{Z}} \newcommand{\bflambda}{\boldsymbol{\lambda}} \newcommand{\bftheta}{\boldsymbol{\theta}} \newcommand{\bfg}{\boldsymbol{g}} \newcommand{\bfy}{\boldsymbol{y}} \def\thetaphat{\hat{\bftheta}_\bp} \def\bflam{\boldsymbol{\lambda}} \def\Lam{\Lambda} \def\lam{\lambda} \def\bfpi{\boldsymbol{\pi}} \def\bfz{\boldsymbol{z}} \def\bfw{\boldsymbol{w}} \def\bfeta{\boldsymbol{\eta}} \newcommand{\R}{\mathrm{ I}\kern-0.18em\mathrm{ R}} \newcommand{\h}{\mathrm{ I}\kern-0.18em\mathrm{ H}} \newcommand{\K}{\mathrm{ I}\kern-0.18em\mathrm{ K}} \newcommand{\p}{\mathrm{ I}\kern-0.18em\mathrm{ P}} \newcommand{\E}{\mathrm{ I}\kern-0.18em\mathrm{ E}} %\newcommand{\Z}{\mathrm{ Z}\kern-0.26em\mathrm{ Z}} \newcommand{\1}{\mathrm{ 1}\kern-0.24em\mathrm{ I}} \newcommand{\N}{\mathrm{ I}\kern-0.18em\mathrm{ N}} \newcommand{\field}[1]{\mathbb{#1}} \newcommand{\q}{\field{Q}} \newcommand{\Z}{\field{Z}} \newcommand{\X}{\field{X}} \newcommand{\Y}{\field{Y}} \newcommand{\bbS}{\field{S}} \newcommand{\n}{\mathcal{N}} \newcommand{\x}{\mathcal{X}} \newcommand{\pp}{{\sf p}} \newcommand{\hh}{{\sf h}} \newcommand{\ff}{{\sf f}} \newcommand{\Bern}{\mathsf{Ber}} \newcommand{\Bin}{\mathsf{Bin}} \newcommand{\Lap}{\mathsf{Lap}} \newcommand{\tr}{\mathsf{Tr}} \newcommand{\phin}{\varphi_n} \newcommand{\phinb}{\overline \varphi_n(t)} \newcommand{\pn}{\p_{\kern-0.25em n}} \newcommand{\pnm}{\p_{\kern-0.25em n,m}} \newcommand{\psubm}{\p_{\kern-0.25em m}} \newcommand{\psubp}{\p_{\kern-0.25em p}} \newcommand{\cfi}{\cF_{\kern-0.25em \infty}} \newcommand{\e}{\mathrm{e}} \newcommand{\ic}{\mathrm{i}} \newcommand{\Leb}{\mathrm{Leb}_d} \newcommand{\Var}{\mathrm{Var}} \newcommand{\ddelta}{d_{\symdiffsmall}} \newcommand{\dsubh}{d_H} \newcommand{\indep}{\perp\kern-0.95em{\perp}} \newcommand{\supp}{\mathop{\mathrm{supp}}} \newcommand{\rank}{\mathop{\mathrm{rank}}} \newcommand{\vol}{\mathop{\mathrm{vol}}} \newcommand{\conv}{\mathop{\mathrm{conv}}} \newcommand{\card}{\mathop{\mathrm{card}}} \newcommand{\argmin}{\mathop{\mathrm{argmin}}} \newcommand{\argmax}{\mathop{\mathrm{argmax}}} \newcommand{\ud}{\mathrm{d}} \newcommand{\var}{\mathrm{var}} \newcommand{\re}{\mathrm{Re}} \newcommand{\MSE}{\mathsf{MSE}} \newcommand{\im}{\mathrm{Im}} \newcommand{\epr}{\hfill\hbox{\hskip 4pt\vrule width 5pt height 6pt depth 1.5pt}\vspace{0.5cm}\par} \newcommand{\bi}[1]{^{(#1)}} \newcommand{\eps}{\varepsilon} \newcommand{\Deq}{\stackrel{\mathcal{D}}{=}} \newcommand{\ubar}{\underbar} \newcommand{\Kbeta}{K_{\hspace{-0.3mm} \beta}} \newcommand{\crzero}[1]{\overset{r_0}{\underset{#1}{\longleftrightarrow}}} \newcommand{\hint}[1]{\texttt{[Hint:#1]}} \newcommand{\vp}{\vspace{.25cm}} \newcommand{\vm}{\vspace{.5cm}} \newcommand{\vg}{\vspace{1cm}} \newcommand{\vgneg}{\vspace{-1cm}} \newcommand{\vneg}{\vspace{-.5cm}} \newcommand{\vpneg}{\vspace{-.25cm}} \newcommand{\tp}{\ptsize{10}} \newcommand{\douzp}{\ptsize{12}} \newcommand{\np}{\ptsize{9}} \newcommand{\hp}{\ptsize{8}} \newcommand{\red}{\color{red}} \newcommand{\ndpr}[1]{{\textsf{\red[#1]}}} \newcommand\iid{i.i.d\@ifnextchar.{}{.\@\xspace} } \newcommand\MoveEqLeft[1][2]{\kern #1em & \kern -#1em} \newcommand{\leadeq}[2][4]{\MoveEqLeft[#1] #2 \nonumber} \newcommand{\leadeqnum}[2][4]{\MoveEqLeft[#1] #2} \newcommand\independent{\protect\mathpalette{\protect\independenT}{\perp}} \def\independenT#1#2{\mathrel{\rlap{$#1#2$}\mkern2mu{#1#2}}} \newcommand{\MIT}[1]{{\color{MITred} #1}} \newcommand{\dHyp}{\{-1,1\}^d} \newcommand{\thetahard}{\hat \theta^{hrd}} \newcommand{\thetasoft}{\hat \theta^{sft}} \newcommand{\thetabic}{\hat \theta^{bic}} \newcommand{\thetalasso}{\hat \theta^{\cL}} \newcommand{\thetaslope}{\hat \theta^{\cS}} \newcommand{\thetahard}{\hat \theta^{hrd}} \newcommand{\thetasoft}{\hat \theta^{sft}} \newcommand{\thetabic}{\hat \theta^{bic}} \newcommand{\thetalasso}{\hat \theta^{\cL}} \newcommand{\thetaslope}{\hat \theta^{\cS}} \newcommand{\thetals}{\hat \theta^{ls}} \newcommand{\thetalsm}{\tilde \theta^{ls_X}} \newcommand{\thetaridge}{\hat\theta^{\mathrm{ridge}}_\tau} \newcommand{\thetalsK}{\hat \theta^{ls}_K} \newcommand{\muls}{\hat \mu^{ls}} [/math]

Arguably, the strongest assumption that we made in Chapter Linear Regression Model is that the regression function [math]f(x)[/math] is of the form [math]f(x)=x^\top \theta^*[/math]. What if this assumption is violated? In reality, we do not really believe in the linear model and we hope that good statistical methods should be robust to deviations from this model. This is the problem of model misspecification, and in particular misspecified linear models. Throughout this chapter, we assume the following model:

[[math]] \begin{equation} \label{EQ:regmodgen} Y_i=f(X_i)+\eps_i,\quad i=1, \ldots, n\,, \end{equation} [[/math]]

where [math]\eps=(\eps_1, \ldots, \eps_n)^\top [/math] is sub-Gaussian with variance proxy [math]\sigma^2[/math]. Here [math]X_i \in \R^d[/math]. When dealing with fixed design, it will be convenient to consider the vector [math]g \in \R^n[/math] defined for any function [math]g \,:\, \R^d \to \R[/math] by [math]g=(g(X_1), \ldots, g(X_n))^\top[/math]. In this case, we can write for any estimator [math]\hat f \in \R^n[/math] of [math]f[/math],

[[math]] \MSE(\hat f)=\frac1n|\hat f - f|_2^2\,. [[/math]]

Even though the model may not be linear, we are interested in studying the statistical properties of various linear estimators introduced in the previous chapters: [math]\thetals, \thetalsK, \thetalsm, \thetabic, \thetalasso[/math]. Clearly, even with an infinite number of observations, we have no chance of finding a consistent estimator of [math]f[/math] if we don't know the correct model. Nevertheless, as we will see in this chapter, something can still be said about these estimators using oracle inequalities.

Oracle inequalities

Oracle inequalities

As mentioned in the introduction, an oracle is a quantity that cannot be constructed without the knowledge of the quantity of interest, here: the regression function. Unlike the regression function itself, an oracle is constrained to take a specific form. For all matter of purposes, an oracle can be viewed as an estimator (in a given family) that can be constructed with an infinite amount of data. This is exactly what we should aim for in misspecified models. When employing the least squares estimator [math]\thetals[/math], we constrain ourselves to estimating functions that are of the form [math]x\mapsto x^\top \theta[/math], even though [math]f[/math] itself may not be of this form. Therefore, the oracle [math]\hat f[/math] is the linear function that is the closest to [math]f[/math]. Rather than trying to approximate [math]f[/math] by a linear function [math]f(x)\approx \theta^\top x[/math], we make the model a bit more general and consider a dictionary [math]\cH=\{\varphi_1, \ldots, \varphi_M\}[/math] of functions where [math]\varphi_j\,:\, \R^d \to \R[/math]. In this case, we can actually remove the assumption that [math]X \in \R^d[/math]. Indeed, the goal is now to estimate [math]f[/math] using a linear combination of the functions in the dictionary:

[[math]] f \approx \varphi_\theta:=\sum_{j=1}^M \theta_j \varphi_j\,. [[/math]]

If [math]M=d[/math] and [math]\varphi_j(X)=X^{(j)}[/math] returns the [math]j[/math]th coordinate of [math]X \in \R^d[/math] then the goal is to approximate [math]f(x)[/math] by [math]\theta^\top x[/math]. Nevertheless, the use of a dictionary allows for a much more general framework.

Note that the use of a dictionary does not affect the methods that we have been using so far, namely penalized/constrained least squares. We use the same notation as before and define

  • The least squares estimator:
    [[math]] \begin{equation} \label{EQ:defthetaLSmis} \thetals\in \argmin_{\theta \in \R^M}\frac{1}{n} \sum_{i=1}^n\big(Y_i-\varphi_\theta(X_i)\big)^2 \end{equation} [[/math]]
  • The least squares estimator constrained to [math]K \subset \R^M[/math]:
    [[math]] \thetalsK \in \argmin_{\theta \in K}\frac{1}{n} \sum_{i=1}^n\big(Y_i-\varphi_\theta(X_i)\big)^2 [[/math]]
  • The BIC estimator:
    [[math]] \begin{equation} \label{EQ:defBICmis} \thetabic \in \argmin_{\theta \in \R^M}\Big\{\frac{1}{n} \sum_{i=1}^n\big(Y_i-\varphi_\theta(X_i)\big)^2+\tau^2|\theta|_0\Big\} \end{equation} [[/math]]
  • The Lasso estimator:
    [[math]] \begin{equation} \label{EQ:defLassomis} \thetalasso \in \argmin_{\theta \in \R^M}\Big\{\frac{1}{n} \sum_{i=1}^n\big(Y_i-\varphi_\theta(X_i)\big)^2+2\tau|\theta|_1\Big\} \end{equation} [[/math]]
Definition

Let [math]R(\cdot)[/math] be a risk function and let [math]\cH=\{\varphi_1, \ldots, \varphi_M\}[/math] be a dictionary of functions from [math]\R^d[/math] to [math]\R[/math]. Let [math]K[/math] be a subset of [math]\R^M[/math]. The oracle on [math]K[/math] with respect to [math]R[/math] is defined by [math]\varphi_{\bar \theta}[/math], where [math]\bar \theta \in K[/math] is such that

[[math]] R(\varphi_{\bar \theta}) \le R(\varphi_\theta) \,, \qquad \forall\, \theta \in K\,. [[/math]]
Moreover, [math]R_{K}=R(\varphi_{\bar \theta})[/math] is called oracle risk on [math]K[/math]. An estimator [math]\hat f[/math] is said to satisfy an oracle inequality (over [math]K[/math]) with remainder term [math]\phi[/math] in expectation (resp. with high probability) if there exists a constant [math]C\ge 1[/math] such that

[[math]] \E R(\hat f) \le C\inf_{\theta \in K}R(\varphi_\theta) + \phi_{n,M}(K)\,, [[/math]]
or

[[math]] \p\big\{R(\hat f) \le C\inf_{\theta \in K}R(\varphi_\theta) + \phi_{n,M,\delta}(K) \big\} \ge 1-\delta \,, \qquad \forall \ \delta \gt 0 [[/math]]
respectively. If [math]C=1[/math], the oracle inequality is sometimes called exact.

Our goal will be to mimic oracles. The finite sample performance of an estimator at this task is captured by an oracle inequality.

Oracle inequality for the least squares estimator

While our ultimate goal is to prove sparse oracle inequalities for the BIC and Lasso estimator in the case of misspecified model, the difficulty of the extension to this case for linear models is essentially already captured by the analysis for the least squares estimator. In this simple case, can even obtain an exact oracle inequality.

Theorem

Assume the general regression model\eqref{EQ:regmodgen} with [math]\eps\sim\sg_n(\sigma^2)[/math]. Then, the least squares estimator [math]\thetals[/math] satisfies for some numerical constant [math]C \gt 0[/math],

[[math]] \MSE(\varphi_{\thetals})\le \inf_{\theta \in \R^M}\MSE(\varphi_\theta)+C\frac{\sigma^2M}{n}\log(1/\delta) [[/math]]
with probability at least [math]1-\delta[/math].


Show Proof

Note that by definition

[[math]] |Y-\varphi_{\thetals}|_2^2 \le |Y-\varphi_{\bar \theta}|_2^2 [[/math]]
where [math]\varphi_{\bar \theta}[/math] denotes the orthogonal projection of [math]f[/math] onto the linear span of [math]\varphi_1, \ldots, \varphi_M[/math]. Since [math]Y=f+\eps[/math], we get

[[math]] |f-\varphi_{\thetals}|_2^2 \le |f-\varphi_{\bar \theta}|_2^2+2\eps^\top(\varphi_{\thetals}-\varphi_{\bar \theta}) [[/math]]
Moreover, by Pythagoras's theorem, we have

[[math]] |f-\varphi_{\thetals}|_2^2 - |f-\varphi_{\bar \theta}|_2^2=|\varphi_{\thetals}-\varphi_{\bar \theta}|_2^2\,. [[/math]]
It yields

[[math]] |\varphi_{\thetals}-\varphi_{\bar \theta}|_2^2\le 2\eps^\top(\varphi_{\thetals}-\varphi_{\bar \theta})\,. [[/math]]
Using the same steps as the ones following equation EQ:bound_ls_1 for the well specified case, we get

[[math]] |\varphi_{\thetals}-\varphi_{\bar \theta}|_2^2\lesssim \frac{\sigma^2M}{n}\log(1/\delta) [[/math]]
with probability [math]1-\delta[/math]. The result of the lemma follows.

Sparse oracle inequality for the BIC estimator

The analysis for more complicated estimators such as the BIC in Chapter Linear Regression Model allows us to derive oracle inequalities for these estimators.

Theorem

Recall that the proof of Theorem for the BIC estimator begins as follows:

[[math]] \frac1n|Y-\varphi_{\thetabic}|_2^2 + \tau^2|\thetabic|_0 \le \frac1n|Y-\varphi_{\theta}|_2^2 + \tau^2|\theta|_0\,. [[/math]]
This is true for any [math]\theta \in \R^M[/math]. It implies

[[math]] |f-\varphi_{\thetabic}|_2^2 + n\tau^2|\thetabic|_0 \le |f-\varphi_{ \theta}|_2^2+2\eps^\top(\varphi_{\thetabic}-\varphi_{ \theta}) + n\tau^2|\theta|_0\,. [[/math]]
Note that if [math]\thetabic= \theta[/math], the result is trivial. Otherwise,

[[math]] \begin{align*} 2\eps^\top(\varphi_{\thetabic}-\varphi_\theta) &=2\eps^\top\Big(\frac{\varphi_{\thetabic}-\varphi_\theta}{|\varphi_{\thetabic}-\varphi_\theta|_2}\Big)|\varphi_{\thetabic}-\varphi_\theta|_2\\ &\le \frac2\alpha\Big[\eps^\top\Big(\frac{\varphi_{\thetabic}-\varphi_\theta}{|\varphi_{\thetabic}-\varphi_\theta|_2}\Big)\Big]^2 + \frac{\alpha}{2} |\varphi_{\thetabic}-\varphi_\theta|_2^2\,, \end{align*} [[/math]]
where we used Young's inequality [math]2ab\le \frac2\alpha a^2+\frac{\alpha}{2} b^2[/math], which is valid for [math]a,b \in \R[/math] and [math]\alpha \gt 0[/math]. Next, since

[[math]] \frac{\alpha}{2} |\varphi_{\thetabic}-\varphi_\theta|_2^2\le \alpha|\varphi_{\thetabic}-f|_2^2+\alpha|\varphi_\theta-f|_2^2\,, [[/math]]
we get for [math]\alpha=1/2[/math],

[[math]] \begin{align*} \frac12|\varphi_{\thetabic}-f|_2^2 &\le \frac32|\varphi_{ \theta}-f|_2^2+n\tau^2|\theta|_0 +4\big[\eps^\top\cU(\varphi_{\thetabic}-\varphi_{\theta})\big]^2- n\tau^2|\thetabic|_0\\ &\le \frac32|\varphi_{ \theta}-f|_2^2+2n\tau^2|\theta|_0+4\big[\eps^\top\cU(\varphi_{\thetabic -\theta})\big]^2- n\tau^2|\thetabic-\theta|_0. \end{align*} [[/math]]
We conclude as in the proof of Theorem.

Show Proof

{{{4}}}

The interpretation of this theorem is enlightening. It implies that the BIC estimator will mimic the best tradeoff between the approximation error [math]\MSE(\varphi_\theta)[/math] and the complexity of [math]\theta[/math] as measured by its sparsity. In particular, this result, which is sometimes called sparse oracle inequality, implies the following oracle inequality. Define the oracle [math]\bar \theta[/math] to be such that

[[math]] \MSE(\varphi_{\bar \theta})=\min_{\theta \in \R^M}\MSE(\varphi_\theta). [[/math]]

Then, with probability at least [math]1-\delta[/math],

[[math]] \MSE(\varphi_{\thetabic})\le 3\MSE(\varphi_{\bar \theta})+\frac{C\sigma^2}{n}\Big[|\bar \theta|_0\log (eM)+ \log(1/\delta)\Big]\Big\} [[/math]]

If the linear model happens to be correct, then we simply have [math]\MSE(\varphi_{\bar \theta})=0[/math].

Sparse oracle inequality for the Lasso

To prove an oracle inequality for the Lasso, we need additional assumptions on the design matrix, such as incoherence. Here, the design matrix is given by the [math]n\times M[/math] matrix [math]\Phi[/math] with elements [math]\Phi_{i,j}=\varphi_j(X_i)[/math].

Theorem

Assume the general regression model\eqref{EQ:regmodgen} with [math]\eps\sim\sg_n(\sigma^2)[/math]. Moreover, assume that there exists an integer [math]k[/math] such that the matrix [math]\Phi[/math] satisfies assumption [math]\textsf{INC$(k)$}[/math]. Then, the Lasso estimator [math]\thetalasso[/math] with regularization parameter given by

[[math]] \begin{equation} \label{EQ:taulassomis} 2\tau=8\sigma\sqrt{\frac{2\log(2M)}{n}}+8\sigma\sqrt{\frac{2\log(1/\delta)}{n}} \end{equation} [[/math]]
satisfies for some numerical constant [math]C[/math],

[[math]] \begin{align*} \MSE(\varphi_{\thetalasso})\le \inf_{\substack{\theta \in \R^M\\ |\theta|_0\le k}}\Big\{ \MSE(\varphi_\theta) &+\frac{C\sigma^2}{n}|\theta|_0\log (eM/\delta)\Big\} \end{align*} [[/math]]
with probability at least [math]1-\delta[/math].


Show Proof

From the definition of [math]\thetalasso[/math], for any [math]\theta \in \R^M[/math],

[[math]] \begin{align*} \frac1n|Y-\varphi_{\thetalasso}|_2^2 \le \frac1n|Y-\varphi_{\theta}|_2^2 + 2\tau|\theta|_1- 2\tau|\thetalasso|_1 \,. \end{align*} [[/math]]
Expanding the squares, adding [math]\tau|\thetalasso-\theta|_1[/math] on each side and multiplying by [math]n[/math], we get

[[math]] \begin{align} \leadeq{|\varphi_{\thetalasso}-f|_2^2-|\varphi_{\theta}-f|_2^2+n\tau|\thetalasso-\theta|_1}\\ \le {} & 2\eps^\top(\varphi_{\thetalasso}-\varphi_{\theta}) +n\tau|\thetalasso-\theta|_1+ 2n\tau|\theta|_1- 2n\tau|\thetalasso|_1 \,. \label{EQ:pr:TH:lassofastmis1} \end{align} [[/math]]
Next, note that \textsf{INC}[math](k)[/math] for any [math]k \ge 1[/math] implies that [math]|\varphi_j|_2 \le 2\sqrt{n}[/math] for all [math]j=1, \ldots, M[/math]. Applying H\"older's inequality, we get that with probability [math]1-\delta[/math], it holds that

[[math]] 2\eps^\top(\varphi_{\thetalasso}-\varphi_{\theta}) \le \frac{n\tau}2|\thetalasso-\theta|_1. [[/math]]
Therefore, taking [math]S=\supp(\theta)[/math] to be the support of [math]\theta[/math], we get that the right-hand side of\eqref{EQ:pr:TH:lassofastmis1} is bounded by

[[math]] \begin{align} |\varphi_{\thetalasso}-f|_2^2-|\varphi_{\theta}-f|_2^2+n\tau|\thetalasso-\theta|_1 &\le 2n\tau|\thetalasso-\theta|_1+2n\tau|\theta|_1- 2n\tau|\thetalasso|_1\nonumber \\ &= 2n\tau|\thetalasso_S-\theta|_1+2n\tau|\theta|_1- 2n\tau|\thetalasso_S|_1\nonumber\\ &\le 4n\tau|\thetalasso_S-\theta|_1\label{EQ:pr:TH:lassofastmis2} \end{align} [[/math]]
with probability [math]1-\delta[/math]. It implies that either [math]\MSE(\varphi_{\thetalasso})\le \MSE(\varphi_\theta)[/math] or that

[[math]] |\thetalasso_{S^c}-\theta_{S^c}|_1\le 3 |\thetalasso_{S}-\theta_{S}|_1\,. [[/math]]
so that [math]\theta=\thetalasso-\theta[/math] satisfies the cone conditionEQ:conecond. Using now the Cauchy-Schwarz inequality and Lemma, respectively, assuming that [math]|\theta|_0\le k[/math], we get

[[math]] 4n\tau|\thetalasso_S-\theta|_1 \le 4n\tau\sqrt{|S|}|\thetalasso_S-\theta|_2 \le 4\tau\sqrt{2n|\theta|_0}|\varphi_{\thetalasso}-\varphi_{\theta}|_2\,. [[/math]]
Using now the inequality [math]2ab\le \frac2{\alpha} a^2+\frac{\alpha}{2} b^2[/math], we get

[[math]] \begin{align*} 4n\tau|\thetalasso_S-\theta|_1& \le \frac{16\tau^2n|\theta|_0}{\alpha}+ \frac{\alpha}{2}|\varphi_{\thetalasso}-\varphi_{\theta}|_2^2\\ &\le \frac{16\tau^2n|\theta|_0}{\alpha}+ \alpha|\varphi_{\thetalasso}-f|_2^2+\alpha|\varphi_{\theta}-f|_2^2 \end{align*} [[/math]]
Combining this result with\eqref{EQ:pr:TH:lassofastmis1} and\eqref{EQ:pr:TH:lassofastmis2}, we find

[[math]] (1-\alpha)\MSE(\varphi_{\thetalasso}) \le (1+\alpha)\MSE(\varphi_{\theta})+ \frac{16\tau^2|\theta|_0}{\alpha}\,. [[/math]]
To conclude the proof, it only remains to divide by [math]1-\alpha[/math] on both sides of the above inequality and take [math]\alpha=1/2[/math].

Maurey's argument

In there is no sparse [math]\theta[/math] such that [math]\MSE(\varphi_{\theta})[/math] is small, Theorem is useless whereas the Lasso may still enjoy slow rates. In reality, no one really believes in the existence of sparse vectors but rather of approximately sparse vectors. Zipf's law would instead favor the existence of vectors [math]\theta[/math] with absolute coefficients that decay polynomially when ordered from largest to smallest in absolute value. This is the case for example if [math]\theta[/math] has a small [math]\ell_1[/math] norm but is not sparse. For such [math]\theta[/math], the Lasso estimator still enjoys slow rates as in Theorem, which can be easily extended to the misspecified case (see Problem). As a result, it seems that the Lasso estimator is strictly better than the BIC estimator as long as incoherence holds since it enjoys both fast and slow rates, whereas the BIC estimator seems to be tailored to the fast rate. Fortunately, such vectors can be well approximated by sparse vectors in the following sense: for any vector [math]\theta \in \R^M[/math] such that [math]|\theta|_1\le 1[/math], there exists a vector [math]\theta'[/math] that is sparse and for which [math]\MSE(\varphi_{\theta'})[/math] is not much larger than [math]\MSE(\varphi_\theta)[/math]. The following theorem quantifies exactly the tradeoff between sparsity and [math]\MSE[/math]. It is often attributed to B. Maurey and was published by Pisier [1]. This is why it is referred to as Maurey's argument.

Theorem

Let [math]\{\varphi_1, \dots, \varphi_M\}[/math] be a dictionary normalized in such a way that

[[math]] \max_{1\le j \le M}|\varphi_j|_2\le D\sqrt{n}\,. [[/math]]
Then for any integer [math]k[/math] such that [math]1\le k \le M[/math] and any positive [math]R[/math], we have

[[math]] \min_{\substack{\theta \in \R^M\\ |\theta|_0\le k}}\MSE(\varphi_\theta) \le \min_{\substack{\theta \in \R^M\\ |\theta|_1\le R}}\MSE(\varphi_\theta) + \frac{D^2 R^2}{k}\,. [[/math]]


Show Proof

Define

[[math]] \bar \theta \in \argmin_{\substack{\theta \in \R^M\\ |\theta|_1\le R}}|\varphi_\theta -f|_2^2 [[/math]]
and assume without loss of generality that [math]|\bar \theta_1| \ge |\bar \theta_2|\ge \ldots \ge |\bar \theta_M|[/math]. Let now [math]U \in \R^n[/math] be a random vector with values in [math]\{0, \pm R\varphi_1, \ldots, \pm R \varphi_M\}[/math] defined by

[[math]] \begin{alignat*}{2} \p(U=R\,\mathrm{sign}(\bar \theta_j)\varphi_j)&=\frac{|\bar \theta_j|}{R}\,, &\quad &j=1, \ldots, M,\\ \p(U=0)&=1-\frac{|\bar \theta|_1}{R}\,. \end{alignat*} [[/math]]
Note that [math]\E[U]=\varphi_{\bar \theta}[/math] and [math]|U|_2 \le RD \sqrt{n}[/math]. Let now [math]U_1, \ldots, U_k[/math] be [math]k[/math] independent copies of [math]U[/math] and define their average
[[math]] \bar U=\frac{1}{k}\sum_{i=1}^kU_i\,. [[/math]]
Note that [math]\bar U=\varphi_{\tilde \theta}[/math] for some [math]\tilde \theta \in\R^M[/math] such that [math]|\tilde \theta|_0\le k[/math]. Moreover, using the Pythagorean Theorem,

[[math]] \begin{align*} \E|f-\bar U|_2^2&=\E|f-\varphi_{\bar \theta}+\varphi_{\bar \theta}-\bar U|_2^2\\ &=\E|f-\varphi_{\bar \theta}|_2^2+|\varphi_{\bar \theta}-\bar U|_2^2\\ &= |f-\varphi_{\bar \theta}|_2^2 + \frac{\E|U-\E[U]|_2^2}{k}\\ &\le |f-\varphi_{\bar \theta}|_2^2 + \frac{(RD\sqrt{n})^2}{k} \end{align*} [[/math]]
To conclude the proof, note that

[[math]] \E|f-\bar U|_2^2 =\E|f-\varphi_{\tilde \theta}|_2^2 \ge \min_{\substack{\theta \in \R^M\\ |\theta|_0\le k}}|f-\varphi_{\theta}|_2^2 [[/math]]
and divide by [math]n[/math].

Maurey's argument implies the following corollary.

Corollary

Assume that the assumptions of Theorem hold and that the dictionary [math]\{\varphi_1, \dots, \varphi_M\}[/math] is normalized in such a way that

[[math]] \max_{1\le j \le M}|\varphi_j|_2\le \sqrt{n}\,. [[/math]]
Then there exists a constant [math]C \gt 0[/math] such that the BIC estimator satisfies

[[math]] \begin{align*} \MSE(\varphi_{\thetabic})\le \inf_{\theta \in \R^M}\Big\{2\MSE(\varphi_\theta)+C\Big[&\frac{\sigma^2|\theta|_0\log (eM)}{n}\wedge \sigma|\theta|_1\sqrt{\frac{\log (eM)}{n}}\Big]\Big\}\\ &+ C\frac{\sigma^2\log(1/\delta)}{n} \end{align*} [[/math]]
with probability at least [math]1-\delta[/math].


Show Proof

Choosing [math]\alpha=1/3[/math] in Theorem yields

[[math]] \MSE(\varphi_{\thetabic})\le 2\inf_{\theta \in \R^M}\Big\{\MSE(\varphi_\theta)+C\frac{\sigma^2|\theta|_0\log (eM)}{n}\Big\}+C\frac{\sigma^2\log(1/\delta)}{n} [[/math]]
Let [math]\theta' \in \R^M[/math]. It follows from Maurey's argument that for any [math] k \in [M] [/math], there exists [math]\theta = \theta(\theta', k) \in \R^M[/math] such that [math]|\theta|_0 = k[/math] and

[[math]] \MSE(\varphi_{\theta}) \le \MSE(\varphi_{\theta'})+\frac{2|\theta'|_1^2}{k} [[/math]]
It implies that

[[math]] \MSE(\varphi_{\theta})+C\frac{\sigma^2|\theta|_0\log (eM)}{n} \le \MSE(\varphi_{\theta'})+\frac{2|\theta'|_1^2}{k}+C\frac{\sigma^2 k\log (eM)}{n} [[/math]]
and furthermore that

[[math]] \inf_{\theta \in \R^M}\MSE(\varphi_{\theta})+C\frac{\sigma^2|\theta|_0\log (eM)}{n} \le \MSE(\varphi_{\theta'})+\frac{2|\theta'|_1^2}{k}+C\frac{\sigma^2 k\log (eM)}{n} [[/math]]
Since the above bound holds for any [math] \theta' \in \mathbb{R}^M [/math] and [math] k \in [M] [/math], we can take an infimum with respect to both [math]\theta'[/math] and [math]k[/math] on the right-hand side to get

[[math]] \begin{align*} \inf_{\theta \in \R^M}&\Big\{\MSE(\varphi_{\theta})+C\frac{\sigma^2|\theta|_0\log (eM)}{n} \Big\}\\ &\le \inf_{\theta' \in \R^M}\Big\{\MSE(\varphi_{\theta'})+C\min_k\Big(\frac{|\theta'|_1^2}{k}+C\frac{\sigma^2k\log (eM)}{n}\Big) \Big\}\,. \end{align*} [[/math]]
To control the minimum over [math]k[/math], we need to consider three cases for the quantity

[[math]] \bar k=\frac{|\theta'|_1}{\sigma}\sqrt{\frac{n}{\log (eM)}}. [[/math]]

  • If [math]1 \le \bar k \le M[/math], then we get
    [[math]] \min_k\Big(\frac{|\theta'|_1^2}{k}+C\frac{\sigma^2k\log (eM)}{n}\Big) \le C\sigma|\theta'|_1\sqrt{\frac{\log (eM)}{n}} [[/math]]
  • If [math]\bar k \le 1[/math], then
    [[math]] |\theta'|_1^2\le C\frac{\sigma^2\log (eM)}{n}\,, [[/math]]
    which yields
    [[math]] \min_k\Big(\frac{|\theta'|_1^2}{k}+C\frac{\sigma^2k\log (eM)}{n}\Big) \le C\frac{\sigma^2\log (eM)}{n} [[/math]]
  • If [math]\bar k \ge M[/math], then
    [[math]] \frac{\sigma^2M\log (eM)}{n}\le C \frac{|\theta'|_1^2}{M}\,. [[/math]]
    Therefore, on the one hand, if [math]M \ge \frac{|\theta|_1}{\sigma\sqrt{\log(eM)/n}}[/math], we get
    [[math]] \min_k\Big(\frac{|\theta'|_1^2}{k}+C\frac{\sigma^2k\log (eM)}{n}\Big) \le C \frac{|\theta'|_1^2}{M} \le C\sigma|\theta'|_1\sqrt{\frac{\log (eM)}{n}}\,. [[/math]]
    On the other hand, if [math]M \le \frac{|\theta|_1}{\sigma\sqrt{\log(eM)/n}}[/math], then for any [math]\theta \in \R^M[/math], we have
    [[math]] \begin{equation*} \frac{\sigma^2|\theta|_0\log (eM)}{n} \le \frac{\sigma^2M\log (eM)}{n} \le C\sigma|\theta'|_1\sqrt{\frac{\log (eM)}{n}}\,. \end{equation*} [[/math]]

Combined,

[[math]] \begin{align*} \leadeq{\inf_{\theta' \in \R^M}\Big\{\MSE(\varphi_{\theta'})+C\min_k\Big(\frac{|\theta'|_1^2}{k}+C\frac{\sigma^2k\log (eM)}{n}\Big) \Big\}}\\ \le {} & \inf_{\theta' \in \R^M}\Big\{\MSE(\varphi_{\theta'}) + C \sigma | \theta' |_1 \sqrt{\frac{\log(eM}{n}} + C \frac{\sigma^2 \log(eM)}{n} \Big\}, \end{align*} [[/math]]
which together with Theorem 3.4 yields the claim.

Note that this last result holds for any estimator that satisfies an oracle inequality with respect to the [math]\ell_0[/math] norm as in Theorem. In particular, this estimator need not be the BIC estimator. An example is the Exponential Screening estimator of[2]. Maurey's argument allows us to enjoy the best of both the [math]\ell_0[/math] and the [math]\ell_1[/math] world. The rate adapts to the sparsity of the problem and can be even generalized to [math]\ell_q[/math]-sparsity (see Problem). However, it is clear from the proof that this argument is limited to squared [math]\ell_2[/math] norms such as the one appearing in [math]\MSE[/math] and extension to other risk measures is non-trivial. Some work has been done for non-Hilbert spaces [1][3] using more sophisticated arguments.

Nonparametric regression

So far, the oracle inequalities that we have derived do not deal with the approximation error [math]\MSE(\varphi_\theta)[/math]. We kept it arbitrary and simply hoped that it was small. Note also that in the case of linear models, we simply assumed that the approximation error was zero. As we will see in this section, this error can be quantified under natural smoothness conditions if the dictionary of functions [math]\cH=\{\varphi_1, \ldots, \varphi_M\}[/math] is chosen appropriately. In what follows, we assume for simplicity that [math]d=1[/math] so that [math]f\,:\, \R \to \R[/math] and [math]\varphi_j\,:\, \R \to \R[/math].

Fourier decomposition

Historically, nonparametric estimation was developed before high-dimensional statistics and most results hold for the case where the dictionary [math]\cH=\{\varphi_1, \ldots, \varphi_M\}[/math] forms an orthonormal system of [math]L_2([0,1])[/math]:

[[math]] \int_0^1\varphi_j^2(x)\ud x =1\,, \quad \int_0^1 \varphi_j(x)\varphi_k(x)\ud x=0, \ \forall\, j\neq k\,. [[/math]]

We will also deal with the case where [math]M=\infty[/math]. When [math]\cH[/math] is an orthonormal system, the coefficients [math]\theta_j^* \in \R[/math] defined by

[[math]] \theta_j^* =\int_0^1 f(x) \varphi_j(x)\ud x\,, [[/math]]

are called Fourier coefficients of [math]f[/math]. Assume now that the regression function [math]f[/math] admits the following decomposition

[[math]] f=\sum_{j=1}^\infty \theta_j^* \varphi_j\,. [[/math]]

There exists many choices for the orthonormal system and we give only two as examples.

Example Trigonometric basis. This is an orthonormal basis of [math]L_2([0,1])[/math]. It is defined by

[[math]] \begin{eqnarray*} \varphi_1&\equiv&1\\ \varphi_{2k}(x)&=&\sqrt{2} \cos(2\pi k x)\,,\\ \varphi_{2k+1}(x)&=&\sqrt{2} \sin(2\pi k x)\,, \end{eqnarray*} [[/math]]

for [math]k=1,2, \dots[/math] and [math]x \in [0,1][/math]. The fact that it is indeed an orthonormal system can be easily check using trigonometric identities.

The next example has received a lot of attention in the signal (sound, image, ...) processing community.

Example Wavelets. Let [math]\psi\,:\, \R \to \R[/math] be a sufficiently smooth and compactly supported function, called ‘`’'mother wavelet". Define the system of functions

[[math]] \psi_{jk}(x)=2^{j/2}\psi(2^jx-k)\,, \quad j,k \in \Z\,. [[/math]]

It can be shown that for a suitable [math]\psi[/math], the dictionary [math]\{\psi_{j,k}, j,k \in \Z\}[/math] forms an orthonormal system of [math]L_2([0,1])[/math] and sometimes a basis. In the latter case, for any function [math]g \in L_2([0,1])[/math], it holds

[[math]] g=\sum_{j=-\infty}^\infty \sum_{k=-\infty}^\infty \theta_{jk}\psi_{jk}\,, \quad \theta_{jk}=\int_{0}^1 g(x)\psi_{jk}(x)\ud x\,. [[/math]]

The coefficients [math]\theta_{jk}[/math] are called wavelet coefficients of [math]g[/math]. The simplest example is given by the Haar system obtained by taking [math]\psi[/math] to be the following piecewise constant function (see Figure). We will not give more details about wavelets here but simply point the interested reader to[4].

[[math]] \psi(x)=\left\{ \begin{array}{ll} 1& 0\le x \lt 1/2\\ -1 & 1/2\le x \le 1\\ 0 & \text{otherwise} \end{array} \right. [[/math]]

[[File: | 400px | thumb | The Haar mother wavelet ]]

Sobolev classes and ellipsoids

We begin by describing a class of smooth functions where smoothness is understood in terms of its number of derivatives. Recall that [math]f^{(k)}[/math] denotes the [math]k[/math]-th derivative of [math]f[/math].

Definition

Fix parameters [math]\beta \in \{1, 2, \dots\}[/math] and [math]L \gt 0[/math]. The Sobolev class of functions [math]W(\beta,L)[/math] is defined by

[[math]] \begin{align*} W(\beta,L)=\Big\{f\,:\,& [0,1] \to \R\ :\ f\in L_2([0,1])\,, f^{(\beta-1)} \text{ is absolutely continuous and }\\ & \int_{0}^1[f^{(\beta)}]^2\le L^2\,, f^{(j)}(0)= f^{(j)}(1), j=0, \ldots, \beta-1\Big\} \end{align*} [[/math]]

Any function [math]f \in W(\beta, L)[/math] can represented[Notes 1] as its Fourier expansion along the trigonometric basis:

[[math]] f(x)=\theta_1^*\varphi_1(x) + \sum_{k=1}^\infty\big(\theta_{2k}^*\varphi_{2k}(x)+\theta_{2k+1}^*\varphi_{2k+1}(x)\big)\,, \quad \forall \ x \in [0,1]\,, [[/math]]

where [math]\theta^*=\{\theta_j^*\}_{j \ge 1}[/math] is in the space of squared summable sequences [math]\ell_2(\N)[/math] defined by

[[math]] \ell_2(\N)=\Big\{\theta\,:\, \sum_{j=1}^\infty \theta_j^2 \lt \infty\Big\}\,. [[/math]]

For any [math]\beta \gt 0[/math], define the coefficients

[[math]] \begin{equation} \label{EQ:defaj} a_j=\left\{ \begin{array}{cl} j^\beta& \text{for $j$ even}\\ (j-1)^\beta & \text{for $j$ odd} \end{array} \right. \end{equation} [[/math]]

With these coefficients, we can define the Sobolev class of functions in terms of Fourier coefficients.

Theorem

Fix [math]\beta\ge 1[/math] and [math]L \gt 0[/math] and let [math]\{\varphi_j\}_{j \ge 1}[/math] denote the trigonometric basis of [math]L_2([0,1])[/math]. Moreover, let [math]\{a_j\}_{j\ge 1}[/math] be defined as in\eqref{EQ:defaj}. A function [math]f \in W(\beta,L)[/math] can be represented as

[[math]] f=\sum_{j=1}^\infty \theta_j^* \varphi_j\,, [[/math]]
where the sequence [math]\{\theta_j^*\}_{j \ge 1}[/math] belongs to Sobolev ellipsoid of [math]\ell_2(\N)[/math] defined by

[[math]] \Theta(\beta, Q)=\Big\{\theta \in \ell_2(\N)\,:\, \sum_{j=1}^\infty a_j^2\theta_j^2\le Q\Big\} [[/math]]
for [math]Q=L^2/\pi^{2\beta}[/math].


Show Proof

Let us first define the Fourier coefficients [math]\{s_k(j)\}_{k \ge 1}[/math] of the [math]j[/math]th derivative [math]f^{(j)}[/math] of [math]f[/math] for [math]j=1, \ldots, \beta[/math]:

[[math]] \begin{align*} s_1(j)&=\int_0^1f^{(j)}(t)\ud t=f^{(j-1)}(1)-f^{(j-1)}(0)=0\,,\\ s_{2k}(j)&=\sqrt{2}\int_0^1f^{(j)}(t)\cos(2\pi kt)\ud t\,,\\ s_{2k+1}(j)&=\sqrt{2}\int_0^1f^{(j)}(t)\sin(2\pi kt)\ud t\,,\\ \end{align*} [[/math]]
The Fourier coefficients of [math]f[/math] are given by [math]\theta_k=s_k(0)[/math]. Using integration by parts, we find that

[[math]] \begin{align*} s_{2k}(\beta)&=\sqrt{2}f^{(\beta-1)}(t) \cos(2\pi kt)\Big|_0^1 + (2\pi k)\sqrt{2}\int_0^1f^{(\beta-1)}(t) \sin(2\pi kt)\ud t\\ &= \sqrt{2}[f^{(\beta-1)}(1)-f^{(\beta-1)}(0)]+(2\pi k)\sqrt{2}\int_0^1f^{(\beta-1)}(t) \sin(2\pi kt)\ud t\\ &= (2\pi k)s_{2k+1}(\beta-1)\,. \end{align*} [[/math]]
Moreover,

[[math]] \begin{align*} s_{2k+1}(\beta)&=\sqrt{2}f^{(\beta-1)}(t) \sin(2\pi kt)\Big|_0^1 - (2\pi k)\sqrt{2}\int_0^1f^{(\beta-1)}(t) \cos(2\pi kt)\ud t\\ &= -(2\pi k)s_{2k}(\beta-1)\,. \end{align*} [[/math]]
In particular, it yields

[[math]] s_{2k}(\beta)^2+s_{2k+1}(\beta)^2=(2\pi k)^2\big[s_{2k}(\beta-1)^2+s_{2k+1}(\beta-1)^2\big] [[/math]]
By induction, we find that for any [math]k \ge 1[/math],

[[math]] s_{2k}(\beta)^2+s_{2k+1}(\beta)^2=(2\pi k)^{2\beta}\big(\theta_{2k}^2 + \theta_{2k+1}^2\big) [[/math]]
Next, it follows for the definition\eqref{EQ:defaj} of [math]a_j[/math] that

[[math]] \begin{align*} \sum_{k=1}^\infty(2\pi k)^{2\beta}\big(\theta_{2k}^2 + \theta_{2k+1}^2\big)&=\pi^{2\beta}\sum_{k=1}^\infty a_{2k}^2\theta_{2k}^2 + \pi^{2\beta}\sum_{k=1}^\infty a_{2k+1}^2\theta_{2k+1}^2\\ &=\pi^{2\beta}\sum_{j=1}^\infty a_{j}^2\theta_{j}^2\,. \end{align*} [[/math]]
Together with the Parseval identity, it yields

[[math]] \int_0^1\big(f^{(\beta)}(t)\big)^2\ud t=\sum_{k=1}^\infty s_{2k}(\beta)^2+s_{2k+1}(\beta)^2=\pi^{2\beta}\sum_{j=1}^\infty a_{j}^2\theta_{j}^2\,. [[/math]]
To conclude, observe that since [math]f \in W(\beta, L)[/math], we have

[[math]] \int_0^1\big(f^{(\beta)}(t)\big)^2\ud t \le L^2\,, [[/math]]
so that [math]\theta \in \Theta(\beta, L^2/\pi^{2\beta})[/math]\,.

It can actually be shown that the reciprocal is true, that is, any function with Fourier coefficients in [math]\Theta(\beta, Q)[/math] belongs to if [math] W(\beta,L)[/math], but we will not be needing this. In what follows, we will define smooth functions as functions with Fourier coefficients (with respect to the trigonometric basis) in a Sobolev ellipsoid. By extension, we write [math]f \in \Theta(\beta, Q)[/math] in this case and consider any real value for [math]\beta[/math].

Proposition

The Sobolev ellipsoids enjoy the following properties

  • For any [math]Q \gt 0[/math],
    [[math]] 0 \lt \beta' \lt \beta \ \Rightarrow \ \Theta(\beta, Q) \subset \Theta(\beta',Q). [[/math]]
  • For any [math]Q \gt 0[/math],
    [[math]] \beta \gt \frac12 \ \Rightarrow \ f\ \text{is continuous}. [[/math]]

The proof is left as an exercise (Problem). It turns out that the first [math] n [/math] functions in the trigonometric basis are not only orthonormal with respect to the inner product of [math]L_2[/math], but also with respect to the inner predictor associated with fixed design, [math]\langle f,g\rangle:=\frac{1}{n} \sum_{i=1}^n f(X_i)g(X_i)[/math], when the design is chosen to be regular, i.e., [math]X_i=(i-1)/n[/math], [math]i=1, \ldots, n[/math].

Lemma

Assume that [math]\{X_1, \ldots, X_n\}[/math] is the regular design, i.e., [math]X_i=(i-1)/n[/math]. Then, for any [math]M\le n-1[/math], the design matrix [math]\Phi=\{\varphi_j(X_i)\}_{\substack{1\le i \le n\\1\le j \le M}}[/math] satisfies the [math]\textsf{ORT}[/math] condition.


Show Proof

Fix [math]j,j'\in \{1, \ldots, n-1\}[/math], [math]j\neq j'[/math], and consider the inner product [math]\varphi_j^\top \varphi_{j'}[/math]. Write [math]k_{j}=\lfloor j/2\rfloor[/math] for the integer part of [math]j/2[/math] and define the vectors [math]a, b, a', b' \in \R^n[/math] with coordinates such that [math]e^{\frac{\mathsf{i}2\pi k_j s}{n}}=a_{s+1}+\mathsf{i} b_{s+1}[/math] and [math]e^{\frac{\mathsf{i}2\pi k_{j'} s}{n}}=a'_{s+1}+\mathsf{i} b'_{s+1}[/math] for [math] s \in \{0, \dots, n-1\} [/math]. It holds that

[[math]] \frac{1}{2}\varphi_j^\top \varphi_{j'}\in \{a^\top a'\,,\, b^\top b'\,,\, b^\top a'\,,\, a^\top b'\}\,, [[/math]]
depending on the parity of [math]j[/math] and [math]j'[/math]. On the one hand, observe that if [math]k_j \neq k_{j'}[/math], we have for any [math]\sigma \in \{-1, +1\}[/math],

[[math]] \sum_{s=0}^{n-1}e^{\frac{\mathsf{i}2\pi k_j s}{n}}e^{\sigma\frac{\mathsf{i}2\pi k_{j'} s}{n}}=\sum_{s=0}^{n-1}e^{\frac{\mathsf{i}2\pi (k_j+\sigma k_{j'}) s}{n}}=0\,. [[/math]]
On the other hand

[[math]] \sum_{s=0}^{n-1}e^{\frac{\mathsf{i}2\pi k_j s}{n}}e^{\sigma\frac{\mathsf{i}2\pi k_{j'} s}{n}}=(a+\mathsf{i} b)^\top( a'+\sigma \mathsf{i} b')=a^\top a'-\sigma b^\top b' + \mathsf{i} \big[b^\top a' + \sigma a^\top b' \big] [[/math]]
so that [math]a^\top a'=\pm b^\top b'=0[/math] and [math]b^\top a'=\pm a^\top b'=0[/math] whenever, [math]k_j \neq k_{j'}[/math]. It yields [math]\varphi_j^\top \varphi_{j'}=0[/math]. Next, consider the case where [math]k_j=k_{j'}[/math] so that

[[math]] \sum_{s=0}^{n-1}e^{\frac{\mathsf{i}2\pi k_j s}{n}}e^{\sigma\frac{\mathsf{i}2\pi k_{j'} s}{n}}=\left\{ \begin{array}{ll} 0& \text{if $\sigma=1$}\\ n& \text{if $\sigma=-1$}\\ \end{array}\right.\,. [[/math]]
On the one hand if [math]j \neq j'[/math], it can only be the case that [math]\varphi_j^\top \varphi_{j'} \in \{b^\top a'\,,\, a^\top b'\}[/math] but the same argument as above yields [math]b^\top a'=\pm a^\top b'=0[/math] since the imaginary part of the inner product is still [math] 0 [/math]. Hence, in that case, [math]\varphi_j^\top \varphi_{j'}=0[/math]. On the other hand, if [math]j=j'[/math], then [math]a=a'[/math] and [math]b=b'[/math] so that it yields [math]a^\top a'=|a|_2^2=n[/math] and [math]b^\top b'=|b|_2^2=n[/math] which is equivalent to [math]\varphi_j^\top \varphi_j=|\varphi_j|_2^2=n[/math]. Therefore, the design matrix [math]\Phi[/math] is such that

[[math]] \begin{equation*} \Phi^\top \Phi=nI_M\,. \end{equation*} [[/math]]

Integrated squared error

As mentioned in the introduction of this chapter, the smoothness assumption allows us to control the approximation error. Before going into the details, let us gain some insight. Note first that if [math]\theta \in \Theta(\beta, Q)[/math], then [math]a_j^2\theta_j^2 \to 0[/math] as [math]j \to \infty[/math] so that [math]|\theta_j|=o(j^{-\beta})[/math]. Therefore, the [math]\theta_j[/math]s decay polynomially to zero and it makes sense to approximate [math]f[/math] by its truncated Fourier series

[[math]] \sum_{j=1}^M \theta_j^* \varphi_j=:\varphi_{\theta^*}^M [[/math]]

for any fixed [math]M[/math]. This truncation leads to a systematic error that vanishes as [math]M \to \infty[/math]. We are interested in understanding the rate at which this happens.

The Sobolev assumption allows us to control precisely this error as a function of the tunable parameter [math]M[/math] and the smoothness [math]\beta[/math].

Lemma

Note that for any [math]\theta \in \Theta(\beta, Q)[/math], if [math]\beta \gt 1/2[/math], then

[[math]] \begin{align*} \sum_{j=2}^\infty|\theta_j|&=\sum_{j=2}^\infty a_j|\theta_j|\frac1{a_j}\\ &\le \sqrt{\sum_{j=2}^\infty a_j^2\theta_j^2\sum_{j=2}^\infty\frac1{a_j^2}}\quad \text{by Cauchy-Schwarz}\\ &\le \sqrt{Q\sum_{j=1}^\infty\frac1{j^{2\beta}}} \lt \infty \end{align*} [[/math]]
Since [math]\{\varphi_j\}_j[/math] forms an orthonormal system in [math]L_2([0,1])[/math], we have
[[math]] \min_{\theta \in \R^M} \|\varphi_\theta -f\|_{L_2}^2= \|\varphi_{\theta^*}^M -f\|_{L_2}^2 =\sum_{j \gt M} |\theta^*_j|^2\,. [[/math]]
When [math]\theta^* \in \Theta(\beta,Q)[/math], we have

[[math]] \sum_{j \gt M} |\theta_j^*|^2=\sum_{j \gt M} a_j^2|\theta_j^*|^2\frac1{a_j^2}\le \frac1{a_{M+1}^2}Q\le \frac{Q}{M^{2\beta}}\,. [[/math]]

To prove the second part of the lemma, observe that

[[math]] |\varphi_{\theta^*}^{n-1} -f|_2=\big|\sum_{j\ge n}\theta_j^* \varphi_j\big|_2 \le2\sqrt{2n}\sum_{j\ge n}|\theta_j^*|\,, [[/math]]
where in the last inequality, we used the fact that for the trigonometric basis [math]|\varphi_j|_2 \le \sqrt{2n}, j \ge 1[/math] regardless of the choice of the design [math]X_1,\ldots, X_n[/math]. When [math]\theta^* \in \Theta(\beta,Q)[/math], we have

[[math]] \begin{equation*} \sum_{j\ge n} |\theta^*_j|=\sum_{j\ge n} a_j|\theta^*_j|\frac{1}{a_j}\le \sqrt{\sum_{j\ge n} a_j^2|\theta^*_j|^2}\sqrt{\sum_{j\ge n}\frac{1}{a_j^2}}\lesssim \sqrt{Q} n^{\frac{1}{2}-\beta}\,. \end{equation*} [[/math]]

Show Proof

{{{4}}}

Note that the truncated Fourier series [math]\varphi_{\theta^*}^M[/math] is an oracle: this is what we see when we view [math]f[/math] through the lens of functions with only low frequency harmonics. To estimate [math]\varphi_{\theta^*} = \varphi_{\theta^\ast}^M[/math], consider the estimator [math]\varphi_{\thetals}[/math] where

[[math]] \thetals \in \argmin_{\theta \in \R^M} \sum_{i=1}^n \big(Y_i -\varphi_\theta(X_i)\big)^2\,, [[/math]]

which should be such that [math]\varphi_{\thetals}[/math] is close to [math]\varphi_{\theta^*}[/math]. For this estimator, we have proved (Theorem) an oracle inequality for the [math]\MSE[/math] that is of the form

[[math]] |\varphi_{\thetals}^M-f|_2^2\le\inf_{\theta \in \R^M}|\varphi_\theta^M-f|_2^2+C\sigma^2M\log(1/\delta)\,, \qquad C \gt 0\,. [[/math]]

It yields

[[math]] \begin{align*} |\varphi_{\thetals}^M-\varphi_{\theta^*}^M|_2^2&\le 2(\varphi_{\thetals}^M-\varphi_{\theta^*}^M)^\top(f-\varphi_{\theta^*}^M)+C\sigma^2M\log(1/\delta)\\ &= 2(\varphi_{\thetals}^M-\varphi_{\theta^*}^M)^\top(\sum_{j \gt M}\theta^*_j\varphi_j)+C\sigma^2M\log(1/\delta)\\ &=2(\varphi_{\thetals}^M-\varphi_{\theta^*}^M)^\top(\sum_{j\ge n}\theta^*_j\varphi_j)+C\sigma^2M\log(1/\delta)\,, \end{align*} [[/math]]

where we used Lemma in the last equality. Together with\eqref{EQ:bias2} and Young's inequality [math]2ab\le \alpha a^2+b^2/\alpha, a,b \ge 0[/math] for any [math]\alpha \gt 0[/math], we get

[[math]] 2(\varphi_{\thetals}^M-\varphi_{\theta^*}^M)^\top(\sum_{j\ge n}\theta^*_j\varphi_j)\le \alpha|\varphi_{\thetals}^M-\varphi_{\theta^*}^M |_2^2 + \frac{C}{\alpha}Qn^{2-2\beta}\,, [[/math]]

for some positive constant [math]C[/math] when [math]\theta^* \in \Theta(\beta, Q)[/math]. As a result,

[[math]] \begin{equation} \label{EQ:almostLSNP} |\varphi_{\thetals}^M-\varphi_{\theta^*}^M|_2^2 \lesssim \frac{1}{\alpha(1-\alpha)}Qn^{2-2\beta}+ \frac{\sigma^2M}{1-\alpha}\log(1/\delta) \end{equation} [[/math]]

for any [math]\alpha \in (0,1)[/math]. Since, Lemma implies, [math]|\varphi_{\thetals}^M-\varphi_{\theta^*}^M|_2^2=n\|\varphi_{\thetals}^M-\varphi_{\theta^*}^M\|_{L_2([0,1])}^2[/math], we have proved the following theorem.

Theorem

Fix [math]\beta\ge (1+\sqrt{5})/4\simeq 0.81, Q \gt 0, \delta \gt 0[/math] and assume the general regression model\eqref{EQ:regmodgen} with [math]f \in \Theta(\beta,Q)[/math] and [math]\eps\sim\sg_n(\sigma^2), \sigma^2 \le 1[/math]. Moreover, let [math]M=\lceil n^{\frac{1}{2\beta+1}}\rceil[/math] and [math]n[/math] be large enough so that [math]M \le n-1[/math]. Then the least squares estimator [math]\thetals[/math] defined in\eqref{EQ:defthetaLSmis} with [math]\{\varphi_j\}_{j=1}^M[/math] being the trigonometric basis, satisfies with probability [math]1-\delta[/math], for [math]n[/math] large enough,

[[math]] \|\varphi^M_{\thetals} -f\|_{L_2([0,1])}^2 \lesssim n^{-\frac{2\beta}{2\beta+1}}(1+\sigma^2\log(1/\delta))\,. [[/math]]
where the constant factors may depend on [math]\beta, Q[/math] and [math]\sigma[/math].


Show Proof

Choosing [math]\alpha=1/2[/math] for example and absorbing [math]Q[/math] in the constants, we get from\eqref{EQ:almostLSNP} and Lemma that for [math]M \le n-1[/math],

[[math]] \|\varphi^M_{\thetals}-\varphi^M_{\theta^*}\|^2_{L_2([0,1])}\lesssim n^{1-2\beta}+\sigma^2\frac{M\log(1/\delta)}{n} \,. [[/math]]
Using now Lemma and [math]\sigma^2 \le 1[/math], we get

[[math]] \|\varphi^M_{\thetals}-f\|^2_{L_2([0,1])}\lesssim M^{-2\beta}+n^{1-2\beta}+\sigma^2\frac{M\log(1/\delta)}{n} \,. [[/math]]
Taking [math]M=\lceil n^{\frac{1}{2\beta+1}}\rceil\le n-1[/math] for [math]n[/math] large enough yields

[[math]] \|\varphi^M_{\thetals}-f\|^2_{L_2([0,1])}\lesssim n^{-\frac{2\beta}{2\beta+1}}+n^{1-2\beta} \sigma^2\log(1/\delta)\,. [[/math]]
To conclude the proof, simply note that for the prescribed [math]\beta[/math], we have [math]n^{1-2\beta}\le n^{-\frac{2\beta}{2\beta+1}}[/math]\,.

Adaptive estimation

The rate attained by the projection estimator [math]\varphi_{\thetals}[/math] with [math]M=\lceil n^{\frac{1}{2\beta+1}}\rceil[/math] is actually optimal so, in this sense, it is a good estimator. Unfortunately, its implementation requires the knowledge of the smoothness parameter [math]\beta[/math] which is typically unknown, to determine the level [math]M[/math] of truncation. The purpose of adaptive estimation is precisely to adapt to the unknown [math]\beta[/math], that is to build an estimator that does not depend on [math]\beta[/math] and yet, attains a rate of the order of [math]Cn^{-\frac{2\beta}{2\beta+1}}[/math] (up to a logarithmic lowdown). To that end, we will use the oracle inequalities for the BIC and Lasso estimator defined in\eqref{EQ:defBICmis} and\eqref{EQ:defLassomis} respectively. In view of Lemma, the design matrix [math]\Phi[/math] actually satisfies the assumption[math]\textsf{ORT}[/math] when we work with the trigonometric basis. This has two useful implications:

  • Both estimators are actually thresholding estimators and can therefore be implemented efficiently
  • The condition [math]\textsf{INC($k$)}[/math] is automatically satisfied for any [math]k \ge 1[/math].

These observations lead to the following corollary.

Corollary

Fix [math]\beta \gt (1+\sqrt{5})/4\simeq 0.81, Q \gt 0, \delta \gt 0[/math] and [math]n[/math] large enough to ensure [math]n-1\ge \lceil n^{\frac{1}{2\beta+1}}\rceil[/math] assume the general regression model\eqref{EQ:regmodgen} with [math]f \in \Theta(\beta,Q)[/math] and [math]\eps\sim\sg_n(\sigma^2), \sigma^2 \le 1[/math]. Let [math]\{\varphi_j\}_{j=1}^{n-1}[/math] be the trigonometric basis. Denote by [math]\varphi^{n-1}_{\thetabic}[/math] (resp. [math]\varphi^{n-1}_{\thetalasso}[/math]) the BIC (resp. Lasso) estimator defined in\eqref{EQ:defBICmis} (resp. \eqref{EQ:defLassomis}) over [math]\R^{n-1}[/math] with regularization parameter given by\eqref{EQ:taubicmis} (resp. \eqref{EQ:taulassomis}). Then [math]\varphi^{n-1}_{\hat \theta}[/math], where [math]\hat \theta \in \{\thetabic, \thetalasso\}[/math] satisfies with probability [math]1-\delta[/math],

[[math]] \|\varphi^{n-1}_{\hat \theta} -f\|_{L_2([0,1])}^2 \lesssim (n/\log n)^{-\frac{2\beta}{2\beta+1}}(1+\sigma^2\log(1/\delta))\,, [[/math]]
where the constants may depend on [math]\beta[/math] and [math]Q[/math].


Show Proof

For [math]\hat \theta \in \{\thetabic, \thetalasso\}[/math], adapting the proofs of Theorem for the BIC estimator and Theorem for the Lasso estimator, for any [math]\theta \in \R^{n-1}[/math], with probability [math]1-\delta[/math]

[[math]] |\varphi^{n-1}_{\hat \theta} -f|_2^2 \le \frac{1+\alpha}{1-\alpha}|\varphi^{n-1}_\theta-f|_2^2 + R(|\theta|_0)\,. [[/math]]
where

[[math]] R(|\theta|_0):=\frac{C\sigma^2}{\alpha(1-\alpha)}|\theta_0|\log(en/\delta) [[/math]]
It yields

[[math]] \begin{align*} |\varphi^{n-1}_{\hat \theta} -\varphi^{n-1}_\theta|_2^2& \le \frac{2\alpha}{1-\alpha}|\varphi^{n-1}_\theta-f|_2^2 +2(\varphi^{n-1}_{\hat \theta} -\varphi^{n-1}_\theta)^\top (f - \varphi^{n-1}_\theta)+R(|\theta|_0)\\ &\le \Big( \frac{2\alpha}{1-\alpha}+\frac{1}{\alpha}\Big)|\varphi^{n-1}_\theta-f|_2^2+\alpha|\varphi^{n-1}_{\hat \theta} -\varphi^{n-1}_\theta|_2^2 + R(|\theta|_0)\,, \end{align*} [[/math]]
where we used Young's inequality once again. Let [math] M \in \mathbb{N} [/math] be determined later and choose now [math]\alpha=1/2[/math] and [math]\theta=\theta^*_M[/math], where [math]\theta^*_M[/math] is equal to [math]\theta^*[/math] on its first [math]M[/math] coordinates and [math]0[/math] otherwise so that [math]\varphi^{n-1}_{\theta^*_M}=\varphi^M_{\theta^*}[/math]. It yields

[[math]] |\varphi^{n-1}_{\hat \theta} -\varphi^{n-1}_{\theta^*_M}|_2^2\lesssim |\varphi^{n-1}_{\theta^*_M}-f|_2^2+ R(M)\lesssim|\varphi^{n-1}_{\theta^*_M}-\varphi^{n-1}_{\theta^*}|_2^2+|\varphi^{n-1}_{\theta^*}-f|_2^2+ R(M) [[/math]]
Next, it follows from\eqref{EQ:bias2} that [math]|\varphi^{n-1}_{\theta^*}-f|_2^2 \lesssim Qn^{2-2\beta}[/math]. Together with Lemma, it yields

[[math]] \|\varphi^{n-1}_{\hat \theta} -\varphi^{n-1}_{\theta^*_M}\|_{L_2([0,1])}^2\lesssim \|\varphi^{n-1}_{\theta^*} -\varphi^{n-1}_{\theta^*_M}\|_{L_2([0,1])}^2+Qn^{1-2\beta}+ \frac{R(M)}{n}\,. [[/math]]
Moreover, using\eqref{EQ:bias1}, we find that

[[math]] \|\varphi^{n-1}_{\hat \theta} -f\|_{L_2([0,1])}^2\lesssim M^{-2\beta}+Qn^{1-2\beta}+ \frac{M}{n}\sigma^2\log(en/\delta) \,. [[/math]]
To conclude the proof, choose [math]M=\lceil (n/\log n)^{\frac{1}{2\beta+1}}\rceil[/math] and observe that the choice of [math]\beta[/math] ensures that [math]n^{1-2\beta} \lesssim M^{-2\beta}[/math].

While there is sometimes a (logarithmic) price to pay for adaptation, it turns out that the extra logarithmic factor can be removed by a clever use of blocks (see [5](Chapter3)). The reason why we get this extra logarithmic factor here is because we use a hammer that's too big. Indeed, BIC and Lasso allow for ‘`holes" in the Fourier decomposition, so we only get part of their potential benefits.

General references

Rigollet, Philippe; Hütter, Jan-Christian (2023). "High-Dimensional Statistics". arXiv:2310.19244 [math.ST].

Notes

  1. In the sense that
    [[math]] \lim_{k \to \infty} \int_0^1|f(t)-\sum_{j=1}^k \theta_j \varphi_j(t)|^2\ud t=0 [[/math]]

References

  1. 1.0 1.1 Flaxman, Abraham (2003). "A spectral technique for random satisfiable 3CNF formulas". Ann. Statist. 40. Society for Industrial and Applied Mathematics. doi:10.1214/aos/1034276635. 
  2. "Exponential screening and optimal rates of sparse estimation" (2011). Ann. Statist. 39. John Wiley & Sons Inc.. doi:10.1214/aos/1034276635. 
  3. Ramon, van (2017). "Probability in High Dimension". Constructive Approximation 13. Springer-Verlag. doi:10.1214/aos/1034276635. 
  4. Ryan, O'Donnell (2005). "The PCP Theorem and Hardness of Approximation". SIAM J. Comput. 24. W. H. Freeman & Co.. doi:10.1214/aos/1034276635. 
  5. LeCam, L. (1973). "Convergence of estimates under dimensionality restrictions". Ann. Statist. 1. Omnipress. doi:10.1214/aos/1034276635.