guide:8dff7bda6c: Difference between revisions
No edit summary |
mNo edit summary |
||
(4 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
<div class="d-none"> | |||
<math> | |||
\newcommand{\DS}{\displaystyle} | |||
\newcommand{\intbeta}{\lfloor \beta \rfloor} | |||
\newcommand{\cA}{\mathcal{A}} | |||
\newcommand{\cB}{\mathcal{B}} | |||
\newcommand{\cC}{\mathcal{C}} | |||
\newcommand{\cD}{\mathcal{D}} | |||
\newcommand{\cE}{\mathcal{E}} | |||
\newcommand{\cF}{\mathcal{F}} | |||
\newcommand{\cG}{\mathcal{G}} | |||
\newcommand{\cH}{\mathcal{H}} | |||
\newcommand{\cI}{\mathcal{I}} | |||
\newcommand{\cJ}{\mathcal{J}} | |||
\newcommand{\cK}{\mathcal{K}} | |||
\newcommand{\cL}{\mathcal{L}} | |||
\newcommand{\cM}{\mathcal{M}} | |||
\newcommand{\cN}{\mathcal{N}} | |||
\newcommand{\cO}{\mathcal{O}} | |||
\newcommand{\cP}{\mathcal{P}} | |||
\newcommand{\cQ}{\mathcal{Q}} | |||
\newcommand{\cS}{\mathcal{S}} | |||
\newcommand{\cT}{\mathcal{T}} | |||
\newcommand{\cU}{\mathcal{U}} | |||
\newcommand{\cV}{\mathcal{V}} | |||
\newcommand{\cX}{\mathcal{X}} | |||
\newcommand{\cY}{\mathcal{Y}} | |||
\newcommand{\cZ}{\mathcal{Z}} | |||
\newcommand{\sg}{\mathsf{subG}} | |||
\newcommand{\subE}{\mathsf{subE}} | |||
\newcommand{\bA}{\mathbf{A}} | |||
\newcommand{\bB}{\mathbf{B}} | |||
\newcommand{\bC}{\mathbf{C}} | |||
\newcommand{\bD}{\mathbf{D}} | |||
\newcommand{\bE}{\mathbf{E}} | |||
\newcommand{\bF}{\mathbf{F}} | |||
\newcommand{\bG}{\mathbf{G}} | |||
\newcommand{\bH}{\mathbf{H}} | |||
\newcommand{\bI}{\mathbf{I}} | |||
\newcommand{\bJ}{\mathbf{J}} | |||
\newcommand{\bK}{\mathbf{K}} | |||
\newcommand{\bM}{\mathbf{M}} | |||
\newcommand{\bN}{\mathbf{N}} | |||
\newcommand{\bO}{\mathbf{O}} | |||
\newcommand{\bP}{\mathbf{P}} | |||
\newcommand{\bp}{\mathbf{p}} | |||
\newcommand{\bQ}{\mathbf{Q}} | |||
\newcommand{\bS}{\mathbf{S}} | |||
\newcommand{\bT}{\mathbf{T}} | |||
\newcommand{\bU}{\mathbf{U}} | |||
\newcommand{\bV}{\mathbf{V}} | |||
\newcommand{\bX}{\mathbf{X}} | |||
\newcommand{\bY}{\mathbf{Y}} | |||
\newcommand{\bZ}{\mathbf{Z}} | |||
\newcommand{\bflambda}{\boldsymbol{\lambda}} | |||
\newcommand{\bftheta}{\boldsymbol{\theta}} | |||
\newcommand{\bfg}{\boldsymbol{g}} | |||
\newcommand{\bfy}{\boldsymbol{y}} | |||
\def\thetaphat{\hat{\bftheta}_\bp} | |||
\def\bflam{\boldsymbol{\lambda}} | |||
\def\Lam{\Lambda} | |||
\def\lam{\lambda} | |||
\def\bfpi{\boldsymbol{\pi}} | |||
\def\bfz{\boldsymbol{z}} | |||
\def\bfw{\boldsymbol{w}} | |||
\def\bfeta{\boldsymbol{\eta}} | |||
\newcommand{\R}{\mathrm{ I}\kern-0.18em\mathrm{ R}} | |||
\newcommand{\h}{\mathrm{ I}\kern-0.18em\mathrm{ H}} | |||
\newcommand{\K}{\mathrm{ I}\kern-0.18em\mathrm{ K}} | |||
\newcommand{\p}{\mathrm{ I}\kern-0.18em\mathrm{ P}} | |||
\newcommand{\E}{\mathrm{ I}\kern-0.18em\mathrm{ E}} | |||
%\newcommand{\Z}{\mathrm{ Z}\kern-0.26em\mathrm{ Z}} | |||
\newcommand{\1}{\mathrm{ 1}\kern-0.24em\mathrm{ I}} | |||
\newcommand{\N}{\mathrm{ I}\kern-0.18em\mathrm{ N}} | |||
\newcommand{\field}[1]{\mathbb{#1}} | |||
\newcommand{\q}{\field{Q}} | |||
\newcommand{\Z}{\field{Z}} | |||
\newcommand{\X}{\field{X}} | |||
\newcommand{\Y}{\field{Y}} | |||
\newcommand{\bbS}{\field{S}} | |||
\newcommand{\n}{\mathcal{N}} | |||
\newcommand{\x}{\mathcal{X}} | |||
\newcommand{\pp}{{\sf p}} | |||
\newcommand{\hh}{{\sf h}} | |||
\newcommand{\ff}{{\sf f}} | |||
\newcommand{\Bern}{\mathsf{Ber}} | |||
\newcommand{\Bin}{\mathsf{Bin}} | |||
\newcommand{\Lap}{\mathsf{Lap}} | |||
\newcommand{\tr}{\mathsf{Tr}} | |||
\newcommand{\phin}{\varphi_n} | |||
\newcommand{\phinb}{\overline \varphi_n(t)} | |||
\newcommand{\pn}{\p_{\kern-0.25em n}} | |||
\newcommand{\pnm}{\p_{\kern-0.25em n,m}} | |||
\newcommand{\psubm}{\p_{\kern-0.25em m}} | |||
\newcommand{\psubp}{\p_{\kern-0.25em p}} | |||
\newcommand{\cfi}{\cF_{\kern-0.25em \infty}} | |||
\newcommand{\e}{\mathrm{e}} | |||
\newcommand{\ic}{\mathrm{i}} | |||
\newcommand{\Leb}{\mathrm{Leb}_d} | |||
\newcommand{\Var}{\mathrm{Var}} | |||
\newcommand{\ddelta}{d_{\symdiffsmall}} | |||
\newcommand{\dsubh}{d_H} | |||
\newcommand{\indep}{\perp\kern-0.95em{\perp}} | |||
\newcommand{\supp}{\mathop{\mathrm{supp}}} | |||
\newcommand{\rank}{\mathop{\mathrm{rank}}} | |||
\newcommand{\vol}{\mathop{\mathrm{vol}}} | |||
\newcommand{\conv}{\mathop{\mathrm{conv}}} | |||
\newcommand{\card}{\mathop{\mathrm{card}}} | |||
\newcommand{\argmin}{\mathop{\mathrm{argmin}}} | |||
\newcommand{\argmax}{\mathop{\mathrm{argmax}}} | |||
\newcommand{\ud}{\mathrm{d}} | |||
\newcommand{\var}{\mathrm{var}} | |||
\newcommand{\re}{\mathrm{Re}} | |||
\newcommand{\MSE}{\mathsf{MSE}} | |||
\newcommand{\im}{\mathrm{Im}} | |||
\newcommand{\epr}{\hfill\hbox{\hskip 4pt\vrule width 5pt | |||
height 6pt depth 1.5pt}\vspace{0.5cm}\par} | |||
\newcommand{\bi}[1]{^{(#1)}} | |||
\newcommand{\eps}{\varepsilon} | |||
\newcommand{\Deq}{\stackrel{\mathcal{D}}{=}} | |||
\newcommand{\ubar}{\underbar} | |||
\newcommand{\Kbeta}{K_{\hspace{-0.3mm} \beta}} | |||
\newcommand{\crzero}[1]{\overset{r_0}{\underset{#1}{\longleftrightarrow}}} | |||
\newcommand{\hint}[1]{\texttt{[Hint:#1]}} | |||
\newcommand{\vp}{\vspace{.25cm}} | |||
\newcommand{\vm}{\vspace{.5cm}} | |||
\newcommand{\vg}{\vspace{1cm}} | |||
\newcommand{\vgneg}{\vspace{-1cm}} | |||
\newcommand{\vneg}{\vspace{-.5cm}} | |||
\newcommand{\vpneg}{\vspace{-.25cm}} | |||
\newcommand{\tp}{\ptsize{10}} | |||
\newcommand{\douzp}{\ptsize{12}} | |||
\newcommand{\np}{\ptsize{9}} | |||
\newcommand{\hp}{\ptsize{8}} | |||
\newcommand{\red}{\color{red}} | |||
\newcommand{\ndpr}[1]{{\textsf{\red[#1]}}} | |||
\newcommand\iid{i.i.d\@ifnextchar.{}{.\@\xspace} } | |||
\newcommand\MoveEqLeft[1][2]{\kern #1em & \kern -#1em} | |||
\newcommand{\leadeq}[2][4]{\MoveEqLeft[#1] #2 \nonumber} | |||
\newcommand{\leadeqnum}[2][4]{\MoveEqLeft[#1] #2} | |||
\newcommand\independent{\protect\mathpalette{\protect\independenT}{\perp}} | |||
\def\independenT#1#2{\mathrel{\rlap{$#1#2$}\mkern2mu{#1#2}}} | |||
\newcommand{\MIT}[1]{{\color{MITred} #1}} | |||
\newcommand{\dHyp}{\{-1,1\}^d} | |||
\newcommand{\thetahard}{\hat \theta^{hrd}} | |||
\newcommand{\thetasoft}{\hat \theta^{sft}} | |||
\newcommand{\thetabic}{\hat \theta^{bic}} | |||
\newcommand{\thetalasso}{\hat \theta^{\cL}} | |||
\newcommand{\thetaslope}{\hat \theta^{\cS}} | |||
\newcommand{\thetahard}{\hat \theta^{hrd}} | |||
\newcommand{\thetasoft}{\hat \theta^{sft}} | |||
\newcommand{\thetabic}{\hat \theta^{bic}} | |||
\newcommand{\thetalasso}{\hat \theta^{\cL}} | |||
\newcommand{\thetaslope}{\hat \theta^{\cS}} | |||
\newcommand{\thetals}{\hat \theta^{ls}} | |||
\newcommand{\thetalsm}{\tilde \theta^{ls_X}} | |||
\newcommand{\thetaridge}{\hat\theta^{\mathrm{ridge}}_\tau} | |||
\newcommand{\thetalsK}{\hat \theta^{ls}_K} | |||
\newcommand{\muls}{\hat \mu^{ls}} | |||
</math> | |||
</div> | |||
Arguably, the strongest assumption that we made in Chapter [[guide:090050c501#chap:GSM |Linear Regression Model]] is that the regression function <math>f(x)</math> is of the form <math>f(x)=x^\top \theta^*</math>. What if this assumption is violated? In reality, we do not really believe in the linear model and we hope that good statistical methods should be ''robust'' to deviations from this model. This is the problem of model misspecification, and in particular misspecified linear models. | |||
Throughout this chapter, we assume the following model: | |||
<span id="EQ:regmodgen"/> | |||
<math display="block"> | |||
\begin{equation} | |||
\label{EQ:regmodgen} | |||
Y_i=f(X_i)+\eps_i,\quad i=1, \ldots, n\,, | |||
\end{equation} | |||
</math> | |||
where <math>\eps=(\eps_1, \ldots, \eps_n)^\top </math> is sub-Gaussian with variance proxy <math>\sigma^2</math>. Here <math>X_i \in \R^d</math>. | |||
When dealing with fixed design, it will be convenient to consider the vector <math>g \in \R^n</math> defined for any function <math>g \,:\, \R^d \to \R</math> by <math>g=(g(X_1), \ldots, g(X_n))^\top</math>. In this case, we can write for any estimator <math>\hat f \in \R^n</math> of <math>f</math>, | |||
<math display="block"> | |||
\MSE(\hat f)=\frac1n|\hat f - f|_2^2\,. | |||
</math> | |||
Even though the model may not be linear, we are interested in studying the statistical properties of various linear estimators introduced in the previous chapters: <math>\thetals, \thetalsK, \thetalsm, \thetabic, \thetalasso</math>. Clearly, even with an infinite number of observations, we have no chance of finding a consistent estimator of <math>f</math> if we don't know the correct model. Nevertheless, as we will see in this chapter, something can still be said about these estimators using ''oracle inequalities''. | |||
==Oracle inequalities== | |||
===Oracle inequalities=== | |||
As mentioned in the introduction, an oracle is a quantity that cannot be constructed without the knowledge of the quantity of interest, here: the regression function. Unlike the regression function itself, an oracle is constrained to take a specific form. For all matter of purposes, an oracle can be viewed as an estimator (in a given family) that can be constructed with an infinite amount of data. This is exactly what we should aim for in misspecified models. | |||
When employing the least squares estimator <math>\thetals</math>, we constrain ourselves to estimating functions that are of the form <math>x\mapsto x^\top \theta</math>, even though <math>f</math> itself may not be of this form. Therefore, the oracle <math>\hat f</math> is the linear function that is the closest to <math>f</math>. | |||
Rather than trying to approximate <math>f</math> by a linear function <math>f(x)\approx \theta^\top x</math>, we make the model a bit more general and consider a dictionary <math>\cH=\{\varphi_1, \ldots, \varphi_M\}</math> of functions where <math>\varphi_j\,:\, \R^d \to \R</math>. In this case, we can actually remove the assumption that <math>X \in \R^d</math>. Indeed, the goal is now to estimate <math>f</math> using a linear combination of the functions in the dictionary: | |||
<math display="block"> | |||
f \approx \varphi_\theta:=\sum_{j=1}^M \theta_j \varphi_j\,. | |||
</math> | |||
{{alert-info | | |||
If <math>M=d</math> and <math>\varphi_j(X)=X^{(j)}</math> returns the <math>j</math>th coordinate of <math>X \in \R^d</math> then the goal is to approximate <math>f(x)</math> by <math>\theta^\top x</math>. Nevertheless, the use of a dictionary allows for a much more general framework. | |||
}} | |||
Note that the use of a dictionary does not affect the methods that we have been using so far, namely penalized/constrained least squares. We use the same notation as before and define | |||
<ul><li> The least squares estimator: | |||
<span id = "EQ:defthetaLSmis"/> | |||
<span id="EQ:defthetaLSmis"/> | |||
<math display="block"> | |||
\begin{equation} | |||
\label{EQ:defthetaLSmis} | |||
\thetals\in \argmin_{\theta \in \R^M}\frac{1}{n} \sum_{i=1}^n\big(Y_i-\varphi_\theta(X_i)\big)^2 | |||
\end{equation} | |||
</math> | |||
</li> | |||
<li> The least squares estimator constrained to <math>K \subset \R^M</math>: | |||
<math display="block"> | |||
\thetalsK \in \argmin_{\theta \in K}\frac{1}{n} \sum_{i=1}^n\big(Y_i-\varphi_\theta(X_i)\big)^2 | |||
</math> | |||
</li> | |||
<li> The BIC estimator: | |||
<span id="EQ:defBICmis"/> | |||
<math display="block"> | |||
\begin{equation} | |||
\label{EQ:defBICmis} | |||
\thetabic \in \argmin_{\theta \in \R^M}\Big\{\frac{1}{n} \sum_{i=1}^n\big(Y_i-\varphi_\theta(X_i)\big)^2+\tau^2|\theta|_0\Big\} | |||
\end{equation} | |||
</math> | |||
</li> | |||
<li> The Lasso estimator: | |||
<span id="EQ:defLassomis"/> | |||
<math display="block"> | |||
\begin{equation} | |||
\label{EQ:defLassomis} | |||
\thetalasso \in \argmin_{\theta \in \R^M}\Big\{\frac{1}{n} \sum_{i=1}^n\big(Y_i-\varphi_\theta(X_i)\big)^2+2\tau|\theta|_1\Big\} | |||
\end{equation} | |||
</math> | |||
</li> | |||
</ul> | |||
{{defncard|label=|id=def:oracle| | |||
Let <math>R(\cdot)</math> be a risk function and let <math>\cH=\{\varphi_1, \ldots, \varphi_M\}</math> be a dictionary of functions from <math>\R^d</math> to <math>\R</math>. Let <math>K</math> be a subset of <math>\R^M</math>. The ''oracle'' on <math>K</math> with respect to <math>R</math> is defined by <math>\varphi_{\bar \theta}</math>, where <math>\bar \theta \in K</math> is such that | |||
<math display="block"> | |||
R(\varphi_{\bar \theta}) \le R(\varphi_\theta) \,, \qquad \forall\, \theta \in K\,. | |||
</math> | |||
Moreover, <math>R_{K}=R(\varphi_{\bar \theta})</math> is called ''oracle risk'' on <math>K</math>. An estimator <math>\hat f</math> is said to satisfy an oracle inequality (over <math>K</math>) with remainder term <math>\phi</math> in expectation (resp. with high probability) if there exists a constant <math>C\ge 1</math> such that | |||
<math display="block"> | |||
\E R(\hat f) \le C\inf_{\theta \in K}R(\varphi_\theta) + \phi_{n,M}(K)\,, | |||
</math> | |||
or | |||
<math display="block"> | |||
\p\big\{R(\hat f) \le C\inf_{\theta \in K}R(\varphi_\theta) + \phi_{n,M,\delta}(K) \big\} \ge 1-\delta \,, \qquad \forall \ \delta > 0 | |||
</math> | |||
respectively. If <math>C=1</math>, the oracle inequality is sometimes called ''exact''.}} | |||
Our goal will be to mimic oracles. The finite sample performance of an estimator at this task is captured by an oracle inequality. | |||
===Oracle inequality for the least squares estimator=== | |||
While our ultimate goal is to prove sparse oracle inequalities for the BIC and Lasso estimator in the case of misspecified model, the difficulty of the extension to this case for linear models is essentially already captured by the analysis for the least squares estimator. In this simple case, can even obtain an exact oracle inequality. | |||
{{proofcard|Theorem|TH:LS_mis|Assume the general regression model\eqref{EQ:regmodgen} with <math>\eps\sim\sg_n(\sigma^2)</math>. Then, the least squares estimator <math>\thetals</math> satisfies for some numerical constant <math>C > 0</math>, | |||
<math display="block"> | |||
\MSE(\varphi_{\thetals})\le \inf_{\theta \in \R^M}\MSE(\varphi_\theta)+C\frac{\sigma^2M}{n}\log(1/\delta) | |||
</math> | |||
with probability at least <math>1-\delta</math>. | |||
|Note that by definition | |||
<math display="block"> | |||
|Y-\varphi_{\thetals}|_2^2 \le |Y-\varphi_{\bar \theta}|_2^2 | |||
</math> | |||
where <math>\varphi_{\bar \theta}</math> denotes the orthogonal projection of <math>f</math> onto the linear span of <math>\varphi_1, \ldots, \varphi_M</math>. Since <math>Y=f+\eps</math>, we get | |||
<math display="block"> | |||
|f-\varphi_{\thetals}|_2^2 \le |f-\varphi_{\bar \theta}|_2^2+2\eps^\top(\varphi_{\thetals}-\varphi_{\bar \theta}) | |||
</math> | |||
Moreover, by Pythagoras's theorem, we have | |||
<math display="block"> | |||
|f-\varphi_{\thetals}|_2^2 - |f-\varphi_{\bar \theta}|_2^2=|\varphi_{\thetals}-\varphi_{\bar \theta}|_2^2\,. | |||
</math> | |||
It yields | |||
<math display="block"> | |||
|\varphi_{\thetals}-\varphi_{\bar \theta}|_2^2\le 2\eps^\top(\varphi_{\thetals}-\varphi_{\bar \theta})\,. | |||
</math> | |||
Using the same steps as the ones following equation [[guide:090050c501#EQ:bound_ls_1 |EQ:bound_ls_1]] for the well specified case, we get | |||
<math display="block"> | |||
|\varphi_{\thetals}-\varphi_{\bar \theta}|_2^2\lesssim \frac{\sigma^2M}{n}\log(1/\delta) | |||
</math> | |||
with probability <math>1-\delta</math>. The result of the lemma follows.}} | |||
===Sparse oracle inequality for the BIC estimator=== | |||
The analysis for more complicated estimators such as the BIC in Chapter [[guide:090050c501#chap:GSM |Linear Regression Model]] allows us to derive oracle inequalities for these estimators. | |||
{{proofcard|Theorem|TH:BIC_mis|Assume the general regression model\eqref{EQ:regmodgen} with <math>\eps\sim\sg_n(\sigma^2)</math>. Then, the BIC estimator <math>\thetabic</math> with regularization parameter | |||
<span id="EQ:taubicmis"/> | |||
<math display="block"> | |||
\begin{equation} | |||
\label{EQ:taubicmis} | |||
\tau^2=16\log(6)\frac{\sigma^2}{n} + 32\frac{\sigma^2 \log (ed)}{n} | |||
\end{equation} | |||
</math> | |||
satisfies for some numerical constant <math>C > 0</math>, | |||
<math display="block"> | |||
\begin{align*} | |||
\MSE(\varphi_{\thetabic})\le \inf_{\theta \in \R^M}\Big\{3\MSE(\varphi_\theta)+&\frac{C\sigma^2}{n}|\theta|_0\log (eM/\delta)\Big\}+ \frac{C\sigma^2}{n}\log(1/\delta) | |||
\end{align*} | |||
</math> | |||
with probability at least <math>1-\delta</math>. | |||
|Recall that the proof of [[guide:090050c501#TH:BIC |Theorem]] for the BIC estimator begins as follows: | |||
<math display="block"> | |||
\frac1n|Y-\varphi_{\thetabic}|_2^2 + \tau^2|\thetabic|_0 \le \frac1n|Y-\varphi_{\theta}|_2^2 + \tau^2|\theta|_0\,. | |||
</math> | |||
This is true for any <math>\theta \in \R^M</math>. | |||
It implies | |||
<math display="block"> | |||
|f-\varphi_{\thetabic}|_2^2 + n\tau^2|\thetabic|_0 \le |f-\varphi_{ \theta}|_2^2+2\eps^\top(\varphi_{\thetabic}-\varphi_{ \theta}) + n\tau^2|\theta|_0\,. | |||
</math> | |||
Note that if <math>\thetabic= \theta</math>, the result is trivial. Otherwise, | |||
<math display="block"> | |||
\begin{align*} | |||
2\eps^\top(\varphi_{\thetabic}-\varphi_\theta) &=2\eps^\top\Big(\frac{\varphi_{\thetabic}-\varphi_\theta}{|\varphi_{\thetabic}-\varphi_\theta|_2}\Big)|\varphi_{\thetabic}-\varphi_\theta|_2\\ | |||
&\le \frac2\alpha\Big[\eps^\top\Big(\frac{\varphi_{\thetabic}-\varphi_\theta}{|\varphi_{\thetabic}-\varphi_\theta|_2}\Big)\Big]^2 + \frac{\alpha}{2} |\varphi_{\thetabic}-\varphi_\theta|_2^2\,, | |||
\end{align*} | |||
</math> | |||
where we used Young's inequality <math>2ab\le \frac2\alpha a^2+\frac{\alpha}{2} b^2</math>, which is valid for <math>a,b \in \R</math> and <math>\alpha > 0</math>. | |||
Next, since | |||
<math display="block"> | |||
\frac{\alpha}{2} |\varphi_{\thetabic}-\varphi_\theta|_2^2\le \alpha|\varphi_{\thetabic}-f|_2^2+\alpha|\varphi_\theta-f|_2^2\,, | |||
</math> | |||
we get for <math>\alpha=1/2</math>, | |||
<math display="block"> | |||
\begin{align*} | |||
\frac12|\varphi_{\thetabic}-f|_2^2 &\le \frac32|\varphi_{ \theta}-f|_2^2+n\tau^2|\theta|_0 +4\big[\eps^\top\cU(\varphi_{\thetabic}-\varphi_{\theta})\big]^2- n\tau^2|\thetabic|_0\\ | |||
&\le \frac32|\varphi_{ \theta}-f|_2^2+2n\tau^2|\theta|_0+4\big[\eps^\top\cU(\varphi_{\thetabic -\theta})\big]^2- n\tau^2|\thetabic-\theta|_0. | |||
\end{align*} | |||
</math> | |||
We conclude as in the proof of [[guide:090050c501#TH:BIC |Theorem]].}} | |||
The interpretation of this theorem is enlightening. It implies that the BIC estimator will mimic the best tradeoff between the approximation error <math>\MSE(\varphi_\theta)</math> and the complexity of <math>\theta</math> as measured by its sparsity. In particular, this result, which is sometimes called ''sparse oracle inequality'', implies the following oracle inequality. Define the oracle <math>\bar \theta</math> to be such that | |||
<math display="block"> | |||
\MSE(\varphi_{\bar \theta})=\min_{\theta \in \R^M}\MSE(\varphi_\theta). | |||
</math> | |||
Then, with probability at least <math>1-\delta</math>, | |||
<math display="block"> | |||
\MSE(\varphi_{\thetabic})\le 3\MSE(\varphi_{\bar \theta})+\frac{C\sigma^2}{n}\Big[|\bar \theta|_0\log (eM)+ \log(1/\delta)\Big]\Big\} | |||
</math> | |||
If the linear model happens to be correct, then we simply have <math>\MSE(\varphi_{\bar \theta})=0</math>. | |||
===Sparse oracle inequality for the Lasso=== | |||
To prove an oracle inequality for the Lasso, we need additional assumptions on the design matrix, such as incoherence. Here, the design matrix is given by the <math>n\times M</math> matrix <math>\Phi</math> with elements <math>\Phi_{i,j}=\varphi_j(X_i)</math>. | |||
{{proofcard|Theorem|TH:lasso_fast_mis|Assume the general regression model\eqref{EQ:regmodgen} with <math>\eps\sim\sg_n(\sigma^2)</math>. Moreover, assume that there exists an integer <math>k</math> such that the matrix <math>\Phi</math> satisfies assumption <math>\textsf{INC$(k)$}</math>. Then, the Lasso estimator <math>\thetalasso</math> with regularization parameter given by | |||
<span id{{=}}"EQ:taulassomis"/> | |||
<math display="block"> | |||
\begin{equation} | |||
\label{EQ:taulassomis} | |||
2\tau=8\sigma\sqrt{\frac{2\log(2M)}{n}}+8\sigma\sqrt{\frac{2\log(1/\delta)}{n}} | |||
\end{equation} | |||
</math> | |||
satisfies for some numerical constant <math>C</math>, | |||
<math display="block"> | |||
\begin{align*} | |||
\MSE(\varphi_{\thetalasso})\le \inf_{\substack{\theta \in \R^M\\ |\theta|_0\le k}}\Big\{ \MSE(\varphi_\theta) &+\frac{C\sigma^2}{n}|\theta|_0\log (eM/\delta)\Big\} | |||
\end{align*} | |||
</math> | |||
with probability at least <math>1-\delta</math>. | |||
|From the definition of <math>\thetalasso</math>, for any <math>\theta \in \R^M</math>, | |||
<math display="block"> | |||
\begin{align*} | |||
\frac1n|Y-\varphi_{\thetalasso}|_2^2 | |||
\le \frac1n|Y-\varphi_{\theta}|_2^2 + 2\tau|\theta|_1- 2\tau|\thetalasso|_1 \,. | |||
\end{align*} | |||
</math> | |||
Expanding the squares, adding <math>\tau|\thetalasso-\theta|_1</math> on each side and multiplying by <math>n</math>, we get | |||
<span id{{=}}"EQ:pr:TH:lassofastmis1"/> | |||
<math display="block"> | |||
\begin{align} | |||
\leadeq{|\varphi_{\thetalasso}-f|_2^2-|\varphi_{\theta}-f|_2^2+n\tau|\thetalasso-\theta|_1}\\ | |||
\le {} & 2\eps^\top(\varphi_{\thetalasso}-\varphi_{\theta}) +n\tau|\thetalasso-\theta|_1+ 2n\tau|\theta|_1- 2n\tau|\thetalasso|_1 \,. | |||
\label{EQ:pr:TH:lassofastmis1} | |||
\end{align} | |||
</math> | |||
Next, note that \textsf{INC}<math>(k)</math> for any <math>k \ge 1</math> implies that <math>|\varphi_j|_2 \le 2\sqrt{n}</math> for all <math>j=1, \ldots, M</math>. Applying H\"older's inequality, we get that with probability <math>1-\delta</math>, it holds that | |||
<math display="block"> | |||
2\eps^\top(\varphi_{\thetalasso}-\varphi_{\theta}) \le \frac{n\tau}2|\thetalasso-\theta|_1. | |||
</math> | |||
Therefore, taking <math>S=\supp(\theta)</math> to be the support of <math>\theta</math>, we get that the right-hand side of\eqref{EQ:pr:TH:lassofastmis1} is bounded by | |||
<span id{{=}}"EQ:pr:TH:lassofastmis2"/> | |||
<math display="block"> | |||
\begin{align} | |||
|\varphi_{\thetalasso}-f|_2^2-|\varphi_{\theta}-f|_2^2+n\tau|\thetalasso-\theta|_1 | |||
&\le 2n\tau|\thetalasso-\theta|_1+2n\tau|\theta|_1- 2n\tau|\thetalasso|_1\nonumber \\ | |||
&= 2n\tau|\thetalasso_S-\theta|_1+2n\tau|\theta|_1- 2n\tau|\thetalasso_S|_1\nonumber\\ | |||
&\le 4n\tau|\thetalasso_S-\theta|_1\label{EQ:pr:TH:lassofastmis2} | |||
\end{align} | |||
</math> | |||
with probability <math>1-\delta</math>. | |||
It implies that either <math>\MSE(\varphi_{\thetalasso})\le \MSE(\varphi_\theta)</math> or that | |||
<math display="block"> | |||
|\thetalasso_{S^c}-\theta_{S^c}|_1\le 3 |\thetalasso_{S}-\theta_{S}|_1\,. | |||
</math> | |||
so that <math>\theta=\thetalasso-\theta</math> satisfies the cone condition[[guide:090050c501#EQ:conecond |EQ:conecond]]. | |||
Using now the Cauchy-Schwarz inequality and [[guide:090050c501#lem:inc |Lemma]], respectively, assuming that <math>|\theta|_0\le k</math>, we get | |||
<math display="block"> | |||
4n\tau|\thetalasso_S-\theta|_1 \le 4n\tau\sqrt{|S|}|\thetalasso_S-\theta|_2 \le 4\tau\sqrt{2n|\theta|_0}|\varphi_{\thetalasso}-\varphi_{\theta}|_2\,. | |||
</math> | |||
Using now the inequality <math>2ab\le \frac2{\alpha} a^2+\frac{\alpha}{2} b^2</math>, we get | |||
<math display="block"> | |||
\begin{align*} | |||
4n\tau|\thetalasso_S-\theta|_1& \le \frac{16\tau^2n|\theta|_0}{\alpha}+ \frac{\alpha}{2}|\varphi_{\thetalasso}-\varphi_{\theta}|_2^2\\ | |||
&\le \frac{16\tau^2n|\theta|_0}{\alpha}+ \alpha|\varphi_{\thetalasso}-f|_2^2+\alpha|\varphi_{\theta}-f|_2^2 | |||
\end{align*} | |||
</math> | |||
Combining this result with\eqref{EQ:pr:TH:lassofastmis1} and\eqref{EQ:pr:TH:lassofastmis2}, we find | |||
<math display="block"> | |||
(1-\alpha)\MSE(\varphi_{\thetalasso}) \le (1+\alpha)\MSE(\varphi_{\theta})+ \frac{16\tau^2|\theta|_0}{\alpha}\,. | |||
</math> | |||
To conclude the proof, it only remains to divide by <math>1-\alpha</math> on both sides of the above inequality and take <math>\alpha=1/2</math>.}} | |||
===Maurey's argument=== | |||
In there is no sparse <math>\theta</math> such that <math>\MSE(\varphi_{\theta})</math> is small, [[#TH:BIC_mis |Theorem]] is useless whereas the Lasso may still enjoy slow rates. In reality, no one really believes in the existence of sparse vectors but rather of approximately sparse vectors. Zipf's law would instead favor the existence of vectors <math>\theta</math> with absolute coefficients that decay polynomially when ordered from largest to smallest in absolute value. This is the case for example if <math>\theta</math> has a small <math>\ell_1</math> norm but is not sparse. For such <math>\theta</math>, the Lasso estimator still enjoys slow rates as in [[guide:090050c501#TH:lasso_slow |Theorem]], which can be easily extended to the misspecified case (see [[exercise:165e02b2e2 |Problem]]). As a result, it seems that the Lasso estimator is strictly better than the BIC estimator as long as incoherence holds since it enjoys both fast and slow rates, whereas the BIC estimator seems to be tailored to the fast rate. Fortunately, such vectors can be well approximated by sparse vectors in the following sense: for any vector <math>\theta \in \R^M</math> such that <math>|\theta|_1\le 1</math>, there exists a vector <math>\theta'</math> that is sparse and for which <math>\MSE(\varphi_{\theta'})</math> is not much larger than <math>\MSE(\varphi_\theta)</math>. The following theorem quantifies exactly the tradeoff between sparsity and <math>\MSE</math>. It is often attributed to B. Maurey and was published by Pisier <ref name="Pis81">{{cite journal|last=Flaxman|first=Abraham|journal=Ann. Statist.|year=2003|title=A spectral technique for random satisfiable 3CNF formulas|volume=40|number=1|doi=10.1214/aos/1034276635|publisher=Society for Industrial and Applied Mathematics|url-access=http://dx.doi.org/10.1214/aos/1034276635|eprint=http://epubs.siam.org/doi/pdf/10.1137/S0097539705446974}}</ref>. This is why it is referred to as ''Maurey's argument''. | |||
{{proofcard|Theorem|thm-1|Let <math>\{\varphi_1, \dots, \varphi_M\}</math> be a dictionary normalized in such a way that | |||
<math display="block"> | |||
\max_{1\le j \le M}|\varphi_j|_2\le D\sqrt{n}\,. | |||
</math> | |||
Then for any integer <math>k</math> such that <math>1\le k \le M</math> and any positive <math>R</math>, we have | |||
<math display="block"> | |||
\min_{\substack{\theta \in \R^M\\ |\theta|_0\le k}}\MSE(\varphi_\theta) \le \min_{\substack{\theta \in \R^M\\ |\theta|_1\le R}}\MSE(\varphi_\theta) + \frac{D^2 R^2}{k}\,. | |||
</math> | |||
|Define | |||
<math display="block"> | |||
\bar \theta \in \argmin_{\substack{\theta \in \R^M\\ |\theta|_1\le R}}|\varphi_\theta -f|_2^2 | |||
</math> | |||
and assume without loss of generality that <math>|\bar \theta_1| \ge |\bar \theta_2|\ge \ldots \ge |\bar \theta_M|</math>. | |||
Let now <math>U \in \R^n</math> be a random vector with values in <math>\{0, \pm R\varphi_1, \ldots, \pm R \varphi_M\}</math> defined by | |||
<math display = "block"> | |||
\begin{alignat*}{2} | |||
\p(U=R\,\mathrm{sign}(\bar \theta_j)\varphi_j)&=\frac{|\bar \theta_j|}{R}\,, &\quad &j=1, \ldots, M,\\ | |||
\p(U=0)&=1-\frac{|\bar \theta|_1}{R}\,. | |||
\end{alignat*} | |||
</math> | |||
Note that <math>\E[U]=\varphi_{\bar \theta}</math> and <math>|U|_2 \le RD \sqrt{n}</math>. Let now <math>U_1, \ldots, U_k</math> be <math>k</math> independent copies of <math>U</math> and define their average | |||
<math display="block"> | |||
\bar U=\frac{1}{k}\sum_{i=1}^kU_i\,. | |||
</math> | |||
Note that <math>\bar U=\varphi_{\tilde \theta}</math> for some <math>\tilde \theta \in\R^M</math> such that <math>|\tilde \theta|_0\le k</math>. Moreover, using the Pythagorean Theorem, | |||
<math display="block"> | |||
\begin{align*} | |||
\E|f-\bar U|_2^2&=\E|f-\varphi_{\bar \theta}+\varphi_{\bar \theta}-\bar U|_2^2\\ | |||
&=\E|f-\varphi_{\bar \theta}|_2^2+|\varphi_{\bar \theta}-\bar U|_2^2\\ | |||
&= |f-\varphi_{\bar \theta}|_2^2 + \frac{\E|U-\E[U]|_2^2}{k}\\ | |||
&\le |f-\varphi_{\bar \theta}|_2^2 + \frac{(RD\sqrt{n})^2}{k} | |||
\end{align*} | |||
</math> | |||
To conclude the proof, note that | |||
<math display="block"> | |||
\E|f-\bar U|_2^2 =\E|f-\varphi_{\tilde \theta}|_2^2 \ge \min_{\substack{\theta \in \R^M\\ |\theta|_0\le k}}|f-\varphi_{\theta}|_2^2 | |||
</math> | |||
and divide by <math>n</math>.}} | |||
Maurey's argument implies the following corollary. | |||
{{proofcard|Corollary|cor-1|Assume that the assumptions of [[#TH:BIC_mis |Theorem]] hold and that the dictionary <math>\{\varphi_1, \dots, \varphi_M\}</math> is normalized in such a way that | |||
<math display="block"> | |||
\max_{1\le j \le M}|\varphi_j|_2\le \sqrt{n}\,. | |||
</math> | |||
Then there exists a constant <math>C > 0</math> such that the BIC estimator satisfies | |||
<math display="block"> | |||
\begin{align*} | |||
\MSE(\varphi_{\thetabic})\le \inf_{\theta \in \R^M}\Big\{2\MSE(\varphi_\theta)+C\Big[&\frac{\sigma^2|\theta|_0\log (eM)}{n}\wedge \sigma|\theta|_1\sqrt{\frac{\log (eM)}{n}}\Big]\Big\}\\ | |||
&+ C\frac{\sigma^2\log(1/\delta)}{n} | |||
\end{align*} | |||
</math> | |||
with probability at least <math>1-\delta</math>. | |||
|Choosing <math>\alpha=1/3</math> in [[#TH:BIC_mis |Theorem]] yields | |||
<math display="block"> | |||
\MSE(\varphi_{\thetabic})\le 2\inf_{\theta \in \R^M}\Big\{\MSE(\varphi_\theta)+C\frac{\sigma^2|\theta|_0\log (eM)}{n}\Big\}+C\frac{\sigma^2\log(1/\delta)}{n} | |||
</math> | |||
Let <math>\theta' \in \R^M</math>. | |||
It follows from Maurey's argument that for any <math> k \in [M] </math>, there exists <math>\theta = \theta(\theta', k) \in \R^M</math> such that <math>|\theta|_0 = k</math> and | |||
<math display="block"> | |||
\MSE(\varphi_{\theta}) \le \MSE(\varphi_{\theta'})+\frac{2|\theta'|_1^2}{k} | |||
</math> | |||
It implies that | |||
<math display="block"> | |||
\MSE(\varphi_{\theta})+C\frac{\sigma^2|\theta|_0\log (eM)}{n} \le \MSE(\varphi_{\theta'})+\frac{2|\theta'|_1^2}{k}+C\frac{\sigma^2 k\log (eM)}{n} | |||
</math> | |||
and furthermore that | |||
<math display="block"> | |||
\inf_{\theta \in \R^M}\MSE(\varphi_{\theta})+C\frac{\sigma^2|\theta|_0\log (eM)}{n} \le \MSE(\varphi_{\theta'})+\frac{2|\theta'|_1^2}{k}+C\frac{\sigma^2 k\log (eM)}{n} | |||
</math> | |||
Since the above bound holds for any <math> \theta' \in \mathbb{R}^M </math> and <math> k \in [M] </math>, we can take an infimum with respect to both <math>\theta'</math> and <math>k</math> on the right-hand side to get | |||
<math display="block"> | |||
\begin{align*} | |||
\inf_{\theta \in \R^M}&\Big\{\MSE(\varphi_{\theta})+C\frac{\sigma^2|\theta|_0\log (eM)}{n} \Big\}\\ | |||
&\le \inf_{\theta' \in \R^M}\Big\{\MSE(\varphi_{\theta'})+C\min_k\Big(\frac{|\theta'|_1^2}{k}+C\frac{\sigma^2k\log (eM)}{n}\Big) \Big\}\,. | |||
\end{align*} | |||
</math> | |||
To control the minimum over <math>k</math>, we need to consider three cases for the quantity | |||
<math display="block"> | |||
\bar k=\frac{|\theta'|_1}{\sigma}\sqrt{\frac{n}{\log (eM)}}. | |||
</math> | |||
<ul><li> If <math>1 \le \bar k \le M</math>, then we get | |||
<math display="block"> | |||
\min_k\Big(\frac{|\theta'|_1^2}{k}+C\frac{\sigma^2k\log (eM)}{n}\Big) \le C\sigma|\theta'|_1\sqrt{\frac{\log (eM)}{n}} | |||
</math> | |||
</li> | |||
<li> If <math>\bar k \le 1</math>, then | |||
<math display="block"> | |||
|\theta'|_1^2\le C\frac{\sigma^2\log (eM)}{n}\,, | |||
</math> | |||
which yields | |||
<math display="block"> | |||
\min_k\Big(\frac{|\theta'|_1^2}{k}+C\frac{\sigma^2k\log (eM)}{n}\Big) \le C\frac{\sigma^2\log (eM)}{n} | |||
</math> | |||
</li> | |||
<li> If <math>\bar k \ge M</math>, then | |||
<math display="block"> | |||
\frac{\sigma^2M\log (eM)}{n}\le C \frac{|\theta'|_1^2}{M}\,. | |||
</math> | |||
Therefore, on the one hand, if <math>M \ge \frac{|\theta|_1}{\sigma\sqrt{\log(eM)/n}}</math>, we get | |||
<math display="block"> | |||
\min_k\Big(\frac{|\theta'|_1^2}{k}+C\frac{\sigma^2k\log (eM)}{n}\Big) \le C \frac{|\theta'|_1^2}{M} \le C\sigma|\theta'|_1\sqrt{\frac{\log (eM)}{n}}\,. | |||
</math> | |||
On the other hand, if <math>M \le \frac{|\theta|_1}{\sigma\sqrt{\log(eM)/n}}</math>, then for any <math>\theta \in \R^M</math>, we have | |||
<math display="block"> | |||
\begin{equation*} | |||
\frac{\sigma^2|\theta|_0\log (eM)}{n} \le \frac{\sigma^2M\log (eM)}{n} \le C\sigma|\theta'|_1\sqrt{\frac{\log (eM)}{n}}\,. | |||
\end{equation*} | |||
</math> | |||
</li> | |||
</ul> | |||
Combined, | |||
<math display="block"> | |||
\begin{align*} | |||
\leadeq{\inf_{\theta' \in \R^M}\Big\{\MSE(\varphi_{\theta'})+C\min_k\Big(\frac{|\theta'|_1^2}{k}+C\frac{\sigma^2k\log (eM)}{n}\Big) \Big\}}\\ | |||
\le {} & \inf_{\theta' \in \R^M}\Big\{\MSE(\varphi_{\theta'}) + C \sigma | \theta' |_1 \sqrt{\frac{\log(eM}{n}} + C \frac{\sigma^2 \log(eM)}{n} \Big\}, | |||
\end{align*} | |||
</math> | |||
which together with Theorem 3.4 yields the claim.}} | |||
Note that this last result holds for any estimator that satisfies an oracle inequality with respect to the <math>\ell_0</math> norm as in [[#TH:BIC_mis |Theorem]]. In particular, this estimator need not be the BIC estimator. An example is the Exponential Screening estimator of<ref name="RigTsy11">{{cite journal|last1=Rigollet|first1=Philippe|last2=Tsybakov|first2=Alexandre|journal=Ann. Statist.|year=2011|title=Exponential screening and optimal rates of sparse estimation|volume=39|number=2|doi=10.1214/aos/1034276635|publisher=John Wiley & Sons Inc.|url-access=http://dx.doi.org/10.1214/aos/1034276635|eprint=1206.5349}}</ref>. | |||
Maurey's argument allows us to enjoy the best of both the <math>\ell_0</math> and the <math>\ell_1</math> world. The rate adapts to the sparsity of the problem and can be even generalized to <math>\ell_q</math>-sparsity (see [[exercise:Db743c87c5|Problem]]). However, it is clear from the proof that this argument is limited to squared <math>\ell_2</math> norms such as the one appearing in <math>\MSE</math> and extension to other risk measures is non-trivial. Some work has been done for non-Hilbert spaces <ref name="Pis81"/><ref name=" DonDarGur97">{{cite journal|last=Ramon|first=van|journal=Constructive Approximation|year=2017|title=Probability in High Dimension|volume=13|number=2|doi=10.1214/aos/1034276635|publisher=Springer-Verlag|url-access=http://dx.doi.org/10.1214/aos/1034276635|eprint=ArXiv:1502.00665}}</ref> using more sophisticated arguments. | |||
==Nonparametric regression== | |||
So far, the oracle inequalities that we have derived do not deal with the approximation error <math>\MSE(\varphi_\theta)</math>. We kept it arbitrary and simply hoped that it was small. Note also that in the case of linear models, we simply assumed that the approximation error was zero. As we will see in this section, this error can be quantified under natural smoothness conditions if the dictionary of functions <math>\cH=\{\varphi_1, \ldots, \varphi_M\}</math> is chosen appropriately. In what follows, we assume for simplicity that <math>d=1</math> so that <math>f\,:\, \R \to \R</math> and <math>\varphi_j\,:\, \R \to \R</math>. | |||
===Fourier decomposition=== | |||
Historically, nonparametric estimation was developed before high-dimensional statistics and most results hold for the case where the dictionary <math>\cH=\{\varphi_1, \ldots, \varphi_M\}</math> forms an orthonormal system of <math>L_2([0,1])</math>: | |||
<math display="block"> | |||
\int_0^1\varphi_j^2(x)\ud x =1\,, \quad \int_0^1 \varphi_j(x)\varphi_k(x)\ud x=0, \ \forall\, j\neq k\,. | |||
</math> | |||
We will also deal with the case where <math>M=\infty</math>. | |||
When <math>\cH</math> is an orthonormal system, the coefficients <math>\theta_j^* \in \R</math> defined by | |||
<math display="block"> | |||
\theta_j^* =\int_0^1 f(x) \varphi_j(x)\ud x\,, | |||
</math> | |||
are called ''Fourier coefficients'' of <math>f</math>. | |||
Assume now that the regression function <math>f</math> admits the following decomposition | |||
<math display="block"> | |||
f=\sum_{j=1}^\infty \theta_j^* \varphi_j\,. | |||
</math> | |||
There exists many choices for the orthonormal system and we give only two as examples. | |||
'''Example''' | |||
<span id = "ex:trigbasis"/> | |||
''Trigonometric basis''. This is an orthonormal basis of <math>L_2([0,1])</math>. It is defined by | |||
<math display="block"> | |||
\begin{eqnarray*} | |||
\varphi_1&\equiv&1\\ | |||
\varphi_{2k}(x)&=&\sqrt{2} \cos(2\pi k x)\,,\\ | |||
\varphi_{2k+1}(x)&=&\sqrt{2} \sin(2\pi k x)\,, | |||
\end{eqnarray*} | |||
</math> | |||
for <math>k=1,2, \dots</math> and <math>x \in [0,1]</math>. The fact that it is indeed an orthonormal system can be easily check using trigonometric identities. | |||
The next example has received a lot of attention in the signal (sound, image, ...) processing community. | |||
'''Example''' | |||
''Wavelets''. | |||
Let <math>\psi\,:\, \R \to \R</math> be a sufficiently smooth and compactly supported function, called ‘`’'mother wavelet''". Define the system of functions | |||
<math display="block"> | |||
\psi_{jk}(x)=2^{j/2}\psi(2^jx-k)\,, \quad j,k \in \Z\,. | |||
</math> | |||
It can be shown that for a suitable <math>\psi</math>, the dictionary <math>\{\psi_{j,k}, j,k \in \Z\}</math> forms an orthonormal system of <math>L_2([0,1])</math> and sometimes a basis. In the latter case, for any function <math>g \in L_2([0,1])</math>, it holds | |||
<math display="block"> | |||
g=\sum_{j=-\infty}^\infty \sum_{k=-\infty}^\infty \theta_{jk}\psi_{jk}\,, \quad \theta_{jk}=\int_{0}^1 g(x)\psi_{jk}(x)\ud x\,. | |||
</math> | |||
The coefficients <math>\theta_{jk}</math> are called ''wavelet coefficients'' of <math>g</math>. | |||
The simplest example is given by the ''Haar system'' obtained by taking <math>\psi</math> to be the following piecewise constant function (see [[#FIG:haar |Figure]]). We will not give more details about wavelets here but simply point the interested reader to<ref name="Mal09">{{cite journal|last=Ryan|first=O'Donnell|journal=SIAM J. Comput.|year=2005|title=The PCP Theorem and Hardness of Approximation|volume=24|number=2|doi=10.1214/aos/1034276635|publisher=W. H. Freeman & Co.|url-access=http://dx.doi.org/10.1214/aos/1034276635|eprint=1011.3396}}</ref>. | |||
<math display="block"> | |||
\psi(x)=\left\{ | |||
\begin{array}{ll} | |||
1& 0\le x < 1/2\\ | |||
-1 & 1/2\le x \le 1\\ | |||
0 & \text{otherwise} | |||
\end{array} | |||
\right. | |||
</math> | |||
<div id="FIG:haar" class="d-flex justify-content-center"> | |||
[[File: | 400px | thumb | The Haar mother wavelet ]] | |||
</div> | |||
===Sobolev classes and ellipsoids=== | |||
We begin by describing a class of smooth functions where smoothness is understood in terms of its number of derivatives. Recall that <math>f^{(k)}</math> denotes the <math>k</math>-th derivative of <math>f</math>. | |||
{{defncard|label=|id=|Fix parameters <math>\beta \in \{1, 2, \dots\}</math> and <math>L > 0</math>. The Sobolev class of functions <math>W(\beta,L)</math> is defined by | |||
<math display="block"> | |||
\begin{align*} | |||
W(\beta,L)=\Big\{f\,:\,& [0,1] \to \R\ :\ f\in L_2([0,1])\,, f^{(\beta-1)} \text{ is absolutely continuous and }\\ | |||
& \int_{0}^1[f^{(\beta)}]^2\le L^2\,, f^{(j)}(0)= f^{(j)}(1), j=0, \ldots, \beta-1\Big\} | |||
\end{align*} | |||
</math> | |||
}} | |||
Any function <math>f \in W(\beta, L)</math> can represented<ref group="Notes" >In the sense that | |||
<math display="block"> | |||
\lim_{k \to \infty} \int_0^1|f(t)-\sum_{j=1}^k \theta_j \varphi_j(t)|^2\ud t=0 | |||
</math> | |||
</ref> as its Fourier expansion along the trigonometric basis: | |||
<math display="block"> | |||
f(x)=\theta_1^*\varphi_1(x) + \sum_{k=1}^\infty\big(\theta_{2k}^*\varphi_{2k}(x)+\theta_{2k+1}^*\varphi_{2k+1}(x)\big)\,, \quad \forall \ x \in [0,1]\,, | |||
</math> | |||
where <math>\theta^*=\{\theta_j^*\}_{j \ge 1}</math> is in the space of squared summable sequences <math>\ell_2(\N)</math> defined by | |||
<math display="block"> | |||
\ell_2(\N)=\Big\{\theta\,:\, \sum_{j=1}^\infty \theta_j^2 < \infty\Big\}\,. | |||
</math> | |||
For any <math>\beta > 0</math>, define the coefficients | |||
<span id="EQ:defaj"/> | |||
<math display="block"> | |||
\begin{equation} | |||
\label{EQ:defaj} | |||
a_j=\left\{ | |||
\begin{array}{cl} | |||
j^\beta& \text{for $j$ even}\\ | |||
(j-1)^\beta & \text{for $j$ odd} | |||
\end{array} | |||
\right. | |||
\end{equation} | |||
</math> | |||
With these coefficients, we can define the Sobolev class of functions in terms of Fourier coefficients. | |||
{{proofcard|Theorem|thm-2|Fix <math>\beta\ge 1</math> and <math>L > 0</math> and let <math>\{\varphi_j\}_{j \ge 1}</math> denote the trigonometric basis of <math>L_2([0,1])</math>. Moreover, let <math>\{a_j\}_{j\ge 1}</math> be defined as in\eqref{EQ:defaj}. A function <math>f \in W(\beta,L)</math> can be represented as | |||
<math display="block"> | |||
f=\sum_{j=1}^\infty \theta_j^* \varphi_j\,, | |||
</math> | |||
where the sequence <math>\{\theta_j^*\}_{j \ge 1}</math> belongs to Sobolev ellipsoid of <math>\ell_2(\N)</math> defined by | |||
<math display="block"> | |||
\Theta(\beta, Q)=\Big\{\theta \in \ell_2(\N)\,:\, \sum_{j=1}^\infty a_j^2\theta_j^2\le Q\Big\} | |||
</math> | |||
for <math>Q=L^2/\pi^{2\beta}</math>. | |||
|Let us first define the Fourier coefficients <math>\{s_k(j)\}_{k \ge 1}</math> of the <math>j</math>th derivative <math>f^{(j)}</math> of <math>f</math> for <math>j=1, \ldots, \beta</math>: | |||
<math display="block"> | |||
\begin{align*} | |||
s_1(j)&=\int_0^1f^{(j)}(t)\ud t=f^{(j-1)}(1)-f^{(j-1)}(0)=0\,,\\ | |||
s_{2k}(j)&=\sqrt{2}\int_0^1f^{(j)}(t)\cos(2\pi kt)\ud t\,,\\ | |||
s_{2k+1}(j)&=\sqrt{2}\int_0^1f^{(j)}(t)\sin(2\pi kt)\ud t\,,\\ | |||
\end{align*} | |||
</math> | |||
The Fourier coefficients of <math>f</math> are given by <math>\theta_k=s_k(0)</math>. | |||
Using integration by parts, we find that | |||
<math display="block"> | |||
\begin{align*} | |||
s_{2k}(\beta)&=\sqrt{2}f^{(\beta-1)}(t) \cos(2\pi kt)\Big|_0^1 + (2\pi k)\sqrt{2}\int_0^1f^{(\beta-1)}(t) \sin(2\pi kt)\ud t\\ | |||
&= \sqrt{2}[f^{(\beta-1)}(1)-f^{(\beta-1)}(0)]+(2\pi k)\sqrt{2}\int_0^1f^{(\beta-1)}(t) \sin(2\pi kt)\ud t\\ | |||
&= (2\pi k)s_{2k+1}(\beta-1)\,. | |||
\end{align*} | |||
</math> | |||
Moreover, | |||
<math display="block"> | |||
\begin{align*} | |||
s_{2k+1}(\beta)&=\sqrt{2}f^{(\beta-1)}(t) \sin(2\pi kt)\Big|_0^1 - (2\pi k)\sqrt{2}\int_0^1f^{(\beta-1)}(t) \cos(2\pi kt)\ud t\\ | |||
&= -(2\pi k)s_{2k}(\beta-1)\,. | |||
\end{align*} | |||
</math> | |||
In particular, it yields | |||
<math display="block"> | |||
s_{2k}(\beta)^2+s_{2k+1}(\beta)^2=(2\pi k)^2\big[s_{2k}(\beta-1)^2+s_{2k+1}(\beta-1)^2\big] | |||
</math> | |||
By induction, we find that for any <math>k \ge 1</math>, | |||
<math display="block"> | |||
s_{2k}(\beta)^2+s_{2k+1}(\beta)^2=(2\pi k)^{2\beta}\big(\theta_{2k}^2 + \theta_{2k+1}^2\big) | |||
</math> | |||
Next, it follows for the definition\eqref{EQ:defaj} of <math>a_j</math> that | |||
<math display="block"> | |||
\begin{align*} | |||
\sum_{k=1}^\infty(2\pi k)^{2\beta}\big(\theta_{2k}^2 + \theta_{2k+1}^2\big)&=\pi^{2\beta}\sum_{k=1}^\infty a_{2k}^2\theta_{2k}^2 + \pi^{2\beta}\sum_{k=1}^\infty a_{2k+1}^2\theta_{2k+1}^2\\ | |||
&=\pi^{2\beta}\sum_{j=1}^\infty a_{j}^2\theta_{j}^2\,. | |||
\end{align*} | |||
</math> | |||
Together with the Parseval identity, it yields | |||
<math display="block"> | |||
\int_0^1\big(f^{(\beta)}(t)\big)^2\ud t=\sum_{k=1}^\infty s_{2k}(\beta)^2+s_{2k+1}(\beta)^2=\pi^{2\beta}\sum_{j=1}^\infty a_{j}^2\theta_{j}^2\,. | |||
</math> | |||
To conclude, observe that since <math>f \in W(\beta, L)</math>, we have | |||
<math display="block"> | |||
\int_0^1\big(f^{(\beta)}(t)\big)^2\ud t \le L^2\,, | |||
</math> | |||
so that <math>\theta \in \Theta(\beta, L^2/\pi^{2\beta})</math>\,.}} | |||
It can actually be shown that the reciprocal is true, that is, any function with Fourier coefficients in <math>\Theta(\beta, Q)</math> belongs to if <math> W(\beta,L)</math>, but we will not be needing this. | |||
In what follows, we will define smooth functions as functions with Fourier coefficients (with respect to the trigonometric basis) in a Sobolev ellipsoid. By extension, we write <math>f \in \Theta(\beta, Q)</math> in this case and consider any real value for <math>\beta</math>. | |||
{{proofcard|Proposition|prop-1|The Sobolev ellipsoids enjoy the following properties | |||
<ul style{{=}}"list-style-type:lower-roman"><li>For any <math>Q > 0</math>, | |||
<math display="block"> | |||
0 < \beta' < \beta \ \Rightarrow \ \Theta(\beta, Q) \subset \Theta(\beta',Q). | |||
</math></li> | |||
<li>For any <math>Q > 0</math>, | |||
<math display="block"> | |||
\beta > \frac12 \ \Rightarrow \ f\ \text{is continuous}. | |||
</math></li> | |||
</ul>|}} | |||
The proof is left as an exercise ([[exercise:143f8e5f8a |Problem]]). | |||
It turns out that the first <math> n </math> functions in the trigonometric basis are not only orthonormal with respect to the inner product of <math>L_2</math>, but also with respect to the inner predictor associated with fixed design, <math>\langle f,g\rangle:=\frac{1}{n} \sum_{i=1}^n f(X_i)g(X_i)</math>, when the design is chosen to be ''regular'', i.e., <math>X_i=(i-1)/n</math>, <math>i=1, \ldots, n</math>. | |||
{{proofcard|Lemma|LEM:regdesign|Assume that <math>\{X_1, \ldots, X_n\}</math> is the regular design, i.e., <math>X_i=(i-1)/n</math>. Then, for any <math>M\le n-1</math>, the design matrix <math>\Phi=\{\varphi_j(X_i)\}_{\substack{1\le i \le n\\1\le j \le M}}</math> satisfies the <math>\textsf{ORT}</math> condition. | |||
|Fix <math>j,j'\in \{1, \ldots, n-1\}</math>, <math>j\neq j'</math>, and consider the inner product <math>\varphi_j^\top \varphi_{j'}</math>. Write <math>k_{j}=\lfloor j/2\rfloor</math> for the integer part of <math>j/2</math> and define the vectors <math>a, b, a', b' \in \R^n</math> with coordinates such that <math>e^{\frac{\mathsf{i}2\pi k_j s}{n}}=a_{s+1}+\mathsf{i} b_{s+1}</math> and <math>e^{\frac{\mathsf{i}2\pi k_{j'} s}{n}}=a'_{s+1}+\mathsf{i} b'_{s+1}</math> for <math> s \in \{0, \dots, n-1\} </math>. It holds that | |||
<math display="block"> | |||
\frac{1}{2}\varphi_j^\top \varphi_{j'}\in \{a^\top a'\,,\, b^\top b'\,,\, b^\top a'\,,\, a^\top b'\}\,, | |||
</math> | |||
depending on the parity of <math>j</math> and <math>j'</math>. | |||
On the one hand, observe that if <math>k_j \neq k_{j'}</math>, we have for any <math>\sigma \in \{-1, +1\}</math>, | |||
<math display="block"> | |||
\sum_{s=0}^{n-1}e^{\frac{\mathsf{i}2\pi k_j s}{n}}e^{\sigma\frac{\mathsf{i}2\pi k_{j'} s}{n}}=\sum_{s=0}^{n-1}e^{\frac{\mathsf{i}2\pi (k_j+\sigma k_{j'}) s}{n}}=0\,. | |||
</math> | |||
On the other hand | |||
<math display="block"> | |||
\sum_{s=0}^{n-1}e^{\frac{\mathsf{i}2\pi k_j s}{n}}e^{\sigma\frac{\mathsf{i}2\pi k_{j'} s}{n}}=(a+\mathsf{i} b)^\top( a'+\sigma \mathsf{i} b')=a^\top a'-\sigma b^\top b' + \mathsf{i} \big[b^\top a' + \sigma a^\top b' \big] | |||
</math> | |||
so that <math>a^\top a'=\pm b^\top b'=0</math> and <math>b^\top a'=\pm a^\top b'=0</math> whenever, <math>k_j \neq k_{j'}</math>. It yields <math>\varphi_j^\top \varphi_{j'}=0</math>. | |||
Next, consider the case where <math>k_j=k_{j'}</math> so that | |||
<math display="block"> | |||
\sum_{s=0}^{n-1}e^{\frac{\mathsf{i}2\pi k_j s}{n}}e^{\sigma\frac{\mathsf{i}2\pi k_{j'} s}{n}}=\left\{ | |||
\begin{array}{ll} | |||
0& \text{if $\sigma=1$}\\ | |||
n& \text{if $\sigma=-1$}\\ | |||
\end{array}\right.\,. | |||
</math> | |||
On the one hand if <math>j \neq j'</math>, it can only be the case that <math>\varphi_j^\top \varphi_{j'} \in \{b^\top a'\,,\, a^\top b'\}</math> but the same argument as above yields <math>b^\top a'=\pm a^\top b'=0</math> since the imaginary part of the inner product is still <math> 0 </math>. Hence, in that case, <math>\varphi_j^\top \varphi_{j'}=0</math>. | |||
On the other hand, if <math>j=j'</math>, then <math>a=a'</math> and <math>b=b'</math> so that it yields <math>a^\top a'=|a|_2^2=n</math> and <math>b^\top b'=|b|_2^2=n</math> which is equivalent to <math>\varphi_j^\top \varphi_j=|\varphi_j|_2^2=n</math>. | |||
Therefore, the design matrix <math>\Phi</math> is such that | |||
<math display="block"> | |||
\begin{equation*} | |||
\Phi^\top \Phi=nI_M\,. | |||
\end{equation*} | |||
</math>}} | |||
===Integrated squared error=== | |||
As mentioned in the introduction of this chapter, the smoothness assumption allows us to control the approximation error. Before going into the details, let us gain some insight. Note first that if <math>\theta \in \Theta(\beta, Q)</math>, then <math>a_j^2\theta_j^2 \to 0</math> as <math>j \to \infty</math> so that <math>|\theta_j|=o(j^{-\beta})</math>. Therefore, the <math>\theta_j</math>s decay polynomially to zero and it makes sense to approximate <math>f</math> by its truncated Fourier series | |||
<math display="block"> | |||
\sum_{j=1}^M \theta_j^* \varphi_j=:\varphi_{\theta^*}^M | |||
</math> | |||
for any fixed <math>M</math>. This truncation leads to a systematic error that vanishes as <math>M \to \infty</math>. We are interested in understanding the rate at which this happens. | |||
The Sobolev assumption allows us to control precisely this error as a function of the tunable parameter <math>M</math> and the smoothness <math>\beta</math>. | |||
{{proofcard|Lemma|TH:bias|For any integer <math>M \ge 1</math>, and <math>f \in \Theta(\beta, Q)</math>, <math>\beta > 1/2</math>, it holds | |||
<span id="EQ:bias1"/> | |||
<math display="block"> | |||
\begin{equation} | |||
\label{EQ:bias1} | |||
\|\varphi_{\theta^*}^M -f\|_{L_2}^2=\sum_{j > M} |\theta^*_j|^2 \le QM^{-2\beta}\,. | |||
\end{equation} | |||
</math> | |||
and for <math>M =n-1</math>, we have | |||
<span id="EQ:bias2"/> | |||
<math display="block"> | |||
\begin{equation} | |||
\label{EQ:bias2} | |||
|\varphi_{\theta^*}^{n-1} -f|_2^2\le 2n\Big(\sum_{j\ge n}|\theta^*_j|\Big)^2\lesssim Qn^{2-2\beta}\,. | |||
\end{equation} | |||
</math> | |||
|Note that for any <math>\theta \in \Theta(\beta, Q)</math>, if <math>\beta > 1/2</math>, then | |||
<math display="block"> | |||
\begin{align*} | |||
\sum_{j=2}^\infty|\theta_j|&=\sum_{j=2}^\infty a_j|\theta_j|\frac1{a_j}\\ | |||
&\le \sqrt{\sum_{j=2}^\infty a_j^2\theta_j^2\sum_{j=2}^\infty\frac1{a_j^2}}\quad \text{by Cauchy-Schwarz}\\ | |||
&\le \sqrt{Q\sum_{j=1}^\infty\frac1{j^{2\beta}}} < \infty | |||
\end{align*} | |||
</math> | |||
Since <math>\{\varphi_j\}_j</math> forms an orthonormal system in <math>L_2([0,1])</math>, we have | |||
<math display="block"> | |||
\min_{\theta \in \R^M} \|\varphi_\theta -f\|_{L_2}^2= \|\varphi_{\theta^*}^M -f\|_{L_2}^2 =\sum_{j > M} |\theta^*_j|^2\,. | |||
</math> | |||
When <math>\theta^* \in \Theta(\beta,Q)</math>, we have | |||
<math display="block"> | |||
\sum_{j > M} |\theta_j^*|^2=\sum_{j > M} a_j^2|\theta_j^*|^2\frac1{a_j^2}\le \frac1{a_{M+1}^2}Q\le \frac{Q}{M^{2\beta}}\,. | |||
</math> | |||
To prove the second part of the lemma, observe that | |||
<math display="block"> | |||
|\varphi_{\theta^*}^{n-1} -f|_2=\big|\sum_{j\ge n}\theta_j^* \varphi_j\big|_2 \le2\sqrt{2n}\sum_{j\ge n}|\theta_j^*|\,, | |||
</math> | |||
where in the last inequality, we used the fact that for the trigonometric basis <math>|\varphi_j|_2 \le \sqrt{2n}, j \ge 1</math> regardless of the choice of the design <math>X_1,\ldots, X_n</math>. | |||
When <math>\theta^* \in \Theta(\beta,Q)</math>, we have | |||
<math display="block"> | |||
\begin{equation*} | |||
\sum_{j\ge n} |\theta^*_j|=\sum_{j\ge n} a_j|\theta^*_j|\frac{1}{a_j}\le \sqrt{\sum_{j\ge n} a_j^2|\theta^*_j|^2}\sqrt{\sum_{j\ge n}\frac{1}{a_j^2}}\lesssim \sqrt{Q} n^{\frac{1}{2}-\beta}\,. | |||
\end{equation*} | |||
</math>}} | |||
Note that the truncated Fourier series <math>\varphi_{\theta^*}^M</math> is an oracle: this is what we see when we view <math>f</math> through the lens of functions with only low frequency harmonics. | |||
To estimate <math>\varphi_{\theta^*} = \varphi_{\theta^\ast}^M</math>, consider the estimator <math>\varphi_{\thetals}</math> where | |||
<math display="block"> | |||
\thetals \in \argmin_{\theta \in \R^M} \sum_{i=1}^n \big(Y_i -\varphi_\theta(X_i)\big)^2\,, | |||
</math> | |||
which should be such that <math>\varphi_{\thetals}</math> is close to <math>\varphi_{\theta^*}</math>. For this estimator, we have proved ([[#TH:LS_mis |Theorem]]) an oracle inequality for the <math>\MSE</math> that is of the form | |||
<math display="block"> | |||
|\varphi_{\thetals}^M-f|_2^2\le\inf_{\theta \in \R^M}|\varphi_\theta^M-f|_2^2+C\sigma^2M\log(1/\delta)\,, \qquad C > 0\,. | |||
</math> | |||
It yields | |||
<math display="block"> | |||
\begin{align*} | |||
|\varphi_{\thetals}^M-\varphi_{\theta^*}^M|_2^2&\le 2(\varphi_{\thetals}^M-\varphi_{\theta^*}^M)^\top(f-\varphi_{\theta^*}^M)+C\sigma^2M\log(1/\delta)\\ | |||
&= 2(\varphi_{\thetals}^M-\varphi_{\theta^*}^M)^\top(\sum_{j > M}\theta^*_j\varphi_j)+C\sigma^2M\log(1/\delta)\\ | |||
&=2(\varphi_{\thetals}^M-\varphi_{\theta^*}^M)^\top(\sum_{j\ge n}\theta^*_j\varphi_j)+C\sigma^2M\log(1/\delta)\,, | |||
\end{align*} | |||
</math> | |||
where we used [[#LEM:regdesign |Lemma]] in the last equality. Together with\eqref{EQ:bias2} and Young's inequality <math>2ab\le \alpha a^2+b^2/\alpha, a,b \ge 0</math> for any <math>\alpha > 0</math>, we get | |||
<math display="block"> | |||
2(\varphi_{\thetals}^M-\varphi_{\theta^*}^M)^\top(\sum_{j\ge n}\theta^*_j\varphi_j)\le \alpha|\varphi_{\thetals}^M-\varphi_{\theta^*}^M | |||
|_2^2 + \frac{C}{\alpha}Qn^{2-2\beta}\,, | |||
</math> | |||
for some positive constant <math>C</math> when <math>\theta^* \in \Theta(\beta, Q)</math>. As a result, | |||
<span id="EQ:almostLSNP"/> | |||
<math display="block"> | |||
\begin{equation} | |||
\label{EQ:almostLSNP} | |||
|\varphi_{\thetals}^M-\varphi_{\theta^*}^M|_2^2 \lesssim \frac{1}{\alpha(1-\alpha)}Qn^{2-2\beta}+ \frac{\sigma^2M}{1-\alpha}\log(1/\delta) | |||
\end{equation} | |||
</math> | |||
for any <math>\alpha \in (0,1)</math>. Since, [[#LEM:regdesign |Lemma]] implies, <math>|\varphi_{\thetals}^M-\varphi_{\theta^*}^M|_2^2=n\|\varphi_{\thetals}^M-\varphi_{\theta^*}^M\|_{L_2([0,1])}^2</math>, we have proved the following theorem. | |||
{{proofcard|Theorem|TH:UB_ls_NP|Fix <math>\beta\ge (1+\sqrt{5})/4\simeq 0.81, Q > 0, \delta > 0</math> and assume the general regression model\eqref{EQ:regmodgen} with <math>f \in \Theta(\beta,Q)</math> and <math>\eps\sim\sg_n(\sigma^2), \sigma^2 \le 1</math>. Moreover, let <math>M=\lceil n^{\frac{1}{2\beta+1}}\rceil</math> and <math>n</math> be large enough so that <math>M \le n-1</math>. Then the least squares estimator <math>\thetals</math> defined in\eqref{EQ:defthetaLSmis} with <math>\{\varphi_j\}_{j=1}^M</math> being the trigonometric basis, satisfies with probability <math>1-\delta</math>, for <math>n</math> large enough, | |||
<math display="block"> | |||
\|\varphi^M_{\thetals} -f\|_{L_2([0,1])}^2 \lesssim n^{-\frac{2\beta}{2\beta+1}}(1+\sigma^2\log(1/\delta))\,. | |||
</math> | |||
where the constant factors may depend on <math>\beta, Q</math> and <math>\sigma</math>. | |||
|Choosing <math>\alpha=1/2</math> for example and absorbing <math>Q</math> in the constants, we get from\eqref{EQ:almostLSNP} and [[#LEM:regdesign |Lemma]] that | |||
for <math>M \le n-1</math>, | |||
<math display="block"> | |||
\|\varphi^M_{\thetals}-\varphi^M_{\theta^*}\|^2_{L_2([0,1])}\lesssim n^{1-2\beta}+\sigma^2\frac{M\log(1/\delta)}{n} \,. | |||
</math> | |||
Using now [[#TH:bias |Lemma]] and <math>\sigma^2 \le 1</math>, we get | |||
<math display="block"> | |||
\|\varphi^M_{\thetals}-f\|^2_{L_2([0,1])}\lesssim M^{-2\beta}+n^{1-2\beta}+\sigma^2\frac{M\log(1/\delta)}{n} \,. | |||
</math> | |||
Taking <math>M=\lceil n^{\frac{1}{2\beta+1}}\rceil\le n-1</math> for <math>n</math> large enough yields | |||
<math display="block"> | |||
\|\varphi^M_{\thetals}-f\|^2_{L_2([0,1])}\lesssim n^{-\frac{2\beta}{2\beta+1}}+n^{1-2\beta} \sigma^2\log(1/\delta)\,. | |||
</math> | |||
To conclude the proof, simply note that for the prescribed <math>\beta</math>, we have <math>n^{1-2\beta}\le n^{-\frac{2\beta}{2\beta+1}}</math>\,.}} | |||
===Adaptive estimation=== | |||
The rate attained by the projection estimator <math>\varphi_{\thetals}</math> with <math>M=\lceil n^{\frac{1}{2\beta+1}}\rceil</math> is actually optimal so, in this sense, it is a good estimator. Unfortunately, its implementation requires the knowledge of the smoothness parameter <math>\beta</math> which is typically unknown, to determine the level <math>M</math> of truncation. The purpose of ''adaptive estimation'' is precisely to adapt to the unknown <math>\beta</math>, that is to build an estimator that does not depend on <math>\beta</math> and yet, attains a rate of the order of <math>Cn^{-\frac{2\beta}{2\beta+1}}</math> (up to a logarithmic lowdown). To that end, we will use the oracle inequalities for the BIC and Lasso estimator defined in\eqref{EQ:defBICmis} and\eqref{EQ:defLassomis} respectively. In view of [[#LEM:regdesign |Lemma]], the design matrix <math>\Phi</math> actually satisfies the assumption<math>\textsf{ORT}</math> when we work with the trigonometric basis. This has two useful implications: | |||
<ul><li> Both estimators are actually thresholding estimators and can therefore be implemented efficiently | |||
</li> | |||
<li> The condition <math>\textsf{INC($k$)}</math> is automatically satisfied for any <math>k \ge 1</math>. | |||
</li> | |||
</ul> | |||
These observations lead to the following corollary. | |||
{{proofcard|Corollary|COR:adap|Fix <math>\beta > (1+\sqrt{5})/4\simeq 0.81, Q > 0, \delta > 0</math> and <math>n</math> large enough to ensure <math>n-1\ge \lceil n^{\frac{1}{2\beta+1}}\rceil</math> assume the general regression model\eqref{EQ:regmodgen} with <math>f \in \Theta(\beta,Q)</math> and <math>\eps\sim\sg_n(\sigma^2), \sigma^2 \le 1</math>. Let <math>\{\varphi_j\}_{j=1}^{n-1}</math> be the trigonometric basis. Denote by <math>\varphi^{n-1}_{\thetabic}</math> (resp. <math>\varphi^{n-1}_{\thetalasso}</math>) the BIC (resp. Lasso) estimator defined in\eqref{EQ:defBICmis} (resp. \eqref{EQ:defLassomis}) over <math>\R^{n-1}</math> with regularization parameter given by\eqref{EQ:taubicmis} (resp. \eqref{EQ:taulassomis}). Then <math>\varphi^{n-1}_{\hat \theta}</math>, where <math>\hat \theta \in \{\thetabic, \thetalasso\}</math> satisfies with probability <math>1-\delta</math>, | |||
<math display="block"> | |||
\|\varphi^{n-1}_{\hat \theta} -f\|_{L_2([0,1])}^2 \lesssim (n/\log n)^{-\frac{2\beta}{2\beta+1}}(1+\sigma^2\log(1/\delta))\,, | |||
</math> | |||
where the constants may depend on <math>\beta</math> and <math>Q</math>. | |||
|For <math>\hat \theta \in \{\thetabic, \thetalasso\}</math>, adapting the proofs of [[#TH:BIC_mis |Theorem]] for the BIC estimator and [[#TH:lasso_fast_mis |Theorem]] for the Lasso estimator, for any <math>\theta \in \R^{n-1}</math>, with probability <math>1-\delta</math> | |||
<math display="block"> | |||
|\varphi^{n-1}_{\hat \theta} -f|_2^2 \le \frac{1+\alpha}{1-\alpha}|\varphi^{n-1}_\theta-f|_2^2 + R(|\theta|_0)\,. | |||
</math> | |||
where | |||
<math display="block"> | |||
R(|\theta|_0):=\frac{C\sigma^2}{\alpha(1-\alpha)}|\theta_0|\log(en/\delta) | |||
</math> | |||
It yields | |||
<math display="block"> | |||
\begin{align*} | |||
|\varphi^{n-1}_{\hat \theta} -\varphi^{n-1}_\theta|_2^2& \le \frac{2\alpha}{1-\alpha}|\varphi^{n-1}_\theta-f|_2^2 +2(\varphi^{n-1}_{\hat \theta} -\varphi^{n-1}_\theta)^\top (f - \varphi^{n-1}_\theta)+R(|\theta|_0)\\ | |||
&\le \Big( \frac{2\alpha}{1-\alpha}+\frac{1}{\alpha}\Big)|\varphi^{n-1}_\theta-f|_2^2+\alpha|\varphi^{n-1}_{\hat \theta} -\varphi^{n-1}_\theta|_2^2 + R(|\theta|_0)\,, | |||
\end{align*} | |||
</math> | |||
where we used Young's inequality once again. Let <math> M \in \mathbb{N} </math> be determined later and choose now <math>\alpha=1/2</math> and <math>\theta=\theta^*_M</math>, where <math>\theta^*_M</math> is equal to <math>\theta^*</math> on its first <math>M</math> coordinates and <math>0</math> otherwise so that <math>\varphi^{n-1}_{\theta^*_M}=\varphi^M_{\theta^*}</math>. It yields | |||
<math display="block"> | |||
|\varphi^{n-1}_{\hat \theta} -\varphi^{n-1}_{\theta^*_M}|_2^2\lesssim |\varphi^{n-1}_{\theta^*_M}-f|_2^2+ R(M)\lesssim|\varphi^{n-1}_{\theta^*_M}-\varphi^{n-1}_{\theta^*}|_2^2+|\varphi^{n-1}_{\theta^*}-f|_2^2+ R(M) | |||
</math> | |||
Next, it follows from\eqref{EQ:bias2} that <math>|\varphi^{n-1}_{\theta^*}-f|_2^2 \lesssim Qn^{2-2\beta}</math>. Together with [[#LEM:regdesign |Lemma]], it yields | |||
<math display="block"> | |||
\|\varphi^{n-1}_{\hat \theta} -\varphi^{n-1}_{\theta^*_M}\|_{L_2([0,1])}^2\lesssim \|\varphi^{n-1}_{\theta^*} -\varphi^{n-1}_{\theta^*_M}\|_{L_2([0,1])}^2+Qn^{1-2\beta}+ \frac{R(M)}{n}\,. | |||
</math> | |||
Moreover, using\eqref{EQ:bias1}, we find that | |||
<math display="block"> | |||
\|\varphi^{n-1}_{\hat \theta} -f\|_{L_2([0,1])}^2\lesssim M^{-2\beta}+Qn^{1-2\beta}+ \frac{M}{n}\sigma^2\log(en/\delta) \,. | |||
</math> | |||
To conclude the proof, choose <math>M=\lceil (n/\log n)^{\frac{1}{2\beta+1}}\rceil</math> and observe that the choice of <math>\beta</math> ensures that <math>n^{1-2\beta} \lesssim M^{-2\beta}</math>.}} | |||
While there is sometimes a (logarithmic) price to pay for adaptation, it turns out that the extra logarithmic factor can be removed by a clever use of blocks (see <ref name="Tsy09">{{cite journal|last=LeCam|first=L.|journal=Ann. Statist.|year=1973|title=Convergence of estimates under dimensionality restrictions|volume=1|number=5|doi=10.1214/aos/1034276635|publisher=Omnipress|url-access=http://dx.doi.org/10.1214/aos/1034276635|eprint=1108.1170v6}}</ref>{{rp|at=Chapter3}}). The reason why we get this extra logarithmic factor here is because we use a hammer that's too big. Indeed, BIC and Lasso allow for ‘`holes" in the Fourier decomposition, so we only get part of their potential benefits. | |||
==General references== | |||
{{cite arXiv|last1=Rigollet|first1=Philippe|last2=Hütter|first2=Jan-Christian|year=2023|title=High-Dimensional Statistics|eprint=2310.19244|class=math.ST}} | |||
==Notes== | |||
{{Reflist|group=Notes}} | |||
==References== | |||
{{reflist}} |
Latest revision as of 01:34, 3 June 2024
[math] \newcommand{\DS}{\displaystyle} \newcommand{\intbeta}{\lfloor \beta \rfloor} \newcommand{\cA}{\mathcal{A}} \newcommand{\cB}{\mathcal{B}} \newcommand{\cC}{\mathcal{C}} \newcommand{\cD}{\mathcal{D}} \newcommand{\cE}{\mathcal{E}} \newcommand{\cF}{\mathcal{F}} \newcommand{\cG}{\mathcal{G}} \newcommand{\cH}{\mathcal{H}} \newcommand{\cI}{\mathcal{I}} \newcommand{\cJ}{\mathcal{J}} \newcommand{\cK}{\mathcal{K}} \newcommand{\cL}{\mathcal{L}} \newcommand{\cM}{\mathcal{M}} \newcommand{\cN}{\mathcal{N}} \newcommand{\cO}{\mathcal{O}} \newcommand{\cP}{\mathcal{P}} \newcommand{\cQ}{\mathcal{Q}} \newcommand{\cS}{\mathcal{S}} \newcommand{\cT}{\mathcal{T}} \newcommand{\cU}{\mathcal{U}} \newcommand{\cV}{\mathcal{V}} \newcommand{\cX}{\mathcal{X}} \newcommand{\cY}{\mathcal{Y}} \newcommand{\cZ}{\mathcal{Z}} \newcommand{\sg}{\mathsf{subG}} \newcommand{\subE}{\mathsf{subE}} \newcommand{\bA}{\mathbf{A}} \newcommand{\bB}{\mathbf{B}} \newcommand{\bC}{\mathbf{C}} \newcommand{\bD}{\mathbf{D}} \newcommand{\bE}{\mathbf{E}} \newcommand{\bF}{\mathbf{F}} \newcommand{\bG}{\mathbf{G}} \newcommand{\bH}{\mathbf{H}} \newcommand{\bI}{\mathbf{I}} \newcommand{\bJ}{\mathbf{J}} \newcommand{\bK}{\mathbf{K}} \newcommand{\bM}{\mathbf{M}} \newcommand{\bN}{\mathbf{N}} \newcommand{\bO}{\mathbf{O}} \newcommand{\bP}{\mathbf{P}} \newcommand{\bp}{\mathbf{p}} \newcommand{\bQ}{\mathbf{Q}} \newcommand{\bS}{\mathbf{S}} \newcommand{\bT}{\mathbf{T}} \newcommand{\bU}{\mathbf{U}} \newcommand{\bV}{\mathbf{V}} \newcommand{\bX}{\mathbf{X}} \newcommand{\bY}{\mathbf{Y}} \newcommand{\bZ}{\mathbf{Z}} \newcommand{\bflambda}{\boldsymbol{\lambda}} \newcommand{\bftheta}{\boldsymbol{\theta}} \newcommand{\bfg}{\boldsymbol{g}} \newcommand{\bfy}{\boldsymbol{y}} \def\thetaphat{\hat{\bftheta}_\bp} \def\bflam{\boldsymbol{\lambda}} \def\Lam{\Lambda} \def\lam{\lambda} \def\bfpi{\boldsymbol{\pi}} \def\bfz{\boldsymbol{z}} \def\bfw{\boldsymbol{w}} \def\bfeta{\boldsymbol{\eta}} \newcommand{\R}{\mathrm{ I}\kern-0.18em\mathrm{ R}} \newcommand{\h}{\mathrm{ I}\kern-0.18em\mathrm{ H}} \newcommand{\K}{\mathrm{ I}\kern-0.18em\mathrm{ K}} \newcommand{\p}{\mathrm{ I}\kern-0.18em\mathrm{ P}} \newcommand{\E}{\mathrm{ I}\kern-0.18em\mathrm{ E}} %\newcommand{\Z}{\mathrm{ Z}\kern-0.26em\mathrm{ Z}} \newcommand{\1}{\mathrm{ 1}\kern-0.24em\mathrm{ I}} \newcommand{\N}{\mathrm{ I}\kern-0.18em\mathrm{ N}} \newcommand{\field}[1]{\mathbb{#1}} \newcommand{\q}{\field{Q}} \newcommand{\Z}{\field{Z}} \newcommand{\X}{\field{X}} \newcommand{\Y}{\field{Y}} \newcommand{\bbS}{\field{S}} \newcommand{\n}{\mathcal{N}} \newcommand{\x}{\mathcal{X}} \newcommand{\pp}{{\sf p}} \newcommand{\hh}{{\sf h}} \newcommand{\ff}{{\sf f}} \newcommand{\Bern}{\mathsf{Ber}} \newcommand{\Bin}{\mathsf{Bin}} \newcommand{\Lap}{\mathsf{Lap}} \newcommand{\tr}{\mathsf{Tr}} \newcommand{\phin}{\varphi_n} \newcommand{\phinb}{\overline \varphi_n(t)} \newcommand{\pn}{\p_{\kern-0.25em n}} \newcommand{\pnm}{\p_{\kern-0.25em n,m}} \newcommand{\psubm}{\p_{\kern-0.25em m}} \newcommand{\psubp}{\p_{\kern-0.25em p}} \newcommand{\cfi}{\cF_{\kern-0.25em \infty}} \newcommand{\e}{\mathrm{e}} \newcommand{\ic}{\mathrm{i}} \newcommand{\Leb}{\mathrm{Leb}_d} \newcommand{\Var}{\mathrm{Var}} \newcommand{\ddelta}{d_{\symdiffsmall}} \newcommand{\dsubh}{d_H} \newcommand{\indep}{\perp\kern-0.95em{\perp}} \newcommand{\supp}{\mathop{\mathrm{supp}}} \newcommand{\rank}{\mathop{\mathrm{rank}}} \newcommand{\vol}{\mathop{\mathrm{vol}}} \newcommand{\conv}{\mathop{\mathrm{conv}}} \newcommand{\card}{\mathop{\mathrm{card}}} \newcommand{\argmin}{\mathop{\mathrm{argmin}}} \newcommand{\argmax}{\mathop{\mathrm{argmax}}} \newcommand{\ud}{\mathrm{d}} \newcommand{\var}{\mathrm{var}} \newcommand{\re}{\mathrm{Re}} \newcommand{\MSE}{\mathsf{MSE}} \newcommand{\im}{\mathrm{Im}} \newcommand{\epr}{\hfill\hbox{\hskip 4pt\vrule width 5pt height 6pt depth 1.5pt}\vspace{0.5cm}\par} \newcommand{\bi}[1]{^{(#1)}} \newcommand{\eps}{\varepsilon} \newcommand{\Deq}{\stackrel{\mathcal{D}}{=}} \newcommand{\ubar}{\underbar} \newcommand{\Kbeta}{K_{\hspace{-0.3mm} \beta}} \newcommand{\crzero}[1]{\overset{r_0}{\underset{#1}{\longleftrightarrow}}} \newcommand{\hint}[1]{\texttt{[Hint:#1]}} \newcommand{\vp}{\vspace{.25cm}} \newcommand{\vm}{\vspace{.5cm}} \newcommand{\vg}{\vspace{1cm}} \newcommand{\vgneg}{\vspace{-1cm}} \newcommand{\vneg}{\vspace{-.5cm}} \newcommand{\vpneg}{\vspace{-.25cm}} \newcommand{\tp}{\ptsize{10}} \newcommand{\douzp}{\ptsize{12}} \newcommand{\np}{\ptsize{9}} \newcommand{\hp}{\ptsize{8}} \newcommand{\red}{\color{red}} \newcommand{\ndpr}[1]{{\textsf{\red[#1]}}} \newcommand\iid{i.i.d\@ifnextchar.{}{.\@\xspace} } \newcommand\MoveEqLeft[1][2]{\kern #1em & \kern -#1em} \newcommand{\leadeq}[2][4]{\MoveEqLeft[#1] #2 \nonumber} \newcommand{\leadeqnum}[2][4]{\MoveEqLeft[#1] #2} \newcommand\independent{\protect\mathpalette{\protect\independenT}{\perp}} \def\independenT#1#2{\mathrel{\rlap{$#1#2$}\mkern2mu{#1#2}}} \newcommand{\MIT}[1]{{\color{MITred} #1}} \newcommand{\dHyp}{\{-1,1\}^d} \newcommand{\thetahard}{\hat \theta^{hrd}} \newcommand{\thetasoft}{\hat \theta^{sft}} \newcommand{\thetabic}{\hat \theta^{bic}} \newcommand{\thetalasso}{\hat \theta^{\cL}} \newcommand{\thetaslope}{\hat \theta^{\cS}} \newcommand{\thetahard}{\hat \theta^{hrd}} \newcommand{\thetasoft}{\hat \theta^{sft}} \newcommand{\thetabic}{\hat \theta^{bic}} \newcommand{\thetalasso}{\hat \theta^{\cL}} \newcommand{\thetaslope}{\hat \theta^{\cS}} \newcommand{\thetals}{\hat \theta^{ls}} \newcommand{\thetalsm}{\tilde \theta^{ls_X}} \newcommand{\thetaridge}{\hat\theta^{\mathrm{ridge}}_\tau} \newcommand{\thetalsK}{\hat \theta^{ls}_K} \newcommand{\muls}{\hat \mu^{ls}} [/math]
Arguably, the strongest assumption that we made in Chapter Linear Regression Model is that the regression function [math]f(x)[/math] is of the form [math]f(x)=x^\top \theta^*[/math]. What if this assumption is violated? In reality, we do not really believe in the linear model and we hope that good statistical methods should be robust to deviations from this model. This is the problem of model misspecification, and in particular misspecified linear models. Throughout this chapter, we assume the following model:
where [math]\eps=(\eps_1, \ldots, \eps_n)^\top [/math] is sub-Gaussian with variance proxy [math]\sigma^2[/math]. Here [math]X_i \in \R^d[/math]. When dealing with fixed design, it will be convenient to consider the vector [math]g \in \R^n[/math] defined for any function [math]g \,:\, \R^d \to \R[/math] by [math]g=(g(X_1), \ldots, g(X_n))^\top[/math]. In this case, we can write for any estimator [math]\hat f \in \R^n[/math] of [math]f[/math],
Even though the model may not be linear, we are interested in studying the statistical properties of various linear estimators introduced in the previous chapters: [math]\thetals, \thetalsK, \thetalsm, \thetabic, \thetalasso[/math]. Clearly, even with an infinite number of observations, we have no chance of finding a consistent estimator of [math]f[/math] if we don't know the correct model. Nevertheless, as we will see in this chapter, something can still be said about these estimators using oracle inequalities.
Oracle inequalities
Oracle inequalities
As mentioned in the introduction, an oracle is a quantity that cannot be constructed without the knowledge of the quantity of interest, here: the regression function. Unlike the regression function itself, an oracle is constrained to take a specific form. For all matter of purposes, an oracle can be viewed as an estimator (in a given family) that can be constructed with an infinite amount of data. This is exactly what we should aim for in misspecified models. When employing the least squares estimator [math]\thetals[/math], we constrain ourselves to estimating functions that are of the form [math]x\mapsto x^\top \theta[/math], even though [math]f[/math] itself may not be of this form. Therefore, the oracle [math]\hat f[/math] is the linear function that is the closest to [math]f[/math]. Rather than trying to approximate [math]f[/math] by a linear function [math]f(x)\approx \theta^\top x[/math], we make the model a bit more general and consider a dictionary [math]\cH=\{\varphi_1, \ldots, \varphi_M\}[/math] of functions where [math]\varphi_j\,:\, \R^d \to \R[/math]. In this case, we can actually remove the assumption that [math]X \in \R^d[/math]. Indeed, the goal is now to estimate [math]f[/math] using a linear combination of the functions in the dictionary:
If [math]M=d[/math] and [math]\varphi_j(X)=X^{(j)}[/math] returns the [math]j[/math]th coordinate of [math]X \in \R^d[/math] then the goal is to approximate [math]f(x)[/math] by [math]\theta^\top x[/math]. Nevertheless, the use of a dictionary allows for a much more general framework.
Note that the use of a dictionary does not affect the methods that we have been using so far, namely penalized/constrained least squares. We use the same notation as before and define
- The least squares estimator:
[[math]] \begin{equation} \label{EQ:defthetaLSmis} \thetals\in \argmin_{\theta \in \R^M}\frac{1}{n} \sum_{i=1}^n\big(Y_i-\varphi_\theta(X_i)\big)^2 \end{equation} [[/math]]
- The least squares estimator constrained to [math]K \subset \R^M[/math]:
[[math]] \thetalsK \in \argmin_{\theta \in K}\frac{1}{n} \sum_{i=1}^n\big(Y_i-\varphi_\theta(X_i)\big)^2 [[/math]]
- The BIC estimator:
[[math]] \begin{equation} \label{EQ:defBICmis} \thetabic \in \argmin_{\theta \in \R^M}\Big\{\frac{1}{n} \sum_{i=1}^n\big(Y_i-\varphi_\theta(X_i)\big)^2+\tau^2|\theta|_0\Big\} \end{equation} [[/math]]
- The Lasso estimator:
[[math]] \begin{equation} \label{EQ:defLassomis} \thetalasso \in \argmin_{\theta \in \R^M}\Big\{\frac{1}{n} \sum_{i=1}^n\big(Y_i-\varphi_\theta(X_i)\big)^2+2\tau|\theta|_1\Big\} \end{equation} [[/math]]
Let [math]R(\cdot)[/math] be a risk function and let [math]\cH=\{\varphi_1, \ldots, \varphi_M\}[/math] be a dictionary of functions from [math]\R^d[/math] to [math]\R[/math]. Let [math]K[/math] be a subset of [math]\R^M[/math]. The oracle on [math]K[/math] with respect to [math]R[/math] is defined by [math]\varphi_{\bar \theta}[/math], where [math]\bar \theta \in K[/math] is such that
Our goal will be to mimic oracles. The finite sample performance of an estimator at this task is captured by an oracle inequality.
Oracle inequality for the least squares estimator
While our ultimate goal is to prove sparse oracle inequalities for the BIC and Lasso estimator in the case of misspecified model, the difficulty of the extension to this case for linear models is essentially already captured by the analysis for the least squares estimator. In this simple case, can even obtain an exact oracle inequality.
Assume the general regression model\eqref{EQ:regmodgen} with [math]\eps\sim\sg_n(\sigma^2)[/math]. Then, the least squares estimator [math]\thetals[/math] satisfies for some numerical constant [math]C \gt 0[/math],
Note that by definition
Sparse oracle inequality for the BIC estimator
The analysis for more complicated estimators such as the BIC in Chapter Linear Regression Model allows us to derive oracle inequalities for these estimators.
Recall that the proof of Theorem for the BIC estimator begins as follows:
{{{4}}}
The interpretation of this theorem is enlightening. It implies that the BIC estimator will mimic the best tradeoff between the approximation error [math]\MSE(\varphi_\theta)[/math] and the complexity of [math]\theta[/math] as measured by its sparsity. In particular, this result, which is sometimes called sparse oracle inequality, implies the following oracle inequality. Define the oracle [math]\bar \theta[/math] to be such that
Then, with probability at least [math]1-\delta[/math],
If the linear model happens to be correct, then we simply have [math]\MSE(\varphi_{\bar \theta})=0[/math].
Sparse oracle inequality for the Lasso
To prove an oracle inequality for the Lasso, we need additional assumptions on the design matrix, such as incoherence. Here, the design matrix is given by the [math]n\times M[/math] matrix [math]\Phi[/math] with elements [math]\Phi_{i,j}=\varphi_j(X_i)[/math].
Assume the general regression model\eqref{EQ:regmodgen} with [math]\eps\sim\sg_n(\sigma^2)[/math]. Moreover, assume that there exists an integer [math]k[/math] such that the matrix [math]\Phi[/math] satisfies assumption [math]\textsf{INC$(k)$}[/math]. Then, the Lasso estimator [math]\thetalasso[/math] with regularization parameter given by
From the definition of [math]\thetalasso[/math], for any [math]\theta \in \R^M[/math],
Maurey's argument
In there is no sparse [math]\theta[/math] such that [math]\MSE(\varphi_{\theta})[/math] is small, Theorem is useless whereas the Lasso may still enjoy slow rates. In reality, no one really believes in the existence of sparse vectors but rather of approximately sparse vectors. Zipf's law would instead favor the existence of vectors [math]\theta[/math] with absolute coefficients that decay polynomially when ordered from largest to smallest in absolute value. This is the case for example if [math]\theta[/math] has a small [math]\ell_1[/math] norm but is not sparse. For such [math]\theta[/math], the Lasso estimator still enjoys slow rates as in Theorem, which can be easily extended to the misspecified case (see Problem). As a result, it seems that the Lasso estimator is strictly better than the BIC estimator as long as incoherence holds since it enjoys both fast and slow rates, whereas the BIC estimator seems to be tailored to the fast rate. Fortunately, such vectors can be well approximated by sparse vectors in the following sense: for any vector [math]\theta \in \R^M[/math] such that [math]|\theta|_1\le 1[/math], there exists a vector [math]\theta'[/math] that is sparse and for which [math]\MSE(\varphi_{\theta'})[/math] is not much larger than [math]\MSE(\varphi_\theta)[/math]. The following theorem quantifies exactly the tradeoff between sparsity and [math]\MSE[/math]. It is often attributed to B. Maurey and was published by Pisier [1]. This is why it is referred to as Maurey's argument.
Let [math]\{\varphi_1, \dots, \varphi_M\}[/math] be a dictionary normalized in such a way that
Define
Maurey's argument implies the following corollary.
Assume that the assumptions of Theorem hold and that the dictionary [math]\{\varphi_1, \dots, \varphi_M\}[/math] is normalized in such a way that
Choosing [math]\alpha=1/3[/math] in Theorem yields
- If [math]1 \le \bar k \le M[/math], then we get
[[math]] \min_k\Big(\frac{|\theta'|_1^2}{k}+C\frac{\sigma^2k\log (eM)}{n}\Big) \le C\sigma|\theta'|_1\sqrt{\frac{\log (eM)}{n}} [[/math]]
- If [math]\bar k \le 1[/math], then
[[math]] |\theta'|_1^2\le C\frac{\sigma^2\log (eM)}{n}\,, [[/math]]which yields[[math]] \min_k\Big(\frac{|\theta'|_1^2}{k}+C\frac{\sigma^2k\log (eM)}{n}\Big) \le C\frac{\sigma^2\log (eM)}{n} [[/math]]
- If [math]\bar k \ge M[/math], then
[[math]] \frac{\sigma^2M\log (eM)}{n}\le C \frac{|\theta'|_1^2}{M}\,. [[/math]]Therefore, on the one hand, if [math]M \ge \frac{|\theta|_1}{\sigma\sqrt{\log(eM)/n}}[/math], we get[[math]] \min_k\Big(\frac{|\theta'|_1^2}{k}+C\frac{\sigma^2k\log (eM)}{n}\Big) \le C \frac{|\theta'|_1^2}{M} \le C\sigma|\theta'|_1\sqrt{\frac{\log (eM)}{n}}\,. [[/math]]On the other hand, if [math]M \le \frac{|\theta|_1}{\sigma\sqrt{\log(eM)/n}}[/math], then for any [math]\theta \in \R^M[/math], we have[[math]] \begin{equation*} \frac{\sigma^2|\theta|_0\log (eM)}{n} \le \frac{\sigma^2M\log (eM)}{n} \le C\sigma|\theta'|_1\sqrt{\frac{\log (eM)}{n}}\,. \end{equation*} [[/math]]
Combined,
Note that this last result holds for any estimator that satisfies an oracle inequality with respect to the [math]\ell_0[/math] norm as in Theorem. In particular, this estimator need not be the BIC estimator. An example is the Exponential Screening estimator of[2]. Maurey's argument allows us to enjoy the best of both the [math]\ell_0[/math] and the [math]\ell_1[/math] world. The rate adapts to the sparsity of the problem and can be even generalized to [math]\ell_q[/math]-sparsity (see Problem). However, it is clear from the proof that this argument is limited to squared [math]\ell_2[/math] norms such as the one appearing in [math]\MSE[/math] and extension to other risk measures is non-trivial. Some work has been done for non-Hilbert spaces [1][3] using more sophisticated arguments.
Nonparametric regression
So far, the oracle inequalities that we have derived do not deal with the approximation error [math]\MSE(\varphi_\theta)[/math]. We kept it arbitrary and simply hoped that it was small. Note also that in the case of linear models, we simply assumed that the approximation error was zero. As we will see in this section, this error can be quantified under natural smoothness conditions if the dictionary of functions [math]\cH=\{\varphi_1, \ldots, \varphi_M\}[/math] is chosen appropriately. In what follows, we assume for simplicity that [math]d=1[/math] so that [math]f\,:\, \R \to \R[/math] and [math]\varphi_j\,:\, \R \to \R[/math].
Fourier decomposition
Historically, nonparametric estimation was developed before high-dimensional statistics and most results hold for the case where the dictionary [math]\cH=\{\varphi_1, \ldots, \varphi_M\}[/math] forms an orthonormal system of [math]L_2([0,1])[/math]:
We will also deal with the case where [math]M=\infty[/math]. When [math]\cH[/math] is an orthonormal system, the coefficients [math]\theta_j^* \in \R[/math] defined by
are called Fourier coefficients of [math]f[/math]. Assume now that the regression function [math]f[/math] admits the following decomposition
There exists many choices for the orthonormal system and we give only two as examples.
Example Trigonometric basis. This is an orthonormal basis of [math]L_2([0,1])[/math]. It is defined by
for [math]k=1,2, \dots[/math] and [math]x \in [0,1][/math]. The fact that it is indeed an orthonormal system can be easily check using trigonometric identities.
The next example has received a lot of attention in the signal (sound, image, ...) processing community.
Example Wavelets. Let [math]\psi\,:\, \R \to \R[/math] be a sufficiently smooth and compactly supported function, called ‘`’'mother wavelet". Define the system of functions
It can be shown that for a suitable [math]\psi[/math], the dictionary [math]\{\psi_{j,k}, j,k \in \Z\}[/math] forms an orthonormal system of [math]L_2([0,1])[/math] and sometimes a basis. In the latter case, for any function [math]g \in L_2([0,1])[/math], it holds
The coefficients [math]\theta_{jk}[/math] are called wavelet coefficients of [math]g[/math]. The simplest example is given by the Haar system obtained by taking [math]\psi[/math] to be the following piecewise constant function (see Figure). We will not give more details about wavelets here but simply point the interested reader to[4].
[[File: | 400px | thumb | The Haar mother wavelet ]]
Sobolev classes and ellipsoids
We begin by describing a class of smooth functions where smoothness is understood in terms of its number of derivatives. Recall that [math]f^{(k)}[/math] denotes the [math]k[/math]-th derivative of [math]f[/math].
Fix parameters [math]\beta \in \{1, 2, \dots\}[/math] and [math]L \gt 0[/math]. The Sobolev class of functions [math]W(\beta,L)[/math] is defined by
Any function [math]f \in W(\beta, L)[/math] can represented[Notes 1] as its Fourier expansion along the trigonometric basis:
where [math]\theta^*=\{\theta_j^*\}_{j \ge 1}[/math] is in the space of squared summable sequences [math]\ell_2(\N)[/math] defined by
For any [math]\beta \gt 0[/math], define the coefficients
With these coefficients, we can define the Sobolev class of functions in terms of Fourier coefficients.
Fix [math]\beta\ge 1[/math] and [math]L \gt 0[/math] and let [math]\{\varphi_j\}_{j \ge 1}[/math] denote the trigonometric basis of [math]L_2([0,1])[/math]. Moreover, let [math]\{a_j\}_{j\ge 1}[/math] be defined as in\eqref{EQ:defaj}. A function [math]f \in W(\beta,L)[/math] can be represented as
Let us first define the Fourier coefficients [math]\{s_k(j)\}_{k \ge 1}[/math] of the [math]j[/math]th derivative [math]f^{(j)}[/math] of [math]f[/math] for [math]j=1, \ldots, \beta[/math]:
It can actually be shown that the reciprocal is true, that is, any function with Fourier coefficients in [math]\Theta(\beta, Q)[/math] belongs to if [math] W(\beta,L)[/math], but we will not be needing this. In what follows, we will define smooth functions as functions with Fourier coefficients (with respect to the trigonometric basis) in a Sobolev ellipsoid. By extension, we write [math]f \in \Theta(\beta, Q)[/math] in this case and consider any real value for [math]\beta[/math].
The Sobolev ellipsoids enjoy the following properties
- For any [math]Q \gt 0[/math],
[[math]] 0 \lt \beta' \lt \beta \ \Rightarrow \ \Theta(\beta, Q) \subset \Theta(\beta',Q). [[/math]]
- For any [math]Q \gt 0[/math],
[[math]] \beta \gt \frac12 \ \Rightarrow \ f\ \text{is continuous}. [[/math]]
The proof is left as an exercise (Problem). It turns out that the first [math] n [/math] functions in the trigonometric basis are not only orthonormal with respect to the inner product of [math]L_2[/math], but also with respect to the inner predictor associated with fixed design, [math]\langle f,g\rangle:=\frac{1}{n} \sum_{i=1}^n f(X_i)g(X_i)[/math], when the design is chosen to be regular, i.e., [math]X_i=(i-1)/n[/math], [math]i=1, \ldots, n[/math].
Assume that [math]\{X_1, \ldots, X_n\}[/math] is the regular design, i.e., [math]X_i=(i-1)/n[/math]. Then, for any [math]M\le n-1[/math], the design matrix [math]\Phi=\{\varphi_j(X_i)\}_{\substack{1\le i \le n\\1\le j \le M}}[/math] satisfies the [math]\textsf{ORT}[/math] condition.
Fix [math]j,j'\in \{1, \ldots, n-1\}[/math], [math]j\neq j'[/math], and consider the inner product [math]\varphi_j^\top \varphi_{j'}[/math]. Write [math]k_{j}=\lfloor j/2\rfloor[/math] for the integer part of [math]j/2[/math] and define the vectors [math]a, b, a', b' \in \R^n[/math] with coordinates such that [math]e^{\frac{\mathsf{i}2\pi k_j s}{n}}=a_{s+1}+\mathsf{i} b_{s+1}[/math] and [math]e^{\frac{\mathsf{i}2\pi k_{j'} s}{n}}=a'_{s+1}+\mathsf{i} b'_{s+1}[/math] for [math] s \in \{0, \dots, n-1\} [/math]. It holds that
Integrated squared error
As mentioned in the introduction of this chapter, the smoothness assumption allows us to control the approximation error. Before going into the details, let us gain some insight. Note first that if [math]\theta \in \Theta(\beta, Q)[/math], then [math]a_j^2\theta_j^2 \to 0[/math] as [math]j \to \infty[/math] so that [math]|\theta_j|=o(j^{-\beta})[/math]. Therefore, the [math]\theta_j[/math]s decay polynomially to zero and it makes sense to approximate [math]f[/math] by its truncated Fourier series
for any fixed [math]M[/math]. This truncation leads to a systematic error that vanishes as [math]M \to \infty[/math]. We are interested in understanding the rate at which this happens.
The Sobolev assumption allows us to control precisely this error as a function of the tunable parameter [math]M[/math] and the smoothness [math]\beta[/math].
Note that for any [math]\theta \in \Theta(\beta, Q)[/math], if [math]\beta \gt 1/2[/math], then
To prove the second part of the lemma, observe that
{{{4}}}
Note that the truncated Fourier series [math]\varphi_{\theta^*}^M[/math] is an oracle: this is what we see when we view [math]f[/math] through the lens of functions with only low frequency harmonics. To estimate [math]\varphi_{\theta^*} = \varphi_{\theta^\ast}^M[/math], consider the estimator [math]\varphi_{\thetals}[/math] where
which should be such that [math]\varphi_{\thetals}[/math] is close to [math]\varphi_{\theta^*}[/math]. For this estimator, we have proved (Theorem) an oracle inequality for the [math]\MSE[/math] that is of the form
It yields
where we used Lemma in the last equality. Together with\eqref{EQ:bias2} and Young's inequality [math]2ab\le \alpha a^2+b^2/\alpha, a,b \ge 0[/math] for any [math]\alpha \gt 0[/math], we get
for some positive constant [math]C[/math] when [math]\theta^* \in \Theta(\beta, Q)[/math]. As a result,
for any [math]\alpha \in (0,1)[/math]. Since, Lemma implies, [math]|\varphi_{\thetals}^M-\varphi_{\theta^*}^M|_2^2=n\|\varphi_{\thetals}^M-\varphi_{\theta^*}^M\|_{L_2([0,1])}^2[/math], we have proved the following theorem.
Fix [math]\beta\ge (1+\sqrt{5})/4\simeq 0.81, Q \gt 0, \delta \gt 0[/math] and assume the general regression model\eqref{EQ:regmodgen} with [math]f \in \Theta(\beta,Q)[/math] and [math]\eps\sim\sg_n(\sigma^2), \sigma^2 \le 1[/math]. Moreover, let [math]M=\lceil n^{\frac{1}{2\beta+1}}\rceil[/math] and [math]n[/math] be large enough so that [math]M \le n-1[/math]. Then the least squares estimator [math]\thetals[/math] defined in\eqref{EQ:defthetaLSmis} with [math]\{\varphi_j\}_{j=1}^M[/math] being the trigonometric basis, satisfies with probability [math]1-\delta[/math], for [math]n[/math] large enough,
Choosing [math]\alpha=1/2[/math] for example and absorbing [math]Q[/math] in the constants, we get from\eqref{EQ:almostLSNP} and Lemma that for [math]M \le n-1[/math],
Adaptive estimation
The rate attained by the projection estimator [math]\varphi_{\thetals}[/math] with [math]M=\lceil n^{\frac{1}{2\beta+1}}\rceil[/math] is actually optimal so, in this sense, it is a good estimator. Unfortunately, its implementation requires the knowledge of the smoothness parameter [math]\beta[/math] which is typically unknown, to determine the level [math]M[/math] of truncation. The purpose of adaptive estimation is precisely to adapt to the unknown [math]\beta[/math], that is to build an estimator that does not depend on [math]\beta[/math] and yet, attains a rate of the order of [math]Cn^{-\frac{2\beta}{2\beta+1}}[/math] (up to a logarithmic lowdown). To that end, we will use the oracle inequalities for the BIC and Lasso estimator defined in\eqref{EQ:defBICmis} and\eqref{EQ:defLassomis} respectively. In view of Lemma, the design matrix [math]\Phi[/math] actually satisfies the assumption[math]\textsf{ORT}[/math] when we work with the trigonometric basis. This has two useful implications:
- Both estimators are actually thresholding estimators and can therefore be implemented efficiently
- The condition [math]\textsf{INC($k$)}[/math] is automatically satisfied for any [math]k \ge 1[/math].
These observations lead to the following corollary.
Fix [math]\beta \gt (1+\sqrt{5})/4\simeq 0.81, Q \gt 0, \delta \gt 0[/math] and [math]n[/math] large enough to ensure [math]n-1\ge \lceil n^{\frac{1}{2\beta+1}}\rceil[/math] assume the general regression model\eqref{EQ:regmodgen} with [math]f \in \Theta(\beta,Q)[/math] and [math]\eps\sim\sg_n(\sigma^2), \sigma^2 \le 1[/math]. Let [math]\{\varphi_j\}_{j=1}^{n-1}[/math] be the trigonometric basis. Denote by [math]\varphi^{n-1}_{\thetabic}[/math] (resp. [math]\varphi^{n-1}_{\thetalasso}[/math]) the BIC (resp. Lasso) estimator defined in\eqref{EQ:defBICmis} (resp. \eqref{EQ:defLassomis}) over [math]\R^{n-1}[/math] with regularization parameter given by\eqref{EQ:taubicmis} (resp. \eqref{EQ:taulassomis}). Then [math]\varphi^{n-1}_{\hat \theta}[/math], where [math]\hat \theta \in \{\thetabic, \thetalasso\}[/math] satisfies with probability [math]1-\delta[/math],
For [math]\hat \theta \in \{\thetabic, \thetalasso\}[/math], adapting the proofs of Theorem for the BIC estimator and Theorem for the Lasso estimator, for any [math]\theta \in \R^{n-1}[/math], with probability [math]1-\delta[/math]
While there is sometimes a (logarithmic) price to pay for adaptation, it turns out that the extra logarithmic factor can be removed by a clever use of blocks (see [5](Chapter3)). The reason why we get this extra logarithmic factor here is because we use a hammer that's too big. Indeed, BIC and Lasso allow for ‘`holes" in the Fourier decomposition, so we only get part of their potential benefits.
General references
Rigollet, Philippe; Hütter, Jan-Christian (2023). "High-Dimensional Statistics". arXiv:2310.19244 [math.ST].
Notes
References
- 1.0 1.1 Flaxman, Abraham (2003). "A spectral technique for random satisfiable 3CNF formulas". Ann. Statist. 40. Society for Industrial and Applied Mathematics. doi: .
- "Exponential screening and optimal rates of sparse estimation" (2011). Ann. Statist. 39. John Wiley & Sons Inc.. doi: .
- Ramon, van (2017). "Probability in High Dimension". Constructive Approximation 13. Springer-Verlag. doi: .
- Ryan, O'Donnell (2005). "The PCP Theorem and Hardness of Approximation". SIAM J. Comput. 24. W. H. Freeman & Co.. doi: .
- LeCam, L. (1973). "Convergence of estimates under dimensionality restrictions". Ann. Statist. 1. Omnipress. doi: .