guide:8dff7bda6c: Difference between revisions

Latest revision as of 01:34, 3 June 2024

Arguably, the strongest assumption that we made in Chapter Linear Regression Model is that the regression function [math]f(x)[/math] is of the form [math]f(x)=x^\top \theta^*[/math]. What if this assumption is violated? In reality, we do not really believe in the linear model and we hope that good statistical methods should be robust to deviations from this model. This is the problem of model misspecification, and in particular misspecified linear models. Throughout this chapter, we assume the following model:

[[math]] \begin{equation} \label{EQ:regmodgen} Y_i=f(X_i)+\eps_i,\quad i=1, \ldots, n\,, \end{equation} [[/math]]

where [math]\eps=(\eps_1, \ldots, \eps_n)^\top [/math] is sub-Gaussian with variance proxy [math]\sigma^2[/math]. Here [math]X_i \in \R^d[/math]. When dealing with fixed design, it will be convenient to consider the vector [math]g \in \R^n[/math] defined for any function [math]g \,:\, \R^d \to \R[/math] by [math]g=(g(X_1), \ldots, g(X_n))^\top[/math]. In this case, we can write for any estimator [math]\hat f \in \R^n[/math] of [math]f[/math],

[[math]] \MSE(\hat f)=\frac1n|\hat f - f|_2^2\,. [[/math]]

Even though the model may not be linear, we are interested in studying the statistical properties of various linear estimators introduced in the previous chapters: [math]\thetals, \thetalsK, \thetalsm, \thetabic, \thetalasso[/math]. Clearly, even with an infinite number of observations, we have no chance of finding a consistent estimator of [math]f[/math] if we don't know the correct model. Nevertheless, as we will see in this chapter, something can still be said about these estimators using oracle inequalities.

Oracle inequalities

As mentioned in the introduction, an oracle is a quantity that cannot be constructed without the knowledge of the quantity of interest, here: the regression function. Unlike the regression function itself, an oracle is constrained to take a specific form. For all matter of purposes, an oracle can be viewed as an estimator (in a given family) that can be constructed with an infinite amount of data. This is exactly what we should aim for in misspecified models. When employing the least squares estimator [math]\thetals[/math], we constrain ourselves to estimating functions that are of the form [math]x\mapsto x^\top \theta[/math], even though [math]f[/math] itself may not be of this form. Therefore, the oracle [math]\hat f[/math] is the linear function that is the closest to [math]f[/math]. Rather than trying to approximate [math]f[/math] by a linear function [math]f(x)\approx \theta^\top x[/math], we make the model a bit more general and consider a dictionary [math]\cH=\{\varphi_1, \ldots, \varphi_M\}[/math] of functions where [math]\varphi_j\,:\, \R^d \to \R[/math]. In this case, we can actually remove the assumption that [math]X \in \R^d[/math]. Indeed, the goal is now to estimate [math]f[/math] using a linear combination of the functions in the dictionary:

[[math]] f \approx \varphi_\theta:=\sum_{j=1}^M \theta_j \varphi_j\,. [[/math]]

If [math]M=d[/math] and [math]\varphi_j(X)=X^{(j)}[/math] returns the [math]j[/math]th coordinate of [math]X \in \R^d[/math] then the goal is to approximate [math]f(x)[/math] by [math]\theta^\top x[/math]. Nevertheless, the use of a dictionary allows for a much more general framework.

Note that the use of a dictionary does not affect the methods that we have been using so far, namely penalized/constrained least squares. We use the same notation as before and define

The least squares estimator:
[[math]] \begin{equation} \label{EQ:defthetaLSmis} \thetals\in \argmin_{\theta \in \R^M}\frac{1}{n} \sum_{i=1}^n\big(Y_i-\varphi_\theta(X_i)\big)^2 \end{equation} [[/math]]
The least squares estimator constrained to [math]K \subset \R^M[/math]:
[[math]] \thetalsK \in \argmin_{\theta \in K}\frac{1}{n} \sum_{i=1}^n\big(Y_i-\varphi_\theta(X_i)\big)^2 [[/math]]
The BIC estimator:
[[math]] \begin{equation} \label{EQ:defBICmis} \thetabic \in \argmin_{\theta \in \R^M}\Big\{\frac{1}{n} \sum_{i=1}^n\big(Y_i-\varphi_\theta(X_i)\big)^2+\tau^2|\theta|_0\Big\} \end{equation} [[/math]]
The Lasso estimator:
[[math]] \begin{equation} \label{EQ:defLassomis} \thetalasso \in \argmin_{\theta \in \R^M}\Big\{\frac{1}{n} \sum_{i=1}^n\big(Y_i-\varphi_\theta(X_i)\big)^2+2\tau|\theta|_1\Big\} \end{equation} [[/math]]

Definition

Let [math]R(\cdot)[/math] be a risk function and let [math]\cH=\{\varphi_1, \ldots, \varphi_M\}[/math] be a dictionary of functions from [math]\R^d[/math] to [math]\R[/math]. Let [math]K[/math] be a subset of [math]\R^M[/math]. The oracle on [math]K[/math] with respect to [math]R[/math] is defined by [math]\varphi_{\bar \theta}[/math], where [math]\bar \theta \in K[/math] is such that

[[math]] R(\varphi_{\bar \theta}) \le R(\varphi_\theta) \,, \qquad \forall\, \theta \in K\,. [[/math]]

Moreover, [math]R_{K}=R(\varphi_{\bar \theta})[/math] is called oracle risk on [math]K[/math]. An estimator [math]\hat f[/math] is said to satisfy an oracle inequality (over [math]K[/math]) with remainder term [math]\phi[/math] in expectation (resp. with high probability) if there exists a constant [math]C\ge 1[/math] such that

[[math]] \E R(\hat f) \le C\inf_{\theta \in K}R(\varphi_\theta) + \phi_{n,M}(K)\,, [[/math]]

or

[[math]] \p\big\{R(\hat f) \le C\inf_{\theta \in K}R(\varphi_\theta) + \phi_{n,M,\delta}(K) \big\} \ge 1-\delta \,, \qquad \forall \ \delta \gt 0 [[/math]]

respectively. If [math]C=1[/math], the oracle inequality is sometimes called exact.

Our goal will be to mimic oracles. The finite sample performance of an estimator at this task is captured by an oracle inequality.

Oracle inequality for the least squares estimator

While our ultimate goal is to prove sparse oracle inequalities for the BIC and Lasso estimator in the case of misspecified model, the difficulty of the extension to this case for linear models is essentially already captured by the analysis for the least squares estimator. In this simple case, can even obtain an exact oracle inequality.

Theorem

Assume the general regression model\eqref{EQ:regmodgen} with [math]\eps\sim\sg_n(\sigma^2)[/math]. Then, the least squares estimator [math]\thetals[/math] satisfies for some numerical constant [math]C \gt 0[/math],

[[math]] \MSE(\varphi_{\thetals})\le \inf_{\theta \in \R^M}\MSE(\varphi_\theta)+C\frac{\sigma^2M}{n}\log(1/\delta) [[/math]]

with probability at least [math]1-\delta[/math].

Show Proof

Note that by definition

[[math]] |Y-\varphi_{\thetals}|_2^2 \le |Y-\varphi_{\bar \theta}|_2^2 [[/math]]

where [math]\varphi_{\bar \theta}[/math] denotes the orthogonal projection of [math]f[/math] onto the linear span of [math]\varphi_1, \ldots, \varphi_M[/math]. Since [math]Y=f+\eps[/math], we get

[[math]] |f-\varphi_{\thetals}|_2^2 \le |f-\varphi_{\bar \theta}|_2^2+2\eps^\top(\varphi_{\thetals}-\varphi_{\bar \theta}) [[/math]]

Moreover, by Pythagoras's theorem, we have

[[math]] |f-\varphi_{\thetals}|_2^2 - |f-\varphi_{\bar \theta}|_2^2=|\varphi_{\thetals}-\varphi_{\bar \theta}|_2^2\,. [[/math]]

It yields

[[math]] |\varphi_{\thetals}-\varphi_{\bar \theta}|_2^2\le 2\eps^\top(\varphi_{\thetals}-\varphi_{\bar \theta})\,. [[/math]]

Using the same steps as the ones following equation EQ:bound_ls_1 for the well specified case, we get

[[math]] |\varphi_{\thetals}-\varphi_{\bar \theta}|_2^2\lesssim \frac{\sigma^2M}{n}\log(1/\delta) [[/math]]

with probability [math]1-\delta[/math]. The result of the lemma follows.

■

Sparse oracle inequality for the BIC estimator

The analysis for more complicated estimators such as the BIC in Chapter Linear Regression Model allows us to derive oracle inequalities for these estimators.

Theorem

Recall that the proof of Theorem for the BIC estimator begins as follows:

[[math]] \frac1n|Y-\varphi_{\thetabic}|_2^2 + \tau^2|\thetabic|_0 \le \frac1n|Y-\varphi_{\theta}|_2^2 + \tau^2|\theta|_0\,. [[/math]]

This is true for any [math]\theta \in \R^M[/math]. It implies

[[math]] |f-\varphi_{\thetabic}|_2^2 + n\tau^2|\thetabic|_0 \le |f-\varphi_{ \theta}|_2^2+2\eps^\top(\varphi_{\thetabic}-\varphi_{ \theta}) + n\tau^2|\theta|_0\,. [[/math]]

Note that if [math]\thetabic= \theta[/math], the result is trivial. Otherwise,

[[math]] \begin{align*} 2\eps^\top(\varphi_{\thetabic}-\varphi_\theta) &=2\eps^\top\Big(\frac{\varphi_{\thetabic}-\varphi_\theta}{|\varphi_{\thetabic}-\varphi_\theta|_2}\Big)|\varphi_{\thetabic}-\varphi_\theta|_2\\ &\le \frac2\alpha\Big[\eps^\top\Big(\frac{\varphi_{\thetabic}-\varphi_\theta}{|\varphi_{\thetabic}-\varphi_\theta|_2}\Big)\Big]^2 + \frac{\alpha}{2} |\varphi_{\thetabic}-\varphi_\theta|_2^2\,, \end{align*} [[/math]]

where we used Young's inequality [math]2ab\le \frac2\alpha a^2+\frac{\alpha}{2} b^2[/math], which is valid for [math]a,b \in \R[/math] and [math]\alpha \gt 0[/math]. Next, since

[[math]] \frac{\alpha}{2} |\varphi_{\thetabic}-\varphi_\theta|_2^2\le \alpha|\varphi_{\thetabic}-f|_2^2+\alpha|\varphi_\theta-f|_2^2\,, [[/math]]

we get for [math]\alpha=1/2[/math],

[[math]] \begin{align*} \frac12|\varphi_{\thetabic}-f|_2^2 &\le \frac32|\varphi_{ \theta}-f|_2^2+n\tau^2|\theta|_0 +4\big[\eps^\top\cU(\varphi_{\thetabic}-\varphi_{\theta})\big]^2- n\tau^2|\thetabic|_0\\ &\le \frac32|\varphi_{ \theta}-f|_2^2+2n\tau^2|\theta|_0+4\big[\eps^\top\cU(\varphi_{\thetabic -\theta})\big]^2- n\tau^2|\thetabic-\theta|_0. \end{align*} [[/math]]

We conclude as in the proof of Theorem.

Show Proof

{{{4}}}

■

The interpretation of this theorem is enlightening. It implies that the BIC estimator will mimic the best tradeoff between the approximation error [math]\MSE(\varphi_\theta)[/math] and the complexity of [math]\theta[/math] as measured by its sparsity. In particular, this result, which is sometimes called sparse oracle inequality, implies the following oracle inequality. Define the oracle [math]\bar \theta[/math] to be such that

[[math]] \MSE(\varphi_{\bar \theta})=\min_{\theta \in \R^M}\MSE(\varphi_\theta). [[/math]]

Then, with probability at least [math]1-\delta[/math],

[[math]] \MSE(\varphi_{\thetabic})\le 3\MSE(\varphi_{\bar \theta})+\frac{C\sigma^2}{n}\Big[|\bar \theta|_0\log (eM)+ \log(1/\delta)\Big]\Big\} [[/math]]

If the linear model happens to be correct, then we simply have [math]\MSE(\varphi_{\bar \theta})=0[/math].

Sparse oracle inequality for the Lasso

To prove an oracle inequality for the Lasso, we need additional assumptions on the design matrix, such as incoherence. Here, the design matrix is given by the [math]n\times M[/math] matrix [math]\Phi[/math] with elements [math]\Phi_{i,j}=\varphi_j(X_i)[/math].

Theorem

Assume the general regression model\eqref{EQ:regmodgen} with [math]\eps\sim\sg_n(\sigma^2)[/math]. Moreover, assume that there exists an integer [math]k[/math] such that the matrix [math]\Phi[/math] satisfies assumption [math]\textsf{INC$(k)$}[/math]. Then, the Lasso estimator [math]\thetalasso[/math] with regularization parameter given by

[[math]] \begin{equation} \label{EQ:taulassomis} 2\tau=8\sigma\sqrt{\frac{2\log(2M)}{n}}+8\sigma\sqrt{\frac{2\log(1/\delta)}{n}} \end{equation} [[/math]]

satisfies for some numerical constant [math]C[/math],

[[math]] \begin{align*} \MSE(\varphi_{\thetalasso})\le \inf_{\substack{\theta \in \R^M\\ |\theta|_0\le k}}\Big\{ \MSE(\varphi_\theta) &+\frac{C\sigma^2}{n}|\theta|_0\log (eM/\delta)\Big\} \end{align*} [[/math]]

with probability at least [math]1-\delta[/math].

Show Proof

From the definition of [math]\thetalasso[/math], for any [math]\theta \in \R^M[/math],

[[math]] \begin{align*} \frac1n|Y-\varphi_{\thetalasso}|_2^2 \le \frac1n|Y-\varphi_{\theta}|_2^2 + 2\tau|\theta|_1- 2\tau|\thetalasso|_1 \,. \end{align*} [[/math]]

Expanding the squares, adding [math]\tau|\thetalasso-\theta|_1[/math] on each side and multiplying by [math]n[/math], we get

[[math]] \begin{align} \leadeq{|\varphi_{\thetalasso}-f|_2^2-|\varphi_{\theta}-f|_2^2+n\tau|\thetalasso-\theta|_1}\\ \le {} & 2\eps^\top(\varphi_{\thetalasso}-\varphi_{\theta}) +n\tau|\thetalasso-\theta|_1+ 2n\tau|\theta|_1- 2n\tau|\thetalasso|_1 \,. \label{EQ:pr:TH:lassofastmis1} \end{align} [[/math]]

Next, note that \textsf{INC}[math](k)[/math] for any [math]k \ge 1[/math] implies that [math]|\varphi_j|_2 \le 2\sqrt{n}[/math] for all [math]j=1, \ldots, M[/math]. Applying H\"older's inequality, we get that with probability [math]1-\delta[/math], it holds that

[[math]] 2\eps^\top(\varphi_{\thetalasso}-\varphi_{\theta}) \le \frac{n\tau}2|\thetalasso-\theta|_1. [[/math]]

Therefore, taking [math]S=\supp(\theta)[/math] to be the support of [math]\theta[/math], we get that the right-hand side of\eqref{EQ:pr:TH:lassofastmis1} is bounded by

[[math]] \begin{align} |\varphi_{\thetalasso}-f|_2^2-|\varphi_{\theta}-f|_2^2+n\tau|\thetalasso-\theta|_1 &\le 2n\tau|\thetalasso-\theta|_1+2n\tau|\theta|_1- 2n\tau|\thetalasso|_1\nonumber \\ &= 2n\tau|\thetalasso_S-\theta|_1+2n\tau|\theta|_1- 2n\tau|\thetalasso_S|_1\nonumber\\ &\le 4n\tau|\thetalasso_S-\theta|_1\label{EQ:pr:TH:lassofastmis2} \end{align} [[/math]]

with probability [math]1-\delta[/math]. It implies that either [math]\MSE(\varphi_{\thetalasso})\le \MSE(\varphi_\theta)[/math] or that

[[math]] |\thetalasso_{S^c}-\theta_{S^c}|_1\le 3 |\thetalasso_{S}-\theta_{S}|_1\,. [[/math]]

so that [math]\theta=\thetalasso-\theta[/math] satisfies the cone conditionEQ:conecond. Using now the Cauchy-Schwarz inequality and Lemma, respectively, assuming that [math]|\theta|_0\le k[/math], we get

[[math]] 4n\tau|\thetalasso_S-\theta|_1 \le 4n\tau\sqrt{|S|}|\thetalasso_S-\theta|_2 \le 4\tau\sqrt{2n|\theta|_0}|\varphi_{\thetalasso}-\varphi_{\theta}|_2\,. [[/math]]

Using now the inequality [math]2ab\le \frac2{\alpha} a^2+\frac{\alpha}{2} b^2[/math], we get

[[math]] \begin{align*} 4n\tau|\thetalasso_S-\theta|_1& \le \frac{16\tau^2n|\theta|_0}{\alpha}+ \frac{\alpha}{2}|\varphi_{\thetalasso}-\varphi_{\theta}|_2^2\\ &\le \frac{16\tau^2n|\theta|_0}{\alpha}+ \alpha|\varphi_{\thetalasso}-f|_2^2+\alpha|\varphi_{\theta}-f|_2^2 \end{align*} [[/math]]

Combining this result with\eqref{EQ:pr:TH:lassofastmis1} and\eqref{EQ:pr:TH:lassofastmis2}, we find

[[math]] (1-\alpha)\MSE(\varphi_{\thetalasso}) \le (1+\alpha)\MSE(\varphi_{\theta})+ \frac{16\tau^2|\theta|_0}{\alpha}\,. [[/math]]

To conclude the proof, it only remains to divide by [math]1-\alpha[/math] on both sides of the above inequality and take [math]\alpha=1/2[/math].

■

Maurey's argument

In there is no sparse [math]\theta[/math] such that [math]\MSE(\varphi_{\theta})[/math] is small, Theorem is useless whereas the Lasso may still enjoy slow rates. In reality, no one really believes in the existence of sparse vectors but rather of approximately sparse vectors. Zipf's law would instead favor the existence of vectors [math]\theta[/math] with absolute coefficients that decay polynomially when ordered from largest to smallest in absolute value. This is the case for example if [math]\theta[/math] has a small [math]\ell_1[/math] norm but is not sparse. For such [math]\theta[/math], the Lasso estimator still enjoys slow rates as in Theorem, which can be easily extended to the misspecified case (see Problem). As a result, it seems that the Lasso estimator is strictly better than the BIC estimator as long as incoherence holds since it enjoys both fast and slow rates, whereas the BIC estimator seems to be tailored to the fast rate. Fortunately, such vectors can be well approximated by sparse vectors in the following sense: for any vector [math]\theta \in \R^M[/math] such that [math]|\theta|_1\le 1[/math], there exists a vector [math]\theta'[/math] that is sparse and for which [math]\MSE(\varphi_{\theta'})[/math] is not much larger than [math]\MSE(\varphi_\theta)[/math]. The following theorem quantifies exactly the tradeoff between sparsity and [math]\MSE[/math]. It is often attributed to B. Maurey and was published by Pisier ^[1]. This is why it is referred to as Maurey's argument.

Theorem

Let [math]\{\varphi_1, \dots, \varphi_M\}[/math] be a dictionary normalized in such a way that

[[math]] \max_{1\le j \le M}|\varphi_j|_2\le D\sqrt{n}\,. [[/math]]

Then for any integer [math]k[/math] such that [math]1\le k \le M[/math] and any positive [math]R[/math], we have

[[math]] \min_{\substack{\theta \in \R^M\\ |\theta|_0\le k}}\MSE(\varphi_\theta) \le \min_{\substack{\theta \in \R^M\\ |\theta|_1\le R}}\MSE(\varphi_\theta) + \frac{D^2 R^2}{k}\,. [[/math]]

Show Proof

Define

[[math]] \bar \theta \in \argmin_{\substack{\theta \in \R^M\\ |\theta|_1\le R}}|\varphi_\theta -f|_2^2 [[/math]]

and assume without loss of generality that [math]|\bar \theta_1| \ge |\bar \theta_2|\ge \ldots \ge |\bar \theta_M|[/math]. Let now [math]U \in \R^n[/math] be a random vector with values in [math]\{0, \pm R\varphi_1, \ldots, \pm R \varphi_M\}[/math] defined by

[[math]] \begin{alignat*}{2} \p(U=R\,\mathrm{sign}(\bar \theta_j)\varphi_j)&=\frac{|\bar \theta_j|}{R}\,, &\quad &j=1, \ldots, M,\\ \p(U=0)&=1-\frac{|\bar \theta|_1}{R}\,. \end{alignat*} [[/math]]

Note that [math]\E[U]=\varphi_{\bar \theta}[/math] and [math]|U|_2 \le RD \sqrt{n}[/math]. Let now [math]U_1, \ldots, U_k[/math] be [math]k[/math] independent copies of [math]U[/math] and define their average

[[math]] \bar U=\frac{1}{k}\sum_{i=1}^kU_i\,. [[/math]]

Note that [math]\bar U=\varphi_{\tilde \theta}[/math] for some [math]\tilde \theta \in\R^M[/math] such that [math]|\tilde \theta|_0\le k[/math]. Moreover, using the Pythagorean Theorem,

[[math]] \begin{align*} \E|f-\bar U|_2^2&=\E|f-\varphi_{\bar \theta}+\varphi_{\bar \theta}-\bar U|_2^2\\ &=\E|f-\varphi_{\bar \theta}|_2^2+|\varphi_{\bar \theta}-\bar U|_2^2\\ &= |f-\varphi_{\bar \theta}|_2^2 + \frac{\E|U-\E[U]|_2^2}{k}\\ &\le |f-\varphi_{\bar \theta}|_2^2 + \frac{(RD\sqrt{n})^2}{k} \end{align*} [[/math]]

To conclude the proof, note that

[[math]] \E|f-\bar U|_2^2 =\E|f-\varphi_{\tilde \theta}|_2^2 \ge \min_{\substack{\theta \in \R^M\\ |\theta|_0\le k}}|f-\varphi_{\theta}|_2^2 [[/math]]

and divide by [math]n[/math].

■

Maurey's argument implies the following corollary.

Corollary

Assume that the assumptions of Theorem hold and that the dictionary [math]\{\varphi_1, \dots, \varphi_M\}[/math] is normalized in such a way that

[[math]] \max_{1\le j \le M}|\varphi_j|_2\le \sqrt{n}\,. [[/math]]

Then there exists a constant [math]C \gt 0[/math] such that the BIC estimator satisfies

[[math]] \begin{align*} \MSE(\varphi_{\thetabic})\le \inf_{\theta \in \R^M}\Big\{2\MSE(\varphi_\theta)+C\Big[&\frac{\sigma^2|\theta|_0\log (eM)}{n}\wedge \sigma|\theta|_1\sqrt{\frac{\log (eM)}{n}}\Big]\Big\}\\ &+ C\frac{\sigma^2\log(1/\delta)}{n} \end{align*} [[/math]]

with probability at least [math]1-\delta[/math].

Show Proof

Choosing [math]\alpha=1/3[/math] in Theorem yields

[[math]] \MSE(\varphi_{\thetabic})\le 2\inf_{\theta \in \R^M}\Big\{\MSE(\varphi_\theta)+C\frac{\sigma^2|\theta|_0\log (eM)}{n}\Big\}+C\frac{\sigma^2\log(1/\delta)}{n} [[/math]]

Let [math]\theta' \in \R^M[/math]. It follows from Maurey's argument that for any [math] k \in [M] [/math], there exists [math]\theta = \theta(\theta', k) \in \R^M[/math] such that [math]|\theta|_0 = k[/math] and

[[math]] \MSE(\varphi_{\theta}) \le \MSE(\varphi_{\theta'})+\frac{2|\theta'|_1^2}{k} [[/math]]

It implies that

[[math]] \MSE(\varphi_{\theta})+C\frac{\sigma^2|\theta|_0\log (eM)}{n} \le \MSE(\varphi_{\theta'})+\frac{2|\theta'|_1^2}{k}+C\frac{\sigma^2 k\log (eM)}{n} [[/math]]

and furthermore that

[[math]] \inf_{\theta \in \R^M}\MSE(\varphi_{\theta})+C\frac{\sigma^2|\theta|_0\log (eM)}{n} \le \MSE(\varphi_{\theta'})+\frac{2|\theta'|_1^2}{k}+C\frac{\sigma^2 k\log (eM)}{n} [[/math]]

Since the above bound holds for any [math] \theta' \in \mathbb{R}^M [/math] and [math] k \in [M] [/math], we can take an infimum with respect to both [math]\theta'[/math] and [math]k[/math] on the right-hand side to get

[[math]] \begin{align*} \inf_{\theta \in \R^M}&\Big\{\MSE(\varphi_{\theta})+C\frac{\sigma^2|\theta|_0\log (eM)}{n} \Big\}\\ &\le \inf_{\theta' \in \R^M}\Big\{\MSE(\varphi_{\theta'})+C\min_k\Big(\frac{|\theta'|_1^2}{k}+C\frac{\sigma^2k\log (eM)}{n}\Big) \Big\}\,. \end{align*} [[/math]]

To control the minimum over [math]k[/math], we need to consider three cases for the quantity

[[math]] \bar k=\frac{|\theta'|_1}{\sigma}\sqrt{\frac{n}{\log (eM)}}. [[/math]]

If [math]1 \le \bar k \le M[/math], then we get
[[math]] \min_k\Big(\frac{|\theta'|_1^2}{k}+C\frac{\sigma^2k\log (eM)}{n}\Big) \le C\sigma|\theta'|_1\sqrt{\frac{\log (eM)}{n}} [[/math]]
If [math]\bar k \le 1[/math], then
[[math]] |\theta'|_1^2\le C\frac{\sigma^2\log (eM)}{n}\,, [[/math]]
which yields
[[math]] \min_k\Big(\frac{|\theta'|_1^2}{k}+C\frac{\sigma^2k\log (eM)}{n}\Big) \le C\frac{\sigma^2\log (eM)}{n} [[/math]]
If [math]\bar k \ge M[/math], then
[[math]] \frac{\sigma^2M\log (eM)}{n}\le C \frac{|\theta'|_1^2}{M}\,. [[/math]]
Therefore, on the one hand, if [math]M \ge \frac{|\theta|_1}{\sigma\sqrt{\log(eM)/n}}[/math], we get
[[math]] \min_k\Big(\frac{|\theta'|_1^2}{k}+C\frac{\sigma^2k\log (eM)}{n}\Big) \le C \frac{|\theta'|_1^2}{M} \le C\sigma|\theta'|_1\sqrt{\frac{\log (eM)}{n}}\,. [[/math]]
On the other hand, if [math]M \le \frac{|\theta|_1}{\sigma\sqrt{\log(eM)/n}}[/math], then for any [math]\theta \in \R^M[/math], we have
[[math]] \begin{equation*} \frac{\sigma^2|\theta|_0\log (eM)}{n} \le \frac{\sigma^2M\log (eM)}{n} \le C\sigma|\theta'|_1\sqrt{\frac{\log (eM)}{n}}\,. \end{equation*} [[/math]]

Combined,

[[math]] \begin{align*} \leadeq{\inf_{\theta' \in \R^M}\Big\{\MSE(\varphi_{\theta'})+C\min_k\Big(\frac{|\theta'|_1^2}{k}+C\frac{\sigma^2k\log (eM)}{n}\Big) \Big\}}\\ \le {} & \inf_{\theta' \in \R^M}\Big\{\MSE(\varphi_{\theta'}) + C \sigma | \theta' |_1 \sqrt{\frac{\log(eM}{n}} + C \frac{\sigma^2 \log(eM)}{n} \Big\}, \end{align*} [[/math]]

which together with Theorem 3.4 yields the claim.

■

Note that this last result holds for any estimator that satisfies an oracle inequality with respect to the [math]\ell_0[/math] norm as in Theorem. In particular, this estimator need not be the BIC estimator. An example is the Exponential Screening estimator of^[2]. Maurey's argument allows us to enjoy the best of both the [math]\ell_0[/math] and the [math]\ell_1[/math] world. The rate adapts to the sparsity of the problem and can be even generalized to [math]\ell_q[/math]-sparsity (see Problem). However, it is clear from the proof that this argument is limited to squared [math]\ell_2[/math] norms such as the one appearing in [math]\MSE[/math] and extension to other risk measures is non-trivial. Some work has been done for non-Hilbert spaces ^[1]^[3] using more sophisticated arguments.

Nonparametric regression

So far, the oracle inequalities that we have derived do not deal with the approximation error [math]\MSE(\varphi_\theta)[/math]. We kept it arbitrary and simply hoped that it was small. Note also that in the case of linear models, we simply assumed that the approximation error was zero. As we will see in this section, this error can be quantified under natural smoothness conditions if the dictionary of functions [math]\cH=\{\varphi_1, \ldots, \varphi_M\}[/math] is chosen appropriately. In what follows, we assume for simplicity that [math]d=1[/math] so that [math]f\,:\, \R \to \R[/math] and [math]\varphi_j\,:\, \R \to \R[/math].

Fourier decomposition

Historically, nonparametric estimation was developed before high-dimensional statistics and most results hold for the case where the dictionary [math]\cH=\{\varphi_1, \ldots, \varphi_M\}[/math] forms an orthonormal system of [math]L_2([0,1])[/math]:

[[math]] \int_0^1\varphi_j^2(x)\ud x =1\,, \quad \int_0^1 \varphi_j(x)\varphi_k(x)\ud x=0, \ \forall\, j\neq k\,. [[/math]]

We will also deal with the case where [math]M=\infty[/math]. When [math]\cH[/math] is an orthonormal system, the coefficients [math]\theta_j^* \in \R[/math] defined by

[[math]] \theta_j^* =\int_0^1 f(x) \varphi_j(x)\ud x\,, [[/math]]

are called Fourier coefficients of [math]f[/math]. Assume now that the regression function [math]f[/math] admits the following decomposition

[[math]] f=\sum_{j=1}^\infty \theta_j^* \varphi_j\,. [[/math]]

There exists many choices for the orthonormal system and we give only two as examples.

Example Trigonometric basis. This is an orthonormal basis of [math]L_2([0,1])[/math]. It is defined by

[[math]] \begin{eqnarray*} \varphi_1&\equiv&1\\ \varphi_{2k}(x)&=&\sqrt{2} \cos(2\pi k x)\,,\\ \varphi_{2k+1}(x)&=&\sqrt{2} \sin(2\pi k x)\,, \end{eqnarray*} [[/math]]

for [math]k=1,2, \dots[/math] and [math]x \in [0,1][/math]. The fact that it is indeed an orthonormal system can be easily check using trigonometric identities.

The next example has received a lot of attention in the signal (sound, image, ...) processing community.

Example Wavelets. Let [math]\psi\,:\, \R \to \R[/math] be a sufficiently smooth and compactly supported function, called ‘`’'mother wavelet". Define the system of functions

[[math]] \psi_{jk}(x)=2^{j/2}\psi(2^jx-k)\,, \quad j,k \in \Z\,. [[/math]]

It can be shown that for a suitable [math]\psi[/math], the dictionary [math]\{\psi_{j,k}, j,k \in \Z\}[/math] forms an orthonormal system of [math]L_2([0,1])[/math] and sometimes a basis. In the latter case, for any function [math]g \in L_2([0,1])[/math], it holds

[[math]] g=\sum_{j=-\infty}^\infty \sum_{k=-\infty}^\infty \theta_{jk}\psi_{jk}\,, \quad \theta_{jk}=\int_{0}^1 g(x)\psi_{jk}(x)\ud x\,. [[/math]]

The coefficients [math]\theta_{jk}[/math] are called wavelet coefficients of [math]g[/math]. The simplest example is given by the Haar system obtained by taking [math]\psi[/math] to be the following piecewise constant function (see Figure). We will not give more details about wavelets here but simply point the interested reader to^[4].

[[math]] \psi(x)=\left\{ \begin{array}{ll} 1& 0\le x \lt 1/2\\ -1 & 1/2\le x \le 1\\ 0 & \text{otherwise} \end{array} \right. [[/math]]

[[File: | 400px | thumb | The Haar mother wavelet ]]

Sobolev classes and ellipsoids

We begin by describing a class of smooth functions where smoothness is understood in terms of its number of derivatives. Recall that [math]f^{(k)}[/math] denotes the [math]k[/math]-th derivative of [math]f[/math].

Definition

Fix parameters [math]\beta \in \{1, 2, \dots\}[/math] and [math]L \gt 0[/math]. The Sobolev class of functions [math]W(\beta,L)[/math] is defined by

[[math]] \begin{align*} W(\beta,L)=\Big\{f\,:\,& [0,1] \to \R\ :\ f\in L_2([0,1])\,, f^{(\beta-1)} \text{ is absolutely continuous and }\\ & \int_{0}^1[f^{(\beta)}]^2\le L^2\,, f^{(j)}(0)= f^{(j)}(1), j=0, \ldots, \beta-1\Big\} \end{align*} [[/math]]

Any function [math]f \in W(\beta, L)[/math] can represented^{[Notes 1]} as its Fourier expansion along the trigonometric basis:

[[math]] f(x)=\theta_1^*\varphi_1(x) + \sum_{k=1}^\infty\big(\theta_{2k}^*\varphi_{2k}(x)+\theta_{2k+1}^*\varphi_{2k+1}(x)\big)\,, \quad \forall \ x \in [0,1]\,, [[/math]]

where [math]\theta^*=\{\theta_j^*\}_{j \ge 1}[/math] is in the space of squared summable sequences [math]\ell_2(\N)[/math] defined by

[[math]] \ell_2(\N)=\Big\{\theta\,:\, \sum_{j=1}^\infty \theta_j^2 \lt \infty\Big\}\,. [[/math]]

For any [math]\beta \gt 0[/math], define the coefficients

[[math]] \begin{equation} \label{EQ:defaj} a_j=\left\{ \begin{array}{cl} j^\beta& \text{for $j$ even}\\ (j-1)^\beta & \text{for $j$ odd} \end{array} \right. \end{equation} [[/math]]

With these coefficients, we can define the Sobolev class of functions in terms of Fourier coefficients.

Theorem

Fix [math]\beta\ge 1[/math] and [math]L \gt 0[/math] and let [math]\{\varphi_j\}_{j \ge 1}[/math] denote the trigonometric basis of [math]L_2([0,1])[/math]. Moreover, let [math]\{a_j\}_{j\ge 1}[/math] be defined as in\eqref{EQ:defaj}. A function [math]f \in W(\beta,L)[/math] can be represented as

[[math]] f=\sum_{j=1}^\infty \theta_j^* \varphi_j\,, [[/math]]

where the sequence [math]\{\theta_j^*\}_{j \ge 1}[/math] belongs to Sobolev ellipsoid of [math]\ell_2(\N)[/math] defined by

[[math]] \Theta(\beta, Q)=\Big\{\theta \in \ell_2(\N)\,:\, \sum_{j=1}^\infty a_j^2\theta_j^2\le Q\Big\} [[/math]]

for [math]Q=L^2/\pi^{2\beta}[/math].

Show Proof

Let us first define the Fourier coefficients [math]\{s_k(j)\}_{k \ge 1}[/math] of the [math]j[/math]th derivative [math]f^{(j)}[/math] of [math]f[/math] for [math]j=1, \ldots, \beta[/math]:

[[math]] \begin{align*} s_1(j)&=\int_0^1f^{(j)}(t)\ud t=f^{(j-1)}(1)-f^{(j-1)}(0)=0\,,\\ s_{2k}(j)&=\sqrt{2}\int_0^1f^{(j)}(t)\cos(2\pi kt)\ud t\,,\\ s_{2k+1}(j)&=\sqrt{2}\int_0^1f^{(j)}(t)\sin(2\pi kt)\ud t\,,\\ \end{align*} [[/math]]

The Fourier coefficients of [math]f[/math] are given by [math]\theta_k=s_k(0)[/math]. Using integration by parts, we find that

[[math]] \begin{align*} s_{2k}(\beta)&=\sqrt{2}f^{(\beta-1)}(t) \cos(2\pi kt)\Big|_0^1 + (2\pi k)\sqrt{2}\int_0^1f^{(\beta-1)}(t) \sin(2\pi kt)\ud t\\ &= \sqrt{2}[f^{(\beta-1)}(1)-f^{(\beta-1)}(0)]+(2\pi k)\sqrt{2}\int_0^1f^{(\beta-1)}(t) \sin(2\pi kt)\ud t\\ &= (2\pi k)s_{2k+1}(\beta-1)\,. \end{align*} [[/math]]

Moreover,

[[math]] \begin{align*} s_{2k+1}(\beta)&=\sqrt{2}f^{(\beta-1)}(t) \sin(2\pi kt)\Big|_0^1 - (2\pi k)\sqrt{2}\int_0^1f^{(\beta-1)}(t) \cos(2\pi kt)\ud t\\ &= -(2\pi k)s_{2k}(\beta-1)\,. \end{align*} [[/math]]

In particular, it yields

[[math]] s_{2k}(\beta)^2+s_{2k+1}(\beta)^2=(2\pi k)^2\big[s_{2k}(\beta-1)^2+s_{2k+1}(\beta-1)^2\big] [[/math]]

By induction, we find that for any [math]k \ge 1[/math],

[[math]] s_{2k}(\beta)^2+s_{2k+1}(\beta)^2=(2\pi k)^{2\beta}\big(\theta_{2k}^2 + \theta_{2k+1}^2\big) [[/math]]

Next, it follows for the definition\eqref{EQ:defaj} of [math]a_j[/math] that

[[math]] \begin{align*} \sum_{k=1}^\infty(2\pi k)^{2\beta}\big(\theta_{2k}^2 + \theta_{2k+1}^2\big)&=\pi^{2\beta}\sum_{k=1}^\infty a_{2k}^2\theta_{2k}^2 + \pi^{2\beta}\sum_{k=1}^\infty a_{2k+1}^2\theta_{2k+1}^2\\ &=\pi^{2\beta}\sum_{j=1}^\infty a_{j}^2\theta_{j}^2\,. \end{align*} [[/math]]

Together with the Parseval identity, it yields

[[math]] \int_0^1\big(f^{(\beta)}(t)\big)^2\ud t=\sum_{k=1}^\infty s_{2k}(\beta)^2+s_{2k+1}(\beta)^2=\pi^{2\beta}\sum_{j=1}^\infty a_{j}^2\theta_{j}^2\,. [[/math]]

To conclude, observe that since [math]f \in W(\beta, L)[/math], we have

[[math]] \int_0^1\big(f^{(\beta)}(t)\big)^2\ud t \le L^2\,, [[/math]]

so that [math]\theta \in \Theta(\beta, L^2/\pi^{2\beta})[/math]\,.

■

It can actually be shown that the reciprocal is true, that is, any function with Fourier coefficients in [math]\Theta(\beta, Q)[/math] belongs to if [math] W(\beta,L)[/math], but we will not be needing this. In what follows, we will define smooth functions as functions with Fourier coefficients (with respect to the trigonometric basis) in a Sobolev ellipsoid. By extension, we write [math]f \in \Theta(\beta, Q)[/math] in this case and consider any real value for [math]\beta[/math].

Proposition

The Sobolev ellipsoids enjoy the following properties

For any [math]Q \gt 0[/math],
[[math]] 0 \lt \beta' \lt \beta \ \Rightarrow \ \Theta(\beta, Q) \subset \Theta(\beta',Q). [[/math]]
For any [math]Q \gt 0[/math],
[[math]] \beta \gt \frac12 \ \Rightarrow \ f\ \text{is continuous}. [[/math]]

The proof is left as an exercise (Problem). It turns out that the first [math] n [/math] functions in the trigonometric basis are not only orthonormal with respect to the inner product of [math]L_2[/math], but also with respect to the inner predictor associated with fixed design, [math]\langle f,g\rangle:=\frac{1}{n} \sum_{i=1}^n f(X_i)g(X_i)[/math], when the design is chosen to be regular, i.e., [math]X_i=(i-1)/n[/math], [math]i=1, \ldots, n[/math].

Lemma

Assume that [math]\{X_1, \ldots, X_n\}[/math] is the regular design, i.e., [math]X_i=(i-1)/n[/math]. Then, for any [math]M\le n-1[/math], the design matrix [math]\Phi=\{\varphi_j(X_i)\}_{\substack{1\le i \le n\\1\le j \le M}}[/math] satisfies the [math]\textsf{ORT}[/math] condition.

Show Proof

Fix [math]j,j'\in \{1, \ldots, n-1\}[/math], [math]j\neq j'[/math], and consider the inner product [math]\varphi_j^\top \varphi_{j'}[/math]. Write [math]k_{j}=\lfloor j/2\rfloor[/math] for the integer part of [math]j/2[/math] and define the vectors [math]a, b, a', b' \in \R^n[/math] with coordinates such that [math]e^{\frac{\mathsf{i}2\pi k_j s}{n}}=a_{s+1}+\mathsf{i} b_{s+1}[/math] and [math]e^{\frac{\mathsf{i}2\pi k_{j'} s}{n}}=a'_{s+1}+\mathsf{i} b'_{s+1}[/math] for [math] s \in \{0, \dots, n-1\} [/math]. It holds that

[[math]] \frac{1}{2}\varphi_j^\top \varphi_{j'}\in \{a^\top a'\,,\, b^\top b'\,,\, b^\top a'\,,\, a^\top b'\}\,, [[/math]]

depending on the parity of [math]j[/math] and [math]j'[/math]. On the one hand, observe that if [math]k_j \neq k_{j'}[/math], we have for any [math]\sigma \in \{-1, +1\}[/math],

[[math]] \sum_{s=0}^{n-1}e^{\frac{\mathsf{i}2\pi k_j s}{n}}e^{\sigma\frac{\mathsf{i}2\pi k_{j'} s}{n}}=\sum_{s=0}^{n-1}e^{\frac{\mathsf{i}2\pi (k_j+\sigma k_{j'}) s}{n}}=0\,. [[/math]]

On the other hand

[[math]] \sum_{s=0}^{n-1}e^{\frac{\mathsf{i}2\pi k_j s}{n}}e^{\sigma\frac{\mathsf{i}2\pi k_{j'} s}{n}}=(a+\mathsf{i} b)^\top( a'+\sigma \mathsf{i} b')=a^\top a'-\sigma b^\top b' + \mathsf{i} \big[b^\top a' + \sigma a^\top b' \big] [[/math]]

so that [math]a^\top a'=\pm b^\top b'=0[/math] and [math]b^\top a'=\pm a^\top b'=0[/math] whenever, [math]k_j \neq k_{j'}[/math]. It yields [math]\varphi_j^\top \varphi_{j'}=0[/math]. Next, consider the case where [math]k_j=k_{j'}[/math] so that

[[math]] \sum_{s=0}^{n-1}e^{\frac{\mathsf{i}2\pi k_j s}{n}}e^{\sigma\frac{\mathsf{i}2\pi k_{j'} s}{n}}=\left\{ \begin{array}{ll} 0& \text{if $\sigma=1$}\\ n& \text{if $\sigma=-1$}\\ \end{array}\right.\,. [[/math]]

On the one hand if [math]j \neq j'[/math], it can only be the case that [math]\varphi_j^\top \varphi_{j'} \in \{b^\top a'\,,\, a^\top b'\}[/math] but the same argument as above yields [math]b^\top a'=\pm a^\top b'=0[/math] since the imaginary part of the inner product is still [math] 0 [/math]. Hence, in that case, [math]\varphi_j^\top \varphi_{j'}=0[/math]. On the other hand, if [math]j=j'[/math], then [math]a=a'[/math] and [math]b=b'[/math] so that it yields [math]a^\top a'=|a|_2^2=n[/math] and [math]b^\top b'=|b|_2^2=n[/math] which is equivalent to [math]\varphi_j^\top \varphi_j=|\varphi_j|_2^2=n[/math]. Therefore, the design matrix [math]\Phi[/math] is such that

[[math]] \begin{equation*} \Phi^\top \Phi=nI_M\,. \end{equation*} [[/math]]

■

Integrated squared error

As mentioned in the introduction of this chapter, the smoothness assumption allows us to control the approximation error. Before going into the details, let us gain some insight. Note first that if [math]\theta \in \Theta(\beta, Q)[/math], then [math]a_j^2\theta_j^2 \to 0[/math] as [math]j \to \infty[/math] so that [math]|\theta_j|=o(j^{-\beta})[/math]. Therefore, the [math]\theta_j[/math]s decay polynomially to zero and it makes sense to approximate [math]f[/math] by its truncated Fourier series

[[math]] \sum_{j=1}^M \theta_j^* \varphi_j=:\varphi_{\theta^*}^M [[/math]]

for any fixed [math]M[/math]. This truncation leads to a systematic error that vanishes as [math]M \to \infty[/math]. We are interested in understanding the rate at which this happens.

The Sobolev assumption allows us to control precisely this error as a function of the tunable parameter [math]M[/math] and the smoothness [math]\beta[/math].

Lemma

Note that for any [math]\theta \in \Theta(\beta, Q)[/math], if [math]\beta \gt 1/2[/math], then

[[math]] \begin{align*} \sum_{j=2}^\infty|\theta_j|&=\sum_{j=2}^\infty a_j|\theta_j|\frac1{a_j}\\ &\le \sqrt{\sum_{j=2}^\infty a_j^2\theta_j^2\sum_{j=2}^\infty\frac1{a_j^2}}\quad \text{by Cauchy-Schwarz}\\ &\le \sqrt{Q\sum_{j=1}^\infty\frac1{j^{2\beta}}} \lt \infty \end{align*} [[/math]]

Since [math]\{\varphi_j\}_j[/math] forms an orthonormal system in [math]L_2([0,1])[/math], we have

[[math]] \min_{\theta \in \R^M} \|\varphi_\theta -f\|_{L_2}^2= \|\varphi_{\theta^*}^M -f\|_{L_2}^2 =\sum_{j \gt M} |\theta^*_j|^2\,. [[/math]]

When [math]\theta^* \in \Theta(\beta,Q)[/math], we have

[[math]] \sum_{j \gt M} |\theta_j^*|^2=\sum_{j \gt M} a_j^2|\theta_j^*|^2\frac1{a_j^2}\le \frac1{a_{M+1}^2}Q\le \frac{Q}{M^{2\beta}}\,. [[/math]]

To prove the second part of the lemma, observe that

[[math]] |\varphi_{\theta^*}^{n-1} -f|_2=\big|\sum_{j\ge n}\theta_j^* \varphi_j\big|_2 \le2\sqrt{2n}\sum_{j\ge n}|\theta_j^*|\,, [[/math]]

where in the last inequality, we used the fact that for the trigonometric basis [math]|\varphi_j|_2 \le \sqrt{2n}, j \ge 1[/math] regardless of the choice of the design [math]X_1,\ldots, X_n[/math]. When [math]\theta^* \in \Theta(\beta,Q)[/math], we have

[[math]] \begin{equation*} \sum_{j\ge n} |\theta^*_j|=\sum_{j\ge n} a_j|\theta^*_j|\frac{1}{a_j}\le \sqrt{\sum_{j\ge n} a_j^2|\theta^*_j|^2}\sqrt{\sum_{j\ge n}\frac{1}{a_j^2}}\lesssim \sqrt{Q} n^{\frac{1}{2}-\beta}\,. \end{equation*} [[/math]]

Show Proof

{{{4}}}

■

Note that the truncated Fourier series [math]\varphi_{\theta^*}^M[/math] is an oracle: this is what we see when we view [math]f[/math] through the lens of functions with only low frequency harmonics. To estimate [math]\varphi_{\theta^*} = \varphi_{\theta^\ast}^M[/math], consider the estimator [math]\varphi_{\thetals}[/math] where

[[math]] \thetals \in \argmin_{\theta \in \R^M} \sum_{i=1}^n \big(Y_i -\varphi_\theta(X_i)\big)^2\,, [[/math]]

which should be such that [math]\varphi_{\thetals}[/math] is close to [math]\varphi_{\theta^*}[/math]. For this estimator, we have proved (Theorem) an oracle inequality for the [math]\MSE[/math] that is of the form

[[math]] |\varphi_{\thetals}^M-f|_2^2\le\inf_{\theta \in \R^M}|\varphi_\theta^M-f|_2^2+C\sigma^2M\log(1/\delta)\,, \qquad C \gt 0\,. [[/math]]

It yields

[[math]] \begin{align*} |\varphi_{\thetals}^M-\varphi_{\theta^*}^M|_2^2&\le 2(\varphi_{\thetals}^M-\varphi_{\theta^*}^M)^\top(f-\varphi_{\theta^*}^M)+C\sigma^2M\log(1/\delta)\\ &= 2(\varphi_{\thetals}^M-\varphi_{\theta^*}^M)^\top(\sum_{j \gt M}\theta^*_j\varphi_j)+C\sigma^2M\log(1/\delta)\\ &=2(\varphi_{\thetals}^M-\varphi_{\theta^*}^M)^\top(\sum_{j\ge n}\theta^*_j\varphi_j)+C\sigma^2M\log(1/\delta)\,, \end{align*} [[/math]]

where we used Lemma in the last equality. Together with\eqref{EQ:bias2} and Young's inequality [math]2ab\le \alpha a^2+b^2/\alpha, a,b \ge 0[/math] for any [math]\alpha \gt 0[/math], we get

[[math]] 2(\varphi_{\thetals}^M-\varphi_{\theta^*}^M)^\top(\sum_{j\ge n}\theta^*_j\varphi_j)\le \alpha|\varphi_{\thetals}^M-\varphi_{\theta^*}^M |_2^2 + \frac{C}{\alpha}Qn^{2-2\beta}\,, [[/math]]

for some positive constant [math]C[/math] when [math]\theta^* \in \Theta(\beta, Q)[/math]. As a result,

[[math]] \begin{equation} \label{EQ:almostLSNP} |\varphi_{\thetals}^M-\varphi_{\theta^*}^M|_2^2 \lesssim \frac{1}{\alpha(1-\alpha)}Qn^{2-2\beta}+ \frac{\sigma^2M}{1-\alpha}\log(1/\delta) \end{equation} [[/math]]

for any [math]\alpha \in (0,1)[/math]. Since, Lemma implies, [math]|\varphi_{\thetals}^M-\varphi_{\theta^*}^M|_2^2=n\|\varphi_{\thetals}^M-\varphi_{\theta^*}^M\|_{L_2([0,1])}^2[/math], we have proved the following theorem.

Theorem

Fix [math]\beta\ge (1+\sqrt{5})/4\simeq 0.81, Q \gt 0, \delta \gt 0[/math] and assume the general regression model\eqref{EQ:regmodgen} with [math]f \in \Theta(\beta,Q)[/math] and [math]\eps\sim\sg_n(\sigma^2), \sigma^2 \le 1[/math]. Moreover, let [math]M=\lceil n^{\frac{1}{2\beta+1}}\rceil[/math] and [math]n[/math] be large enough so that [math]M \le n-1[/math]. Then the least squares estimator [math]\thetals[/math] defined in\eqref{EQ:defthetaLSmis} with [math]\{\varphi_j\}_{j=1}^M[/math] being the trigonometric basis, satisfies with probability [math]1-\delta[/math], for [math]n[/math] large enough,

[[math]] \|\varphi^M_{\thetals} -f\|_{L_2([0,1])}^2 \lesssim n^{-\frac{2\beta}{2\beta+1}}(1+\sigma^2\log(1/\delta))\,. [[/math]]

where the constant factors may depend on [math]\beta, Q[/math] and [math]\sigma[/math].

Show Proof

Choosing [math]\alpha=1/2[/math] for example and absorbing [math]Q[/math] in the constants, we get from\eqref{EQ:almostLSNP} and Lemma that for [math]M \le n-1[/math],

[[math]] \|\varphi^M_{\thetals}-\varphi^M_{\theta^*}\|^2_{L_2([0,1])}\lesssim n^{1-2\beta}+\sigma^2\frac{M\log(1/\delta)}{n} \,. [[/math]]

Using now Lemma and [math]\sigma^2 \le 1[/math], we get

[[math]] \|\varphi^M_{\thetals}-f\|^2_{L_2([0,1])}\lesssim M^{-2\beta}+n^{1-2\beta}+\sigma^2\frac{M\log(1/\delta)}{n} \,. [[/math]]

Taking [math]M=\lceil n^{\frac{1}{2\beta+1}}\rceil\le n-1[/math] for [math]n[/math] large enough yields

[[math]] \|\varphi^M_{\thetals}-f\|^2_{L_2([0,1])}\lesssim n^{-\frac{2\beta}{2\beta+1}}+n^{1-2\beta} \sigma^2\log(1/\delta)\,. [[/math]]

To conclude the proof, simply note that for the prescribed [math]\beta[/math], we have [math]n^{1-2\beta}\le n^{-\frac{2\beta}{2\beta+1}}[/math]\,.

■

Adaptive estimation

The rate attained by the projection estimator [math]\varphi_{\thetals}[/math] with [math]M=\lceil n^{\frac{1}{2\beta+1}}\rceil[/math] is actually optimal so, in this sense, it is a good estimator. Unfortunately, its implementation requires the knowledge of the smoothness parameter [math]\beta[/math] which is typically unknown, to determine the level [math]M[/math] of truncation. The purpose of adaptive estimation is precisely to adapt to the unknown [math]\beta[/math], that is to build an estimator that does not depend on [math]\beta[/math] and yet, attains a rate of the order of [math]Cn^{-\frac{2\beta}{2\beta+1}}[/math] (up to a logarithmic lowdown). To that end, we will use the oracle inequalities for the BIC and Lasso estimator defined in\eqref{EQ:defBICmis} and\eqref{EQ:defLassomis} respectively. In view of Lemma, the design matrix [math]\Phi[/math] actually satisfies the assumption[math]\textsf{ORT}[/math] when we work with the trigonometric basis. This has two useful implications:

Both estimators are actually thresholding estimators and can therefore be implemented efficiently
The condition [math]\textsf{INC($k$)}[/math] is automatically satisfied for any [math]k \ge 1[/math].

These observations lead to the following corollary.

Corollary

Fix [math]\beta \gt (1+\sqrt{5})/4\simeq 0.81, Q \gt 0, \delta \gt 0[/math] and [math]n[/math] large enough to ensure [math]n-1\ge \lceil n^{\frac{1}{2\beta+1}}\rceil[/math] assume the general regression model\eqref{EQ:regmodgen} with [math]f \in \Theta(\beta,Q)[/math] and [math]\eps\sim\sg_n(\sigma^2), \sigma^2 \le 1[/math]. Let [math]\{\varphi_j\}_{j=1}^{n-1}[/math] be the trigonometric basis. Denote by [math]\varphi^{n-1}_{\thetabic}[/math] (resp. [math]\varphi^{n-1}_{\thetalasso}[/math]) the BIC (resp. Lasso) estimator defined in\eqref{EQ:defBICmis} (resp. \eqref{EQ:defLassomis}) over [math]\R^{n-1}[/math] with regularization parameter given by\eqref{EQ:taubicmis} (resp. \eqref{EQ:taulassomis}). Then [math]\varphi^{n-1}_{\hat \theta}[/math], where [math]\hat \theta \in \{\thetabic, \thetalasso\}[/math] satisfies with probability [math]1-\delta[/math],

[[math]] \|\varphi^{n-1}_{\hat \theta} -f\|_{L_2([0,1])}^2 \lesssim (n/\log n)^{-\frac{2\beta}{2\beta+1}}(1+\sigma^2\log(1/\delta))\,, [[/math]]

where the constants may depend on [math]\beta[/math] and [math]Q[/math].

Show Proof

For [math]\hat \theta \in \{\thetabic, \thetalasso\}[/math], adapting the proofs of Theorem for the BIC estimator and Theorem for the Lasso estimator, for any [math]\theta \in \R^{n-1}[/math], with probability [math]1-\delta[/math]

[[math]] |\varphi^{n-1}_{\hat \theta} -f|_2^2 \le \frac{1+\alpha}{1-\alpha}|\varphi^{n-1}_\theta-f|_2^2 + R(|\theta|_0)\,. [[/math]]

where

[[math]] R(|\theta|_0):=\frac{C\sigma^2}{\alpha(1-\alpha)}|\theta_0|\log(en/\delta) [[/math]]

It yields

[[math]] \begin{align*} |\varphi^{n-1}_{\hat \theta} -\varphi^{n-1}_\theta|_2^2& \le \frac{2\alpha}{1-\alpha}|\varphi^{n-1}_\theta-f|_2^2 +2(\varphi^{n-1}_{\hat \theta} -\varphi^{n-1}_\theta)^\top (f - \varphi^{n-1}_\theta)+R(|\theta|_0)\\ &\le \Big( \frac{2\alpha}{1-\alpha}+\frac{1}{\alpha}\Big)|\varphi^{n-1}_\theta-f|_2^2+\alpha|\varphi^{n-1}_{\hat \theta} -\varphi^{n-1}_\theta|_2^2 + R(|\theta|_0)\,, \end{align*} [[/math]]

where we used Young's inequality once again. Let [math] M \in \mathbb{N} [/math] be determined later and choose now [math]\alpha=1/2[/math] and [math]\theta=\theta^*_M[/math], where [math]\theta^*_M[/math] is equal to [math]\theta^*[/math] on its first [math]M[/math] coordinates and [math]0[/math] otherwise so that [math]\varphi^{n-1}_{\theta^*_M}=\varphi^M_{\theta^*}[/math]. It yields

[[math]] |\varphi^{n-1}_{\hat \theta} -\varphi^{n-1}_{\theta^*_M}|_2^2\lesssim |\varphi^{n-1}_{\theta^*_M}-f|_2^2+ R(M)\lesssim|\varphi^{n-1}_{\theta^*_M}-\varphi^{n-1}_{\theta^*}|_2^2+|\varphi^{n-1}_{\theta^*}-f|_2^2+ R(M) [[/math]]

Next, it follows from\eqref{EQ:bias2} that [math]|\varphi^{n-1}_{\theta^*}-f|_2^2 \lesssim Qn^{2-2\beta}[/math]. Together with Lemma, it yields

[[math]] \|\varphi^{n-1}_{\hat \theta} -\varphi^{n-1}_{\theta^*_M}\|_{L_2([0,1])}^2\lesssim \|\varphi^{n-1}_{\theta^*} -\varphi^{n-1}_{\theta^*_M}\|_{L_2([0,1])}^2+Qn^{1-2\beta}+ \frac{R(M)}{n}\,. [[/math]]

Moreover, using\eqref{EQ:bias1}, we find that

[[math]] \|\varphi^{n-1}_{\hat \theta} -f\|_{L_2([0,1])}^2\lesssim M^{-2\beta}+Qn^{1-2\beta}+ \frac{M}{n}\sigma^2\log(en/\delta) \,. [[/math]]

To conclude the proof, choose [math]M=\lceil (n/\log n)^{\frac{1}{2\beta+1}}\rceil[/math] and observe that the choice of [math]\beta[/math] ensures that [math]n^{1-2\beta} \lesssim M^{-2\beta}[/math].

■

While there is sometimes a (logarithmic) price to pay for adaptation, it turns out that the extra logarithmic factor can be removed by a clever use of blocks (see ^[5]^(Chapter3)). The reason why we get this extra logarithmic factor here is because we use a hammer that's too big. Indeed, BIC and Lasso allow for ‘`holes" in the Fourier decomposition, so we only get part of their potential benefits.

General references

Rigollet, Philippe; Hütter, Jan-Christian (2023). "High-Dimensional Statistics". arXiv:2310.19244 [math.ST].

Notes

In the sense that
[[math]] \lim_{k \to \infty} \int_0^1|f(t)-\sum_{j=1}^k \theta_j \varphi_j(t)|^2\ud t=0 [[/math]]

References

^1.0 ^1.1 Flaxman, Abraham (2003). "A spectral technique for random satisfiable 3CNF formulas". Ann. Statist. 40. Society for Industrial and Applied Mathematics. doi:10.1214/aos/1034276635.
"Exponential screening and optimal rates of sparse estimation" (2011). Ann. Statist. 39. John Wiley & Sons Inc.. doi:10.1214/aos/1034276635.
Ramon, van (2017). "Probability in High Dimension". Constructive Approximation 13. Springer-Verlag. doi:10.1214/aos/1034276635.
Ryan, O'Donnell (2005). "The PCP Theorem and Hardness of Approximation". SIAM J. Comput. 24. W. H. Freeman & Co.. doi:10.1214/aos/1034276635.
LeCam, L. (1973). "Convergence of estimates under dimensionality restrictions". Ann. Statist. 1. Omnipress. doi:10.1214/aos/1034276635.

[5] In the sense that
[[math]] \lim_{k \to \infty} \int_0^1|f(t)-\sum_{j=1}^k \theta_j \varphi_j(t)|^2\ud t=0 [[/math]]

[Pis81-1] 1.0 ^1.1 Flaxman, Abraham (2003). "A spectral technique for random satisfiable 3CNF formulas". Ann. Statist. 40. Society for Industrial and Applied Mathematics. doi:10.1214/aos/1034276635.

[RigTsy11-2] "Exponential screening and optimal rates of sparse estimation" (2011). Ann. Statist. 39. John Wiley & Sons Inc.. doi:10.1214/aos/1034276635.

[DonDarGur97-3] Ramon, van (2017). "Probability in High Dimension". Constructive Approximation 13. Springer-Verlag. doi:10.1214/aos/1034276635.

[Mal09-4] Ryan, O'Donnell (2005). "The PCP Theorem and Hardness of Approximation". SIAM J. Comput. 24. W. H. Freeman & Co.. doi:10.1214/aos/1034276635.

[Tsy09-6] LeCam, L. (1973). "Convergence of estimates under dimensionality restrictions". Ann. Statist. 1. Omnipress. doi:10.1214/aos/1034276635.

[1]

[2]

[3]

[4]

[Notes 1]

[5]

@@ Line 1: / Line 1: @@
+<div class="d-none">
+<math>
+\newcommand{\DS}{\displaystyle}
+\newcommand{\intbeta}{\lfloor \beta \rfloor}
+\newcommand{\cA}{\mathcal{A}}
+\newcommand{\cB}{\mathcal{B}}
+\newcommand{\cC}{\mathcal{C}}
+\newcommand{\cD}{\mathcal{D}}
+\newcommand{\cE}{\mathcal{E}}
+\newcommand{\cF}{\mathcal{F}}
+\newcommand{\cG}{\mathcal{G}}
+\newcommand{\cH}{\mathcal{H}}
+\newcommand{\cI}{\mathcal{I}}
+\newcommand{\cJ}{\mathcal{J}}
+\newcommand{\cK}{\mathcal{K}}
+\newcommand{\cL}{\mathcal{L}}
+\newcommand{\cM}{\mathcal{M}}
+\newcommand{\cN}{\mathcal{N}}
+\newcommand{\cO}{\mathcal{O}}
+\newcommand{\cP}{\mathcal{P}}
+\newcommand{\cQ}{\mathcal{Q}}
+\newcommand{\cS}{\mathcal{S}}
+\newcommand{\cT}{\mathcal{T}}
+\newcommand{\cU}{\mathcal{U}}
+\newcommand{\cV}{\mathcal{V}}
+\newcommand{\cX}{\mathcal{X}}
+\newcommand{\cY}{\mathcal{Y}}
+\newcommand{\cZ}{\mathcal{Z}}
+\newcommand{\sg}{\mathsf{subG}}
+\newcommand{\subE}{\mathsf{subE}}
+\newcommand{\bA}{\mathbf{A}}
+\newcommand{\bB}{\mathbf{B}}
+\newcommand{\bC}{\mathbf{C}}
+\newcommand{\bD}{\mathbf{D}}
+\newcommand{\bE}{\mathbf{E}}
+\newcommand{\bF}{\mathbf{F}}
+\newcommand{\bG}{\mathbf{G}}
+\newcommand{\bH}{\mathbf{H}}
+\newcommand{\bI}{\mathbf{I}}
+\newcommand{\bJ}{\mathbf{J}}
+\newcommand{\bK}{\mathbf{K}}
+\newcommand{\bM}{\mathbf{M}}
+\newcommand{\bN}{\mathbf{N}}
+\newcommand{\bO}{\mathbf{O}}
+\newcommand{\bP}{\mathbf{P}}
+\newcommand{\bp}{\mathbf{p}}
+\newcommand{\bQ}{\mathbf{Q}}
+\newcommand{\bS}{\mathbf{S}}
+\newcommand{\bT}{\mathbf{T}}
+\newcommand{\bU}{\mathbf{U}}
+\newcommand{\bV}{\mathbf{V}}
+\newcommand{\bX}{\mathbf{X}}
+\newcommand{\bY}{\mathbf{Y}}
+\newcommand{\bZ}{\mathbf{Z}}
+\newcommand{\bflambda}{\boldsymbol{\lambda}}
+\newcommand{\bftheta}{\boldsymbol{\theta}}
+\newcommand{\bfg}{\boldsymbol{g}}
+\newcommand{\bfy}{\boldsymbol{y}}
+\def\thetaphat{\hat{\bftheta}_\bp}
+\def\bflam{\boldsymbol{\lambda}}
+\def\Lam{\Lambda}
+\def\lam{\lambda}
+\def\bfpi{\boldsymbol{\pi}}
+\def\bfz{\boldsymbol{z}}
+\def\bfw{\boldsymbol{w}}
+\def\bfeta{\boldsymbol{\eta}}
+\newcommand{\R}{\mathrm{ I}\kern-0.18em\mathrm{ R}}
+\newcommand{\h}{\mathrm{ I}\kern-0.18em\mathrm{ H}}
+\newcommand{\K}{\mathrm{ I}\kern-0.18em\mathrm{ K}}
+\newcommand{\p}{\mathrm{ I}\kern-0.18em\mathrm{ P}}
+\newcommand{\E}{\mathrm{ I}\kern-0.18em\mathrm{ E}}
+%\newcommand{\Z}{\mathrm{ Z}\kern-0.26em\mathrm{ Z}}
+\newcommand{\1}{\mathrm{ 1}\kern-0.24em\mathrm{ I}}
+\newcommand{\N}{\mathrm{ I}\kern-0.18em\mathrm{ N}}
+\newcommand{\field}[1]{\mathbb{#1}}
+\newcommand{\q}{\field{Q}}
+\newcommand{\Z}{\field{Z}}
+\newcommand{\X}{\field{X}}
+\newcommand{\Y}{\field{Y}}
+\newcommand{\bbS}{\field{S}}
+\newcommand{\n}{\mathcal{N}}
+\newcommand{\x}{\mathcal{X}}
+\newcommand{\pp}{{\sf p}}
+\newcommand{\hh}{{\sf h}}
+\newcommand{\ff}{{\sf f}}
+\newcommand{\Bern}{\mathsf{Ber}}
+\newcommand{\Bin}{\mathsf{Bin}}
+\newcommand{\Lap}{\mathsf{Lap}}
+\newcommand{\tr}{\mathsf{Tr}}
+\newcommand{\phin}{\varphi_n}
+\newcommand{\phinb}{\overline \varphi_n(t)}
+\newcommand{\pn}{\p_{\kern-0.25em n}}
+\newcommand{\pnm}{\p_{\kern-0.25em n,m}}
+\newcommand{\psubm}{\p_{\kern-0.25em m}}
+\newcommand{\psubp}{\p_{\kern-0.25em p}}
+\newcommand{\cfi}{\cF_{\kern-0.25em \infty}}
+\newcommand{\e}{\mathrm{e}}
+\newcommand{\ic}{\mathrm{i}}
+\newcommand{\Leb}{\mathrm{Leb}_d}
+\newcommand{\Var}{\mathrm{Var}}
+\newcommand{\ddelta}{d_{\symdiffsmall}}
+\newcommand{\dsubh}{d_H}
+\newcommand{\indep}{\perp\kern-0.95em{\perp}}
+\newcommand{\supp}{\mathop{\mathrm{supp}}}
+\newcommand{\rank}{\mathop{\mathrm{rank}}}
+\newcommand{\vol}{\mathop{\mathrm{vol}}}
+\newcommand{\conv}{\mathop{\mathrm{conv}}}
+\newcommand{\card}{\mathop{\mathrm{card}}}
+\newcommand{\argmin}{\mathop{\mathrm{argmin}}}
+\newcommand{\argmax}{\mathop{\mathrm{argmax}}}
+\newcommand{\ud}{\mathrm{d}}
+\newcommand{\var}{\mathrm{var}}
+\newcommand{\re}{\mathrm{Re}}
+\newcommand{\MSE}{\mathsf{MSE}}
+\newcommand{\im}{\mathrm{Im}}
+\newcommand{\epr}{\hfill\hbox{\hskip 4pt\vrule width 5pt
+                  height 6pt depth 1.5pt}\vspace{0.5cm}\par}
+\newcommand{\bi}[1]{^{(#1)}}
+\newcommand{\eps}{\varepsilon}
+\newcommand{\Deq}{\stackrel{\mathcal{D}}{=}}
+\newcommand{\ubar}{\underbar}
+\newcommand{\Kbeta}{K_{\hspace{-0.3mm} \beta}}
+\newcommand{\crzero}[1]{\overset{r_0}{\underset{#1}{\longleftrightarrow}}}
+\newcommand{\hint}[1]{\texttt{[Hint:#1]}}
+\newcommand{\vp}{\vspace{.25cm}}
+\newcommand{\vm}{\vspace{.5cm}}
+\newcommand{\vg}{\vspace{1cm}}
+\newcommand{\vgneg}{\vspace{-1cm}}
+\newcommand{\vneg}{\vspace{-.5cm}}
+\newcommand{\vpneg}{\vspace{-.25cm}}
+\newcommand{\tp}{\ptsize{10}}
+\newcommand{\douzp}{\ptsize{12}}
+\newcommand{\np}{\ptsize{9}}
+\newcommand{\hp}{\ptsize{8}}
+\newcommand{\red}{\color{red}}
+\newcommand{\ndpr}[1]{{\textsf{\red[#1]}}}
+\newcommand\iid{i.i.d\@ifnextchar.{}{.\@\xspace} }
+\newcommand\MoveEqLeft[1][2]{\kern #1em & \kern -#1em}
+\newcommand{\leadeq}[2][4]{\MoveEqLeft[#1] #2 \nonumber}
+\newcommand{\leadeqnum}[2][4]{\MoveEqLeft[#1] #2}
+\newcommand\independent{\protect\mathpalette{\protect\independenT}{\perp}}
+\def\independenT#1#2{\mathrel{\rlap{$#1#2$}\mkern2mu{#1#2}}}
+\newcommand{\MIT}[1]{{\color{MITred} #1}}
+\newcommand{\dHyp}{\{-1,1\}^d}
+\newcommand{\thetahard}{\hat \theta^{hrd}}
+\newcommand{\thetasoft}{\hat \theta^{sft}}
+\newcommand{\thetabic}{\hat \theta^{bic}}
+\newcommand{\thetalasso}{\hat \theta^{\cL}}
+\newcommand{\thetaslope}{\hat \theta^{\cS}}
+\newcommand{\thetahard}{\hat \theta^{hrd}}
+\newcommand{\thetasoft}{\hat \theta^{sft}}
+\newcommand{\thetabic}{\hat \theta^{bic}}
+\newcommand{\thetalasso}{\hat \theta^{\cL}}
+\newcommand{\thetaslope}{\hat \theta^{\cS}}
+\newcommand{\thetals}{\hat \theta^{ls}}
+\newcommand{\thetalsm}{\tilde \theta^{ls_X}}
+\newcommand{\thetaridge}{\hat\theta^{\mathrm{ridge}}_\tau}
+\newcommand{\thetalsK}{\hat \theta^{ls}_K}
+\newcommand{\muls}{\hat \mu^{ls}}
+</math>
+</div>
+Arguably, the strongest assumption that we made in Chapter [[guide:090050c501#chap:GSM |Linear Regression Model]] is that the regression function <math>f(x)</math> is of the form <math>f(x)=x^\top \theta^*</math>. What if this assumption is violated? In reality, we do not really believe in the linear model and we hope that good statistical methods should be ''robust'' to deviations from this model. This is the problem of model misspecification, and in particular misspecified linear models.
+Throughout this chapter, we assume the following model:
+<span id="EQ:regmodgen"/>
+<math display="block">
+\begin{equation}
+\label{EQ:regmodgen}
+Y_i=f(X_i)+\eps_i,\quad i=1, \ldots, n\,,
+\end{equation}
+</math>
+where <math>\eps=(\eps_1, \ldots, \eps_n)^\top </math> is sub-Gaussian with variance proxy <math>\sigma^2</math>. Here <math>X_i \in \R^d</math>.
+When dealing with fixed design, it will be convenient to consider the vector <math>g \in \R^n</math> defined for any function <math>g \,:\, \R^d \to \R</math> by <math>g=(g(X_1), \ldots, g(X_n))^\top</math>. In this case, we can write for any estimator <math>\hat f \in \R^n</math> of <math>f</math>,
+<math display="block">
+\MSE(\hat f)=\frac1n|\hat f - f|_2^2\,.
+</math>
+Even though the model may not be linear, we are interested in studying the statistical properties of various linear estimators introduced in the previous chapters: <math>\thetals, \thetalsK, \thetalsm, \thetabic, \thetalasso</math>. Clearly, even with an infinite number of observations, we have no chance of finding a consistent estimator of <math>f</math> if we don't know the correct model. Nevertheless, as we will see in this chapter, something can still be said about these estimators using ''oracle inequalities''.
+==Oracle inequalities==
+===Oracle inequalities===
+As mentioned in the introduction, an oracle is a quantity that cannot be constructed without the knowledge of the quantity of interest, here: the regression function. Unlike the regression function itself, an oracle is constrained to take a specific form. For all matter of purposes, an oracle can be viewed as an estimator (in a given family) that can be constructed with an infinite amount of data. This is exactly what we should aim for in misspecified models.
+When employing the least squares estimator <math>\thetals</math>, we constrain ourselves to estimating functions that are of the form <math>x\mapsto x^\top \theta</math>, even though <math>f</math> itself may not be of this form. Therefore, the oracle <math>\hat f</math> is the linear function that is the closest to <math>f</math>.
+Rather than trying to approximate <math>f</math> by a linear function <math>f(x)\approx \theta^\top x</math>, we make the model a bit more general and consider a dictionary <math>\cH=\{\varphi_1, \ldots, \varphi_M\}</math> of functions where <math>\varphi_j\,:\, \R^d \to \R</math>. In this case, we can actually remove the assumption that <math>X \in \R^d</math>. Indeed, the goal is now to estimate <math>f</math> using a linear combination of the functions in the dictionary:
+<math display="block">
+f \approx \varphi_\theta:=\sum_{j=1}^M \theta_j \varphi_j\,.
+</math>
+{{alert-info |
+If <math>M=d</math> and <math>\varphi_j(X)=X^{(j)}</math> returns the <math>j</math>th coordinate of <math>X \in \R^d</math> then the goal is to approximate <math>f(x)</math> by <math>\theta^\top x</math>. Nevertheless, the use of a dictionary allows for a much more general framework.
+}}
+Note that the use of a dictionary does not affect the methods that we have been using so far, namely penalized/constrained least squares. We use the same notation as before and define
+<ul><li> The least squares estimator:
+<span id = "EQ:defthetaLSmis"/>
+<span id="EQ:defthetaLSmis"/>
+<math display="block">
+\begin{equation}
+\label{EQ:defthetaLSmis}
+\thetals\in \argmin_{\theta \in \R^M}\frac{1}{n} \sum_{i=1}^n\big(Y_i-\varphi_\theta(X_i)\big)^2
+\end{equation}
+</math>
+</li>
+<li> The  least squares estimator constrained to <math>K \subset \R^M</math>:
+<math display="block">
+\thetalsK \in \argmin_{\theta \in K}\frac{1}{n} \sum_{i=1}^n\big(Y_i-\varphi_\theta(X_i)\big)^2
+</math>
+</li>
+<li> The BIC estimator:
+<span id="EQ:defBICmis"/>
+<math display="block">
+\begin{equation}
+\label{EQ:defBICmis}
+\thetabic \in  \argmin_{\theta \in \R^M}\Big\{\frac{1}{n} \sum_{i=1}^n\big(Y_i-\varphi_\theta(X_i)\big)^2+\tau^2|\theta|_0\Big\}
+\end{equation}
+</math>
+</li>
+<li> The Lasso estimator:
+<span id="EQ:defLassomis"/>
+<math display="block">
+\begin{equation}
+\label{EQ:defLassomis}
+\thetalasso \in  \argmin_{\theta \in \R^M}\Big\{\frac{1}{n} \sum_{i=1}^n\big(Y_i-\varphi_\theta(X_i)\big)^2+2\tau|\theta|_1\Big\}
+\end{equation}
+</math>
+</li>
+</ul>
+{{defncard|label=|id=def:oracle|
+Let <math>R(\cdot)</math> be a risk function and let <math>\cH=\{\varphi_1, \ldots, \varphi_M\}</math> be a dictionary  of functions from <math>\R^d</math> to <math>\R</math>. Let <math>K</math> be a subset of <math>\R^M</math>. The ''oracle'' on <math>K</math> with respect to <math>R</math> is defined by <math>\varphi_{\bar \theta}</math>, where    <math>\bar \theta \in K</math> is such that
+<math display="block">
+R(\varphi_{\bar \theta}) \le R(\varphi_\theta) \,, \qquad \forall\, \theta \in K\,.
+</math>
+Moreover, <math>R_{K}=R(\varphi_{\bar \theta})</math> is called ''oracle risk'' on <math>K</math>. An estimator <math>\hat f</math> is said to satisfy an oracle inequality (over <math>K</math>) with remainder term <math>\phi</math> in expectation (resp. with high probability) if there exists a constant <math>C\ge 1</math> such that
+<math display="block">
+\E R(\hat f) \le C\inf_{\theta \in K}R(\varphi_\theta) + \phi_{n,M}(K)\,,
+</math>
+or
+<math display="block">
+\p\big\{R(\hat f) \le C\inf_{\theta \in K}R(\varphi_\theta) + \phi_{n,M,\delta}(K)  \big\} \ge 1-\delta \,, \qquad \forall \ \delta > 0
+</math>
+respectively. If <math>C=1</math>, the oracle inequality is sometimes called ''exact''.}}
+Our goal will be to mimic oracles. The finite sample performance of an estimator at this task is captured by an oracle inequality.
+===Oracle inequality for the least squares estimator===
+While our ultimate goal is to prove sparse oracle inequalities for the BIC and Lasso estimator in the case of misspecified model, the difficulty of the extension to this case for linear models is essentially already captured by the analysis for the least squares estimator. In this simple case, can even obtain an exact oracle inequality.
+{{proofcard|Theorem|TH:LS_mis|Assume the general regression model\eqref{EQ:regmodgen} with <math>\eps\sim\sg_n(\sigma^2)</math>. Then,  the least squares estimator <math>\thetals</math> satisfies for some numerical constant <math>C > 0</math>,
+<math display="block">
+\MSE(\varphi_{\thetals})\le \inf_{\theta \in \R^M}\MSE(\varphi_\theta)+C\frac{\sigma^2M}{n}\log(1/\delta)
+</math>
+with probability at least <math>1-\delta</math>.
+|Note that by definition
+<math display="block">
+|Y-\varphi_{\thetals}|_2^2 \le |Y-\varphi_{\bar \theta}|_2^2
+</math>
+where <math>\varphi_{\bar \theta}</math> denotes the orthogonal projection of <math>f</math> onto the linear span of <math>\varphi_1, \ldots, \varphi_M</math>. Since <math>Y=f+\eps</math>, we get
+<math display="block">
+|f-\varphi_{\thetals}|_2^2 \le |f-\varphi_{\bar \theta}|_2^2+2\eps^\top(\varphi_{\thetals}-\varphi_{\bar \theta})
+</math>
+Moreover, by Pythagoras's theorem, we have
+<math display="block">
+|f-\varphi_{\thetals}|_2^2 - |f-\varphi_{\bar \theta}|_2^2=|\varphi_{\thetals}-\varphi_{\bar \theta}|_2^2\,.
+</math>
+It yields
+<math display="block">
+|\varphi_{\thetals}-\varphi_{\bar \theta}|_2^2\le 2\eps^\top(\varphi_{\thetals}-\varphi_{\bar \theta})\,.
+</math>
+Using the same steps as the ones following equation [[guide:090050c501#EQ:bound_ls_1 |EQ:bound_ls_1]] for the well specified case, we get
+<math display="block">
+|\varphi_{\thetals}-\varphi_{\bar \theta}|_2^2\lesssim \frac{\sigma^2M}{n}\log(1/\delta)
+</math>
+with probability <math>1-\delta</math>. The result of the lemma follows.}}
+===Sparse oracle inequality for the BIC estimator===
+The analysis for more complicated estimators such as the BIC in Chapter [[guide:090050c501#chap:GSM |Linear Regression Model]] allows us to derive oracle inequalities for these estimators.
+{{proofcard|Theorem|TH:BIC_mis|Assume the general regression model\eqref{EQ:regmodgen} with <math>\eps\sim\sg_n(\sigma^2)</math>. Then,  the BIC estimator <math>\thetabic</math> with regularization parameter
+<span id="EQ:taubicmis"/>
+<math display="block">
+\begin{equation}
+\label{EQ:taubicmis}
+\tau^2=16\log(6)\frac{\sigma^2}{n} + 32\frac{\sigma^2 \log (ed)}{n}
+\end{equation}
+</math>
+satisfies for some numerical constant <math>C > 0</math>,
+<math display="block">
+\begin{align*}
+\MSE(\varphi_{\thetabic})\le \inf_{\theta \in \R^M}\Big\{3\MSE(\varphi_\theta)+&\frac{C\sigma^2}{n}|\theta|_0\log (eM/\delta)\Big\}+ \frac{C\sigma^2}{n}\log(1/\delta)
+\end{align*}
+</math>
+with probability at least <math>1-\delta</math>.
+|Recall that the proof of [[guide:090050c501#TH:BIC |Theorem]] for the BIC estimator begins as follows:
+<math display="block">
+\frac1n|Y-\varphi_{\thetabic}|_2^2 + \tau^2|\thetabic|_0 \le \frac1n|Y-\varphi_{\theta}|_2^2 + \tau^2|\theta|_0\,.
+</math>
+This is true for any <math>\theta \in \R^M</math>.
+It implies
+<math display="block">
+|f-\varphi_{\thetabic}|_2^2 + n\tau^2|\thetabic|_0 \le |f-\varphi_{ \theta}|_2^2+2\eps^\top(\varphi_{\thetabic}-\varphi_{ \theta}) + n\tau^2|\theta|_0\,.
+</math>
+Note that if <math>\thetabic= \theta</math>, the result is trivial. Otherwise,
+<math display="block">
+\begin{align*}
+\eps^\top(\varphi_{\thetabic}-\varphi_\theta) &=2\eps^\top\Big(\frac{\varphi_{\thetabic}-\varphi_\theta}{|\varphi_{\thetabic}-\varphi_\theta|_2}\Big)|\varphi_{\thetabic}-\varphi_\theta|_2\\
+&\le \frac2\alpha\Big[\eps^\top\Big(\frac{\varphi_{\thetabic}-\varphi_\theta}{|\varphi_{\thetabic}-\varphi_\theta|_2}\Big)\Big]^2 + \frac{\alpha}{2} |\varphi_{\thetabic}-\varphi_\theta|_2^2\,,
+\end{align*}
+</math>
+where we used Young's inequality <math>2ab\le \frac2\alpha a^2+\frac{\alpha}{2} b^2</math>, which is valid for <math>a,b \in \R</math> and <math>\alpha > 0</math>.
+Next, since
+<math display="block">
+\frac{\alpha}{2} |\varphi_{\thetabic}-\varphi_\theta|_2^2\le \alpha|\varphi_{\thetabic}-f|_2^2+\alpha|\varphi_\theta-f|_2^2\,,
+</math>
+we get for  <math>\alpha=1/2</math>,
+<math display="block">
+\begin{align*}
+\frac12|\varphi_{\thetabic}-f|_2^2  &\le \frac32|\varphi_{ \theta}-f|_2^2+n\tau^2|\theta|_0 +4\big[\eps^\top\cU(\varphi_{\thetabic}-\varphi_{\theta})\big]^2- n\tau^2|\thetabic|_0\\
+&\le \frac32|\varphi_{ \theta}-f|_2^2+2n\tau^2|\theta|_0+4\big[\eps^\top\cU(\varphi_{\thetabic -\theta})\big]^2- n\tau^2|\thetabic-\theta|_0.
+\end{align*}
+</math>
+We conclude as in the proof of [[guide:090050c501#TH:BIC |Theorem]].}}
+The interpretation of this theorem is  enlightening. It implies that the BIC estimator will mimic the best tradeoff between the approximation error <math>\MSE(\varphi_\theta)</math> and the complexity of <math>\theta</math> as measured by its sparsity. In particular, this result, which is sometimes called ''sparse oracle inequality'', implies the following oracle inequality. Define the oracle <math>\bar \theta</math> to be such that
+<math display="block">
+\MSE(\varphi_{\bar \theta})=\min_{\theta \in \R^M}\MSE(\varphi_\theta).
+</math>
+Then, with probability at least <math>1-\delta</math>,
+<math display="block">
+\MSE(\varphi_{\thetabic})\le 	3\MSE(\varphi_{\bar \theta})+\frac{C\sigma^2}{n}\Big[|\bar \theta|_0\log (eM)+ \log(1/\delta)\Big]\Big\}
+</math>
+If the linear model happens to be correct, then we simply have <math>\MSE(\varphi_{\bar \theta})=0</math>.
+===Sparse oracle inequality for the Lasso===
+To prove an oracle inequality for the Lasso, we need additional assumptions on the design matrix, such as incoherence. Here, the design matrix is given by the <math>n\times M</math> matrix <math>\Phi</math> with elements <math>\Phi_{i,j}=\varphi_j(X_i)</math>.
+{{proofcard|Theorem|TH:lasso_fast_mis|Assume the general regression model\eqref{EQ:regmodgen} with <math>\eps\sim\sg_n(\sigma^2)</math>. Moreover, assume that there exists an integer <math>k</math> such that the matrix <math>\Phi</math> satisfies assumption <math>\textsf{INC$(k)$}</math>.  Then, the Lasso estimator <math>\thetalasso</math> with regularization parameter given by
+<span id{{=}}"EQ:taulassomis"/>
+<math display="block">
+\begin{equation}
+\label{EQ:taulassomis}
+\tau=8\sigma\sqrt{\frac{2\log(2M)}{n}}+8\sigma\sqrt{\frac{2\log(1/\delta)}{n}}
+\end{equation}
+</math>
+satisfies for some numerical constant <math>C</math>,
+<math display="block">
+\begin{align*}
+\MSE(\varphi_{\thetalasso})\le \inf_{\substack{\theta \in \R^M\\ |\theta|_0\le k}}\Big\{ \MSE(\varphi_\theta) &+\frac{C\sigma^2}{n}|\theta|_0\log (eM/\delta)\Big\}
+\end{align*}
+</math>
+with probability at least <math>1-\delta</math>.
+|From the definition of <math>\thetalasso</math>, for any <math>\theta \in \R^M</math>,
+<math display="block">
+\begin{align*}
+\frac1n|Y-\varphi_{\thetalasso}|_2^2
+\le \frac1n|Y-\varphi_{\theta}|_2^2 + 2\tau|\theta|_1- 2\tau|\thetalasso|_1 \,.
+\end{align*}
+</math>
+Expanding the squares, adding <math>\tau|\thetalasso-\theta|_1</math> on each side and multiplying by <math>n</math>, we get
+<span id{{=}}"EQ:pr:TH:lassofastmis1"/>
+<math display="block">
+\begin{align}
+\leadeq{|\varphi_{\thetalasso}-f|_2^2-|\varphi_{\theta}-f|_2^2+n\tau|\thetalasso-\theta|_1}\\
+\le {} & 2\eps^\top(\varphi_{\thetalasso}-\varphi_{\theta}) +n\tau|\thetalasso-\theta|_1+ 2n\tau|\theta|_1- 2n\tau|\thetalasso|_1 \,.
+\label{EQ:pr:TH:lassofastmis1}
+\end{align}
+</math>
+Next, note that  \textsf{INC}<math>(k)</math> for any <math>k \ge 1</math> implies that <math>|\varphi_j|_2 \le 2\sqrt{n}</math> for all <math>j=1, \ldots, M</math>. Applying H\"older's inequality, we get that with probability <math>1-\delta</math>, it holds that
+<math display="block">
+\eps^\top(\varphi_{\thetalasso}-\varphi_{\theta}) \le \frac{n\tau}2|\thetalasso-\theta|_1.
+</math>
+Therefore, taking <math>S=\supp(\theta)</math> to be the support of <math>\theta</math>, we get that the right-hand side of\eqref{EQ:pr:TH:lassofastmis1} is bounded by
+<span id{{=}}"EQ:pr:TH:lassofastmis2"/>
+<math display="block">
+\begin{align}
+|\varphi_{\thetalasso}-f|_2^2-|\varphi_{\theta}-f|_2^2+n\tau|\thetalasso-\theta|_1
+&\le  2n\tau|\thetalasso-\theta|_1+2n\tau|\theta|_1- 2n\tau|\thetalasso|_1\nonumber \\
+&= 2n\tau|\thetalasso_S-\theta|_1+2n\tau|\theta|_1- 2n\tau|\thetalasso_S|_1\nonumber\\
+&\le 4n\tau|\thetalasso_S-\theta|_1\label{EQ:pr:TH:lassofastmis2}
+\end{align}
+</math>
+with probability <math>1-\delta</math>.
+It implies that either <math>\MSE(\varphi_{\thetalasso})\le \MSE(\varphi_\theta)</math> or that
+<math display="block">
+|\thetalasso_{S^c}-\theta_{S^c}|_1\le 3 |\thetalasso_{S}-\theta_{S}|_1\,.
+</math>
+so that <math>\theta=\thetalasso-\theta</math> satisfies the cone condition[[guide:090050c501#EQ:conecond |EQ:conecond]].
+Using now  the Cauchy-Schwarz inequality  and [[guide:090050c501#lem:inc |Lemma]], respectively, assuming that <math>|\theta|_0\le k</math>, we get
+<math display="block">
+n\tau|\thetalasso_S-\theta|_1 \le 4n\tau\sqrt{|S|}|\thetalasso_S-\theta|_2 \le 4\tau\sqrt{2n|\theta|_0}|\varphi_{\thetalasso}-\varphi_{\theta}|_2\,.
+</math>
+Using now the inequality <math>2ab\le \frac2{\alpha} a^2+\frac{\alpha}{2} b^2</math>, we get
+<math display="block">
+\begin{align*}
+n\tau|\thetalasso_S-\theta|_1& \le \frac{16\tau^2n|\theta|_0}{\alpha}+ \frac{\alpha}{2}|\varphi_{\thetalasso}-\varphi_{\theta}|_2^2\\
+&\le  \frac{16\tau^2n|\theta|_0}{\alpha}+ \alpha|\varphi_{\thetalasso}-f|_2^2+\alpha|\varphi_{\theta}-f|_2^2
+\end{align*}
+</math>
+Combining this result with\eqref{EQ:pr:TH:lassofastmis1} and\eqref{EQ:pr:TH:lassofastmis2}, we find
+<math display="block">
+(1-\alpha)\MSE(\varphi_{\thetalasso}) \le (1+\alpha)\MSE(\varphi_{\theta})+ \frac{16\tau^2|\theta|_0}{\alpha}\,.
+</math>
+To conclude the proof, it only remains to divide by <math>1-\alpha</math> on both sides of the above inequality and take <math>\alpha=1/2</math>.}}
+===Maurey's argument===
+In there is no sparse <math>\theta</math> such that <math>\MSE(\varphi_{\theta})</math> is small, [[#TH:BIC_mis |Theorem]] is useless whereas the Lasso may still enjoy slow rates. In reality, no one really believes in the existence of sparse vectors but rather of approximately sparse vectors. Zipf's law would instead favor the existence of vectors <math>\theta</math> with absolute coefficients that decay polynomially when ordered from largest to smallest in absolute value. This is the case for example if <math>\theta</math> has a small <math>\ell_1</math> norm but is not sparse. For such <math>\theta</math>, the Lasso estimator still enjoys slow rates as in [[guide:090050c501#TH:lasso_slow |Theorem]], which can be easily extended to the misspecified case (see [[exercise:165e02b2e2 |Problem]]). As a result, it seems that the Lasso estimator is strictly better than the BIC estimator as long as incoherence holds since it enjoys both fast and slow rates, whereas the BIC estimator seems to be tailored to the fast rate.  Fortunately, such vectors can be well approximated by sparse vectors in the following sense: for any vector <math>\theta \in \R^M</math> such that <math>|\theta|_1\le 1</math>, there exists a vector <math>\theta'</math> that is sparse and for which <math>\MSE(\varphi_{\theta'})</math> is not much larger than <math>\MSE(\varphi_\theta)</math>. The following theorem quantifies exactly the tradeoff between sparsity and <math>\MSE</math>. It is often attributed to B. Maurey and was published by Pisier <ref name="Pis81">{{cite journal|last=Flaxman|first=Abraham|journal=Ann. Statist.|year=2003|title=A spectral technique for random satisfiable 3CNF formulas|volume=40|number=1|doi=10.1214/aos/1034276635|publisher=Society for Industrial and Applied Mathematics|url-access=http://dx.doi.org/10.1214/aos/1034276635|eprint=http://epubs.siam.org/doi/pdf/10.1137/S0097539705446974}}</ref>. This is why it is referred to as ''Maurey's argument''.
+{{proofcard|Theorem|thm-1|Let <math>\{\varphi_1, \dots, \varphi_M\}</math> be a dictionary normalized in such a way that
+<math display="block">
+\max_{1\le j \le M}|\varphi_j|_2\le D\sqrt{n}\,.
+</math>
+Then for any integer <math>k</math> such that <math>1\le k \le M</math> and any positive <math>R</math>, we have
+<math display="block">
+\min_{\substack{\theta \in \R^M\\ |\theta|_0\le k}}\MSE(\varphi_\theta) \le \min_{\substack{\theta \in \R^M\\ |\theta|_1\le R}}\MSE(\varphi_\theta) + \frac{D^2 R^2}{k}\,.
+</math>
+|Define
+<math display="block">
+\bar \theta \in \argmin_{\substack{\theta \in \R^M\\ |\theta|_1\le R}}|\varphi_\theta -f|_2^2
+</math>
+and assume without loss of generality that <math>|\bar \theta_1| \ge |\bar \theta_2|\ge \ldots \ge |\bar \theta_M|</math>.
+Let now <math>U \in \R^n</math> be a random vector with values in <math>\{0, \pm R\varphi_1, \ldots, \pm R \varphi_M\}</math> defined by
+<math display = "block">
+\begin{alignat*}{2}
+\p(U=R\,\mathrm{sign}(\bar \theta_j)\varphi_j)&=\frac{|\bar \theta_j|}{R}\,, &\quad &j=1, \ldots, M,\\
+\p(U=0)&=1-\frac{|\bar \theta|_1}{R}\,.
+\end{alignat*}
+</math>
+Note that <math>\E[U]=\varphi_{\bar \theta}</math> and <math>|U|_2 \le RD \sqrt{n}</math>. Let now <math>U_1, \ldots, U_k</math> be <math>k</math> independent copies of <math>U</math> and define their average
+<math display="block">
+\bar U=\frac{1}{k}\sum_{i=1}^kU_i\,.
+</math>
+Note that <math>\bar U=\varphi_{\tilde \theta}</math> for some <math>\tilde \theta \in\R^M</math> such that <math>|\tilde \theta|_0\le k</math>. Moreover, using the Pythagorean Theorem,
+<math display="block">
+\begin{align*}
+\E|f-\bar U|_2^2&=\E|f-\varphi_{\bar \theta}+\varphi_{\bar \theta}-\bar U|_2^2\\
+&=\E|f-\varphi_{\bar \theta}|_2^2+|\varphi_{\bar \theta}-\bar U|_2^2\\
+&= |f-\varphi_{\bar \theta}|_2^2 + \frac{\E|U-\E[U]|_2^2}{k}\\
+&\le |f-\varphi_{\bar \theta}|_2^2 + \frac{(RD\sqrt{n})^2}{k}
+\end{align*}
+</math>
+To conclude the proof, note that
+<math display="block">
+\E|f-\bar U|_2^2 =\E|f-\varphi_{\tilde \theta}|_2^2 \ge \min_{\substack{\theta \in \R^M\\ |\theta|_0\le k}}|f-\varphi_{\theta}|_2^2
+</math>
+and divide by <math>n</math>.}}
+Maurey's argument implies the following corollary.
+{{proofcard|Corollary|cor-1|Assume that the assumptions of [[#TH:BIC_mis |Theorem]] hold and that the dictionary <math>\{\varphi_1, \dots, \varphi_M\}</math> is  normalized in such a way that
+<math display="block">
+\max_{1\le j \le M}|\varphi_j|_2\le \sqrt{n}\,.
+</math>
+Then there exists a constant <math>C > 0</math> such that the BIC estimator satisfies
+<math display="block">
+\begin{align*}
+\MSE(\varphi_{\thetabic})\le \inf_{\theta \in \R^M}\Big\{2\MSE(\varphi_\theta)+C\Big[&\frac{\sigma^2|\theta|_0\log (eM)}{n}\wedge \sigma|\theta|_1\sqrt{\frac{\log (eM)}{n}}\Big]\Big\}\\
+&+ C\frac{\sigma^2\log(1/\delta)}{n}
+\end{align*}
+</math>
+with probability at least <math>1-\delta</math>.
+|Choosing <math>\alpha=1/3</math> in  [[#TH:BIC_mis |Theorem]] yields
+<math display="block">
+\MSE(\varphi_{\thetabic})\le 2\inf_{\theta \in \R^M}\Big\{\MSE(\varphi_\theta)+C\frac{\sigma^2|\theta|_0\log (eM)}{n}\Big\}+C\frac{\sigma^2\log(1/\delta)}{n}
+</math>
+Let <math>\theta' \in \R^M</math>.
+It follows from Maurey's argument that for any <math> k \in [M] </math>, there exists <math>\theta = \theta(\theta', k) \in \R^M</math> such that <math>|\theta|_0 = k</math> and
+<math display="block">
+\MSE(\varphi_{\theta}) \le \MSE(\varphi_{\theta'})+\frac{2|\theta'|_1^2}{k}
+</math>
+It implies that
+<math display="block">
+\MSE(\varphi_{\theta})+C\frac{\sigma^2|\theta|_0\log (eM)}{n} \le \MSE(\varphi_{\theta'})+\frac{2|\theta'|_1^2}{k}+C\frac{\sigma^2 k\log (eM)}{n}
+</math>
+and furthermore that
+<math display="block">
+\inf_{\theta \in \R^M}\MSE(\varphi_{\theta})+C\frac{\sigma^2|\theta|_0\log (eM)}{n} \le \MSE(\varphi_{\theta'})+\frac{2|\theta'|_1^2}{k}+C\frac{\sigma^2 k\log (eM)}{n}
+</math>
+Since the above bound holds for any <math> \theta' \in \mathbb{R}^M </math> and <math> k \in [M] </math>, we can take an infimum with respect to both <math>\theta'</math> and <math>k</math> on the right-hand side to get
+<math display="block">
+\begin{align*}
+\inf_{\theta \in \R^M}&\Big\{\MSE(\varphi_{\theta})+C\frac{\sigma^2|\theta|_0\log (eM)}{n} \Big\}\\
+&\le \inf_{\theta' \in \R^M}\Big\{\MSE(\varphi_{\theta'})+C\min_k\Big(\frac{|\theta'|_1^2}{k}+C\frac{\sigma^2k\log (eM)}{n}\Big) \Big\}\,.
+\end{align*}
+</math>
+To control the minimum over <math>k</math>, we need to consider three cases for the quantity
+<math display="block">
+\bar k=\frac{|\theta'|_1}{\sigma}\sqrt{\frac{n}{\log (eM)}}.
+</math>
+<ul><li> If  <math>1 \le \bar k \le M</math>, then we get
+<math display="block">
+\min_k\Big(\frac{|\theta'|_1^2}{k}+C\frac{\sigma^2k\log (eM)}{n}\Big) \le C\sigma|\theta'|_1\sqrt{\frac{\log (eM)}{n}}
+</math>
+</li>
+<li> If <math>\bar k \le 1</math>, then
+<math display="block">
+|\theta'|_1^2\le C\frac{\sigma^2\log (eM)}{n}\,,
+</math>
+which yields
+<math display="block">
+\min_k\Big(\frac{|\theta'|_1^2}{k}+C\frac{\sigma^2k\log (eM)}{n}\Big) \le  C\frac{\sigma^2\log (eM)}{n}
+</math>
+</li>
+<li> If <math>\bar k \ge M</math>, then
+<math display="block">
+\frac{\sigma^2M\log (eM)}{n}\le C \frac{|\theta'|_1^2}{M}\,.
+</math>
+Therefore, on the one hand, if <math>M \ge \frac{|\theta|_1}{\sigma\sqrt{\log(eM)/n}}</math>, we get
+<math display="block">
+\min_k\Big(\frac{|\theta'|_1^2}{k}+C\frac{\sigma^2k\log (eM)}{n}\Big) \le C \frac{|\theta'|_1^2}{M} \le C\sigma|\theta'|_1\sqrt{\frac{\log (eM)}{n}}\,.
+</math>
+On the other hand, if <math>M \le \frac{|\theta|_1}{\sigma\sqrt{\log(eM)/n}}</math>, then for any <math>\theta \in \R^M</math>, we have
+<math display="block">
+\begin{equation*}
+\frac{\sigma^2|\theta|_0\log (eM)}{n} \le \frac{\sigma^2M\log (eM)}{n}  \le C\sigma|\theta'|_1\sqrt{\frac{\log (eM)}{n}}\,.
+\end{equation*}
+</math>
+</li>
+</ul>
+Combined,
+<math display="block">
+\begin{align*}
+\leadeq{\inf_{\theta' \in \R^M}\Big\{\MSE(\varphi_{\theta'})+C\min_k\Big(\frac{|\theta'|_1^2}{k}+C\frac{\sigma^2k\log (eM)}{n}\Big) \Big\}}\\
+\le {} & \inf_{\theta' \in \R^M}\Big\{\MSE(\varphi_{\theta'}) + C \sigma | \theta' |_1 \sqrt{\frac{\log(eM}{n}} + C \frac{\sigma^2 \log(eM)}{n} \Big\},
+\end{align*}
+</math>
+which together with Theorem 3.4 yields the claim.}}
+Note that this last result holds for any estimator that satisfies an oracle inequality with respect to the <math>\ell_0</math> norm as in [[#TH:BIC_mis |Theorem]]. In particular, this estimator need not be the BIC estimator. An example is the Exponential Screening estimator of<ref name="RigTsy11">{{cite journal|last1=Rigollet|first1=Philippe|last2=Tsybakov|first2=Alexandre|journal=Ann. Statist.|year=2011|title=Exponential screening and optimal rates of sparse estimation|volume=39|number=2|doi=10.1214/aos/1034276635|publisher=John Wiley & Sons Inc.|url-access=http://dx.doi.org/10.1214/aos/1034276635|eprint=1206.5349}}</ref>.
+Maurey's argument allows us to enjoy the best of both the <math>\ell_0</math> and the <math>\ell_1</math> world. The rate adapts to the sparsity of the problem and can be even generalized to <math>\ell_q</math>-sparsity (see [[exercise:Db743c87c5|Problem]]). However, it is clear from the proof that this argument is limited to squared <math>\ell_2</math> norms such as the one appearing in <math>\MSE</math> and extension to other risk measures is non-trivial. Some work has been done for non-Hilbert spaces <ref name="Pis81"/><ref name=" DonDarGur97">{{cite journal|last=Ramon|first=van|journal=Constructive Approximation|year=2017|title=Probability in High Dimension|volume=13|number=2|doi=10.1214/aos/1034276635|publisher=Springer-Verlag|url-access=http://dx.doi.org/10.1214/aos/1034276635|eprint=ArXiv:1502.00665}}</ref> using more sophisticated arguments.
+==Nonparametric regression==
+So far, the oracle inequalities that we have derived do not deal with the approximation error <math>\MSE(\varphi_\theta)</math>. We kept it arbitrary and simply hoped that it was small.  Note also that in the case of linear models, we simply assumed that the approximation error was zero. As we will see in this section, this error can be quantified under natural smoothness conditions if the dictionary of functions <math>\cH=\{\varphi_1, \ldots, \varphi_M\}</math> is chosen appropriately. In what follows, we assume for simplicity that <math>d=1</math> so that <math>f\,:\, \R \to \R</math> and <math>\varphi_j\,:\, \R \to \R</math>.
+===Fourier decomposition===
+Historically, nonparametric estimation was developed before high-dimensional statistics and most results hold for the case where the dictionary <math>\cH=\{\varphi_1, \ldots, \varphi_M\}</math> forms an orthonormal system of <math>L_2([0,1])</math>:
+<math display="block">
+\int_0^1\varphi_j^2(x)\ud x =1\,, \quad \int_0^1 \varphi_j(x)\varphi_k(x)\ud x=0, \ \forall\, j\neq k\,.
+</math>
+We will also deal with the case where <math>M=\infty</math>.
+When <math>\cH</math> is an orthonormal system, the coefficients <math>\theta_j^* \in \R</math> defined by
+<math display="block">
+\theta_j^* =\int_0^1 f(x) \varphi_j(x)\ud x\,,
+</math>
+are called ''Fourier coefficients'' of <math>f</math>.
+Assume now that the regression function <math>f</math> admits the following decomposition
+<math display="block">
+f=\sum_{j=1}^\infty \theta_j^* \varphi_j\,.
+</math>
+There exists many choices for the orthonormal system and we give only two as examples.
+'''Example'''
+<span id = "ex:trigbasis"/>
+''Trigonometric basis''. This is an orthonormal basis of <math>L_2([0,1])</math>. It is defined by
+<math display="block">
+\begin{eqnarray*}
+\varphi_1&\equiv&1\\
+\varphi_{2k}(x)&=&\sqrt{2} \cos(2\pi k x)\,,\\
+\varphi_{2k+1}(x)&=&\sqrt{2} \sin(2\pi k x)\,,
+\end{eqnarray*}
+</math>
+for <math>k=1,2, \dots</math> and <math>x \in [0,1]</math>. The fact that it is indeed an orthonormal system can be easily check using trigonometric identities.
+The next example has received a lot of attention in the signal (sound, image, ...) processing community.
+'''Example'''
+''Wavelets''.
+Let <math>\psi\,:\, \R \to \R</math> be a sufficiently smooth and compactly supported function, called  ‘`’'mother wavelet''". Define the system of functions
+<math display="block">
+\psi_{jk}(x)=2^{j/2}\psi(2^jx-k)\,, \quad j,k \in \Z\,.
+</math>
+It can be shown that for a suitable <math>\psi</math>, the dictionary <math>\{\psi_{j,k}, j,k \in \Z\}</math> forms an orthonormal system of <math>L_2([0,1])</math> and sometimes a basis. In the latter case, for any function <math>g \in L_2([0,1])</math>, it holds
+<math display="block">
+g=\sum_{j=-\infty}^\infty \sum_{k=-\infty}^\infty \theta_{jk}\psi_{jk}\,, \quad \theta_{jk}=\int_{0}^1 g(x)\psi_{jk}(x)\ud x\,.
+</math>
+The coefficients <math>\theta_{jk}</math> are called ''wavelet coefficients'' of <math>g</math>.
+The simplest example is given by the ''Haar system'' obtained by taking <math>\psi</math> to be the following piecewise constant function (see [[#FIG:haar |Figure]]). We will not give more details about wavelets here but simply point the interested reader to<ref name="Mal09">{{cite journal|last=Ryan|first=O'Donnell|journal=SIAM J. Comput.|year=2005|title=The PCP Theorem and Hardness of Approximation|volume=24|number=2|doi=10.1214/aos/1034276635|publisher=W. H. Freeman & Co.|url-access=http://dx.doi.org/10.1214/aos/1034276635|eprint=1011.3396}}</ref>.
+<math display="block">
+\psi(x)=\left\{
+\begin{array}{ll}
+& 0\le x < 1/2\\
+-1 & 1/2\le x \le 1\\
+& \text{otherwise}
+\end{array}
+\right.
+</math>
+<div id="FIG:haar" class="d-flex justify-content-center">
+[[File: | 400px | thumb | The Haar mother wavelet ]]
+</div>
+===Sobolev classes and ellipsoids===
+We begin by describing a class of smooth functions where smoothness is understood in terms of its number of derivatives. Recall that <math>f^{(k)}</math> denotes the <math>k</math>-th derivative of <math>f</math>.
+{{defncard|label=|id=|Fix parameters <math>\beta \in \{1, 2, \dots\}</math> and <math>L > 0</math>. The Sobolev class of functions <math>W(\beta,L)</math> is defined by
+<math display="block">
+\begin{align*}
+W(\beta,L)=\Big\{f\,:\,& [0,1] \to \R\ :\ f\in L_2([0,1])\,, f^{(\beta-1)} \text{  is absolutely continuous and }\\
+& \int_{0}^1[f^{(\beta)}]^2\le L^2\,, f^{(j)}(0)= f^{(j)}(1), j=0, \ldots, \beta-1\Big\}
+\end{align*}
+</math>
+}}
+Any function <math>f \in W(\beta, L)</math> can represented<ref group="Notes" >In the sense that
+<math display="block">
+\lim_{k \to \infty} \int_0^1|f(t)-\sum_{j=1}^k \theta_j \varphi_j(t)|^2\ud t=0
+</math>
+</ref> as its Fourier expansion along the trigonometric basis:
+<math display="block">
+f(x)=\theta_1^*\varphi_1(x) + \sum_{k=1}^\infty\big(\theta_{2k}^*\varphi_{2k}(x)+\theta_{2k+1}^*\varphi_{2k+1}(x)\big)\,, \quad \forall \ x \in [0,1]\,,
+</math>
+where <math>\theta^*=\{\theta_j^*\}_{j \ge 1}</math> is in the space of squared summable sequences <math>\ell_2(\N)</math> defined by
+<math display="block">
+\ell_2(\N)=\Big\{\theta\,:\, \sum_{j=1}^\infty \theta_j^2  < \infty\Big\}\,.
+</math>
+For any <math>\beta > 0</math>, define the coefficients
+<span id="EQ:defaj"/>
+<math display="block">
+\begin{equation}
+\label{EQ:defaj}
+a_j=\left\{
+\begin{array}{cl}
+j^\beta& \text{for $j$ even}\\
+(j-1)^\beta & \text{for $j$ odd}
+\end{array}
+\right.
+\end{equation}
+</math>
+With these coefficients, we can define the Sobolev class of functions in terms of Fourier coefficients.
+{{proofcard|Theorem|thm-2|Fix <math>\beta\ge 1</math> and <math>L > 0</math> and let <math>\{\varphi_j\}_{j \ge 1}</math> denote the trigonometric basis of <math>L_2([0,1])</math>. Moreover,  let <math>\{a_j\}_{j\ge 1}</math> be defined as in\eqref{EQ:defaj}. A function <math>f \in W(\beta,L)</math> can be represented as
+<math display="block">
+f=\sum_{j=1}^\infty \theta_j^* \varphi_j\,,
+</math>
+where the sequence <math>\{\theta_j^*\}_{j \ge 1}</math> belongs to Sobolev ellipsoid of <math>\ell_2(\N)</math> defined by
+<math display="block">
+\Theta(\beta, Q)=\Big\{\theta \in \ell_2(\N)\,:\, \sum_{j=1}^\infty a_j^2\theta_j^2\le Q\Big\}
+</math>
+for <math>Q=L^2/\pi^{2\beta}</math>.
+|Let us first define the Fourier coefficients <math>\{s_k(j)\}_{k \ge 1}</math> of the <math>j</math>th derivative <math>f^{(j)}</math> of <math>f</math> for <math>j=1, \ldots, \beta</math>:
+<math display="block">
+\begin{align*}
+s_1(j)&=\int_0^1f^{(j)}(t)\ud t=f^{(j-1)}(1)-f^{(j-1)}(0)=0\,,\\
+s_{2k}(j)&=\sqrt{2}\int_0^1f^{(j)}(t)\cos(2\pi kt)\ud t\,,\\
+s_{2k+1}(j)&=\sqrt{2}\int_0^1f^{(j)}(t)\sin(2\pi kt)\ud t\,,\\
+\end{align*}
+</math>
+The Fourier coefficients of <math>f</math> are given by <math>\theta_k=s_k(0)</math>.
+Using integration by parts, we find that
+<math display="block">
+\begin{align*}
+s_{2k}(\beta)&=\sqrt{2}f^{(\beta-1)}(t) \cos(2\pi kt)\Big|_0^1 + (2\pi k)\sqrt{2}\int_0^1f^{(\beta-1)}(t) \sin(2\pi kt)\ud t\\
+&= \sqrt{2}[f^{(\beta-1)}(1)-f^{(\beta-1)}(0)]+(2\pi k)\sqrt{2}\int_0^1f^{(\beta-1)}(t) \sin(2\pi kt)\ud t\\
+&= (2\pi k)s_{2k+1}(\beta-1)\,.
+\end{align*}
+</math>
+Moreover,
+<math display="block">
+\begin{align*}
+s_{2k+1}(\beta)&=\sqrt{2}f^{(\beta-1)}(t) \sin(2\pi kt)\Big|_0^1 - (2\pi k)\sqrt{2}\int_0^1f^{(\beta-1)}(t) \cos(2\pi kt)\ud t\\
+&= -(2\pi k)s_{2k}(\beta-1)\,.
+\end{align*}
+</math>
+In particular, it yields
+<math display="block">
+s_{2k}(\beta)^2+s_{2k+1}(\beta)^2=(2\pi k)^2\big[s_{2k}(\beta-1)^2+s_{2k+1}(\beta-1)^2\big]
+</math>
+By induction, we find that for any <math>k \ge 1</math>,
+<math display="block">
+s_{2k}(\beta)^2+s_{2k+1}(\beta)^2=(2\pi k)^{2\beta}\big(\theta_{2k}^2 + \theta_{2k+1}^2\big)
+</math>
+Next, it follows for the definition\eqref{EQ:defaj} of <math>a_j</math> that
+<math display="block">
+\begin{align*}
+\sum_{k=1}^\infty(2\pi k)^{2\beta}\big(\theta_{2k}^2 + \theta_{2k+1}^2\big)&=\pi^{2\beta}\sum_{k=1}^\infty a_{2k}^2\theta_{2k}^2 + \pi^{2\beta}\sum_{k=1}^\infty a_{2k+1}^2\theta_{2k+1}^2\\
+&=\pi^{2\beta}\sum_{j=1}^\infty a_{j}^2\theta_{j}^2\,.
+\end{align*}
+</math>
+Together with the Parseval identity, it yields
+<math display="block">
+\int_0^1\big(f^{(\beta)}(t)\big)^2\ud t=\sum_{k=1}^\infty s_{2k}(\beta)^2+s_{2k+1}(\beta)^2=\pi^{2\beta}\sum_{j=1}^\infty a_{j}^2\theta_{j}^2\,.
+</math>
+To conclude, observe that since <math>f \in W(\beta, L)</math>, we have
+<math display="block">
+\int_0^1\big(f^{(\beta)}(t)\big)^2\ud t \le L^2\,,
+</math>
+so that <math>\theta \in \Theta(\beta, L^2/\pi^{2\beta})</math>\,.}}
+It can actually be shown that the reciprocal is true, that is, any function with Fourier coefficients in <math>\Theta(\beta, Q)</math> belongs to  if <math> W(\beta,L)</math>, but we will not be needing this.
+In what follows, we will define smooth functions as functions with Fourier coefficients (with respect to the trigonometric basis) in a Sobolev ellipsoid. By extension, we write  <math>f \in \Theta(\beta, Q)</math> in this case and consider any real value for <math>\beta</math>.
+{{proofcard|Proposition|prop-1|The Sobolev ellipsoids enjoy the following properties
+<ul style{{=}}"list-style-type:lower-roman"><li>For any <math>Q > 0</math>,
+<math display="block">
+< \beta' < \beta \ \Rightarrow \ \Theta(\beta, Q) \subset \Theta(\beta',Q).
+</math></li>
+<li>For any <math>Q > 0</math>,
+<math display="block">
+\beta > \frac12  \ \Rightarrow \ f\ \text{is continuous}.
+</math></li>
+</ul>|}}
+The proof is left as an exercise ([[exercise:143f8e5f8a |Problem]]).
+It turns out that the first <math> n </math> functions in the trigonometric basis are not only orthonormal with respect to the inner product of <math>L_2</math>, but also with respect to the inner predictor associated with fixed design, <math>\langle f,g\rangle:=\frac{1}{n} \sum_{i=1}^n f(X_i)g(X_i)</math>, when the design is chosen to be ''regular'', i.e., <math>X_i=(i-1)/n</math>, <math>i=1, \ldots, n</math>.
+{{proofcard|Lemma|LEM:regdesign|Assume that <math>\{X_1, \ldots, X_n\}</math> is the regular design, i.e., <math>X_i=(i-1)/n</math>.  Then, for any <math>M\le n-1</math>, the design matrix <math>\Phi=\{\varphi_j(X_i)\}_{\substack{1\le i \le n\\1\le j \le M}}</math> satisfies the <math>\textsf{ORT}</math> condition.
+|Fix <math>j,j'\in \{1, \ldots, n-1\}</math>, <math>j\neq j'</math>, and consider the inner product <math>\varphi_j^\top \varphi_{j'}</math>. Write <math>k_{j}=\lfloor j/2\rfloor</math> for the integer part of <math>j/2</math> and define the vectors <math>a, b, a', b' \in \R^n</math> with coordinates such that <math>e^{\frac{\mathsf{i}2\pi k_j s}{n}}=a_{s+1}+\mathsf{i} b_{s+1}</math> and <math>e^{\frac{\mathsf{i}2\pi k_{j'} s}{n}}=a'_{s+1}+\mathsf{i} b'_{s+1}</math> for <math> s \in \{0, \dots, n-1\} </math>. It holds that
+<math display="block">
+\frac{1}{2}\varphi_j^\top \varphi_{j'}\in \{a^\top a'\,,\, b^\top b'\,,\, b^\top a'\,,\, a^\top b'\}\,,
+</math>
+depending on the parity of <math>j</math> and <math>j'</math>.
+On the one hand, observe that  if <math>k_j \neq k_{j'}</math>, we have for any <math>\sigma \in \{-1, +1\}</math>,
+<math display="block">
+\sum_{s=0}^{n-1}e^{\frac{\mathsf{i}2\pi k_j s}{n}}e^{\sigma\frac{\mathsf{i}2\pi k_{j'} s}{n}}=\sum_{s=0}^{n-1}e^{\frac{\mathsf{i}2\pi (k_j+\sigma k_{j'}) s}{n}}=0\,.
+</math>
+On the other hand
+<math display="block">
+\sum_{s=0}^{n-1}e^{\frac{\mathsf{i}2\pi k_j s}{n}}e^{\sigma\frac{\mathsf{i}2\pi k_{j'} s}{n}}=(a+\mathsf{i} b)^\top( a'+\sigma \mathsf{i} b')=a^\top a'-\sigma b^\top b' + \mathsf{i} \big[b^\top a' + \sigma a^\top b'  \big]
+</math>
+so that  <math>a^\top a'=\pm b^\top b'=0</math> and <math>b^\top a'=\pm a^\top b'=0</math> whenever, <math>k_j \neq k_{j'}</math>. It yields <math>\varphi_j^\top \varphi_{j'}=0</math>.
+Next, consider the case where <math>k_j=k_{j'}</math> so that
+<math display="block">
+\sum_{s=0}^{n-1}e^{\frac{\mathsf{i}2\pi k_j s}{n}}e^{\sigma\frac{\mathsf{i}2\pi k_{j'} s}{n}}=\left\{
+\begin{array}{ll}
+& \text{if $\sigma=1$}\\
+n& \text{if $\sigma=-1$}\\
+\end{array}\right.\,.
+</math>
+On the one hand if <math>j \neq j'</math>, it can only be the case that <math>\varphi_j^\top \varphi_{j'} \in \{b^\top a'\,,\, a^\top b'\}</math> but the same argument as above yields <math>b^\top a'=\pm a^\top b'=0</math> since the imaginary part of the inner product is still <math> 0  </math>. Hence, in that case, <math>\varphi_j^\top \varphi_{j'}=0</math>.
+On the other hand, if <math>j=j'</math>, then <math>a=a'</math> and <math>b=b'</math> so that it yields  <math>a^\top a'=|a|_2^2=n</math> and <math>b^\top b'=|b|_2^2=n</math> which is equivalent to <math>\varphi_j^\top \varphi_j=|\varphi_j|_2^2=n</math>.
+Therefore, the design matrix <math>\Phi</math> is such that
+<math display="block">
+\begin{equation*}
+\Phi^\top \Phi=nI_M\,.
+\end{equation*}
+</math>}}
+===Integrated squared error===
+As mentioned in the introduction of this chapter, the smoothness assumption allows us to control the approximation error. Before going into the details,  let us gain some insight. Note first that if <math>\theta \in \Theta(\beta, Q)</math>, then <math>a_j^2\theta_j^2 \to 0</math> as <math>j \to \infty</math> so that <math>|\theta_j|=o(j^{-\beta})</math>. Therefore, the <math>\theta_j</math>s decay polynomially to zero and it makes sense to approximate <math>f</math> by its truncated Fourier series
+<math display="block">
+\sum_{j=1}^M \theta_j^* \varphi_j=:\varphi_{\theta^*}^M
+</math>
+for any fixed <math>M</math>. This truncation leads to a systematic error that vanishes as <math>M \to \infty</math>. We are interested in understanding the rate at which this happens.
+The Sobolev assumption allows us to control precisely this error as a function of the tunable parameter <math>M</math> and the smoothness <math>\beta</math>.
+{{proofcard|Lemma|TH:bias|For any integer <math>M \ge 1</math>, and <math>f \in \Theta(\beta, Q)</math>, <math>\beta > 1/2</math>, it holds
+<span id="EQ:bias1"/>
+<math display="block">
+\begin{equation}
+\label{EQ:bias1}
+\|\varphi_{\theta^*}^M -f\|_{L_2}^2=\sum_{j > M} |\theta^*_j|^2 \le QM^{-2\beta}\,.
+\end{equation}
+</math>
+and for <math>M =n-1</math>, we have
+<span id="EQ:bias2"/>
+<math display="block">
+\begin{equation}
+\label{EQ:bias2}
+|\varphi_{\theta^*}^{n-1} -f|_2^2\le 2n\Big(\sum_{j\ge n}|\theta^*_j|\Big)^2\lesssim Qn^{2-2\beta}\,.
+\end{equation}
+</math>
+|Note that for any <math>\theta \in \Theta(\beta, Q)</math>, if <math>\beta > 1/2</math>, then
+<math display="block">
+\begin{align*}
+\sum_{j=2}^\infty|\theta_j|&=\sum_{j=2}^\infty a_j|\theta_j|\frac1{a_j}\\
+&\le \sqrt{\sum_{j=2}^\infty a_j^2\theta_j^2\sum_{j=2}^\infty\frac1{a_j^2}}\quad \text{by Cauchy-Schwarz}\\
+&\le \sqrt{Q\sum_{j=1}^\infty\frac1{j^{2\beta}}} < \infty
+\end{align*}
+</math>
+Since <math>\{\varphi_j\}_j</math> forms an orthonormal system in <math>L_2([0,1])</math>, we have
+<math display="block">
+\min_{\theta \in \R^M} \|\varphi_\theta -f\|_{L_2}^2= \|\varphi_{\theta^*}^M -f\|_{L_2}^2 =\sum_{j > M} |\theta^*_j|^2\,.
+</math>
+When <math>\theta^* \in \Theta(\beta,Q)</math>, we have
+<math display="block">
+\sum_{j > M}  |\theta_j^*|^2=\sum_{j > M}  a_j^2|\theta_j^*|^2\frac1{a_j^2}\le \frac1{a_{M+1}^2}Q\le \frac{Q}{M^{2\beta}}\,.
+</math>
+To prove the second part of the lemma, observe that
+<math display="block">
+|\varphi_{\theta^*}^{n-1} -f|_2=\big|\sum_{j\ge n}\theta_j^* \varphi_j\big|_2 \le2\sqrt{2n}\sum_{j\ge n}|\theta_j^*|\,,
+</math>
+where in the last inequality, we used the fact that for the trigonometric basis <math>|\varphi_j|_2 \le \sqrt{2n}, j \ge 1</math> regardless of the choice of the design <math>X_1,\ldots, X_n</math>.
+When <math>\theta^* \in \Theta(\beta,Q)</math>, we have
+<math display="block">
+\begin{equation*}
+\sum_{j\ge n} |\theta^*_j|=\sum_{j\ge n} a_j|\theta^*_j|\frac{1}{a_j}\le \sqrt{\sum_{j\ge n} a_j^2|\theta^*_j|^2}\sqrt{\sum_{j\ge n}\frac{1}{a_j^2}}\lesssim \sqrt{Q} n^{\frac{1}{2}-\beta}\,.
+\end{equation*}
+</math>}}
+Note that the truncated Fourier series <math>\varphi_{\theta^*}^M</math> is an oracle: this is what we see when we view <math>f</math> through the lens of functions with only low frequency harmonics.
+To estimate <math>\varphi_{\theta^*} = \varphi_{\theta^\ast}^M</math>, consider the  estimator <math>\varphi_{\thetals}</math> where
+<math display="block">
+\thetals \in \argmin_{\theta \in \R^M} \sum_{i=1}^n \big(Y_i -\varphi_\theta(X_i)\big)^2\,,
+</math>
+which should be such that <math>\varphi_{\thetals}</math> is close to <math>\varphi_{\theta^*}</math>. For this estimator, we have proved ([[#TH:LS_mis |Theorem]]) an oracle inequality for the <math>\MSE</math> that is of the form
+<math display="block">
+|\varphi_{\thetals}^M-f|_2^2\le\inf_{\theta \in \R^M}|\varphi_\theta^M-f|_2^2+C\sigma^2M\log(1/\delta)\,, \qquad C > 0\,.
+</math>
+It yields
+<math display="block">
+\begin{align*}
+|\varphi_{\thetals}^M-\varphi_{\theta^*}^M|_2^2&\le 2(\varphi_{\thetals}^M-\varphi_{\theta^*}^M)^\top(f-\varphi_{\theta^*}^M)+C\sigma^2M\log(1/\delta)\\
+&= 2(\varphi_{\thetals}^M-\varphi_{\theta^*}^M)^\top(\sum_{j > M}\theta^*_j\varphi_j)+C\sigma^2M\log(1/\delta)\\
+&=2(\varphi_{\thetals}^M-\varphi_{\theta^*}^M)^\top(\sum_{j\ge n}\theta^*_j\varphi_j)+C\sigma^2M\log(1/\delta)\,,
+\end{align*}
+</math>
+where we used [[#LEM:regdesign |Lemma]] in the last equality. Together with\eqref{EQ:bias2} and Young's inequality <math>2ab\le \alpha a^2+b^2/\alpha, a,b \ge 0</math> for any <math>\alpha > 0</math>, we get
+<math display="block">
+(\varphi_{\thetals}^M-\varphi_{\theta^*}^M)^\top(\sum_{j\ge n}\theta^*_j\varphi_j)\le \alpha|\varphi_{\thetals}^M-\varphi_{\theta^*}^M
+|_2^2 + \frac{C}{\alpha}Qn^{2-2\beta}\,,
+</math>
+for some positive constant <math>C</math> when <math>\theta^* \in \Theta(\beta, Q)</math>. As a result,
+<span id="EQ:almostLSNP"/>
+<math display="block">
+\begin{equation}
+\label{EQ:almostLSNP}
+|\varphi_{\thetals}^M-\varphi_{\theta^*}^M|_2^2 \lesssim  \frac{1}{\alpha(1-\alpha)}Qn^{2-2\beta}+ \frac{\sigma^2M}{1-\alpha}\log(1/\delta)
+\end{equation}
+</math>
+for any <math>\alpha \in (0,1)</math>. Since, [[#LEM:regdesign |Lemma]] implies, <math>|\varphi_{\thetals}^M-\varphi_{\theta^*}^M|_2^2=n\|\varphi_{\thetals}^M-\varphi_{\theta^*}^M\|_{L_2([0,1])}^2</math>, we have proved the following theorem.
+{{proofcard|Theorem|TH:UB_ls_NP|Fix  <math>\beta\ge (1+\sqrt{5})/4\simeq 0.81, Q > 0, \delta > 0</math> and assume the general regression model\eqref{EQ:regmodgen} with <math>f \in \Theta(\beta,Q)</math> and <math>\eps\sim\sg_n(\sigma^2), \sigma^2 \le 1</math>. Moreover, let <math>M=\lceil n^{\frac{1}{2\beta+1}}\rceil</math> and <math>n</math> be large enough so that <math>M \le n-1</math>. Then the least squares estimator <math>\thetals</math> defined in\eqref{EQ:defthetaLSmis} with <math>\{\varphi_j\}_{j=1}^M</math> being the trigonometric basis, satisfies with probability <math>1-\delta</math>, for <math>n</math> large enough,
+<math display="block">
+\|\varphi^M_{\thetals} -f\|_{L_2([0,1])}^2 \lesssim n^{-\frac{2\beta}{2\beta+1}}(1+\sigma^2\log(1/\delta))\,.
+</math>
+where the constant factors may depend on <math>\beta, Q</math> and <math>\sigma</math>.
+|Choosing <math>\alpha=1/2</math> for example and absorbing <math>Q</math> in the constants, we get from\eqref{EQ:almostLSNP} and [[#LEM:regdesign |Lemma]] that
+for <math>M \le n-1</math>,
+<math display="block">
+\|\varphi^M_{\thetals}-\varphi^M_{\theta^*}\|^2_{L_2([0,1])}\lesssim n^{1-2\beta}+\sigma^2\frac{M\log(1/\delta)}{n} \,.
+</math>
+Using now [[#TH:bias |Lemma]] and  <math>\sigma^2 \le 1</math>, we get
+<math display="block">
+\|\varphi^M_{\thetals}-f\|^2_{L_2([0,1])}\lesssim M^{-2\beta}+n^{1-2\beta}+\sigma^2\frac{M\log(1/\delta)}{n} \,.
+</math>
+Taking <math>M=\lceil  n^{\frac{1}{2\beta+1}}\rceil\le n-1</math> for <math>n</math> large enough  yields
+<math display="block">
+\|\varphi^M_{\thetals}-f\|^2_{L_2([0,1])}\lesssim n^{-\frac{2\beta}{2\beta+1}}+n^{1-2\beta} \sigma^2\log(1/\delta)\,.
+</math>
+To conclude the proof, simply note that for the prescribed <math>\beta</math>, we have <math>n^{1-2\beta}\le n^{-\frac{2\beta}{2\beta+1}}</math>\,.}}
+===Adaptive estimation===
+The rate attained by the projection estimator <math>\varphi_{\thetals}</math> with <math>M=\lceil n^{\frac{1}{2\beta+1}}\rceil</math> is actually optimal so, in this sense, it is a good estimator. Unfortunately, its implementation requires the knowledge of the smoothness parameter <math>\beta</math> which is typically unknown, to determine the level <math>M</math> of truncation. The purpose of ''adaptive estimation'' is precisely to adapt to the unknown <math>\beta</math>, that is to build an estimator that does not depend on <math>\beta</math> and yet, attains a rate of the order of <math>Cn^{-\frac{2\beta}{2\beta+1}}</math> (up to a logarithmic lowdown). To that end, we will use the oracle inequalities for the BIC and Lasso estimator defined in\eqref{EQ:defBICmis} and\eqref{EQ:defLassomis} respectively. In view of [[#LEM:regdesign |Lemma]], the design matrix <math>\Phi</math> actually satisfies the assumption<math>\textsf{ORT}</math> when we work with the trigonometric basis. This has two useful implications:
+<ul><li> Both estimators are actually thresholding estimators and can therefore be implemented efficiently
+</li>
+<li> The condition <math>\textsf{INC($k$)}</math> is automatically satisfied for any <math>k \ge 1</math>.
+</li>
+</ul>
+These observations lead to the following corollary.
+{{proofcard|Corollary|COR:adap|Fix <math>\beta >  (1+\sqrt{5})/4\simeq 0.81, Q > 0, \delta > 0</math> and <math>n</math> large enough to ensure <math>n-1\ge \lceil n^{\frac{1}{2\beta+1}}\rceil</math> assume the general regression model\eqref{EQ:regmodgen} with <math>f \in \Theta(\beta,Q)</math> and <math>\eps\sim\sg_n(\sigma^2), \sigma^2 \le 1</math>. Let <math>\{\varphi_j\}_{j=1}^{n-1}</math> be the trigonometric basis. Denote by <math>\varphi^{n-1}_{\thetabic}</math> (resp. <math>\varphi^{n-1}_{\thetalasso}</math>) the BIC (resp. Lasso) estimator defined in\eqref{EQ:defBICmis} (resp. \eqref{EQ:defLassomis}) over <math>\R^{n-1}</math> with regularization parameter given by\eqref{EQ:taubicmis} (resp. \eqref{EQ:taulassomis}). Then <math>\varphi^{n-1}_{\hat \theta}</math>, where <math>\hat \theta \in \{\thetabic, \thetalasso\}</math> satisfies with probability <math>1-\delta</math>,
+<math display="block">
+\|\varphi^{n-1}_{\hat \theta} -f\|_{L_2([0,1])}^2 \lesssim (n/\log n)^{-\frac{2\beta}{2\beta+1}}(1+\sigma^2\log(1/\delta))\,,
+</math>
+where the constants may depend on <math>\beta</math> and <math>Q</math>.
+|For <math>\hat \theta \in \{\thetabic, \thetalasso\}</math>, adapting the proofs of [[#TH:BIC_mis |Theorem]] for the BIC estimator and [[#TH:lasso_fast_mis |Theorem]] for the Lasso estimator, for any <math>\theta \in \R^{n-1}</math>, with probability <math>1-\delta</math>
+<math display="block">
+|\varphi^{n-1}_{\hat \theta} -f|_2^2 \le \frac{1+\alpha}{1-\alpha}|\varphi^{n-1}_\theta-f|_2^2 +  R(|\theta|_0)\,.
+</math>
+where
+<math display="block">
+R(|\theta|_0):=\frac{C\sigma^2}{\alpha(1-\alpha)}|\theta_0|\log(en/\delta)
+</math>
+It yields
+<math display="block">
+\begin{align*}
+|\varphi^{n-1}_{\hat \theta} -\varphi^{n-1}_\theta|_2^2& \le \frac{2\alpha}{1-\alpha}|\varphi^{n-1}_\theta-f|_2^2 +2(\varphi^{n-1}_{\hat \theta} -\varphi^{n-1}_\theta)^\top (f - \varphi^{n-1}_\theta)+R(|\theta|_0)\\
+&\le \Big( \frac{2\alpha}{1-\alpha}+\frac{1}{\alpha}\Big)|\varphi^{n-1}_\theta-f|_2^2+\alpha|\varphi^{n-1}_{\hat \theta} -\varphi^{n-1}_\theta|_2^2 + R(|\theta|_0)\,,
+\end{align*}
+</math>
+where we used Young's inequality once again. Let <math> M \in \mathbb{N} </math> be determined later and choose now <math>\alpha=1/2</math> and <math>\theta=\theta^*_M</math>, where <math>\theta^*_M</math> is equal to <math>\theta^*</math> on its first <math>M</math> coordinates and <math>0</math> otherwise so that <math>\varphi^{n-1}_{\theta^*_M}=\varphi^M_{\theta^*}</math>. It yields
+<math display="block">
+|\varphi^{n-1}_{\hat \theta} -\varphi^{n-1}_{\theta^*_M}|_2^2\lesssim |\varphi^{n-1}_{\theta^*_M}-f|_2^2+ R(M)\lesssim|\varphi^{n-1}_{\theta^*_M}-\varphi^{n-1}_{\theta^*}|_2^2+|\varphi^{n-1}_{\theta^*}-f|_2^2+ R(M)
+</math>
+Next, it follows from\eqref{EQ:bias2} that <math>|\varphi^{n-1}_{\theta^*}-f|_2^2 \lesssim Qn^{2-2\beta}</math>. Together with   [[#LEM:regdesign |Lemma]], it yields
+<math display="block">
+\|\varphi^{n-1}_{\hat \theta} -\varphi^{n-1}_{\theta^*_M}\|_{L_2([0,1])}^2\lesssim  \|\varphi^{n-1}_{\theta^*} -\varphi^{n-1}_{\theta^*_M}\|_{L_2([0,1])}^2+Qn^{1-2\beta}+ \frac{R(M)}{n}\,.
+</math>
+Moreover, using\eqref{EQ:bias1}, we find that
+<math display="block">
+\|\varphi^{n-1}_{\hat \theta} -f\|_{L_2([0,1])}^2\lesssim  M^{-2\beta}+Qn^{1-2\beta}+ \frac{M}{n}\sigma^2\log(en/\delta) \,.
+</math>
+To conclude the proof, choose <math>M=\lceil (n/\log n)^{\frac{1}{2\beta+1}}\rceil</math> and observe that the choice of <math>\beta</math> ensures that <math>n^{1-2\beta} \lesssim M^{-2\beta}</math>.}}
+While there is sometimes a (logarithmic) price to pay for adaptation, it turns out that the extra logarithmic factor can be removed by a clever use of blocks (see <ref name="Tsy09">{{cite journal|last=LeCam|first=L.|journal=Ann. Statist.|year=1973|title=Convergence of estimates under dimensionality restrictions|volume=1|number=5|doi=10.1214/aos/1034276635|publisher=Omnipress|url-access=http://dx.doi.org/10.1214/aos/1034276635|eprint=1108.1170v6}}</ref>{{rp|at=Chapter3}}). The reason why we get this extra logarithmic factor here is because we use a hammer that's too big. Indeed, BIC and Lasso allow for ‘`holes" in the Fourier decomposition, so we only get part of their potential benefits.
+==General references==
+{{cite arXiv|last1=Rigollet|first1=Philippe|last2=Hütter|first2=Jan-Christian|year=2023|title=High-Dimensional Statistics|eprint=2310.19244|class=math.ST}}
+==Notes==
+{{Reflist|group=Notes}}
+==References==
+{{reflist}}