Misspecified Linear Models

[math] \newcommand{\DS}{\displaystyle} \newcommand{\intbeta}{\lfloor \beta \rfloor} \newcommand{\cA}{\mathcal{A}} \newcommand{\cB}{\mathcal{B}} \newcommand{\cC}{\mathcal{C}} \newcommand{\cD}{\mathcal{D}} \newcommand{\cE}{\mathcal{E}} \newcommand{\cF}{\mathcal{F}} \newcommand{\cG}{\mathcal{G}} \newcommand{\cH}{\mathcal{H}} \newcommand{\cI}{\mathcal{I}} \newcommand{\cJ}{\mathcal{J}} \newcommand{\cK}{\mathcal{K}} \newcommand{\cL}{\mathcal{L}} \newcommand{\cM}{\mathcal{M}} \newcommand{\cN}{\mathcal{N}} \newcommand{\cO}{\mathcal{O}} \newcommand{\cP}{\mathcal{P}} \newcommand{\cQ}{\mathcal{Q}} \newcommand{\cS}{\mathcal{S}} \newcommand{\cT}{\mathcal{T}} \newcommand{\cU}{\mathcal{U}} \newcommand{\cV}{\mathcal{V}} \newcommand{\cX}{\mathcal{X}} \newcommand{\cY}{\mathcal{Y}} \newcommand{\cZ}{\mathcal{Z}} \newcommand{\sg}{\mathsf{subG}} \newcommand{\subE}{\mathsf{subE}} \newcommand{\bA}{\mathbf{A}} \newcommand{\bB}{\mathbf{B}} \newcommand{\bC}{\mathbf{C}} \newcommand{\bD}{\mathbf{D}} \newcommand{\bE}{\mathbf{E}} \newcommand{\bF}{\mathbf{F}} \newcommand{\bG}{\mathbf{G}} \newcommand{\bH}{\mathbf{H}} \newcommand{\bI}{\mathbf{I}} \newcommand{\bJ}{\mathbf{J}} \newcommand{\bK}{\mathbf{K}} \newcommand{\bM}{\mathbf{M}} \newcommand{\bN}{\mathbf{N}} \newcommand{\bO}{\mathbf{O}} \newcommand{\bP}{\mathbf{P}} \newcommand{\bp}{\mathbf{p}} \newcommand{\bQ}{\mathbf{Q}} \newcommand{\bS}{\mathbf{S}} \newcommand{\bT}{\mathbf{T}} \newcommand{\bU}{\mathbf{U}} \newcommand{\bV}{\mathbf{V}} \newcommand{\bX}{\mathbf{X}} \newcommand{\bY}{\mathbf{Y}} \newcommand{\bZ}{\mathbf{Z}} \newcommand{\bflambda}{\boldsymbol{\lambda}} \newcommand{\bftheta}{\boldsymbol{\theta}} \newcommand{\bfg}{\boldsymbol{g}} \newcommand{\bfy}{\boldsymbol{y}} \def\thetaphat{\hat{\bftheta}_\bp} \def\bflam{\boldsymbol{\lambda}} \def\Lam{\Lambda} \def\lam{\lambda} \def\bfpi{\boldsymbol{\pi}} \def\bfz{\boldsymbol{z}} \def\bfw{\boldsymbol{w}} \def\bfeta{\boldsymbol{\eta}} \newcommand{\R}{\mathrm{ I}\kern-0.18em\mathrm{ R}} \newcommand{\h}{\mathrm{ I}\kern-0.18em\mathrm{ H}} \newcommand{\K}{\mathrm{ I}\kern-0.18em\mathrm{ K}} \newcommand{\p}{\mathrm{ I}\kern-0.18em\mathrm{ P}} \newcommand{\E}{\mathrm{ I}\kern-0.18em\mathrm{ E}} %\newcommand{\Z}{\mathrm{ Z}\kern-0.26em\mathrm{ Z}} \newcommand{\1}{\mathrm{ 1}\kern-0.24em\mathrm{ I}} \newcommand{\N}{\mathrm{ I}\kern-0.18em\mathrm{ N}} \newcommand{\field}[1]{\mathbb{#1}} \newcommand{\q}{\field{Q}} \newcommand{\Z}{\field{Z}} \newcommand{\X}{\field{X}} \newcommand{\Y}{\field{Y}} \newcommand{\bbS}{\field{S}} \newcommand{\n}{\mathcal{N}} \newcommand{\x}{\mathcal{X}} \newcommand{\pp}{{\sf p}} \newcommand{\hh}{{\sf h}} \newcommand{\ff}{{\sf f}} \newcommand{\Bern}{\mathsf{Ber}} \newcommand{\Bin}{\mathsf{Bin}} \newcommand{\Lap}{\mathsf{Lap}} \newcommand{\tr}{\mathsf{Tr}} \newcommand{\phin}{\varphi_n} \newcommand{\phinb}{\overline \varphi_n(t)} \newcommand{\pn}{\p_{\kern-0.25em n}} \newcommand{\pnm}{\p_{\kern-0.25em n,m}} \newcommand{\psubm}{\p_{\kern-0.25em m}} \newcommand{\psubp}{\p_{\kern-0.25em p}} \newcommand{\cfi}{\cF_{\kern-0.25em \infty}} \newcommand{\e}{\mathrm{e}} \newcommand{\ic}{\mathrm{i}} \newcommand{\Leb}{\mathrm{Leb}_d} \newcommand{\Var}{\mathrm{Var}} \newcommand{\ddelta}{d_{\symdiffsmall}} \newcommand{\dsubh}{d_H} \newcommand{\indep}{\perp\kern-0.95em{\perp}} \newcommand{\supp}{\mathop{\mathrm{supp}}} \newcommand{\rank}{\mathop{\mathrm{rank}}} \newcommand{\vol}{\mathop{\mathrm{vol}}} \newcommand{\conv}{\mathop{\mathrm{conv}}} \newcommand{\card}{\mathop{\mathrm{card}}} \newcommand{\argmin}{\mathop{\mathrm{argmin}}} \newcommand{\argmax}{\mathop{\mathrm{argmax}}} \newcommand{\ud}{\mathrm{d}} \newcommand{\var}{\mathrm{var}} \newcommand{\re}{\mathrm{Re}} \newcommand{\MSE}{\mathsf{MSE}} \newcommand{\im}{\mathrm{Im}} \newcommand{\epr}{\hfill\hbox{\hskip 4pt\vrule width 5pt height 6pt depth 1.5pt}\vspace{0.5cm}\par} \newcommand{\bi}[1]{^{(#1)}} \newcommand{\eps}{\varepsilon} \newcommand{\Deq}{\stackrel{\mathcal{D}}{=}} \newcommand{\ubar}{\underbar} \newcommand{\Kbeta}{K_{\hspace{-0.3mm} \beta}} \newcommand{\crzero}[1]{\overset{r_0}{\underset{#1}{\longleftrightarrow}}} \newcommand{\hint}[1]{\texttt{[Hint:#1]}} \newcommand{\vp}{\vspace{.25cm}} \newcommand{\vm}{\vspace{.5cm}} \newcommand{\vg}{\vspace{1cm}} \newcommand{\vgneg}{\vspace{-1cm}} \newcommand{\vneg}{\vspace{-.5cm}} \newcommand{\vpneg}{\vspace{-.25cm}} \newcommand{\tp}{\ptsize{10}} \newcommand{\douzp}{\ptsize{12}} \newcommand{\np}{\ptsize{9}} \newcommand{\hp}{\ptsize{8}} \newcommand{\red}{\color{red}} \newcommand{\ndpr}[1]{{\textsf{\red[#1]}}} \newcommand\iid{i.i.d\@ifnextchar.{}{.\@\xspace} } \newcommand\MoveEqLeft[1][2]{\kern #1em & \kern -#1em} \newcommand{\leadeq}[2][4]{\MoveEqLeft[#1] #2 \nonumber} \newcommand{\leadeqnum}[2][4]{\MoveEqLeft[#1] #2} \newcommand\independent{\protect\mathpalette{\protect\independenT}{\perp}} \def\independenT#1#2{\mathrel{\rlap{$#1#2$}\mkern2mu{#1#2}}} \newcommand{\MIT}[1]{{\color{MITred} #1}} \newcommand{\dHyp}{\{-1,1\}^d} \newcommand{\thetahard}{\hat \theta^{hrd}} \newcommand{\thetasoft}{\hat \theta^{sft}} \newcommand{\thetabic}{\hat \theta^{bic}} \newcommand{\thetalasso}{\hat \theta^{\cL}} \newcommand{\thetaslope}{\hat \theta^{\cS}} \newcommand{\thetahard}{\hat \theta^{hrd}} \newcommand{\thetasoft}{\hat \theta^{sft}} \newcommand{\thetabic}{\hat \theta^{bic}} \newcommand{\thetalasso}{\hat \theta^{\cL}} \newcommand{\thetaslope}{\hat \theta^{\cS}} \newcommand{\thetals}{\hat \theta^{ls}} \newcommand{\thetalsm}{\tilde \theta^{ls_X}} \newcommand{\thetaridge}{\hat\theta^{\mathrm{ridge}}_\tau} \newcommand{\thetalsK}{\hat \theta^{ls}_K} \newcommand{\muls}{\hat \mu^{ls}} [/math]

Arguably, the strongest assumption that we made in Chapter Linear Regression Model is that the regression function [math]f(x)[/math] is of the form [math]f(x)=x^\top \theta^*[/math]. What if this assumption is violated? In reality, we do not really believe in the linear model and we hope that good statistical methods should be robust to deviations from this model. This is the problem of model misspecification, and in particular misspecified linear models. Throughout this chapter, we assume the following model:

[[math]] \begin{equation} \label{EQ:regmodgen} Y_i=f(X_i)+\eps_i,\quad i=1, \ldots, n\,, \end{equation} [[/math]]

where [math]\eps=(\eps_1, \ldots, \eps_n)^\top [/math] is sub-Gaussian with variance proxy [math]\sigma^2[/math]. Here [math]X_i \in \R^d[/math]. When dealing with fixed design, it will be convenient to consider the vector [math]g \in \R^n[/math] defined for any function [math]g \,:\, \R^d \to \R[/math] by [math]g=(g(X_1), \ldots, g(X_n))^\top[/math]. In this case, we can write for any estimator [math]\hat f \in \R^n[/math] of [math]f[/math],

[[math]] \MSE(\hat f)=\frac1n|\hat f - f|_2^2\,. [[/math]]

Even though the model may not be linear, we are interested in studying the statistical properties of various linear estimators introduced in the previous chapters: [math]\thetals, \thetalsK, \thetalsm, \thetabic, \thetalasso[/math]. Clearly, even with an infinite number of observations, we have no chance of finding a consistent estimator of [math]f[/math] if we don't know the correct model. Nevertheless, as we will see in this chapter, something can still be said about these estimators using oracle inequalities.

Oracle inequalities

Oracle inequalities

As mentioned in the introduction, an oracle is a quantity that cannot be constructed without the knowledge of the quantity of interest, here: the regression function. Unlike the regression function itself, an oracle is constrained to take a specific form. For all matter of purposes, an oracle can be viewed as an estimator (in a given family) that can be constructed with an infinite amount of data. This is exactly what we should aim for in misspecified models. When employing the least squares estimator [math]\thetals[/math], we constrain ourselves to estimating functions that are of the form [math]x\mapsto x^\top \theta[/math], even though [math]f[/math] itself may not be of this form. Therefore, the oracle [math]\hat f[/math] is the linear function that is the closest to [math]f[/math]. Rather than trying to approximate [math]f[/math] by a linear function [math]f(x)\approx \theta^\top x[/math], we make the model a bit more general and consider a dictionary [math]\cH=\{\varphi_1, \ldots, \varphi_M\}[/math] of functions where [math]\varphi_j\,:\, \R^d \to \R[/math]. In this case, we can actually remove the assumption that [math]X \in \R^d[/math]. Indeed, the goal is now to estimate [math]f[/math] using a linear combination of the functions in the dictionary:

[[math]] f \approx \varphi_\theta:=\sum_{j=1}^M \theta_j \varphi_j\,. [[/math]]

If [math]M=d[/math] and [math]\varphi_j(X)=X^{(j)}[/math] returns the [math]j[/math]th coordinate of [math]X \in \R^d[/math] then the goal is to approximate [math]f(x)[/math] by [math]\theta^\top x[/math]. Nevertheless, the use of a dictionary allows for a much more general framework.

Note that the use of a dictionary does not affect the methods that we have been using so far, namely penalized/constrained least squares. We use the same notation as before and define

  • The least squares estimator:
    [[math]] \begin{equation} \label{EQ:defthetaLSmis} \thetals\in \argmin_{\theta \in \R^M}\frac{1}{n} \sum_{i=1}^n\big(Y_i-\varphi_\theta(X_i)\big)^2 \end{equation} [[/math]]
  • The least squares estimator constrained to [math]K \subset \R^M[/math]:
    [[math]] \thetalsK \in \argmin_{\theta \in K}\frac{1}{n} \sum_{i=1}^n\big(Y_i-\varphi_\theta(X_i)\big)^2 [[/math]]
  • The BIC estimator:
    [[math]] \begin{equation} \label{EQ:defBICmis} \thetabic \in \argmin_{\theta \in \R^M}\Big\{\frac{1}{n} \sum_{i=1}^n\big(Y_i-\varphi_\theta(X_i)\big)^2+\tau^2|\theta|_0\Big\} \end{equation} [[/math]]
  • The Lasso estimator:
    [[math]] \begin{equation} \label{EQ:defLassomis} \thetalasso \in \argmin_{\theta \in \R^M}\Big\{\frac{1}{n} \sum_{i=1}^n\big(Y_i-\varphi_\theta(X_i)\big)^2+2\tau|\theta|_1\Big\} \end{equation} [[/math]]
Definition

Let [math]R(\cdot)[/math] be a risk function and let [math]\cH=\{\varphi_1, \ldots, \varphi_M\}[/math] be a dictionary of functions from [math]\R^d[/math] to [math]\R[/math]. Let [math]K[/math] be a subset of [math]\R^M[/math]. The oracle on [math]K[/math] with respect to [math]R[/math] is defined by [math]\varphi_{\bar \theta}[/math], where [math]\bar \theta \in K[/math] is such that

[[math]] R(\varphi_{\bar \theta}) \le R(\varphi_\theta) \,, \qquad \forall\, \theta \in K\,. [[/math]]
Moreover, [math]R_{K}=R(\varphi_{\bar \theta})[/math] is called oracle risk on [math]K[/math]. An estimator [math]\hat f[/math] is said to satisfy an oracle inequality (over [math]K[/math]) with remainder term [math]\phi[/math] in expectation (resp. with high probability) if there exists a constant [math]C\ge 1[/math] such that

[[math]] \E R(\hat f) \le C\inf_{\theta \in K}R(\varphi_\theta) + \phi_{n,M}(K)\,, [[/math]]
or

[[math]] \p\big\{R(\hat f) \le C\inf_{\theta \in K}R(\varphi_\theta) + \phi_{n,M,\delta}(K) \big\} \ge 1-\delta \,, \qquad \forall \ \delta \gt 0 [[/math]]
respectively. If [math]C=1[/math], the oracle inequality is sometimes called exact.

Our goal will be to mimic oracles. The finite sample performance of an estimator at this task is captured by an oracle inequality.

Oracle inequality for the least squares estimator

While our ultimate goal is to prove sparse oracle inequalities for the BIC and Lasso estimator in the case of misspecified model, the difficulty of the extension to this case for linear models is essentially already captured by the analysis for the least squares estimator. In this simple case, can even obtain an exact oracle inequality.

Theorem

Assume the general regression model\eqref{EQ:regmodgen} with [math]\eps\sim\sg_n(\sigma^2)[/math]. Then, the least squares estimator [math]\thetals[/math] satisfies for some numerical constant [math]C \gt 0[/math],

[[math]] \MSE(\varphi_{\thetals})\le \inf_{\theta \in \R^M}\MSE(\varphi_\theta)+C\frac{\sigma^2M}{n}\log(1/\delta) [[/math]]
with probability at least [math]1-\delta[/math].


Show Proof

Note that by definition

[[math]] |Y-\varphi_{\thetals}|_2^2 \le |Y-\varphi_{\bar \theta}|_2^2 [[/math]]
where [math]\varphi_{\bar \theta}[/math] denotes the orthogonal projection of [math]f[/math] onto the linear span of [math]\varphi_1, \ldots, \varphi_M[/math]. Since [math]Y=f+\eps[/math], we get

[[math]] |f-\varphi_{\thetals}|_2^2 \le |f-\varphi_{\bar \theta}|_2^2+2\eps^\top(\varphi_{\thetals}-\varphi_{\bar \theta}) [[/math]]
Moreover, by Pythagoras's theorem, we have

[[math]] |f-\varphi_{\thetals}|_2^2 - |f-\varphi_{\bar \theta}|_2^2=|\varphi_{\thetals}-\varphi_{\bar \theta}|_2^2\,. [[/math]]
It yields

[[math]] |\varphi_{\thetals}-\varphi_{\bar \theta}|_2^2\le 2\eps^\top(\varphi_{\thetals}-\varphi_{\bar \theta})\,. [[/math]]
Using the same steps as the ones following equation EQ:bound_ls_1 for the well specified case, we get

[[math]] |\varphi_{\thetals}-\varphi_{\bar \theta}|_2^2\lesssim \frac{\sigma^2M}{n}\log(1/\delta) [[/math]]
with probability [math]1-\delta[/math]. The result of the lemma follows.

Sparse oracle inequality for the BIC estimator

The analysis for more complicated estimators such as the BIC in Chapter Linear Regression Model allows us to derive oracle inequalities for these estimators.

Theorem

Recall that the proof of Theorem for the BIC estimator begins as follows:

[[math]] \frac1n|Y-\varphi_{\thetabic}|_2^2 + \tau^2|\thetabic|_0 \le \frac1n|Y-\varphi_{\theta}|_2^2 + \tau^2|\theta|_0\,. [[/math]]
This is true for any [math]\theta \in \R^M[/math]. It implies

[[math]] |f-\varphi_{\thetabic}|_2^2 + n\tau^2|\thetabic|_0 \le |f-\varphi_{ \theta}|_2^2+2\eps^\top(\varphi_{\thetabic}-\varphi_{ \theta}) + n\tau^2|\theta|_0\,. [[/math]]
Note that if [math]\thetabic= \theta[/math], the result is trivial. Otherwise,

[[math]] \begin{align*} 2\eps^\top(\varphi_{\thetabic}-\varphi_\theta) &=2\eps^\top\Big(\frac{\varphi_{\thetabic}-\varphi_\theta}{|\varphi_{\thetabic}-\varphi_\theta|_2}\Big)|\varphi_{\thetabic}-\varphi_\theta|_2\\ &\le \frac2\alpha\Big[\eps^\top\Big(\frac{\varphi_{\thetabic}-\varphi_\theta}{|\varphi_{\thetabic}-\varphi_\theta|_2}\Big)\Big]^2 + \frac{\alpha}{2} |\varphi_{\thetabic}-\varphi_\theta|_2^2\,, \end{align*} [[/math]]
where we used Young's inequality [math]2ab\le \frac2\alpha a^2+\frac{\alpha}{2} b^2[/math], which is valid for [math]a,b \in \R[/math] and [math]\alpha \gt 0[/math]. Next, since

[[math]] \frac{\alpha}{2} |\varphi_{\thetabic}-\varphi_\theta|_2^2\le \alpha|\varphi_{\thetabic}-f|_2^2+\alpha|\varphi_\theta-f|_2^2\,, [[/math]]
we get for [math]\alpha=1/2[/math],

[[math]] \begin{align*} \frac12|\varphi_{\thetabic}-f|_2^2 &\le \frac32|\varphi_{ \theta}-f|_2^2+n\tau^2|\theta|_0 +4\big[\eps^\top\cU(\varphi_{\thetabic}-\varphi_{\theta})\big]^2- n\tau^2|\thetabic|_0\\ &\le \frac32|\varphi_{ \theta}-f|_2^2+2n\tau^2|\theta|_0+4\big[\eps^\top\cU(\varphi_{\thetabic -\theta})\big]^2- n\tau^2|\thetabic-\theta|_0. \end{align*} [[/math]]
We conclude as in the proof of Theorem.

Show Proof

{{{4}}}

The interpretation of this theorem is enlightening. It implies that the BIC estimator will mimic the best tradeoff between the approximation error [math]\MSE(\varphi_\theta)[/math] and the complexity of [math]\theta[/math] as measured by its sparsity. In particular, this result, which is sometimes called sparse oracle inequality, implies the following oracle inequality. Define the oracle [math]\bar \theta[/math] to be such that

[[math]] \MSE(\varphi_{\bar \theta})=\min_{\theta \in \R^M}\MSE(\varphi_\theta). [[/math]]

Then, with probability at least [math]1-\delta[/math],

[[math]] \MSE(\varphi_{\thetabic})\le 3\MSE(\varphi_{\bar \theta})+\frac{C\sigma^2}{n}\Big[|\bar \theta|_0\log (eM)+ \log(1/\delta)\Big]\Big\} [[/math]]

If the linear model happens to be correct, then we simply have [math]\MSE(\varphi_{\bar \theta})=0[/math].

Sparse oracle inequality for the Lasso

To prove an oracle inequality for the Lasso, we need additional assumptions on the design matrix, such as incoherence. Here, the design matrix is given by the [math]n\times M[/math] matrix [math]\Phi[/math] with elements [math]\Phi_{i,j}=\varphi_j(X_i)[/math].

Theorem

Assume the general regression model\eqref{EQ:regmodgen} with [math]\eps\sim\sg_n(\sigma^2)[/math]. Moreover, assume that there exists an integer [math]k[/math] such that the matrix [math]\Phi[/math] satisfies assumption [math]\textsf{INC$(k)$}[/math]. Then, the Lasso estimator [math]\thetalasso[/math] with regularization parameter given by

[[math]] \begin{equation} \label{EQ:taulassomis} 2\tau=8\sigma\sqrt{\frac{2\log(2M)}{n}}+8\sigma\sqrt{\frac{2\log(1/\delta)}{n}} \end{equation} [[/math]]
satisfies for some numerical constant [math]C[/math],

[[math]] \begin{align*} \MSE(\varphi_{\thetalasso})\le \inf_{\substack{\theta \in \R^M\\ |\theta|_0\le k}}\Big\{ \MSE(\varphi_\theta) &+\frac{C\sigma^2}{n}|\theta|_0\log (eM/\delta)\Big\} \end{align*} [[/math]]
with probability at least [math]1-\delta[/math].


Show Proof

From the definition of [math]\thetalasso[/math], for any [math]\theta \in \R^M[/math],

[[math]] \begin{align*} \frac1n|Y-\varphi_{\thetalasso}|_2^2 \le \frac1n|Y-\varphi_{\theta}|_2^2 + 2\tau|\theta|_1- 2\tau|\thetalasso|_1 \,. \end{align*} [[/math]]
Expanding the squares, adding [math]\tau|\thetalasso-\theta|_1[/math] on each side and multiplying by [math]n[/math], we get

[[math]] \begin{align} \leadeq{|\varphi_{\thetalasso}-f|_2^2-|\varphi_{\theta}-f|_2^2+n\tau|\thetalasso-\theta|_1}\\ \le {} & 2\eps^\top(\varphi_{\thetalasso}-\varphi_{\theta}) +n\tau|\thetalasso-\theta|_1+ 2n\tau|\theta|_1- 2n\tau|\thetalasso|_1 \,. \label{EQ:pr:TH:lassofastmis1} \end{align} [[/math]]
Next, note that \textsf{INC}[math](k)[/math] for any [math]k \ge 1[/math] implies that [math]|\varphi_j|_2 \le 2\sqrt{n}[/math] for all [math]j=1, \ldots, M[/math]. Applying H\"older's inequality, we get that with probability [math]1-\delta[/math], it holds that

[[math]] 2\eps^\top(\varphi_{\thetalasso}-\varphi_{\theta}) \le \frac{n\tau}2|\thetalasso-\theta|_1. [[/math]]
Therefore, taking [math]S=\supp(\theta)[/math] to be the support of [math]\theta[/math], we get that the right-hand side of\eqref{EQ:pr:TH:lassofastmis1} is bounded by

[[math]] \begin{align} |\varphi_{\thetalasso}-f|_2^2-|\varphi_{\theta}-f|_2^2+n\tau|\thetalasso-\theta|_1 &\le 2n\tau|\thetalasso-\theta|_1+2n\tau|\theta|_1- 2n\tau|\thetalasso|_1\nonumber \\ &= 2n\tau|\thetalasso_S-\theta|_1+2n\tau|\theta|_1- 2n\tau|\thetalasso_S|_1\nonumber\\ &\le 4n\tau|\thetalasso_S-\theta|_1\label{EQ:pr:TH:lassofastmis2} \end{align} [[/math]]
with probability [math]1-\delta[/math]. It implies that either [math]\MSE(\varphi_{\thetalasso})\le \MSE(\varphi_\theta)[/math] or that

[[math]] |\thetalasso_{S^c}-\theta_{S^c}|_1\le 3 |\thetalasso_{S}-\theta_{S}|_1\,. [[/math]]
so that [math]\theta=\thetalasso-\theta[/math] satisfies the cone conditionEQ:conecond. Using now the Cauchy-Schwarz inequality and Lemma, respectively, assuming that [math]|\theta|_0\le k[/math], we get

[[math]] 4n\tau|\thetalasso_S-\theta|_1 \le 4n\tau\sqrt{|S|}|\thetalasso_S-\theta|_2 \le 4\tau\sqrt{2n|\theta|_0}|\varphi_{\thetalasso}-\varphi_{\theta}|_2\,. [[/math]]
Using now the inequality [math]2ab\le \frac2{\alpha} a^2+\frac{\alpha}{2} b^2[/math], we get

[[math]] \begin{align*} 4n\tau|\thetalasso_S-\theta|_1& \le \frac{16\tau^2n|\theta|_0}{\alpha}+ \frac{\alpha}{2}|\varphi_{\thetalasso}-\varphi_{\theta}|_2^2\\ &\le \frac{16\tau^2n|\theta|_0}{\alpha}+ \alpha|\varphi_{\thetalasso}-f|_2^2+\alpha|\varphi_{\theta}-f|_2^2 \end{align*} [[/math]]
Combining this result with\eqref{EQ:pr:TH:lassofastmis1} and\eqref{EQ:pr:TH:lassofastmis2}, we find

[[math]] (1-\alpha)\MSE(\varphi_{\thetalasso}) \le (1+\alpha)\MSE(\varphi_{\theta})+ \frac{16\tau^2|\theta|_0}{\alpha}\,. [[/math]]
To conclude the proof, it only remains to divide by [math]1-\alpha[/math] on both sides of the above inequality and take [math]\alpha=1/2[/math].

Maurey's argument

In there is no sparse [math]\theta[/math] such that [math]\MSE(\varphi_{\theta})[/math] is small, Theorem is useless whereas the Lasso may still enjoy slow rates. In reality, no one really believes in the existence of sparse vectors but rather of approximately sparse vectors. Zipf's law would instead favor the existence of vectors [math]\theta[/math] with absolute coefficients that decay polynomially when ordered from largest to smallest in absolute value. This is the case for example if [math]\theta[/math] has a small [math]\ell_1[/math] norm but is not sparse. For such [math]\theta[/math], the Lasso estimator still enjoys slow rates as in Theorem, which can be easily extended to the misspecified case (see Problem). As a result, it seems that the Lasso estimator is strictly better than the BIC estimator as long as incoherence holds since it enjoys both fast and slow rates, whereas the BIC estimator seems to be tailored to the fast rate. Fortunately, such vectors can be well approximated by sparse vectors in the following sense: for any vector [math]\theta \in \R^M[/math] such that [math]|\theta|_1\le 1[/math], there exists a vector [math]\theta'[/math] that is sparse and for which [math]\MSE(\varphi_{\theta'})[/math] is not much larger than [math]\MSE(\varphi_\theta)[/math]. The following theorem quantifies exactly the tradeoff between sparsity and [math]\MSE[/math]. It is often attributed to B. Maurey and was published by Pisier [1]. This is why it is referred to as Maurey's argument.

Theorem

Let [math]\{\varphi_1, \dots, \varphi_M\}[/math] be a dictionary normalized in such a way that

[[math]] \max_{1\le j \le M}|\varphi_j|_2\le D\sqrt{n}\,. [[/math]]
Then for any integer [math]k[/math] such that [math]1\le k \le M[/math] and any positive [math]R[/math], we have

[[math]] \min_{\substack{\theta \in \R^M\\ |\theta|_0\le k}}\MSE(\varphi_\theta) \le \min_{\substack{\theta \in \R^M\\ |\theta|_1\le R}}\MSE(\varphi_\theta) + \frac{D^2 R^2}{k}\,. [[/math]]


Show Proof

Define

[[math]] \bar \theta \in \argmin_{\substack{\theta \in \R^M\\ |\theta|_1\le R}}|\varphi_\theta -f|_2^2 [[/math]]
and assume without loss of generality that [math]|\bar \theta_1| \ge |\bar \theta_2|\ge \ldots \ge |\bar \theta_M|[/math]. Let now [math]U \in \R^n[/math] be a random vector with values in [math]\{0, \pm R\varphi_1, \ldots, \pm R \varphi_M\}[/math] defined by

[[math]] \begin{alignat*}{2} \p(U=R\,\mathrm{sign}(\bar \theta_j)\varphi_j)&=\frac{|\bar \theta_j|}{R}\,, &\quad &j=1, \ldots, M,\\ \p(U=0)&=1-\frac{|\bar \theta|_1}{R}\,. \end{alignat*} [[/math]]
Note that [math]\E[U]=\varphi_{\bar \theta}[/math] and [math]|U|_2 \le RD \sqrt{n}[/math]. Let now [math]U_1, \ldots, U_k[/math] be [math]k[/math] independent copies of [math]U[/math] and define their average
[[math]] \bar U=\frac{1}{k}\sum_{i=1}^kU_i\,. [[/math]]
Note that [math]\bar U=\varphi_{\tilde \theta}[/math] for some [math]\tilde \theta \in\R^M[/math] such that [math]|\tilde \theta|_0\le k[/math]. Moreover, using the Pythagorean Theorem,

[[math]] \begin{align*} \E|f-\bar U|_2^2&=\E|f-\varphi_{\bar \theta}+\varphi_{\bar \theta}-\bar U|_2^2\\ &=\E|f-\varphi_{\bar \theta}|_2^2+|\varphi_{\bar \theta}-\bar U|_2^2\\ &= |f-\varphi_{\bar \theta}|_2^2 + \frac{\E|U-\E[U]|_2^2}{k}\\ &\le |f-\varphi_{\bar \theta}|_2^2 + \frac{(RD\sqrt{n})^2}{k} \end{align*} [[/math]]
To conclude the proof, note that

[[math]] \E|f-\bar U|_2^2 =\E|f-\varphi_{\tilde \theta}|_2^2 \ge \min_{\substack{\theta \in \R^M\\ |\theta|_0\le k}}|f-\varphi_{\theta}|_2^2 [[/math]]
and divide by [math]n[/math].

Maurey's argument implies the following corollary.

Corollary

Assume that the assumptions of Theorem hold and that the dictionary [math]\{\varphi_1, \dots, \varphi_M\}[/math] is normalized in such a way that

[[math]] \max_{1\le j \le M}|\varphi_j|_2\le \sqrt{n}\,. [[/math]]
Then there exists a constant [math]C \gt 0[/math] such that the BIC estimator satisfies

[[math]] \begin{align*} \MSE(\varphi_{\thetabic})\le \inf_{\theta \in \R^M}\Big\{2\MSE(\varphi_\theta)+C\Big[&\frac{\sigma^2|\theta|_0\log (eM)}{n}\wedge \sigma|\theta|_1\sqrt{\frac{\log (eM)}{n}}\Big]\Big\}\\ &+ C\frac{\sigma^2\log(1/\delta)}{n} \end{align*} [[/math]]
with probability at least [math]1-\delta[/math].


Show Proof

Choosing [math]\alpha=1/3[/math] in Theorem yields

[[math]] \MSE(\varphi_{\thetabic})\le 2\inf_{\theta \in \R^M}\Big\{\MSE(\varphi_\theta)+C\frac{\sigma^2|\theta|_0\log (eM)}{n}\Big\}+C\frac{\sigma^2\log(1/\delta)}{n} [[/math]]
Let [math]\theta' \in \R^M[/math]. It follows from Maurey's argument that for any [math] k \in [M] [/math], there exists [math]\theta = \theta(\theta', k) \in \R^M[/math] such that [math]|\theta|_0 = k[/math] and

[[math]] \MSE(\varphi_{\theta}) \le \MSE(\varphi_{\theta'})+\frac{2|\theta'|_1^2}{k} [[/math]]
It implies that

[[math]] \MSE(\varphi_{\theta})+C\frac{\sigma^2|\theta|_0\log (eM)}{n} \le \MSE(\varphi_{\theta'})+\frac{2|\theta'|_1^2}{k}+C\frac{\sigma^2 k\log (eM)}{n} [[/math]]
and furthermore that

[[math]] \inf_{\theta \in \R^M}\MSE(\varphi_{\theta})+C\frac{\sigma^2|\theta|_0\log (eM)}{n} \le \MSE(\varphi_{\theta'})+\frac{2|\theta'|_1^2}{k}+C\frac{\sigma^2 k\log (eM)}{n} [[/math]]
Since the above bound holds for any [math] \theta' \in \mathbb{R}^M [/math] and [math] k \in [M] [/math], we can take an infimum with respect to both [math]\theta'[/math] and [math]k[/math] on the right-hand side to get

[[math]] \begin{align*} \inf_{\theta \in \R^M}&\Big\{\MSE(\varphi_{\theta})+C\frac{\sigma^2|\theta|_0\log (eM)}{n} \Big\}\\ &\le \inf_{\theta' \in \R^M}\Big\{\MSE(\varphi_{\theta'})+C\min_k\Big(\frac{|\theta'|_1^2}{k}+C\frac{\sigma^2k\log (eM)}{n}\Big) \Big\}\,. \end{align*} [[/math]]
To control the minimum over [math]k[/math], we need to consider three cases for the quantity

[[math]] \bar k=\frac{|\theta'|_1}{\sigma}\sqrt{\frac{n}{\log (eM)}}. [[/math]]

  • If [math]1 \le \bar k \le M[/math], then we get
    [[math]] \min_k\Big(\frac{|\theta'|_1^2}{k}+C\frac{\sigma^2k\log (eM)}{n}\Big) \le C\sigma|\theta'|_1\sqrt{\frac{\log (eM)}{n}} [[/math]]
  • If [math]\bar k \le 1[/math], then
    [[math]] |\theta'|_1^2\le C\frac{\sigma^2\log (eM)}{n}\,, [[/math]]
    which yields
    [[math]] \min_k\Big(\frac{|\theta'|_1^2}{k}+C\frac{\sigma^2k\log (eM)}{n}\Big) \le C\frac{\sigma^2\log (eM)}{n} [[/math]]
  • If [math]\bar k \ge M[/math], then
    [[math]] \frac{\sigma^2M\log (eM)}{n}\le C \frac{|\theta'|_1^2}{M}\,. [[/math]]
    Therefore, on the one hand, if [math]M \ge \frac{|\theta|_1}{\sigma\sqrt{\log(eM)/n}}[/math], we get
    [[math]] \min_k\Big(\frac{|\theta'|_1^2}{k}+C\frac{\sigma^2k\log (eM)}{n}\Big) \le C \frac{|\theta'|_1^2}{M} \le C\sigma|\theta'|_1\sqrt{\frac{\log (eM)}{n}}\,. [[/math]]
    On the other hand, if [math]M \le \frac{|\theta|_1}{\sigma\sqrt{\log(eM)/n}}[/math], then for any [math]\theta \in \R^M[/math], we have
    [[math]] \begin{equation*} \frac{\sigma^2|\theta|_0\log (eM)}{n} \le \frac{\sigma^2M\log (eM)}{n} \le C\sigma|\theta'|_1\sqrt{\frac{\log (eM)}{n}}\,. \end{equation*} [[/math]]

Combined,

[[math]] \begin{align*} \leadeq{\inf_{\theta' \in \R^M}\Big\{\MSE(\varphi_{\theta'})+C\min_k\Big(\frac{|\theta'|_1^2}{k}+C\frac{\sigma^2k\log (eM)}{n}\Big) \Big\}}\\ \le {} & \inf_{\theta' \in \R^M}\Big\{\MSE(\varphi_{\theta'}) + C \sigma | \theta' |_1 \sqrt{\frac{\log(eM}{n}} + C \frac{\sigma^2 \log(eM)}{n} \Big\}, \end{align*} [[/math]]
which together with Theorem 3.4 yields the claim.

Note that this last result holds for any estimator that satisfies an oracle inequality with respect to the [math]\ell_0[/math] norm as in Theorem. In particular, this estimator need not be the BIC estimator. An example is the Exponential Screening estimator of[2]. Maurey's argument allows us to enjoy the best of both the [math]\ell_0[/math] and the [math]\ell_1[/math] world. The rate adapts to the sparsity of the problem and can be even generalized to [math]\ell_q[/math]-sparsity (see Problem). However, it is clear from the proof that this argument is limited to squared [math]\ell_2[/math] norms such as the one appearing in [math]\MSE[/math] and extension to other risk measures is non-trivial. Some work has been done for non-Hilbert spaces [1][3] using more sophisticated arguments.

Nonparametric regression

So far, the oracle inequalities that we have derived do not deal with the approximation error [math]\MSE(\varphi_\theta)[/math]. We kept it arbitrary and simply hoped that it was small. Note also that in the case of linear models, we simply assumed that the approximation error was zero. As we will see in this section, this error can be quantified under natural smoothness conditions if the dictionary of functions [math]\cH=\{\varphi_1, \ldots, \varphi_M\}[/math] is chosen appropriately. In what follows, we assume for simplicity that [math]d=1[/math] so that [math]f\,:\, \R \to \R[/math] and [math]\varphi_j\,:\, \R \to \R[/math].

Fourier decomposition

Historically, nonparametric estimation was developed before high-dimensional statistics and most results hold for the case where the dictionary [math]\cH=\{\varphi_1, \ldots, \varphi_M\}[/math] forms an orthonormal system of [math]L_2([0,1])[/math]:

[[math]] \int_0^1\varphi_j^2(x)\ud x =1\,, \quad \int_0^1 \varphi_j(x)\varphi_k(x)\ud x=0, \ \forall\, j\neq k\,. [[/math]]

We will also deal with the case where [math]M=\infty[/math]. When [math]\cH[/math] is an orthonormal system, the coefficients [math]\theta_j^* \in \R[/math] defined by

[[math]] \theta_j^* =\int_0^1 f(x) \varphi_j(x)\ud x\,, [[/math]]

are called Fourier coefficients of [math]f[/math]. Assume now that the regression function [math]f[/math] admits the following decomposition

[[math]] f=\sum_{j=1}^\infty \theta_j^* \varphi_j\,. [[/math]]

There exists many choices for the orthonormal system and we give only two as examples.

Example Trigonometric basis. This is an orthonormal basis of [math]L_2([0,1])[/math]. It is defined by

[[math]] \begin{eqnarray*} \varphi_1&\equiv&1\\ \varphi_{2k}(x)&=&\sqrt{2} \cos(2\pi k x)\,,\\ \varphi_{2k+1}(x)&=&\sqrt{2} \sin(2\pi k x)\,, \end{eqnarray*} [[/math]]

for [math]k=1,2, \dots[/math] and [math]x \in [0,1][/math]. The fact that it is indeed an orthonormal system can be easily check using trigonometric identities.

The next example has received a lot of attention in the signal (sound, image, ...) processing community.

Example Wavelets. Let [math]\psi\,:\, \R \to \R[/math] be a sufficiently smooth and compactly supported function, called ‘`’'mother wavelet". Define the system of functions

[[math]] \psi_{jk}(x)=2^{j/2}\psi(2^jx-k)\,, \quad j,k \in \Z\,. [[/math]]

It can be shown that for a suitable [math]\psi[/math], the dictionary [math]\{\psi_{j,k}, j,k \in \Z\}[/math] forms an orthonormal system of [math]L_2([0,1])[/math] and sometimes a basis. In the latter case, for any function [math]g \in L_2([0,1])[/math], it holds

[[math]] g=\sum_{j=-\infty}^\infty \sum_{k=-\infty}^\infty \theta_{jk}\psi_{jk}\,, \quad \theta_{jk}=\int_{0}^1 g(x)\psi_{jk}(x)\ud x\,. [[/math]]

The coefficients [math]\theta_{jk}[/math] are called wavelet coefficients of [math]g[/math]. The simplest example is given by the Haar system obtained by taking [math]\psi[/math] to be the following piecewise constant function (see Figure). We will not give more details about wavelets here but simply point the interested reader to[4].

[[math]] \psi(x)=\left\{ \begin{array}{ll} 1& 0\le x \lt 1/2\\ -1 & 1/2\le x \le 1\\ 0 & \text{otherwise} \end{array} \right. [[/math]]

[[File: | 400px | thumb | The Haar mother wavelet ]]

Sobolev classes and ellipsoids

We begin by describing a class of smooth functions where smoothness is understood in terms of its number of derivatives. Recall that [math]f^{(k)}[/math] denotes the [math]k[/math]-th derivative of [math]f[/math].

Definition

Fix parameters [math]\beta \in \{1, 2, \dots\}[/math] and [math]L \gt 0[/math]. The Sobolev class of functions [math]W(\beta,L)[/math] is defined by

[[math]] \begin{align*} W(\beta,L)=\Big\{f\,:\,& [0,1] \to \R\ :\ f\in L_2([0,1])\,, f^{(\beta-1)} \text{ is absolutely continuous and }\\ & \int_{0}^1[f^{(\beta)}]^2\le L^2\,, f^{(j)}(0)= f^{(j)}(1), j=0, \ldots, \beta-1\Big\} \end{align*} [[/math]]

Any function [math]f \in W(\beta, L)[/math] can represented[Notes 1] as its Fourier expansion along the trigonometric basis:

[[math]] f(x)=\theta_1^*\varphi_1(x) + \sum_{k=1}^\infty\big(\theta_{2k}^*\varphi_{2k}(x)+\theta_{2k+1}^*\varphi_{2k+1}(x)\big)\,, \quad \forall \ x \in [0,1]\,, [[/math]]

where [math]\theta^*=\{\theta_j^*\}_{j \ge 1}[/math] is in the space of squared summable sequences [math]\ell_2(\N)[/math] defined by

[[math]] \ell_2(\N)=\Big\{\theta\,:\, \sum_{j=1}^\infty \theta_j^2 \lt \infty\Big\}\,. [[/math]]

For any [math]\beta \gt 0[/math], define the coefficients

[[math]] \begin{equation} \label{EQ:defaj} a_j=\left\{ \begin{array}{cl} j^\beta& \text{for $j$ even}\\ (j-1)^\beta & \text{for $j$ odd} \end{array} \right. \end{equation} [[/math]]

With these coefficients, we can define the Sobolev class of functions in terms of Fourier coefficients.

Theorem

Fix [math]\beta\ge 1[/math] and [math]L \gt 0[/math] and let [math]\{\varphi_j\}_{j \ge 1}[/math] denote the trigonometric basis of [math]L_2([0,1])[/math]. Moreover, let [math]\{a_j\}_{j\ge 1}[/math] be defined as in\eqref{EQ:defaj}. A function [math]f \in W(\beta,L)[/math] can be represented as

[[math]] f=\sum_{j=1}^\infty \theta_j^* \varphi_j\,, [[/math]]
where the sequence [math]\{\theta_j^*\}_{j \ge 1}[/math] belongs to Sobolev ellipsoid of [math]\ell_2(\N)[/math] defined by

[[math]] \Theta(\beta, Q)=\Big\{\theta \in \ell_2(\N)\,:\, \sum_{j=1}^\infty a_j^2\theta_j^2\le Q\Big\} [[/math]]
for [math]Q=L^2/\pi^{2\beta}[/math].


Show Proof

Let us first define the Fourier coefficients [math]\{s_k(j)\}_{k \ge 1}[/math] of the [math]j[/math]th derivative [math]f^{(j)}[/math] of [math]f[/math] for [math]j=1, \ldots, \beta[/math]:

[[math]] \begin{align*} s_1(j)&=\int_0^1f^{(j)}(t)\ud t=f^{(j-1)}(1)-f^{(j-1)}(0)=0\,,\\ s_{2k}(j)&=\sqrt{2}\int_0^1f^{(j)}(t)\cos(2\pi kt)\ud t\,,\\ s_{2k+1}(j)&=\sqrt{2}\int_0^1f^{(j)}(t)\sin(2\pi kt)\ud t\,,\\ \end{align*} [[/math]]
The Fourier coefficients of [math]f[/math] are given by [math]\theta_k=s_k(0)[/math]. Using integration by parts, we find that

[[math]] \begin{align*} s_{2k}(\beta)&=\sqrt{2}f^{(\beta-1)}(t) \cos(2\pi kt)\Big|_0^1 + (2\pi k)\sqrt{2}\int_0^1f^{(\beta-1)}(t) \sin(2\pi kt)\ud t\\ &= \sqrt{2}[f^{(\beta-1)}(1)-f^{(\beta-1)}(0)]+(2\pi k)\sqrt{2}\int_0^1f^{(\beta-1)}(t) \sin(2\pi kt)\ud t\\ &= (2\pi k)s_{2k+1}(\beta-1)\,. \end{align*} [[/math]]
Moreover,

[[math]] \begin{align*} s_{2k+1}(\beta)&=\sqrt{2}f^{(\beta-1)}(t) \sin(2\pi kt)\Big|_0^1 - (2\pi k)\sqrt{2}\int_0^1f^{(\beta-1)}(t) \cos(2\pi kt)\ud t\\ &= -(2\pi k)s_{2k}(\beta-1)\,. \end{align*} [[/math]]
In particular, it yields

[[math]] s_{2k}(\beta)^2+s_{2k+1}(\beta)^2=(2\pi k)^2\big[s_{2k}(\beta-1)^2+s_{2k+1}(\beta-1)^2\big] [[/math]]
By induction, we find that for any [math]k \ge 1[/math],

[[math]] s_{2k}(\beta)^2+s_{2k+1}(\beta)^2=(2\pi k)^{2\beta}\big(\theta_{2k}^2 + \theta_{2k+1}^2\big) [[/math]]
Next, it follows for the definition\eqref{EQ:defaj} of [math]a_j[/math] that

[[math]] \begin{align*} \sum_{k=1}^\infty(2\pi k)^{2\beta}\big(\theta_{2k}^2 + \theta_{2k+1}^2\big)&=\pi^{2\beta}\sum_{k=1}^\infty a_{2k}^2\theta_{2k}^2 + \pi^{2\beta}\sum_{k=1}^\infty a_{2k+1}^2\theta_{2k+1}^2\\ &=\pi^{2\beta}\sum_{j=1}^\infty a_{j}^2\theta_{j}^2\,. \end{align*} [[/math]]
Together with the Parseval identity, it yields

[[math]] \int_0^1\big(f^{(\beta)}(t)\big)^2\ud t=\sum_{k=1}^\infty s_{2k}(\beta)^2+s_{2k+1}(\beta)^2=\pi^{2\beta}\sum_{j=1}^\infty a_{j}^2\theta_{j}^2\,. [[/math]]
To conclude, observe that since [math]f \in W(\beta, L)[/math], we have

[[math]] \int_0^1\big(f^{(\beta)}(t)\big)^2\ud t \le L^2\,, [[/math]]
so that [math]\theta \in \Theta(\beta, L^2/\pi^{2\beta})[/math]\,.

It can actually be shown that the reciprocal is true, that is, any function with Fourier coefficients in [math]\Theta(\beta, Q)[/math] belongs to if [math] W(\beta,L)[/math], but we will not be needing this. In what follows, we will define smooth functions as functions with Fourier coefficients (with respect to the trigonometric basis) in a Sobolev ellipsoid. By extension, we write [math]f \in \Theta(\beta, Q)[/math] in this case and consider any real value for [math]\beta[/math].

Proposition

The Sobolev ellipsoids enjoy the following properties

  • For any [math]Q \gt 0[/math],
    [[math]] 0 \lt \beta' \lt \beta \ \Rightarrow \ \Theta(\beta, Q) \subset \Theta(\beta',Q). [[/math]]
  • For any [math]Q \gt 0[/math],
    [[math]] \beta \gt \frac12 \ \Rightarrow \ f\ \text{is continuous}. [[/math]]

The proof is left as an exercise (Problem). It turns out that the first [math] n [/math] functions in the trigonometric basis are not only orthonormal with respect to the inner product of [math]L_2[/math], but also with respect to the inner predictor associated with fixed design, [math]\langle f,g\rangle:=\frac{1}{n} \sum_{i=1}^n f(X_i)g(X_i)[/math], when the design is chosen to be regular, i.e., [math]X_i=(i-1)/n[/math], [math]i=1, \ldots, n[/math].

Lemma

Assume that [math]\{X_1, \ldots, X_n\}[/math] is the regular design, i.e., [math]X_i=(i-1)/n[/math]. Then, for any [math]M\le n-1[/math], the design matrix [math]\Phi=\{\varphi_j(X_i)\}_{\substack{1\le i \le n\\1\le j \le M}}[/math] satisfies the [math]\textsf{ORT}[/math] condition.


Show Proof

Fix [math]j,j'\in \{1, \ldots, n-1\}[/math], [math]j\neq j'[/math], and consider the inner product [math]\varphi_j^\top \varphi_{j'}[/math]. Write [math]k_{j}=\lfloor j/2\rfloor[/math] for the integer part of [math]j/2[/math] and define the vectors [math]a, b, a', b' \in \R^n[/math] with coordinates such that [math]e^{\frac{\mathsf{i}2\pi k_j s}{n}}=a_{s+1}+\mathsf{i} b_{s+1}[/math] and [math]e^{\frac{\mathsf{i}2\pi k_{j'} s}{n}}=a'_{s+1}+\mathsf{i} b'_{s+1}[/math] for [math] s \in \{0, \dots, n-1\} [/math]. It holds that

[[math]] \frac{1}{2}\varphi_j^\top \varphi_{j'}\in \{a^\top a'\,,\, b^\top b'\,,\, b^\top a'\,,\, a^\top b'\}\,, [[/math]]
depending on the parity of [math]j[/math] and [math]j'[/math]. On the one hand, observe that if [math]k_j \neq k_{j'}[/math], we have for any [math]\sigma \in \{-1, +1\}[/math],

[[math]] \sum_{s=0}^{n-1}e^{\frac{\mathsf{i}2\pi k_j s}{n}}e^{\sigma\frac{\mathsf{i}2\pi k_{j'} s}{n}}=\sum_{s=0}^{n-1}e^{\frac{\mathsf{i}2\pi (k_j+\sigma k_{j'}) s}{n}}=0\,. [[/math]]
On the other hand

[[math]] \sum_{s=0}^{n-1}e^{\frac{\mathsf{i}2\pi k_j s}{n}}e^{\sigma\frac{\mathsf{i}2\pi k_{j'} s}{n}}=(a+\mathsf{i} b)^\top( a'+\sigma \mathsf{i} b')=a^\top a'-\sigma b^\top b' + \mathsf{i} \big[b^\top a' + \sigma a^\top b' \big] [[/math]]
so that [math]a^\top a'=\pm b^\top b'=0[/math] and [math]b^\top a'=\pm a^\top b'=0[/math] whenever, [math]k_j \neq k_{j'}[/math]. It yields [math]\varphi_j^\top \varphi_{j'}=0[/math]. Next, consider the case where [math]k_j=k_{j'}[/math] so that

[[math]] \sum_{s=0}^{n-1}e^{\frac{\mathsf{i}2\pi k_j s}{n}}e^{\sigma\frac{\mathsf{i}2\pi k_{j'} s}{n}}=\left\{ \begin{array}{ll} 0& \text{if $\sigma=1$}\\ n& \text{if $\sigma=-1$}\\ \end{array}\right.\,. [[/math]]
On the one hand if [math]j \neq j'[/math], it can only be the case that [math]\varphi_j^\top \varphi_{j'} \in \{b^\top a'\,,\, a^\top b'\}[/math] but the same argument as above yields [math]b^\top a'=\pm a^\top b'=0[/math] since the imaginary part of the inner product is still [math] 0 [/math]. Hence, in that case, [math]\varphi_j^\top \varphi_{j'}=0[/math]. On the other hand, if [math]j=j'[/math], then [math]a=a'[/math] and [math]b=b'[/math] so that it yields [math]a^\top a'=|a|_2^2=n[/math] and [math]b^\top b'=|b|_2^2=n[/math] which is equivalent to [math]\varphi_j^\top \varphi_j=|\varphi_j|_2^2=n[/math]. Therefore, the design matrix [math]\Phi[/math] is such that

[[math]] \begin{equation*} \Phi^\top \Phi=nI_M\,. \end{equation*} [[/math]]

Integrated squared error

As mentioned in the introduction of this chapter, the smoothness assumption allows us to control the approximation error. Before going into the details, let us gain some insight. Note first that if [math]\theta \in \Theta(\beta, Q)[/math], then [math]a_j^2\theta_j^2 \to 0[/math] as [math]j \to \infty[/math] so that [math]|\theta_j|=o(j^{-\beta})[/math]. Therefore, the [math]\theta_j[/math]s decay polynomially to zero and it makes sense to approximate [math]f[/math] by its truncated Fourier series

[[math]] \sum_{j=1}^M \theta_j^* \varphi_j=:\varphi_{\theta^*}^M [[/math]]

for any fixed [math]M[/math]. This truncation leads to a systematic error that vanishes as [math]M \to \infty[/math]. We are interested in understanding the rate at which this happens.

The Sobolev assumption allows us to control precisely this error as a function of the tunable parameter [math]M[/math] and the smoothness [math]\beta[/math].

Lemma

Note that for any [math]\theta \in \Theta(\beta, Q)[/math], if [math]\beta \gt 1/2[/math], then

[[math]] \begin{align*} \sum_{j=2}^\infty|\theta_j|&=\sum_{j=2}^\infty a_j|\theta_j|\frac1{a_j}\\ &\le \sqrt{\sum_{j=2}^\infty a_j^2\theta_j^2\sum_{j=2}^\infty\frac1{a_j^2}}\quad \text{by Cauchy-Schwarz}\\ &\le \sqrt{Q\sum_{j=1}^\infty\frac1{j^{2\beta}}} \lt \infty \end{align*} [[/math]]
Since [math]\{\varphi_j\}_j[/math] forms an orthonormal system in [math]L_2([0,1])[/math], we have
[[math]] \min_{\theta \in \R^M} \|\varphi_\theta -f\|_{L_2}^2= \|\varphi_{\theta^*}^M -f\|_{L_2}^2 =\sum_{j \gt M} |\theta^*_j|^2\,. [[/math]]
When [math]\theta^* \in \Theta(\beta,Q)[/math], we have

[[math]] \sum_{j \gt M} |\theta_j^*|^2=\sum_{j \gt M} a_j^2|\theta_j^*|^2\frac1{a_j^2}\le \frac1{a_{M+1}^2}Q\le \frac{Q}{M^{2\beta}}\,. [[/math]]

To prove the second part of the lemma, observe that

[[math]] |\varphi_{\theta^*}^{n-1} -f|_2=\big|\sum_{j\ge n}\theta_j^* \varphi_j\big|_2 \le2\sqrt{2n}\sum_{j\ge n}|\theta_j^*|\,, [[/math]]
where in the last inequality, we used the fact that for the trigonometric basis [math]|\varphi_j|_2 \le \sqrt{2n}, j \ge 1[/math] regardless of the choice of the design [math]X_1,\ldots, X_n[/math]. When [math]\theta^* \in \Theta(\beta,Q)[/math], we have

[[math]] \begin{equation*} \sum_{j\ge n} |\theta^*_j|=\sum_{j\ge n} a_j|\theta^*_j|\frac{1}{a_j}\le \sqrt{\sum_{j\ge n} a_j^2|\theta^*_j|^2}\sqrt{\sum_{j\ge n}\frac{1}{a_j^2}}\lesssim \sqrt{Q} n^{\frac{1}{2}-\beta}\,. \end{equation*} [[/math]]

Show Proof

{{{4}}}

Note that the truncated Fourier series [math]\varphi_{\theta^*}^M[/math] is an oracle: this is what we see when we view [math]f[/math] through the lens of functions with only low frequency harmonics. To estimate [math]\varphi_{\theta^*} = \varphi_{\theta^\ast}^M[/math], consider the estimator [math]\varphi_{\thetals}[/math] where

[[math]] \thetals \in \argmin_{\theta \in \R^M} \sum_{i=1}^n \big(Y_i -\varphi_\theta(X_i)\big)^2\,, [[/math]]

which should be such that [math]\varphi_{\thetals}[/math] is close to [math]\varphi_{\theta^*}[/math]. For this estimator, we have proved (Theorem) an oracle inequality for the [math]\MSE[/math] that is of the form

[[math]] |\varphi_{\thetals}^M-f|_2^2\le\inf_{\theta \in \R^M}|\varphi_\theta^M-f|_2^2+C\sigma^2M\log(1/\delta)\,, \qquad C \gt 0\,. [[/math]]

It yields

[[math]] \begin{align*} |\varphi_{\thetals}^M-\varphi_{\theta^*}^M|_2^2&\le 2(\varphi_{\thetals}^M-\varphi_{\theta^*}^M)^\top(f-\varphi_{\theta^*}^M)+C\sigma^2M\log(1/\delta)\\ &= 2(\varphi_{\thetals}^M-\varphi_{\theta^*}^M)^\top(\sum_{j \gt M}\theta^*_j\varphi_j)+C\sigma^2M\log(1/\delta)\\ &=2(\varphi_{\thetals}^M-\varphi_{\theta^*}^M)^\top(\sum_{j\ge n}\theta^*_j\varphi_j)+C\sigma^2M\log(1/\delta)\,, \end{align*} [[/math]]

where we used Lemma in the last equality. Together with\eqref{EQ:bias2} and Young's inequality [math]2ab\le \alpha a^2+b^2/\alpha, a,b \ge 0[/math] for any [math]\alpha \gt 0[/math], we get

[[math]] 2(\varphi_{\thetals}^M-\varphi_{\theta^*}^M)^\top(\sum_{j\ge n}\theta^*_j\varphi_j)\le \alpha|\varphi_{\thetals}^M-\varphi_{\theta^*}^M |_2^2 + \frac{C}{\alpha}Qn^{2-2\beta}\,, [[/math]]

for some positive constant [math]C[/math] when [math]\theta^* \in \Theta(\beta, Q)[/math]. As a result,

[[math]] \begin{equation} \label{EQ:almostLSNP} |\varphi_{\thetals}^M-\varphi_{\theta^*}^M|_2^2 \lesssim \frac{1}{\alpha(1-\alpha)}Qn^{2-2\beta}+ \frac{\sigma^2M}{1-\alpha}\log(1/\delta) \end{equation} [[/math]]

for any [math]\alpha \in (0,1)[/math]. Since, Lemma implies, [math]|\varphi_{\thetals}^M-\varphi_{\theta^*}^M|_2^2=n\|\varphi_{\thetals}^M-\varphi_{\theta^*}^M\|_{L_2([0,1])}^2[/math], we have proved the following theorem.

Theorem

Fix [math]\beta\ge (1+\sqrt{5})/4\simeq 0.81, Q \gt 0, \delta \gt 0[/math] and assume the general regression model\eqref{EQ:regmodgen} with [math]f \in \Theta(\beta,Q)[/math] and [math]\eps\sim\sg_n(\sigma^2), \sigma^2 \le 1[/math]. Moreover, let [math]M=\lceil n^{\frac{1}{2\beta+1}}\rceil[/math] and [math]n[/math] be large enough so that [math]M \le n-1[/math]. Then the least squares estimator [math]\thetals[/math] defined in\eqref{EQ:defthetaLSmis} with [math]\{\varphi_j\}_{j=1}^M[/math] being the trigonometric basis, satisfies with probability [math]1-\delta[/math], for [math]n[/math] large enough,

[[math]] \|\varphi^M_{\thetals} -f\|_{L_2([0,1])}^2 \lesssim n^{-\frac{2\beta}{2\beta+1}}(1+\sigma^2\log(1/\delta))\,. [[/math]]
where the constant factors may depend on [math]\beta, Q[/math] and [math]\sigma[/math].


Show Proof

Choosing [math]\alpha=1/2[/math] for example and absorbing [math]Q[/math] in the constants, we get from\eqref{EQ:almostLSNP} and Lemma that for [math]M \le n-1[/math],

[[math]] \|\varphi^M_{\thetals}-\varphi^M_{\theta^*}\|^2_{L_2([0,1])}\lesssim n^{1-2\beta}+\sigma^2\frac{M\log(1/\delta)}{n} \,. [[/math]]
Using now Lemma and [math]\sigma^2 \le 1[/math], we get

[[math]] \|\varphi^M_{\thetals}-f\|^2_{L_2([0,1])}\lesssim M^{-2\beta}+n^{1-2\beta}+\sigma^2\frac{M\log(1/\delta)}{n} \,. [[/math]]
Taking [math]M=\lceil n^{\frac{1}{2\beta+1}}\rceil\le n-1[/math] for [math]n[/math] large enough yields

[[math]] \|\varphi^M_{\thetals}-f\|^2_{L_2([0,1])}\lesssim n^{-\frac{2\beta}{2\beta+1}}+n^{1-2\beta} \sigma^2\log(1/\delta)\,. [[/math]]
To conclude the proof, simply note that for the prescribed [math]\beta[/math], we have [math]n^{1-2\beta}\le n^{-\frac{2\beta}{2\beta+1}}[/math]\,.

Adaptive estimation

The rate attained by the projection estimator [math]\varphi_{\thetals}[/math] with [math]M=\lceil n^{\frac{1}{2\beta+1}}\rceil[/math] is actually optimal so, in this sense, it is a good estimator. Unfortunately, its implementation requires the knowledge of the smoothness parameter [math]\beta[/math] which is typically unknown, to determine the level [math]M[/math] of truncation. The purpose of adaptive estimation is precisely to adapt to the unknown [math]\beta[/math], that is to build an estimator that does not depend on [math]\beta[/math] and yet, attains a rate of the order of [math]Cn^{-\frac{2\beta}{2\beta+1}}[/math] (up to a logarithmic lowdown). To that end, we will use the oracle inequalities for the BIC and Lasso estimator defined in\eqref{EQ:defBICmis} and\eqref{EQ:defLassomis} respectively. In view of Lemma, the design matrix [math]\Phi[/math] actually satisfies the assumption[math]\textsf{ORT}[/math] when we work with the trigonometric basis. This has two useful implications:

  • Both estimators are actually thresholding estimators and can therefore be implemented efficiently
  • The condition [math]\textsf{INC($k$)}[/math] is automatically satisfied for any [math]k \ge 1[/math].

These observations lead to the following corollary.

Corollary

Fix [math]\beta \gt (1+\sqrt{5})/4\simeq 0.81, Q \gt 0, \delta \gt 0[/math] and [math]n[/math] large enough to ensure [math]n-1\ge \lceil n^{\frac{1}{2\beta+1}}\rceil[/math] assume the general regression model\eqref{EQ:regmodgen} with [math]f \in \Theta(\beta,Q)[/math] and [math]\eps\sim\sg_n(\sigma^2), \sigma^2 \le 1[/math]. Let [math]\{\varphi_j\}_{j=1}^{n-1}[/math] be the trigonometric basis. Denote by [math]\varphi^{n-1}_{\thetabic}[/math] (resp. [math]\varphi^{n-1}_{\thetalasso}[/math]) the BIC (resp. Lasso) estimator defined in\eqref{EQ:defBICmis} (resp. \eqref{EQ:defLassomis}) over [math]\R^{n-1}[/math] with regularization parameter given by\eqref{EQ:taubicmis} (resp. \eqref{EQ:taulassomis}). Then [math]\varphi^{n-1}_{\hat \theta}[/math], where [math]\hat \theta \in \{\thetabic, \thetalasso\}[/math] satisfies with probability [math]1-\delta[/math],

[[math]] \|\varphi^{n-1}_{\hat \theta} -f\|_{L_2([0,1])}^2 \lesssim (n/\log n)^{-\frac{2\beta}{2\beta+1}}(1+\sigma^2\log(1/\delta))\,, [[/math]]
where the constants may depend on [math]\beta[/math] and [math]Q[/math].


Show Proof

For [math]\hat \theta \in \{\thetabic, \thetalasso\}[/math], adapting the proofs of Theorem for the BIC estimator and Theorem for the Lasso estimator, for any [math]\theta \in \R^{n-1}[/math], with probability [math]1-\delta[/math]

[[math]] |\varphi^{n-1}_{\hat \theta} -f|_2^2 \le \frac{1+\alpha}{1-\alpha}|\varphi^{n-1}_\theta-f|_2^2 + R(|\theta|_0)\,. [[/math]]
where

[[math]] R(|\theta|_0):=\frac{C\sigma^2}{\alpha(1-\alpha)}|\theta_0|\log(en/\delta) [[/math]]
It yields

[[math]] \begin{align*} |\varphi^{n-1}_{\hat \theta} -\varphi^{n-1}_\theta|_2^2& \le \frac{2\alpha}{1-\alpha}|\varphi^{n-1}_\theta-f|_2^2 +2(\varphi^{n-1}_{\hat \theta} -\varphi^{n-1}_\theta)^\top (f - \varphi^{n-1}_\theta)+R(|\theta|_0)\\ &\le \Big( \frac{2\alpha}{1-\alpha}+\frac{1}{\alpha}\Big)|\varphi^{n-1}_\theta-f|_2^2+\alpha|\varphi^{n-1}_{\hat \theta} -\varphi^{n-1}_\theta|_2^2 + R(|\theta|_0)\,, \end{align*} [[/math]]
where we used Young's inequality once again. Let [math] M \in \mathbb{N} [/math] be determined later and choose now [math]\alpha=1/2[/math] and [math]\theta=\theta^*_M[/math], where [math]\theta^*_M[/math] is equal to [math]\theta^*[/math] on its first [math]M[/math] coordinates and [math]0[/math] otherwise so that [math]\varphi^{n-1}_{\theta^*_M}=\varphi^M_{\theta^*}[/math]. It yields

[[math]] |\varphi^{n-1}_{\hat \theta} -\varphi^{n-1}_{\theta^*_M}|_2^2\lesssim |\varphi^{n-1}_{\theta^*_M}-f|_2^2+ R(M)\lesssim|\varphi^{n-1}_{\theta^*_M}-\varphi^{n-1}_{\theta^*}|_2^2+|\varphi^{n-1}_{\theta^*}-f|_2^2+ R(M) [[/math]]
Next, it follows from\eqref{EQ:bias2} that [math]|\varphi^{n-1}_{\theta^*}-f|_2^2 \lesssim Qn^{2-2\beta}[/math]. Together with Lemma, it yields

[[math]] \|\varphi^{n-1}_{\hat \theta} -\varphi^{n-1}_{\theta^*_M}\|_{L_2([0,1])}^2\lesssim \|\varphi^{n-1}_{\theta^*} -\varphi^{n-1}_{\theta^*_M}\|_{L_2([0,1])}^2+Qn^{1-2\beta}+ \frac{R(M)}{n}\,. [[/math]]
Moreover, using\eqref{EQ:bias1}, we find that

[[math]] \|\varphi^{n-1}_{\hat \theta} -f\|_{L_2([0,1])}^2\lesssim M^{-2\beta}+Qn^{1-2\beta}+ \frac{M}{n}\sigma^2\log(en/\delta) \,. [[/math]]
To conclude the proof, choose [math]M=\lceil (n/\log n)^{\frac{1}{2\beta+1}}\rceil[/math] and observe that the choice of [math]\beta[/math] ensures that [math]n^{1-2\beta} \lesssim M^{-2\beta}[/math].

While there is sometimes a (logarithmic) price to pay for adaptation, it turns out that the extra logarithmic factor can be removed by a clever use of blocks (see [5](Chapter3)). The reason why we get this extra logarithmic factor here is because we use a hammer that's too big. Indeed, BIC and Lasso allow for ‘`holes" in the Fourier decomposition, so we only get part of their potential benefits.

General references

Rigollet, Philippe; Hütter, Jan-Christian (2023). "High-Dimensional Statistics". arXiv:2310.19244 [math.ST].

Notes

  1. In the sense that
    [[math]] \lim_{k \to \infty} \int_0^1|f(t)-\sum_{j=1}^k \theta_j \varphi_j(t)|^2\ud t=0 [[/math]]

References

  1. 1.0 1.1 Flaxman, Abraham (2003). "A spectral technique for random satisfiable 3CNF formulas". Ann. Statist. 40. Society for Industrial and Applied Mathematics. doi:10.1214/aos/1034276635. 
  2. "Exponential screening and optimal rates of sparse estimation" (2011). Ann. Statist. 39. John Wiley & Sons Inc.. doi:10.1214/aos/1034276635. 
  3. Ramon, van (2017). "Probability in High Dimension". Constructive Approximation 13. Springer-Verlag. doi:10.1214/aos/1034276635. 
  4. Ryan, O'Donnell (2005). "The PCP Theorem and Hardness of Approximation". SIAM J. Comput. 24. W. H. Freeman & Co.. doi:10.1214/aos/1034276635. 
  5. LeCam, L. (1973). "Convergence of estimates under dimensionality restrictions". Ann. Statist. 1. Omnipress. doi:10.1214/aos/1034276635.