Misspecified Linear Models
[math] \newcommand{\DS}{\displaystyle} \newcommand{\intbeta}{\lfloor \beta \rfloor} \newcommand{\cA}{\mathcal{A}} \newcommand{\cB}{\mathcal{B}} \newcommand{\cC}{\mathcal{C}} \newcommand{\cD}{\mathcal{D}} \newcommand{\cE}{\mathcal{E}} \newcommand{\cF}{\mathcal{F}} \newcommand{\cG}{\mathcal{G}} \newcommand{\cH}{\mathcal{H}} \newcommand{\cI}{\mathcal{I}} \newcommand{\cJ}{\mathcal{J}} \newcommand{\cK}{\mathcal{K}} \newcommand{\cL}{\mathcal{L}} \newcommand{\cM}{\mathcal{M}} \newcommand{\cN}{\mathcal{N}} \newcommand{\cO}{\mathcal{O}} \newcommand{\cP}{\mathcal{P}} \newcommand{\cQ}{\mathcal{Q}} \newcommand{\cS}{\mathcal{S}} \newcommand{\cT}{\mathcal{T}} \newcommand{\cU}{\mathcal{U}} \newcommand{\cV}{\mathcal{V}} \newcommand{\cX}{\mathcal{X}} \newcommand{\cY}{\mathcal{Y}} \newcommand{\cZ}{\mathcal{Z}} \newcommand{\sg}{\mathsf{subG}} \newcommand{\subE}{\mathsf{subE}} \newcommand{\bA}{\mathbf{A}} \newcommand{\bB}{\mathbf{B}} \newcommand{\bC}{\mathbf{C}} \newcommand{\bD}{\mathbf{D}} \newcommand{\bE}{\mathbf{E}} \newcommand{\bF}{\mathbf{F}} \newcommand{\bG}{\mathbf{G}} \newcommand{\bH}{\mathbf{H}} \newcommand{\bI}{\mathbf{I}} \newcommand{\bJ}{\mathbf{J}} \newcommand{\bK}{\mathbf{K}} \newcommand{\bM}{\mathbf{M}} \newcommand{\bN}{\mathbf{N}} \newcommand{\bO}{\mathbf{O}} \newcommand{\bP}{\mathbf{P}} \newcommand{\bp}{\mathbf{p}} \newcommand{\bQ}{\mathbf{Q}} \newcommand{\bS}{\mathbf{S}} \newcommand{\bT}{\mathbf{T}} \newcommand{\bU}{\mathbf{U}} \newcommand{\bV}{\mathbf{V}} \newcommand{\bX}{\mathbf{X}} \newcommand{\bY}{\mathbf{Y}} \newcommand{\bZ}{\mathbf{Z}} \newcommand{\bflambda}{\boldsymbol{\lambda}} \newcommand{\bftheta}{\boldsymbol{\theta}} \newcommand{\bfg}{\boldsymbol{g}} \newcommand{\bfy}{\boldsymbol{y}} \def\thetaphat{\hat{\bftheta}_\bp} \def\bflam{\boldsymbol{\lambda}} \def\Lam{\Lambda} \def\lam{\lambda} \def\bfpi{\boldsymbol{\pi}} \def\bfz{\boldsymbol{z}} \def\bfw{\boldsymbol{w}} \def\bfeta{\boldsymbol{\eta}} \newcommand{\R}{\mathrm{ I}\kern-0.18em\mathrm{ R}} \newcommand{\h}{\mathrm{ I}\kern-0.18em\mathrm{ H}} \newcommand{\K}{\mathrm{ I}\kern-0.18em\mathrm{ K}} \newcommand{\p}{\mathrm{ I}\kern-0.18em\mathrm{ P}} \newcommand{\E}{\mathrm{ I}\kern-0.18em\mathrm{ E}} %\newcommand{\Z}{\mathrm{ Z}\kern-0.26em\mathrm{ Z}} \newcommand{\1}{\mathrm{ 1}\kern-0.24em\mathrm{ I}} \newcommand{\N}{\mathrm{ I}\kern-0.18em\mathrm{ N}} \newcommand{\field}[1]{\mathbb{#1}} \newcommand{\q}{\field{Q}} \newcommand{\Z}{\field{Z}} \newcommand{\X}{\field{X}} \newcommand{\Y}{\field{Y}} \newcommand{\bbS}{\field{S}} \newcommand{\n}{\mathcal{N}} \newcommand{\x}{\mathcal{X}} \newcommand{\pp}{{\sf p}} \newcommand{\hh}{{\sf h}} \newcommand{\ff}{{\sf f}} \newcommand{\Bern}{\mathsf{Ber}} \newcommand{\Bin}{\mathsf{Bin}} \newcommand{\Lap}{\mathsf{Lap}} \newcommand{\tr}{\mathsf{Tr}} \newcommand{\phin}{\varphi_n} \newcommand{\phinb}{\overline \varphi_n(t)} \newcommand{\pn}{\p_{\kern-0.25em n}} \newcommand{\pnm}{\p_{\kern-0.25em n,m}} \newcommand{\psubm}{\p_{\kern-0.25em m}} \newcommand{\psubp}{\p_{\kern-0.25em p}} \newcommand{\cfi}{\cF_{\kern-0.25em \infty}} \newcommand{\e}{\mathrm{e}} \newcommand{\ic}{\mathrm{i}} \newcommand{\Leb}{\mathrm{Leb}_d} \newcommand{\Var}{\mathrm{Var}} \newcommand{\ddelta}{d_{\symdiffsmall}} \newcommand{\dsubh}{d_H} \newcommand{\indep}{\perp\kern-0.95em{\perp}} \newcommand{\supp}{\mathop{\mathrm{supp}}} \newcommand{\rank}{\mathop{\mathrm{rank}}} \newcommand{\vol}{\mathop{\mathrm{vol}}} \newcommand{\conv}{\mathop{\mathrm{conv}}} \newcommand{\card}{\mathop{\mathrm{card}}} \newcommand{\argmin}{\mathop{\mathrm{argmin}}} \newcommand{\argmax}{\mathop{\mathrm{argmax}}} \newcommand{\ud}{\mathrm{d}} \newcommand{\var}{\mathrm{var}} \newcommand{\re}{\mathrm{Re}} \newcommand{\MSE}{\mathsf{MSE}} \newcommand{\im}{\mathrm{Im}} \newcommand{\epr}{\hfill\hbox{\hskip 4pt\vrule width 5pt height 6pt depth 1.5pt}\vspace{0.5cm}\par} \newcommand{\bi}[1]{^{(#1)}} \newcommand{\eps}{\varepsilon} \newcommand{\Deq}{\stackrel{\mathcal{D}}{=}} \newcommand{\ubar}{\underbar} \newcommand{\Kbeta}{K_{\hspace{-0.3mm} \beta}} \newcommand{\crzero}[1]{\overset{r_0}{\underset{#1}{\longleftrightarrow}}} \newcommand{\hint}[1]{\texttt{[Hint:#1]}} \newcommand{\vp}{\vspace{.25cm}} \newcommand{\vm}{\vspace{.5cm}} \newcommand{\vg}{\vspace{1cm}} \newcommand{\vgneg}{\vspace{-1cm}} \newcommand{\vneg}{\vspace{-.5cm}} \newcommand{\vpneg}{\vspace{-.25cm}} \newcommand{\tp}{\ptsize{10}} \newcommand{\douzp}{\ptsize{12}} \newcommand{\np}{\ptsize{9}} \newcommand{\hp}{\ptsize{8}} \newcommand{\red}{\color{red}} \newcommand{\ndpr}[1]{{\textsf{\red[#1]}}} \newcommand\iid{i.i.d\@ifnextchar.{}{.\@\xspace} } \newcommand\MoveEqLeft[1][2]{\kern #1em & \kern -#1em} \newcommand{\leadeq}[2][4]{\MoveEqLeft[#1] #2 \nonumber} \newcommand{\leadeqnum}[2][4]{\MoveEqLeft[#1] #2} \newcommand\independent{\protect\mathpalette{\protect\independenT}{\perp}} \def\independenT#1#2{\mathrel{\rlap{$#1#2$}\mkern2mu{#1#2}}} \newcommand{\MIT}[1]{{\color{MITred} #1}} \newcommand{\dHyp}{\{-1,1\}^d} \newcommand{\thetahard}{\hat \theta^{hrd}} \newcommand{\thetasoft}{\hat \theta^{sft}} \newcommand{\thetabic}{\hat \theta^{bic}} \newcommand{\thetalasso}{\hat \theta^{\cL}} \newcommand{\thetaslope}{\hat \theta^{\cS}} \newcommand{\thetahard}{\hat \theta^{hrd}} \newcommand{\thetasoft}{\hat \theta^{sft}} \newcommand{\thetabic}{\hat \theta^{bic}} \newcommand{\thetalasso}{\hat \theta^{\cL}} \newcommand{\thetaslope}{\hat \theta^{\cS}} \newcommand{\thetals}{\hat \theta^{ls}} \newcommand{\thetalsm}{\tilde \theta^{ls_X}} \newcommand{\thetaridge}{\hat\theta^{\mathrm{ridge}}_\tau} \newcommand{\thetalsK}{\hat \theta^{ls}_K} \newcommand{\muls}{\hat \mu^{ls}} [/math]
Arguably, the strongest assumption that we made in Chapter Linear Regression Model is that the regression function [math]f(x)[/math] is of the form [math]f(x)=x^\top \theta^*[/math]. What if this assumption is violated? In reality, we do not really believe in the linear model and we hope that good statistical methods should be robust to deviations from this model. This is the problem of model misspecification, and in particular misspecified linear models. Throughout this chapter, we assume the following model:
where [math]\eps=(\eps_1, \ldots, \eps_n)^\top [/math] is sub-Gaussian with variance proxy [math]\sigma^2[/math]. Here [math]X_i \in \R^d[/math]. When dealing with fixed design, it will be convenient to consider the vector [math]g \in \R^n[/math] defined for any function [math]g \,:\, \R^d \to \R[/math] by [math]g=(g(X_1), \ldots, g(X_n))^\top[/math]. In this case, we can write for any estimator [math]\hat f \in \R^n[/math] of [math]f[/math],
Even though the model may not be linear, we are interested in studying the statistical properties of various linear estimators introduced in the previous chapters: [math]\thetals, \thetalsK, \thetalsm, \thetabic, \thetalasso[/math]. Clearly, even with an infinite number of observations, we have no chance of finding a consistent estimator of [math]f[/math] if we don't know the correct model. Nevertheless, as we will see in this chapter, something can still be said about these estimators using oracle inequalities.
Oracle inequalities
Oracle inequalities
As mentioned in the introduction, an oracle is a quantity that cannot be constructed without the knowledge of the quantity of interest, here: the regression function. Unlike the regression function itself, an oracle is constrained to take a specific form. For all matter of purposes, an oracle can be viewed as an estimator (in a given family) that can be constructed with an infinite amount of data. This is exactly what we should aim for in misspecified models. When employing the least squares estimator [math]\thetals[/math], we constrain ourselves to estimating functions that are of the form [math]x\mapsto x^\top \theta[/math], even though [math]f[/math] itself may not be of this form. Therefore, the oracle [math]\hat f[/math] is the linear function that is the closest to [math]f[/math]. Rather than trying to approximate [math]f[/math] by a linear function [math]f(x)\approx \theta^\top x[/math], we make the model a bit more general and consider a dictionary [math]\cH=\{\varphi_1, \ldots, \varphi_M\}[/math] of functions where [math]\varphi_j\,:\, \R^d \to \R[/math]. In this case, we can actually remove the assumption that [math]X \in \R^d[/math]. Indeed, the goal is now to estimate [math]f[/math] using a linear combination of the functions in the dictionary:
If [math]M=d[/math] and [math]\varphi_j(X)=X^{(j)}[/math] returns the [math]j[/math]th coordinate of [math]X \in \R^d[/math] then the goal is to approximate [math]f(x)[/math] by [math]\theta^\top x[/math]. Nevertheless, the use of a dictionary allows for a much more general framework.
Note that the use of a dictionary does not affect the methods that we have been using so far, namely penalized/constrained least squares. We use the same notation as before and define
- The least squares estimator:
[[math]] \begin{equation} \label{EQ:defthetaLSmis} \thetals\in \argmin_{\theta \in \R^M}\frac{1}{n} \sum_{i=1}^n\big(Y_i-\varphi_\theta(X_i)\big)^2 \end{equation} [[/math]]
- The least squares estimator constrained to [math]K \subset \R^M[/math]:
[[math]] \thetalsK \in \argmin_{\theta \in K}\frac{1}{n} \sum_{i=1}^n\big(Y_i-\varphi_\theta(X_i)\big)^2 [[/math]]
- The BIC estimator:
[[math]] \begin{equation} \label{EQ:defBICmis} \thetabic \in \argmin_{\theta \in \R^M}\Big\{\frac{1}{n} \sum_{i=1}^n\big(Y_i-\varphi_\theta(X_i)\big)^2+\tau^2|\theta|_0\Big\} \end{equation} [[/math]]
- The Lasso estimator:
[[math]] \begin{equation} \label{EQ:defLassomis} \thetalasso \in \argmin_{\theta \in \R^M}\Big\{\frac{1}{n} \sum_{i=1}^n\big(Y_i-\varphi_\theta(X_i)\big)^2+2\tau|\theta|_1\Big\} \end{equation} [[/math]]
Let [math]R(\cdot)[/math] be a risk function and let [math]\cH=\{\varphi_1, \ldots, \varphi_M\}[/math] be a dictionary of functions from [math]\R^d[/math] to [math]\R[/math]. Let [math]K[/math] be a subset of [math]\R^M[/math]. The oracle on [math]K[/math] with respect to [math]R[/math] is defined by [math]\varphi_{\bar \theta}[/math], where [math]\bar \theta \in K[/math] is such that
Our goal will be to mimic oracles. The finite sample performance of an estimator at this task is captured by an oracle inequality.
Oracle inequality for the least squares estimator
While our ultimate goal is to prove sparse oracle inequalities for the BIC and Lasso estimator in the case of misspecified model, the difficulty of the extension to this case for linear models is essentially already captured by the analysis for the least squares estimator. In this simple case, can even obtain an exact oracle inequality.
Assume the general regression model\eqref{EQ:regmodgen} with [math]\eps\sim\sg_n(\sigma^2)[/math]. Then, the least squares estimator [math]\thetals[/math] satisfies for some numerical constant [math]C \gt 0[/math],
Note that by definition
Sparse oracle inequality for the BIC estimator
The analysis for more complicated estimators such as the BIC in Chapter Linear Regression Model allows us to derive oracle inequalities for these estimators.
Recall that the proof of Theorem for the BIC estimator begins as follows:
{{{4}}}
The interpretation of this theorem is enlightening. It implies that the BIC estimator will mimic the best tradeoff between the approximation error [math]\MSE(\varphi_\theta)[/math] and the complexity of [math]\theta[/math] as measured by its sparsity. In particular, this result, which is sometimes called sparse oracle inequality, implies the following oracle inequality. Define the oracle [math]\bar \theta[/math] to be such that
Then, with probability at least [math]1-\delta[/math],
If the linear model happens to be correct, then we simply have [math]\MSE(\varphi_{\bar \theta})=0[/math].
Sparse oracle inequality for the Lasso
To prove an oracle inequality for the Lasso, we need additional assumptions on the design matrix, such as incoherence. Here, the design matrix is given by the [math]n\times M[/math] matrix [math]\Phi[/math] with elements [math]\Phi_{i,j}=\varphi_j(X_i)[/math].
Assume the general regression model\eqref{EQ:regmodgen} with [math]\eps\sim\sg_n(\sigma^2)[/math]. Moreover, assume that there exists an integer [math]k[/math] such that the matrix [math]\Phi[/math] satisfies assumption [math]\textsf{INC$(k)$}[/math]. Then, the Lasso estimator [math]\thetalasso[/math] with regularization parameter given by
From the definition of [math]\thetalasso[/math], for any [math]\theta \in \R^M[/math],
Maurey's argument
In there is no sparse [math]\theta[/math] such that [math]\MSE(\varphi_{\theta})[/math] is small, Theorem is useless whereas the Lasso may still enjoy slow rates. In reality, no one really believes in the existence of sparse vectors but rather of approximately sparse vectors. Zipf's law would instead favor the existence of vectors [math]\theta[/math] with absolute coefficients that decay polynomially when ordered from largest to smallest in absolute value. This is the case for example if [math]\theta[/math] has a small [math]\ell_1[/math] norm but is not sparse. For such [math]\theta[/math], the Lasso estimator still enjoys slow rates as in Theorem, which can be easily extended to the misspecified case (see Problem). As a result, it seems that the Lasso estimator is strictly better than the BIC estimator as long as incoherence holds since it enjoys both fast and slow rates, whereas the BIC estimator seems to be tailored to the fast rate. Fortunately, such vectors can be well approximated by sparse vectors in the following sense: for any vector [math]\theta \in \R^M[/math] such that [math]|\theta|_1\le 1[/math], there exists a vector [math]\theta'[/math] that is sparse and for which [math]\MSE(\varphi_{\theta'})[/math] is not much larger than [math]\MSE(\varphi_\theta)[/math]. The following theorem quantifies exactly the tradeoff between sparsity and [math]\MSE[/math]. It is often attributed to B. Maurey and was published by Pisier [1]. This is why it is referred to as Maurey's argument.
Let [math]\{\varphi_1, \dots, \varphi_M\}[/math] be a dictionary normalized in such a way that
Define
Maurey's argument implies the following corollary.
Assume that the assumptions of Theorem hold and that the dictionary [math]\{\varphi_1, \dots, \varphi_M\}[/math] is normalized in such a way that
Choosing [math]\alpha=1/3[/math] in Theorem yields
- If [math]1 \le \bar k \le M[/math], then we get
[[math]] \min_k\Big(\frac{|\theta'|_1^2}{k}+C\frac{\sigma^2k\log (eM)}{n}\Big) \le C\sigma|\theta'|_1\sqrt{\frac{\log (eM)}{n}} [[/math]]
- If [math]\bar k \le 1[/math], then
[[math]] |\theta'|_1^2\le C\frac{\sigma^2\log (eM)}{n}\,, [[/math]]which yields[[math]] \min_k\Big(\frac{|\theta'|_1^2}{k}+C\frac{\sigma^2k\log (eM)}{n}\Big) \le C\frac{\sigma^2\log (eM)}{n} [[/math]]
- If [math]\bar k \ge M[/math], then
[[math]] \frac{\sigma^2M\log (eM)}{n}\le C \frac{|\theta'|_1^2}{M}\,. [[/math]]Therefore, on the one hand, if [math]M \ge \frac{|\theta|_1}{\sigma\sqrt{\log(eM)/n}}[/math], we get[[math]] \min_k\Big(\frac{|\theta'|_1^2}{k}+C\frac{\sigma^2k\log (eM)}{n}\Big) \le C \frac{|\theta'|_1^2}{M} \le C\sigma|\theta'|_1\sqrt{\frac{\log (eM)}{n}}\,. [[/math]]On the other hand, if [math]M \le \frac{|\theta|_1}{\sigma\sqrt{\log(eM)/n}}[/math], then for any [math]\theta \in \R^M[/math], we have[[math]] \begin{equation*} \frac{\sigma^2|\theta|_0\log (eM)}{n} \le \frac{\sigma^2M\log (eM)}{n} \le C\sigma|\theta'|_1\sqrt{\frac{\log (eM)}{n}}\,. \end{equation*} [[/math]]
Combined,
Note that this last result holds for any estimator that satisfies an oracle inequality with respect to the [math]\ell_0[/math] norm as in Theorem. In particular, this estimator need not be the BIC estimator. An example is the Exponential Screening estimator of[2]. Maurey's argument allows us to enjoy the best of both the [math]\ell_0[/math] and the [math]\ell_1[/math] world. The rate adapts to the sparsity of the problem and can be even generalized to [math]\ell_q[/math]-sparsity (see Problem). However, it is clear from the proof that this argument is limited to squared [math]\ell_2[/math] norms such as the one appearing in [math]\MSE[/math] and extension to other risk measures is non-trivial. Some work has been done for non-Hilbert spaces [1][3] using more sophisticated arguments.
Nonparametric regression
So far, the oracle inequalities that we have derived do not deal with the approximation error [math]\MSE(\varphi_\theta)[/math]. We kept it arbitrary and simply hoped that it was small. Note also that in the case of linear models, we simply assumed that the approximation error was zero. As we will see in this section, this error can be quantified under natural smoothness conditions if the dictionary of functions [math]\cH=\{\varphi_1, \ldots, \varphi_M\}[/math] is chosen appropriately. In what follows, we assume for simplicity that [math]d=1[/math] so that [math]f\,:\, \R \to \R[/math] and [math]\varphi_j\,:\, \R \to \R[/math].
Fourier decomposition
Historically, nonparametric estimation was developed before high-dimensional statistics and most results hold for the case where the dictionary [math]\cH=\{\varphi_1, \ldots, \varphi_M\}[/math] forms an orthonormal system of [math]L_2([0,1])[/math]:
We will also deal with the case where [math]M=\infty[/math]. When [math]\cH[/math] is an orthonormal system, the coefficients [math]\theta_j^* \in \R[/math] defined by
are called Fourier coefficients of [math]f[/math]. Assume now that the regression function [math]f[/math] admits the following decomposition
There exists many choices for the orthonormal system and we give only two as examples.
Example Trigonometric basis. This is an orthonormal basis of [math]L_2([0,1])[/math]. It is defined by
for [math]k=1,2, \dots[/math] and [math]x \in [0,1][/math]. The fact that it is indeed an orthonormal system can be easily check using trigonometric identities.
The next example has received a lot of attention in the signal (sound, image, ...) processing community.
Example Wavelets. Let [math]\psi\,:\, \R \to \R[/math] be a sufficiently smooth and compactly supported function, called ‘`’'mother wavelet". Define the system of functions
It can be shown that for a suitable [math]\psi[/math], the dictionary [math]\{\psi_{j,k}, j,k \in \Z\}[/math] forms an orthonormal system of [math]L_2([0,1])[/math] and sometimes a basis. In the latter case, for any function [math]g \in L_2([0,1])[/math], it holds
The coefficients [math]\theta_{jk}[/math] are called wavelet coefficients of [math]g[/math]. The simplest example is given by the Haar system obtained by taking [math]\psi[/math] to be the following piecewise constant function (see Figure). We will not give more details about wavelets here but simply point the interested reader to[4].
[[File: | 400px | thumb | The Haar mother wavelet ]]
Sobolev classes and ellipsoids
We begin by describing a class of smooth functions where smoothness is understood in terms of its number of derivatives. Recall that [math]f^{(k)}[/math] denotes the [math]k[/math]-th derivative of [math]f[/math].
Fix parameters [math]\beta \in \{1, 2, \dots\}[/math] and [math]L \gt 0[/math]. The Sobolev class of functions [math]W(\beta,L)[/math] is defined by
Any function [math]f \in W(\beta, L)[/math] can represented[Notes 1] as its Fourier expansion along the trigonometric basis:
where [math]\theta^*=\{\theta_j^*\}_{j \ge 1}[/math] is in the space of squared summable sequences [math]\ell_2(\N)[/math] defined by
For any [math]\beta \gt 0[/math], define the coefficients
With these coefficients, we can define the Sobolev class of functions in terms of Fourier coefficients.
Fix [math]\beta\ge 1[/math] and [math]L \gt 0[/math] and let [math]\{\varphi_j\}_{j \ge 1}[/math] denote the trigonometric basis of [math]L_2([0,1])[/math]. Moreover, let [math]\{a_j\}_{j\ge 1}[/math] be defined as in\eqref{EQ:defaj}. A function [math]f \in W(\beta,L)[/math] can be represented as
Let us first define the Fourier coefficients [math]\{s_k(j)\}_{k \ge 1}[/math] of the [math]j[/math]th derivative [math]f^{(j)}[/math] of [math]f[/math] for [math]j=1, \ldots, \beta[/math]:
It can actually be shown that the reciprocal is true, that is, any function with Fourier coefficients in [math]\Theta(\beta, Q)[/math] belongs to if [math] W(\beta,L)[/math], but we will not be needing this. In what follows, we will define smooth functions as functions with Fourier coefficients (with respect to the trigonometric basis) in a Sobolev ellipsoid. By extension, we write [math]f \in \Theta(\beta, Q)[/math] in this case and consider any real value for [math]\beta[/math].
The Sobolev ellipsoids enjoy the following properties
- For any [math]Q \gt 0[/math],
[[math]] 0 \lt \beta' \lt \beta \ \Rightarrow \ \Theta(\beta, Q) \subset \Theta(\beta',Q). [[/math]]
- For any [math]Q \gt 0[/math],
[[math]] \beta \gt \frac12 \ \Rightarrow \ f\ \text{is continuous}. [[/math]]
The proof is left as an exercise (Problem). It turns out that the first [math] n [/math] functions in the trigonometric basis are not only orthonormal with respect to the inner product of [math]L_2[/math], but also with respect to the inner predictor associated with fixed design, [math]\langle f,g\rangle:=\frac{1}{n} \sum_{i=1}^n f(X_i)g(X_i)[/math], when the design is chosen to be regular, i.e., [math]X_i=(i-1)/n[/math], [math]i=1, \ldots, n[/math].
Assume that [math]\{X_1, \ldots, X_n\}[/math] is the regular design, i.e., [math]X_i=(i-1)/n[/math]. Then, for any [math]M\le n-1[/math], the design matrix [math]\Phi=\{\varphi_j(X_i)\}_{\substack{1\le i \le n\\1\le j \le M}}[/math] satisfies the [math]\textsf{ORT}[/math] condition.
Fix [math]j,j'\in \{1, \ldots, n-1\}[/math], [math]j\neq j'[/math], and consider the inner product [math]\varphi_j^\top \varphi_{j'}[/math]. Write [math]k_{j}=\lfloor j/2\rfloor[/math] for the integer part of [math]j/2[/math] and define the vectors [math]a, b, a', b' \in \R^n[/math] with coordinates such that [math]e^{\frac{\mathsf{i}2\pi k_j s}{n}}=a_{s+1}+\mathsf{i} b_{s+1}[/math] and [math]e^{\frac{\mathsf{i}2\pi k_{j'} s}{n}}=a'_{s+1}+\mathsf{i} b'_{s+1}[/math] for [math] s \in \{0, \dots, n-1\} [/math]. It holds that
Integrated squared error
As mentioned in the introduction of this chapter, the smoothness assumption allows us to control the approximation error. Before going into the details, let us gain some insight. Note first that if [math]\theta \in \Theta(\beta, Q)[/math], then [math]a_j^2\theta_j^2 \to 0[/math] as [math]j \to \infty[/math] so that [math]|\theta_j|=o(j^{-\beta})[/math]. Therefore, the [math]\theta_j[/math]s decay polynomially to zero and it makes sense to approximate [math]f[/math] by its truncated Fourier series
for any fixed [math]M[/math]. This truncation leads to a systematic error that vanishes as [math]M \to \infty[/math]. We are interested in understanding the rate at which this happens.
The Sobolev assumption allows us to control precisely this error as a function of the tunable parameter [math]M[/math] and the smoothness [math]\beta[/math].
Note that for any [math]\theta \in \Theta(\beta, Q)[/math], if [math]\beta \gt 1/2[/math], then
To prove the second part of the lemma, observe that
{{{4}}}
Note that the truncated Fourier series [math]\varphi_{\theta^*}^M[/math] is an oracle: this is what we see when we view [math]f[/math] through the lens of functions with only low frequency harmonics. To estimate [math]\varphi_{\theta^*} = \varphi_{\theta^\ast}^M[/math], consider the estimator [math]\varphi_{\thetals}[/math] where
which should be such that [math]\varphi_{\thetals}[/math] is close to [math]\varphi_{\theta^*}[/math]. For this estimator, we have proved (Theorem) an oracle inequality for the [math]\MSE[/math] that is of the form
It yields
where we used Lemma in the last equality. Together with\eqref{EQ:bias2} and Young's inequality [math]2ab\le \alpha a^2+b^2/\alpha, a,b \ge 0[/math] for any [math]\alpha \gt 0[/math], we get
for some positive constant [math]C[/math] when [math]\theta^* \in \Theta(\beta, Q)[/math]. As a result,
for any [math]\alpha \in (0,1)[/math]. Since, Lemma implies, [math]|\varphi_{\thetals}^M-\varphi_{\theta^*}^M|_2^2=n\|\varphi_{\thetals}^M-\varphi_{\theta^*}^M\|_{L_2([0,1])}^2[/math], we have proved the following theorem.
Fix [math]\beta\ge (1+\sqrt{5})/4\simeq 0.81, Q \gt 0, \delta \gt 0[/math] and assume the general regression model\eqref{EQ:regmodgen} with [math]f \in \Theta(\beta,Q)[/math] and [math]\eps\sim\sg_n(\sigma^2), \sigma^2 \le 1[/math]. Moreover, let [math]M=\lceil n^{\frac{1}{2\beta+1}}\rceil[/math] and [math]n[/math] be large enough so that [math]M \le n-1[/math]. Then the least squares estimator [math]\thetals[/math] defined in\eqref{EQ:defthetaLSmis} with [math]\{\varphi_j\}_{j=1}^M[/math] being the trigonometric basis, satisfies with probability [math]1-\delta[/math], for [math]n[/math] large enough,
Choosing [math]\alpha=1/2[/math] for example and absorbing [math]Q[/math] in the constants, we get from\eqref{EQ:almostLSNP} and Lemma that for [math]M \le n-1[/math],
Adaptive estimation
The rate attained by the projection estimator [math]\varphi_{\thetals}[/math] with [math]M=\lceil n^{\frac{1}{2\beta+1}}\rceil[/math] is actually optimal so, in this sense, it is a good estimator. Unfortunately, its implementation requires the knowledge of the smoothness parameter [math]\beta[/math] which is typically unknown, to determine the level [math]M[/math] of truncation. The purpose of adaptive estimation is precisely to adapt to the unknown [math]\beta[/math], that is to build an estimator that does not depend on [math]\beta[/math] and yet, attains a rate of the order of [math]Cn^{-\frac{2\beta}{2\beta+1}}[/math] (up to a logarithmic lowdown). To that end, we will use the oracle inequalities for the BIC and Lasso estimator defined in\eqref{EQ:defBICmis} and\eqref{EQ:defLassomis} respectively. In view of Lemma, the design matrix [math]\Phi[/math] actually satisfies the assumption[math]\textsf{ORT}[/math] when we work with the trigonometric basis. This has two useful implications:
- Both estimators are actually thresholding estimators and can therefore be implemented efficiently
- The condition [math]\textsf{INC($k$)}[/math] is automatically satisfied for any [math]k \ge 1[/math].
These observations lead to the following corollary.
Fix [math]\beta \gt (1+\sqrt{5})/4\simeq 0.81, Q \gt 0, \delta \gt 0[/math] and [math]n[/math] large enough to ensure [math]n-1\ge \lceil n^{\frac{1}{2\beta+1}}\rceil[/math] assume the general regression model\eqref{EQ:regmodgen} with [math]f \in \Theta(\beta,Q)[/math] and [math]\eps\sim\sg_n(\sigma^2), \sigma^2 \le 1[/math]. Let [math]\{\varphi_j\}_{j=1}^{n-1}[/math] be the trigonometric basis. Denote by [math]\varphi^{n-1}_{\thetabic}[/math] (resp. [math]\varphi^{n-1}_{\thetalasso}[/math]) the BIC (resp. Lasso) estimator defined in\eqref{EQ:defBICmis} (resp. \eqref{EQ:defLassomis}) over [math]\R^{n-1}[/math] with regularization parameter given by\eqref{EQ:taubicmis} (resp. \eqref{EQ:taulassomis}). Then [math]\varphi^{n-1}_{\hat \theta}[/math], where [math]\hat \theta \in \{\thetabic, \thetalasso\}[/math] satisfies with probability [math]1-\delta[/math],
For [math]\hat \theta \in \{\thetabic, \thetalasso\}[/math], adapting the proofs of Theorem for the BIC estimator and Theorem for the Lasso estimator, for any [math]\theta \in \R^{n-1}[/math], with probability [math]1-\delta[/math]
While there is sometimes a (logarithmic) price to pay for adaptation, it turns out that the extra logarithmic factor can be removed by a clever use of blocks (see [5](Chapter3)). The reason why we get this extra logarithmic factor here is because we use a hammer that's too big. Indeed, BIC and Lasso allow for ‘`holes" in the Fourier decomposition, so we only get part of their potential benefits.
General references
Rigollet, Philippe; Hütter, Jan-Christian (2023). "High-Dimensional Statistics". arXiv:2310.19244 [math.ST].
Notes
References
- 1.0 1.1 Flaxman, Abraham (2003). "A spectral technique for random satisfiable 3CNF formulas". Ann. Statist. 40. Society for Industrial and Applied Mathematics. doi: .
- "Exponential screening and optimal rates of sparse estimation" (2011). Ann. Statist. 39. John Wiley & Sons Inc.. doi: .
- Ramon, van (2017). "Probability in High Dimension". Constructive Approximation 13. Springer-Verlag. doi: .
- Ryan, O'Donnell (2005). "The PCP Theorem and Hardness of Approximation". SIAM J. Comput. 24. W. H. Freeman & Co.. doi: .
- LeCam, L. (1973). "Convergence of estimates under dimensionality restrictions". Ann. Statist. 1. Omnipress. doi: .