⧼exchistory⧽
7 exercise(s) shown, 0 hidden
ABy Admin
Jun 24'23

This exercise is inspired by one from [1].

Consider the simple linear regression model [math]Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i[/math] with [math]\varepsilon_i \sim \mathcal{N}(0, \sigma^2)[/math]. The data on the covariate and response are: [math]\mathbf{X}^{\top} = (X_1, X_2, \ldots, X_{8})^{\top} = (-2, -1, -1, -1, 0, 1, 2, 2)^{\top}[/math] and [math]\mathbf{Y}^{\top} = (Y_1, Y_2, \ldots, Y_{8})^{\top} = (35, 40, 36, 38, 40, 43, 45, 43)^{\top}[/math], with corresponding elements in the same order.

  • Find the ridge regression estimator for the data above for a general value of [math]\lambda[/math].
  • Evaluate the fit, i.e. [math]\widehat{Y}_i(\lambda)[/math] for [math]\lambda=10[/math]. Would you judge the fit as good? If not, what is the most striking feature that you find unsatisfactory?
  • Now zero center the covariate and response data, denote it by [math]\tilde{X}_i[/math] and [math]\tilde{Y}_i[/math], and evaluate the ridge estimator of [math]\tilde{Y}_i = \beta_1 \tilde{X}_i + \varepsilon_i[/math] at [math]\lambda=4[/math]. Verify that in terms of original data the resulting predictor now is: [math]\widehat{Y}_i(\lambda) = 40 + 1.75 X[/math].

Note that the employed estimate in the predictor found in part c) is effectively a combination of a maximum likelihood and ridge regression one for intercept and slope, respectively. Put differently, only the slope has been regularized/penalized.

  1. Draper, N. R. and Smith, H. (1998).Applied Regression Analysis (3rd edition).John Wiley & Sons
ABy Admin
Jun 24'23

[math] \def \bbeta {\bf \beta} [/math]

Consider the simple linear regression model [math]Y_i = \beta_0 + X_{i} \bbeta + \varepsilon_i[/math] for [math]i=1, \ldots, n[/math] and with [math]\varepsilon_i \sim_{i.i.d.} \mathcal{N}(0, \sigma^2)[/math]. The model comprises a single covariate and an intercept. Response and covariate data are: [math]\{(y_i, x_{i})\}_{i=1}^4 = \{ (1.4, 0.0), (1.4, -2.0), (0.8, 0.0), (0.4, 2.0) \}[/math]. Find the value of [math]\lambda[/math] that yields the ridge regression estimate (with an unregularized/unpenalized intercept as is done in part c) of Question) equal to [math](1, -\tfrac{1}{8})^{\top}[/math].

ABy Admin
Jun 24'23

Plot the regularization path of the ridge regression estimator over the range [math]\lambda \in (0, 20.000][/math] using the data of Example.

ABy Admin
Jun 24'23

This exercise is freely rendered from [1], but can be found in many other places. The original source is unknown to the author.

Show that the ridge regression estimator can be obtained by ordinary least squares regression on an augmented data set. Hereto augment the matrix [math]\mathbf{X}[/math] with [math]p[/math] additional rows [math]\sqrt{\lambda} \mathbf{I}_{pp}[/math], and augment the response vector [math]\mathbf{Y}[/math] with [math]p[/math] zeros.

  1. Hastie, T., Friedman, J., and Tibshirani, R. (2009).The Elements of Statistical Learning.Springer
ABy Admin
Jun 24'23

[math] \require{textmacros} \def \bbeta {\bf \beta} [/math]

Recall the definitions of [math]\hat{\bbeta}_{\mbox{{\tiny MLS}}}[/math], [math]\mathbf{W}_{\lambda}[/math] and [math]\hat{\bbeta}(\lambda)[/math] from Section The ridge regression estimator . Show that, unlike the linear relations between the ridge and maximum likelihood estimators if [math]\mathbf{X}^{\top} \mathbf{X}[/math] is of full rank, [math]\hat{\bbeta}(\lambda) \not= \mathbf{W}_{\lambda} \hat{\bbeta}_{\mbox{{\tiny MLS}}}[/math] high-dimensionally.

ABy Admin
Jun 24'23

[math] \require{textmacros} \def\bbeta{\bf \beta} \def\vvarepsilon{\bf \varepsilon} [/math]

Consider the standard linear regression model [math]Y_i = \mathbf{X}_{i,\ast} \bbeta + \varepsilon_i[/math] for [math]i=1, \ldots, n[/math] and with the [math]\varepsilon_i \sim_{i.i.d} \mathcal{N}(0, \sigma^2)[/math]. Suppose the parameter [math]\bbeta[/math] is estimated by the ridge regression estimator [math]\hat{\bbeta}(\lambda) = (\mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp})^{-1} \mathbf{X}^{\top} \mathbf{Y}[/math].

  • The vector of ‘ridge residuals’, defined as [math]\vvarepsilon(\lambda) = \mathbf{Y} - \mathbf{X} \hat{\bbeta}(\lambda)[/math], are normally distributed. Why?
  • Show that [math]\mathbb{E}[\vvarepsilon(\lambda)] = [\mathbf{I}_{nn} - \mathbf{X} (\mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp})^{-1} \mathbf{X}^{\top}] \mathbf{X} \bbeta[/math].
  • Show that [math]\mbox{Var}[\vvarepsilon(\lambda)] = \sigma^2 [\mathbf{I}_{nn} - \mathbf{X} (\mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp})^{-1} \mathbf{X}^{\top}]^2[/math].
  • Could the normal probability plot, i.e. a qq-plot with the quantiles of standard normal distribution plotted against those of the ridge residuals, be used to assess the normality of the latter? Motivate.
ABy Admin
Jun 24'23

[math] \require{textmacros} \def \bbeta {\bf \beta} \def\fat#1{\mbox{\boldmath$#1$}} \def\reminder#1{\marginpar{\rule[0pt]{1mm}{11pt}}\textbf{#1}} \def\SSigma{\bf \Sigma} \def\ttheta{\bf \theta} \def\aalpha{\bf \alpha} \def\ddelta{\bf \delta} \def\eeta{\bf \eta} \def\llambda{\bf \lambda} \def\ggamma{\bf \gamma} \def\nnu{\bf \nu} \def\vvarepsilon{\bf \varepsilon} \def\mmu{\bf \mu} \def\nnu{\bf \nu} \def\ttau{\bf \tau} \def\SSigma{\bf \Sigma} \def\TTheta{\bf \Theta} \def\XXi{\bf \Xi} \def\PPi{\bf \Pi} \def\GGamma{\bf \Gamma} \def\DDelta{\bf \Delta} \def\ssigma{\bf \sigma} \def\UUpsilon{\bf \Upsilon} \def\PPsi{\bf \Psi} \def\PPhi{\bf \Phi} \def\LLambda{\bf \Lambda} \def\OOmega{\bf \Omega} [/math]

The coefficients [math]\bbeta[/math] of a linear regression model, [math]\mathbf{Y} = \mathbf{X} \bbeta + \vvarepsilon[/math], are estimated by [math]\hat{\bbeta} = (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{Y}[/math]. The associated fitted values then given by [math]\widehat{\mathbf{Y}} = \mathbf{X} \, \hat{\bbeta} = \mathbf{X} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{Y} = \mathbf{H} \mathbf{Y}[/math], where [math]\mathbf{H} =\mathbf{X} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top}[/math] referred to as the hat matrix. The hat matrix [math]\mathbf{H}[/math] is a projection matrix as it satisfies [math]\mathbf{H} = \mathbf{H}^2[/math]. Hence, linear regression projects the response [math]\mathbf{Y}[/math] onto the vector space spanned by the columns of [math]\mathbf{Y}[/math]. Consequently, the residuals [math]\hat{\vvarepsilon}[/math] and [math]\hat{\mathbf{Y}}[/math] are orthogonal. Now consider the ridge estimator of the regression coefficients: [math]\hat{\bbeta}(\lambda) = (\mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp})^{-1} \mathbf{X}^{\top} \mathbf{Y}[/math]. Let [math]\hat{\mathbf{Y}}(\lambda) = \mathbf{X} \hat{\bbeta}(\lambda)[/math] be the vector of associated fitted values.

  • Show that the ridge hat matrix [math]\mathbf{H}(\lambda) = \mathbf{X} (\mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp})^{-1} \mathbf{X}^{\top}[/math], associated with ridge regression, is not a projection matrix (for any [math]\lambda \gt 0[/math]), i.e. [math]\mathbf{H}(\lambda) \not= [\mathbf{H}(\lambda)]^2[/math].
  • Show that for any [math]\lambda \gt 0[/math] the ‘ridge fit’ [math]\widehat{\mathbf{Y}}(\lambda)[/math] is not orthogonal to the associated ‘ridge residuals’ [math]\hat{\vvarepsilon}(\lambda)[/math], defined as [math]\vvarepsilon(\lambda) = \mathbf{Y} - \mathbf{X} \hat{\bbeta}(\lambda)[/math].