[math] \require{textmacros} \def \bbeta {\bf \beta} \def\fat#1{\mbox{\boldmath$#1$}} \def\reminder#1{\marginpar{\rule[0pt]{1mm}{11pt}}\textbf{#1}} \def\SSigma{\bf \Sigma} \def\ttheta{\bf \theta} \def\aalpha{\bf \alpha} \def\ddelta{\bf \delta} \def\eeta{\bf \eta} \def\llambda{\bf \lambda} \def\ggamma{\bf \gamma} \def\nnu{\bf \nu} \def\vvarepsilon{\bf \varepsilon} \def\mmu{\bf \mu} \def\nnu{\bf \nu} \def\ttau{\bf \tau} \def\SSigma{\bf \Sigma} \def\TTheta{\bf \Theta} \def\XXi{\bf \Xi} \def\PPi{\bf \Pi} \def\GGamma{\bf \Gamma} \def\DDelta{\bf \Delta} \def\ssigma{\bf \sigma} \def\UUpsilon{\bf \Upsilon} \def\PPsi{\bf \Psi} \def\PPhi{\bf \Phi} \def\LLambda{\bf \Lambda} \def\OOmega{\bf \Omega} [/math]
Consider the linear regression model [math]\mathbf{Y} = \mathbf{X} \bbeta + \vvarepsilon[/math] with [math]\vvarepsilon \sim \mathcal{N} ( \mathbf{0}_n, \sigma^2 \mathbf{I}_{nn})[/math]. This model is fitted to data, [math]\mathbf{X}_{1,\ast} = (4, -2)[/math] and [math]Y_1 =(10)[/math], using the ridge regression estimator [math]\hat{\bbeta}(\lambda) = (\mathbf{X}^{\top}_{1,\ast} \mathbf{X}_{1,\ast} + \lambda \mathbf{I}_{22})^{-1} \mathbf{X}_{1,\ast}^{\top} Y_1[/math].
- Evaluate the ridge regression estimator for [math]\lambda=5[/math].
- Suppose [math]\bbeta = (1,-1)^{\top}[/math]. Evaluate the bias of the ridge regression estimator.
- Decompose the bias into a component due to the regularization and one attributable to the high-dimensionality of the study.
- Would [math]\bbeta[/math] have equalled [math](2,-1)^{\top}[/math], the bias' component due to the high-dimensionality vanishes. Explain why.
[math] \require{textmacros} \def \bbeta {\bf \beta} \def\fat#1{\mbox{\boldmath$#1$}} \def\reminder#1{\marginpar{\rule[0pt]{1mm}{11pt}}\textbf{#1}} \def\SSigma{\bf \Sigma} \def\ttheta{\bf \theta} \def\aalpha{\bf \alpha} \def\ddelta{\bf \delta} \def\eeta{\bf \eta} \def\llambda{\bf \lambda} \def\ggamma{\bf \gamma} \def\nnu{\bf \nu} \def\vvarepsilon{\bf \varepsilon} \def\mmu{\bf \mu} \def\nnu{\bf \nu} \def\ttau{\bf \tau} \def\SSigma{\bf \Sigma} \def\TTheta{\bf \Theta} \def\XXi{\bf \Xi} \def\PPi{\bf \Pi} \def\GGamma{\bf \Gamma} \def\DDelta{\bf \Delta} \def\ssigma{\bf \sigma} \def\UUpsilon{\bf \Upsilon} \def\PPsi{\bf \Psi} \def\PPhi{\bf \Phi} \def\LLambda{\bf \Lambda} \def\OOmega{\bf \Omega} [/math]
The linear regression model, [math]\mathbf{Y} =\mathbf{X} \bbeta + \vvarepsilon[/math] with [math]\vvarepsilon \sim \mathcal{N}(\mathbf{0}_n, \sigma^2 \mathbf{I}_{nn})[/math], is fitted by to the data with the following response, design matrix, and relevant summary statistics:
Hence, [math]p=2[/math] and [math]n=1[/math]. The fitting uses the ridge regression estimator.
- Section Expectation states that the regularization path of the ridge regression estimator, i.e. [math]\{ \hat{\bbeta}(\lambda) : \lambda \gt 0\}[/math], is confined to a line in [math]\mathbb{R}^2[/math]. Give the details of this line and draw it in the [math](\beta_1, \beta_2)[/math]-plane.
- Verify numerically, for a set of penalty parameter values, whether the corresponding estimates [math]\hat{\bbeta}(\lambda)[/math] are indeed confined to the line found in part a). Do this by plotting the estimates in the [math](\beta_1, \beta_2)[/math]-plane (along with the line found in part a). In this use the following set of [math]\lambda[/math]'s:
lambdas <- exp(seq(log(10^(-15)), log(1), length.out=100))
- Part b) reveals that, for small values of [math]\lambda[/math], the estimates fall outside the line found in part a). Using the theory outlined in Section Expectation , the estimates can be decomposed into a part that falls on this line and a part that is orthogonal to it. The latter is given by [math](\mathbf{I}_{22} - \mathbf{P}_x) \hat{\bbeta}(\lambda)[/math] where [math]\mathbf{P}_x[/math] is the projection matrix onto the space spanned by the columns of [math]\mathbf{X}[/math]. Evaluate the projection matrix [math]\mathbf{P}_x[/math].
- Numerical inaccuracy, resulting from the ill-conditionedness of [math]\mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{22}[/math], causes [math](\mathbf{I}_{22} - \mathbf{P}_x) \hat{\bbeta}(\lambda) \not= \mathbf{0}_2[/math]. Verify that the observed non-null [math](\mathbf{I}_{22} - \mathbf{P}_x) \hat{\bbeta}(\lambda)[/math] are indeed due to numerical inaccuracy. Hereto generate a log-log plot of the condition number of [math]\mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{22}[/math] vs. the [math]\| (\mathbf{I}_{22} - \mathbf{P}_x) \hat{\bbeta}(\lambda) \|_2[/math] for the provided set of [math]\lambda[/math]'s.
[math] \require{textmacros} \def \bbeta {\bf \beta} \def\fat#1{\mbox{\boldmath$#1$}} \def\reminder#1{\marginpar{\rule[0pt]{1mm}{11pt}}\textbf{#1}} \def\SSigma{\bf \Sigma} \def\ttheta{\bf \theta} \def\aalpha{\bf \alpha} \def\ddelta{\bf \delta} \def\eeta{\bf \eta} \def\llambda{\bf \lambda} \def\ggamma{\bf \gamma} \def\nnu{\bf \nu} \def\vvarepsilon{\bf \varepsilon} \def\mmu{\bf \mu} \def\nnu{\bf \nu} \def\ttau{\bf \tau} \def\SSigma{\bf \Sigma} \def\TTheta{\bf \Theta} \def\XXi{\bf \Xi} \def\PPi{\bf \Pi} \def\GGamma{\bf \Gamma} \def\DDelta{\bf \Delta} \def\ssigma{\bf \sigma} \def\UUpsilon{\bf \Upsilon} \def\PPsi{\bf \Psi} \def\PPhi{\bf \Phi} \def\LLambda{\bf \Lambda} \def\OOmega{\bf \Omega} [/math]
Provide an alternative proof of Theorem that states the existence of a positive value of the penalty parameter for which the ridge regression estimator has a superior MSE compared to that of its maximum likelihood counterpart. Hereto show that the derivative of the MSE with respect to the penalty parameter is negative at zero. In this use the following results from matrix calculus:
and the chain rule
where [math]\mathbf{A} (\lambda)[/math] and [math]\mathbf{B} (\lambda)[/math] are square, symmetric matrices parameterized by the scalar [math]\lambda[/math].
Note: the proof in the lecture notes is a stronger one, as it provides an interval on the penalty parameter where the MSE of the ridge regression estimator is better than that of the maximum likelihood one.
[math] \require{textmacros} \def \bbeta {\bf \beta} \def\fat#1{\mbox{\boldmath$#1$}} \def\reminder#1{\marginpar{\rule[0pt]{1mm}{11pt}}\textbf{#1}} \def\SSigma{\bf \Sigma} \def\ttheta{\bf \theta} \def\aalpha{\bf \alpha} \def\ddelta{\bf \delta} \def\eeta{\bf \eta} \def\llambda{\bf \lambda} \def\ggamma{\bf \gamma} \def\nnu{\bf \nu} \def\vvarepsilon{\bf \varepsilon} \def\mmu{\bf \mu} \def\nnu{\bf \nu} \def\ttau{\bf \tau} \def\SSigma{\bf \Sigma} \def\TTheta{\bf \Theta} \def\XXi{\bf \Xi} \def\PPi{\bf \Pi} \def\GGamma{\bf \Gamma} \def\DDelta{\bf \Delta} \def\ssigma{\bf \sigma} \def\UUpsilon{\bf \Upsilon} \def\PPsi{\bf \Psi} \def\PPhi{\bf \Phi} \def\LLambda{\bf \Lambda} \def\OOmega{\bf \Omega} [/math]
Recall that there exists [math]\lambda \gt 0[/math] such that [math]MSE(\hat{\bbeta}) \gt MSE[\hat{\bbeta}(\lambda)][/math]. Verify that this carries over to the linear predictor. That is, then there exists a [math]\lambda \gt 0[/math] such that[math]MSE(\widehat{\mathbf{Y}}) = MSE(\mathbf{X} \hat{\bbeta}) \gt MSE[\mathbf{X}\hat{\bbeta}(\lambda)][/math].
[math] \require{textmacros} \def \bbeta {\bf \beta} \def\fat#1{\mbox{\boldmath$#1$}} \def\reminder#1{\marginpar{\rule[0pt]{1mm}{11pt}}\textbf{#1}} \def\SSigma{\bf \Sigma} \def\ttheta{\bf \theta} \def\aalpha{\bf \alpha} \def\ddelta{\bf \delta} \def\eeta{\bf \eta} \def\llambda{\bf \lambda} \def\ggamma{\bf \gamma} \def\nnu{\bf \nu} \def\vvarepsilon{\bf \varepsilon} \def\mmu{\bf \mu} \def\nnu{\bf \nu} \def\ttau{\bf \tau} \def\SSigma{\bf \Sigma} \def\TTheta{\bf \Theta} \def\XXi{\bf \Xi} \def\PPi{\bf \Pi} \def\GGamma{\bf \Gamma} \def\DDelta{\bf \Delta} \def\ssigma{\bf \sigma} \def\UUpsilon{\bf \Upsilon} \def\PPsi{\bf \Psi} \def\PPhi{\bf \Phi} \def\LLambda{\bf \Lambda} \def\OOmega{\bf \Omega} [/math]
Consider the standard linear regression model [math]Y_i = \mathbf{X}_{i,\ast} \bbeta + \varepsilon_i[/math] for [math]i=1, \ldots, n[/math] and with the [math]\varepsilon_i[/math] i.i.d. normally distributed with zero mean and a common but unknown variance. Information on the response, design matrix and relevant summary statistics are:
from which the sample size and dimension of the covariate space are immediate.
- Evaluate the ridge regression estimator [math]\hat{\bbeta}(\lambda)[/math] with [math]\lambda=1[/math].
- Evaluate the variance of the ridge regression estimator, i.e.[math]\widehat{\mbox{Var}}[\hat{\bbeta}(\lambda)][/math], for [math]\lambda = 1[/math]. In this the error variance [math]\sigma^2[/math] is estimated by [math]n^{-1} \| \mathbf{Y} - \mathbf{X} \hat{\bbeta}(\lambda) \|_2^2[/math].
- Recall that the ridge regression estimator [math]\hat{\bbeta}(\lambda)[/math] is normally distributed. Consider the interval
[[math]] \begin{eqnarray*} \mathcal{C} & = & \big(\hat{\bbeta}(\lambda) - 2 \{ \widehat{\mbox{Var}}[\hat{\bbeta}(\lambda)] \}^{1/2}, \, \hat{\bbeta}(\lambda) + 2 \{ \widehat{\mbox{Var}}[\hat{\bbeta}(\lambda)] \}^{1/2} \big). \end{eqnarray*} [[/math]]Is this a genuine (approximate) [math]95\%[/math] confidence interval for [math]\bbeta[/math]? If so, motivate. If not, what is the interpretation of this interval?
- Suppose the design matrix is augmented with an extra column identical to the first one. Is the estimate of the error variance unaffected, or not? Motivate.
[math] \require{textmacros} \def \bbeta {\bf \beta} \def\fat#1{\mbox{\boldmath$#1$}} \def\reminder#1{\marginpar{\rule[0pt]{1mm}{11pt}}\textbf{#1}} \def\SSigma{\bf \Sigma} \def\ttheta{\bf \theta} \def\aalpha{\bf \alpha} \def\ddelta{\bf \delta} \def\eeta{\bf \eta} \def\llambda{\bf \lambda} \def\ggamma{\bf \gamma} \def\nnu{\bf \nu} \def\vvarepsilon{\bf \varepsilon} \def\mmu{\bf \mu} \def\nnu{\bf \nu} \def\ttau{\bf \tau} \def\SSigma{\bf \Sigma} \def\TTheta{\bf \Theta} \def\XXi{\bf \Xi} \def\PPi{\bf \Pi} \def\GGamma{\bf \Gamma} \def\DDelta{\bf \Delta} \def\ssigma{\bf \sigma} \def\UUpsilon{\bf \Upsilon} \def\PPsi{\bf \Psi} \def\PPhi{\bf \Phi} \def\LLambda{\bf \Lambda} \def\OOmega{\bf \Omega} [/math]
Consider the standard linear regression model [math]Y_i = \mathbf{X}_{i,\ast} \bbeta + \varepsilon_i[/math] for [math]i=1, \ldots, n[/math] and with [math]\varepsilon_i \sim_{i.i.d.} \mathcal{N}(0, \sigma^2)[/math]. The ridge regression estimator of [math]\bbeta[/math] is denoted by [math]\hat{\bbeta}(\lambda)[/math] for [math]\lambda \gt 0[/math].
- Show:
[[math]] \begin{eqnarray*} \mbox{tr}\{ \mbox{Var}[ \widehat{\mathbf{Y}} (\lambda)] \} \, \, \, = \, \, \, \sigma^2 \sum\nolimits_{j=1}^p (\mathbf{D}_x)_{jj}^4 [(\mathbf{D}_x)_{jj}^2 + \lambda ]^{-2}, \end{eqnarray*} [[/math]]where [math]\widehat{\mathbf{Y}} (\lambda) = \mathbf{X} \hat{\bbeta}(\lambda)[/math] and [math]\mathbf{D}_x[/math] is the diagonal matrix containing the singular values of [math]\mathbf{X}[/math] on its diagonal.
- The coefficient of determination is defined as:
[[math]] \begin{eqnarray*} R^2 & = & [\mbox{Var}(\mathbf{Y}) - \mbox{Var}(\widehat{\mathbf{Y}})] / [\mbox{Var}(\mathbf{Y}) ] \, \, \, = \, \, \, [ \mbox{Var}(\mathbf{Y} - \widehat{\mathbf{Y}}) ] / [ \mbox{Var}(\mathbf{Y}) ], \end{eqnarray*} [[/math]]where [math]\widehat{\mathbf{Y}} = \mathbf{X} \hat{\bbeta}[/math] with [math]\hat{\bbeta} = (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{Y}[/math]. Show that the second equality does not hold when [math]\widehat{\mathbf{Y}}[/math] is now replaced by the ridge regression predictor defined as [math]\widehat{\mathbf{Y}}(\lambda) = \mathbf{H}(\lambda) \mathbf{Y}[/math] where [math]\mathbf{H}(\lambda) = \mathbf{X} (\mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp})^{-1} \mathbf{X}^{\top}[/math]. Hint: Use the fact that [math]\mathbf{H}(\lambda)[/math] is not a projection matrix, i.e. [math]\mathbf{H}(\lambda) \not= [\mathbf{H}(\lambda)]^2[/math].