Exercise
[math] \require{textmacros} \def \bbeta {\bf \beta} \def\fat#1{\mbox{\boldmath$#1$}} \def\reminder#1{\marginpar{\rule[0pt]{1mm}{11pt}}\textbf{#1}} \def\SSigma{\bf \Sigma} \def\ttheta{\bf \theta} \def\aalpha{\bf \alpha} \def\ddelta{\bf \delta} \def\eeta{\bf \eta} \def\llambda{\bf \lambda} \def\ggamma{\bf \gamma} \def\nnu{\bf \nu} \def\vvarepsilon{\bf \varepsilon} \def\mmu{\bf \mu} \def\nnu{\bf \nu} \def\ttau{\bf \tau} \def\SSigma{\bf \Sigma} \def\TTheta{\bf \Theta} \def\XXi{\bf \Xi} \def\PPi{\bf \Pi} \def\GGamma{\bf \Gamma} \def\DDelta{\bf \Delta} \def\ssigma{\bf \sigma} \def\UUpsilon{\bf \Upsilon} \def\PPsi{\bf \Psi} \def\PPhi{\bf \Phi} \def\LLambda{\bf \Lambda} \def\OOmega{\bf \Omega} [/math]
Consider the linear regression model: [math]\mathbf{Y} = \mathbf{X} \bbeta + \vvarepsilon[/math] with [math]\vvarepsilon \sim \mathcal{N} ( \mathbf{0}_{n}, \sigma^2 \mathbf{I}_{nn})[/math]. Let [math]\hat{\bbeta}(\lambda) = (\mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp})^{-1} \mathbf{X}^{\top} \mathbf{Y}[/math] be the ridge regression estimator with penalty parameter [math]\lambda[/math]. The shrinkage of the ridge regression estimator propogates to the scale of the ‘ridge prediction’ [math]\mathbf{X} \hat{\bbeta}(\lambda)[/math]. To correct (a bit) for the shrinkage, [1] propose the alternative ridge regression estimator: [math]\hat{\bbeta}(\alpha) = [ (1-\alpha) \mathbf{X}^{\top} \mathbf{X} + \alpha \mathbf{I}_{pp}]^{-1} \mathbf{X}^{\top} \mathbf{Y}[/math] with shrinkage parameter [math]\alpha \in [0,1][/math].
- Let [math]\alpha = \lambda ( 1+ \lambda)^{-1}[/math]. Show that [math]\hat{\bbeta}(\alpha) = (1+\lambda) \hat{\bbeta}(\lambda)[/math] with [math]\hat{\bbeta}(\lambda)[/math] as in the introduction above.
- Use part a) and the parametrization of [math]\alpha[/math] provided there to show that the some shrinkage has been undone. That is, show: [math]\mbox{Var}[ \mathbf{X} \hat{\bbeta}(\lambda)] \lt \mbox{Var}[ \mathbf{X} \hat{\bbeta}(\alpha)][/math] for any [math]\lambda \gt 0[/math].
- Use the singular value decomposition of [math]\mathbf{X}[/math] to show that [math]\lim_{\alpha \downarrow 0} \hat{\bbeta}(\alpha) = (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{Y}[/math] (should it exist) and [math]\lim_{\alpha \uparrow 1} \hat{\bbeta}(\alpha) = \mathbf{X}^{\top} \mathbf{Y}[/math].
- Derive the expectation, variance and mean squared error of [math]\hat{\bbeta}(\alpha)[/math].
- Temporarily assume that [math]p=1[/math] and let [math]\mathbf{X}^{\top} \mathbf{X} = c[/math] for some [math]c \gt 0[/math]. Then, [math]\mbox{MSE}[\hat{\bbeta}(\alpha)] = (c -1)^2 \beta^2 + \sigma^2 c [ (1-\alpha) c + \alpha ]^{-2}[/math]. Does there exist an [math]\alpha \in (0,1)[/math] such that the mean squared error of [math]\hat{\bbeta}(\alpha)[/math] is smaller than that of its maximum likelihood counterpart? Motivate. % Hint: distinguish between different values of [math]c[/math].
- Now assume [math]p \gt 1[/math] and an orthonormal design matrix. Specify the regularization path of the alternative ridge regression estimator [math]\hat{\bbeta}(\alpha)[/math].