[math] \require{textmacros} \def \bbeta {\bf \beta} \def\fat#1{\mbox{\boldmath$#1$}} \def\reminder#1{\marginpar{\rule[0pt]{1mm}{11pt}}\textbf{#1}} \def\SSigma{\bf \Sigma} \def\ttheta{\bf \theta} \def\aalpha{\bf \alpha} \def\ddelta{\bf \delta} \def\eeta{\bf \eta} \def\llambda{\bf \lambda} \def\ggamma{\bf \gamma} \def\nnu{\bf \nu} \def\vvarepsilon{\bf \varepsilon} \def\mmu{\bf \mu} \def\nnu{\bf \nu} \def\ttau{\bf \tau} \def\SSigma{\bf \Sigma} \def\TTheta{\bf \Theta} \def\XXi{\bf \Xi} \def\PPi{\bf \Pi} \def\GGamma{\bf \Gamma} \def\DDelta{\bf \Delta} \def\ssigma{\bf \sigma} \def\UUpsilon{\bf \Upsilon} \def\PPsi{\bf \Psi} \def\PPhi{\bf \Phi} \def\LLambda{\bf \Lambda} \def\OOmega{\bf \Omega} [/math]
Consider a pathway comprising of three genes called [math]A[/math], [math]B[/math], and [math]C[/math]. Let random variables [math]Y_{i,a}[/math], [math]Y_{i,b}[/math], and [math]Y_{i,c}[/math] be the random variable representing the expression of levels of genes [math]A[/math], [math]B[/math], and [math]C[/math] in sample [math]i[/math]. Hundred realizations, i.e. [math]i=1, \ldots, n[/math], of [math]Y_{i,a}[/math], [math]Y_{i,b}[/math], and [math]Y_{i,c}[/math] are available from an observational study. In order to assess how the expression levels of gene [math]A[/math] are affect by that of genes [math]B[/math] and [math]C[/math] a medical researcher fits the
with [math]\varepsilon_i \sim \mathcal{N}(0, \sigma^2)[/math]. This model is fitted by means of ridge regression, but with a separate penalty parameter, [math]\lambda_{b}[/math] and [math]\lambda_{c}[/math], for the two regression coefficients, [math]\beta_b[/math] and [math]\beta_c[/math], respectively.
- Write down the ridge penalized loss function employed by the researcher.
- Does a different choice of penalty parameter for the second regression coefficient affect the estimation of the first regression coefficient? Motivate your answer.
- The researcher decides that the second covariate [math]Y_{i,c}[/math] is irrelevant. Instead of removing the covariate from model, the researcher decides to set [math]\lambda_{c} = \infty[/math]. Show that this results in the same ridge estimate for [math]\beta_b[/math] as when fitting (again by means of ridge regression) the model without the second covariate.
[math] \require{textmacros} \def \bbeta {\bf \beta} \def\fat#1{\mbox{\boldmath$#1$}} \def\reminder#1{\marginpar{\rule[0pt]{1mm}{11pt}}\textbf{#1}} \def\SSigma{\bf \Sigma} \def\ttheta{\bf \theta} \def\aalpha{\bf \alpha} \def\ddelta{\bf \delta} \def\eeta{\bf \eta} \def\llambda{\bf \lambda} \def\ggamma{\bf \gamma} \def\nnu{\bf \nu} \def\vvarepsilon{\bf \varepsilon} \def\mmu{\bf \mu} \def\nnu{\bf \nu} \def\ttau{\bf \tau} \def\SSigma{\bf \Sigma} \def\TTheta{\bf \Theta} \def\XXi{\bf \Xi} \def\PPi{\bf \Pi} \def\GGamma{\bf \Gamma} \def\DDelta{\bf \Delta} \def\ssigma{\bf \sigma} \def\UUpsilon{\bf \Upsilon} \def\PPsi{\bf \Psi} \def\PPhi{\bf \Phi} \def\LLambda{\bf \Lambda} \def\OOmega{\bf \Omega} [/math]
Consider the linear regression model [math]Y_i = \beta_1 X_{i,1} + \beta_2 X_{i,2} + \varepsilon_i[/math] for [math]i=1, \ldots, n[/math]. Suppose estimates of the regression parameters [math](\beta_1, \beta_2)[/math] of this model are obtained through the minimization of the sum-of-squares augmented with a ridge-type penalty:
with penalty parameters [math]\lambda \in \mathbb{R}_{\gt 0}[/math] and [math]\nu \in (-1, 1)[/math].
- Recall the equivalence between constrained and penalized estimation (cf. Section Constrained estimation ). Sketch (for both [math]\nu=0[/math] and [math]\nu=0.9[/math]) the shape of the parameter constraint induced by the penalty above and describe in words the qualitative difference between both shapes.
- When [math]\nu = -1[/math] and [math]\lambda \rightarrow \infty[/math] the estimates of [math]\beta_1[/math] and [math]\beta_2[/math] (resulting from minimization of the penalized loss function above) converge towards each other:
[math]\lim_{\lambda \rightarrow \infty} \hat{\beta}_1(\lambda, -1) = \lim_{\lambda \rightarrow \infty} \hat{\beta}_2(\lambda, -1)[/math]. Motivated by this observation a data scientists incorporates the equality constraint [math]\beta_1 = \beta = \beta_2[/math] explicitly into the model, and s/he estimates the ‘joint regression parameter’ [math]\beta[/math] through the minimization (with respect to [math]\beta[/math]) of:
[[math]] \begin{eqnarray*} \sum\nolimits_{i=1}^n (Y_i - \beta X_{i,1} - \beta X_{i,2})^2 + \delta \beta^2, \end{eqnarray*} [[/math]]with penalty parameter [math]\delta \in \mathbb{R}_{\gt 0}[/math]. The data scientist is surprised to find that resulting estimate [math]\hat{\beta}(\delta)[/math] does not have the same limiting (in the penalty parameter) behavior as the [math]\hat{\beta}_1(\lambda, -1)[/math], i.e. [math]\lim_{\delta \rightarrow \infty} \hat{\beta} (\delta) \not= \lim_{\lambda \rightarrow \infty} \hat{\beta}_1(\lambda, -1)[/math]. Explain the misconception of the data scientist.
- Assume that i) [math]n \gg 2[/math], ii) the unpenalized estimates [math](\hat{\beta}_1(0, 0), \hat{\beta}_2(0, 0))^{\top}[/math] equal [math](-2,2)[/math], and iii) that the two covariates [math]X_1[/math] and [math]X_2[/math] are zero-centered, have equal variance, and are strongly negatively correlated. Consider [math](\hat{\beta}_1(\lambda, \nu), \hat{\beta}_2(\lambda, \nu))^{\top}[/math] for both [math]\nu=-0.9[/math] and [math]\nu=0.9[/math]. For which value of [math]\nu[/math] do you expect the sum of the absolute value of the estimates to be largest? Hint: Distinguish between small and large values of [math]\lambda[/math] and think geometrically!
[math] \require{textmacros} \def \bbeta {\bf \beta} \def\fat#1{\mbox{\boldmath$#1$}} \def\reminder#1{\marginpar{\rule[0pt]{1mm}{11pt}}\textbf{#1}} \def\SSigma{\bf \Sigma} \def\ttheta{\bf \theta} \def\aalpha{\bf \alpha} \def\ddelta{\bf \delta} \def\eeta{\bf \eta} \def\llambda{\bf \lambda} \def\ggamma{\bf \gamma} \def\nnu{\bf \nu} \def\vvarepsilon{\bf \varepsilon} \def\mmu{\bf \mu} \def\nnu{\bf \nu} \def\ttau{\bf \tau} \def\SSigma{\bf \Sigma} \def\TTheta{\bf \Theta} \def\XXi{\bf \Xi} \def\PPi{\bf \Pi} \def\GGamma{\bf \Gamma} \def\DDelta{\bf \Delta} \def\ssigma{\bf \sigma} \def\UUpsilon{\bf \Upsilon} \def\PPsi{\bf \Psi} \def\PPhi{\bf \Phi} \def\LLambda{\bf \Lambda} \def\OOmega{\bf \Omega} [/math]
Show that the genalized ridge regression estimator, [math]\hat{\bbeta}(\mathbf{\Delta}) = (\mathbf{X}^{\top} \mathbf{X} + \mathbf{\Delta})^{-1} \mathbf{X}^{\top} \mathbf{Y}[/math], too (as in Question) can be obtained by ordinary least squares regression on an augmented data set. Hereto consider the Cholesky decomposition of the penalty matrix: [math]\mathbf{\Delta} = \mathbf{L}^{\top} \mathbf{L}[/math] Now augment the matrix [math]\mathbf{X}[/math] with [math]p[/math] additional rows comprising the matrix [math]\mathbf{L}[/math], and augment the response vector [math]\mathbf{Y}[/math] with [math]p[/math] zeros.
[math] \require{textmacros} \def \bbeta {\bf \beta} \def\fat#1{\mbox{\boldmath$#1$}} \def\reminder#1{\marginpar{\rule[0pt]{1mm}{11pt}}\textbf{#1}} \def\SSigma{\bf \Sigma} \def\ttheta{\bf \theta} \def\aalpha{\bf \alpha} \def\ddelta{\bf \delta} \def\eeta{\bf \eta} \def\llambda{\bf \lambda} \def\ggamma{\bf \gamma} \def\nnu{\bf \nu} \def\vvarepsilon{\bf \varepsilon} \def\mmu{\bf \mu} \def\nnu{\bf \nu} \def\ttau{\bf \tau} \def\SSigma{\bf \Sigma} \def\TTheta{\bf \Theta} \def\XXi{\bf \Xi} \def\PPi{\bf \Pi} \def\GGamma{\bf \Gamma} \def\DDelta{\bf \Delta} \def\ssigma{\bf \sigma} \def\UUpsilon{\bf \Upsilon} \def\PPsi{\bf \Psi} \def\PPhi{\bf \Phi} \def\LLambda{\bf \Lambda} \def\OOmega{\bf \Omega} [/math]
Consider the linear regression model [math]\mathbf{Y} = \mathbf{X} \bbeta + \vvarepsilon[/math] with [math]\vvarepsilon \sim \mathcal{N}(\mathbf{0}_p, \sigma^2 \mathbf{I}_{pp})[/math]. Assume [math]\bbeta \sim \mathcal{N}(\bbeta_0, \sigma^2 \mathbf{\Delta}^{-1})[/math] with [math]\bbeta_0 \in \mathbb{R}^p[/math] and [math]\mathbf{\Delta} \succ 0[/math] and a gamma prior on the error variance. Verify (i.e., work out the details of the derivation) that the posterior mean coincides with the generalized ridge estimator defined as:
[math] \require{textmacros} \def \bbeta {\bf \beta} \def\fat#1{\mbox{\boldmath$#1$}} \def\reminder#1{\marginpar{\rule[0pt]{1mm}{11pt}}\textbf{#1}} \def\SSigma{\bf \Sigma} \def\ttheta{\bf \theta} \def\aalpha{\bf \alpha} \def\ddelta{\bf \delta} \def\eeta{\bf \eta} \def\llambda{\bf \lambda} \def\ggamma{\bf \gamma} \def\nnu{\bf \nu} \def\vvarepsilon{\bf \varepsilon} \def\mmu{\bf \mu} \def\nnu{\bf \nu} \def\ttau{\bf \tau} \def\SSigma{\bf \Sigma} \def\TTheta{\bf \Theta} \def\XXi{\bf \Xi} \def\PPi{\bf \Pi} \def\GGamma{\bf \Gamma} \def\DDelta{\bf \Delta} \def\ssigma{\bf \sigma} \def\UUpsilon{\bf \Upsilon} \def\PPsi{\bf \Psi} \def\PPhi{\bf \Phi} \def\LLambda{\bf \Lambda} \def\OOmega{\bf \Omega} [/math]
Consider the Bayesian linear regression model [math]\mathbf{Y} = \mathbf{X} \bbeta + \vvarepsilon[/math] with [math]\vvarepsilon \sim \mathcal{N}(\mathbf{0}_n, \sigma^2 \mathbf{I}_{nn})[/math], a multivariate normal law as conditional prior distribution on the regression parameter: [math]\bbeta \, | \, \sigma^2 \sim \mathcal{N}(\bbeta_0, \sigma^2 \mathbf{\Delta}^{-1})[/math], and an inverse gamma prior on the error variance [math]\sigma^2 \sim \mathcal{IG}(\gamma, \delta)[/math]. The consequences of various choices for the hyper parameters of the prior distribution on [math]\bbeta[/math] are studied.
- Consider the following conditional prior distributions on the regression parameters [math]\bbeta \, | \, \sigma^2 \sim \mathcal{N}(\bbeta_0, \sigma^2 \mathbf{\Delta}_a^{-1})[/math] and [math]\bbeta \, | \, \sigma^2 \sim \mathcal{N}(\bbeta_0, \sigma^2 \mathbf{\Delta}_b^{-1})[/math] with precision matrices [math]\mathbf{\Delta}_a, \mathbf{\Delta}_b \in \mathcal{S}_{++}^p[/math] such that [math]\mathbf{\Delta}_a \succeq \mathbf{\Delta}_b[/math], i.e. [math]\mathbf{\Delta}_a = \mathbf{\Delta}_b + \mathbf{D}[/math] for some positive semi-definite symmetric matrix of appropriate dimensions. Verify:
[[math]] \begin{eqnarray*} \mbox{Var}(\bbeta \, | \, \sigma^2, \mathbf{Y}, \mathbf{X}, \bbeta_0, \mathbf{\Delta}_a) & \preceq & \mbox{Var}(\bbeta \, | \, \sigma^2, \mathbf{Y}, \mathbf{X}, \bbeta_0, \mathbf{\Delta}_b), \end{eqnarray*} [[/math]]i.e. the smaller (in the positive definite ordering) the variance of the prior the smaller that of the posterior.
- In the remainder of this exercise assume [math]\mathbf{\Delta}_a =\mathbf{\Delta} = \mathbf{\Delta}_b[/math]. Let [math]\bbeta_t[/math] be the ‘true’ or ‘ideal’ value of the regression parameter, that has been used in the generation of the data, and show that a better initial guess yields a better posterior probability at [math]\bbeta_t[/math]. That is, take two prior mean parameters [math]\bbeta_0 = \bbeta_0^{\mbox{{\tiny (a)}}}[/math] and [math]\bbeta_0 = \bbeta_0^{\mbox{{\tiny (b)}}}[/math] such that the former is closer to [math]\bbeta_t[/math] than the latter. Here close is defined in terms of the Mahalabonis distance, which for, e.g. [math]\bbeta_t[/math] and [math]\bbeta_0^{\mbox{{\tiny (a)}}}[/math] is defined as [math]d_M(\bbeta_t, \bbeta_0^{\mbox{{\tiny (a)}}}; \mathbf{\Sigma}) = [(\bbeta_t - \bbeta_0^{\mbox{{\tiny (a)}}})^{\top} \mathbf{\Sigma}^{-1} (\bbeta_t - \bbeta_0^{\mbox{{\tiny (a)}}})]^{1/2}[/math] with positive definite covariance matrix [math]\mathbf{\Sigma}[/math] with [math]\mathbf{\Sigma} = \sigma^2 \mathbf{\Delta}^{-1}[/math]. Show that the posterior density [math]\pi_{\bbeta \, | \, \sigma^2} (\bbeta \, | \, \sigma^2, \mathbf{X}, \mathbf{Y}, \bbeta_0^{\mbox{{\tiny (a)}}}, \mathbf{\Delta})[/math] is larger at [math]\bbeta =\bbeta_t[/math] than with the other prior mean paramater.
- Adopt the assumptions of part b) and show that a better initial guess yields a better posterior mean. That is, show
[[math]] \begin{eqnarray*} d_M[\bbeta_t, \mathbb{E}(\bbeta \, | \, \sigma^2, \mathbf{Y}, \mathbf{X}, \bbeta_0^{\mbox{{\tiny (a)}}}, \mathbf{\Delta}); \mathbf{\Sigma}] & \leq & d_M[\bbeta_t, \mathbb{E}(\bbeta \, | \, \sigma^2, \mathbf{Y}, \mathbf{X}, \bbeta_0^{\mbox{{\tiny (b)}}}, \mathbf{\Delta}); \mathbf{\Sigma}], \end{eqnarray*} [[/math]]now with [math]\mathbf{\Sigma} = \sigma^2 (\mathbf{X}^{\top} \mathbf{X} + \mathbf{\Delta})^{-1}[/math].
[math] \require{textmacros} \def \bbeta {\bf \beta} \def\fat#1{\mbox{\boldmath$#1$}} \def\reminder#1{\marginpar{\rule[0pt]{1mm}{11pt}}\textbf{#1}} \def\SSigma{\bf \Sigma} \def\ttheta{\bf \theta} \def\aalpha{\bf \alpha} \def\ddelta{\bf \delta} \def\eeta{\bf \eta} \def\llambda{\bf \lambda} \def\ggamma{\bf \gamma} \def\nnu{\bf \nu} \def\vvarepsilon{\bf \varepsilon} \def\mmu{\bf \mu} \def\nnu{\bf \nu} \def\ttau{\bf \tau} \def\SSigma{\bf \Sigma} \def\TTheta{\bf \Theta} \def\XXi{\bf \Xi} \def\PPi{\bf \Pi} \def\GGamma{\bf \Gamma} \def\DDelta{\bf \Delta} \def\ssigma{\bf \sigma} \def\UUpsilon{\bf \Upsilon} \def\PPsi{\bf \Psi} \def\PPhi{\bf \Phi} \def\LLambda{\bf \Lambda} \def\OOmega{\bf \Omega} [/math]
The ridge penalty may be interpreted as a multivariate normal prior on the regression coefficients: [math]\bbeta \sim \mathcal{N}(\mathbf{0}, \lambda^{-1} \mathbf{I}_{pp})[/math]. Different priors may be considered. In case the covariates are spatially related in some sense (e.g. genomically), it may of interest to assume a first-order autoregressive prior: [math]\bbeta \sim \mathcal{N}(\mathbf{0}, \lambda^{-1} \mathbf{\Sigma}_a)[/math], in which [math]\mathbf{\Sigma}_a[/math] is a [math](p \times p)[/math]-dimensional correlation matrix with [math](\mathbf{\Sigma}_a)_{j_1, j_2} = \rho^{ | j_1 - j_2 | } [/math] for some correlation coefficient [math]\rho \in [0, 1)[/math]. Hence,
- The penalized loss function associated with this AR(1) prior is:
[[math]] \begin{eqnarray*} \mathcal{L}(\bbeta; \lambda, \mathbf{\Sigma}_a) & = & \| \mathbf{Y} - \mathbf{X} \bbeta \|_2^2 + \lambda \bbeta^{\top} \mathbf{\Sigma}_a^{-1} \bbeta. \end{eqnarray*} [[/math]]Find the minimizer of this loss function.
- What is the effect of [math]\rho[/math] on the ridge estimates? Contrast this to the effect of [math]\lambda[/math]. Illustrate this on (simulated) data.
- Instead of an AR(1) prior assume a prior with a uniform correlation between the elements of [math]\bbeta[/math]. That is, replace [math]\mathbf{\Sigma}_a[/math] by [math]\mathbf{\Sigma}_u[/math], given by [math]\mathbf{\Sigma}_u = (1-\rho) \mathbf{I}_{pp} + \rho \mathbf{1}_{pp}[/math]. Investigate (again on data) the effect of changing from the AR(1) to the uniform prior on the ridge regression estimates.
[math] \require{textmacros} \def \bbeta {\bf \beta} \def\fat#1{\mbox{\boldmath$#1$}} \def\reminder#1{\marginpar{\rule[0pt]{1mm}{11pt}}\textbf{#1}} \def\SSigma{\bf \Sigma} \def\ttheta{\bf \theta} \def\aalpha{\bf \alpha} \def\ddelta{\bf \delta} \def\eeta{\bf \eta} \def\llambda{\bf \lambda} \def\ggamma{\bf \gamma} \def\nnu{\bf \nu} \def\vvarepsilon{\bf \varepsilon} \def\mmu{\bf \mu} \def\nnu{\bf \nu} \def\ttau{\bf \tau} \def\SSigma{\bf \Sigma} \def\TTheta{\bf \Theta} \def\XXi{\bf \Xi} \def\PPi{\bf \Pi} \def\GGamma{\bf \Gamma} \def\DDelta{\bf \Delta} \def\ssigma{\bf \sigma} \def\UUpsilon{\bf \Upsilon} \def\PPsi{\bf \Psi} \def\PPhi{\bf \Phi} \def\LLambda{\bf \Lambda} \def\OOmega{\bf \Omega} [/math]
Consider the standard linear regression model [math]Y_i = \mathbf{X}_{i,\ast} \bbeta + \varepsilon_i[/math] for [math]i=1, \ldots, n[/math]. Suppose estimates of the regression parameters [math]\bbeta[/math] of this model are obtained through the minimization of the sum-of-squares augmented with a ridge-type penalty:
for known [math]\alpha \in [0,1][/math], nonrandom [math]p[/math]-dimensional target vectors [math]\bbeta_{t,a}[/math] and [math]\bbeta_{t,b}[/math] with [math]\bbeta_{t,a} \not= \bbeta_{t,b}[/math], and penalty parameter [math]\lambda \gt 0[/math]. Here [math]\mathbf{Y} = (Y_1, \ldots, Y_n)^{\top}[/math] and [math]\mathbf{X}[/math] is [math]n \times p[/math] matrix with the [math]n[/math] row-vectors [math]\mathbf{X}_{i,\ast}[/math] stacked.
- When [math]p \gt n[/math] the sum-of-squares part does not have a unique minimum. Does the above employed penalty warrant a unique minimum for the loss function above (i.e., sum-of-squares plus penalty)? Motivate your answer.
- Could it be that for intermediate values of [math]\alpha[/math], i.e. [math]0 \lt \alpha \lt 1[/math], the loss function assumes smaller values than for the boundary values [math]\alpha=0[/math] and [math]\alpha=1[/math]? Motivate your answer.
- Draw the parameter constraint induced by this penalty for [math]\alpha = 0, 0.5[/math] and [math]1[/math] when [math]p = 2[/math]
- Derive the estimator of [math]\bbeta[/math], defined as the minimum of the loss function, explicitly.
- Discuss the behaviour of the estimator [math]\alpha = 0, 0.5[/math] and [math]1[/math] for [math]\lambda \rightarrow \infty[/math].
[math] \require{textmacros} \def \bbeta {\bf \beta} \def\fat#1{\mbox{\boldmath$#1$}} \def\reminder#1{\marginpar{\rule[0pt]{1mm}{11pt}}\textbf{#1}} \def\SSigma{\bf \Sigma} \def\ttheta{\bf \theta} \def\aalpha{\bf \alpha} \def\ddelta{\bf \delta} \def\eeta{\bf \eta} \def\llambda{\bf \lambda} \def\ggamma{\bf \gamma} \def\nnu{\bf \nu} \def\vvarepsilon{\bf \varepsilon} \def\mmu{\bf \mu} \def\nnu{\bf \nu} \def\ttau{\bf \tau} \def\SSigma{\bf \Sigma} \def\TTheta{\bf \Theta} \def\XXi{\bf \Xi} \def\PPi{\bf \Pi} \def\GGamma{\bf \Gamma} \def\DDelta{\bf \Delta} \def\ssigma{\bf \sigma} \def\UUpsilon{\bf \Upsilon} \def\PPsi{\bf \Psi} \def\PPhi{\bf \Phi} \def\LLambda{\bf \Lambda} \def\OOmega{\bf \Omega} [/math]
Revisit Exercise. There the standard linear regression model [math]Y_i = \mathbf{X}_{i,\ast} \bbeta + \varepsilon_i[/math] for [math]i=1, \ldots, n[/math] and with [math]\varepsilon_i \sim_{i.i.d.} \mathcal{N}(0, \sigma^2)[/math] is considered. The model comprises a single covariate and an intercept. Response and covariate data are: [math]\{(y_i, x_{i,1})\}_{i=1}^4 = \{ (1.4, 0.0), (1.4, -2.0), (0.8, 0.0), (0.4, 2.0) \}[/math].
- Evaluate the generalized ridge regression estimator of [math]\bbeta[/math] with target [math]\bbeta_0 = \mathbf{0}_2[/math] and penalty matrix [math]\mathbf{\Delta}[/math] given by [math](\mathbf{\Delta})_{11} = \lambda = (\mathbf{\Delta})_{22}[/math] and [math](\mathbf{\Delta})_{12} = \tfrac{1}{2} \lambda = (\mathbf{\Delta})_{21}[/math] in which [math]\lambda = 8[/math].
- A data scientist wishes to leave the intercept unpenalized. Hereto s/he sets in part a) [math](\mathbf{\Delta})_{11} = 0[/math]. Why does the resulting estimate not coincide with the answer to Exercise? Motivate.
[math] \require{textmacros} \def \bbeta {\bf \beta} \def\fat#1{\mbox{\boldmath$#1$}} \def\reminder#1{\marginpar{\rule[0pt]{1mm}{11pt}}\textbf{#1}} \def\SSigma{\bf \Sigma} \def\ttheta{\bf \theta} \def\aalpha{\bf \alpha} \def\ddelta{\bf \delta} \def\eeta{\bf \eta} \def\llambda{\bf \lambda} \def\ggamma{\bf \gamma} \def\nnu{\bf \nu} \def\vvarepsilon{\bf \varepsilon} \def\mmu{\bf \mu} \def\nnu{\bf \nu} \def\ttau{\bf \tau} \def\SSigma{\bf \Sigma} \def\TTheta{\bf \Theta} \def\XXi{\bf \Xi} \def\PPi{\bf \Pi} \def\GGamma{\bf \Gamma} \def\DDelta{\bf \Delta} \def\ssigma{\bf \sigma} \def\UUpsilon{\bf \Upsilon} \def\PPsi{\bf \Psi} \def\PPhi{\bf \Phi} \def\LLambda{\bf \Lambda} \def\OOmega{\bf \Omega} [/math]
Consider the linear regression model: [math]\mathbf{Y} = \mathbf{X} \bbeta + \vvarepsilon[/math] with [math]\vvarepsilon \sim \mathcal{N} ( \mathbf{0}_{n}, \sigma^2 \mathbf{I}_{nn})[/math]. Let [math]\hat{\bbeta}(\lambda) = (\mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp})^{-1} \mathbf{X}^{\top} \mathbf{Y}[/math] be the ridge regression estimator with penalty parameter [math]\lambda[/math]. The shrinkage of the ridge regression estimator propogates to the scale of the ‘ridge prediction’ [math]\mathbf{X} \hat{\bbeta}(\lambda)[/math]. To correct (a bit) for the shrinkage, [1] propose the alternative ridge regression estimator: [math]\hat{\bbeta}(\alpha) = [ (1-\alpha) \mathbf{X}^{\top} \mathbf{X} + \alpha \mathbf{I}_{pp}]^{-1} \mathbf{X}^{\top} \mathbf{Y}[/math] with shrinkage parameter [math]\alpha \in [0,1][/math].
- Let [math]\alpha = \lambda ( 1+ \lambda)^{-1}[/math]. Show that [math]\hat{\bbeta}(\alpha) = (1+\lambda) \hat{\bbeta}(\lambda)[/math] with [math]\hat{\bbeta}(\lambda)[/math] as in the introduction above.
- Use part a) and the parametrization of [math]\alpha[/math] provided there to show that the some shrinkage has been undone. That is, show: [math]\mbox{Var}[ \mathbf{X} \hat{\bbeta}(\lambda)] \lt \mbox{Var}[ \mathbf{X} \hat{\bbeta}(\alpha)][/math] for any [math]\lambda \gt 0[/math].
- Use the singular value decomposition of [math]\mathbf{X}[/math] to show that [math]\lim_{\alpha \downarrow 0} \hat{\bbeta}(\alpha) = (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{Y}[/math] (should it exist) and [math]\lim_{\alpha \uparrow 1} \hat{\bbeta}(\alpha) = \mathbf{X}^{\top} \mathbf{Y}[/math].
- Derive the expectation, variance and mean squared error of [math]\hat{\bbeta}(\alpha)[/math].
- Temporarily assume that [math]p=1[/math] and let [math]\mathbf{X}^{\top} \mathbf{X} = c[/math] for some [math]c \gt 0[/math]. Then, [math]\mbox{MSE}[\hat{\bbeta}(\alpha)] = (c -1)^2 \beta^2 + \sigma^2 c [ (1-\alpha) c + \alpha ]^{-2}[/math]. Does there exist an [math]\alpha \in (0,1)[/math] such that the mean squared error of [math]\hat{\bbeta}(\alpha)[/math] is smaller than that of its maximum likelihood counterpart? Motivate. % Hint: distinguish between different values of [math]c[/math].
- Now assume [math]p \gt 1[/math] and an orthonormal design matrix. Specify the regularization path of the alternative ridge regression estimator [math]\hat{\bbeta}(\alpha)[/math].
[math] \require{textmacros} \def \bbeta {\bf \beta} \def\fat#1{\mbox{\boldmath$#1$}} \def\reminder#1{\marginpar{\rule[0pt]{1mm}{11pt}}\textbf{#1}} \def\SSigma{\bf \Sigma} \def\ttheta{\bf \theta} \def\aalpha{\bf \alpha} \def\ddelta{\bf \delta} \def\eeta{\bf \eta} \def\llambda{\bf \lambda} \def\ggamma{\bf \gamma} \def\nnu{\bf \nu} \def\vvarepsilon{\bf \varepsilon} \def\mmu{\bf \mu} \def\nnu{\bf \nu} \def\ttau{\bf \tau} \def\SSigma{\bf \Sigma} \def\TTheta{\bf \Theta} \def\XXi{\bf \Xi} \def\PPi{\bf \Pi} \def\GGamma{\bf \Gamma} \def\DDelta{\bf \Delta} \def\ssigma{\bf \sigma} \def\UUpsilon{\bf \Upsilon} \def\PPsi{\bf \Psi} \def\PPhi{\bf \Phi} \def\LLambda{\bf \Lambda} \def\OOmega{\bf \Omega} [/math]
Consider the linear regression model [math]\mathbf{Y} = \mathbf{X} \bbeta + \vvarepsilon[/math] with [math]\vvarepsilon \sim \mathcal{N}(\mathbf{0}_n, \sigma^2 \mathbf{I}_nn)[/math]. Goldstein & Smith (1974) proposed a novel generalized ridge estimator of its [math]p[/math]-dimensional regression parameter:
with penalty parameter [math]\lambda \gt 0[/math] and ‘shape’ or ‘rate’ parameter [math]m[/math].
- Assume, only for part a), that [math]n=p[/math] and the design matrix is orthonormal. Show that, irrespectively of the choice of [math]m[/math], this generalized ridge regression estimator coincides with the ‘regular’ ridge regression estimator.
- Consider the generalized ridge loss function [math]\| \mathbf{Y} - \mathbf{X} \bbeta \|_2^2 + \bbeta^{\top} \mathbf{A} \bbeta[/math] with [math]\mathbf{A}[/math] a [math]p \times p[/math]-dimensional symmetric matrix. For what [math]\mathbf{A}[/math], does [math]\hat{\bbeta}_m(\lambda)[/math] minimize this loss function?
- Let [math]d_j[/math] be the [math]j[/math]-th singular value of [math]\mathbf{X}[/math]. Show that in [math]\hat{\bbeta}_m(\lambda)[/math] the singular values are shrunken as [math](d_j^{2m} + \lambda)^{-1} d_j^{2m-1}[/math]. Hint: use the singular value decomposition of [math]\mathbf{X}[/math].
- Do, for positive singular values, larger [math]m[/math] lead to more shrinkage? Hint: Involve particulars of the singular value in your answer.
- Express [math]\mathbb{E}[\hat{\bbeta}_m(\lambda)][/math] in terms of the design matrix, model and shrinkage parameters ([math]\lambda[/math] and [math]m[/math]).
- Express [math]\mbox{Var}[\hat{\bbeta}_m(\lambda)][/math] in terms of the design matrix, model and shrinkage parameters ([math]\lambda[/math] and [math]m[/math]).