Exercises

Jun 25'23

Consider a pathway comprising of three genes called [math]A[/math], [math]B[/math], and [math]C[/math]. Let random variables [math]Y_{i,a}[/math], [math]Y_{i,b}[/math], and [math]Y_{i,c}[/math] be the random variable representing the expression of levels of genes [math]A[/math], [math]B[/math], and [math]C[/math] in sample [math]i[/math]. Hundred realizations, i.e. [math]i=1, \ldots, n[/math], of [math]Y_{i,a}[/math], [math]Y_{i,b}[/math], and [math]Y_{i,c}[/math] are available from an observational study. In order to assess how the expression levels of gene [math]A[/math] are affect by that of genes [math]B[/math] and [math]C[/math] a medical researcher fits the

[[math]] \begin{eqnarray*} Y_{i,a} &= & \beta_b Y_{i,b} + \beta_c Y_{i,c} + \varepsilon_{i}, \end{eqnarray*} [[/math]]

with [math]\varepsilon_i \sim \mathcal{N}(0, \sigma^2)[/math]. This model is fitted by means of ridge regression, but with a separate penalty parameter, [math]\lambda_{b}[/math] and [math]\lambda_{c}[/math], for the two regression coefficients, [math]\beta_b[/math] and [math]\beta_c[/math], respectively.

Write down the ridge penalized loss function employed by the researcher.
Does a different choice of penalty parameter for the second regression coefficient affect the estimation of the first regression coefficient? Motivate your answer.
The researcher decides that the second covariate [math]Y_{i,c}[/math] is irrelevant. Instead of removing the covariate from model, the researcher decides to set [math]\lambda_{c} = \infty[/math]. Show that this results in the same ridge estimate for [math]\beta_b[/math] as when fitting (again by means of ridge regression) the model without the second covariate.

AAdmin

Jun 25'23

Consider the linear regression model [math]Y_i = \beta_1 X_{i,1} + \beta_2 X_{i,2} + \varepsilon_i[/math] for [math]i=1, \ldots, n[/math]. Suppose estimates of the regression parameters [math](\beta_1, \beta_2)[/math] of this model are obtained through the minimization of the sum-of-squares augmented with a ridge-type penalty:

[[math]] \begin{eqnarray*} \sum\nolimits_{i=1}^n (Y_i - \beta_1 X_{i,1} - \beta_2 X_{i,2})^2 + \lambda (\beta_1^2 + \beta_2^2 + 2 \nu \beta_1 \beta_2), \end{eqnarray*} [[/math]]

with penalty parameters [math]\lambda \in \mathbb{R}_{\gt 0}[/math] and [math]\nu \in (-1, 1)[/math].

Recall the equivalence between constrained and penalized estimation (cf. Section Constrained estimation ). Sketch (for both [math]\nu=0[/math] and [math]\nu=0.9[/math]) the shape of the parameter constraint induced by the penalty above and describe in words the qualitative difference between both shapes.
When [math]\nu = -1[/math] and [math]\lambda \rightarrow \infty[/math] the estimates of [math]\beta_1[/math] and [math]\beta_2[/math] (resulting from minimization of the penalized loss function above) converge towards each other: [math]\lim_{\lambda \rightarrow \infty} \hat{\beta}_1(\lambda, -1) = \lim_{\lambda \rightarrow \infty} \hat{\beta}_2(\lambda, -1)[/math]. Motivated by this observation a data scientists incorporates the equality constraint [math]\beta_1 = \beta = \beta_2[/math] explicitly into the model, and s/he estimates the ‘joint regression parameter’ [math]\beta[/math] through the minimization (with respect to [math]\beta[/math]) of:
[[math]] \begin{eqnarray*} \sum\nolimits_{i=1}^n (Y_i - \beta X_{i,1} - \beta X_{i,2})^2 + \delta \beta^2, \end{eqnarray*} [[/math]]
with penalty parameter [math]\delta \in \mathbb{R}_{\gt 0}[/math]. The data scientist is surprised to find that resulting estimate [math]\hat{\beta}(\delta)[/math] does not have the same limiting (in the penalty parameter) behavior as the [math]\hat{\beta}_1(\lambda, -1)[/math], i.e. [math]\lim_{\delta \rightarrow \infty} \hat{\beta} (\delta) \not= \lim_{\lambda \rightarrow \infty} \hat{\beta}_1(\lambda, -1)[/math]. Explain the misconception of the data scientist.
Assume that i) [math]n \gg 2[/math], ii) the unpenalized estimates [math](\hat{\beta}_1(0, 0), \hat{\beta}_2(0, 0))^{\top}[/math] equal [math](-2,2)[/math], and iii) that the two covariates [math]X_1[/math] and [math]X_2[/math] are zero-centered, have equal variance, and are strongly negatively correlated. Consider [math](\hat{\beta}_1(\lambda, \nu), \hat{\beta}_2(\lambda, \nu))^{\top}[/math] for both [math]\nu=-0.9[/math] and [math]\nu=0.9[/math]. For which value of [math]\nu[/math] do you expect the sum of the absolute value of the estimates to be largest? Hint: Distinguish between small and large values of [math]\lambda[/math] and think geometrically!

AAdmin

Jun 25'23

Show that the genalized ridge regression estimator, [math]\hat{\bbeta}(\mathbf{\Delta}) = (\mathbf{X}^{\top} \mathbf{X} + \mathbf{\Delta})^{-1} \mathbf{X}^{\top} \mathbf{Y}[/math], too (as in Question) can be obtained by ordinary least squares regression on an augmented data set. Hereto consider the Cholesky decomposition of the penalty matrix: [math]\mathbf{\Delta} = \mathbf{L}^{\top} \mathbf{L}[/math] Now augment the matrix [math]\mathbf{X}[/math] with [math]p[/math] additional rows comprising the matrix [math]\mathbf{L}[/math], and augment the response vector [math]\mathbf{Y}[/math] with [math]p[/math] zeros.

AAdmin

Jun 25'23

Consider the linear regression model [math]\mathbf{Y} = \mathbf{X} \bbeta + \vvarepsilon[/math] with [math]\vvarepsilon \sim \mathcal{N}(\mathbf{0}_p, \sigma^2 \mathbf{I}_{pp})[/math]. Assume [math]\bbeta \sim \mathcal{N}(\bbeta_0, \sigma^2 \mathbf{\Delta}^{-1})[/math] with [math]\bbeta_0 \in \mathbb{R}^p[/math] and [math]\mathbf{\Delta} \succ 0[/math] and a gamma prior on the error variance. Verify (i.e., work out the details of the derivation) that the posterior mean coincides with the generalized ridge estimator defined as:

[[math]] \begin{eqnarray*} \hat{\bbeta} & = & (\mathbf{X}^{\top} \mathbf{X} + \mathbf{\Delta})^{-1} (\mathbf{X}^{\top} \mathbf{Y} + \mathbf{\Delta} \bbeta_0). \end{eqnarray*} [[/math]]

AAdmin

Jun 25'23

Consider the Bayesian linear regression model [math]\mathbf{Y} = \mathbf{X} \bbeta + \vvarepsilon[/math] with [math]\vvarepsilon \sim \mathcal{N}(\mathbf{0}_n, \sigma^2 \mathbf{I}_{nn})[/math], a multivariate normal law as conditional prior distribution on the regression parameter: [math]\bbeta \, | \, \sigma^2 \sim \mathcal{N}(\bbeta_0, \sigma^2 \mathbf{\Delta}^{-1})[/math], and an inverse gamma prior on the error variance [math]\sigma^2 \sim \mathcal{IG}(\gamma, \delta)[/math]. The consequences of various choices for the hyper parameters of the prior distribution on [math]\bbeta[/math] are studied.

Consider the following conditional prior distributions on the regression parameters [math]\bbeta \, | \, \sigma^2 \sim \mathcal{N}(\bbeta_0, \sigma^2 \mathbf{\Delta}_a^{-1})[/math] and [math]\bbeta \, | \, \sigma^2 \sim \mathcal{N}(\bbeta_0, \sigma^2 \mathbf{\Delta}_b^{-1})[/math] with precision matrices [math]\mathbf{\Delta}_a, \mathbf{\Delta}_b \in \mathcal{S}_{++}^p[/math] such that [math]\mathbf{\Delta}_a \succeq \mathbf{\Delta}_b[/math], i.e. [math]\mathbf{\Delta}_a = \mathbf{\Delta}_b + \mathbf{D}[/math] for some positive semi-definite symmetric matrix of appropriate dimensions. Verify:
[[math]] \begin{eqnarray*} \mbox{Var}(\bbeta \, | \, \sigma^2, \mathbf{Y}, \mathbf{X}, \bbeta_0, \mathbf{\Delta}_a) & \preceq & \mbox{Var}(\bbeta \, | \, \sigma^2, \mathbf{Y}, \mathbf{X}, \bbeta_0, \mathbf{\Delta}_b), \end{eqnarray*} [[/math]]
i.e. the smaller (in the positive definite ordering) the variance of the prior the smaller that of the posterior.
In the remainder of this exercise assume [math]\mathbf{\Delta}_a =\mathbf{\Delta} = \mathbf{\Delta}_b[/math]. Let [math]\bbeta_t[/math] be the ‘true’ or ‘ideal’ value of the regression parameter, that has been used in the generation of the data, and show that a better initial guess yields a better posterior probability at [math]\bbeta_t[/math]. That is, take two prior mean parameters [math]\bbeta_0 = \bbeta_0^{\mbox{{\tiny (a)}}}[/math] and [math]\bbeta_0 = \bbeta_0^{\mbox{{\tiny (b)}}}[/math] such that the former is closer to [math]\bbeta_t[/math] than the latter. Here close is defined in terms of the Mahalabonis distance, which for, e.g. [math]\bbeta_t[/math] and [math]\bbeta_0^{\mbox{{\tiny (a)}}}[/math] is defined as [math]d_M(\bbeta_t, \bbeta_0^{\mbox{{\tiny (a)}}}; \mathbf{\Sigma}) = [(\bbeta_t - \bbeta_0^{\mbox{{\tiny (a)}}})^{\top} \mathbf{\Sigma}^{-1} (\bbeta_t - \bbeta_0^{\mbox{{\tiny (a)}}})]^{1/2}[/math] with positive definite covariance matrix [math]\mathbf{\Sigma}[/math] with [math]\mathbf{\Sigma} = \sigma^2 \mathbf{\Delta}^{-1}[/math]. Show that the posterior density [math]\pi_{\bbeta \, | \, \sigma^2} (\bbeta \, | \, \sigma^2, \mathbf{X}, \mathbf{Y}, \bbeta_0^{\mbox{{\tiny (a)}}}, \mathbf{\Delta})[/math] is larger at [math]\bbeta =\bbeta_t[/math] than with the other prior mean paramater.
Adopt the assumptions of part b) and show that a better initial guess yields a better posterior mean. That is, show
[[math]] \begin{eqnarray*} d_M[\bbeta_t, \mathbb{E}(\bbeta \, | \, \sigma^2, \mathbf{Y}, \mathbf{X}, \bbeta_0^{\mbox{{\tiny (a)}}}, \mathbf{\Delta}); \mathbf{\Sigma}] & \leq & d_M[\bbeta_t, \mathbb{E}(\bbeta \, | \, \sigma^2, \mathbf{Y}, \mathbf{X}, \bbeta_0^{\mbox{{\tiny (b)}}}, \mathbf{\Delta}); \mathbf{\Sigma}], \end{eqnarray*} [[/math]]
now with [math]\mathbf{\Sigma} = \sigma^2 (\mathbf{X}^{\top} \mathbf{X} + \mathbf{\Delta})^{-1}[/math].

AAdmin

Jun 25'23

The ridge penalty may be interpreted as a multivariate normal prior on the regression coefficients: [math]\bbeta \sim \mathcal{N}(\mathbf{0}, \lambda^{-1} \mathbf{I}_{pp})[/math]. Different priors may be considered. In case the covariates are spatially related in some sense (e.g. genomically), it may of interest to assume a first-order autoregressive prior: [math]\bbeta \sim \mathcal{N}(\mathbf{0}, \lambda^{-1} \mathbf{\Sigma}_a)[/math], in which [math]\mathbf{\Sigma}_a[/math] is a [math](p \times p)[/math]-dimensional correlation matrix with [math](\mathbf{\Sigma}_a)_{j_1, j_2} = \rho^{ | j_1 - j_2 | } [/math] for some correlation coefficient [math]\rho \in [0, 1)[/math]. Hence,

[[math]] \begin{eqnarray*} \mathbf{\Sigma}_a \, \, \, = \, \, \, \left( \begin{array}{cccc} 1 & \rho & \ldots & \rho^{p-1} \\ \rho & 1 & \ldots & \rho^{p-2} \\ \vdots & \vdots & \ddots & \vdots \\ \rho^{p-1} & \rho^{p-2} & \ldots & 1 \end{array} \right). \end{eqnarray*} [[/math]]

The penalized loss function associated with this AR(1) prior is:
[[math]] \begin{eqnarray*} \mathcal{L}(\bbeta; \lambda, \mathbf{\Sigma}_a) & = & \| \mathbf{Y} - \mathbf{X} \bbeta \|_2^2 + \lambda \bbeta^{\top} \mathbf{\Sigma}_a^{-1} \bbeta. \end{eqnarray*} [[/math]]
Find the minimizer of this loss function.
What is the effect of [math]\rho[/math] on the ridge estimates? Contrast this to the effect of [math]\lambda[/math]. Illustrate this on (simulated) data.
Instead of an AR(1) prior assume a prior with a uniform correlation between the elements of [math]\bbeta[/math]. That is, replace [math]\mathbf{\Sigma}_a[/math] by [math]\mathbf{\Sigma}_u[/math], given by [math]\mathbf{\Sigma}_u = (1-\rho) \mathbf{I}_{pp} + \rho \mathbf{1}_{pp}[/math]. Investigate (again on data) the effect of changing from the AR(1) to the uniform prior on the ridge regression estimates.

AAdmin

Jun 25'23

Consider the standard linear regression model [math]Y_i = \mathbf{X}_{i,\ast} \bbeta + \varepsilon_i[/math] for [math]i=1, \ldots, n[/math]. Suppose estimates of the regression parameters [math]\bbeta[/math] of this model are obtained through the minimization of the sum-of-squares augmented with a ridge-type penalty:

[[math]] \begin{eqnarray*} \| \mathbf{Y} - \mathbf{X} \bbeta \|_2^2 + \lambda \big[ (1-\alpha) \| \bbeta - \bbeta_{t,a} \|_2^2 + \alpha \| \bbeta - \bbeta_{t,b} \|_2^2 \big], \end{eqnarray*} [[/math]]

for known [math]\alpha \in [0,1][/math], nonrandom [math]p[/math]-dimensional target vectors [math]\bbeta_{t,a}[/math] and [math]\bbeta_{t,b}[/math] with [math]\bbeta_{t,a} \not= \bbeta_{t,b}[/math], and penalty parameter [math]\lambda \gt 0[/math]. Here [math]\mathbf{Y} = (Y_1, \ldots, Y_n)^{\top}[/math] and [math]\mathbf{X}[/math] is [math]n \times p[/math] matrix with the [math]n[/math] row-vectors [math]\mathbf{X}_{i,\ast}[/math] stacked.

When [math]p \gt n[/math] the sum-of-squares part does not have a unique minimum. Does the above employed penalty warrant a unique minimum for the loss function above (i.e., sum-of-squares plus penalty)? Motivate your answer.
Could it be that for intermediate values of [math]\alpha[/math], i.e. [math]0 \lt \alpha \lt 1[/math], the loss function assumes smaller values than for the boundary values [math]\alpha=0[/math] and [math]\alpha=1[/math]? Motivate your answer.
Draw the parameter constraint induced by this penalty for [math]\alpha = 0, 0.5[/math] and [math]1[/math] when [math]p = 2[/math]
Derive the estimator of [math]\bbeta[/math], defined as the minimum of the loss function, explicitly.
Discuss the behaviour of the estimator [math]\alpha = 0, 0.5[/math] and [math]1[/math] for [math]\lambda \rightarrow \infty[/math].

AAdmin

Jun 25'23

Revisit Exercise. There the standard linear regression model [math]Y_i = \mathbf{X}_{i,\ast} \bbeta + \varepsilon_i[/math] for [math]i=1, \ldots, n[/math] and with [math]\varepsilon_i \sim_{i.i.d.} \mathcal{N}(0, \sigma^2)[/math] is considered. The model comprises a single covariate and an intercept. Response and covariate data are: [math]\{(y_i, x_{i,1})\}_{i=1}^4 = \{ (1.4, 0.0), (1.4, -2.0), (0.8, 0.0), (0.4, 2.0) \}[/math].

Evaluate the generalized ridge regression estimator of [math]\bbeta[/math] with target [math]\bbeta_0 = \mathbf{0}_2[/math] and penalty matrix [math]\mathbf{\Delta}[/math] given by [math](\mathbf{\Delta})_{11} = \lambda = (\mathbf{\Delta})_{22}[/math] and [math](\mathbf{\Delta})_{12} = \tfrac{1}{2} \lambda = (\mathbf{\Delta})_{21}[/math] in which [math]\lambda = 8[/math].
A data scientist wishes to leave the intercept unpenalized. Hereto s/he sets in part a) [math](\mathbf{\Delta})_{11} = 0[/math]. Why does the resulting estimate not coincide with the answer to Exercise? Motivate.

AAdmin

Jun 25'23

Consider the linear regression model: [math]\mathbf{Y} = \mathbf{X} \bbeta + \vvarepsilon[/math] with [math]\vvarepsilon \sim \mathcal{N} ( \mathbf{0}_{n}, \sigma^2 \mathbf{I}_{nn})[/math]. Let [math]\hat{\bbeta}(\lambda) = (\mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp})^{-1} \mathbf{X}^{\top} \mathbf{Y}[/math] be the ridge regression estimator with penalty parameter [math]\lambda[/math]. The shrinkage of the ridge regression estimator propogates to the scale of the ‘ridge prediction’ [math]\mathbf{X} \hat{\bbeta}(\lambda)[/math]. To correct (a bit) for the shrinkage, ^[1] propose the alternative ridge regression estimator: [math]\hat{\bbeta}(\alpha) = [ (1-\alpha) \mathbf{X}^{\top} \mathbf{X} + \alpha \mathbf{I}_{pp}]^{-1} \mathbf{X}^{\top} \mathbf{Y}[/math] with shrinkage parameter [math]\alpha \in [0,1][/math].

Let [math]\alpha = \lambda ( 1+ \lambda)^{-1}[/math]. Show that [math]\hat{\bbeta}(\alpha) = (1+\lambda) \hat{\bbeta}(\lambda)[/math] with [math]\hat{\bbeta}(\lambda)[/math] as in the introduction above.
Use part a) and the parametrization of [math]\alpha[/math] provided there to show that the some shrinkage has been undone. That is, show: [math]\mbox{Var}[ \mathbf{X} \hat{\bbeta}(\lambda)] \lt \mbox{Var}[ \mathbf{X} \hat{\bbeta}(\alpha)][/math] for any [math]\lambda \gt 0[/math].
Use the singular value decomposition of [math]\mathbf{X}[/math] to show that [math]\lim_{\alpha \downarrow 0} \hat{\bbeta}(\alpha) = (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{Y}[/math] (should it exist) and [math]\lim_{\alpha \uparrow 1} \hat{\bbeta}(\alpha) = \mathbf{X}^{\top} \mathbf{Y}[/math].
Derive the expectation, variance and mean squared error of [math]\hat{\bbeta}(\alpha)[/math].
Temporarily assume that [math]p=1[/math] and let [math]\mathbf{X}^{\top} \mathbf{X} = c[/math] for some [math]c \gt 0[/math]. Then, [math]\mbox{MSE}[\hat{\bbeta}(\alpha)] = (c -1)^2 \beta^2 + \sigma^2 c [ (1-\alpha) c + \alpha ]^{-2}[/math]. Does there exist an [math]\alpha \in (0,1)[/math] such that the mean squared error of [math]\hat{\bbeta}(\alpha)[/math] is smaller than that of its maximum likelihood counterpart? Motivate. % Hint: distinguish between different values of [math]c[/math].
Now assume [math]p \gt 1[/math] and an orthonormal design matrix. Specify the regularization path of the alternative ridge regression estimator [math]\hat{\bbeta}(\alpha)[/math].

de Vlaming, R. and Groenen, P. J. F. (2015).The current and future use of ridge regression for prediction in quantitative genetics.BioMed Research International, page Article ID 143712

AAdmin

Jun 25'23

Consider the linear regression model [math]\mathbf{Y} = \mathbf{X} \bbeta + \vvarepsilon[/math] with [math]\vvarepsilon \sim \mathcal{N}(\mathbf{0}_n, \sigma^2 \mathbf{I}_nn)[/math]. Goldstein & Smith (1974) proposed a novel generalized ridge estimator of its [math]p[/math]-dimensional regression parameter:

[[math]] \begin{eqnarray*} \hat{\bbeta}_m(\lambda) & = & [ (\mathbf{X}^{\top} \mathbf{X})^m + \lambda \mathbf{I}_{pp} ]^{-1} (\mathbf{X}^{\top} \mathbf{X})^{m-1} \mathbf{X}^{\top} \mathbf{Y}, \end{eqnarray*} [[/math]]

with penalty parameter [math]\lambda \gt 0[/math] and ‘shape’ or ‘rate’ parameter [math]m[/math].

Assume, only for part a), that [math]n=p[/math] and the design matrix is orthonormal. Show that, irrespectively of the choice of [math]m[/math], this generalized ridge regression estimator coincides with the ‘regular’ ridge regression estimator.
Consider the generalized ridge loss function [math]\| \mathbf{Y} - \mathbf{X} \bbeta \|_2^2 + \bbeta^{\top} \mathbf{A} \bbeta[/math] with [math]\mathbf{A}[/math] a [math]p \times p[/math]-dimensional symmetric matrix. For what [math]\mathbf{A}[/math], does [math]\hat{\bbeta}_m(\lambda)[/math] minimize this loss function?
Let [math]d_j[/math] be the [math]j[/math]-th singular value of [math]\mathbf{X}[/math]. Show that in [math]\hat{\bbeta}_m(\lambda)[/math] the singular values are shrunken as [math](d_j^{2m} + \lambda)^{-1} d_j^{2m-1}[/math]. Hint: use the singular value decomposition of [math]\mathbf{X}[/math].
Do, for positive singular values, larger [math]m[/math] lead to more shrinkage? Hint: Involve particulars of the singular value in your answer.
Express [math]\mathbb{E}[\hat{\bbeta}_m(\lambda)][/math] in terms of the design matrix, model and shrinkage parameters ([math]\lambda[/math] and [math]m[/math]).
Express [math]\mbox{Var}[\hat{\bbeta}_m(\lambda)][/math] in terms of the design matrix, model and shrinkage parameters ([math]\lambda[/math] and [math]m[/math]).

[deVlaming2015-1] Vlaming, R. and Groenen, P. J. F. (2015).The current and future use of ridge regression for prediction in quantitative genetics.BioMed Research International, page Article ID 143712

[1]