⧼exchistory⧽
7 exercise(s) shown, 0 hidden
ABy Admin
Jun 24'23

[math] \require{textmacros} \def \bbeta {\bf \beta} \def\fat#1{\mbox{\boldmath$#1$}} \def\reminder#1{\marginpar{\rule[0pt]{1mm}{11pt}}\textbf{#1}} \def\SSigma{\bf \Sigma} \def\ttheta{\bf \theta} \def\aalpha{\bf \alpha} \def\ddelta{\bf \delta} \def\eeta{\bf \eta} \def\llambda{\bf \lambda} \def\ggamma{\bf \gamma} \def\nnu{\bf \nu} \def\vvarepsilon{\bf \varepsilon} \def\mmu{\bf \mu} \def\nnu{\bf \nu} \def\ttau{\bf \tau} \def\SSigma{\bf \Sigma} \def\TTheta{\bf \Theta} \def\XXi{\bf \Xi} \def\PPi{\bf \Pi} \def\GGamma{\bf \Gamma} \def\DDelta{\bf \Delta} \def\ssigma{\bf \sigma} \def\UUpsilon{\bf \Upsilon} \def\PPsi{\bf \Psi} \def\PPhi{\bf \Phi} \def\LLambda{\bf \Lambda} \def\OOmega{\bf \Omega} [/math]

Consider fitting the linear regression model, [math]\mathbf{Y} = \mathbf{X} \bbeta + \vvarepsilon[/math] with [math]\varepsilon \sim \mathcal{N}(\mathbf{0}_n, \sigma^2 \mathbf{I}_{nn})[/math], to data by means of the ridge regression estimator. This estimator involves the penalty parameter which is said to be positive. It has been suggested, by among others [1], to extend the range of the penalty parameter to the whole set of real numbers. That is, also tolerating negative values. Let's investigate the consequences of allowing negative values of the penalty parameter. Hereto use in the remainder the following numerical values for the design matrix, response, and corresponding summary statistics:

[[math]] \begin{eqnarray*} \mathbf{X} \, \, \, = \, \, \, \left( \begin{array}{rr} 5 & 4 \\ -3 & -4 \end{array} \right), \quad \mathbf{Y} \, \, \, = \, \, \, \left( \begin{array}{r} 3 \\ -4 \end{array} \right), \quad \mathbf{X}^{\top} \mathbf{X} \, \, \, = \, \, \, \left( \begin{array}{rr} 34 & 32 \\ 32 & 32 \end{array} \right), \quad \mbox{and} \quad \mathbf{X}^{\top} \mathbf{Y} \, \, \, = \, \, \, \left( \begin{array}{r} 27 \\ 28 \end{array} \right). \end{eqnarray*} [[/math]]

  • For which [math]\lambda \lt 0[/math] is the ridge regression estimator [math]\hat{\bbeta}(\lambda) = (\mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{22})^{-1} \mathbf{X}^{\top} \mathbf{Y}[/math] well-defined?
  • Now consider the ridge regression estimator to be defined via the ridge loss function, i.e.
    [[math]] \begin{eqnarray*} \hat{\bbeta} ( \lambda) & = & \arg \min\nolimits_{\bbeta \in \mathbb{R}^2} \| \mathbf{Y} - \mathbf{X} \bbeta \|_2^2 + \lambda \| \bbeta \|_2^2. \end{eqnarray*} [[/math]]
    Let [math]\lambda = -20[/math]. Plot the level sets of this loss function, and add a point with the corresponding ridge regression estimate [math]\hat{\bbeta}(-20)[/math].
  • Verify that the ridge regression estimate [math]\hat{\bbeta}(-20)[/math] is a saddle point of the ridge loss function, as can also be seen from the contour plot generated in part b). Hereto study the eigenvalues of its Hessian matrix. Moreover, specify the range of negative penalty parameters for which the ridge loss function is convex (and does have a unique well-defined minimum).
  • Find the minimum of the ridge loss function.

References

  1. Hua, T. A. and Gunst, R. F. (1983).Generalized ridge regression: a note on negative ridge parameters.Communications in Statistics-Theory and Methods, 12(1), 37--45
ABy Admin
Jun 24'23

[math] \require{textmacros} \def \bbeta {\bf \beta} \def\fat#1{\mbox{\boldmath$#1$}} \def\reminder#1{\marginpar{\rule[0pt]{1mm}{11pt}}\textbf{#1}} \def\SSigma{\bf \Sigma} \def\ttheta{\bf \theta} \def\aalpha{\bf \alpha} \def\ddelta{\bf \delta} \def\eeta{\bf \eta} \def\llambda{\bf \lambda} \def\ggamma{\bf \gamma} \def\nnu{\bf \nu} \def\vvarepsilon{\bf \varepsilon} \def\mmu{\bf \mu} \def\nnu{\bf \nu} \def\ttau{\bf \tau} \def\SSigma{\bf \Sigma} \def\TTheta{\bf \Theta} \def\XXi{\bf \Xi} \def\PPi{\bf \Pi} \def\GGamma{\bf \Gamma} \def\DDelta{\bf \Delta} \def\ssigma{\bf \sigma} \def\UUpsilon{\bf \Upsilon} \def\PPsi{\bf \Psi} \def\PPhi{\bf \Phi} \def\LLambda{\bf \Lambda} \def\OOmega{\bf \Omega} [/math]

Consider the standard linear regression model [math]Y_i = \mathbf{X}_{i,\ast} \bbeta + \varepsilon_i[/math] for [math]i=1, \ldots, n[/math] and with [math]\varepsilon_i \sim_{i.i.d.} \mathcal{N}(0, \sigma^2)[/math] Consider the following two ridge regression estimators of the regression parameter of this model, defined as:

[[math]] \begin{eqnarray*} \arg \min_\bbeta \sum\nolimits_{i=1}^n (Y_{i} - \mathbf{X}_{i,\ast} \bbeta)^2 + \lambda \| \bbeta \|_ 2^2 \quad \mbox{ and } \quad \arg \min_\bbeta \sum\nolimits_{i=1}^n (Y_{i} - \mathbf{X}_{i,\ast} \bbeta)^2 + n \lambda \| \bbeta \|_ 2^2. \end{eqnarray*} [[/math]]

Which do you prefer? Motivate.

ABy Admin
Jun 24'23

Verify the expression of the [math]k[/math]-fold linear predictor.

ABy Admin
Jun 24'23

[math] \require{textmacros} \def \bbeta {\bf \beta} \def\fat#1{\mbox{\boldmath$#1$}} \def\reminder#1{\marginpar{\rule[0pt]{1mm}{11pt}}\textbf{#1}} \def\SSigma{\bf \Sigma} \def\ttheta{\bf \theta} \def\aalpha{\bf \alpha} \def\ddelta{\bf \delta} \def\eeta{\bf \eta} \def\llambda{\bf \lambda} \def\ggamma{\bf \gamma} \def\nnu{\bf \nu} \def\vvarepsilon{\bf \varepsilon} \def\mmu{\bf \mu} \def\nnu{\bf \nu} \def\ttau{\bf \tau} \def\SSigma{\bf \Sigma} \def\TTheta{\bf \Theta} \def\XXi{\bf \Xi} \def\PPi{\bf \Pi} \def\GGamma{\bf \Gamma} \def\DDelta{\bf \Delta} \def\ssigma{\bf \sigma} \def\UUpsilon{\bf \Upsilon} \def\PPsi{\bf \Psi} \def\PPhi{\bf \Phi} \def\LLambda{\bf \Lambda} \def\OOmega{\bf \Omega} [/math]

The linear regression model, [math]\mathbf{Y} = \mathbf{X} \bbeta + \vvarepsilon[/math] with [math]\varepsilon \sim \mathcal{N}(\mathbf{0}_n, \sigma^2 \mathbf{I}_{nn})[/math] is fitted by means of the ridge regression estimator. The design matrix and response are:

[[math]] \begin{eqnarray*} \mathbf{X} = \left( \begin{array}{rr} 2 & -1 \\ 0 & 1 \end{array} \right) \quad \mbox{ and } \quad \mathbf{Y} = \left( \begin{array}{r} 1 \\ \tfrac{1}{2} \end{array} \right). \end{eqnarray*} [[/math]]

The penalty parameter is chosen as the minimizer of the leave-one-out cross-validated squared error of the prediction (i.e. Allen's PRESS statistic). Show that [math]\lambda = \infty[/math].

ABy Admin
Jun 24'23

PSA (Prostate-Specific Antigen) is a prognostic indicator of prostate cancer. Low and high PSA values indicate low and high risk, respectively. PSA interacts with the VEGF pathway. In cancer the VEGF pathway aids in the process of angiogenesis, i.e. the formation of blood vessels in solid tumors. Assume the aforementioned interaction can -- at least partially -- be captured by a linear relationship between PSA and the constituents of the VEGF pathway. Use the prostate cancer data of [1] to estimate this linear relationship using the ridge regression estimator. The following R-script downloads and prepares the data.

# load the necessary libraries
library(Biobase)
library(prostateCancerStockholm)
library(penalized)
library(KEGG.db)

# load data
data(stockholm)

# prepare psa data
psa <- pData(stockholm)[,9]
psa <- log(as.numeric(levels(psa)[psa]))

# prepare VEGF pathway data
X <- exprs(stockholm)
kegg2vegf    <- as.list(KEGGPATHID2EXTID)
entrezIDvegf <- as.numeric(unlist(kegg2vegf[names(kegg2vegf) == "hsa04370"]))
entrezIDx    <- as.numeric(fData(stockholm)[,10])
idX2VEGF     <- match(entrezIDvegf, entrezIDx)
entrezIDvegf <- entrezIDvegf[-which(is.na(idX2VEGF))]
idX2VEGF     <- match(entrezIDvegf, entrezIDx)
X            <- t(X[idX2VEGF,])
gSymbols     <- fData(stockholm)[idX2VEGF,13]
gSymbols     <- levels(gSymbols)[gSymbols]
colnames(X)  <- gSymbols

# remove samples with missing outcome
X   <- X[-which(is.na(psa)), ]
X   <- sweep(X, 2, apply(X, 2, mean))
X   <- sweep(X, 2, apply(X, 2, sd), "/")
psa <- psa[-which(is.na(psa))]
psa <- psa - mean(psa)
  • Find the ridge penalty parameter by means of AIC minization. Hint: the likelihood can be obtained from the penFit-object that is created by the penalized-function of the R-package penalized.
  • Find the ridge penalty parameter by means of leave-one-out cross-validation, as implemented by the optL2-function provided by the R-package penalized.
  • Find the ridge penalty parameter by means of leave-one-out cross-validation using Allen's PRESS statistic as performance measure (see Section Cross-validation ).
  • Discuss the reasons for the different values of the ridge penalty parameter obtained in parts a), b), and c). Also investigate the consequences of these values on the corresponding regression estimates.
  1. Ross-Adams, H., Lamb, A., Dunning, M., Halim, S., Lindberg, J., Massie, C., Egevad, L., Russell, R., Ramos-Montoya, A., Vowler, S., et al. (2015).Integration of copy number and transcriptomics provides risk stratification in prostate cancer: a discovery and validation cohort study.EBioMedicine, 2(9), 1133--1144
ABy Admin
Jun 24'23

The variance of the covariates influences the amount of shrinkage of the regression estimator induced by ridge regularization. Some deal with this through rescaling of the covariates to have a common unit variance. This is discussed at the end of Section Role of the variance of the covariates . Investigate this numerically using the data of the microRNA-mRNA illustration discussed in Section Illustration .

  • Load the data by running the first R-script of Section Illustration .
  • Fit the linear regression model by means of the ridge regression estimator with [math]\lambda=1[/math] using both the scaled and unscaled covariates. Compare the order of the coefficients between the both estimates as well as their corresponding linear predictors. Does the top 50 largest (in an absolute size) coefficients differ much between the two estimates?
  • Repeat part b), now with the penalty parameter chosen by means of LOOCV.
ABy Admin
Jun 24'23

[math] \require{textmacros} \def \bbeta {\bf \beta} \def\fat#1{\mbox{\boldmath$#1$}} \def\reminder#1{\marginpar{\rule[0pt]{1mm}{11pt}}\textbf{#1}} \def\SSigma{\bf \Sigma} \def\ttheta{\bf \theta} \def\aalpha{\bf \alpha} \def\ddelta{\bf \delta} \def\eeta{\bf \eta} \def\llambda{\bf \lambda} \def\ggamma{\bf \gamma} \def\nnu{\bf \nu} \def\vvarepsilon{\bf \varepsilon} \def\mmu{\bf \mu} \def\nnu{\bf \nu} \def\ttau{\bf \tau} \def\SSigma{\bf \Sigma} \def\TTheta{\bf \Theta} \def\XXi{\bf \Xi} \def\PPi{\bf \Pi} \def\GGamma{\bf \Gamma} \def\DDelta{\bf \Delta} \def\ssigma{\bf \sigma} \def\UUpsilon{\bf \Upsilon} \def\PPsi{\bf \Psi} \def\PPhi{\bf \Phi} \def\LLambda{\bf \Lambda} \def\OOmega{\bf \Omega} [/math]

Consider the ridge regression estimator [math]\hat{\bbeta}(\lambda)[/math] of the linear regression model parameter [math]\bbeta[/math]. Its penalty parameter [math]\lambda[/math] may be chosen as the minimizer of Allen's PRESS statistic, i.e.: [math]\lambda_{\mbox{{\tiny opt}}} = \arg \min_{\lambda \gt 0} n^{-1} \sum\nolimits_{i=1}^n [Y_i - \mathbf{X}_{i, \ast} \hat{\bbeta}_{-i}(\lambda)]^2[/math], with the LOOCV ridge regression estimator [math]\hat{\bbeta}_{-i}(\lambda) = (\mathbf{X}_{- i, \ast}^{\top} \mathbf{X}_{- i, \ast} + \lambda \mathbf{I}_{pp})^{-1} \mathbf{X}_{- i, \ast}^{\top} \mathbf{Y}_{- i}[/math]. This is computationally demanding as it involves [math]n[/math] evaluations of [math]\hat{\bbeta}_{-i}(\lambda)[/math], which can be circumvented by rewriting Allen's PRESS statistics. Hereto:

  • Use the Woodbury matrix identity to verify:
    [[math]] \begin{eqnarray*} (\mathbf{X}_{- i, \ast}^{\top} \mathbf{X}_{- i, \ast} + \lambda \mathbf{I}_{pp})^{-1} & = & (\mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp})^{-1} \\ & & + (\mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp})^{-1} \mathbf{X}_{i, \ast}^{\top} [ 1 - \mathbf{H}_{ii}(\lambda)]^{-1} \mathbf{X}_{i, \ast} (\mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp})^{-1}, \end{eqnarray*} [[/math]]
    in which [math]\mathbf{H}_{ii}(\lambda) = \mathbf{X}_{i, \ast} (\mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp})^{-1} \mathbf{X}_{i, \ast}^{\top}[/math].
  • Rewrite the LOOCV ridge regression estimator to:
    [[math]] \begin{eqnarray*} \hat{\bbeta}_{- i}(\lambda) & = & \hat{\bbeta}(\lambda) - (\mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp})^{-1} \mathbf{X}_{i, \ast}^{\top} [ 1 - \mathbf{H}_{ii}(\lambda)]^{-1} [ Y_i - \mathbf{X}_{i, \ast} \hat{\bbeta}(\lambda) ]. \end{eqnarray*} [[/math]]
    In this use part a) and the identity [math]\mathbf{X}_{-i}^{\top} \mathbf{Y}_{-i} = \mathbf{X}^{\top} \mathbf{Y} - \mathbf{X}_{i, \ast}^{\top} Y_i[/math].
  • Reformulate, using part b), the prediction error as [math]Y_i - \mathbf{X}_{i, \ast} \hat{\bbeta}_{-i}(\lambda) = [ 1 - \mathbf{H}_{ii}(\lambda)]^{-1} [ Y_i - \mathbf{X}_{i, \ast}^{\top} \hat{\bbeta}(\lambda) ][/math] and express Allen's PRESS statistic as:
    [[math]] \begin{eqnarray*} n^{-1} \sum\nolimits_{i=1}^n [Y_i - \mathbf{X}_{i, \ast} \hat{\bbeta}_{-i}(\lambda)]^2 & = & n^{-1} \| \mathbf{B}(\lambda) [\mathbf{I}_{nn} - \mathbf{H}(\lambda)] \mathbf{Y} \|_ F^2, \end{eqnarray*} [[/math]]
    where [math]\mathbf{B}(\lambda)[/math] is diagonal with [math][\mathbf{B}(\lambda)]_{ii} = [ 1 - \mathbf{H}_{ii}(\lambda)]^{-1}[/math].