Exercises

Jun 25'23

Augment the lasso penalty with the sum of the absolute differences all pairs of successive regression coefficients:

[[math]] \begin{eqnarray*} \lambda_1 \sum\nolimits_{j=1}^p | \beta_j | + \lambda_{1,f} \sum\nolimits_{j=2}^p | \beta_j - \beta_{j-1} |. \end{eqnarray*} [[/math]]

This augmented lasso penalty is referred to as the fused lasso penalty.

Consider the standard multiple linear regression model: [math]Y_i = \sum_{j=1}^p X_{ij} \, \beta_j + \varepsilon_i[/math]. Estimation of the regression parameters takes place via minimization of penalized sum of squares, in which the fused lasso penalty is used with [math]\lambda_1 =0[/math]. Rewrite the corresponding loss function to the standard lasso problem by application of the following change-of-variables: [math]\gamma_1 = \beta_1[/math] and [math]\gamma_{j} = \beta_j - \beta_{j-1}[/math].
Investigate on simulated data the effect of the second summand of the fused lasso penalty on the parameter estimates. In this, temporarily set [math]\lambda_1 = 0[/math].
Let [math]\lambda_1[/math] equal zero still. Compare the regression estimates of part b) to the ridge estimates with a first-order autoregressive prior. What is qualitatively the difference in the behavior of the two estimates? Hint: plot the full solution path for the penalized estimates of both estimation procedures.
How do the estimates of part b) of this question change if we allow [math]\lambda_1 \gt0[/math]?

AAdmin

Jun 25'23

Consider the standard linear regression model [math]Y_i = \mathbf{X}_{i,\ast} \bbeta + \varepsilon_i[/math] for [math]i=1, \ldots, n[/math] and with [math]\varepsilon_i \sim_{i.i.d.} \mathcal{N}(0, \sigma^2)[/math]. The rows of the design matrix [math]\mathbf{X}[/math] are of length 2, neither column represents the intercept. Relevant summary statistics from the data on the response [math]\mathbf{Y}[/math] and the covariates are:

[[math]] \begin{eqnarray*} \mathbf{X}^{\top} \mathbf{X} & = & \left( \begin{array}{rr} 40 & -20 \\ -20 & 10 \end{array} \right) \qquad \mbox{ and } \qquad \mathbf{X}^{\top} \mathbf{Y} \, \, \, = \, \, \, \left( \begin{array}{rr} 26 \\ -13 \end{array} \right). \end{eqnarray*} [[/math]]

Use lasso regression to regress without intercept the response on the first covariate. Draw (i.e. not sketch!) the regularization path of lasso regression estimator.
The two covariates are perfectly collinear. However, their regularization paths do not coincide. Why?
Fit the linear regression model with both covariates (and still without intercept) by means of the fused lasso, i.e. [math]\hat{\bbeta} (\lambda_1, \lambda_f) = \arg \min_{\bbeta} \| \mathbf{Y} - \mathbf{X} \bbeta \|_2^2 + \lambda_1 \| \bbeta \|_1 + \lambda_f | \beta_1 - \beta_2 |[/math] with [math]\lambda_1 = 10[/math] and [math]\lambda_F = \infty[/math]. Hint: at some point in your answer you may wish to write [math]\mathbf{X} = ( \mathbf{X}_{\ast,1} \, \, \, c \mathbf{X}_{\ast,1})[/math] and deduce [math]c[/math] from [math]\mathbf{X}^{\top} \mathbf{X}[/math].

AAdmin

Jun 25'23

Consider the linear regression model [math]\mathbf{Y} = \mathbf{X} \bbeta + \vvarepsilon[/math] with [math]\vvarepsilon \sim \mathcal{N} ( \mathbf{0}_n, \sigma^2 \mathbf{I}_{nn})[/math]. This model (without intercept) is fitted to data using the lasso regression estimator [math]\hat{\bbeta}(\lambda_1) = \arg \min_{\bbeta} \| \mathbf{Y} - \mathbf{X} \bbeta \|_2^2 + \lambda_1 \| \bbeta \|_1[/math]. The relevant summary statistics of the data are:

[[math]] \begin{eqnarray*} \mathbf{X} = \left( \begin{array}{rr} 1 & -1 \\ -1 & 1 \end{array} \right), \, \mathbf{Y} = \left( \begin{array}{r} -5 \\ 4 \end{array} \right), \, \mathbf{X}^{\top} \mathbf{X} = \left( \begin{array}{rr} 2 & -2 \\ -2 & 2 \end{array} \right), \mbox{ and } \, \mathbf{X}^{\top} \mathbf{Y} = \left( \begin{array}{r} -9 \\ 9 \end{array} \right). \end{eqnarray*} [[/math]]

Specify the full set of lasso regression estimates with [math]\lambda_1 = 2[/math] that minimize the lasso loss function for these data.
Now consider fitting the linear regression model with the fused lasso estimator [math]\hat{\bbeta}(\lambda_f) = \arg \min_{\bbeta} \| \mathbf{Y} - \mathbf{X} \bbeta \|_2^2 + \lambda_f | \beta_1 - \beta_2|[/math]. Determine [math]\lambda_{f}^{(0)} \gt 0[/math] such that [math]\hat{\bbeta}(\lambda_f) = (0, 0)^{\top}[/math] for all [math]\lambda_f \gt \lambda_{f}^{(0)}[/math].

AAdmin

Jun 25'23

A researcher has measured gene expression measurements for 1000 genes in 40 subjects, half of them cases and the other half controls.

Describe and explain what would happen if the researcher would fit an ordinary logistic regression to these data, using case/control status as the response variable.
Instead, the researcher chooses to fit a lasso regression, choosing the tuning parameter lambda by cross-validation. Out of 1000 genes, 37 get a non-zero regression coefficient in the lasso fit. In the ensuing publication, the researcher writes that the 963 genes with zero regression coefficients were found to be “irrelevant”. What is your opinion about this statement?

AAdmin

Jun 25'23

Consider the standard linear regression model [math]Y_i = \mathbf{X}_{i,\ast} \bbeta + \varepsilon_i[/math] for [math]i=1, \ldots, n[/math] and with the [math]\varepsilon_i[/math] i.i.d. normally distributed with zero mean and a common variance. Let the first covariate correspond to the intercept. The model is fitted to data by means of the minimization of the sum-of-squares augmented with a lasso penalty in which the intercept is left unpenalized: [math]\lambda_1 \sum_{j=2}^p | \beta_j |[/math] with penalty parameter [math]\lambda_1 \gt 0[/math]. The penalty parameter is chosen through leave-one-out cross-validation (LOOCV). The predictive performance of the model is evaluated, again by means of LOOCV. Thus, creating a double cross-validation loop. At each inner loop the optimal [math]\lambda_1[/math] yields an empty intercept-only model, from which a prediction for the left-out sample is obtained. The vector of these prediction is compared to the corresponding observation vector through their Spearman correlation (which measures the monotonicity of a relatonship and -- as a correlation measure -- assumed values on the [math][-1,1][/math] interval with an analogous interpretation to the ‘ordinary’ correlation). The latter equals [math]-1[/math]. Why?

AAdmin

Jun 25'23

Load the breast cancer data available via the breastCancerNKI-package (downloadable from BioConductor) through the following R-code:

# activate library and load the data
library(breastCancerNKI)
data(nki)

# subset and center the data
X <- exprs(nki)
X <- X[-which(rowSums(is.na(X)) > 0),]
X <- apply(X[1:1000,], 1, function(X){ X - mean(X) })

# define ER as the response
Y <- pData(nki)[,8]

The eSet-object nki is now available. It contains the expression profiles of 337 breast cancer patients. Each profile comprises expression levels of 24481 genes. The R-code above extracts the expression data from the object, removes all genes with missing values, centers the gene expression gene-wise around zero, and subsets the data set to the first thousand genes. The reduction of the gene dimensionality is only for computational speed. Furthermore, it extracts the estrogen receptor status (short: ER status), an important prognostic indicator for breast cancer, that is to be used as the response variable in the remainder of the exercise.

Relate the ER status and the gene expression levels by a logistic regression model, which is fitted by means of the lasso penalized maximum likelihood method. First, find the optimal value of the penalty parameter of [math]\lambda_1[/math] by means of cross-validation. This is implemented in optL1-function of the penalized-package available from CRAN.
Evaluate whether the cross-validated likelihood indeed attains a maximum at the optimal value of [math]\lambda_1[/math]. This can be done with the profL1-function of the penalized-package available from CRAN.
Investigate the sensitivity of the penalty parameter selection with respect to the choice of the cross-validation fold.
Does the optimal lambda produce a reasonable fit? And how does it compare to the ‘ridge fit’?

AAdmin

Jun 25'23

Consider fitting the linear regression model by means of the elastic net regression estimator.

Recall the data augmentation trick of Question of the ridge regression exercises. Use the same trick to show that the elastic net least squares loss function can be reformulated to the form of the traditional lasso function. Hint: absorb the ridge part of the elastic net penalty into the sum of squares.
The elastic net regression estimator can be evaluated by a coordinate descent procedure outlined in Section Coordinate descent . Show that in such a procedure at each step the [math]j[/math]-th element of the elastic net regression estimate is updated according to:
[[math]] \begin{eqnarray*} \hat{\beta}_j (\lambda_1, \lambda_2) & = & (\| \mathbf{X}_{\ast, j} \|_2^2 + \lambda_2)^{-1} \mbox{sign}(\mathbf{X}_{\ast, j}^{\top} \tilde{\mathbf{Y}}) \big[ | \mathbf{X}_{\ast, j}^{\top} \tilde{\mathbf{Y}} | - \tfrac{1}{2} \lambda_1 \big]_+. \end{eqnarray*} [[/math]]
with [math]\tilde{\mathbf{Y}} = \mathbf{Y} - \mathbf{X}_{\ast, \setminus j} \bbeta_{\setminus j}[/math].

AAdmin

Jun 25'23

Consider the linear regression model [math]\mathbf{Y} = \mathbf{X} \bbeta + \vvarepsilon[/math] with [math]\vvarepsilon \sim \mathcal{N} ( \mathbf{0}_n, \sigma_{\varepsilon}^2 \mathbf{I}_{nn})[/math]. This model (without intercept) is fitted to data using the elastic net estimator [math]\hat{\bbeta}(\lambda_1, \lambda_2) = \arg \min_{\bbeta} \| \mathbf{Y} - \mathbf{X} \bbeta \|_2^2 + \lambda_1 \| \bbeta \|_1 + \tfrac{1}{2} \lambda_2 \| \bbeta \|_2^2[/math]. The relevant summary statistics of the data are:

[[math]] \begin{eqnarray*} \mathbf{X} = \left( \begin{array}{r} 1 \\ -1 \\ -1 \end{array} \right), \, \mathbf{Y} = \left( \begin{array}{r} -5 \\ 4 \\ 1 \end{array} \right), \, \mathbf{X}^{\top} \mathbf{X} = \left( \begin{array}{r} 3 \end{array} \right), \mbox{ and } \, \mathbf{X}^{\top} \mathbf{Y} = \left( \begin{array}{r} -10 \end{array} \right). \end{eqnarray*} [[/math]]

Evaluate for [math](\lambda_1, \lambda_2) = (3,2)[/math] the elastic net regression estimator of the linear regression model.
Now consider the evaluation the elastic net regression estimator of the linear regression model for the same penalty parameters, [math](\lambda_1, \lambda_2) = (3,2)[/math], but this time involving two covariates. The first covariate is as in part a), the second is orthogonal to that one. Do you expect the resulting elastic net estimate of the first regression coefficient [math]\hat{\beta}_1 ( \lambda_1, \lambda_2)[/math] to be larger, equal or smaller (in an absolute sense) than your answer to part a)? Motivate.
Now take in part b) the second covariate equal to the first one. Show that the first coefficient of elastic net estimate [math]\hat{\beta}_1 ( \lambda_1, 2 \lambda_2)[/math] is half that of part a). Note: there is no need to know the exact answer to part a).

AAdmin

Jun 25'23

This question is freely copied from ^[1]: Problem 2.5a, page 43.

Consider the linear regression model [math]\mathbf{Y} = \mathbf{X} \bbeta + \vvarepsilon[/math]. It is fitted to data from a study with an orthonormal design matrix by means of the adaptive lasso regression estimator initiated by the OLS/ML regression estimator. Show that the [math]j[/math]-th element of the resulting adaptive lasso regression estimator equals:

[[math]] \begin{eqnarray*} \hat{\beta}_j^{\mbox{{\tiny adapt}}} (\lambda_1) & = & \mbox{sign}(\hat{\beta}_j^{\mbox{{\tiny ols}}}) ( | \hat{\beta}_j^{\mbox{{\tiny ols}}} | - \tfrac{1}{2} \lambda_1 / | \hat{\beta}_j^{\mbox{{\tiny ols}}} |)_+. \end{eqnarray*} [[/math]]

Bühlmann, P. and Van De Geer, S. (2011).Statistics for High-Dimensional Data: Methods, Theory and Applications.Springer Science & Business Media

[Buhlmann2011-1] Bühlmann, P. and Van De Geer, S. (2011).Statistics for High-Dimensional Data: Methods, Theory and Applications.Springer Science & Business Media

[1]