exercise:99193dde3f

Jun 25'23

Exercise

Load the breast cancer data available via the breastCancerNKI-package (downloadable from BioConductor) through the following R-code:

# activate library and load the data
library(breastCancerNKI)
data(nki)

# subset and center the data
X <- exprs(nki)
X <- X[-which(rowSums(is.na(X)) > 0),]
X <- apply(X[1:1000,], 1, function(X){ X - mean(X) })

# define ER as the response
Y <- pData(nki)[,8]

The eSet-object nki is now available. It contains the expression profiles of 337 breast cancer patients. Each profile comprises expression levels of 24481 genes. The R-code above extracts the expression data from the object, removes all genes with missing values, centers the gene expression gene-wise around zero, and subsets the data set to the first thousand genes. The reduction of the gene dimensionality is only for computational speed. Furthermore, it extracts the estrogen receptor status (short: ER status), an important prognostic indicator for breast cancer, that is to be used as the response variable in the remainder of the exercise.

Relate the ER status and the gene expression levels by a logistic regression model, which is fitted by means of the lasso penalized maximum likelihood method. First, find the optimal value of the penalty parameter of [math]\lambda_1[/math] by means of cross-validation. This is implemented in optL1-function of the penalized-package available from CRAN.
Evaluate whether the cross-validated likelihood indeed attains a maximum at the optimal value of [math]\lambda_1[/math]. This can be done with the profL1-function of the penalized-package available from CRAN.
Investigate the sensitivity of the penalty parameter selection with respect to the choice of the cross-validation fold.
Does the optimal lambda produce a reasonable fit? And how does it compare to the ‘ridge fit’?

Add answer Add answer