Exercise
[math] \require{textmacros} \def \bbeta {\bf \beta} \def\fat#1{\mbox{\boldmath$#1$}} \def\reminder#1{\marginpar{\rule[0pt]{1mm}{11pt}}\textbf{#1}} \def\SSigma{\bf \Sigma} \def\ttheta{\bf \theta} \def\aalpha{\bf \alpha} \def\ddelta{\bf \delta} \def\eeta{\bf \eta} \def\llambda{\bf \lambda} \def\ggamma{\bf \gamma} \def\nnu{\bf \nu} \def\vvarepsilon{\bf \varepsilon} \def\mmu{\bf \mu} \def\nnu{\bf \nu} \def\ttau{\bf \tau} \def\SSigma{\bf \Sigma} \def\TTheta{\bf \Theta} \def\XXi{\bf \Xi} \def\PPi{\bf \Pi} \def\GGamma{\bf \Gamma} \def\DDelta{\bf \Delta} \def\ssigma{\bf \sigma} \def\UUpsilon{\bf \Upsilon} \def\PPsi{\bf \Psi} \def\PPhi{\bf \Phi} \def\LLambda{\bf \Lambda} \def\OOmega{\bf \Omega} [/math]
Load the leukemia data available via the multtest-package (downloadable from BioConductor) through the following R-code:
# activate library and load the data library(multtest) data(golub)
The objects golub and golub.cl are now available. The matrix-object golub contains the expression profiles of 38 leukemia patients. Each profile comprises expression levels of 3051 genes. The numeric-object golub.cl is an indicator variable for the leukemia type (AML or ALL) of the patient.
- Relate the leukemia subtype and the gene expression levels by a logistic regression model. Fit this model by means of penalized maximum likelihood, employing the ridge penalty with penalty parameter [math]\lambda=1[/math]. This is implemented in the penalized-packages available from CRAN. Note: center (gene-wise) the expression levels around zero.
- Obtain the fits from the regression model. The fit is almost perfect. Could this be due to overfitting the data? Alternatively, could it be that the biological information in the gene expression levels indeed determines the leukemia subtype almost perfectly?
- To discern between the two explanations for the almost perfect fit, randomly shuffle the subtypes. Refit the logistic regression model and obtain the fits. On the basis of this and the previous fit, which explanation is more plausible?
- Compare the fit of the logistic model with different penalty parameters, say [math]\lambda = 1[/math] and [math]\lambda = 1000[/math]. How does [math]\lambda[/math] influence the possibility of overfitting the data?
- Describe what you would do to prevent overfitting.