For this homework, you may only use the following packages:
# general
library(MASS)
library(caret)
library(tidyverse)
library(knitr)
library(kableExtra)
library(mlbench)
# specific
library(ISLR)
library(ellipse)
library(randomForest)
library(gbm)
library(glmnet)
library(rpart)
library(rpart.plot)
[7 points] For this question we will use the data in leukemia.csv
which originates from Golub et al. 1999.
The response variable class
is a categorical variable. There are two possible responses: ALL
(acute myeloid leukemia) and AML
(acute lymphoblastic leukemia), both types of leukemia. We will use the many feature variables, which are expression levels of genes, to predict these classes.
Note that, this dataset is rather large and you may have difficultly loading it using the “Import Dataset” feature in RStudio. Instead place the file in the same folder as your .Rmd
file and run the following command. (Which you should be doing anyway.) Again, since this dataset is large, use 5-fold cross-validation when needed.
leukemia = read_csv("leukemia.csv", progress = FALSE)
For use with the glmnet
package, it will be useful to create a factor response variable y
and a feature matrix X
as seen below. We won’t test-train split the data since there are so few observations.
y = as.factor(leukemia$class)
X = as.matrix(leukemia[, -1])
Do the following:
glmnet
choose the \(\lambda\) values.) Create side-by-side plots that shows the features entering (or leaving) the models.glmnet
choose the \(\lambda\) values. Store both the \(\lambda\) that minimizes the deviance, as well as the \(\lambda\) that has a deviance within one standard error. Create a plot of the deviances for each value of \(\lambda\) considered. Use these two \(\lambda\) values to create a grid for use with train()
in caret
. Use train()
to get cross-validated classification accuracy for these two values of \(\lambda\). Store these values.glmnet
choose the \(\lambda\) values. Store both the \(\lambda\) that minimizes the deviance, as well as the \(\lambda\) that has a deviance within one standard error. Create a plot of the deviances for each value of \(\lambda\) considered. Use these two \(\lambda\) values to create a grid for use with train()
in caret
. Use train()
to get cross-validated classification accuracy for these two values of \(\lambda\). Store these values.train()
in caret
. Do not specify a grid of \(k\) values to try, let caret
do so automatically. (It will use 5, 7, 9.) Store the cross-validated accuracy for each. Scale the predictors.Solution:
uin = 123456789
set.seed(uin)
cv_5 = trainControl(method = "cv", number = 5)
fit_lasso = glmnet(X, y, family = "binomial", alpha = 1)
fit_ridge = glmnet(X, y, family = "binomial", alpha = 0)
par(mfrow = c(1, 2))
plot(fit_lasso, xvar = "lambda", main = "Lasso")
plot(fit_ridge, xvar = "lambda", main = "Ridge")