Please see the homework policy document for detailed instructions and some grading notes. Failure to follow instructions will result in point reductions.
“Statisticians, like artists, have the bad habit of falling in love with their models.”
For this homework, you may only use the following packages:
# general
library(MASS)
library(caret)
library(tidyverse)
library(knitr)
library(kableExtra)
library(mlbench)
# specific
library(ISLR)
library(ellipse)
library(randomForest)
library(gbm)
library(glmnet)
library(rpart)
library(rpart.plot)
If you feel additional general packages would be useful for future homework, please pass these along to the instructor.
[7 points] For this question we will use the data in leukemia.csv
which originates from Golub et al. 1999.
The response variable class
is a categorical variable. There are two possible responses: ALL
(acute myeloid leukemia) and AML
(acute lymphoblastic leukemia), both types of leukemia. We will use the many feature variables, which are expression levels of genes, to predict these classes.
Note that, this dataset is rather large and you may have difficultly loading it using the “Import Dataset” feature in RStudio. Instead place the file in the same folder as your .Rmd
file and run the following command. (Which you should be doing anyway.) Again, since this dataset is large, use 5-fold cross-validation when needed.
leukemia = read_csv("leukemia.csv", progress = FALSE)
For use with the glmnet
package, it will be useful to create a factor response variable y
and a feature matrix X
as seen below. We won’t test-train split the data since there are so few observations.
y = as.factor(leukemia$class)
X = as.matrix(leukemia[, -1])
Do the following:
glmnet
choose the \(\lambda\) values.) Create side-by-side plots that shows the features entering (or leaving) the models.glmnet
choose the \(\lambda\) values. Store both the \(\lambda\) that minimizes the deviance, as well as the \(\lambda\) that has a deviance within one standard error. Create a plot of the deviances for each value of \(\lambda\) considered. Use these two \(\lambda\) values to create a grid for use with train()
in caret
. Use train()
to get cross-validated classification accuracy for these two values of \(\lambda\). Store these values.glmnet
choose the \(\lambda\) values. Store both the \(\lambda\) that minimizes the deviance, as well as the \(\lambda\) that has a deviance within one standard error. Create a plot of the deviances for each value of \(\lambda\) considered. Use these two \(\lambda\) values to create a grid for use with train()
in caret
. Use train()
to get cross-validated classification accuracy for these two values of \(\lambda\). Store these values.train()
in caret
. Do not specify a grid of \(k\) values to try, let caret
do so automatically. (It will use 5, 7, 9.) Store the cross-validated accuracy for each. Scale the predictors.[5 points] For this exercise, we will use the College
data from the ISLR
package. Familiarize yourself with this dataset before performing analyses. We will attempt to predict the Outstate
variable.
Test-train split the data using this code.
set.seed(42)
index = createDataPartition(College$Outstate, p = 0.75, list = FALSE)
college_trn = College[index, ]
college_tst = College[-index, ]
Train a total of six models using five-fold cross validation.
tuneLength
of 10
.tuneLength
of 10
.Before beginning, set a seed equal to your UIN.
uin = 123456789
set.seed(uin)
[5 points] For this exercise we will create data via simulation, then assess how well certain methods perform. Use the code below to create a train and test dataset.
set.seed(42)
sim_trn = mlbench.spirals(n = 2500, cycles = 1.5, sd = 0.125)
sim_trn = data.frame(sim_trn$x, class = as.factor(sim_trn$classes))
sim_tst = mlbench.spirals(n = 10000, cycles = 1.5, sd = 0.125)
sim_tst = data.frame(sim_tst$x, class = as.factor(sim_tst$classes))
The training data is plotted below, with colors indicating the class
variable, which is the response.
Before proceeding further, set a seed equal to your UIN.
uin = 123456789
set.seed(uin)
We’ll use the following to define 5-fold cross-validation for use with train()
from caret
.
cv_5 = trainControl(method = "cv", number = 5)
We now tune two models with train()
. First, a logistic regression using glm
. (This actually isn’t “tuned” as there are not parameters to be tuned, but we use train()
to perform cross-validation.) Second we tune a single decision tree using rpart
.
We store the results in sim_glm_cv
and sim_tree_cv
respectively, but we also wrap both function calls with system.time()
in order to record how long the tuning process takes for each method.
glm_cv_time = system.time({
sim_glm_cv = train(
class ~ .,
data = sim_trn,
trControl = cv_5,
method = "glm")
})
tree_cv_time = system.time({
sim_tree_cv = train(
class ~ .,
data = sim_trn,
trControl = cv_5,
method = "rpart")
})
We see that both methods are tuned via cross-validation in a similar amount of time.
glm_cv_time["elapsed"]
## elapsed
## 1.348
tree_cv_time["elapsed"]
## elapsed
## 1.107
Repeat the above analysis using a random forest, twice. The first time use 5-fold cross-validation. (This is how we had been using random forests before we understood random forests.) The second time, tune the model using OOB samples. We only have two predictors here, so, for both, use the following tuning grid.
rf_grid = expand.grid(mtry = c(1, 2))
Create a table summarizing the results of these four models. (Logistic with CV, Tree with CV, RF with OOB, RF with CV). Report:
[5 points] For this question we will predict the Salary
of Hitters
. (Hitters
is also the name of the dataset.) We first remove the missing data:
Hitters = na.omit(Hitters)
After changing uin
to your UIN, use the following code to test-train split the data.
uin = 123456789
set.seed(uin)
hit_idx = createDataPartition(Hitters$Salary, p = 0.6, list = FALSE)
hit_trn = Hitters[hit_idx,]
hit_tst = Hitters[-hit_idx,]
Do the following:
gbm_grid = expand.grid(interaction.depth = c(1, 2),
n.trees = c(500, 1000, 1500),
shrinkage = c(0.001, 0.01, 0.1),
n.minobsinnode = 10)
mtry
.Create a table summarizing the results of three models:
For each, report:
[0.5 point each] Answer the following questions based on your results from the three exercises. Yes this does mean that the total points total more than 30. The extra points are “buffer” points. The maximum points you are able to score is still 30.
(5.1.1) How many observations are in the dataset? How many predictors are in the dataset?
(5.1.2) Based on the deviance plots, do you feel that glmnet
considered enough \(\lambda\) values for lasso?
(5.1.3) Based on the deviance plots, do you feel that glmnet
considered enough \(\lambda\) values for ridge?
(5.1.4) How does \(k\)-nearest neighbor compare to the penalized methods? Can you explain any difference?
(5.1.5) Based on your results, which model would you choose? Explain.
(5.2.1) Based on the table, which model do you prefer? Justify your answer.
(5.2.2) For both of the elastic net models, report the best tuning parameters from caret
. For each, is this ridge, lasso, or somewhere in between? If in between, closer to which?
(5.2.3) Did you scale the predictors when you used KNN? Should you have scaled the predictors when you used KNN?
(5.2.4) Of the two KNN models which works better? Can you explain why?
(5.2.5) What year is this dataset from? What was out-of-state tuition at UIUC at that time?
(5.3.1) Compare the time taken to tune each model. Is the difference between the OOB and CV result for the random forest similar to what you would have expected?
(5.3.2) Compare the tuned value of mtry
for each of the random forests tuned. Do they choose the same model?
(5.3.3) Compare the test accuracy of each of the four procedures considered. Briefly explain these results.
(5.4.1) Report the tuned value of mtry
for the random forest.
(5.4.2) Create a plot that shows the tuning results for the tuning of the boosted tree model.
(5.4.3) Create a plot of the variable importance for the tuned random forest.
(5.4.4) Create a plot of the variable importance for the tuned boosted tree model.
(5.4.5) According to the random forest, what are the three most important predictors?
(5.4.6) According to the boosted model, what are the three most important predictors?