For this homework, you may only use the following packages:

# general
library(MASS)
library(caret)
library(tidyverse)
library(knitr)
library(kableExtra)
library(mlbench)

# specific
library(ISLR)
library(ellipse)
library(randomForest)
library(gbm)
library(glmnet)
library(rpart)
library(rpart.plot)

Exercise 1 (Classifying Leukemia)

[7 points] For this question we will use the data in leukemia.csv which originates from Golub et al. 1999.

The response variable class is a categorical variable. There are two possible responses: ALL (acute myeloid leukemia) and AML (acute lymphoblastic leukemia), both types of leukemia. We will use the many feature variables, which are expression levels of genes, to predict these classes.

Note that, this dataset is rather large and you may have difficultly loading it using the “Import Dataset” feature in RStudio. Instead place the file in the same folder as your .Rmd file and run the following command. (Which you should be doing anyway.) Again, since this dataset is large, use 5-fold cross-validation when needed.

leukemia = read_csv("leukemia.csv", progress = FALSE)

For use with the glmnet package, it will be useful to create a factor response variable y and a feature matrix X as seen below. We won’t test-train split the data since there are so few observations.

y = as.factor(leukemia$class)
X = as.matrix(leukemia[, -1])

Do the following:

Set a seed equal to your UIN.
Fit the full path of a logistic regression with both a lasso penalty and a ridge penalty. (Don’t use cross-validation. Also let glmnet choose the \(\lambda\) values.) Create side-by-side plots that shows the features entering (or leaving) the models.
Use cross-validation to tune an logistic regression with a lasso penalty. Again, let glmnet choose the \(\lambda\) values. Store both the \(\lambda\) that minimizes the deviance, as well as the \(\lambda\) that has a deviance within one standard error. Create a plot of the deviances for each value of \(\lambda\) considered. Use these two \(\lambda\) values to create a grid for use with train() in caret. Use train() to get cross-validated classification accuracy for these two values of \(\lambda\). Store these values.
Use cross-validation to tune an logistic regression with a ridge penalty. Again, let glmnet choose the \(\lambda\) values. Store both the \(\lambda\) that minimizes the deviance, as well as the \(\lambda\) that has a deviance within one standard error. Create a plot of the deviances for each value of \(\lambda\) considered. Use these two \(\lambda\) values to create a grid for use with train() in caret. Use train() to get cross-validated classification accuracy for these two values of \(\lambda\). Store these values.
Use cross-validation to tune \(k\)-nearest neighbors using train() in caret. Do not specify a grid of \(k\) values to try, let caret do so automatically. (It will use 5, 7, 9.) Store the cross-validated accuracy for each. Scale the predictors.
Summarize these seven models in a table. (Two lasso, two ridge, three knn.) For each report the cross-validated accuracy and the standard deviation of the accuracy.

Solution:

uin = 123456789
set.seed(uin)
cv_5 = trainControl(method = "cv", number = 5)

fit_lasso = glmnet(X, y, family = "binomial", alpha = 1)
fit_ridge = glmnet(X, y, family = "binomial", alpha = 0)

par(mfrow = c(1, 2))
plot(fit_lasso, xvar = "lambda", main = "Lasso")
plot(fit_ridge, xvar = "lambda", main = "Ridge")

fit_lasso_cv = cv.glmnet(X, y, family = "binomial", alpha = 1, nfolds = 5)
plot(fit_lasso_cv)

lambda_lasso = expand.grid(alpha = 1, 
                           lambda = c(fit_lasso_cv$lambda.min,
                                      fit_lasso_cv$lambda.1se))

train_lasso = train(X, y, 
                    method = "glmnet", 
                    trControl = cv_5,
                    tuneGrid = lambda_lasso)

fit_ridge_cv = cv.glmnet(X, y, family = "binomial", alpha = 0, nfolds = 5)
plot(fit_ridge_cv)

lambda_ridge = expand.grid(alpha = 0, 
                           lambda = c(fit_ridge_cv$lambda.min, 
                                      fit_ridge_cv$lambda.1se))

train_ridge = train(X, y, 
                    method = "glmnet", 
                    trControl = cv_5,
                    tuneGrid = lambda_ridge)

train_knn = train(X, y, 
                  method = "knn",
                  preProc = c("center", "scale"),
                  trControl = cv_5)

Method	Parameter	Parameter Value	CV-5 Accuracy	Standard Deviation
Lasso	`lambda`	0.037	0.930	0.003
Lasso	`lambda`	0.136	0.917	0.058
Ridge	`lambda`	3.960	0.986	0.032
Ridge	`lambda`	4.996	0.986	0.032
KNN	`k`	5.000	0.876	0.090
KNN	`k`	7.000	0.876	0.090
KNN	`k`	9.000	0.833	0.063

Exercise 2 (The Cost of College)

[5 points] For this exercise, we will use the College data from the ISLR package. Familiarize yourself with this dataset before performing analyses. We will attempt to predict the Outstate variable.

Test-train split the data using this code.

set.seed(42)
index = createDataPartition(College$Outstate, p = 0.75, list = FALSE)
college_trn = College[index, ]
college_tst = College[-index, ]

Train a total of six models using five-fold cross validation.

An additive linear model.
An elastic net model using additive predictors. Use a tuneLength of 10.
An elastic net model that also considers all two-way interactions. Use a tuneLength of 10.
A well-tuned KNN model.
A well-tuned KNN model that also considers all two-way interactions. (Should this work?)
A default-tuned random forest.

Before beginning, set a seed equal to your UIN.

uin = 123456789
set.seed(uin)

Create a table which reports CV and Test RMSE for each.

Solution:

Note that some code, for plotting and summarizing, is hidden. See the .Rmd file for code.

calc_rmse = function(actual, predicted) {
  sqrt(mean((actual - predicted) ^ 2))
}

cv_5 = trainControl(method = "cv", number = 5)

set.seed(uin)
fit_lm         = train(Outstate ~ ., data = college_trn, method = "lm", 
                       trControl = cv_5)
fit_glmnet     = train(Outstate ~ ., data = college_trn, method = "glmnet", 
                       trControl = cv_5, tuneLength = 10)
fit_glmnet_int = train(Outstate ~ . ^ 2, data = college_trn, method = "glmnet", 
                       trControl = cv_5, tuneLength = 10)
fit_knn        = train(Outstate ~ ., data = college_trn, method = "knn", 
                       trControl = cv_5, tuneLength = 25, 
                       preProcess = c("center", "scale"))
fit_knn_int    = train(Outstate ~ . ^ 2, data = college_trn, method = "knn", 
                       trControl = cv_5, tuneLength = 25, 
                       preProcess = c("center", "scale"))
fit_rf         = train(Outstate ~ ., data = college_trn, method = "rf", 
                       trControl = cv_5)

get_best_result = function(caret_fit) {
  best = which(rownames(caret_fit$results) == rownames(caret_fit$bestTune))
  best_result = caret_fit$results[best, ]
  rownames(best_result) = NULL
  best_result
}

Method	CV RMSE	Test RMSE
Linear Model	1968.034	2069.087
Elastic Net	1982.386	2080.690
Elastic Net with Interactions	1872.169	1964.739
KNN	1916.154	1971.330
KNN with Interactions	1916.701	2060.803
Random Forest	1785.113	1733.767

Exercise 3 (Computation Time)

[5 points] For this exercise we will create data via simulation, then assess how well certain methods perform. Use the code below to create a train and test dataset.

set.seed(42)
sim_trn = mlbench.spirals(n = 2500, cycles = 1.5, sd = 0.125)
sim_trn = data.frame(sim_trn$x, class = as.factor(sim_trn$classes))
sim_tst = mlbench.spirals(n = 10000, cycles = 1.5, sd = 0.125)
sim_tst = data.frame(sim_tst$x, class = as.factor(sim_tst$classes))

The training data is plotted below, with colors indicating the class variable, which is the response.

Before proceeding further, set a seed equal to your UIN.

uin = 123456789
set.seed(uin)

We’ll use the following to define 5-fold cross-validation for use with train() from caret.

cv_5 = trainControl(method = "cv", number = 5)

We now tune two models with train(). First, a logistic regression using glm. (This actually isn’t “tuned” as there are not parameters to be tuned, but we use train() to perform cross-validation.) Second we tune a single decision tree using rpart.

We store the results in sim_glm_cv and sim_tree_cv respectively, but we also wrap both function calls with system.time() in order to record how long the tuning process takes for each method.

glm_cv_time = system.time({
  sim_glm_cv  = train(
    class ~ .,
    data = sim_trn,
    trControl = cv_5,
    method = "glm")
})

tree_cv_time = system.time({
  sim_tree_cv = train(
    class ~ .,
    data = sim_trn,
    trControl = cv_5,
    method = "rpart")
})

We see that both methods are tuned via cross-validation in a similar amount of time.

glm_cv_time["elapsed"]

## elapsed 
##    1.01

tree_cv_time["elapsed"]

## elapsed 
##   0.704

Repeat the above analysis using a random forest, twice. The first time use 5-fold cross-validation. (This is how we had been using random forests before we understood random forests.) The second time, tune the model using OOB samples. We only have two predictors here, so, for both, use the following tuning grid.

rf_grid = expand.grid(mtry = c(1, 2))

oob  = trainControl(method = "oob")
rf_oob_time = system.time({
  sim_rf_oob = train(
    class ~ .,
    data = sim_trn,
    trControl = oob,
    tuneGrid = rf_grid)
})

rf_cv_time = system.time({
  sim_rf_cv = train(
    class ~ .,
    data = sim_trn,
    trControl = cv_5,
    tuneGrid = rf_grid)
})

Create a table summarizing the results of these four models. (Logistic with CV, Tree with CV, RF with OOB, RF with CV). Report:

Chosen value of tuning parameter (If applicable)
Elapsed tuning time
Resampled (CV or OOB) Accuracy
Test Accuracy

Method	Best Tune	Elapsed	Resampled Accuracy	Test Accuracy
Logistic, CV	NA	1.010	0.6580	0.6606
Tree, CV	0.0196	0.704	0.7528	0.7233
RF, OOB	1.0000	2.768	0.8532	0.8500
RF, CV	1.0000	6.138	0.8500	0.8506

Exercise 4 (Predicting Baseball Salaries)

[5 points] For this question we will predict the Salary of Hitters. (Hitters is also the name of the dataset.) We first remove the missing data:

Hitters = na.omit(Hitters)

After changing uin to your UIN, use the following code to test-train split the data.

uin = 123456789
set.seed(uin)
hit_idx = createDataPartition(Hitters$Salary, p = 0.6, list = FALSE)
hit_trn = Hitters[hit_idx,]
hit_tst = Hitters[-hit_idx,]

Do the following:

Tune a boosted tree model using the following tuning grid and 5-fold cross-validation.

gbm_grid = expand.grid(interaction.depth = c(1, 2),
                       n.trees = c(500, 1000, 1500),
                       shrinkage = c(0.001, 0.01, 0.1),
                       n.minobsinnode = 10)

Tune a random forest using OOB resampling and all possible values of mtry.

Create a table summarizing the results of three models:

Tuned boosted tree model
Tuned random forest model
Bagged tree model

For each, report:

Resampled RMSE
Test RMSE

hit_gbm = train(Salary ~ ., data = hit_trn,
                method = "gbm",
                trControl = cv_5,
                verbose = FALSE,
                tuneGrid = gbm_grid)

rf_grid = rf_grid = expand.grid(mtry = 1:(ncol(hit_trn) - 1))
hit_rf  = train(Salary ~ ., data = hit_trn,
                method = "rf",
                trControl = oob,
                tuneGrid = rf_grid)

# storing the bagged model for making predictions
hit_bag = train(Salary ~ ., data = hit_trn,
                method = "rf",
                trControl = oob,
                tuneGrid = data.frame(mtry = (ncol(hit_trn) - 1)))

calc_rmse = function(actual, predicted) {
  sqrt(mean((actual - predicted) ^ 2))
}

gbm_tst_rmse = calc_rmse(predicted = predict(hit_gbm, hit_tst),
                         actual    = hit_tst$Salary)

rf_tst_rmse = calc_rmse(predicted = predict(hit_rf, hit_tst),
                        actual    = hit_tst$Salary)

bag_tst_rmse = calc_rmse(predicted = predict(hit_bag, hit_tst),
                         actual    = hit_tst$Salary)


hitters_results = data.frame(
  c("Boosting", "Random Forest", "Bagging"),
  c(min(hit_gbm$results$RMSE), min(hit_rf$results$RMSE), min(hit_bag$results$RMSE)),
  c(gbm_tst_rmse, rf_tst_rmse, bag_tst_rmse)
)
colnames(hitters_results) = c("Method", "Resampled RMSE", "Test RMSE")
kable_styling(kable(hitters_results, format = "html", digits = 4), full_width = FALSE)

Method	Resampled RMSE	Test RMSE
Boosting	323.9203	257.8148
Random Forest	298.7520	255.5084
Bagging	304.8423	280.6918

Exercise 5 (Concept Checks)

[0.5 point each] Answer the following questions based on your results from the three exercises. Yes this does mean that the total points total more than 30. The extra points are “buffer” points. The maximum points you are able to score is still 30.

Leukemia

(5.1.1) How many observations are in the dataset? How many predictors are in the dataset?

Predictors: 5147
Observations: 72

(5.1.2) Based on the deviance plots, do you feel that glmnet considered enough \(\lambda\) values for lasso?

Yes. We see a nice U-shaped CV error curve.

(5.1.3) Based on the deviance plots, do you feel that glmnet considered enough \(\lambda\) values for ridge?

No. This plot suggests that if we were to try smaller lambda, we could achieve a lower devience (error).

(5.1.4) How does \(k\)-nearest neighbor compare to the penalized methods? Can you explain any difference?

KNN performs worse. This is expected in a high-dimensional setting that we have here due to the curse of dimensionality. (Although, it turns out that if we were to scale the predictors, KNN would actually work reasonably well.)

(5.1.5) Based on your results, which model would you choose? Explain.

The ridge model with the larger lambda. (Using the 1-SE rule.) This model performs much better than the lasso and KNN models. It also achieves the same CV accuracy as the ridge with a smaller lambda, but we prefer the model with a greater penalty, thus less chance of overfitting.

College

(5.2.1) Based on the table, which model do you prefer? Justify your answer.

Random forest, as it achieves the lowest error. Although, predicting college tuition to within 1700 versus 2000 dollars is not all that different, so an argument could be made for a more interpretable model.

(5.2.2) For both of the elastic net models, report the best tuning parameters from caret. For each, is this ridge, lasso, or somewhere in between? If in between, closer to which?

fit_glmnet$bestTune

##   alpha   lambda
## 5   0.1 35.91052

fit_glmnet_int$bestTune

##   alpha   lambda
## 7   0.1 220.8521

Both have an \(\alpha\) value of 0.1, so are in-between, but are closer to ridge.

(5.2.3) Did you scale the predictors when you used KNN? Should you have scaled the predictors when you used KNN?

Yes. Yes. A lower error is found using scaled predictors.

(5.2.4) Of the two KNN models which works better? Can you explain why?

Without interactions seems to work better. Adding all the interactions creates a high dimensional dataset, so we’re suffering from the curse of dimensionality.

(5.2.5) What year is this dataset from? What was out-of-state tuition at UIUC at that time?

According to the documentation, 1995.

College %>% filter(rownames(College) == "University of Illinois - Urbana") %>% select(Outstate)

##   Outstate
## 1     7560

Wow. Remember, this is for out-of-state.

Timing

(5.3.1) Compare the time taken to tune each model. Is the difference between the OOB and CV result for the random forest similar to what you would have expected?

Solution: The speed-up for OOB is only about three times that of 5-fold CV, instead of the five times that would have been expected. There appears to be some additional overhead in using OOB.

rf_cv_time["elapsed"] / rf_oob_time["elapsed"]

##  elapsed 
## 2.217486

(5.3.2) Compare the tuned value of mtry for each of the random forests tuned. Do they choose the same model?

Solution: They choose the same model, although, there were only two to choose from, and they are not very different. In practice, the two methods may differ more.

(5.3.3) Compare the test accuracy of each of the four procedures considered. Briefly explain these results.

Solution:

Logistic: Performs the worst. This is expected as clearly a non-linear decision boundary is needed.
Single Tree: Better than logistic, but not the best seen here. We see above that this is not a very deep tree. It will have non-linear boundaries, but since it uses binary splits, they will be rectangular regions.
Random Forest: First note that both essentially fit the same model. (The exact forests will be different due to randomization.) By using many trees (500) the boundaries will become less rectangular than the single tree, and will better match the spiral data in the data.
See below for plots of decision boundaries created by making predictions from the different models.

Salary

(5.4.1) Report the tuned value of mtry for the random forest.

hit_rf$bestTune

##   mtry
## 4    4

(5.4.2) Create a plot that shows the tuning results for the tuning of the boosted tree model.

plot(hit_gbm)

(5.4.3) Create a plot of the variable importance for the tuned random forest.

(5.4.4) Create a plot of the variable importance for the tuned boosted tree model.

(5.4.5) According to the random forest, what are the three most important predictors?

## [1] "CRBI"  "RBI"   "CRuns"

(5.4.6) According to the boosted model, what are the three most important predictors?

## [1] "CHmRun"  "CRuns"   "PutOuts"

STAT 432 Homework 07

Spring 2018 | Dalpiaz | UIUC

Due: Friday, April 6, 11:59 PM

Exercise 1 (Classifying Leukemia)

Exercise 2 (The Cost of College)

Exercise 3 (Computation Time)

Exercise 4 (Predicting Baseball Salaries)

Exercise 5 (Concept Checks)

Leukemia

College

Timing

Salary