For this homework, you may only use the following packages:

# general
library(MASS)
library(caret)
library(tidyverse)
library(knitr)
library(kableExtra)
library(mlbench)

# specific
library(randomForest)
library(gbm)
library(klaR)
library(ellipse)

Exercise 1 (Tuning KNN Regression with `caret`)

[6 points] For this exercise we will train KNN regression models for the Boston data from the MASS package. Use medv as the response and all other variables as predictors. Use the test-train split given below. When tuning models and reporting cross-validated error, use 5-fold cross-validation.

data(Boston, package = "MASS")
set.seed(1)
bstn_idx = createDataPartition(Boston$medv, p = 0.75, list = FALSE)
bstn_trn = Boston[bstn_idx, ]
bstn_tst = Boston[-bstn_idx, ]

Consider \(k \in \{1, 5, 10, 15, 20, 25, 30, 35\}\) and two pre-processing setups:

Do not scale the predictors.
Do scale the predictors.

Provide plots of cross-validated error versus tuning parameters for both KNN pre-processing setups. Use the same value on the \(y\) axis for both plots. (You can be lazy and let caret create these plots. Since it will use lattice plotting, putting them side-by-side, or on the same plot would be difficult.)

Solution:

set.seed(1337)
bstn_knnu_mod = train(
  medv ~ .,
  data = bstn_trn,
  trControl = trainControl(method = "cv", number = 5),
  method = "knn",
  tuneGrid = expand.grid(k = c(1, 5, 10, 15, 20, 25, 30, 35))
)

set.seed(1337)
bstn_knns_mod = train(
  medv ~ .,
  data = bstn_trn,
  trControl = trainControl(method = "cv", number = 5),
  preProcess = c("center", "scale"),
  method = "knn",
  tuneGrid = expand.grid(k = c(1, 5, 10, 15, 20, 25, 30, 35))
)

Exercise 2 (More Regression with `caret`)

[7 points] For this exercise we will train more regression models for the Boston data from the MASS package. Use medv as the response and all other variables as predictors. Use the test-train split given previously. When tuning models and reporting cross-validated error, use 5-fold cross-validation.

Traing a total of three new models:

An additive linear regression
A random forest
- Use the default tuning parameters chosen by caret
A boosted tree model (Use gbm)
- Use the provided tuning grid below

gbm_grid = expand.grid(interaction.depth = c(1, 2, 3),
                       n.trees = (1:20) * 100,
                       shrinkage = c(0.1, 0.3),
                       n.minobsinnode = 20)

Provide plots of error versus tuning parameters for the the boosted tree model. Also provide a table that summarizes the cross-validated and test RMSE for each of the three (tuned) models as well as the two models tuned in the previous exercise.

Solution:

set.seed(1337)
bstn_lm_mod = train(
  medv ~ .,
  data = bstn_trn,
  trControl = trainControl(method = "cv", number = 5),
  method = "lm"
)

set.seed(1337)
bstn_rf_mod = train(
  medv ~ .,
  data = bstn_trn,
  trControl = trainControl(method = "cv", number = 5),
  method = "rf"
)

set.seed(1337)
bstn_gbm_mod = train(
  medv ~ .,
  data = bstn_trn,
  trControl = trainControl(method = "cv", number = 5),
  method = "gbm",
  tuneGrid = gbm_grid,
  verbose = FALSE
)

Method	CV RMSE	Test RMSE
Linear Regression	4.92	4.61
KNN, Unscaled	6.66	6.01
KNN, Scaled	4.91	4.28
Random Forest	3.56	3.26
Boosted Trees	4.02	3.59

Exercise 3 (Clasification with `caret`)

[7 points] For this exercise we will train a number of classifiers using the training data generated below. The categorical response variable is classes and the remaining variables should be used as predictors. When tuning models and reporting cross-validated error, use 10-fold cross-validation. We will not use a test set for this exercise.

set.seed(42)
# simulate data using mlbench
sim_trn = mlbench.2dnormals(n = 500, cl = 7, r = 10, sd = 3)
# create tidy data
sim_trn = data.frame(
  classes = sim_trn$classes,
  sim_trn$x
)

featurePlot(x = sim_trn[, -1], 
            y = sim_trn$classes, 
            plot = "pairs",
            auto.key = list(columns = 2),
            par.settings = list(superpose.symbol = list(pch = 1:9))
)

Fit a total of five models:

LDA
QDA
Naive Bayes
Regularized Discriminant Analysis (RDA)
- Use method rda with caret which requires the klaR package
- Use the default tuning grid
Random Forest
- Use a tuning grid that considers mtry values of 1 and 2

Provide a plot of acuracy versus tuning parameters for the RDA model. Also provide a table that summarizes the cross-validated accuracy and their standard deviations for each of the five (tuned) models.

Solution:

set.seed(1337)
sim_lda_mod = train(
  classes ~ .,
  data = sim_trn,
  trControl = trainControl(method = "cv", number = 10),
  method = "lda"
)

set.seed(1337)
sim_qda_mod = train(
  classes ~ .,
  data = sim_trn,
  trControl = trainControl(method = "cv", number = 10),
  method = "qda"
)

set.seed(1337)
sim_nb_mod = train(
  classes ~ .,
  data = sim_trn,
  trControl = trainControl(method = "cv", number = 10),
  method = "nb"
)

set.seed(1337)
sim_rda_mod = train(
  classes ~ .,
  data = sim_trn,
  trControl = trainControl(method = "cv", number = 10),
  method = "rda"
)

set.seed(1337)
sim_rf_mod = train(
  classes ~ .,
  data = sim_trn,
  trControl = trainControl(method = "cv", number = 10),
  method = "rf",
  tuneGrid = expand.grid(mtry = c(1, 2))
)

Method	CV Acc	SD CV Acc
LDA	0.851	0.049
QDA	0.846	0.055
Naive Bayes	0.849	0.052
RDA	0.853	0.049
RF	0.838	0.069

Exercise 4 (Concept Checks)

[1 point each] Answer the following questions based on your results from the three exercises.

Regression

(a) What value of \(k\) is chosen for KNN without predictor scaling?

bstn_knnu_mod$bestTune$k

## [1] 5

(b) What value of \(k\) is chosen for KNN with predictor scaling?

bstn_knns_mod$bestTune$k

## [1] 10

(c) What are the values of the tuning parameters chosen for the boosted tree regression model?

bstn_gbm_mod$bestTune

##    n.trees interaction.depth shrinkage n.minobsinnode
## 39    1900                 2       0.1             20

(d) Which regression model achieves the lowest cross-validated error?

reg_results[reg_results$`CV RMSE` == min(reg_results$`CV RMSE`), ]

##          Method  CV RMSE Test RMSE
## 4 Random Forest 3.563629  3.255258

(e) Which method achieves the lowest test error?

reg_results[reg_results$`Test RMSE` == min(reg_results$`Test RMSE`), ]

##          Method  CV RMSE Test RMSE
## 4 Random Forest 3.563629  3.255258

Classification

(f) What are the values of the tuning parameters chosen for the RDA model?

sim_rda_mod$bestTune

##   gamma lambda
## 5   0.5    0.5

(g) Based on the scatterplot, which method, LDA or QDA, do you think is more appropriate? Explain.

LDA. The covariance seems to be the same in each class.

(h) Based on the scatterplot, which method, QDA or Naive Bayes, do you think is more appropriate? Explain.

Naive Bayes. The predictors seem to be independent in each class.

(i) Which model achieves the best cross-validated accuracy?

class_results[class_results$`CV Acc` == max(class_results$`CV Acc`), ]

##   Method    CV Acc SD CV Acc
## 4    RDA 0.8534146 0.0488233

(j) Do you believe the model in (i) is the model that should be chosen? Explain.

rda_res = class_results[class_results$`CV Acc` == max(class_results$`CV Acc`), ]
rda_res$`CV Acc` - rda_res$`SD CV Acc`

## [1] 0.8045913

No. The results of all the other model are within one SE. We should pick a less complex model, perhpas LDA or NB.

STAT 432 Homework 06

Spring 2018 | Dalpiaz | UIUC

Due: Friday, March 16, 11:59 PM

Exercise 1 (Tuning KNN Regression with `caret`)

Exercise 2 (More Regression with `caret`)

Exercise 3 (Clasification with `caret`)

Exercise 4 (Concept Checks)

Regression

Classification

STAT 432 Homework 06

Spring 2018 | Dalpiaz | UIUC

Due: Friday, March 16, 11:59 PM

Exercise 1 (Tuning KNN Regression with caret)

Exercise 2 (More Regression with caret)

Exercise 3 (Clasification with caret)

Exercise 4 (Concept Checks)

Regression

Classification

Exercise 1 (Tuning KNN Regression with `caret`)

Exercise 2 (More Regression with `caret`)

Exercise 3 (Clasification with `caret`)