What follows are brief thoughts on each of the three tasks.

# suggested packages
library(MASS)
library(caret)
library(tidyverse)
library(knitr)
library(kableExtra)
library(mlbench)
library(ISLR)
library(ellipse)
library(randomForest)
library(gbm)
library(glmnet)
library(rpart)
library(rpart.plot)
library(klaR)
library(gam)
library(e1071)
# feel free to use additional packages

Classification

Introduction

class_trn = read_csv("https://daviddalpiaz.github.io/stat432sp18/hw/hw08/class-trn.csv")
class_tst = read_csv("https://daviddalpiaz.github.io/stat432sp18/hw/hw08/class-tst.csv")

accuracy = function(actual, predicted) {
  mean(actual == predicted)
}

Here have have all numeric predictors and a categorical response, which we coerce to be a factor.

class_trn$y = factor(class_trn$y)

table(class_trn$y)

## 
##  One Zero 
##  454  546

The response variable is y and takes values Zero and One. (And is a factor variable.)

Our goal is to classify this response as well as possible. We can see above that there is a slightly higher proportion of Zero in the training data.

The following plots will explore the relationship of the ten feature variables with the response variable.

Based on the above plots we can see that X6, X7, and X8 show promise as predictors, however, they would each need two cutoffs, a non-linear boundary.

The above two plots again reinforce that X6, X7, and X8 should be good predictors.

Here we fit a single classification tree. We see that it uses X6, X7, and X8 for all of the splits.

The above plot is a variable importance plot generated by fitting a boosted tree model. Again we see that X6, X7, and X8 are clearly good predictors.

As a result, we will consider some models that use only X6, X7, and X8.

Methods

We will consider five methods:

Random Forest (rf)
Regularized Discriminant Analysis (rda)
Logistic Regression (glm)
Penalized Logistic Regression, Elastic-Net (glmnet)
Boosted Trees (gbm)

For each method we will consider different sets of features:

small: Only X6, X7, and X8
int: All possible interactions between X6, X7, and X8
poly: All first and second degree terms using X6, X7, and X8
full: All features.
huge: All features with all possible two way interactions.

small_form = formula(y ~ X6 + X7 + X8)
int_form   = formula(y ~ X6 * X7 * X8)
poly_form  = formula(y ~ X6 + X7 + X8 + I(X6 ^ 2) + I(X7 ^ 2) + I(X8 ^ 2))
full_form  = formula(y ~ .)
huge_form  = formula(y ~ . ^ 2)

huge will only be considered with Penalized Logistic Regression since it is a very large number of features. poly will only be considered with glm.

cv_5 = trainControl(method = "cv", number = 5)

Tuning will be done using 5-fold cross-validation. (We could use OOB metrics for Random Forests, but here we sacrifice computation time for better comparison.)

uin = 123456789
set.seed(uin)

rf_small = train(small_form, data = class_trn, method = "rf", 
                 trControl = cv_5, tuneLength = 2)

rf_int   = train(int_form, data = class_trn, method = "rf", 
                 trControl = cv_5)

rf_full  = train(full_form, data = class_trn, method = "rf", 
                 trControl = cv_5)

Random Forests are fit above using caret with method = "rf". For most we use the default training grid, which considers a low, medium and high mtry. We specify a training grid of length 2 for the small feature set to avoid a warning. (The default settings result in a tune length of 2 in this situation, since there are so few predictors.)

rda_small = train(small_form, data = class_trn, method = "rda", 
                  trControl = cv_5)

rda_int   = train(int_form, data = class_trn, method = "rda", 
                  trControl = cv_5)

rda_full = train(full_form, data = class_trn, method = "rda", 
                 trControl = cv_5)

RDA is fit above using caret with method = "rda". We simply use the default tuning grid, which will consider both LDA, QDA, as well as some other models.

glm_small = train(small_form, data = class_trn, method = "glm", 
                  trControl = cv_5)

glm_poly  = train(poly_form, data = class_trn, method = "glm", 
                  trControl = cv_5)

glm_full  = train(full_form, data = class_trn, method = "glm", 
                  trControl = cv_5)

Logistic Regressions are fit above using caret with method = "glm" which with a factor response automatically uses family = "binomial". There is nothing to tune, but we store the results for comparing cross-validated errors later. One of the models considers quadratic terms for predictors noted above.

glmn_small = train(small_form, data = class_trn, method = "glmnet", 
                   trControl = cv_5, tuneLength = 10)

glmn_int   = train(int_form, data = class_trn, method = "glmnet", 
                   trControl = cv_5, tuneLength = 10)

glmn_full  = train(full_form, data = class_trn, method = "glmnet", 
                   trControl = cv_5, tuneLength = 10)

glmn_huge  = train(huge_form, data = class_trn, method = "glmnet", 
                   trControl = cv_5, tuneLength = 10)

Penalized Logistic Regressions are fit above using caret with method = "glmnet" which with a factor response automatically uses family = "binomial". We simply specify a tuning grid length. caret will try 10 \(\alpha\) values and a number of \(\lambda\) values for each.

gbm_grid  = expand.grid(interaction.depth = c(1:3), n.trees = 1:3 * 500, 
                        shrinkage = c(0.001, 0.01, 0.1), n.minobsinnode = 10)

gbm_small = train(small_form, data = class_trn, method = "gbm", 
                  verbose = FALSE, trControl = cv_5, tuneGrid = gbm_grid)

gbm_int   = train(int_form, data = class_trn, method = "gbm", 
                  verbose = FALSE, trControl = cv_5, tuneGrid = gbm_grid)

gbm_full  = train(full_form, data = class_trn, method = "gbm", 
                  verbose = FALSE, trControl = cv_5, tuneGrid = gbm_grid)

Boosted Trees are fit above using caret with method = "gbm". We specify a somewhat large custom tuning grid, keeping interaction.depth low and using enough trees for the chosen shrinkage values.

Results

Method + Features	CV Acc	SD CV Acc
gbm_int	0.809	0.046
rda_small	0.808	0.007
glm_poly	0.806	0.020
gbm_small	0.805	0.035
gbm_full	0.804	0.022
rf_full	0.803	0.026
rda_int	0.797	0.014
rf_int	0.792	0.033
rf_small	0.788	0.016
rda_full	0.769	0.028
glmn_huge	0.584	0.032
glmn_int	0.570	0.012
glm_small	0.556	0.023
glmn_small	0.555	0.020
glmn_full	0.546	0.001
glm_full	0.514	0.024

Discussion

Based on the above cross-validated accuracies, we need to decide which to submit to the autograder. The most obvious is gbm_int, which is the boosted tree model using all possible interactions between selected features. However, it is a rather complicated model with a reasonably large standard deviation compared to other models.

rda_small also performs well, but is still a somewhat complicated model.

glm_poly is very promising as it is a very simple model with a high CV-accuracy. We will select this model.

rda_small$bestTune

##   gamma lambda
## 4   0.5      0

gbm_int$bestTune

##    n.trees interaction.depth shrinkage n.minobsinnode
## 14    1000                 2      0.01             10

Since this was synthetic data, we can actually see the true data generating process.

# define the data generating process
gen_class_data = function(n) {
  p = 10
  x = round(replicate(p, rnorm(n)), 3)
  z = -3 + 1 * x[, 6] ^ 2 + 1 * x[, 7] ^ 2 + 1 * x[, 8] ^ 2
  prob = exp(z) / (1 + exp(z))
  y = rbinom(length(z), 1, prob)
  y = ifelse(y == 1, "One", "Zero")
  data.frame(y = as.factor(y), x)
}

As a result, the following function is the best possible model for this data.

f = function(predictors) {
  -3 + predictors$X6 ^ 2 + predictors$X7 ^ 2 + predictors$X8 ^ 2
}

This means that our chosen model was very close to the true model. (The signs are flipped due to the coding of the factor variable.)

glm_poly$finalModel

## 
## Call:  NULL
## 
## Coefficients:
## (Intercept)           X6           X7           X8    `I(X6^2)`  
##     2.99676     -0.01598      0.07926      0.07175     -0.91487  
##   `I(X7^2)`    `I(X8^2)`  
##    -1.00952     -1.11753  
## 
## Degrees of Freedom: 999 Total (i.e. Null);  993 Residual
## Null Deviance:       1378 
## Residual Deviance: 856.4     AIC: 870.4

class_tst_y = read_csv("class-tst-y.csv") # this data not released
class_tst_y$y = factor(class_tst_y$y)

Thus the accuracy below is the best we could achieve on the test data.

accuracy(ifelse(f(class_tst) > 0, "One", "Zero"), class_tst_y$y)

## [1] 0.7938

This means that several of the above models did very well. Results with a test accuracy above this value are the resulting of fitting to the noise of the test data.

accuracy(predict(glm_poly, class_tst), class_tst_y$y)

## [1] 0.7964

Regression

Introduction

reg_trn = read_csv("https://daviddalpiaz.github.io/stat432sp18/hw/hw08/reg-trn.csv")
reg_tst = read_csv("https://daviddalpiaz.github.io/stat432sp18/hw/hw08/reg-tst.csv")

rmse = function(actual, predicted) {
  sqrt(mean((actual - predicted) ^ 2))
}

Here we have a numeric response, and some categorical predictors that we coerce to be factors.

# modify train
reg_trn$X21 = factor(reg_trn$X21)
reg_trn$X22 = factor(reg_trn$X22)

# modify test to match
reg_tst$X21 = factor(reg_tst$X21)
reg_tst$X22 = factor(reg_tst$X22)

histogram(reg_trn$y)

The response variable is y. Our goal is to predict this response as well as possible.

The following plots will explore the relationship of the twenty feature variables with the response variable.

From the above plot, features X1, X2, X3, X4 and X6 show potential.

Here we fit a single regression tree. We see that it uses X3, X4, X6 and X22 exclusively.

The above plot is a variable importance plot generated by fitting a boosted tree model. Again we see that X6 and X22 are by far the best predictors. Here we see that X1 and X2 are not important. Why?

##             X1          X2          X3          X4          X5          X6
## X1  1.00000000  0.78539221  0.77451418  0.76908252 -0.06340441 -0.01270533
## X2  0.78539221  1.00000000  0.75521760  0.73765804 -0.06386369 -0.03333847
## X3  0.77451418  0.75521760  1.00000000  0.76200946 -0.05414591 -0.04140397
## X4  0.76908252  0.73765804  0.76200946  1.00000000 -0.04550336 -0.05007625
## X5 -0.06340441 -0.06386369 -0.05414591 -0.04550336  1.00000000  0.05939235
## X6 -0.01270533 -0.03333847 -0.04140397 -0.05007625  0.05939235  1.00000000

It seems that X3 and X4 are highly correlated with X1 and X2 and must not add much in terms of prediction.

As a result, we will consider some models that use only X3, X4, X6, and X22.

Methods

We will consider four methods:

Random Forest (rf)
Linear Regression (lm)
Penalized Linear Regression, Elastic-Net (glmnet)
Generalized Additive Models (gam)

For each method we will consider different sets of features:

small: Only X3, X4, X6, and X22
int: All possible interactions between X3, X4, X6, and X22
full: All features.
huge: All features with all possible two way interactions.

small_form = formula(y ~ X3 + X4 + X6 + X22)
int_form   = formula(y ~ X3 * X4 * X6 * X22)
full_form  = formula(y ~ .)
huge_form  = formula(y ~ . ^ 2)

huge will only be considered with Penalized Logistic Regression since it is a very large number of features.

cv_5 = trainControl(method = "cv", number = 5, verbose = FALSE)

Tuning will be done using 5-fold cross-validation. (We could use OOB metrics for Random Forests, but here we sacrifice computation time for better comparison.)

rf_small = train(small_form, data = reg_trn, method = "rf",
                 trControl = cv_5, tuneLength = 2)

rf_int   = train(int_form, data = reg_trn, method = "rf",
                 trControl = cv_5)

rf_full  = train(full_form, data = reg_trn, method = "rf",
                 trControl = cv_5)

lm_small = train(small_form, data = reg_trn, method = "lm", trControl = cv_5)

lm_int   = train(int_form, data = reg_trn, method = "lm", trControl = cv_5)

lm_full  = train(full_form, data = reg_trn, method = "lm", trControl = cv_5)

Linear Regressions are fit above using caret with method = "lm". There is nothing to tune, but we store the results for comparing cross-validated errors later.

glmn_small = train(small_form, data = reg_trn, method = "glmnet",
                   trControl = cv_5, tuneLength = 10)

glmn_int   = train(int_form, data = reg_trn, method = "glmnet",
                   trControl = cv_5, tuneLength = 10)

glmn_full  = train(full_form, data = reg_trn, method = "glmnet",
                   trControl = cv_5, tuneLength = 10)

glmn_huge  = train(huge_form, data = reg_trn, method = "glmnet",
                   trControl = cv_5, tuneLength = 10)

Penalized Linear Regressions are fit above using caret with method = "glmnet". We simply specify a tuning grid length. caret will try 10 \(\alpha\) values and a number of \(\lambda\) values for each.

gam_grid =  expand.grid(df = seq(1, 5, by = 0.5))

gam_small  = train(y ~ X3 + X4 + X6 + X22, data = reg_trn, method = "gamSpline",
                  verbose = FALSE, trControl = cv_5, tuneGrid = gam_grid)

gam_full  = train(y ~ ., data = reg_trn, method = "gamSpline",
                  verbose = FALSE, trControl = cv_5, tuneGrid = gam_grid)

Generalized Additive Models are fit above using caret with method = "gamSpline". We specify a number of degrees of freedom to try. Note that our ability to tune these models with caret is very limited. (For example, each term must use the same degrees of freedom and we cannot add parametric terms.)

Results

Method + Features	CV RMSE	SD CV RMSE
lm_int	1.526	0.098
glmn_int	1.528	0.117
glmn_huge	1.567	0.083
gam_small	1.806	0.068
lm_small	1.814	0.074
glmn_small	1.820	0.055
glmn_full	1.837	0.149
lm_full	1.855	0.208
gam_full	1.856	0.073
rf_int	2.124	0.126
rf_small	2.296	0.104
rf_full	3.105	0.202

Discussion

Based on the above cross-validated RMSEs, our choice is somewhat obvious. lm_int has both the lowest cross-validated RMSE and a reasonably low variability. However, glmn_int has very similar performance, and actually uses fewer predictors, so we give it a slight preference.

glmn_int$bestTune

##    alpha     lambda
## 94     1 0.03180609

coef(glmn_int$finalModel, s = glmn_int$bestTune$lambda)

## 24 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept)    1.13546667
## X3             1.83483990
## X4             2.73954172
## X6            -0.99636988
## X22E          -0.89957323
## X22F          -0.79507595
## X3:X4          0.02761843
## X3:X6          .         
## X4:X6          .         
## X3:X22E        .         
## X3:X22F       -0.05460996
## X4:X22E       -1.80091379
## X4:X22F       -1.72068483
## X6:X22E        .         
## X6:X22F        .         
## X3:X4:X6       .         
## X3:X4:X22E     .         
## X3:X4:X22F     .         
## X3:X6:X22E     .         
## X3:X6:X22F     .         
## X4:X6:X22E     .         
## X4:X6:X22F     .         
## X3:X4:X6:X22E  .         
## X3:X4:X6:X22F  .

Since this was synthetic data, we can actually see the true data generating process.

# define the data generating process
gen_reg_data = function(n) {
  p = 20
  mu = c(1, 2, 3, 4)
  sigma = matrix(.9, nrow = 4, ncol = 4) + diag(4) * .3
  x_corr = mvrnorm(n, mu = mu, Sigma = sigma)
  x_uncorr = replicate(p - 4, rnorm(n, sample(1:(p - 4)), 3))
  x_cat_1 = sample(c("A", "B", "C"), n, replace = TRUE)
  x_cat_2 = sample(c("D", "E", "F"), n, replace = TRUE)
  x = cbind(x_corr, x_uncorr)
  x = data.frame(round(x, 3))
  x = cbind(x, X21 = x_cat_1, X22 = x_cat_2)
  y = round(2 * x[, 3] + x[, 4] + 2 * x[, 4] * (x[, 22] == "D") - 1 * x[, 6] + rnorm(n, 0, 1.5), 3)
  data.frame(y, x)
}

As a result, the following function is the best possible model for this data.

f = function(predictors) {
  2 * predictors$X3 + predictors$X4 + 2 * predictors$X4 * (predictors$X22 == "D") - 1 * predictors$X6
}

reg_tst_y = read_csv("reg-tst-y.csv") # this data not released

Thus the RMSE below is the best we could achieve on the test data.

rmse(f(reg_tst), reg_tst_y$y)

## [1] 1.501682

However, in general a RMSE of 1.5 is what we would hope to achieve, as that is how noise was added to the data. The method we used gets very close.

While lm_int performs best in terms of CV-RMSE, it turns on that glmn_int is closer to the true model as it sets some of the unneeded predictors of lm_int to be zero. In the test set, glmn_int performs slightly better.

rmse(predict(lm_int, reg_tst), reg_tst_y$y)

## [1] 1.520896

rmse(predict(glmn_int, reg_tst), reg_tst_y$y)

## [1] 1.507915

Spam Filter

Introduction

spam_trn = read_csv("https://daviddalpiaz.github.io/stat432sp18/hw/hw08/spam-trn.csv")
spam_tst = read_csv("https://daviddalpiaz.github.io/stat432sp18/hw/hw08/spam-tst.csv")

spam_trn$type = factor(spam_trn$type)

score = function(actual, predicted) {
  1   * sum(predicted == "spam" & actual == "spam") +
  -25 * sum(predicted == "spam" & actual == "nonspam") +
  -1  * sum(predicted == "nonspam" & actual == "spam") +
  2   * sum(predicted == "nonspam" & actual == "nonspam")
}

Methods

For this problem, we simply cut to the chase, and utilize a random forest. This time, we do split off some “test” data.

set.seed(uin)
spam_trn_idx = createDataPartition(spam_trn$type, p = 0.80, list = FALSE)
spam_trn_trn = spam_trn[spam_trn_idx, ]
spam_trn_tst = spam_trn[-spam_trn_idx, ]

# tune for best accuracy, use OOB metrics
rf_mod_spamn_acc = train(type ~ ., data = spam_trn_trn,
                         method = "rf",
                         trControl = trainControl(method = "oob"))

# tune for best ROC, caret does not allow OOB, so we use CV
rf_mod_spamn_roc = train(type ~ ., data = spam_trn_trn,
                         method = "rf",
                         trControl = trainControl(method = "cv",
                                                  classProbs = TRUE,
                                                  summaryFunction = twoClassSummary),
                         metric = "ROC")

cutoffs = seq(0.5, 1, by = 0.01)
score_acc = rep(0, length(cutoffs))
score_roc = rep(0, length(cutoffs))

# test cutoffs for accuracy tuned forest
for (c in seq_along(cutoffs)) {
  pred = ifelse(predict(rf_mod_spamn_acc, spam_trn_tst, type = "prob")$spam > cutoffs[c], 
                "spam", "nonspam")
  score_acc[c] = score(predicted = pred,
                       actual = spam_trn_tst$type)
}

# test cutoffs for ROC tuned forest
for (c in seq_along(cutoffs)) {
  pred = ifelse(predict(rf_mod_spamn_roc, spam_trn_tst, type = "prob")$spam > cutoffs[c], 
                "spam", "nonspam")
  score_roc[c] = score(predicted = pred,
                       actual = spam_trn_tst$type)
}

Results

We see that tuning with ROC performs ever so slightly better. (It almost doesn’t seem like it was worth it to spend the extra time to tune that model.) As expected, we need to increase the spam threshold.

(cut_to_use = cutoffs[which.max(score_roc)])

## [1] 0.8

Discussion

It is certainly possible that with more thought we could discover a better filter, but this analysis uses the key concept, modifying the cutoffs for classifying.

spam_tst_y = read_csv("spam_tst-y.csv") # data not provided
spam_tst_y$type = factor(spam_tst_y$type)

pred_prob = predict(rf_mod_spamn_roc, spam_tst, type = "prob")$spam
pred_class = ifelse(pred_prob > cut_to_use, "spam", "nonspam")
score(actual = spam_tst_y$type, predicted = pred_class) # score on autograder

## [1] 2902

score(actual = spam_tst_y$type, predicted = predict(rf_mod_spamn_roc, spam_tst))

## [1] 2238

accuracy(actual = spam_tst_y$type, predicted = pred_class)

## [1] 0.8604348

accuracy(actual = spam_tst_y$type, predicted = predict(rf_mod_spamn_roc, spam_tst))

## [1] 0.9334783

As expected, we see a decrease in accuracy, for an increase in score.

varImpPlot(rf_mod_spamn_roc$finalModel, n.var = 10)

Looking at the ten most important variables, we see things like exclamation marks, capital letters, and dollar signs are good predictors. Remember, this is an old dataset, from HP employees. (Note that hp also appears to be important.) This would be a terrible spam classifier for general email data today.

Miscellaneous Notes

We’re loosing using the IMRD style here, but this document doesn’t serve as a good template for project reports.
- For example, the methods sections contain too many specifics about computing that would be unnecessary for project reports.
- The introductions are lacking a lot of background. (We can get away with this here since we are all familiar with the assignment.)
We could have, and perhaps should have, split off some of the training data to be used as a test dataset. This would have been absolutely necessary if we used OOB metrics to tune the random forests.

STAT 432 Homework 08

Spring 2018 | Dalpiaz | UIUC

Partial Solution

Classification

Introduction

Methods

Results

Discussion

Regression

Introduction

Methods

Results

Discussion

Spam Filter

Introduction

Methods

Results

Discussion

Miscellaneous Notes