What follows are brief thoughts on each of the three tasks.
# suggested packages
library(MASS)
library(caret)
library(tidyverse)
library(knitr)
library(kableExtra)
library(mlbench)
library(ISLR)
library(ellipse)
library(randomForest)
library(gbm)
library(glmnet)
library(rpart)
library(rpart.plot)
library(klaR)
library(gam)
library(e1071)
# feel free to use additional packages
class_trn = read_csv("https://daviddalpiaz.github.io/stat432sp18/hw/hw08/class-trn.csv")
class_tst = read_csv("https://daviddalpiaz.github.io/stat432sp18/hw/hw08/class-tst.csv")
accuracy = function(actual, predicted) {
mean(actual == predicted)
}
Here have have all numeric predictors and a categorical response, which we coerce to be a factor.
class_trn$y = factor(class_trn$y)
table(class_trn$y)
##
## One Zero
## 454 546
The response variable is y
and takes values Zero
and One
. (And is a factor variable.)
Our goal is to classify this response as well as possible. We can see above that there is a slightly higher proportion of Zero
in the training data.
The following plots will explore the relationship of the ten feature variables with the response variable.
Based on the above plots we can see that X6
, X7
, and X8
show promise as predictors, however, they would each need two cutoffs, a non-linear boundary.
The above two plots again reinforce that X6
, X7
, and X8
should be good predictors.
Here we fit a single classification tree. We see that it uses X6
, X7
, and X8
for all of the splits.
The above plot is a variable importance plot generated by fitting a boosted tree model. Again we see that X6
, X7
, and X8
are clearly good predictors.
As a result, we will consider some models that use only X6
, X7
, and X8
.
We will consider five methods:
rf
)rda
)glm
)glmnet
)gbm
)For each method we will consider different sets of features:
small
: Only X6
, X7
, and X8
int
: All possible interactions between X6
, X7
, and X8
poly
: All first and second degree terms using X6
, X7
, and X8
full
: All features.huge
: All features with all possible two way interactions.small_form = formula(y ~ X6 + X7 + X8)
int_form = formula(y ~ X6 * X7 * X8)
poly_form = formula(y ~ X6 + X7 + X8 + I(X6 ^ 2) + I(X7 ^ 2) + I(X8 ^ 2))
full_form = formula(y ~ .)
huge_form = formula(y ~ . ^ 2)
huge
will only be considered with Penalized Logistic Regression since it is a very large number of features. poly
will only be considered with glm
.
cv_5 = trainControl(method = "cv", number = 5)
Tuning will be done using 5-fold cross-validation. (We could use OOB metrics for Random Forests, but here we sacrifice computation time for better comparison.)
uin = 123456789
set.seed(uin)
rf_small = train(small_form, data = class_trn, method = "rf",
trControl = cv_5, tuneLength = 2)
rf_int = train(int_form, data = class_trn, method = "rf",
trControl = cv_5)
rf_full = train(full_form, data = class_trn, method = "rf",
trControl = cv_5)
Random Forests are fit above using caret
with method = "rf"
. For most we use the default training grid, which considers a low, medium and high mtry
. We specify a training grid of length 2
for the small feature set to avoid a warning. (The default settings result in a tune length of 2 in this situation, since there are so few predictors.)
rda_small = train(small_form, data = class_trn, method = "rda",
trControl = cv_5)
rda_int = train(int_form, data = class_trn, method = "rda",
trControl = cv_5)
rda_full = train(full_form, data = class_trn, method = "rda",
trControl = cv_5)
RDA is fit above using caret
with method = "rda"
. We simply use the default tuning grid, which will consider both LDA, QDA, as well as some other models.
glm_small = train(small_form, data = class_trn, method = "glm",
trControl = cv_5)
glm_poly = train(poly_form, data = class_trn, method = "glm",
trControl = cv_5)
glm_full = train(full_form, data = class_trn, method = "glm",
trControl = cv_5)
Logistic Regressions are fit above using caret
with method = "glm"
which with a factor response automatically uses family = "binomial"
. There is nothing to tune, but we store the results for comparing cross-validated errors later. One of the models considers quadratic terms for predictors noted above.
glmn_small = train(small_form, data = class_trn, method = "glmnet",
trControl = cv_5, tuneLength = 10)
glmn_int = train(int_form, data = class_trn, method = "glmnet",
trControl = cv_5, tuneLength = 10)
glmn_full = train(full_form, data = class_trn, method = "glmnet",
trControl = cv_5, tuneLength = 10)
glmn_huge = train(huge_form, data = class_trn, method = "glmnet",
trControl = cv_5, tuneLength = 10)
Penalized Logistic Regressions are fit above using caret
with method = "glmnet"
which with a factor response automatically uses family = "binomial"
. We simply specify a tuning grid length. caret
will try 10 \(\alpha\) values and a number of \(\lambda\) values for each.
gbm_grid = expand.grid(interaction.depth = c(1:3), n.trees = 1:3 * 500,
shrinkage = c(0.001, 0.01, 0.1), n.minobsinnode = 10)
gbm_small = train(small_form, data = class_trn, method = "gbm",
verbose = FALSE, trControl = cv_5, tuneGrid = gbm_grid)
gbm_int = train(int_form, data = class_trn, method = "gbm",
verbose = FALSE, trControl = cv_5, tuneGrid = gbm_grid)
gbm_full = train(full_form, data = class_trn, method = "gbm",
verbose = FALSE, trControl = cv_5, tuneGrid = gbm_grid)
Boosted Trees are fit above using caret
with method = "gbm"
. We specify a somewhat large custom tuning grid, keeping interaction.depth
low and using enough trees for the chosen shrinkage
values.
Method + Features | CV Acc | SD CV Acc |
---|---|---|
gbm_int | 0.809 | 0.046 |
rda_small | 0.808 | 0.007 |
glm_poly | 0.806 | 0.020 |
gbm_small | 0.805 | 0.035 |
gbm_full | 0.804 | 0.022 |
rf_full | 0.803 | 0.026 |
rda_int | 0.797 | 0.014 |
rf_int | 0.792 | 0.033 |
rf_small | 0.788 | 0.016 |
rda_full | 0.769 | 0.028 |
glmn_huge | 0.584 | 0.032 |
glmn_int | 0.570 | 0.012 |
glm_small | 0.556 | 0.023 |
glmn_small | 0.555 | 0.020 |
glmn_full | 0.546 | 0.001 |
glm_full | 0.514 | 0.024 |
Based on the above cross-validated accuracies, we need to decide which to submit to the autograder. The most obvious is gbm_int
, which is the boosted tree model using all possible interactions between selected features. However, it is a rather complicated model with a reasonably large standard deviation compared to other models.
rda_small
also performs well, but is still a somewhat complicated model.
glm_poly
is very promising as it is a very simple model with a high CV-accuracy. We will select this model.
rda_small$bestTune
## gamma lambda
## 4 0.5 0
gbm_int$bestTune
## n.trees interaction.depth shrinkage n.minobsinnode
## 14 1000 2 0.01 10
Since this was synthetic data, we can actually see the true data generating process.
# define the data generating process
gen_class_data = function(n) {
p = 10
x = round(replicate(p, rnorm(n)), 3)
z = -3 + 1 * x[, 6] ^ 2 + 1 * x[, 7] ^ 2 + 1 * x[, 8] ^ 2
prob = exp(z) / (1 + exp(z))
y = rbinom(length(z), 1, prob)
y = ifelse(y == 1, "One", "Zero")
data.frame(y = as.factor(y), x)
}
As a result, the following function is the best possible model for this data.
f = function(predictors) {
-3 + predictors$X6 ^ 2 + predictors$X7 ^ 2 + predictors$X8 ^ 2
}
This means that our chosen model was very close to the true model. (The signs are flipped due to the coding of the factor variable.)
glm_poly$finalModel
##
## Call: NULL
##
## Coefficients:
## (Intercept) X6 X7 X8 `I(X6^2)`
## 2.99676 -0.01598 0.07926 0.07175 -0.91487
## `I(X7^2)` `I(X8^2)`
## -1.00952 -1.11753
##
## Degrees of Freedom: 999 Total (i.e. Null); 993 Residual
## Null Deviance: 1378
## Residual Deviance: 856.4 AIC: 870.4
class_tst_y = read_csv("class-tst-y.csv") # this data not released
class_tst_y$y = factor(class_tst_y$y)
Thus the accuracy below is the best we could achieve on the test data.
accuracy(ifelse(f(class_tst) > 0, "One", "Zero"), class_tst_y$y)
## [1] 0.7938
This means that several of the above models did very well. Results with a test accuracy above this value are the resulting of fitting to the noise of the test data.
accuracy(predict(glm_poly, class_tst), class_tst_y$y)
## [1] 0.7964
reg_trn = read_csv("https://daviddalpiaz.github.io/stat432sp18/hw/hw08/reg-trn.csv")
reg_tst = read_csv("https://daviddalpiaz.github.io/stat432sp18/hw/hw08/reg-tst.csv")
rmse = function(actual, predicted) {
sqrt(mean((actual - predicted) ^ 2))
}
Here we have a numeric response, and some categorical predictors that we coerce to be factors.
# modify train
reg_trn$X21 = factor(reg_trn$X21)
reg_trn$X22 = factor(reg_trn$X22)
# modify test to match
reg_tst$X21 = factor(reg_tst$X21)
reg_tst$X22 = factor(reg_tst$X22)
histogram(reg_trn$y)
The response variable is y
. Our goal is to predict this response as well as possible.
The following plots will explore the relationship of the twenty feature variables with the response variable.
From the above plot, features X1
, X2
, X3
, X4
and X6
show potential.
Here we fit a single regression tree. We see that it uses X3
, X4
, X6
and X22
exclusively.
The above plot is a variable importance plot generated by fitting a boosted tree model. Again we see that X6
and X22
are by far the best predictors. Here we see that X1
and X2
are not important. Why?
## X1 X2 X3 X4 X5 X6
## X1 1.00000000 0.78539221 0.77451418 0.76908252 -0.06340441 -0.01270533
## X2 0.78539221 1.00000000 0.75521760 0.73765804 -0.06386369 -0.03333847
## X3 0.77451418 0.75521760 1.00000000 0.76200946 -0.05414591 -0.04140397
## X4 0.76908252 0.73765804 0.76200946 1.00000000 -0.04550336 -0.05007625
## X5 -0.06340441 -0.06386369 -0.05414591 -0.04550336 1.00000000 0.05939235
## X6 -0.01270533 -0.03333847 -0.04140397 -0.05007625 0.05939235 1.00000000
It seems that X3
and X4
are highly correlated with X1
and X2
and must not add much in terms of prediction.
As a result, we will consider some models that use only X3
, X4
, X6
, and X22
.
We will consider four methods:
rf
)lm
)glmnet
)gam
)For each method we will consider different sets of features:
small
: Only X3
, X4
, X6
, and X22
int
: All possible interactions between X3
, X4
, X6
, and X22
full
: All features.huge
: All features with all possible two way interactions.small_form = formula(y ~ X3 + X4 + X6 + X22)
int_form = formula(y ~ X3 * X4 * X6 * X22)
full_form = formula(y ~ .)
huge_form = formula(y ~ . ^ 2)
huge
will only be considered with Penalized Logistic Regression since it is a very large number of features.
cv_5 = trainControl(method = "cv", number = 5, verbose = FALSE)
Tuning will be done using 5-fold cross-validation. (We could use OOB metrics for Random Forests, but here we sacrifice computation time for better comparison.)
rf_small = train(small_form, data = reg_trn, method = "rf",
trControl = cv_5, tuneLength = 2)
rf_int = train(int_form, data = reg_trn, method = "rf",
trControl = cv_5)
rf_full = train(full_form, data = reg_trn, method = "rf",
trControl = cv_5)
Random Forests are fit above using caret
with method = "rf"
. For most we use the default training grid, which considers a low, medium and high mtry
. We specify a training grid of length 2
for the small feature set to avoid a warning.
lm_small = train(small_form, data = reg_trn, method = "lm", trControl = cv_5)
lm_int = train(int_form, data = reg_trn, method = "lm", trControl = cv_5)
lm_full = train(full_form, data = reg_trn, method = "lm", trControl = cv_5)
Linear Regressions are fit above using caret
with method = "lm"
. There is nothing to tune, but we store the results for comparing cross-validated errors later.
glmn_small = train(small_form, data = reg_trn, method = "glmnet",
trControl = cv_5, tuneLength = 10)
glmn_int = train(int_form, data = reg_trn, method = "glmnet",
trControl = cv_5, tuneLength = 10)
glmn_full = train(full_form, data = reg_trn, method = "glmnet",
trControl = cv_5, tuneLength = 10)
glmn_huge = train(huge_form, data = reg_trn, method = "glmnet",
trControl = cv_5, tuneLength = 10)
Penalized Linear Regressions are fit above using caret
with method = "glmnet"
. We simply specify a tuning grid length. caret
will try 10 \(\alpha\) values and a number of \(\lambda\) values for each.
gam_grid = expand.grid(df = seq(1, 5, by = 0.5))
gam_small = train(y ~ X3 + X4 + X6 + X22, data = reg_trn, method = "gamSpline",
verbose = FALSE, trControl = cv_5, tuneGrid = gam_grid)
gam_full = train(y ~ ., data = reg_trn, method = "gamSpline",
verbose = FALSE, trControl = cv_5, tuneGrid = gam_grid)
Generalized Additive Models are fit above using caret
with method = "gamSpline"
. We specify a number of degrees of freedom to try. Note that our ability to tune these models with caret
is very limited. (For example, each term must use the same degrees of freedom and we cannot add parametric terms.)
Method + Features | CV RMSE | SD CV RMSE |
---|---|---|
lm_int | 1.526 | 0.098 |
glmn_int | 1.528 | 0.117 |
glmn_huge | 1.567 | 0.083 |
gam_small | 1.806 | 0.068 |
lm_small | 1.814 | 0.074 |
glmn_small | 1.820 | 0.055 |
glmn_full | 1.837 | 0.149 |
lm_full | 1.855 | 0.208 |
gam_full | 1.856 | 0.073 |
rf_int | 2.124 | 0.126 |
rf_small | 2.296 | 0.104 |
rf_full | 3.105 | 0.202 |
Based on the above cross-validated RMSEs, our choice is somewhat obvious. lm_int
has both the lowest cross-validated RMSE and a reasonably low variability. However, glmn_int
has very similar performance, and actually uses fewer predictors, so we give it a slight preference.
glmn_int$bestTune
## alpha lambda
## 94 1 0.03180609
coef(glmn_int$finalModel, s = glmn_int$bestTune$lambda)
## 24 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 1.13546667
## X3 1.83483990
## X4 2.73954172
## X6 -0.99636988
## X22E -0.89957323
## X22F -0.79507595
## X3:X4 0.02761843
## X3:X6 .
## X4:X6 .
## X3:X22E .
## X3:X22F -0.05460996
## X4:X22E -1.80091379
## X4:X22F -1.72068483
## X6:X22E .
## X6:X22F .
## X3:X4:X6 .
## X3:X4:X22E .
## X3:X4:X22F .
## X3:X6:X22E .
## X3:X6:X22F .
## X4:X6:X22E .
## X4:X6:X22F .
## X3:X4:X6:X22E .
## X3:X4:X6:X22F .
Since this was synthetic data, we can actually see the true data generating process.
# define the data generating process
gen_reg_data = function(n) {
p = 20
mu = c(1, 2, 3, 4)
sigma = matrix(.9, nrow = 4, ncol = 4) + diag(4) * .3
x_corr = mvrnorm(n, mu = mu, Sigma = sigma)
x_uncorr = replicate(p - 4, rnorm(n, sample(1:(p - 4)), 3))
x_cat_1 = sample(c("A", "B", "C"), n, replace = TRUE)
x_cat_2 = sample(c("D", "E", "F"), n, replace = TRUE)
x = cbind(x_corr, x_uncorr)
x = data.frame(round(x, 3))
x = cbind(x, X21 = x_cat_1, X22 = x_cat_2)
y = round(2 * x[, 3] + x[, 4] + 2 * x[, 4] * (x[, 22] == "D") - 1 * x[, 6] + rnorm(n, 0, 1.5), 3)
data.frame(y, x)
}
As a result, the following function is the best possible model for this data.
f = function(predictors) {
2 * predictors$X3 + predictors$X4 + 2 * predictors$X4 * (predictors$X22 == "D") - 1 * predictors$X6
}
reg_tst_y = read_csv("reg-tst-y.csv") # this data not released
Thus the RMSE below is the best we could achieve on the test data.
rmse(f(reg_tst), reg_tst_y$y)
## [1] 1.501682
However, in general a RMSE of 1.5 is what we would hope to achieve, as that is how noise was added to the data. The method we used gets very close.
While lm_int
performs best in terms of CV-RMSE, it turns on that glmn_int
is closer to the true model as it sets some of the unneeded predictors of lm_int
to be zero. In the test set, glmn_int
performs slightly better.
rmse(predict(lm_int, reg_tst), reg_tst_y$y)
## [1] 1.520896
rmse(predict(glmn_int, reg_tst), reg_tst_y$y)
## [1] 1.507915
spam_trn = read_csv("https://daviddalpiaz.github.io/stat432sp18/hw/hw08/spam-trn.csv")
spam_tst = read_csv("https://daviddalpiaz.github.io/stat432sp18/hw/hw08/spam-tst.csv")
spam_trn$type = factor(spam_trn$type)
score = function(actual, predicted) {
1 * sum(predicted == "spam" & actual == "spam") +
-25 * sum(predicted == "spam" & actual == "nonspam") +
-1 * sum(predicted == "nonspam" & actual == "spam") +
2 * sum(predicted == "nonspam" & actual == "nonspam")
}
For this problem, we simply cut to the chase, and utilize a random forest. This time, we do split off some “test” data.
set.seed(uin)
spam_trn_idx = createDataPartition(spam_trn$type, p = 0.80, list = FALSE)
spam_trn_trn = spam_trn[spam_trn_idx, ]
spam_trn_tst = spam_trn[-spam_trn_idx, ]
# tune for best accuracy, use OOB metrics
rf_mod_spamn_acc = train(type ~ ., data = spam_trn_trn,
method = "rf",
trControl = trainControl(method = "oob"))
# tune for best ROC, caret does not allow OOB, so we use CV
rf_mod_spamn_roc = train(type ~ ., data = spam_trn_trn,
method = "rf",
trControl = trainControl(method = "cv",
classProbs = TRUE,
summaryFunction = twoClassSummary),
metric = "ROC")
cutoffs = seq(0.5, 1, by = 0.01)
score_acc = rep(0, length(cutoffs))
score_roc = rep(0, length(cutoffs))
# test cutoffs for accuracy tuned forest
for (c in seq_along(cutoffs)) {
pred = ifelse(predict(rf_mod_spamn_acc, spam_trn_tst, type = "prob")$spam > cutoffs[c],
"spam", "nonspam")
score_acc[c] = score(predicted = pred,
actual = spam_trn_tst$type)
}
# test cutoffs for ROC tuned forest
for (c in seq_along(cutoffs)) {
pred = ifelse(predict(rf_mod_spamn_roc, spam_trn_tst, type = "prob")$spam > cutoffs[c],
"spam", "nonspam")
score_roc[c] = score(predicted = pred,
actual = spam_trn_tst$type)
}
We see that tuning with ROC performs ever so slightly better. (It almost doesn’t seem like it was worth it to spend the extra time to tune that model.) As expected, we need to increase the spam threshold.
(cut_to_use = cutoffs[which.max(score_roc)])
## [1] 0.8
It is certainly possible that with more thought we could discover a better filter, but this analysis uses the key concept, modifying the cutoffs for classifying.
spam_tst_y = read_csv("spam_tst-y.csv") # data not provided
spam_tst_y$type = factor(spam_tst_y$type)
pred_prob = predict(rf_mod_spamn_roc, spam_tst, type = "prob")$spam
pred_class = ifelse(pred_prob > cut_to_use, "spam", "nonspam")
score(actual = spam_tst_y$type, predicted = pred_class) # score on autograder
## [1] 2902
score(actual = spam_tst_y$type, predicted = predict(rf_mod_spamn_roc, spam_tst))
## [1] 2238
accuracy(actual = spam_tst_y$type, predicted = pred_class)
## [1] 0.8604348
accuracy(actual = spam_tst_y$type, predicted = predict(rf_mod_spamn_roc, spam_tst))
## [1] 0.9334783
As expected, we see a decrease in accuracy, for an increase in score.
varImpPlot(rf_mod_spamn_roc$finalModel, n.var = 10)
Looking at the ten most important variables, we see things like exclamation marks, capital letters, and dollar signs are good predictors. Remember, this is an old dataset, from HP employees. (Note that hp
also appears to be important.) This would be a terrible spam classifier for general email data today.