For this homework, you may only use the following packages:

# general
library(caret)
library(tidyverse)
library(knitr)
library(kableExtra)

# specific
library(ISLR)
library(pROC)

Exercise 1 (Logistic Regression for Fuel Efficiency)

[6 points] For this exercise we will use the Auto data from the ISLR package.

data(Auto)

As we have seen before, we drop the name variable. We also coerce origin and cylinders to be factors as they are categorical variables.

We also re-create a new response variable mpg. Instead of the actual fuel efficiency, we simply label cars that obtain fewer than 30 miles per gallon as ‘low’ fuel efficiency. Those above 30 have ‘high’ fuel efficiency.

Auto = subset(Auto, select = -c(name))
Auto$origin = factor(Auto$origin)
Auto$cylinders = factor(Auto$cylinders)
Auto$mpg = factor(ifelse(Auto$mpg < 30, "low", "high"))

After these modifications, we test-train split the data.

set.seed(1)
auto_trn_idx  = sample(nrow(Auto), size = trunc(0.75 * nrow(Auto)))
auto_trn_data = Auto[auto_trn_idx, ]
auto_tst_data = Auto[-auto_trn_idx, ]

The goal of our modeling in this exercise is to predict whether or not a vehicle is fuel efficient.

Fit five different logistic regressions.

Here we’ll define \(p(x) = P(Y = \texttt{low} \mid X = x)\). The variables euro and japan are dummy variables based on the origin variables. Do not make these variables by modifying the data.

Using each of these models to estimate \(p(x)\), that is estimate the probability of a low fuel efficiency given the characteristics of a vehicle, we can create a classifier.

\[ \hat{C}(x) = \begin{cases} \texttt{low} & \hat{p}(x) > 0.5 \\ \texttt{high} & \hat{p}(x) \leq 0.5 \end{cases} \]

For each classifier, obtain train and test classification error rates. Summarize your results in a well-formatted markdown table.

Solution:

# fit models
mod_intercept =   glm(mpg ~ 1, data = auto_trn_data, family = "binomial")
mod_simple =      glm(mpg ~ horsepower, data = auto_trn_data, family = "binomial")
mod_multiple =    glm(mpg ~ horsepower + origin, data = auto_trn_data, family = "binomial")
mod_additive =    glm(mpg ~ ., data = auto_trn_data, family = "binomial")
mod_interaction = glm(mpg ~ . ^ 2, data = auto_trn_data, family = "binomial")
# helper function
calc_err = function(actual, predicted) {
  mean(actual != predicted)
}
# create model list
auto_mod_list = list(mod_intercept, mod_simple, mod_multiple, mod_additive, mod_interaction)

# get predicted probabilities from fitted models
auto_trn_prob = lapply(auto_mod_list, predict, newdata = auto_trn_data)
auto_tst_prob = lapply(auto_mod_list, predict, newdata = auto_tst_data)

# get predictions from classifiers built using fitted models
auto_trn_pred = lapply(auto_trn_prob, function(x) {ifelse(x > 0, "low", "high")})
auto_tst_pred = lapply(auto_tst_prob, function(x) {ifelse(x > 0, "low", "high")})

# calculate errors
auto_trn_err  = sapply(auto_trn_pred, calc_err, actual = auto_trn_data$mpg)
auto_tst_err  = sapply(auto_tst_pred, calc_err, actual = auto_tst_data$mpg)
Model Name Train Error Test Error
Intercept 0.224 0.245
Simple 0.143 0.122
Multiple 0.136 0.102
Additive 0.078 0.061
Interaction 0.051 0.122

Exercise 2 (Detecting Cancer with Logistic Regression)

[6 points] For this exercise we will use data found in wisc-trn.csv and wisc-tst.csv which contain train and test data respectively. wisc.csv is provided but not used. This is a modification of the Breast Cancer Wisconsin (Diagnostic) dataset from the UCI Machine Learning Repository. Only the first 10 feature variables have been provided. (And these are all you should use.)

You should consider coercing the response to be a factor variable.

Consider an additive logistic regression that considers only two predictors, radius and symmetry. Use this model to estimate

\[ p(x) = P(Y = \texttt{M} \mid X = x). \]

Report test sensitivity, test specificity, and test accuracy for three classifiers, each using a different cutoff for predicted probability:

\[ \hat{C}(x) = \begin{cases} M & \hat{p}(x) > c \\ B & \hat{p}(x) \leq c \end{cases} \]

We will consider M (malignant) to be the “positive” class when calculating sensitivity and specificity. Summarize these results using a single well-formatted table.

Solution:

# import data
wisc_trn = read.csv("wisc-trn.csv")
wisc_tst = read.csv("wisc-tst.csv")
# coerce to factor
wisc_trn$class = as.factor(wisc_trn$class)
wisc_tst$class = as.factor(wisc_tst$class)
# fit model
wisc_glm = glm(class ~ radius + symmetry, data = wisc_trn, family = "binomial")
# helper function, specific to glm function
get_pred_glm = function(mod, data, res = "y", pos = 1, neg = 0, cut = 0.5) {
  probs = predict(mod, newdata = data, type = "response")
  ifelse(probs > cut, pos, neg)
}
# get predicted values for each cutoff
pred_10 = get_pred_glm(wisc_glm, wisc_tst, res = "class", cut = 0.1, pos = "M", neg = "B")
pred_50 = get_pred_glm(wisc_glm, wisc_tst, res = "class", cut = 0.5, pos = "M", neg = "B")
pred_90 = get_pred_glm(wisc_glm, wisc_tst, res = "class", cut = 0.9, pos = "M", neg = "B")
# create table for each cutoff
tab_10 = table(predicted = pred_10, actual = wisc_tst$class)
tab_50 = table(predicted = pred_50, actual = wisc_tst$class)
tab_90 = table(predicted = pred_90, actual = wisc_tst$class)
# coerce tables to be confusion matrixes
con_mat_10 = confusionMatrix(tab_10, positive = "M")
con_mat_50 = confusionMatrix(tab_50, positive = "M")
con_mat_90 = confusionMatrix(tab_90, positive = "M")
metrics = rbind(
  
  c(con_mat_10$overall["Accuracy"], 
    con_mat_10$byClass["Sensitivity"], 
    con_mat_10$byClass["Specificity"]),
  c(con_mat_50$overall["Accuracy"], 
    con_mat_50$byClass["Sensitivity"], 
    con_mat_50$byClass["Specificity"]),
  c(con_mat_90$overall["Accuracy"], 
    con_mat_90$byClass["Sensitivity"], 
    con_mat_90$byClass["Specificity"])
)

rownames(metrics) = c("c = 0.10", "c = 0.50", "c = 0.90")
kable_styling(kable(metrics, format = "html", digits = 3), full_width = FALSE)
Accuracy Sensitivity Specificity
c = 0.10 0.86 0.950 0.800
c = 0.50 0.91 0.825 0.967
c = 0.90 0.82 0.575 0.983

Exercise 3 (More Sensitivity and Specificity)

[6 points] Continuing the setup (data and model) from Exercise 2, we now create two plots which will help us understand the tradeoff between sensitivity and specificity.

Display these two plots side-by-side.

Hint: Consider creating some functions specific to this exercise for obtaining accuracy, sensitivity, and specificity. (If you didn’t already do so in Exercise 2.)

c = seq(0.01, 0.99, by = 0.01)

Solution:

get_metric = function(c, metric) {
  pred = get_pred_glm(wisc_glm, wisc_tst, res = "class", cut = c, pos = "M", neg = "B")
  tab = table(predicted = pred, actual = wisc_tst$class)
  conmat = confusionMatrix(tab, positive = "M")
  ifelse(metric == "acc", conmat$overall["Accuracy"],
                          ifelse(metric == "sens", conmat$byClass["Sensitivity"],
                                                   conmat$byClass["Specificity"]))
}
# setup for custom plotting
wisc_acc  = sapply(c, get_metric, metric = "acc")
wisc_sens = sapply(c, get_metric, metric = "sens")
wisc_spec = sapply(c, get_metric, metric = "spec")

# setup for ROC
test_prob = predict(wisc_glm, newdata = wisc_tst, type = "response")
par(mfrow = c(1, 2))

# custom plot
plot(c, wisc_acc, typ = "l", ylim = c(0, 1), col = "darkgrey", lwd = 1, lty = 1,
     main = "Accuracy, Sensitivity, Specificity",
     xlab = "Probability Cutoff, c", ylab = "Metric Value")
lines(c, wisc_sens, col = "darkorange", lwd = 2, lty = 3)
lines(c, wisc_spec, col = "dodgerblue", lwd = 2, lty = 3)
legend("bottomleft", c("Accuracy", "Specificity", "Sensitivity"), lty = c(1, 2, 3), 
       lwd = 1, col = c("darkgrey", "dodgerblue", "darkorange"), cex = 0.8)

# ROC plot
test_roc = roc(wisc_tst$class ~ test_prob, plot = TRUE, print.auc = TRUE,
               main = "ROC")


Exercise 4 (Logistic Regression Decision Boundary)

[6 points] Continue with the cancer data from previous exercises. Again, consider an additive logistic regression that considers only two predictors, radius and symmetry. Plot the test data with radius as the \(x\) axis, and symmetry as the \(y\) axis, with the points colored according to their tumor status. Add a line which represents the decision boundary for a classifier using 0.5 as a cutoff for predicted probability.

Solution:

wisc_glm = glm(class ~ radius + symmetry, data = wisc_trn, family = "binomial")
# function to calculate decision boundary for glms
glm_boundary_line = function(glm_fit) {
  intercept = as.numeric(-coef(glm_fit)[1] / coef(glm_fit)[3])
  slope = as.numeric(-coef(glm_fit)[2] / coef(glm_fit)[3])
  c(intercept = intercept, slope = slope)
}
# function to add decision boundary for glms to an existing plot
add_glm_boundary = function(glm_fit, line_col = "black") {
  abline(glm_boundary_line(glm_fit), col = line_col, lwd = 3)
}


Exercise 5 (Concept Checks)

[1 point each] Answer the following questions based on your results from the three exercises.

(a) What is \(\hat{\beta}_2\) for the multiple model in Exercise 1?

Solution:

coef(mod_multiple)[3]
##   origin2 
## 0.1243737

(b) Based on your results in Exercise 1, which of these models performs best?

Solution: The additive model achieves the lowest test error.

(c) Based on your results in Exercise 1, which of these models do you think may be underfitting?

Solution: The intercept, simple, and multiple models as they all simpler than the best model, with more error.

(d) Based on your results in Exercise 1, which of these models do you think may be overfitting??

Solution: The interaction model as it is more complex than the best model, with more error.

(e) Of the classifiers in Exercise 2, which do you prefer?

Solution: The model with c = 0.10.

(f) State the metric you used to make your decision in part (e), and a reason for using that metric.

Solution: The model with c = 0.10 achieves the highest sensitivity, meaning it most oftens labels cancer as cancer! We value this over labeling non-cancer as non-cancer.