Please see the homework policy document for detailed instructions and some grading notes. Failure to follow instructions will result in point reductions.
“Better three hours too soon than a minute too late.”
— William Shakespeare
For this homework, you may only use the following packages:
# general
library(caret)
library(tidyverse)
library(knitr)
library(kableExtra)
# specific
library(ISLR)
library(pROC)
If you feel additional general packages would be useful for future homework, please pass these along to the instructor.
[6 points] For this exercise we will use the Auto
data from the ISLR
package.
data(Auto)
As we have seen before, we drop the name
variable. We also coerce origin
and cylinders
to be factors as they are categorical variables.
We also re-create a new response variable mpg.
Instead of the actual fuel efficiency, we simply label cars that obtain fewer than 30 miles per gallon as ‘low’ fuel efficiency. Those above 30 have ‘high’ fuel efficiency.
Auto = subset(Auto, select = -c(name))
Auto$origin = factor(Auto$origin)
Auto$cylinders = factor(Auto$cylinders)
Auto$mpg = factor(ifelse(Auto$mpg < 30, "low", "high"))
After these modifications, we test-train split the data.
set.seed(1)
auto_trn_idx = sample(nrow(Auto), size = trunc(0.75 * nrow(Auto)))
auto_trn_data = Auto[auto_trn_idx, ]
auto_tst_data = Auto[-auto_trn_idx, ]
The goal of our modeling in this exercise is to predict whether or not a vehicle is fuel efficient.
Fit five different logistic regressions.
Here we’ll define \(p(x) = P(Y = \texttt{low} \mid X = x)\). The variables euro
and japan
are dummy variables based on the origin
variables. Do not make these variables by modifying the data.
Using each of these models to estimate \(p(x)\), that is estimate the probability of a low
fuel efficiency given the characteristics of a vehicle, we can create a classifier.
\[ \hat{C}(x) = \begin{cases} \texttt{low} & \hat{p}(x) > 0.5 \\ \texttt{high} & \hat{p}(x) \leq 0.5 \end{cases} \]
For each classifier, obtain train and test classification error rates. Summarize your results in a well-formatted markdown table.
[6 points] For this exercise we will use data found in wisc-trn.csv
and wisc-tst.csv
which contain train and test data respectively. wisc.csv
is provided but not used. This is a modification of the Breast Cancer Wisconsin (Diagnostic) dataset from the UCI Machine Learning Repository. Only the first 10 feature variables have been provided. (And these are all you should use.)
You should consider coercing the response to be a factor variable.
Consider an additive logistic regression that considers only two predictors, radius
and symmetry
. Use this model to estimate
\[ p(x) = P(Y = \texttt{M} \mid X = x). \]
Report test sensitivity, test specificity, and test accuracy for three classifiers, each using a different cutoff for predicted probability:
\[ \hat{C}(x) = \begin{cases} M & \hat{p}(x) > c \\ B & \hat{p}(x) \leq c \end{cases} \]
We will consider M
(malignant) to be the “positive” class when calculating sensitivity and specificity. Summarize these results using a single well-formatted table.
[6 points] Continuing the setup (data and model) from Exercise 2, we now create two plots which will help us understand the tradeoff between sensitivity and specificity.
R
code below.)Display these two plots side-by-side.
Hint: Consider creating some functions specific to this exercise for obtaining accuracy, sensitivity, and specificity. (If you didn’t already do so in Exercise 2.)
c = seq(0.01, 0.99, by = 0.01)
[6 points] Continue with the cancer data from previous exercises. Again, consider an additive logistic regression that considers only two predictors, radius
and symmetry
. Plot the test data with radius
as the \(x\) axis, and symmetry
as the \(y\) axis, with the points colored according to their tumor status. Add a line which represents the decision boundary for a classifier using 0.5 as a cutoff for predicted probability.
[1 point each] Answer the following questions based on your results from the three exercises.
(a) What is \(\hat{\beta}_2\) for the multiple model in Exercise 1?
(b) Based on your results in Exercise 1, which of these models performs best?
(c) Based on your results in Exercise 1, which of these models do you think may be underfitting?
(d) Based on your results in Exercise 1, which of these models do you think may be overfitting??
(e) Of the classifiers in Exercise 2, which do you prefer?
(f) State the metric you used to make your decision in part (e), and a reason for using that metric.