Please see the homework policy document for detailed instructions and some grading notes. Failure to follow instructions will result in point reductions.

“Better three hours too soon than a minute too late.”

William Shakespeare

For this homework, you may only use the following packages:

# general

# specific

If you feel additional general packages would be useful for future homework, please pass these along to the instructor.

Exercise 1 (Logistic Regression for Fuel Efficiency)

[6 points] For this exercise we will use the Auto data from the ISLR package.


As we have seen before, we drop the name variable. We also coerce origin and cylinders to be factors as they are categorical variables.

We also re-create a new response variable mpg. Instead of the actual fuel efficiency, we simply label cars that obtain fewer than 30 miles per gallon as ‘low’ fuel efficiency. Those above 30 have ‘high’ fuel efficiency.

Auto = subset(Auto, select = -c(name))
Auto$origin = factor(Auto$origin)
Auto$cylinders = factor(Auto$cylinders)
Auto$mpg = factor(ifelse(Auto$mpg < 30, "low", "high"))

After these modifications, we test-train split the data.

auto_trn_idx  = sample(nrow(Auto), size = trunc(0.75 * nrow(Auto)))
auto_trn_data = Auto[auto_trn_idx, ]
auto_tst_data = Auto[-auto_trn_idx, ]

The goal of our modeling in this exercise is to predict whether or not a vehicle is fuel efficient.

Fit five different logistic regressions.

Here we’ll define \(p(x) = P(Y = \texttt{low} \mid X = x)\). The variables euro and japan are dummy variables based on the origin variables. Do not make these variables by modifying the data.

Using each of these models to estimate \(p(x)\), that is estimate the probability of a low fuel efficiency given the characteristics of a vehicle, we can create a classifier.

\[ \hat{C}(x) = \begin{cases} \texttt{low} & \hat{p}(x) > 0.5 \\ \texttt{high} & \hat{p}(x) \leq 0.5 \end{cases} \]

For each classifier, obtain train and test classification error rates. Summarize your results in a well-formatted markdown table.

Exercise 2 (Detecting Cancer with Logistic Regression)

[6 points] For this exercise we will use data found in wisc-trn.csv and wisc-tst.csv which contain train and test data respectively. wisc.csv is provided but not used. This is a modification of the Breast Cancer Wisconsin (Diagnostic) dataset from the UCI Machine Learning Repository. Only the first 10 feature variables have been provided. (And these are all you should use.)

You should consider coercing the response to be a factor variable.

Consider an additive logistic regression that considers only two predictors, radius and symmetry. Use this model to estimate

\[ p(x) = P(Y = \texttt{M} \mid X = x). \]

Report test sensitivity, test specificity, and test accuracy for three classifiers, each using a different cutoff for predicted probability:

\[ \hat{C}(x) = \begin{cases} M & \hat{p}(x) > c \\ B & \hat{p}(x) \leq c \end{cases} \]

We will consider M (malignant) to be the “positive” class when calculating sensitivity and specificity. Summarize these results using a single well-formatted table.

Exercise 3 (More Sensitivity and Specificity)

[6 points] Continuing the setup (data and model) from Exercise 2, we now create two plots which will help us understand the tradeoff between sensitivity and specificity.

Display these two plots side-by-side.

Hint: Consider creating some functions specific to this exercise for obtaining accuracy, sensitivity, and specificity. (If you didn’t already do so in Exercise 2.)

c = seq(0.01, 0.99, by = 0.01)

Exercise 4 (Logistic Regression Decision Boundary)

[6 points] Continue with the cancer data from previous exercises. Again, consider an additive logistic regression that considers only two predictors, radius and symmetry. Plot the test data with radius as the \(x\) axis, and symmetry as the \(y\) axis, with the points colored according to their tumor status. Add a line which represents the decision boundary for a classifier using 0.5 as a cutoff for predicted probability.

Exercise 5 (Concept Checks)

[1 point each] Answer the following questions based on your results from the three exercises.

(a) What is \(\hat{\beta}_2\) for the multiple model in Exercise 1?

(b) Based on your results in Exercise 1, which of these models performs best?

(c) Based on your results in Exercise 1, which of these models do you think may be underfitting?

(d) Based on your results in Exercise 1, which of these models do you think may be overfitting??

(e) Of the classifiers in Exercise 2, which do you prefer?

(f) State the metric you used to make your decision in part (e), and a reason for using that metric.