Please see the homework policy document for detailed instructions and some grading notes. Failure to follow instructions will result in point reductions.

“Better three hours too soon than a minute too late.”

—

William Shakespeare

For this homework, you may only use the following packages:

```
# general
library(caret)
library(tidyverse)
library(knitr)
library(kableExtra)
# specific
library(ISLR)
library(pROC)
```

If you feel additional general packages would be useful for future homework, please pass these along to the instructor.

**[6 points]** For this exercise we will use the `Auto`

data from the `ISLR`

package.

`data(Auto)`

As we have seen before, we drop the `name`

variable. We also coerce `origin`

and `cylinders`

to be factors as they are categorical variables.

We also re-create a new response variable `mpg.`

Instead of the actual fuel efficiency, we simply label cars that obtain fewer than 30 miles per gallon as ‘low’ fuel efficiency. Those above 30 have ‘high’ fuel efficiency.

```
Auto = subset(Auto, select = -c(name))
Auto$origin = factor(Auto$origin)
Auto$cylinders = factor(Auto$cylinders)
Auto$mpg = factor(ifelse(Auto$mpg < 30, "low", "high"))
```

After these modifications, we test-train split the data.

```
set.seed(1)
auto_trn_idx = sample(nrow(Auto), size = trunc(0.75 * nrow(Auto)))
auto_trn_data = Auto[auto_trn_idx, ]
auto_tst_data = Auto[-auto_trn_idx, ]
```

The goal of our modeling in this exercise is to predict whether or not a vehicle is fuel efficient.

Fit five different logistic regressions.

**Intercept**: \(\log \left( \frac{p(x)}{1 - p(x)} \right) = \beta_0\)**Simple**: \(\log \left( \frac{p(x)}{1 - p(x)} \right) = \beta_0 + \beta_1 \texttt{horsepower}\)**Multiple**:\(\log \left( \frac{p(x)}{1 - p(x)} \right) = \beta_0 + \beta_1 \texttt{horsepower} + \beta_2 \texttt{euro} + \beta_3 \texttt{japan}\)**Additive**: An*additive*model using all available predictors**Interaction**: An*interaction*model that includes all first order terms and all possible two-way interactions

Here we’ll define \(p(x) = P(Y = \texttt{low} \mid X = x)\). The variables `euro`

and `japan`

are dummy variables based on the `origin`

variables. **Do not make these variables by modifying the data.**

Using each of these models to estimate \(p(x)\), that is estimate the probability of a `low`

fuel efficiency given the characteristics of a vehicle, we can create a classifier.

\[ \hat{C}(x) = \begin{cases} \texttt{low} & \hat{p}(x) > 0.5 \\ \texttt{high} & \hat{p}(x) \leq 0.5 \end{cases} \]

For each classifier, obtain train and test classification error rates. Summarize your results in a well-formatted markdown table.

**[6 points]** For this exercise we will use data found in `wisc-trn.csv`

and `wisc-tst.csv`

which contain train and test data respectively. `wisc.csv`

is provided but not used. This is a modification of the Breast Cancer Wisconsin (Diagnostic) dataset from the UCI Machine Learning Repository. Only the first 10 feature variables have been provided. (And these are all you should use.)

You should consider coercing the response to be a factor variable.

Consider an additive logistic regression that considers *only two predictors*, `radius`

and `symmetry`

. Use this model to estimate

\[ p(x) = P(Y = \texttt{M} \mid X = x). \]

Report test sensitivity, test specificity, and test accuracy for three classifiers, each using a different cutoff for predicted probability:

\[ \hat{C}(x) = \begin{cases} M & \hat{p}(x) > c \\ B & \hat{p}(x) \leq c \end{cases} \]

- \(c = 0.1\)
- \(c = 0.5\)
- \(c = 0.9\)

We will consider `M`

(malignant) to be the “positive” class when calculating sensitivity and specificity. Summarize these results using a single well-formatted table.

**[6 points]** Continuing the setup (data and model) from Exercise 2, we now create two plots which will help us understand the tradeoff between sensitivity and specificity.

**Plot 1**A plot that shows (test) accuracy, sensitivity, and specificity as a function of the cutoff \(c\) used to create a classifier based on the logistic regression model.- Use the test data.
- Consider values of c from 0.01 to 0.99. (See
`R`

code below.) - Accuracy, sensitivity, and specificity will each be a “line” on the plot.
- Give each line a different color and line type.
- Give the plot a title, axis labels, and legend.

**Plot 2**An ROC Curve- Use the test data.
- Display the AUC value on the plot.

Display these two plots side-by-side.

*Hint:* Consider creating some functions specific to this exercise for obtaining accuracy, sensitivity, and specificity. (If you didn’t already do so in Exercise 2.)

`c = seq(0.01, 0.99, by = 0.01)`

**[6 points]** Continue with the cancer data from previous exercises. Again, consider an additive logistic regression that considers *only two predictors*, `radius`

and `symmetry`

. Plot the test data with `radius`

as the \(x\) axis, and `symmetry`

as the \(y\) axis, with the points colored according to their tumor status. Add a line which represents the decision boundary for a classifier using 0.5 as a cutoff for predicted probability.

**[1 point each]** Answer the following questions based on your results from the three exercises.

**(a)** What is \(\hat{\beta}_2\) for the **multiple** model in Exercise 1?

**(b)** Based on your results in Exercise 1, which of these models performs best?

**(c)** Based on your results in Exercise 1, which of these models do you think may be underfitting?

**(d)** Based on your results in Exercise 1, which of these models do you think may be overfitting??

**(e)** Of the classifiers in Exercise 2, which do you prefer?

**(f)** State the metric you used to make your decision in part **(e)**, and a reason for using that metric.