Please see the homework policy document for detailed instructions and some grading notes. Failure to follow instructions will result in point reductions.


“Our greatest glory is not in never falling, but in rising every time we fall.”

Confucius


For this homework, you may only use the following packages:

# general
library(MASS)
library(caret)
library(tidyverse)
library(knitr)
library(kableExtra)

# specific
library(e1071)
library(nnet)
library(ellipse)

If you feel additional general packages would be useful for future homework, please pass these along to the instructor.


Exercise 1 (Detecting Cancer with KNN)

[6 points] For this exercise we will use data found in wisc-trn.csv and wisc-tst.csv which contain train and test data respectively. wisc.csv is provided but not used. This is a modification of the Breast Cancer Wisconsin (Diagnostic) dataset from the UCI Machine Learning Repository. Only the first 10 feature variables have been provided. (And these are all you should use.)

You should consider coercing the response to be a factor variable.

Consider two different preprocessing setups:

For each setup, train KNN models using values of k from 1 to 200. Using only the variables radius, symmetry, and texture. For each, calculate test classification error. Summarize these results in a single plot which plots the test error as a function of k. (The plot will have two “curves,” one for each setup.) Your plot should be reasonably visually appealing, well-labeled, and include a legend.


Exercise 2 (Bias-Variance Tradeoff, Logistic Regression)

[9 points] Run a simulation study to estimate the bias, variance, and mean squared error of estimating \(p(x)\) using logistic regression. Recall that \(p(x) = P(Y = 1 \mid X = x)\).

Consider the (true) logistic regression model

\[ \log \left( \frac{p(x)}{1 - p(x)} \right) = 1 + 2 x_1 - x_2 \]

To specify the full data generating process, consider the following R function.

make_sim_data = function(n_obs = 100) {
  x1 = runif(n = n_obs, min = 0, max = 2)
  x2 = runif(n = n_obs, min = 0, max = 4)
  prob = exp(1 + 2 * x1 - 1 * x2) / (1 + exp(1 + 2 * x1 - 1 * x2))
  y = rbinom(n = n_obs, size = 1, prob = prob)
  data.frame(y, x1, x2)
}

So, the following generates one simulated dataset according to the data generating process defined above.

sim_data = make_sim_data()

Evaluate estimates of \(p(x_1 = 0.5, x_2 = 0.75)\) from fitting four models:

\[ \log \left( \frac{p(x)}{1 - p(x)} \right) = \beta_0 \]

\[ \log \left( \frac{p(x)}{1 - p(x)} \right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 \]

\[ \log \left( \frac{p(x)}{1 - p(x)} \right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1x_2 \]

\[ \log \left( \frac{p(x)}{1 - p(x)} \right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1^2 + \beta_4 x_2^2 + \beta_5 x_1x_2 \]

Use 2000 simulations of datasets with a sample size of 30 to estimate squared bias, variance, and the mean squared error of estimating \(p(x_1 = 0.5, x_2 = 0.75)\) using \(\hat{p}(x_1 = 0.5, x_2 = 0.75)\) for each model. Report your results using a well formatted table.

At the beginning of your simulation study, run the following code, but with your nine-digit Illinois UIN.

set.seed(123456789)

Exercise 3 (Comparing Classifiers)

[8 points] Use the data found in hw05-trn.csv and hw05-tst.csv which contain train and test data respectively. Use y as the response. Coerce y to be a factor after importing the data if it is not already.

Create a pairs plot with ellipses for the training data, then train the following models using both available predictors:

Calculate test and train error rates for each model. Summarize these results using a single well-formatted table.


Exercise 4 (Concept Checks)

[1 point each] Answer the following questions based on your results from the previous three exercises.

(a) Based on your results in Exercise 2, which models are performing unbiased estimation?

(b) Based on your results in Exercise 2, which of these models performs best?

(c) In Exercise 3, which model performs best?

(d) In Exercise 3, why does Naive Bayes perform poorly?

(e) In Exercise 3, which performs better, LDA or QDA? Why?

(f) In Exercise 3, which prior performs better? Estimating from data, or using a flat prior? Why?

(g) In Exercise 3, of the four classes, which is the easiest to classify?