For this homework, you may only use the following packages:

# general
library(MASS)
library(caret)
library(tidyverse)
library(knitr)
library(kableExtra)

# specific
library(e1071)
library(nnet)
library(ellipse)

# Exercise 1 (Detecting Cancer with KNN)

[6 points] For this exercise we will use data found in wisc-trn.csv and wisc-tst.csv which contain train and test data respectively. wisc.csv is provided but not used. This is a modification of the Breast Cancer Wisconsin (Diagnostic) dataset from the UCI Machine Learning Repository. Only the first 10 feature variables have been provided. (And these are all you should use.)

You should consider coercing the response to be a factor variable.

Consider two different preprocessing setups:

• Setup 1
• Numeric variables not scaled.
• Setup 2
• Numeric variables are scaled to have mean 0 and standard deviation 1.

For each setup, train KNN models using values of k from 1 to 200. Using only the variables radius, symmetry, and texture. For each, calculate test classification error. Summarize these results in a single plot which plots the test error as a function of k. (The plot will have two “curves,” one for each setup.) Your plot should be reasonably visually appealing, well-labeled, and include a legend.

Solution:

# import data
wisc_tst = read.csv("wisc-tst.csv")
# coerce to factor
wisc_trn$class = as.factor(wisc_trn$class)
wisc_tst$class = as.factor(wisc_tst$class)
# helper function for RMSE
calc_err = function(actual, predicted) {
mean(actual != predicted)
}
# define values of k to tune over
k = 1:200
# define model setups
wisc_knn_form_1 = as.formula(class ~ radius + symmetry + texture)
wisc_knn_form_2 = as.formula(class ~ scale(radius) + scale(symmetry) + scale(texture))
# fit models for each k within each setup
set.seed(1)
wisc_knn_mod_1 = lapply(k, function(x) {knn3(wisc_knn_form_1, data = wisc_trn, k = x)})
wisc_knn_mod_2 = lapply(k, function(x) {knn3(wisc_knn_form_2, data = wisc_trn, k = x)})
# get all predictions
wisc_knn_pred_1 = lapply(wisc_knn_mod_1, predict, newdata = wisc_tst, type = "class")
wisc_knn_pred_2 = lapply(wisc_knn_mod_2, predict, newdata = wisc_tst, type = "class")
# calculate test RMSE for each k for each setup
wisc_knn_err_1 = sapply(wisc_knn_pred_1, calc_err, actual = wisc_tst$class) wisc_knn_err_2 = sapply(wisc_knn_pred_2, calc_err, actual = wisc_tst$class) # Exercise 2 (Bias-Variance Tradeoff, Logistic Regression)

[9 points] Run a simulation study to estimate the bias, variance, and mean squared error of estimating $$p(x)$$ using logistic regression. Recall that $$p(x) = P(Y = 1 \mid X = x)$$.

Consider the (true) logistic regression model

$\log \left( \frac{p(x)}{1 - p(x)} \right) = 1 + 2 x_1 - x_2$

To specify the full data generating process, consider the following R function.

make_sim_data = function(n_obs = 100) {
x1 = runif(n = n_obs, min = 0, max = 2)
x2 = runif(n = n_obs, min = 0, max = 4)
prob = exp(1 + 2 * x1 - 1 * x2) / (1 + exp(1 + 2 * x1 - 1 * x2))
y = rbinom(n = n_obs, size = 1, prob = prob)
data.frame(y, x1, x2)
}

So, the following generates one simulated dataset according to the data generating process defined above.

sim_data = make_sim_data()

Evaluate estimates of $$p(x_1 = 0.5, x_2 = 0.75)$$ from fitting four models:

$\log \left( \frac{p(x)}{1 - p(x)} \right) = \beta_0$

$\log \left( \frac{p(x)}{1 - p(x)} \right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2$

$\log \left( \frac{p(x)}{1 - p(x)} \right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1x_2$

$\log \left( \frac{p(x)}{1 - p(x)} \right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1^2 + \beta_4 x_2^2 + \beta_5 x_1x_2$

Use 2000 simulations of datasets with a sample size of 30 to estimate squared bias, variance, and the mean squared error of estimating $$p(x_1 = 0.5, x_2 = 0.75)$$ using $$\hat{p}(x_1 = 0.5, x_2 = 0.75)$$ for each model. Report your results using a well formatted table.

At the beginning of your simulation study, run the following code, but with your nine-digit Illinois UIN.

set.seed(123456789)
# setup simulation
n_sims = 2000
n_models = 4
x = data.frame(x1 = 0.5, x2 = 0.75) # fixed point at which we make predictions
predictions = matrix(0, nrow = n_sims, ncol = n_models)
# perform simulations
for (sim in 1:n_sims) {

# generate datasets according to the data generating process
sim_data = make_sim_data(n_obs = 30)

# fit models
fit_1 = glm(y ~ 1, data = sim_data, family = "binomial")
fit_2 = glm(y ~ ., data = sim_data, family = "binomial")
fit_3 = glm(y ~ x1 * x2, data = sim_data, family = "binomial")
fit_4 = glm(y ~ x1 * x2 + I(x1 ^ 2) + I(x1 ^ 2), data = sim_data, family = "binomial")

# get predictions
predictions[sim, 1] = predict(fit_1, x, type = "response")
predictions[sim, 2] = predict(fit_2, x, type = "response")
predictions[sim, 3] = predict(fit_3, x, type = "response")
predictions[sim, 4] = predict(fit_4, x, type = "response")
}
# helper functions from R4SL

get_var = function(estimate) {
mean((estimate - mean(estimate)) ^ 2)
}

get_bias = function(estimate, truth) {
mean(estimate) - truth
}

get_mse = function(truth, estimate) {
mean((estimate - truth) ^ 2)
}
# true funciton p(x) as defined by data generating process
p = function(x) {
with(x,
exp(1 + 2 * x1 - 1 * x2) / (1 + exp(1 + 2 * x1 - 1 * x2))
)
}
# value we are trying to estimate
p(x = x)
##  0.7772999
# calculate bias, variance, and mse of predictions for each logistic regression
bias = apply(predictions, 2, get_bias, truth = p(x))
variance = apply(predictions, 2, get_var)
mse = apply(predictions, 2, get_mse, truth = p(x))
# summarize results
results = data.frame(
c("Intercept Only", "Additive", "Interaction", "Full Second Order"),
round(mse, 5),
round(bias ^ 2, 5),
round(variance, 5)
)
colnames(results) = c("Logistic Regression Model",
"Mean Squared Error",
"Bias Squared",
"Variance")
rownames(results) = NULL
knitr::kable(results)
Logistic Regression Model Mean Squared Error Bias Squared Variance
Intercept Only 0.02089 0.01343 0.00746
Interaction 0.03482 0.00001 0.03480
Full Second Order 0.04362 0.00002 0.04359

# Exercise 3 (Comparing Classifiers)

# setup parameters
num_obs = 1000

# means
mu_1 = c(12, 8.5)
mu_2 = c(22, 10)
mu_3 = c(12, 15)
mu_4 = c(12, 20)

# sigmas
sigma_1 = matrix(c(10, -4, -4, 8), 2, 2)
sigma_2 = matrix(c(5, -3, -3, 5), 2, 2)
sigma_3 = matrix(c(8, 3, 3, 8), 2, 2)
sigma_4 = matrix(c(8, 6, 6, 8), 2, 2)

# control randomization
set.seed(42)

# make train data
hw05_trn = data.frame(

# create response
as.factor(c(rep("A", num_obs / 2), rep("B", num_obs),
rep("C", num_obs * 2), rep("D", num_obs))),

# create predictors
rbind(
mvrnorm(n = num_obs / 2, mu = mu_1, Sigma = sigma_1),
mvrnorm(n = num_obs, mu = mu_2, Sigma = sigma_2),
mvrnorm(n = num_obs * 2, mu = mu_3, Sigma = sigma_3),
mvrnorm(n = num_obs, mu = mu_4, Sigma = sigma_4)
)

)
# label variables
colnames(hw05_trn) = c("y", "x1", "x2")

# make test data
hw05_tst = data.frame(

# create response
as.factor(c(rep("A", num_obs), rep("B", num_obs),
rep("C", num_obs), rep("D", num_obs))),

# create predictors
rbind(
mvrnorm(n = num_obs, mu = mu_1, Sigma = sigma_1),
mvrnorm(n = num_obs, mu = mu_2, Sigma = sigma_2),
mvrnorm(n = num_obs, mu = mu_3, Sigma = sigma_3),
mvrnorm(n = num_obs, mu = mu_4, Sigma = sigma_4)
)

)
# label variables
colnames(hw05_tst) = c("y", "x1", "x2")

# write to files

# clear workspace
rm(hw05_trn)
rm(hw05_tst)

[8 points] Use the data found in hw05-trn.csv and hw05-tst.csv which contain train and test data respectively. Use y as the response. Coerce y to be a factor after importing the data if it is not already.

Create a pairs plot with ellipses for the training data, then train the following models using both available predictors:

• Additive Logistic Regression (Multinomial Regression)
• LDA (with Priors estimated from data)
• LDA with Flat Prior
• QDA (with Priors estimated from data)
• QDA with Flat Prior
• Naive Bayes (with Priors estimated from data)

Calculate test and train error rates for each model. Summarize these results using a single well-formatted table.

Solution:

# read data

# coerce characters to factors
hw05_trn$y = as.factor(hw05_trn$y)
hw05_tst$y = as.factor(hw05_tst$y)

Note: when using the older read.csv() strings are automatically imported as factors by deafult. This would seem useful here, but a terrible idea in general. It is better to import as a character, then later explicitly coerce to a factor if desired. For this reason, read_csv() does not even provide an option to import characters as a factor. (At least not one this instructor is aware of.)

# load packages
featurePlot(x = hw05_trn[, 2:3],
y = hw05_trn\$y,
plot = "density",
scales = list(x = list(relation = "free"),
y = list(relation = "free")),
auto.key = list(columns = 4))