Please see the homework instructions document for detailed instructions and some grading notes. Failure to follow instructions will result in point reductions.

Exercise 1 (Data Scaling?)

[8 points] This exercise will use data in hw03-train-data.csv and hw03-test-data.csv which are train and test datasets respectively. Both datasets contain multiple predictors and a numeric response y.

Fit a total of six \(k\)-nearest neighbors models. Consider three values of \(k\): 1, 5, and 25. To make a total of six models, consider both scaled and unscaled \(X\) data. For each model, use all available predictors.

Summarize these results using a single well-formatted table which displays test RMSE, k, and whether or not scaling was used.

Exercise 2 (KNN versus Linear Models)

[9 points] Find a \(k\)-nearest neighbors model that outperforms an additive linear model for predicting mpg in the Auto data from the ISLR package. Use the following data cleaning and test-train split to perform this analysis. Keep all of the predictor variables as numeric variables. Report the test RMSE for both the additive linear model, as well as your chosen model. For your model, also note what value of \(k\) you used, as well as whether or not you scaled the \(X\) data.

# install.packages("ISLR")
library(ISLR)
auto = Auto[, !names(Auto) %in% c("name")]

set.seed(42)
auto_idx = sample(1:nrow(auto), size = round(0.5 * nrow(auto)))
auto_trn = auto[auto_idx, ]
auto_tst = auto[-auto_idx, ]

The additive linear model can be fit using:

lm(mpg ~ ., data = auto_trn)

Exercise 3 (Bias-Variance Tradeoff, KNN)

[8 points] Run a modified version of the simulation study found in Section 8.3 of R4SL. Use the same data generating process to simulate data:

f = function(x) {
  x ^ 2
}

get_sim_data = function(f, sample_size = 100) {
  x = runif(n = sample_size, min = 0, max = 1)
  y = rnorm(n = sample_size, mean = f(x), sd = 0.3)
  data.frame(x, y)
}

So, the following generates one simulated dataset according to the data generating process defined above.

sim_data = get_sim_data(f)

Evaluate predictions of \(f(x = 0.90)\) for three models:

\(k\)-nearest neighbors with \(k = 1\). \(\hat{f}_1(x)\)
\(k\)-nearest neighbors with \(k = 10\). \(\hat{f}_{10}(x)\)
\(k\)-nearest neighbors with \(k = 100\). \(\hat{f}_{100}(x)\)

For simplicity, when fitting the \(k\)-nearest neighbors models, do not scale \(X\) data.

Use 500 simulations to estimate squared bias, variance, and the mean squared error of estimating \(f(0.90)\) using \(\hat{f}_k(0.90)\) for each \(k\). Report your results using a well formatted table.

At the beginning of your simulation study, run the following code, but with your nine-digit Illinois UIN.

set.seed(123456789)

Exercise 4 (Concept Checks)

[1 point each] Answer the following questions based on your results from the three exercises.

(a) Based on your results in Exercise 1, which \(k\) performed best?

(b) Based on your results in Exercise 1, was scaling the data appropriate?

(c) Based on your results in Exercise 2, why do you think it was so easy to find a \(k\)-nearest neighbors model that met this criteria?

(d) Based on your results in Exercise 3, which of the three models do you think are providing unbiased predictions?

(e) Based on your results in Exercise 3, which model is predicting best at \(x = 0.90\)?

Homework 03

STAT 430, Fall 2017

Due: Friday, September 29, 11:59 PM

Exercise 1 (Data Scaling?)

Exercise 2 (KNN versus Linear Models)

Exercise 3 (Bias-Variance Tradeoff, KNN)

Exercise 4 (Concept Checks)