Lab Goal: The goal of this lab is to dig into the knnreg() function from the caret package. This version of $$k$$-nearest neighbors will allow us to use formula sytnax!

“Rarely is the question asked: Is our children learning?”

George W. Bush

You should use the .Rmd file which created this document as a template for the lab. This file can be found in the .zip file that contains all the necessary files for the lab.

# Packages

For this lab, use only methods availible from the following packges:

library(MASS)
library(caret)
library(tidyverse)
library(knitr)
library(kableExtra)
library(microbenchmark)
library(randomForest)

If you haven’t already, make sure each is installed!

# Data

For this lab we will again use the birthwt data from the MASS package. Our goal in analyzing this data is to predict the birthweight of newborns at Baystate Medical Center, Springfield, Mass. (This data is from 1986…)

data(birthwt)

Last time w read the documentation for this data and based on our goal dropped the low variable from the dataset. (It doesn’t make sense to including a variable indiciating low birthweight if our goal is to predict birthweight. That would be of no use in practice.)

birthwt = subset(birthwt, select = -c(low))

We also coerced certain variables to be factor variables, as it was clear that they were categorical variables.

birthwt$race = factor(ifelse(birthwt$race == 1, "white",
ifelse(birthwt$race == 2, "black", "other"))) birthwt$smoke = factor(birthwt$smoke) birthwt$ht = factor(birthwt$ht) birthwt$ui = factor(birthwt\$ui)

We then finally test-train split the data.

set.seed(42)
bwt_trn_idx  = sample(nrow(birthwt), size = trunc(0.70 * nrow(birthwt)))
bwt_trn_data = birthwt[bwt_trn_idx, ]
bwt_tst_data = birthwt[-bwt_trn_idx, ]

# Model Training

[Exercise] Train three “different” $$k$$-nearest neighbors models, each with k = 5. The “difference” will be in what we call the preprocessing of the data. Note that we won’t actually modify the data, but we will use formula syntax to handle the preprocessing within the model fitting.

• Model 1
• Numeric variables not scaled.
• Factor variables remain factors. knnreg() will use one-hot encoding.
• Model 2
• Numeric variables are scaled to have mean 0 and standard deviation 1.
• Factor variables remain factors. knnreg() will use one-hot encoding.
• Model 3
• Numeric variables are scaled to have mean 0 and standard deviation 1.
• Coerce the race variable to be numeric. All ther factors remain unchanged.
# your code here

[Exercise] Output the first 6 rows of the traning data.

# your code here

[Exercise] Output the first 6 rows of the $$X$$ data (predictor data frame) supplied to the $$k$$-nearest neighbors algorith in Model 1.

# your code here

[Exercise] Output the first 6 rows of the $$X$$ data (predictor data frame) supplied to the $$k$$-nearest neighbors algorith in Model 2.

# your code here

[Exercise] Output the first 6 rows of the $$X$$ data (predictor data frame) supplied to the $$k$$-nearest neighbors algorith in Model 3.

# your code here

# Model Evaluation

[Exercise] Calculate test and train RMSE for each model.

# your code here

[Exercise] Summarize these results in a table. (Model, Train/Test RMSE.) Output the results as a well-formatted markdown table.

# your code here

# Results?

[Exercise] Did preprocessing make a difference?

# Fast Train, Slow Test?

When using $$k$$-nearest neighbors, we say that it is fast at train time, slow at test (predict) time. Let’s see if this is true for the specific implementation used in knnreg().

[Exercise] Use the microbenchmark() function from the microbenchmark package to compare the runtimes of the following two lines.

fit = knnreg(bwt ~ ., data = bwt_trn_data, k = 5)
pred = predict(fit, newdata = bwt_tst_data)
# your code here

[Exercise] Are the results what you expected? If not, try to explain.

[Exercise] Use the microbenchmark() function from the microbenchmark package to compare fitting $$k$$-nearest neighbors with k = 5 to fitting a random forest.

# your code here