# Packages

For this lab, use only methods availible from the following packges:

library(MASS)
library(caret)
library(tidyverse)
library(knitr)
library(kableExtra)
library(microbenchmark)
library(randomForest)

If you haven’t already, make sure each is installed!

# Data

For this lab we will again use the birthwt data from the MASS package. Our goal in analyzing this data is to predict the birthweight of newborns at Baystate Medical Center, Springfield, Mass. (This data is from 1986…)

data(birthwt)

Last time w read the documentation for this data and based on our goal dropped the low variable from the dataset. (It doesn’t make sense to including a variable indiciating low birthweight if our goal is to predict birthweight. That would be of no use in practice.)

birthwt = subset(birthwt, select = -c(low))

We also coerced certain variables to be factor variables, as it was clear that they were categorical variables.

birthwt$race = factor(ifelse(birthwt$race == 1, "white",
ifelse(birthwt$race == 2, "black", "other"))) birthwt$smoke = factor(birthwt$smoke) birthwt$ht = factor(birthwt$ht) birthwt$ui = factor(birthwt$ui) We then finally test-train split the data. set.seed(42) bwt_trn_idx = sample(nrow(birthwt), size = trunc(0.70 * nrow(birthwt))) bwt_trn_data = birthwt[bwt_trn_idx, ] bwt_tst_data = birthwt[-bwt_trn_idx, ] # Model Training [Exercise] Train three “different” $$k$$-nearest neighbors models, each with k = 5. The “difference” will be in what we call the preprocessing of the data. Note that we won’t actually modify the data, but we will use formula syntax to handle the preprocessing within the model fitting. • Model 1 • Numeric variables not scaled. • Factor variables remain factors. knnreg() will use one-hot encoding. • Model 2 • Numeric variables are scaled to have mean 0 and standard deviation 1. • Factor variables remain factors. knnreg() will use one-hot encoding. • Model 3 • Numeric variables are scaled to have mean 0 and standard deviation 1. • Coerce the race variable to be numeric. All ther factors remain unchanged. bwt_knn_mod_1 = knnreg(bwt ~ age + lwt + race + smoke + ptl + ht + ui + ftv, data = bwt_trn_data, k = 5) bwt_knn_mod_2 = knnreg(bwt ~ scale(age) + scale(lwt) + race + smoke + scale(ptl) + ht + ui + scale(ftv), data = bwt_trn_data, k = 5) bwt_knn_mod_3 = knnreg(bwt ~ scale(age) + scale(lwt) + as.numeric(race) + smoke + scale(ptl) + ht + ui + scale(ftv), data = bwt_trn_data, k = 5) [Exercise] Output the first 6 rows of the traning data. head(bwt_trn_data) ## age lwt race smoke ptl ht ui ftv bwt ## 61 24 105 black 1 0 0 0 0 2381 ## 67 22 130 white 1 0 0 0 1 2410 ## 141 30 95 white 1 0 0 0 2 3147 ## 36 24 138 white 0 0 0 0 0 2100 ## 215 25 120 white 0 0 0 0 2 3983 ## 190 29 135 white 0 0 0 0 1 3651 [Exercise] Output the first 6 rows of the $$X$$ data (predictor data frame) supplied to the $$k$$-nearest neighbors algorith in Model 1. head(bwt_knn_mod_1$learn$X) ## age lwt raceother racewhite smoke1 ptl ht1 ui1 ftv ## 61 24 105 0 0 1 0 0 0 0 ## 67 22 130 0 1 1 0 0 0 1 ## 141 30 95 0 1 1 0 0 0 2 ## 36 24 138 0 1 0 0 0 0 0 ## 215 25 120 0 1 0 0 0 0 2 ## 190 29 135 0 1 0 0 0 0 1 [Exercise] Output the first 6 rows of the $$X$$ data (predictor data frame) supplied to the $$k$$-nearest neighbors algorith in Model 2. head(bwt_knn_mod_2$learn$X) ## scale(age) scale(lwt) raceother racewhite smoke1 scale(ptl) ht1 ui1 ## 61 0.07216633 -0.881215186 0 0 1 -0.4172794 0 0 ## 67 -0.30887190 0.004023814 0 1 1 -0.4172794 0 0 ## 141 1.21528103 -1.235310786 0 1 1 -0.4172794 0 0 ## 36 0.07216633 0.287300294 0 1 0 -0.4172794 0 0 ## 215 0.26268545 -0.350071786 0 1 0 -0.4172794 0 0 ## 190 1.02476192 0.181071614 0 1 0 -0.4172794 0 0 ## scale(ftv) ## 61 -0.7938904 ## 67 0.1017808 ## 141 0.9974520 ## 36 -0.7938904 ## 215 0.9974520 ## 190 0.1017808 [Exercise] Output the first 6 rows of the $$X$$ data (predictor data frame) supplied to the $$k$$-nearest neighbors algorith in Model 3. head(bwt_knn_mod_3$learn$X) ## scale(age) scale(lwt) as.numeric(race) smoke1 scale(ptl) ht1 ui1 ## 61 0.07216633 -0.881215186 1 1 -0.4172794 0 0 ## 67 -0.30887190 0.004023814 3 1 -0.4172794 0 0 ## 141 1.21528103 -1.235310786 3 1 -0.4172794 0 0 ## 36 0.07216633 0.287300294 3 0 -0.4172794 0 0 ## 215 0.26268545 -0.350071786 3 0 -0.4172794 0 0 ## 190 1.02476192 0.181071614 3 0 -0.4172794 0 0 ## scale(ftv) ## 61 -0.7938904 ## 67 0.1017808 ## 141 0.9974520 ## 36 -0.7938904 ## 215 0.9974520 ## 190 0.1017808 # Model Evaluation [Exercise] Calculate test and train RMSE for each model. calc_rmse = function(actual, predicted) { sqrt(mean((actual - predicted) ^ 2)) } # create model list mod_list = list(bwt_knn_mod_1, bwt_knn_mod_2, bwt_knn_mod_3) # get train and test predictions trn_pred = lapply(mod_list, predict, newdata = bwt_trn_data) tst_pred = lapply(mod_list, predict, newdata = bwt_tst_data) # get train and test RMSE trn_rmse = sapply(trn_pred, calc_rmse, actual = bwt_trn_data$bwt)
tst_rmse = sapply(tst_pred, calc_rmse, actual = bwt_tst_data\$bwt)

[Exercise] Summarize these results in a table. (Model, Train/Test RMSE.) Output the results as a well-formatted markdown table.

# create df of results
results = data.frame(
mod = c("Model 1", "Model 2", "Model 3"),
trn_rmse = trn_rmse,
tst_rmse = tst_rmse
)
colnames(results) = c("Model", "Train RMSE", "Test RMSE")

# create results table
kable_styling(kable(results, format = "html", digits = 3), full_width = FALSE)
Model Train RMSE Test RMSE
Model 1 638.582 720.808
Model 2 585.120 707.995
Model 3 583.951 717.642

# Results?

[Exercise] Did preprocessing make a difference?

A little bit, but nothing all that significant in the context of the problem.

# Fast Train, Slow Test?

When using $$k$$-nearest neighbors, we say that it is fast at train time, slow at test (predict) time. Let’s see if this is true for the specific implementation used in knnreg().

[Exercise] Use the microbenchmark() function from the microbenchmark package to compare the runtimes of the following two lines.

fit = knnreg(bwt ~ ., data = bwt_trn_data, k = 5)
pred = predict(fit, newdata = bwt_tst_data)
microbenchmark(fit = knnreg(bwt ~ ., data = bwt_trn_data, k = 5),
times = 100, unit = "ms")
## Unit: milliseconds
##  expr      min       lq     mean   median       uq      max neval
##   fit 1.022332 1.139804 1.434355 1.299095 1.681022 3.097059   100
microbenchmark(pred = predict(fit, newdata = bwt_tst_data),
times = 100, unit = "ms")
## Unit: milliseconds
##  expr      min      lq     mean    median       uq      max neval
##  pred 0.896311 0.92754 1.095276 0.9828045 1.147853 2.264674   100

[Exercise] Are the results what you expected? If not, try to explain.

The model “fitting” when we call knnreg() involves re-creating the training dataset, that is applying the preprocessing. For this data we have here, this takes about as long as making predictions on the test data.

[Exercise] Use the microbenchmark() function from the microbenchmark package to compare fitting $$k$$-nearest neighbors with k = 5 to fitting a random forest.

microbenchmark(fit = knnreg(bwt ~ ., data = bwt_trn_data, k = 5),
times = 100, unit = "ms")
## Unit: milliseconds
##  expr      min      lq     mean   median       uq      max neval
##   fit 1.003668 1.02187 1.118344 1.037106 1.094433 2.007259   100
microbenchmark(fit = randomForest(bwt ~ ., data = bwt_trn_data),
times = 100, unit = "ms")
## Unit: milliseconds
##  expr     min       lq     mean   median       uq      max neval
##   fit 51.1207 54.28044 64.41487 57.05296 59.55442 297.6403   100