R
For this lab, use only methods availible from the following packges:
library(MASS)
library(caret)
library(tidyverse)
library(knitr)
library(kableExtra)
library(microbenchmark)
library(randomForest)
If you haven’t already, make sure each is installed!
For this lab we will again use the birthwt
data from the MASS
package. Our goal in analyzing this data is to predict the birthweight of newborns at Baystate Medical Center, Springfield, Mass. (This data is from 1986…)
data(birthwt)
Last time w read the documentation for this data and based on our goal dropped the low
variable from the dataset. (It doesn’t make sense to including a variable indiciating low birthweight if our goal is to predict birthweight. That would be of no use in practice.)
birthwt = subset(birthwt, select = -c(low))
We also coerced certain variables to be factor variables, as it was clear that they were categorical variables.
birthwt$race = factor(ifelse(birthwt$race == 1, "white",
ifelse(birthwt$race == 2, "black", "other")))
birthwt$smoke = factor(birthwt$smoke)
birthwt$ht = factor(birthwt$ht)
birthwt$ui = factor(birthwt$ui)
We then finally test-train split the data.
set.seed(42)
bwt_trn_idx = sample(nrow(birthwt), size = trunc(0.70 * nrow(birthwt)))
bwt_trn_data = birthwt[bwt_trn_idx, ]
bwt_tst_data = birthwt[-bwt_trn_idx, ]
[Exercise] Train three “different” \(k\)-nearest neighbors models, each with k = 5
. The “difference” will be in what we call the preprocessing of the data. Note that we won’t actually modify the data, but we will use formula syntax to handle the preprocessing within the model fitting.
knnreg()
will use one-hot encoding.knnreg()
will use one-hot encoding.race
variable to be numeric. All ther factors remain unchanged.bwt_knn_mod_1 = knnreg(bwt ~ age + lwt + race + smoke + ptl + ht + ui + ftv,
data = bwt_trn_data, k = 5)
bwt_knn_mod_2 = knnreg(bwt ~ scale(age) + scale(lwt) + race + smoke + scale(ptl) +
ht + ui + scale(ftv), data = bwt_trn_data, k = 5)
bwt_knn_mod_3 = knnreg(bwt ~ scale(age) + scale(lwt) + as.numeric(race) + smoke +
scale(ptl) + ht + ui + scale(ftv), data = bwt_trn_data, k = 5)
[Exercise] Output the first 6 rows of the traning data.
head(bwt_trn_data)
## age lwt race smoke ptl ht ui ftv bwt
## 61 24 105 black 1 0 0 0 0 2381
## 67 22 130 white 1 0 0 0 1 2410
## 141 30 95 white 1 0 0 0 2 3147
## 36 24 138 white 0 0 0 0 0 2100
## 215 25 120 white 0 0 0 0 2 3983
## 190 29 135 white 0 0 0 0 1 3651
[Exercise] Output the first 6 rows of the \(X\) data (predictor data frame) supplied to the \(k\)-nearest neighbors algorith in Model 1.
head(bwt_knn_mod_1$learn$X)
## age lwt raceother racewhite smoke1 ptl ht1 ui1 ftv
## 61 24 105 0 0 1 0 0 0 0
## 67 22 130 0 1 1 0 0 0 1
## 141 30 95 0 1 1 0 0 0 2
## 36 24 138 0 1 0 0 0 0 0
## 215 25 120 0 1 0 0 0 0 2
## 190 29 135 0 1 0 0 0 0 1
[Exercise] Output the first 6 rows of the \(X\) data (predictor data frame) supplied to the \(k\)-nearest neighbors algorith in Model 2.
head(bwt_knn_mod_2$learn$X)
## scale(age) scale(lwt) raceother racewhite smoke1 scale(ptl) ht1 ui1
## 61 0.07216633 -0.881215186 0 0 1 -0.4172794 0 0
## 67 -0.30887190 0.004023814 0 1 1 -0.4172794 0 0
## 141 1.21528103 -1.235310786 0 1 1 -0.4172794 0 0
## 36 0.07216633 0.287300294 0 1 0 -0.4172794 0 0
## 215 0.26268545 -0.350071786 0 1 0 -0.4172794 0 0
## 190 1.02476192 0.181071614 0 1 0 -0.4172794 0 0
## scale(ftv)
## 61 -0.7938904
## 67 0.1017808
## 141 0.9974520
## 36 -0.7938904
## 215 0.9974520
## 190 0.1017808
[Exercise] Output the first 6 rows of the \(X\) data (predictor data frame) supplied to the \(k\)-nearest neighbors algorith in Model 3.
head(bwt_knn_mod_3$learn$X)
## scale(age) scale(lwt) as.numeric(race) smoke1 scale(ptl) ht1 ui1
## 61 0.07216633 -0.881215186 1 1 -0.4172794 0 0
## 67 -0.30887190 0.004023814 3 1 -0.4172794 0 0
## 141 1.21528103 -1.235310786 3 1 -0.4172794 0 0
## 36 0.07216633 0.287300294 3 0 -0.4172794 0 0
## 215 0.26268545 -0.350071786 3 0 -0.4172794 0 0
## 190 1.02476192 0.181071614 3 0 -0.4172794 0 0
## scale(ftv)
## 61 -0.7938904
## 67 0.1017808
## 141 0.9974520
## 36 -0.7938904
## 215 0.9974520
## 190 0.1017808
[Exercise] Calculate test and train RMSE for each model.
calc_rmse = function(actual, predicted) {
sqrt(mean((actual - predicted) ^ 2))
}
# create model list
mod_list = list(bwt_knn_mod_1, bwt_knn_mod_2, bwt_knn_mod_3)
# get train and test predictions
trn_pred = lapply(mod_list, predict, newdata = bwt_trn_data)
tst_pred = lapply(mod_list, predict, newdata = bwt_tst_data)
# get train and test RMSE
trn_rmse = sapply(trn_pred, calc_rmse, actual = bwt_trn_data$bwt)
tst_rmse = sapply(tst_pred, calc_rmse, actual = bwt_tst_data$bwt)
[Exercise] Summarize these results in a table. (Model, Train/Test RMSE.) Output the results as a well-formatted markdown table.
# create df of results
results = data.frame(
mod = c("Model 1", "Model 2", "Model 3"),
trn_rmse = trn_rmse,
tst_rmse = tst_rmse
)
colnames(results) = c("Model", "Train RMSE", "Test RMSE")
# create results table
kable_styling(kable(results, format = "html", digits = 3), full_width = FALSE)
Model | Train RMSE | Test RMSE |
---|---|---|
Model 1 | 638.582 | 720.808 |
Model 2 | 585.120 | 707.995 |
Model 3 | 583.951 | 717.642 |
[Exercise] Did preprocessing make a difference?
A little bit, but nothing all that significant in the context of the problem.
When using \(k\)-nearest neighbors, we say that it is fast at train time, slow at test (predict) time. Let’s see if this is true for the specific implementation used in knnreg()
.
[Exercise] Use the microbenchmark()
function from the microbenchmark
package to compare the runtimes of the following two lines.
fit = knnreg(bwt ~ ., data = bwt_trn_data, k = 5)
pred = predict(fit, newdata = bwt_tst_data)
microbenchmark(fit = knnreg(bwt ~ ., data = bwt_trn_data, k = 5),
times = 100, unit = "ms")
## Unit: milliseconds
## expr min lq mean median uq max neval
## fit 1.022332 1.139804 1.434355 1.299095 1.681022 3.097059 100
microbenchmark(pred = predict(fit, newdata = bwt_tst_data),
times = 100, unit = "ms")
## Unit: milliseconds
## expr min lq mean median uq max neval
## pred 0.896311 0.92754 1.095276 0.9828045 1.147853 2.264674 100
[Exercise] Are the results what you expected? If not, try to explain.
The model “fitting” when we call knnreg()
involves re-creating the training dataset, that is applying the preprocessing. For this data we have here, this takes about as long as making predictions on the test data.
[Exercise] Use the microbenchmark()
function from the microbenchmark
package to compare fitting \(k\)-nearest neighbors with k = 5
to fitting a random forest.
microbenchmark(fit = knnreg(bwt ~ ., data = bwt_trn_data, k = 5),
times = 100, unit = "ms")
## Unit: milliseconds
## expr min lq mean median uq max neval
## fit 1.003668 1.02187 1.118344 1.037106 1.094433 2.007259 100
microbenchmark(fit = randomForest(bwt ~ ., data = bwt_trn_data),
times = 100, unit = "ms")
## Unit: milliseconds
## expr min lq mean median uq max neval
## fit 51.1207 54.28044 64.41487 57.05296 59.55442 297.6403 100