`R`

**Lab Goal:** The goal of this lab is to dig into the `knnreg()`

function from the `caret`

package. This version of \(k\)-nearest neighbors will allow us to use formula sytnax!

“Rarely is the question asked: Is our children learning?”

—

George W. Bush

You should use the `.Rmd`

file which created this document as a template for the lab. This file can be found in the `.zip`

file that contains all the necessary files for the lab.

For this lab, use only methods availible from the following packges:

```
library(MASS)
library(caret)
library(tidyverse)
library(knitr)
library(kableExtra)
library(microbenchmark)
library(randomForest)
```

If you haven’t already, make sure each is installed!

For this lab we will again use the `birthwt`

data from the `MASS`

package. Our **goal** in analyzing this data is to predict the birthweight of newborns at Baystate Medical Center, Springfield, Mass. (This data is from 1986…)

`data(birthwt)`

Last time w read the documentation for this data and based on our goal dropped the `low`

variable from the dataset. (It doesn’t make sense to including a variable indiciating low birthweight if our goal is to predict birthweight. That would be of no use in practice.)

`birthwt = subset(birthwt, select = -c(low))`

We also coerced certain variables to be factor variables, as it was clear that they were categorical variables.

```
birthwt$race = factor(ifelse(birthwt$race == 1, "white",
ifelse(birthwt$race == 2, "black", "other")))
birthwt$smoke = factor(birthwt$smoke)
birthwt$ht = factor(birthwt$ht)
birthwt$ui = factor(birthwt$ui)
```

We then finally test-train split the data.

```
set.seed(42)
bwt_trn_idx = sample(nrow(birthwt), size = trunc(0.70 * nrow(birthwt)))
bwt_trn_data = birthwt[bwt_trn_idx, ]
bwt_tst_data = birthwt[-bwt_trn_idx, ]
```

**[Exercise]** Train three “different” \(k\)-nearest neighbors models, each with `k = 5`

. The “difference” will be in what we call the preprocessing of the data. Note that we won’t actually modify the data, but we will use formula syntax to handle the preprocessing within the model fitting.

**Model 1**- Numeric variables not scaled.
- Factor variables remain factors.
`knnreg()`

will use one-hot encoding.

**Model 2**- Numeric variables are scaled to have mean 0 and standard deviation 1.
- Factor variables remain factors.
`knnreg()`

will use one-hot encoding.

**Model 3**- Numeric variables are scaled to have mean 0 and standard deviation 1.
- Coerce the
`race`

variable to be numeric. All ther factors remain unchanged.

`# your code here`

**[Exercise]** Output the first 6 rows of the traning data.

`# your code here`

**[Exercise]** Output the first 6 rows of the \(X\) data (predictor data frame) supplied to the \(k\)-nearest neighbors algorith in Model 1.

`# your code here`

**[Exercise]** Output the first 6 rows of the \(X\) data (predictor data frame) supplied to the \(k\)-nearest neighbors algorith in Model 2.

`# your code here`

**[Exercise]** Output the first 6 rows of the \(X\) data (predictor data frame) supplied to the \(k\)-nearest neighbors algorith in Model 3.

`# your code here`

**[Exercise]** Calculate test and train RMSE for each model.

`# your code here`

**[Exercise]** Summarize these results in a table. (Model, Train/Test RMSE.) Output the results as a well-formatted markdown table.

`# your code here`

**[Exercise]** Did preprocessing make a difference?

When using \(k\)-nearest neighbors, we say that it is fast at train time, slow at test (predict) time. Let’s see if this is true for the specific implementation used in `knnreg()`

.

**[Exercise]** Use the `microbenchmark()`

function from the `microbenchmark`

package to compare the runtimes of the following two lines.

```
fit = knnreg(bwt ~ ., data = bwt_trn_data, k = 5)
pred = predict(fit, newdata = bwt_tst_data)
```

`# your code here`

**[Exercise]** Are the results what you expected? If not, try to explain.

**[Exercise]** Use the `microbenchmark()`

function from the `microbenchmark`

package to compare fitting \(k\)-nearest neighbors with `k = 5`

to fitting a random forest.

`# your code here`