Packages

For this lab, use only methods availible from the following packges:

library(MASS)
library(tidyverse)
library(knitr)
library(kableExtra)

If you haven’t already, make sure each is installed!


Data

For this lab we will use the birthwt data from the MASS package. Our goal in analyzing this data is to predict the birthweight of newborns at Baystate Medical Center, Springfield, Mass. (This data is from 1986…)

data(birthwt)

[Exercise] Read the documentation for this data. Based on our goal, one of the variables should not be used in our analysis. Remove this variable.

birthwt = subset(birthwt, select = -c(low))

[Exercise] Based on reading the documentation, it should be clear that some of the variables are categorical. Coerce these variables to be factor variables.

We will always encode categorical data as factors when using formula syntax to specify a model. For full details on factor variables, see the Factors chapter from Hadley Wickham’s book R for Data Science.

birthwt$race = factor(ifelse(birthwt$race == 1, "white", 
                             ifelse(birthwt$race == 2, "black", "other")))
birthwt$smoke = factor(birthwt$smoke)
birthwt$ht = factor(birthwt$ht)
birthwt$ui = factor(birthwt$ui)

[Exercise] After completing the previous preparations, test-train split the data using the following code:

set.seed(42)
bwt_trn_idx  = sample(nrow(birthwt), size = trunc(0.70 * nrow(birthwt)))
bwt_trn_data = birthwt[bwt_trn_idx, ]
bwt_tst_data = birthwt[-bwt_trn_idx, ]

Model Training

Let black, white, other, and smoke each be dummy variables. For example white is 1 is the mother’s race is white, otherwise it is 0.

[Exercise] Fit (train) a total of five models:

bwt_mod_1 = lm(bwt ~ ., data = bwt_trn_data)
bwt_mod_2 = lm(bwt ~ . ^ 2, data = bwt_trn_data)
bwt_mod_3 = lm(bwt ~ age + lwt, data = bwt_trn_data)
bwt_mod_4 = lm(bwt ~ age + lwt + race, data = bwt_trn_data)
bwt_mod_5 = lm(bwt ~ age * lwt * smoke, data = bwt_trn_data)

Model Evaluation

Consider two evaluation metrics, RMSE and \(R^2\), which can both be evaluated on train or test data:

\[ \text{RMSE}(\hat{f}, \mathcal{D}) = \sqrt{ \frac{1}{n_\mathcal{D}} \sum_{i \in \mathcal{D}} (y_i - \hat{f}(x_i))^2} \]

\[ R^2(\hat{f}, \mathcal{D}) = 1 - \frac{\sum_{i \in \mathcal{D}} (y_i - \hat{f}(x_i))^2}{\sum_{i \in \mathcal{D}} (y_i - \bar{y})^2} \]

In both cases, \(\hat{f}\) is a model that has been fit to training data, thus \(\hat{f}(x_i)\) is the model’s prediction given the predictors for observation \(i\) from dataset \(\mathcal{D}\), which could be either test or train. Similiarly \(y_i\) is the true value of the response for observation \(i\) from dataset \(\mathcal{D}\)

We supply R functions to calculate both.

# RMSE
calc_lm_rmse = function(mod, data) {
  actual = data$bwt
  predicted = predict(mod, data)
  sqrt(mean((actual - predicted) ^ 2))
}

# R2
calc_lm_r2 = function(mod, data) {
  actual = data$bwt
  predicted = predict(mod, data)
  1 - ((sum((actual - predicted) ^ 2)) / (sum((actual - mean(actual)) ^ 2)))
}

Note that these functions are specific both to the models we are using and the data we are using. They should not be re-used outside of this lab! Don’t just copy and paste them into homework and expect them to do anything useful!

[Exercise] Calculate test and train RMSE for each model.

# create model list
mod_list = list(bwt_mod_1, bwt_mod_2, bwt_mod_3, bwt_mod_4, bwt_mod_5)
trn_rmse = sapply(mod_list, calc_lm_rmse, data = bwt_trn_data)
tst_rmse = sapply(mod_list, calc_lm_rmse, data = bwt_tst_data)

Here we used the sapply() function. A couple useful resources for learning the *apply() functions:

[Exercise] Calculate test and train \(R^2\) for each model.

# create model list
trn_r2 = sapply(mod_list, calc_lm_r2, data = bwt_trn_data)
tst_r2 = sapply(mod_list, calc_lm_r2, data = bwt_tst_data)

[Exercise] Summarize these results in a table. (Model, Train/Test RMSE, Train/Test \(R^2\).) Output the results as a well-formatted markdown table.

# create df of results
results = data.frame(
  mod = c("`bwt ~ .`", "`bwt ~ . ^ 2`", "`bwt ~ age + lwt`", 
          "`bwt ~ age + lwt + race`", "`bwt ~ age * lwt * smoke`"),
  trn_rmse = trn_rmse,
  tst_rmse = tst_rmse,
  trn_r2 = trn_r2,
  tst_r2 = tst_r2
)
colnames(results) = c("Model", "Train RMSE", "Test RMSE", "Train R2", "Test R2")

# create results table
kable_styling(kable(results, format = "html", digits = 3), full_width = FALSE)
Model Train RMSE Test RMSE Train R2 Test R2
bwt ~ . 610.066 700.586 0.320 -0.008
bwt ~ . ^ 2 509.609 1223.517 0.525 -2.076
bwt ~ age + lwt 716.231 722.785 0.062 -0.073
bwt ~ age + lwt + race 687.964 731.563 0.135 -0.100
bwt ~ age * lwt * smoke 699.546 684.335 0.106 0.038

Results?

[Exercise] You might have noticed that you would select the same model using RMSE or \(R^2\). So \(R^2\) is as good as RMSE, right?

No!!!* \(R^2\) is in some arbitrary units, whereas RMSE gives us an estimate of the average squared prediction error in the original units used by the response variable!

[Exercise] Create a 95% prediction interval for the first observation in the test dataset using the best model of these five. Do you feel that this interval is valid?

predict(bwt_mod_5, bwt_tst_data[1, ], interval = c("prediction"), level = 0.95)
##         fit      lwr      upr
## 87 2561.925 1097.957 4025.893
par(mfrow = c(2, 2))
plot(bwt_mod_5)

The diagnostics look reasonable, so we don’t have any reason to doubt the interval’s validty. However, with such a large interval, we should question the usefulness of our predictions!