Lab Goal: The goal of this lab is to dig into the lm() function and highlight the use of formula syntax. *apply() functions and factor variables are introduced as well.


“See, in my line of work you got to keep repeating things over and over and over again for the truth to sink in, to kind of catapult the propaganda.”

George W. Bush


You should use the .Rmd file which created this document as a template for the lab. This file can be found in the .zip file that contains all the necessary files for the lab.


Packages

For this lab, use only methods availible from the following packges:

library(MASS)
library(tidyverse)
library(knitr)
library(kableExtra)

If you haven’t already, make sure each is installed!


Data

For this lab we will use the birthwt data from the MASS package. Our goal in analyzing this data is to predict the birthweight of newborns at Baystate Medical Center, Springfield, Mass. (This data is from 1986…)

data(birthwt)

[Exercise] Read the documentation for this data. Based on our goal, one of the variables should not be used in our analysis. Remove this variable.

# your code here

[Exercise] Based on reading the documentation, it should be clear that some of the variables are categorical. Coerce these variables to be factor variables.

We will always encode categorical data as factors when using formula syntax to specify a model. For full details on factor variables, see the Factors chapter from Hadley Wickham’s book R for Data Science.

# your code here

[Exercise] After completing the previous preparations, test-train split the data using the following code:

set.seed(42)
bwt_trn_idx  = sample(nrow(birthwt), size = trunc(0.70 * nrow(birthwt)))
bwt_trn_data = birthwt[bwt_trn_idx, ]
bwt_tst_data = birthwt[-bwt_trn_idx, ]

Model Training

Let black, white, other, and smoke each be dummy variables. For example white is 1 is the mother’s race is white, otherwise it is 0.

[Exercise] Fit (train) a total of five models:

# your code here

Model Evaluation

Consider two evaluation metrics, RMSE and \(R^2\), which can both be evaluated on train or test data:

\[ \text{RMSE}(\hat{f}, \mathcal{D}) = \sqrt{ \frac{1}{n_\mathcal{D}} \sum_{i \in \mathcal{D}} (y_i - \hat{f}(x_i))^2} \]

\[ R^2(\hat{f}, \mathcal{D}) = 1 - \frac{\sum_{i \in \mathcal{D}} (y_i - \hat{f}(x_i))^2}{\sum_{i \in \mathcal{D}} (y_i - \bar{y})^2} \]

In both cases, \(\hat{f}\) is a model that has been fit to training data, thus \(\hat{f}(x_i)\) is the model’s prediction given the predictors for observation \(i\) from dataset \(\mathcal{D}\), which could be either test or train. Similiarly \(y_i\) is the true value of the response for observation \(i\) from dataset \(\mathcal{D}\)

We supply R functions to calculate both.

# RMSE
calc_lm_rmse = function(mod, data) {
  actual = data$bwt
  predicted = predict(mod, data)
  sqrt(mean((actual - predicted) ^ 2))
}

# R2
calc_lm_r2 = function(mod, data) {
  actual = data$bwt
  predicted = predict(mod, data)
  1 - ((sum((actual - predicted) ^ 2)) / (sum((actual - mean(actual)) ^ 2)))
}

Note that these functions are specific both to the models we are using and the data we are using. They should not be re-used outside of this lab! Don’t just copy and paste them into homework and expect them to do anything useful!

[Exercise] Calculate test and train RMSE for each model.

# your code here

[Exercise] Calculate test and train \(R^2\) for each model.

# your code here

[Exercise] Summarize these results in a table. (Model, Train/Test RMSE, Train/Test \(R^2\).) Output the results as a well-formatted markdown table.

# your code here

Results?

[Exercise] You might have noticed that you would select the same model using RMSE or \(R^2\). So \(R^2\) is as good as RMSE, right?

[Exercise] Create a 95% prediction interval for the first observation in the test dataset using the best model of these five. Do you feel that this interval is valid?

# your code here