R
Lab Goal: The goal of this lab is to dig into the lm()
function and highlight the use of formula syntax. *apply()
functions and factor variables are introduced as well.
“See, in my line of work you got to keep repeating things over and over and over again for the truth to sink in, to kind of catapult the propaganda.”
— George W. Bush
You should use the .Rmd
file which created this document as a template for the lab. This file can be found in the .zip
file that contains all the necessary files for the lab.
For this lab, use only methods availible from the following packges:
library(MASS)
library(tidyverse)
library(knitr)
library(kableExtra)
If you haven’t already, make sure each is installed!
For this lab we will use the birthwt
data from the MASS
package. Our goal in analyzing this data is to predict the birthweight of newborns at Baystate Medical Center, Springfield, Mass. (This data is from 1986…)
data(birthwt)
[Exercise] Read the documentation for this data. Based on our goal, one of the variables should not be used in our analysis. Remove this variable.
# your code here
[Exercise] Based on reading the documentation, it should be clear that some of the variables are categorical. Coerce these variables to be factor variables.
We will always encode categorical data as factors when using formula syntax to specify a model. For full details on factor variables, see the Factors chapter from Hadley Wickham’s book R
for Data Science.
# your code here
[Exercise] After completing the previous preparations, test-train split the data using the following code:
set.seed(42)
bwt_trn_idx = sample(nrow(birthwt), size = trunc(0.70 * nrow(birthwt)))
bwt_trn_data = birthwt[bwt_trn_idx, ]
bwt_tst_data = birthwt[-bwt_trn_idx, ]
Let black
, white
, other
, and smoke
each be dummy variables. For example white
is 1
is the mother’s race is white, otherwise it is 0
.
[Exercise] Fit (train) a total of five models:
# your code here
Consider two evaluation metrics, RMSE and \(R^2\), which can both be evaluated on train or test data:
\[ \text{RMSE}(\hat{f}, \mathcal{D}) = \sqrt{ \frac{1}{n_\mathcal{D}} \sum_{i \in \mathcal{D}} (y_i - \hat{f}(x_i))^2} \]
\[ R^2(\hat{f}, \mathcal{D}) = 1 - \frac{\sum_{i \in \mathcal{D}} (y_i - \hat{f}(x_i))^2}{\sum_{i \in \mathcal{D}} (y_i - \bar{y})^2} \]
In both cases, \(\hat{f}\) is a model that has been fit to training data, thus \(\hat{f}(x_i)\) is the model’s prediction given the predictors for observation \(i\) from dataset \(\mathcal{D}\), which could be either test or train. Similiarly \(y_i\) is the true value of the response for observation \(i\) from dataset \(\mathcal{D}\)
We supply R
functions to calculate both.
# RMSE
calc_lm_rmse = function(mod, data) {
actual = data$bwt
predicted = predict(mod, data)
sqrt(mean((actual - predicted) ^ 2))
}
# R2
calc_lm_r2 = function(mod, data) {
actual = data$bwt
predicted = predict(mod, data)
1 - ((sum((actual - predicted) ^ 2)) / (sum((actual - mean(actual)) ^ 2)))
}
Note that these functions are specific both to the models we are using and the data we are using. They should not be re-used outside of this lab! Don’t just copy and paste them into homework and expect them to do anything useful!
[Exercise] Calculate test and train RMSE for each model.
# your code here
[Exercise] Calculate test and train \(R^2\) for each model.
# your code here
[Exercise] Summarize these results in a table. (Model, Train/Test RMSE, Train/Test \(R^2\).) Output the results as a well-formatted markdown table.
# your code here
[Exercise] You might have noticed that you would select the same model using RMSE or \(R^2\). So \(R^2\) is as good as RMSE, right?
[Exercise] Create a 95% prediction interval for the first observation in the test dataset using the best model of these five. Do you feel that this interval is valid?
# your code here