Lab Goal: Create a test-train split of the data. Use test metrics to asses how well a model performs.
You can use the .Rmd file that created this document as a template for this lab.
The following code will generate some regression data.
# define the data generating process
simulate_regression_data = function(sample_size = 500) {
x1 = rnorm(n = sample_size)
x2 = rnorm(n = sample_size)
x3 = runif(n = sample_size, min = 0, max = 4)
x4 = rnorm(n = sample_size, mean = 2, sd = 1.5)
x5 = runif(n = sample_size)
f = (2 * x3) + (3 * x4) + (x4 ^ 2)
eps = rnorm(n = sample_size, sd = 2)
y = f + eps
data.frame(y, x1, x2, x3, x4, x5)
}
Since we are generating the data, we know the true form of \(f({\bf x})\).
\[ f({\bf x}) = 2 x_3 + 3 x_4 + x_4^2 \] Then, the data generating process is
\[ Y = f({\bf x}) + \epsilon \]
where
\[ \epsilon \sim N(0, \sigma^2 = 4) \]
[Exercise] Modify (and run) the following code to generate a dataset of size 1000 from the above data generating process.
# generate an example dataset
set.seed(42)
example_data = simulate_regression_data()
The following code takes the dataset we generated and splits it into two dataset of equal size. One for training (trn_data
) and one for evaluation (tst_data
).
[Exercise] Modify (and run) the following code to create a training set that is roughly 30% of the original dataset. The remaining 70% should be used for the test set. Note that these percentages are completely arbitrary and and for illustrative purposes only. In practice, we’ll give more consideration to the size of the test set. In particular, we’ll consider:
set.seed(42)
trn_idx = sample(nrow(example_data), size = trunc(0.50 * nrow(example_data)))
trn_data = example_data[trn_idx, ]
tst_data = example_data[-trn_idx, ]
[Exercise] Fit the following five models using the training data. Never fit models using test data.
y ~ x3
y ~ x3 + x4
y ~ x3 + poly(x4, 2, raw = TRUE)
y ~ x3 + poly(x4, 8, raw = TRUE)
y ~ x1 + x2 + poly(x3, 10, raw = TRUE) + poly(x4, 10, raw = TRUE) + x5
We will use RMSE for our metric to evaluate the models we just fit.
\[ \text{RMSE}(\hat{f}, \text{Data}) = \sqrt{\frac{1}{n}\displaystyle\sum_{i = 1}^{n}\left(y_i - \hat{f}(\bf{x}_i)\right)^2} \]
As an R
function, this becomes:
rmse = function(actual, predicted) {
sqrt(mean((actual - predicted) ^ 2))
}
where actual
is \(y_i\) and predicted
is \(\hat{f}(\bf{x}_i)\).
\[ \text{RMSE}_{\text{Train}} = \text{RMSE}(\hat{f}, \text{Train Data}) = \sqrt{\frac{1}{n_{\text{Tr}}}\displaystyle\sum_{i \in \text{Train}}^{}\left(y_i - \hat{f}(\bf{x}_i)\right)^2} \]
\[ \text{RMSE}_{\text{Test}} = \text{RMSE}(\hat{f}, \text{Test Data}) = \sqrt{\frac{1}{n_{\text{Te}}}\displaystyle\sum_{i \in \text{Test}}^{}\left(y_i - \hat{f}(\bf{x}_i)\right)^2} \]
[Exercise] For each of the five models we fit, obtain train and test RMSE as defined above. Based on these metrics, which model performs the best?