You can find the solutions rmarkdown file here.

Generate Data

The following code will generate some regression data.

# define the data generating process
simulate_regression_data = function(sample_size = 500) {
  x1 = rnorm(n = sample_size)
  x2 = rnorm(n = sample_size)
  x3 = runif(n = sample_size, min = 0, max = 4)
  x4 = rnorm(n = sample_size, mean = 2, sd = 1.5)
  x5 = runif(n = sample_size)
  f = (2 * x3) + (3 * x4) + (x4 ^ 2)
  eps = rnorm(n = sample_size, sd = 2)
  y = f + eps 
  data.frame(y, x1, x2, x3, x4, x5)
}

Since we are generating the data, we know the true form of \(f({\bf x})\).

\[ f({\bf x}) = 2 x_3 + 3 x_4 + x_4^2 \] Then, the data generating process is

\[ Y = f({\bf x}) + \epsilon \]

where

\[ \epsilon \sim N(0, \sigma^2 = 4) \]

[Exercise] Modify (and run) the following code to generate a dataset of size 1000 from the above data generating process.

# generate an example dataset
set.seed(42)
example_data = simulate_regression_data()

Solution:

# generate an example dataset
set.seed(42)
example_data = simulate_regression_data(sample_size = 1000)

Test-Train Split the Data

The following code takes the dataset we generated and splits it into two dataset of equal size. One for training (trn_data) and one for evaluation (tst_data).

[Exercise] Modify (and run) the following code to create a training set that is roughly 30% of the original dataset. The remaining 70% should be used for the test set. Note that these percentages are completely arbitrary and and for illustrative purposes only. In practice, we’ll give more consideration to the size of the test set. In particular, we’ll consider:

Does it contain enough data to be representative of the population?
Does it leave enough data for training? (Can we estimate the model parameters well?)

set.seed(42)
trn_idx = sample(nrow(example_data), size = trunc(0.50 * nrow(example_data)))
trn_data = example_data[trn_idx, ]
tst_data = example_data[-trn_idx, ]

Solution:

set.seed(42)
trn_idx = sample(nrow(example_data), size = trunc(0.30 * nrow(example_data)))
trn_data = example_data[trn_idx, ]
tst_data = example_data[-trn_idx, ]

Fit Models

[Exercise] Fit the following five models using the training data. Never fit models using test data.

y ~ x3
y ~ x3 + x4
y ~ x3 + poly(x4, 2, raw = TRUE)
y ~ x3 + poly(x4, 8, raw = TRUE)
y ~ x1 + x2 + poly(x3, 10, raw = TRUE) + poly(x4, 10, raw = TRUE) + x5

Solution:

mod_1 = lm(y ~ x3, data = trn_data)
mod_2 = lm(y ~ x3 + x4, data = trn_data)
mod_3 = lm(y ~ x3 + poly(x4, 2, raw = TRUE), data = trn_data)
mod_4 = lm(y ~ x3 + poly(x4, 8, raw = TRUE), data = trn_data)
mod_5 = lm(y ~ x1 + x2 + poly(x3, 10, raw = TRUE) + poly(x4, 10, raw = TRUE) + x5, 
           data = trn_data)

Evaluate Models

We will use RMSE for our metric to evaluate the models we just fit.

\[ \text{RMSE}(\hat{f}, \text{Data}) = \sqrt{\frac{1}{n}\displaystyle\sum_{i = 1}^{n}\left(y_i - \hat{f}(\bf{x}_i)\right)^2} \]

As an R function, this becomes:

rmse = function(actual, predicted) {
  sqrt(mean((actual - predicted) ^ 2))
}

where actual is \(y_i\) and predicted is \(\hat{f}(\bf{x}_i)\).

\[ \text{RMSE}_{\text{Train}} = \text{RMSE}(\hat{f}, \text{Train Data}) = \sqrt{\frac{1}{n_{\text{Tr}}}\displaystyle\sum_{i \in \text{Train}}^{}\left(y_i - \hat{f}(\bf{x}_i)\right)^2} \]

\[ \text{RMSE}_{\text{Test}} = \text{RMSE}(\hat{f}, \text{Test Data}) = \sqrt{\frac{1}{n_{\text{Te}}}\displaystyle\sum_{i \in \text{Test}}^{}\left(y_i - \hat{f}(\bf{x}_i)\right)^2} \]

[Exercise] For each of the five models we fit, obtain train and test RMSE as defined above. Based on these metrics, which model performs the best?

Solution:

# example for model 1
rmse(actual = trn_data$y, predicted = predict(mod_1, trn_data)) # train RMSE

## [1] 10.17229

rmse(actual = tst_data$y, predicted = predict(mod_1, tst_data)) # test RMSE

## [1] 11.76528

We could repeat the process above, however, any time you find yourself copy-and-pasting to repeat the same code, you should probably write a function and/or use *apply() type functions.

get_rmse = function(model, data, response) {
  rmse(actual = data[, response], 
       predicted = predict(model, data))
}

get_complexity = function(model) {
  length(coef(model))
}

model_list = list(mod_1, mod_2, mod_3, mod_4, mod_5)

trn_rmse = sapply(model_list, get_rmse, data = trn_data, response = "y")
tst_rmse = sapply(model_list, get_rmse, data = tst_data, response = "y")
model_complexity = sapply(model_list, get_complexity)

Model	Train RMSE	Test RMSE	Predictors
`mod_1`	10.1722881	11.7652842	2
`mod_2`	3.7476653	4.2528501	3
`mod_3`	1.8214745	2.1175768	4
`mod_4`	1.8161304	2.5811559	10
`mod_5`	1.7599179	22.7834649	24

Here we’ve arranged the results as a table. This table shows that mod_3 performs the best. (Note that we only ever consider Test RMSE to make this determination.) This should be no surprise since it is estimating a model of the correct form.

Also, we’ve added a column for model complexity. Since mod_3 performs the best we can say that mod_1 and mod_2 are underfitting. They are less complex and perform worse. Similarly mod_4 and mod_5 are overfitting since they are more complex and perform worse.

Lab 03 - Test-Train Split

STAT 430, Fall 2017

Generate Data

Test-Train Split the Data

Fit Models

Evaluate Models