For this lab we will use the airquality dataset which is a default dataset in R.

During Lecture

Data Cleaning

This dataset contains some missing data. For simplicity, we will remove it. Think about why this may or may not be a reasonable thing to do. (We’ll return to this idea later. For now we want to focus on modeling.)

airquality_cleaned = na.omit(airquality)

Test-Train Split

Now we want to test-train split the data. That is, we want a training dataset for fitting our models, and a testing dataset for evaluating our models. Since this split will be based on randomly selected observations in the dataset, we first set a seed value to be able to reproduce the same split again.

set.seed(42)
trn_idx = sample(nrow(airquality_cleaned), size = trunc(0.70 * nrow(airquality_cleaned)))
trn_data = airquality_cleaned[trn_idx, ]
tst_data = airquality_cleaned[-trn_idx, ]

[Exercise] How many observations are used in the test set?

nrow(tst_data)
## [1] 34

EDA

We’ve already started working with this data, but we should really take a step back and ask ourselves a question. What is this data? Whenever you ask yourself this question, you should “look” at the data. You should do three things:

  • Read the metadata, in this case the R documentation.
    • Where did this data come from?
    • What is an observation in this dataset?
    • What are the variables in this dataset?
  • View the data in tabular form. This can be done by clicking the dataset in the RStudio Enviroment panel, or by using the View() function on the dataset.
    • What is the type of each variable?
    • Are categorical variables coded as factors?
  • Plot the data.

[Exercise] Create a plot that shows all possible scatterplots of two variables in the training dataset.

plot(trn_data, col = "darkgrey")

[Exercise] Since we will be focusing on predicting Ozone using Temp, create a scatterplot that shows only this relationship using the training data.

plot(Ozone ~ Temp, data = trn_data, col = "darkgrey", pch = 20,
     main = "Air Quaility in New York, 1973: Ozone (ppb) vs Temp (F)")
grid()

Polynomial Models

[Exercise] Fit a total of five polynomial models to the training data that can be used to predict Ozone from Temp. Use polynomial degrees from 1 to 5.

poly_fit_1 = lm(Ozone ~ poly(Temp, 1), data = trn_data)
poly_fit_2 = lm(Ozone ~ poly(Temp, 2), data = trn_data)
poly_fit_3 = lm(Ozone ~ poly(Temp, 3), data = trn_data)
poly_fit_4 = lm(Ozone ~ poly(Temp, 4), data = trn_data)
poly_fit_5 = lm(Ozone ~ poly(Temp, 5), data = trn_data)

[Exercise] Predict Ozone for a temperature of 89 degrees Fahrenheit using the degree three polynomial model.

predict(poly_fit_3, data.frame(Temp = 89))
##        1 
## 74.40082

KNN Models

[Exercise] Use KNN with k = 5 to make predictions for each of the observations in both the train and test datasets. Store the results in vectors named knn_pred_trn and knn_pred_tst.

To do so you will need the knn.reg() function from the FNN package. The knn.reg() is very different than the lm() function. Check the documentation!

library(FNN)
knn_pred_trn = knn.reg(train = trn_data["Temp"], test = trn_data["Temp"], y = trn_data["Ozone"], k =  5)$pred
knn_pred_tst = knn.reg(train = trn_data["Temp"], test = tst_data["Temp"], y = trn_data["Ozone"], k =  5)$pred

Results

[Exercise] Calculate both train and test RMSE for the KNN model above using the predictions you have stored. You might find the provided calc_rmse() function useful.

calc_rmse = function(actual, predicted) {
  sqrt(mean((actual - predicted) ^ 2))
}

# train RMSE
calc_rmse(actual = trn_data$Ozone, predicted = knn_pred_trn)
## [1] 20.03146
# test RMSE
calc_rmse(actual = tst_data$Ozone, predicted = knn_pred_tst)
## [1] 21.85557

Table

[Exercise] Create a table that summarizes the results from each model fit. (The five polynomial models and the single KNN model.) For each model note the type of model, the value of the tuning parameter, the train RMSE, and the test RMSE. (Consider the polynomial degree a tuning parameter.) The result should be a table with a header, six rows, and four columns. In the final rendered document, hide the code used to create the table.

Hint: First create a data frame, then use the kable() function from the knitr package. For fake bonus points, use the kable_styling() function form the kableExtra package to control the width of the table output.

Model Type Tuning Parameter Train RMSE Test RMSE
poly 1 23.98 23.08
poly 2 22.60 21.96
poly 3 22.58 21.75
poly 4 22.04 20.87
poly 5 21.92 21.15
knn 5 20.03 21.86

After Lecture

Plot Results

[Exercise] Recreate the scatterplot of Ozone versus Temp form above. Add to this plot the polynomial model that performed best, as well as the fitted KNN model. Can you center this plot in the rendered document? Again, hide the code to create the plot.

Multiple Predictors

So far we’ve only been using one of the available predictors. Why not use them all? (Maybe we should only use some of them though… We’ll return to this thought later.)

[Exercise] Fit an additive linear model with Ozone as the response and the remaining variables as predictors. Calculate the test RMSE for this model. Does this improve upon the previous models?

add_mod = lm(Ozone ~ ., data = trn_data)
calc_rmse(actual = tst_data$Ozone,
          predicted = predict(add_mod, tst_data))
## [1] 18.87183

This model outperforms each of the previous models.

Random Forest

[Exercise] Fit a random forest with Ozone as the response and the remaining variables as predictors. Calculate the test RMSE for this model. Does this improve upon the previous models? To do so, use the randomForest() function from the randomForest package. Check the documentation, but its syntax is very similar to that of lm().

library(randomForest)
rf_mod = randomForest(Ozone ~ ., data = trn_data)
calc_rmse(actual = tst_data$Ozone,
          predicted = predict(rf_mod, tst_data))
## [1] 17.63263

This model again outperforms each of the previous models. Random forest is a method that we will be building towards understand throughout the course.

Motiviation

We’ve skipped one big question when analyzing this data. We’ve fit a bunch of models with the goal of predicting the Ozone variable, but why? Why are predictions useful in this situation? This is a question you should be asking yourself whenever performing a predictive analysis.

Best Practices

You might note that some code in this document is very repetitive. Anytime we think that, we probably need to rethink our coding strategies. However, this is somewhat intentional in this document. It is clearer what is “happening” from a modeling and predicting perspective. Eventually we’ll be OK with adding a layer of abstraction above this to make our code better, but for now our main goal is understanding the task we are performing. We don’t want the code to get in the way.

You might also notice some repetition in the chunk options, for example, repeatedly seeting the figure alignment to center for chunks that produce plots. (Remember, one plot per chunk.) Instead of always doing this for each chunk that produces a plot, we could set default chunk options. Maybe we’ll try that next time…