Lab Goal: The goal of this lab is to introduce the regression task, which is a part of supervised learning.
You should use the .Rmd
file which created this document as a template for the lab. This file can be found in the .zip
file that contains all the necessary files for the lab.
For this lab we will use the airquality
dataset which is a default dataset in R
.
This dataset contains some missing data. For simplicity, we will remove it. Think about why this may or may not be a reasonable thing to do. (We’ll return to this idea later. For now we want to focus on modeling.)
airquality_cleaned = na.omit(airquality)
Now we want to test-train split the data. That is, we want a training dataset for fitting our models, and a testing dataset for evaluating our models. Since this split will be based on randomly selected observations in the dataset, we first set a seed value to be able to reproduce the same split again.
set.seed(42)
trn_idx = sample(nrow(airquality_cleaned), size = trunc(0.70 * nrow(airquality_cleaned)))
trn_data = airquality_cleaned[trn_idx, ]
tst_data = airquality_cleaned[-trn_idx, ]
[Exercise] How many observations are used in the test set?
# your code here
We’ve already started working with this data, but we should really take a step back and ask ourselves a question. What is this data? Whenever you ask yourself this question, you should “look” at the data. You should do three things:
R
documentation.
View()
function on the dataset.
[Exercise] Create a plot that shows all possible scatterplots of two variables in the training dataset.
# your code here
[Exercise] Since we will be focusing on predicting Ozone
using Temp
, create a scatterplot that shows only this relationship using the training data.
# your code here
[Exercise] Fit a total of five polynomial models to the training data that can be used to predict Ozone
from Temp
. Use polynomial degrees from 1
to 5
.
# your code here
[Exercise] Predict Ozone
for a temperature of 89 degrees Fahrenheit using the degree three polynomial model.
# your code here
[Exercise] Use KNN with k = 5
to make predictions for each of the observations in both the train and test datasets. Store the results in vectors named knn_pred_trn
and knn_pred_tst
.
To do so you will need the knn.reg()
function from the FNN
package. The knn.reg()
is very different than the lm()
function. Check the documentation!
library(FNN)
# your code here
[Exercise] Calculate both train and test RMSE for the KNN model above using the predictions you have stored. You might find the provided calc_rmse()
function useful.
calc_rmse = function(actual, predicted) {
sqrt(mean((actual - predicted) ^ 2))
}
# your code here
[Exercise] Create a table that summarizes the results from each model fit. (The five polynomial models and the single KNN model.) For each model note the type of model, the value of the tuning parameter, the train RMSE, and the test RMSE. (Consider the polynomial degree a tuning parameter.) The result should be a table with a header, six rows, and four columns. In the final rendered document, hide the code used to create the table.
Hint: First create a data frame, then use the kable()
function from the knitr
package. For fake bonus points, use the kable_styling()
function form the kableExtra
package to control the width of the table output.
library(knitr)
library(kableExtra)
# your code here
[Exercise] Recreate the scatterplot of Ozone
versus Temp
form above. Add to this plot the polynomial model that performed best, as well as the fitted KNN model. Can you center this plot in the rendered document? Again, hide the code to create the plot.
# your code here
So far we’ve only been using one of the available predictors. Why not use them all? (Maybe we should only use some of them though… We’ll return to this thought later.)
[Exercise] Fit an additive linear model with Ozone
as the response and the remaining variables as predictors. Calculate the test RMSE for this model. Does this improve upon the previous models?
# your code here
[Exercise] Fit a random forest with Ozone
as the response and the remaining variables as predictors. Calculate the test RMSE for this model. Does this improve upon the previous models? To do so, use the randomForest()
function from the randomForest
package. Check the documentation, but its syntax is very similar to that of lm()
.
library(randomForest)
# your code here
We’ve skipped one big question when analyzing this data. We’ve fit a bunch of models with the goal of predicting the Ozone
variable, but why? Why are predictions useful in this situation? This is a question you should be asking yourself whenever performing a predictive analysis.
You might note that some code in this document is very repetitive. Anytime we think that, we probably need to rethink our coding strategies. However, this is somewhat intentional in this document. It is clearer what is “happening” from a modeling and predicting perspective. Eventually we’ll be OK with adding a layer of abstraction above this to make our code better, but for now our main goal is understanding the task we are performing. We don’t want the code to get in the way.
You might also notice some repetition in the chunk options, for example, repeatedly seeting the figure alignment to center for chunks that produce plots. (Remember, one plot per chunk.) Instead of always doing this for each chunk that produces a plot, we could set default chunk options. Maybe we’ll try that next time…