For this lab we will use the airquality
dataset which is a default dataset in R
.
This dataset contains some missing data. For simplicity, we will remove it. Think about why this may or may not be a reasonable thing to do. (We’ll return to this idea later. For now we want to focus on modeling.)
airquality_cleaned = na.omit(airquality)
Now we want to test-train split the data. That is, we want a training dataset for fitting our models, and a testing dataset for evaluating our models. Since this split will be based on randomly selected observations in the dataset, we first set a seed value to be able to reproduce the same split again.
set.seed(42)
trn_idx = sample(nrow(airquality_cleaned), size = trunc(0.70 * nrow(airquality_cleaned)))
trn_data = airquality_cleaned[trn_idx, ]
tst_data = airquality_cleaned[-trn_idx, ]
[Exercise] How many observations are used in the test set?
nrow(tst_data)
## [1] 34
We’ve already started working with this data, but we should really take a step back and ask ourselves a question. What is this data? Whenever you ask yourself this question, you should “look” at the data. You should do three things:
R
documentation.
View()
function on the dataset.
[Exercise] Create a plot that shows all possible scatterplots of two variables in the training dataset.
plot(trn_data, col = "darkgrey")