For this lab we will use the airquality dataset which is a default dataset in R.

During Lecture

Data Cleaning

This dataset contains some missing data. For simplicity, we will remove it. Think about why this may or may not be a reasonable thing to do. (We’ll return to this idea later. For now we want to focus on modeling.)

airquality_cleaned = na.omit(airquality)

Test-Train Split

Now we want to test-train split the data. That is, we want a training dataset for fitting our models, and a testing dataset for evaluating our models. Since this split will be based on randomly selected observations in the dataset, we first set a seed value to be able to reproduce the same split again.

set.seed(42)
trn_idx = sample(nrow(airquality_cleaned), size = trunc(0.70 * nrow(airquality_cleaned)))
trn_data = airquality_cleaned[trn_idx, ]
tst_data = airquality_cleaned[-trn_idx, ]

[Exercise] How many observations are used in the test set?

nrow(tst_data)
## [1] 34

EDA

We’ve already started working with this data, but we should really take a step back and ask ourselves a question. What is this data? Whenever you ask yourself this question, you should “look” at the data. You should do three things:

  • Read the metadata, in this case the R documentation.
    • Where did this data come from?
    • What is an observation in this dataset?
    • What are the variables in this dataset?
  • View the data in tabular form. This can be done by clicking the dataset in the RStudio Enviroment panel, or by using the View() function on the dataset.
    • What is the type of each variable?
    • Are categorical variables coded as factors?
  • Plot the data.

[Exercise] Create a plot that shows all possible scatterplots of two variables in the training dataset.

plot(trn_data, col = "darkgrey")