**Lab Goal:** The goal of this lab is to introduce the regression task, which is a part of supervised learning.

You should use the `.Rmd`

file which created this document as a template for the lab. This file can be found in the `.zip`

file that contains all the necessary files for the lab.

For this lab we will use the `airquality`

dataset which is a default dataset in `R`

.

This dataset contains some missing data. For simplicity, we will remove it. Think about why this may or may not be a reasonable thing to do. (We’ll return to this idea later. For now we want to focus on modeling.)

`airquality_cleaned = na.omit(airquality)`

Now we want to test-train split the data. That is, we want a **training** dataset for *fitting* our models, and a **testing** dataset for *evaluating* our models. Since this split will be based on randomly selected observations in the dataset, we first set a seed value to be able to reproduce the same split again.

```
set.seed(42)
trn_idx = sample(nrow(airquality_cleaned), size = trunc(0.70 * nrow(airquality_cleaned)))
trn_data = airquality_cleaned[trn_idx, ]
tst_data = airquality_cleaned[-trn_idx, ]
```

**[Exercise]** How many observations are used in the test set?

`# your code here`

We’ve already started working with this data, but we should really take a step back and ask ourselves a question. What is this data? Whenever you ask yourself this question, you should **“look”** at the data. You should do three things:

- Read the metadata, in this case the
`R`

documentation.- Where did this data come from?
- What is an observation in this dataset?
- What are the variables in this dataset?

- View the data in tabular form. This can be done by clicking the dataset in the RStudio
**Enviroment**panel, or by using the`View()`

function on the dataset.- What is the type of each variable?
- Are categorical variables coded as factors?

- Plot the data.

**[Exercise]** Create a plot that shows all possible scatterplots of two variables in the training dataset.

`# your code here`

**[Exercise]** Since we will be focusing on predicting `Ozone`

using `Temp`

, create a scatterplot that shows only this relationship using the training data.

`# your code here`

**[Exercise]** Fit a total of five polynomial models to the training data that can be used to predict `Ozone`

from `Temp`

. Use polynomial degrees from `1`

to `5`

.

`# your code here`

**[Exercise]** Predict `Ozone`

for a temperature of 89 degrees Fahrenheit using the degree three polynomial model.

`# your code here`

**[Exercise]** Use KNN with `k = 5`

to make predictions for each of the observations in both the train and test datasets. Store the results in **vectors** named `knn_pred_trn`

and `knn_pred_tst`

.

To do so you will need the `knn.reg()`

function from the `FNN`

package. The `knn.reg()`

is very different than the `lm()`

function. Check the documentation!

```
library(FNN)
# your code here
```

**[Exercise]** Calculate both train and test RMSE for the KNN model above using the predictions you have stored. You might find the provided `calc_rmse()`

function useful.

```
calc_rmse = function(actual, predicted) {
sqrt(mean((actual - predicted) ^ 2))
}
# your code here
```

**[Exercise]** Create a table that summarizes the results from each model fit. (The five polynomial models and the single KNN model.) For each model note the type of model, the value of the tuning parameter, the train RMSE, and the test RMSE. (Consider the polynomial degree a tuning parameter.) The result should be a table with a header, six rows, and four columns. In the final rendered document, hide the code used to create the table.

Hint: First create a data frame, then use the `kable()`

function from the `knitr`

package. For fake bonus points, use the `kable_styling()`

function form the `kableExtra`

package to control the width of the table output.

```
library(knitr)
library(kableExtra)
# your code here
```

**[Exercise]** Recreate the scatterplot of `Ozone`

versus `Temp`

form above. Add to this plot the polynomial model that performed best, as well as the fitted KNN model. Can you center this plot in the rendered document? Again, hide the code to create the plot.

`# your code here`

So far we’ve only been using one of the available predictors. Why not use them all? (Maybe we should only use some of them though… We’ll return to this thought later.)

**[Exercise]** Fit an additive linear model with `Ozone`

as the response and the remaining variables as predictors. Calculate the test RMSE for this model. Does this improve upon the previous models?

`# your code here`

**[Exercise]** Fit a **random forest** with `Ozone`

as the response and the remaining variables as predictors. Calculate the test RMSE for this model. Does this improve upon the previous models? To do so, use the `randomForest()`

function from the `randomForest`

package. Check the documentation, but its syntax is very similar to that of `lm()`

.

```
library(randomForest)
# your code here
```

We’ve skipped one big question when analyzing this data. We’ve fit a bunch of models with the goal of predicting the `Ozone`

variable, but **why**? Why are predictions useful in this situation? This is a question you should be asking yourself whenever performing a predictive analysis.

You might note that some code in this document is very repetitive. Anytime we think that, we probably need to rethink our coding strategies. However, this is somewhat intentional in this document. It is clearer what is “happening” from a modeling and predicting perspective. Eventually we’ll be OK with adding a layer of abstraction above this to make our code better, but for now our main goal is understanding the task we are performing. We don’t want the code to get in the way.

You might also notice some repetition in the chunk options, for example, repeatedly seeting the figure alignment to center for chunks that produce plots. (Remember, one plot per chunk.) Instead of always doing this for each chunk that produces a plot, we could set default chunk options. Maybe we’ll try that next time…