---
title: 'STAT 3202: Lab 08, Simple Linear Regression in R'
author: "Autumn 2018, OSU"
date: 'Due: Friday, Novemeber 9'
output:
html_document:
theme: spacelab
toc: yes
df_print: paged
pdf_document: default
urlcolor: BrickRed
---
***
```{r setup, include = FALSE}
knitr::opts_chunk$set(fig.align = "center")
```
**Goal:** After completing this lab, you should be able to...
- *Use* the `lm()` function in `R` to fit regression models.
- *Plot* fitting SLR models.
- *Predict* new observations using SLR models.
In this lab we will use, but not focus on...
- `R` Markdown. This document will serve as a template. It is pre-formatted and already contains chunks that you need to complete.
Some additional notes:
- Please see [**Carmen**](https://carmen.osu.edu/) for information about submission, and grading.
- You may use [this document](lab-08-assign.Rmd) as a template. You do not need to remove directions. Chunks that require your input have a comment indicating to do so.
- The following readings may be very useful:
- [ASWR: Chapter 7](https://daviddalpiaz.github.io/appliedstats/simple-linear-regression.html)
- [ASWR: Chapter 8](https://daviddalpiaz.github.io/appliedstats/inference-for-simple-linear-regression.html)
***
# Exercise 0 - Cars
In class we looked at the (boring) `cars` dataset. Use `?cars` to learn more about this dataset. (For example, the year that it was gathered.)
```{r}
head(cars)
```
```{r}
plot(dist ~ speed, data = cars, pch = 20)
grid()
```
Our purpose with this dataset was to fit a line that summarized the data. We did this with the `lm()` function in `R`.
```{r}
cars_mod = lm(dist ~ speed, data = cars)
```
Using the `summary()` function on the result of the `lm()` function produced some useful output, including the slope and intercept of the line that we fit.
```{r}
summary(cars_mod)
```
We could use the `abline()` function to add this line to a plot.
```{r}
plot(dist ~ speed, data = cars, pch = 20)
grid()
abline(cars_mod, col = "red")
```
Next let's look at the `predict()` function. We will use it to create three different estimates, the latter two which we will explore this week during class, but are easy to do in `R`.
- A **point estimate** for the *mean* stopping distance for a particular speed.
- A **confidence interval** for the *mean* stopping distance for a particular speed.
- A **prediction interval** for a new observation of stopping distance at a particular speed.
To understand the `predict()` function, we must first understand its first two arguments:
- `object`: the output of using the `lm()` function. We can think of this as a model that we have stored. (For example, `cars_mod` above.)
- `newdata`: new $x$ data for which we would like to predict the $y$ value. This must be a **data frame** that has column names that exist in the **data frame** used to fit the model, in particular, the variable used as the predictor variable.
```{r}
names(cars)
```
Here we see that the `cars` data frame used to fit the model `cars_mod` has two variables, `speed` (which we used as the predictor variable, $x$) and `dist` (which we used as the response variable, $y$).
The following chunk estimates the mean stopping distance of a car, traveling at 30 miles per hour.
```{r}
predict(object = cars_mod,
newdata = data.frame(speed = 30))
```
Note that we created a data frame that was immediately passed to `newdata` which contained a variable `speed`, and a single observation of `30`. (This might seem like a lot of work. Why not just use `newdata = 30`. Well, for one that doesn't work, but more importantly, later you'll see that the `predict()` function is more powerful that we are showing in this lab.)
```{r}
data.frame(speed = 30)
```
Now let's add two more arguments, `interval` and `level`. By doing so below, we are creating a 95% confidence interval for the mean stopping distance of a car traveling 30 miles per hour.
```{r}
predict(object = cars_mod,
newdata = data.frame(speed = 30),
interval = "confidence",
level = 0.95)
```
This returns three values
- `fit`: which is the point estimate that we had already obtained
- `lwr`: the lower bound of the interval
- `upr`: the upper bound of the interval
So here we are 95% confidence that the **mean** (average) stopping distance of a car traveling 30 miles per hour is between 87.44 and 113.35. But what if instead of the mean, we are interested in a new observation?
```{r}
predict(object = cars_mod,
newdata = data.frame(speed = 30),
interval = "prediction",
level = 0.95)
```
This code creates a 95% **prediction interval**. That means that we are 95% confident that a car traveling 30 miles per hour will stop between 55.6667 and 145.1196. Notice that this interval is *much wider* that the interval for the mean! (We'll discuss this in detail on Wednesday.)
# Exercise 1 - Cats
For this exercise we will use the cats dataset from the `MASS` package. You should use ?cats to learn about the background of this dataset.
```{r}
library(MASS)
head(cats)
```
- Suppose we would like to understand the size of a catâ€™s heart based on the body weight of a cat. Use the following chunk to fit a simple linear model in `R` that accomplishes this task. Store the results in a variable called `cat_model`. Output the result of calling `summary()` on `cat_model`. (You should be able to identify the estimate for the intercept, the estimate for the slope, $R^2$, and the residual standard error from this output.)
```{r}
# your code here
```
- Use your model to estimate the mean heart weight of cats that weigh 2.7 kg.
```{r}
# your code here
```
- Use your model to create a 99% confidence interval for the mean heart weight of cats that weigh 1.5 kg.
```{r}
# your code here
```
- Create a scatterplot of the data and add the fitted regression line. Make sure your plot is well labeled and is somewhat visually appealing.
```{r}
# your code here
```
***
# Exercise 2 - Goalie Penalty Minutes
For this exercise we will use the data stored in [`goalies.txt`](https://daviddalpiaz.github.io/stat3202-au18/data/goalies.txt). It contains career data for 462 players in the National Hockey League who played goaltender at some point up to and including the 2014-2015 season. The variables in the dataset are:
- `Player` - Player Name
- `First` - First Year in League
- `Last` - First Year in League
- `GP` - Games Played
- `W` - Wins
- `L` - Losses
- `GA` - Goals Against
- `SA` - Shots Against
- `SV` - Saves
- `SV_PCT` - Save Percentage
- `GAA` - Goals Against Average
- `SO` - Shutouts
- `MIN` - Minutes
- `PIM` - Penalties in Minutes
The data is imported in the following chunk. We selected only certain columns from the original data, and remove some missing data.
```{r}
goalies = read.csv("https://daviddalpiaz.github.io/stat3202-au18/data/goalies.txt")
goalies = na.omit(subset(goalies,
select = c(Player, First, Last, GP, W, L, GA,
SA, SV, SV_PCT, GAA, SO, MIN, PIM)))
head(goalies)
```
Let's take a look at a couple in particular. First, [Crazy Eddie Belfour](https://en.wikipedia.org/wiki/Ed_Belfour) because, [Go Blackhawks!](https://www.youtube.com/watch?v=PO5SnehKowM&t=420)
```{r}
subset(goalies, Player == "Ed Belfour*")
```
Next, the current goaltender for your Columbus Blue Jackets, [Sergei BOBROVSKY!](https://www.youtube.com/watch?v=omZPhiT2PeQ)
```{r}
subset(goalies, Player == "Sergei Bobrovsky")
```
- Suppose we would like to understand the the number of wins (`W`) that a goalie obtains, based on his penalty minutes. (`PIM`) Use the following chunk to fit a simple linear model in `R` that accomplishes this task. Store the results in a variable called `wins_model_1`. Output the result of calling `summary()` on `wins_model_1`. (You should be able to identify the estimate for the intercept, the estimate for the slope, $R^2$, and the residual standard error from this output.)
```{r}
# your code here
```
- Use your model to estimate the mean wins of a goalie with 400 career penalty minutes.
```{r}
# your code here
```
- Use your model to create a 99% prediction interval for the careers wins of a goalie with 200 penalty minutes
```{r}
# your code here
```
- Create a scatterplot of the data and add the fitted regression line. Make sure your plot is well labeled and is somewhat visually appealing. (This plot should make you suspicious of the previous analysis.)
```{r}
# your code here
```
***
# Exercise 3 - Goalie Saves
Return to the `goalies` dataset form the previous exercise.
- Suppose we would like to understand the the number of wins (`W`) that a goalie obtains, based on his saves. (`SV`) Use the following chunk to fit a simple linear model in `R` that accomplishes this task. Store the results in a variable called `wins_model_2`. Output the result of calling `summary()` on `wins_model_2`. (You should be able to identify the estimate for the intercept, the estimate for the slope, $R^2$, and the residual standard error from this output.)
```{r}
# your code here
```
- Use your model to estimate the mean wins of a goalie with 10000 career saves.
```{r}
# your code here
```
- Use your model to create a 90% prediction interval for the careers wins of a goalie with 5000 career saves.
```{r}
# your code here
```
- Create a scatterplot of the data and add the fitted regression line. Make sure your plot is well labeled and is somewhat visually appealing. (This plot should look much better than the previous.)
```{r}
# your code here
```
***