Goal: After completing this lab, you should be able to…

Use the lm() function in R to fit regression models.
Plot fitting SLR models.
Predict new observations using SLR models.

In this lab we will use, but not focus on…

R Markdown. This document will serve as a template. It is pre-formatted and already contains chunks that you need to complete.

Some additional notes:

Please see Carmen for information about submission, and grading.
You may use this document as a template. You do not need to remove directions. Chunks that require your input have a comment indicating to do so.
The following readings may be very useful:
- ASWR: Chapter 7
- ASWR: Chapter 8

Exercise 0 - Cars

In class we looked at the (boring) cars dataset. Use ?cars to learn more about this dataset. (For example, the year that it was gathered.)

head(cars)

plot(dist ~ speed, data = cars, pch = 20)
grid()

Our purpose with this dataset was to fit a line that summarized the data. We did this with the lm() function in R.

cars_mod = lm(dist ~ speed, data = cars)

Using the summary() function on the result of the lm() function produced some useful output, including the slope and intercept of the line that we fit.

summary(cars_mod)

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

We could use the abline() function to add this line to a plot.

plot(dist ~ speed, data = cars, pch = 20)
grid()
abline(cars_mod, col = "red")

Next let’s look at the predict() function. We will use it to create three different estimates, the latter two which we will explore this week during class, but are easy to do in R.

A point estimate for the mean stopping distance for a particular speed.
A confidence interval for the mean stopping distance for a particular speed.
A prediction interval for a new observation of stopping distance at a particular speed.

To understand the predict() function, we must first understand its first two arguments:

object: the output of using the lm() function. We can think of this as a model that we have stored. (For example, cars_mod above.)
newdata: new \(x\) data for which we would like to predict the \(y\) value. This must be a data frame that has column names that exist in the data frame used to fit the model, in particular, the variable used as the predictor variable.

names(cars)

## [1] "speed" "dist"

Here we see that the cars data frame used to fit the model cars_mod has two variables, speed (which we used as the predictor variable, \(x\)) and dist (which we used as the response variable, \(y\)).

The following chunk estimates the mean stopping distance of a car, traveling at 30 miles per hour.

predict(object = cars_mod, 
        newdata = data.frame(speed = 30))

##        1 
## 100.3932

Note that we created a data frame that was immediately passed to newdata which contained a variable speed, and a single observation of 30. (This might seem like a lot of work. Why not just use newdata = 30. Well, for one that doesn’t work, but more importantly, later you’ll see that the predict() function is more powerful that we are showing in this lab.)

data.frame(speed = 30)

Now let’s add two more arguments, interval and level. By doing so below, we are creating a 95% confidence interval for the mean stopping distance of a car traveling 30 miles per hour.

predict(object = cars_mod, 
        newdata = data.frame(speed = 30),
        interval = "confidence",
        level = 0.95)

##        fit      lwr      upr
## 1 100.3932 87.43543 113.3509

This returns three values

fit: which is the point estimate that we had already obtained
lwr: the lower bound of the interval
upr: the upper bound of the interval

So here we are 95% confidence that the mean (average) stopping distance of a car traveling 30 miles per hour is between 87.44 and 113.35. But what if instead of the mean, we are interested in a new observation?

predict(object = cars_mod, 
        newdata = data.frame(speed = 30),
        interval = "prediction",
        level = 0.95)

##        fit      lwr     upr
## 1 100.3932 66.86529 133.921

This code creates a 95% prediction interval. That means that we are 95% confident that a car traveling 30 miles per hour will stop between 55.6667 and 145.1196. Notice that this interval is much wider that the interval for the mean! (We’ll discuss this in detail on Wednesday.)

Exercise 1 - Cats

For this exercise we will use the cats dataset from the MASS package. You should use ?cats to learn about the background of this dataset.

library(MASS)
head(cats)

Suppose we would like to understand the size of a cat’s heart based on the body weight of a cat. Use the following chunk to fit a simple linear model in R that accomplishes this task. Store the results in a variable called cat_model. Output the result of calling summary() on cat_model. (You should be able to identify the estimate for the intercept, the estimate for the slope, \(R^2\), and the residual standard error from this output.)

# your code here

Use your model to estimate the mean heart weight of cats that weigh 2.7 kg.

# your code here

Use your model to create a 99% confidence interval for the mean heart weight of cats that weigh 1.5 kg.

# your code here

Create a scatterplot of the data and add the fitted regression line. Make sure your plot is well labeled and is somewhat visually appealing.

# your code here

Exercise 2 - Goalie Penalty Minutes

For this exercise we will use the data stored in goalies.txt. It contains career data for 462 players in the National Hockey League who played goaltender at some point up to and including the 2014-2015 season. The variables in the dataset are:

Player - Player Name
First - First Year in League
Last - First Year in League
GP - Games Played
W - Wins
L - Losses
GA - Goals Against
SA - Shots Against
SV - Saves
SV_PCT - Save Percentage
GAA - Goals Against Average
SO - Shutouts
MIN - Minutes
PIM - Penalties in Minutes

The data is imported in the following chunk. We selected only certain columns from the original data, and remove some missing data.

goalies = read.csv("https://daviddalpiaz.github.io/stat3202-au18/data/goalies.txt")
goalies = na.omit(subset(goalies, 
                         select = c(Player, First, Last, GP, W, L, GA, 
                                    SA, SV, SV_PCT, GAA, SO, MIN, PIM)))
head(goalies)

Let’s take a look at a couple in particular. First, Crazy Eddie Belfour because, Go Blackhawks!

subset(goalies, Player == "Ed Belfour*")

Next, the current goaltender for your Columbus Blue Jackets, Sergei BOBROVSKY!

subset(goalies, Player == "Sergei Bobrovsky")

Suppose we would like to understand the the number of wins (W) that a goalie obtains, based on his penalty minutes. (PIM) Use the following chunk to fit a simple linear model in R that accomplishes this task. Store the results in a variable called wins_model_1. Output the result of calling summary() on wins_model_1. (You should be able to identify the estimate for the intercept, the estimate for the slope, \(R^2\), and the residual standard error from this output.)

# your code here

Use your model to estimate the mean wins of a goalie with 400 career penalty minutes.

# your code here

Use your model to create a 99% prediction interval for the careers wins of a goalie with 200 penalty minutes

# your code here

Create a scatterplot of the data and add the fitted regression line. Make sure your plot is well labeled and is somewhat visually appealing. (This plot should make you suspicious of the previous analysis.)

# your code here

Exercise 3 - Goalie Saves

Return to the goalies dataset form the previous exercise.

Suppose we would like to understand the the number of wins (W) that a goalie obtains, based on his saves. (SV) Use the following chunk to fit a simple linear model in R that accomplishes this task. Store the results in a variable called wins_model_2. Output the result of calling summary() on wins_model_2. (You should be able to identify the estimate for the intercept, the estimate for the slope, \(R^2\), and the residual standard error from this output.)

# your code here

Use your model to estimate the mean wins of a goalie with 10000 career saves.

# your code here

Use your model to create a 90% prediction interval for the careers wins of a goalie with 5000 career saves.

# your code here

Create a scatterplot of the data and add the fitted regression line. Make sure your plot is well labeled and is somewhat visually appealing. (This plot should look much better than the previous.)

# your code here

STAT 3202: Lab 08, Simple Linear Regression in R

Autumn 2018, OSU

Due: Friday, Novemeber 9

Exercise 0 - Cars

Exercise 1 - Cats

Exercise 2 - Goalie Penalty Minutes

Exercise 3 - Goalie Saves