Goal: After completing this lab, you should be able to…
lm()
function in R
to fit regression models.In this lab we will use, but not focus on…
R
Markdown. This document will serve as a template. It is pre-formatted and already contains chunks that you need to complete.Some additional notes:
In class we looked at the (boring) cars
dataset. Use ?cars
to learn more about this dataset. (For example, the year that it was gathered.)
head(cars)
plot(dist ~ speed, data = cars, pch = 20)
grid()
Our purpose with this dataset was to fit a line that summarized the data. We did this with the lm()
function in R
.
cars_mod = lm(dist ~ speed, data = cars)
Using the summary()
function on the result of the lm()
function produced some useful output, including the slope and intercept of the line that we fit.
summary(cars_mod)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
We could use the abline()
function to add this line to a plot.
plot(dist ~ speed, data = cars, pch = 20)
grid()
abline(cars_mod, col = "red")
Next let’s look at the predict()
function. We will use it to create three different estimates, the latter two which we will explore this week during class, but are easy to do in R
.
To understand the predict()
function, we must first understand its first two arguments:
object
: the output of using the lm()
function. We can think of this as a model that we have stored. (For example, cars_mod
above.)newdata
: new \(x\) data for which we would like to predict the \(y\) value. This must be a data frame that has column names that exist in the data frame used to fit the model, in particular, the variable used as the predictor variable.names(cars)
## [1] "speed" "dist"
Here we see that the cars
data frame used to fit the model cars_mod
has two variables, speed
(which we used as the predictor variable, \(x\)) and dist
(which we used as the response variable, \(y\)).
The following chunk estimates the mean stopping distance of a car, traveling at 30 miles per hour.
predict(object = cars_mod,
newdata = data.frame(speed = 30))
## 1
## 100.3932
Note that we created a data frame that was immediately passed to newdata
which contained a variable speed
, and a single observation of 30
. (This might seem like a lot of work. Why not just use newdata = 30
. Well, for one that doesn’t work, but more importantly, later you’ll see that the predict()
function is more powerful that we are showing in this lab.)
data.frame(speed = 30)
Now let’s add two more arguments, interval
and level
. By doing so below, we are creating a 95% confidence interval for the mean stopping distance of a car traveling 30 miles per hour.
predict(object = cars_mod,
newdata = data.frame(speed = 30),
interval = "confidence",
level = 0.95)
## fit lwr upr
## 1 100.3932 87.43543 113.3509
This returns three values
fit
: which is the point estimate that we had already obtainedlwr
: the lower bound of the intervalupr
: the upper bound of the intervalSo here we are 95% confidence that the mean (average) stopping distance of a car traveling 30 miles per hour is between 87.44 and 113.35. But what if instead of the mean, we are interested in a new observation?
predict(object = cars_mod,
newdata = data.frame(speed = 30),
interval = "prediction",
level = 0.95)
## fit lwr upr
## 1 100.3932 66.86529 133.921
This code creates a 95% prediction interval. That means that we are 95% confident that a car traveling 30 miles per hour will stop between 55.6667 and 145.1196. Notice that this interval is much wider that the interval for the mean! (We’ll discuss this in detail on Wednesday.)
For this exercise we will use the cats dataset from the MASS
package. You should use ?cats to learn about the background of this dataset.
library(MASS)
head(cats)
R
that accomplishes this task. Store the results in a variable called cat_model
. Output the result of calling summary()
on cat_model
. (You should be able to identify the estimate for the intercept, the estimate for the slope, \(R^2\), and the residual standard error from this output.)# your code here
# your code here
# your code here
# your code here
For this exercise we will use the data stored in goalies.txt
. It contains career data for 462 players in the National Hockey League who played goaltender at some point up to and including the 2014-2015 season. The variables in the dataset are:
Player
- Player NameFirst
- First Year in LeagueLast
- First Year in LeagueGP
- Games PlayedW
- WinsL
- LossesGA
- Goals AgainstSA
- Shots AgainstSV
- SavesSV_PCT
- Save PercentageGAA
- Goals Against AverageSO
- ShutoutsMIN
- MinutesPIM
- Penalties in MinutesThe data is imported in the following chunk. We selected only certain columns from the original data, and remove some missing data.
goalies = read.csv("https://daviddalpiaz.github.io/stat3202-au18/data/goalies.txt")
goalies = na.omit(subset(goalies,
select = c(Player, First, Last, GP, W, L, GA,
SA, SV, SV_PCT, GAA, SO, MIN, PIM)))
head(goalies)
Let’s take a look at a couple in particular. First, Crazy Eddie Belfour because, Go Blackhawks!
subset(goalies, Player == "Ed Belfour*")
Next, the current goaltender for your Columbus Blue Jackets, Sergei BOBROVSKY!
subset(goalies, Player == "Sergei Bobrovsky")
W
) that a goalie obtains, based on his penalty minutes. (PIM
) Use the following chunk to fit a simple linear model in R
that accomplishes this task. Store the results in a variable called wins_model_1
. Output the result of calling summary()
on wins_model_1
. (You should be able to identify the estimate for the intercept, the estimate for the slope, \(R^2\), and the residual standard error from this output.)# your code here
# your code here
# your code here
# your code here
Return to the goalies
dataset form the previous exercise.
W
) that a goalie obtains, based on his saves. (SV
) Use the following chunk to fit a simple linear model in R
that accomplishes this task. Store the results in a variable called wins_model_2
. Output the result of calling summary()
on wins_model_2
. (You should be able to identify the estimate for the intercept, the estimate for the slope, \(R^2\), and the residual standard error from this output.)# your code here
# your code here
# your code here
# your code here