Goal: After completing this lab, you should be able to…
lm()
function in R
to fit regression models.In this lab we will use, but not focus on…
R
Markdown. This document will serve as a template. It is pre-formatted and already contains chunks that you need to complete.Some additional notes:
lm()
In class we looked at the (boring) cars
dataset. Use ?cars
to learn more about this dataset. (For example, the year that it was gathered.)
head(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
plot(dist ~ speed, data = cars, pch = 20)
grid()
Our purpose with this dataset was to fit a line that summarized the data. We did this with the lm()
function in R
.
cars_mod = lm(dist ~ speed, data = cars)
Using the summary()
function on the result of the lm()
function produced some useful output, including the slope and intercept of the line that we fit.
summary(cars_mod)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
We could use the abline()
function to add this line to a plot.
plot(dist ~ speed, data = cars, pch = 20)
grid()
abline(cars_mod, col = "red")
We also discussed that what we were doing when we “fit a line” was estimating a particular probability model. That model, which we call the simple linear regression model, is
\[ Y_i = \beta_0 + \beta_1 x_i + \epsilon_i \]
where
\[ \epsilon_i \sim N(0, \sigma^2) \]
Let’s write a function to simulate data according to this model.
sim_slr = function(x, beta_0 = 10, beta_1 = 5, sigma = 1) {
sample_size = length(x)
epsilon = rnorm(n = sample_size, mean = 0, sd = sigma)
y = beta_0 + beta_1 * x + epsilon
data.frame(predictor = x, response = y)
}
Notice that the function takes as input the \(x\) value of the regression. (In addition to the intercept, slope, and variance.) This is because the \(x\) values are not viewed as random. But based on them, and the specified line, the \(Y\) values are randomly generated.
Here we first generated some \(x\) values. (Yes “randomly”, but after we generate them, we would consider them fixed. Admittedly this is a little confusing. Later when we repeat the process, we will generate the \(x\) values once, and the \(Y\) values repeatedly based on the first set of \(x\) values.) Then we use the sim_slr()
function to simulate from the model when the true values of the parameters are:
set.seed(42)
x_vals = runif(n = 25)
sim_data_1 = sim_slr(x = x_vals, beta_0 = 6, beta_1 = -1, sigma = 2)
We then fit the model. And then check the results.
mod_1 = lm(response ~ predictor, data = sim_data_1)
summary(mod_1)
##
## Call:
## lm(formula = response ~ predictor, data = sim_data_1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.3721 -1.5615 0.1598 1.0745 3.3083
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.6356 0.8003 9.541 1.84e-09 ***
## predictor -2.4999 1.1725 -2.132 0.0439 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.761 on 23 degrees of freedom
## Multiple R-squared: 0.165, Adjusted R-squared: 0.1287
## F-statistic: 4.546 on 1 and 23 DF, p-value: 0.04389
plot(response ~ predictor, data = sim_data_1, pch = 20)
grid()
abline(mod_1, col = "red")
Let’s do this again, this time for the model when
First, let’s simulate data according to this model.
set.seed(42)
x_vals = runif(n = 50)
sim_data_2 = sim_slr(x = x_vals, beta_0 = 9, beta_1 = 2, sigma = 5)
head(sim_data_2)
## predictor response
## 1 0.9148060 8.6772664
## 2 0.9370754 9.5878039
## 3 0.2861395 0.7564636
## 4 0.8304476 12.9613820
## 5 0.6417455 7.0835167
## 6 0.5190959 12.3154425
Now we’d like to use this data to estimate
(Yes, we know the true values, but pretend we didn’t.) Let’s fit the model.
mod_2 = lm(response ~ predictor, data = sim_data_2)
We’ll create a plot for good measure.
plot(response ~ predictor, data = sim_data_2, pch = 20)
grid()
abline(mod_2, col = "red")
Now we’ll look at the results of using the summary()
function.
summary(mod_2)
##
## Call:
## lm(formula = response ~ predictor, data = sim_data_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.6581 -3.2330 0.9429 3.4540 8.2295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.828 1.546 5.711 6.91e-07 ***
## predictor 1.803 2.307 0.781 0.438
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.904 on 48 degrees of freedom
## Multiple R-squared: 0.01256, Adjusted R-squared: -0.008011
## F-statistic: 0.6106 on 1 and 48 DF, p-value: 0.4384
This information tells us each of the estimates, but it’s cluttered by other results. The following code pulls out the specific values that we are interested in. (See if you can match them back to the summary output.)
coef(mod_2)[1] # estimate of beta_0
## (Intercept)
## 8.827826
coef(mod_2)[2] # estimate of beta_1
## predictor
## 1.802672
sigma(mod_2) # estimate of sigma
## [1] 4.903643
Since these are the values that we will be interested in for this lab, let’s write a function that replaces summary()
and returns just the values of interest.
get_slr_ests = function(mod) {
c(beta_0_hat = unname(coef(mod)[1]),
beta_1_hat = unname(coef(mod)[2]),
s_e = sigma(mod))
}
get_slr_ests(mod_2)
## beta_0_hat beta_1_hat s_e
## 8.827826 1.802672 4.903643
Perfect! In order we have \(\hat{\beta}_0\), \(\hat{\beta}_1\) and \(s_e\) which estimate
How good are these estimates though? Well, let’s think about their distributions…
In class, we worked towards, but did not fully derive, the distributions of the regression estimates. We simply said that they were
\[ \hat{\beta}_0 \sim N\left( \beta_0, \sigma^2 \left(\frac{1}{n} + \frac{\bar{x}^2}{S_{xx}}\right) \right) \]
and
\[ \hat{\beta}_1 \sim N\left( \beta_1, \frac{\sigma^2}{S_{xx}} \right) \]
where
\[ S_{xx} = \sum_{i = 1}^{n}(x_i - \bar{x})^2. \]
Consider samples of size \(n = 15\) from the model
\[ Y_i = \beta_0 + \beta_1 x_i + \epsilon_i \]
where \(\epsilon_i \sim N(0, \sigma^2)\). In this case, the true values of the parameters are known to be:
Use 10000 simulations to verify the distributions of \(\hat{\beta}_0\) and \(\hat{\beta}_1\). Also use these simulations to attempt to understand the distribution of \(s_e^2\). To carry out this analysis, do the following:
Hint: Write a for()
loop. Each time through the loop, do the following:
sim_slr()
function.lm()
function.reg_ests_1
data frame with the output from using get_slr_ests()
.# simulation setup
n_sims = 10000
set.seed(42)
x_vals_1 = runif(n = 15)
# simulation results setup
reg_ests_1 = data.frame(matrix(0, nrow = n_sims, ncol = 3))
colnames(reg_ests_1) = c("beta_0_hat", "beta_1_hat", "s_e")
# write code here to perform the simulations
# perform misc_calculations
## Sxx
## Var[beta_0_hat]
## Var[beta_1_hat]
par(mfrow = c(1, 3))
# histogram for distribution of beta_0_hat (add curve for true distribution)
# histogram for distribution of beta_1_hat (add curve for true distribution)
# histogram for distribution of s_e^2
Repeat Exercise 1, but this time consider samples of size \(n = 30\) from the model
\[ Y_i = \beta_0 + \beta_1 x_i + \epsilon_i \]
where \(\epsilon_i \sim N(0, \sigma^2)\). In this case, the true values of the parameters are known to be:
# simulation setup
n_sims = 10000
set.seed(42)
x_vals_2 = runif(n = 15)
# simulation results setup
reg_ests_2 = data.frame(matrix(0, nrow = n_sims, ncol = 3))
colnames(reg_ests_2) = c("beta_0_hat", "beta_1_hat", "s_e")
# write code here to perform the simulations
# perform misc_calculations
## Sxx
## Var[beta_0_hat]
## Var[beta_1_hat]
par(mfrow = c(1, 3))
# histogram for distribution of beta_0_hat (add curve for true distribution)
# histogram for distribution of beta_1_hat (add curve for true distribution)
# histogram for distribution of s_e^2
Let’s look at these results in a slightly different way. Plot side-by-side scatterplots (one of each of the previous two exercises) with the simulated values of \(\hat{\beta}_1\) on the \(y\) axis and the simulated values of \(\hat{\beta}_0\) on the \(x\) axis. Comment on these distributions. (Every syllabus I’ve seen for 3201 suggests this distribution has been covered, but we’ll put that to the test here.)
par(mfrow = c(1, 2))
# histogram for results of Exercise 1 here
# histogram for results of Exercise 2 here