Exercise 0A - Using lm()
Exercise 0B - Simulating SLR
Exercise 0C - Estimation in SLR
Exercise 1 - Distributions of Regression Estimates
Exercise 2 - Distributions of Regression Estimates, Again
Exercise 3 - Joint Distribution of Regression Estimates

Goal: After completing this lab, you should be able to…

Use the lm() function in R to fit regression models.
Use simulation to understanding the distributions of regression estimates.

In this lab we will use, but not focus on…

R Markdown. This document will serve as a template. It is pre-formatted and already contains chunks that you need to complete.

Some additional notes:

Please see Carmen for information about submission, and grading.
You may use this document as a template. You do not need to remove directions. Chunks that require your input have a comment indicating to do so.
The following readings may be very useful:
- ASWR: Chapter 7
- ASWR: Chapter 8

Exercise 0A - Using `lm()`

In class we looked at the (boring) cars dataset. Use ?cars to learn more about this dataset. (For example, the year that it was gathered.)

head(cars)

##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

plot(dist ~ speed, data = cars, pch = 20)
grid()

Our purpose with this dataset was to fit a line that summarized the data. We did this with the lm() function in R.

cars_mod = lm(dist ~ speed, data = cars)

Using the summary() function on the result of the lm() function produced some useful output, including the slope and intercept of the line that we fit.

summary(cars_mod)

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

We could use the abline() function to add this line to a plot.

plot(dist ~ speed, data = cars, pch = 20)
grid()
abline(cars_mod, col = "red")

Exercise 0B - Simulating SLR

We also discussed that what we were doing when we “fit a line” was estimating a particular probability model. That model, which we call the simple linear regression model, is

\[ Y_i = \beta_0 + \beta_1 x_i + \epsilon_i \]

where

\[ \epsilon_i \sim N(0, \sigma^2) \]

Let’s write a function to simulate data according to this model.

sim_slr = function(x, beta_0 = 10, beta_1 = 5, sigma = 1) {
  sample_size = length(x)
  epsilon = rnorm(n = sample_size, mean = 0, sd = sigma)
  y = beta_0 + beta_1 * x + epsilon
  data.frame(predictor = x, response = y)
}

Notice that the function takes as input the \(x\) value of the regression. (In addition to the intercept, slope, and variance.) This is because the \(x\) values are not viewed as random. But based on them, and the specified line, the \(Y\) values are randomly generated.

Here we first generated some \(x\) values. (Yes “randomly”, but after we generate them, we would consider them fixed. Admittedly this is a little confusing. Later when we repeat the process, we will generate the \(x\) values once, and the \(Y\) values repeatedly based on the first set of \(x\) values.) Then we use the sim_slr() function to simulate from the model when the true values of the parameters are:

\(\beta_0 = 6\)
\(\beta_1 = -1\)
\(\sigma^2 = 4\)

set.seed(42)
x_vals = runif(n = 25)
sim_data_1 = sim_slr(x = x_vals, beta_0 = 6, beta_1 = -1, sigma = 2)

We then fit the model. And then check the results.

mod_1 = lm(response ~ predictor, data = sim_data_1)

summary(mod_1)

## 
## Call:
## lm(formula = response ~ predictor, data = sim_data_1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3721 -1.5615  0.1598  1.0745  3.3083 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7.6356     0.8003   9.541 1.84e-09 ***
## predictor    -2.4999     1.1725  -2.132   0.0439 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.761 on 23 degrees of freedom
## Multiple R-squared:  0.165,  Adjusted R-squared:  0.1287 
## F-statistic: 4.546 on 1 and 23 DF,  p-value: 0.04389

plot(response ~ predictor, data = sim_data_1, pch = 20)
grid()
abline(mod_1, col = "red")

Exercise 0C - Estimation in SLR

Let’s do this again, this time for the model when

\(\beta_0 = 9\)
\(\beta_1 = 2\)
\(\sigma^2 = 25\)

First, let’s simulate data according to this model.

set.seed(42)
x_vals = runif(n = 50)
sim_data_2 = sim_slr(x = x_vals, beta_0 = 9, beta_1 = 2, sigma = 5)

head(sim_data_2)

##   predictor   response
## 1 0.9148060  8.6772664
## 2 0.9370754  9.5878039
## 3 0.2861395  0.7564636
## 4 0.8304476 12.9613820
## 5 0.6417455  7.0835167
## 6 0.5190959 12.3154425

Now we’d like to use this data to estimate

\(\beta_0\)
\(\beta_1\)
\(\sigma^2\)

(Yes, we know the true values, but pretend we didn’t.) Let’s fit the model.

mod_2 = lm(response ~ predictor, data = sim_data_2)

We’ll create a plot for good measure.

plot(response ~ predictor, data = sim_data_2, pch = 20)
grid()
abline(mod_2, col = "red")

Now we’ll look at the results of using the summary() function.

summary(mod_2)

## 
## Call:
## lm(formula = response ~ predictor, data = sim_data_2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.6581  -3.2330   0.9429   3.4540   8.2295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    8.828      1.546   5.711 6.91e-07 ***
## predictor      1.803      2.307   0.781    0.438    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.904 on 48 degrees of freedom
## Multiple R-squared:  0.01256,    Adjusted R-squared:  -0.008011 
## F-statistic: 0.6106 on 1 and 48 DF,  p-value: 0.4384

This information tells us each of the estimates, but it’s cluttered by other results. The following code pulls out the specific values that we are interested in. (See if you can match them back to the summary output.)

coef(mod_2)[1] # estimate of beta_0

## (Intercept) 
##    8.827826

coef(mod_2)[2] # estimate of beta_1

## predictor 
##  1.802672

sigma(mod_2)   # estimate of sigma

## [1] 4.903643

Since these are the values that we will be interested in for this lab, let’s write a function that replaces summary() and returns just the values of interest.

get_slr_ests = function(mod) {
  c(beta_0_hat = unname(coef(mod)[1]), 
    beta_1_hat = unname(coef(mod)[2]), 
    s_e = sigma(mod))
}

get_slr_ests(mod_2)

## beta_0_hat beta_1_hat        s_e 
##   8.827826   1.802672   4.903643

Perfect! In order we have \(\hat{\beta}_0\), \(\hat{\beta}_1\) and \(s_e\) which estimate

\(\beta_0\)
\(\beta_1\)
\(\sigma\)

How good are these estimates though? Well, let’s think about their distributions…

Exercise 1 - Distributions of Regression Estimates

In class, we worked towards, but did not fully derive, the distributions of the regression estimates. We simply said that they were

\[ \hat{\beta}_0 \sim N\left( \beta_0, \sigma^2 \left(\frac{1}{n} + \frac{\bar{x}^2}{S_{xx}}\right) \right) \]

and

\[ \hat{\beta}_1 \sim N\left( \beta_1, \frac{\sigma^2}{S_{xx}} \right) \]

where

\[ S_{xx} = \sum_{i = 1}^{n}(x_i - \bar{x})^2. \]

Consider samples of size \(n = 15\) from the model

\[ Y_i = \beta_0 + \beta_1 x_i + \epsilon_i \]

where \(\epsilon_i \sim N(0, \sigma^2)\). In this case, the true values of the parameters are known to be:

\(\beta_0 = 4\)
\(\beta_1 = -2\)
\(\sigma^2 = 9\)

Use 10000 simulations to verify the distributions of \(\hat{\beta}_0\) and \(\hat{\beta}_1\). Also use these simulations to attempt to understand the distribution of \(s_e^2\). To carry out this analysis, do the following:

Simulate from the true model 10000 times, for each store the estimates of the three parameters.
Plot three histograms side-by-side:
- A histogram of the simulated values of \(\hat{\beta}_0\). Add a density curve for the true distribution of \(\hat{\beta}_0\)
- A histogram of the simulated values of \(\hat{\beta}_1\). Add a density curve for the true distribution of \(\hat{\beta}_1\)
- A histogram of the simulated values of \(s_e^2\).
Comment on how well the true distributions match the simulated values for the two \(\beta\)s. Also comment on the shape of the distribution of \(s_e^2\).

Hint: Write a for() loop. Each time through the loop, do the following:

Simulate data with the sim_slr() function.
Fit the SLR model with the lm() function.
Update the \(i\)-th row of the reg_ests_1 data frame with the output from using get_slr_ests().

# simulation setup
n_sims = 10000
set.seed(42)
x_vals_1 = runif(n = 15)

# simulation results setup
reg_ests_1 = data.frame(matrix(0, nrow = n_sims, ncol = 3))
colnames(reg_ests_1) = c("beta_0_hat", "beta_1_hat", "s_e")

# write code here to perform the simulations

# perform misc_calculations
## Sxx
## Var[beta_0_hat]
## Var[beta_1_hat]

par(mfrow = c(1, 3))
# histogram for distribution of beta_0_hat (add curve for true distribution)
# histogram for distribution of beta_1_hat (add curve for true distribution)
# histogram for distribution of s_e^2

Replace this text with your comments on the plots.

Exercise 2 - Distributions of Regression Estimates, Again

Repeat Exercise 1, but this time consider samples of size \(n = 30\) from the model

\[ Y_i = \beta_0 + \beta_1 x_i + \epsilon_i \]

where \(\epsilon_i \sim N(0, \sigma^2)\). In this case, the true values of the parameters are known to be:

\(\beta_0 = -10\)
\(\beta_1 = 5\)
\(\sigma^2 = 4\)

# simulation setup
n_sims = 10000
set.seed(42)
x_vals_2 = runif(n = 15)

# simulation results setup
reg_ests_2 = data.frame(matrix(0, nrow = n_sims, ncol = 3))
colnames(reg_ests_2) = c("beta_0_hat", "beta_1_hat", "s_e")

# write code here to perform the simulations

# perform misc_calculations
## Sxx
## Var[beta_0_hat]
## Var[beta_1_hat]

par(mfrow = c(1, 3))
# histogram for distribution of beta_0_hat (add curve for true distribution)
# histogram for distribution of beta_1_hat (add curve for true distribution)
# histogram for distribution of s_e^2

Replace this text with your comments on the plots.

Exercise 3 - Joint Distribution of Regression Estimates

Let’s look at these results in a slightly different way. Plot side-by-side scatterplots (one of each of the previous two exercises) with the simulated values of \(\hat{\beta}_1\) on the \(y\) axis and the simulated values of \(\hat{\beta}_0\) on the \(x\) axis. Comment on these distributions. (Every syllabus I’ve seen for 3201 suggests this distribution has been covered, but we’ll put that to the test here.)

par(mfrow = c(1, 2))
# histogram for results of Exercise 1 here
# histogram for results of Exercise 2 here

Replace this text with your comments on the plots.

STAT 3202: Lab 07, Regession Estimates Sampling Distributions

Autumn 2018, OSU

Due: Friday, Novemeber 2

Exercise 0A - Using `lm()`

Exercise 0B - Simulating SLR

Exercise 0C - Estimation in SLR

Exercise 1 - Distributions of Regression Estimates

Exercise 2 - Distributions of Regression Estimates, Again

Exercise 3 - Joint Distribution of Regression Estimates

STAT 3202: Lab 07, Regession Estimates Sampling Distributions

Autumn 2018, OSU

Due: Friday, Novemeber 2

Exercise 0A - Using lm()

Exercise 0B - Simulating SLR

Exercise 0C - Estimation in SLR

Exercise 1 - Distributions of Regression Estimates

Exercise 2 - Distributions of Regression Estimates, Again

Exercise 3 - Joint Distribution of Regression Estimates

Exercise 0A - Using `lm()`