Goal: After completing this lab, you should be able to…

In this lab we will use, but not focus on…

Some additional notes:


Exercise 0A - Using lm()

In class we looked at the (boring) cars dataset. Use ?cars to learn more about this dataset. (For example, the year that it was gathered.)

head(cars)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10
plot(dist ~ speed, data = cars, pch = 20)
grid()

Our purpose with this dataset was to fit a line that summarized the data. We did this with the lm() function in R.

cars_mod = lm(dist ~ speed, data = cars)

Using the summary() function on the result of the lm() function produced some useful output, including the slope and intercept of the line that we fit.

summary(cars_mod)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

We could use the abline() function to add this line to a plot.

plot(dist ~ speed, data = cars, pch = 20)
grid()
abline(cars_mod, col = "red")

Exercise 0B - Simulating SLR

We also discussed that what we were doing when we “fit a line” was estimating a particular probability model. That model, which we call the simple linear regression model, is

\[ Y_i = \beta_0 + \beta_1 x_i + \epsilon_i \]

where

\[ \epsilon_i \sim N(0, \sigma^2) \]

Let’s write a function to simulate data according to this model.

sim_slr = function(x, beta_0 = 10, beta_1 = 5, sigma = 1) {
  sample_size = length(x)
  epsilon = rnorm(n = sample_size, mean = 0, sd = sigma)
  y = beta_0 + beta_1 * x + epsilon
  data.frame(predictor = x, response = y)
}

Notice that the function takes as input the \(x\) value of the regression. (In addition to the intercept, slope, and variance.) This is because the \(x\) values are not viewed as random. But based on them, and the specified line, the \(Y\) values are randomly generated.

Here we first generated some \(x\) values. (Yes “randomly”, but after we generate them, we would consider them fixed. Admittedly this is a little confusing. Later when we repeat the process, we will generate the \(x\) values once, and the \(Y\) values repeatedly based on the first set of \(x\) values.) Then we use the sim_slr() function to simulate from the model when the true values of the parameters are:

set.seed(42)
x_vals = runif(n = 25)
sim_data_1 = sim_slr(x = x_vals, beta_0 = 6, beta_1 = -1, sigma = 2)

We then fit the model. And then check the results.

mod_1 = lm(response ~ predictor, data = sim_data_1)
summary(mod_1)
## 
## Call:
## lm(formula = response ~ predictor, data = sim_data_1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3721 -1.5615  0.1598  1.0745  3.3083 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7.6356     0.8003   9.541 1.84e-09 ***
## predictor    -2.4999     1.1725  -2.132   0.0439 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.761 on 23 degrees of freedom
## Multiple R-squared:  0.165,  Adjusted R-squared:  0.1287 
## F-statistic: 4.546 on 1 and 23 DF,  p-value: 0.04389
plot(response ~ predictor, data = sim_data_1, pch = 20)
grid()
abline(mod_1, col = "red")

Exercise 0C - Estimation in SLR

Let’s do this again, this time for the model when

First, let’s simulate data according to this model.

set.seed(42)
x_vals = runif(n = 50)
sim_data_2 = sim_slr(x = x_vals, beta_0 = 9, beta_1 = 2, sigma = 5)
head(sim_data_2)
##   predictor   response
## 1 0.9148060  8.6772664
## 2 0.9370754  9.5878039
## 3 0.2861395  0.7564636
## 4 0.8304476 12.9613820
## 5 0.6417455  7.0835167
## 6 0.5190959 12.3154425

Now we’d like to use this data to estimate

(Yes, we know the true values, but pretend we didn’t.) Let’s fit the model.

mod_2 = lm(response ~ predictor, data = sim_data_2)

We’ll create a plot for good measure.

plot(response ~ predictor, data = sim_data_2, pch = 20)
grid()
abline(mod_2, col = "red")

Now we’ll look at the results of using the summary() function.

summary(mod_2)
## 
## Call:
## lm(formula = response ~ predictor, data = sim_data_2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.6581  -3.2330   0.9429   3.4540   8.2295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    8.828      1.546   5.711 6.91e-07 ***
## predictor      1.803      2.307   0.781    0.438    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.904 on 48 degrees of freedom
## Multiple R-squared:  0.01256,    Adjusted R-squared:  -0.008011 
## F-statistic: 0.6106 on 1 and 48 DF,  p-value: 0.4384

This information tells us each of the estimates, but it’s cluttered by other results. The following code pulls out the specific values that we are interested in. (See if you can match them back to the summary output.)

coef(mod_2)[1] # estimate of beta_0
## (Intercept) 
##    8.827826
coef(mod_2)[2] # estimate of beta_1
## predictor 
##  1.802672
sigma(mod_2)   # estimate of sigma
## [1] 4.903643

Since these are the values that we will be interested in for this lab, let’s write a function that replaces summary() and returns just the values of interest.

get_slr_ests = function(mod) {
  c(beta_0_hat = unname(coef(mod)[1]), 
    beta_1_hat = unname(coef(mod)[2]), 
    s_e = sigma(mod))
}
get_slr_ests(mod_2)
## beta_0_hat beta_1_hat        s_e 
##   8.827826   1.802672   4.903643

Perfect! In order we have \(\hat{\beta}_0\), \(\hat{\beta}_1\) and \(s_e\) which estimate

How good are these estimates though? Well, let’s think about their distributions…

Exercise 1 - Distributions of Regression Estimates

In class, we worked towards, but did not fully derive, the distributions of the regression estimates. We simply said that they were

\[ \hat{\beta}_0 \sim N\left( \beta_0, \sigma^2 \left(\frac{1}{n} + \frac{\bar{x}^2}{S_{xx}}\right) \right) \]

and

\[ \hat{\beta}_1 \sim N\left( \beta_1, \frac{\sigma^2}{S_{xx}} \right) \]

where

\[ S_{xx} = \sum_{i = 1}^{n}(x_i - \bar{x})^2. \]

Consider samples of size \(n = 15\) from the model

\[ Y_i = \beta_0 + \beta_1 x_i + \epsilon_i \]

where \(\epsilon_i \sim N(0, \sigma^2)\). In this case, the true values of the parameters are known to be:

Use 10000 simulations to verify the distributions of \(\hat{\beta}_0\) and \(\hat{\beta}_1\). Also use these simulations to attempt to understand the distribution of \(s_e^2\). To carry out this analysis, do the following:

Hint: Write a for() loop. Each time through the loop, do the following:

# simulation setup
n_sims = 10000
set.seed(42)
x_vals_1 = runif(n = 15)

# simulation results setup
reg_ests_1 = data.frame(matrix(0, nrow = n_sims, ncol = 3))
colnames(reg_ests_1) = c("beta_0_hat", "beta_1_hat", "s_e")

# write code here to perform the simulations
# perform misc_calculations
## Sxx
## Var[beta_0_hat]
## Var[beta_1_hat]
par(mfrow = c(1, 3))
# histogram for distribution of beta_0_hat (add curve for true distribution)
# histogram for distribution of beta_1_hat (add curve for true distribution)
# histogram for distribution of s_e^2

Exercise 2 - Distributions of Regression Estimates, Again

Repeat Exercise 1, but this time consider samples of size \(n = 30\) from the model

\[ Y_i = \beta_0 + \beta_1 x_i + \epsilon_i \]

where \(\epsilon_i \sim N(0, \sigma^2)\). In this case, the true values of the parameters are known to be:

# simulation setup
n_sims = 10000
set.seed(42)
x_vals_2 = runif(n = 15)

# simulation results setup
reg_ests_2 = data.frame(matrix(0, nrow = n_sims, ncol = 3))
colnames(reg_ests_2) = c("beta_0_hat", "beta_1_hat", "s_e")

# write code here to perform the simulations
# perform misc_calculations
## Sxx
## Var[beta_0_hat]
## Var[beta_1_hat]
par(mfrow = c(1, 3))
# histogram for distribution of beta_0_hat (add curve for true distribution)
# histogram for distribution of beta_1_hat (add curve for true distribution)
# histogram for distribution of s_e^2

Exercise 3 - Joint Distribution of Regression Estimates

Let’s look at these results in a slightly different way. Plot side-by-side scatterplots (one of each of the previous two exercises) with the simulated values of \(\hat{\beta}_1\) on the \(y\) axis and the simulated values of \(\hat{\beta}_0\) on the \(x\) axis. Comment on these distributions. (Every syllabus I’ve seen for 3201 suggests this distribution has been covered, but we’ll put that to the test here.)

par(mfrow = c(1, 2))
# histogram for results of Exercise 1 here
# histogram for results of Exercise 2 here