---
title: 'STAT 3202: Lab 07, Regession Estimates Sampling Distributions'
author: "Autumn 2018, OSU"
date: 'Due: Friday, Novemeber 2'
output:
html_document:
theme: spacelab
toc: yes
pdf_document: default
urlcolor: BrickRed
---
***
```{r setup, include = FALSE}
knitr::opts_chunk$set(fig.align = "center")
```
**Goal:** After completing this lab, you should be able to...
- *Use* the `lm()` function in `R` to fit regression models.
- *Use* simulation to understanding the distributions of regression estimates.
In this lab we will use, but not focus on...
- `R` Markdown. This document will serve as a template. It is pre-formatted and already contains chunks that you need to complete.
Some additional notes:
- Please see [**Carmen**](https://carmen.osu.edu/) for information about submission, and grading.
- You may use [this document](lab-07-assign.Rmd) as a template. You do not need to remove directions. Chunks that require your input have a comment indicating to do so.
- The following readings may be very useful:
- [ASWR: Chapter 7](https://daviddalpiaz.github.io/appliedstats/simple-linear-regression.html)
- [ASWR: Chapter 8](https://daviddalpiaz.github.io/appliedstats/inference-for-simple-linear-regression.html)
***
# Exercise 0A - Using `lm()`
In class we looked at the (boring) `cars` dataset. Use `?cars` to learn more about this dataset. (For example, the year that it was gathered.)
```{r}
head(cars)
```
```{r}
plot(dist ~ speed, data = cars, pch = 20)
grid()
```
Our purpose with this dataset was to fit a line that summarized the data. We did this with the `lm()` function in `R`.
```{r}
cars_mod = lm(dist ~ speed, data = cars)
```
Using the `summary()` function on the result of the `lm()` function produced some useful output, including the slope and intercept of the line that we fit.
```{r}
summary(cars_mod)
```
We could use the `abline()` function to add this line to a plot.
```{r}
plot(dist ~ speed, data = cars, pch = 20)
grid()
abline(cars_mod, col = "red")
```
# Exercise 0B - Simulating SLR
We also discussed that what we were doing when we "fit a line" was estimating a particular probability model. That model, which we call the simple linear regression model, is
$$
Y_i = \beta_0 + \beta_1 x_i + \epsilon_i
$$
where
$$
\epsilon_i \sim N(0, \sigma^2)
$$
Let's write a function to simulate data according to this model.
```{r}
sim_slr = function(x, beta_0 = 10, beta_1 = 5, sigma = 1) {
sample_size = length(x)
epsilon = rnorm(n = sample_size, mean = 0, sd = sigma)
y = beta_0 + beta_1 * x + epsilon
data.frame(predictor = x, response = y)
}
```
Notice that the function takes as input the $x$ value of the regression. (In addition to the intercept, slope, and variance.) This is because the $x$ values are **not** viewed as random. But based on them, and the specified line, the $Y$ values are randomly generated.
Here we first generated some $x$ values. (Yes "randomly", but after we generate them, we would consider them fixed. Admittedly this is a little confusing. Later when we repeat the process, we will generate the $x$ values once, and the $Y$ values repeatedly based on the first set of $x$ values.) Then we use the `sim_slr()` function to simulate from the model when the true values of the parameters are:
- $\beta_0 = 6$
- $\beta_1 = -1$
- $\sigma^2 = 4$
```{r}
set.seed(42)
x_vals = runif(n = 25)
sim_data_1 = sim_slr(x = x_vals, beta_0 = 6, beta_1 = -1, sigma = 2)
```
We then fit the model. And then check the results.
```{r}
mod_1 = lm(response ~ predictor, data = sim_data_1)
```
```{r}
summary(mod_1)
```
```{r}
plot(response ~ predictor, data = sim_data_1, pch = 20)
grid()
abline(mod_1, col = "red")
```
# Exercise 0C - Estimation in SLR
Let's do this again, this time for the model when
- $\beta_0 = 9$
- $\beta_1 = 2$
- $\sigma^2 = 25$
First, let's simulate data according to this model.
```{r}
set.seed(42)
x_vals = runif(n = 50)
sim_data_2 = sim_slr(x = x_vals, beta_0 = 9, beta_1 = 2, sigma = 5)
```
```{r}
head(sim_data_2)
```
Now we'd like to use this data to estimate
- $\beta_0$
- $\beta_1$
- $\sigma^2$
(Yes, we know the true values, but pretend we didn't.) Let's fit the model.
```{r}
mod_2 = lm(response ~ predictor, data = sim_data_2)
```
We'll create a plot for good measure.
```{r}
plot(response ~ predictor, data = sim_data_2, pch = 20)
grid()
abline(mod_2, col = "red")
```
Now we'll look at the results of using the `summary()` function.
```{r}
summary(mod_2)
```
This information tells us each of the estimates, but it's cluttered by other results. The following code pulls out the specific values that we are interested in. (See if you can match them back to the summary output.)
```{r}
coef(mod_2)[1] # estimate of beta_0
coef(mod_2)[2] # estimate of beta_1
sigma(mod_2) # estimate of sigma
```
Since these are the values that we will be interested in for this lab, let's write a function that replaces `summary()` and returns just the values of interest.
```{r}
get_slr_ests = function(mod) {
c(beta_0_hat = unname(coef(mod)[1]),
beta_1_hat = unname(coef(mod)[2]),
s_e = sigma(mod))
}
```
```{r}
get_slr_ests(mod_2)
```
Perfect! In order we have $\hat{\beta}_0$, $\hat{\beta}_1$ and $s_e$ which estimate
- $\beta_0$
- $\beta_1$
- $\sigma$
How good are these estimates though? Well, let's think about their distributions...
# Exercise 1 - Distributions of Regression Estimates
In class, we worked towards, but did not fully derive, the distributions of the regression estimates. We simply said that they were
$$
\hat{\beta}_0 \sim N\left( \beta_0, \sigma^2 \left(\frac{1}{n} + \frac{\bar{x}^2}{S_{xx}}\right) \right)
$$
and
$$
\hat{\beta}_1 \sim N\left( \beta_1, \frac{\sigma^2}{S_{xx}} \right)
$$
where
$$
S_{xx} = \sum_{i = 1}^{n}(x_i - \bar{x})^2.
$$
Consider samples of size $n = 15$ from the model
$$
Y_i = \beta_0 + \beta_1 x_i + \epsilon_i
$$
where $\epsilon_i \sim N(0, \sigma^2)$. In this case, the true values of the parameters are known to be:
- $\beta_0 = 4$
- $\beta_1 = -2$
- $\sigma^2 = 9$
Use 10000 simulations to verify the distributions of $\hat{\beta}_0$ and $\hat{\beta}_1$. Also use these simulations to attempt to understand the distribution of $s_e^2$. To carry out this analysis, do the following:
- Simulate from the true model 10000 times, for each store the estimates of the three parameters.
- Plot three histograms side-by-side:
- A histogram of the simulated values of $\hat{\beta}_0$. Add a density curve for the **true** distribution of $\hat{\beta}_0$
- A histogram of the simulated values of $\hat{\beta}_1$. Add a density curve for the **true** distribution of $\hat{\beta}_1$
- A histogram of the simulated values of $s_e^2$.
- Comment on how well the true distributions match the simulated values for the two $\beta$s. Also comment on the shape of the distribution of $s_e^2$.
Hint: Write a `for()` loop. Each time through the loop, do the following:
- Simulate data with the `sim_slr()` function.
- Fit the SLR model with the `lm()` function.
- Update the $i$-th row of the `reg_ests_1` data frame with the output from using `get_slr_ests()`.
```{r}
# simulation setup
n_sims = 10000
set.seed(42)
x_vals_1 = runif(n = 15)
# simulation results setup
reg_ests_1 = data.frame(matrix(0, nrow = n_sims, ncol = 3))
colnames(reg_ests_1) = c("beta_0_hat", "beta_1_hat", "s_e")
# write code here to perform the simulations
```
```{r}
# perform misc_calculations
## Sxx
## Var[beta_0_hat]
## Var[beta_1_hat]
```
```{r}
par(mfrow = c(1, 3))
# histogram for distribution of beta_0_hat (add curve for true distribution)
# histogram for distribution of beta_1_hat (add curve for true distribution)
# histogram for distribution of s_e^2
```
- **Replace this text with your comments on the plots.**
# Exercise 2 - Distributions of Regression Estimates, Again
Repeat Exercise 1, but this time consider samples of size $n = 30$ from the model
$$
Y_i = \beta_0 + \beta_1 x_i + \epsilon_i
$$
where $\epsilon_i \sim N(0, \sigma^2)$. In this case, the true values of the parameters are known to be:
- $\beta_0 = -10$
- $\beta_1 = 5$
- $\sigma^2 = 4$
```{r}
# simulation setup
n_sims = 10000
set.seed(42)
x_vals_2 = runif(n = 15)
# simulation results setup
reg_ests_2 = data.frame(matrix(0, nrow = n_sims, ncol = 3))
colnames(reg_ests_2) = c("beta_0_hat", "beta_1_hat", "s_e")
# write code here to perform the simulations
```
```{r}
# perform misc_calculations
## Sxx
## Var[beta_0_hat]
## Var[beta_1_hat]
```
```{r}
par(mfrow = c(1, 3))
# histogram for distribution of beta_0_hat (add curve for true distribution)
# histogram for distribution of beta_1_hat (add curve for true distribution)
# histogram for distribution of s_e^2
```
- **Replace this text with your comments on the plots.**
# Exercise 3 - Joint Distribution of Regression Estimates
Let's look at these results in a slightly different way. Plot side-by-side scatterplots (one of each of the previous two exercises) with the simulated values of $\hat{\beta}_1$ on the $y$ axis and the simulated values of $\hat{\beta}_0$ on the $x$ axis. Comment on these distributions. (Every syllabus I've seen for 3201 suggests this distribution has been covered, but we'll put that to the test here.)
```{r}
par(mfrow = c(1, 2))
# histogram for results of Exercise 1 here
# histogram for results of Exercise 2 here
```
- **Replace this text with your comments on the plots.**
***