---
title: 'STAT 3202: Lab 03, Confidence Interval Coverage'
author: "Spring 2019, OSU"
date: 'Due: Friday, February 8'
output:
  html_document:
    theme: spacelab
    toc: yes
  pdf_document: default
urlcolor: BrickRed
---

***

```{r setup, include = FALSE}
knitr::opts_chunk$set(fig.align = "center")
```

**Goal:** After completing this lab, you should be able to...

- *Calculate* confidence intervals for proportions.
- *Create* `R` functions.
- *Perform* simulation studies using `R`.

In this lab we will use, but not focus on...

- `R` Markdown. This document will serve as a template. It is pre-formatted and already contains chunks that you need to complete.

Some additional notes:

- Please see [**Carmen**](https://carmen.osu.edu/) for information about submission, and grading.
- You may use [this document](lab-03-assign.Rmd) as a template. You do not need to remove directions. Chunks that require your input have a comment indicating to do so.
- Unlike Lab 02, this lab does not focus on data analysis. Instead, we are using `R` to learn about an anlysis method we have discussed. (Does it really work? When does it work well?)
- Recall the tutorial information posted in [Lab `01`](https://daviddalpiaz.github.io/stat3202-sp19/lab/lab-01.html#exercise-7---curiosity-reading-template-videos). It may be extremely useful here.

***

# Background

In class we will discuss a "large sample" procedure for calculating a $(1 - \alpha) \cdot 100\%$ confidence interval for a binary proportion $p$. That is, we have binary data where one class (category) can be labeled $1$ (perhaps "dog") and the other class is labeled $0$ ("cat"). 

Specifically, consider $X_1, X_2, \ldots X_n$ that are iid Bernoulli random variables. In other words,

$$
X_i \sim \text{bern}(p)
$$

which indicates

$$
P[X_i = 1] = p, \quad P[X_i = 0] = 1 - p
$$

Alternatively, if we write $X = X_1 + X_2 + \cdots + X_n$, then,

$$
X \sim \text{binom}(n, p)
$$

Our goal is to create a confidence interval for this parameter $p$ using the $n$ data points $X_1, X_2, \ldots X_n$.

We have already seen that $\hat{p} = \frac{\sum_{i = 1}^{n}x_i}{n} = \frac{X}{n}$ is a natural estimator for $p$.

Also, recall that

$$
\text{E}\left[\hat{p}\right] = p
$$

$$
\text{Var}\left[\hat{p}\right] = \frac{p(1 - p)}{n}
$$

The central limit theorem gives

$$
\hat{p} \overset{\text{approx}}{\sim} N\left(p,  \frac{p(1 - p)}{n}\right)
$$

Using these three facts, together with [Slutky's theorem](https://en.wikipedia.org/wiki/Slutsky%27s_theorem), and then some manipulating of probability statements, we obtain an approximate $(1 - \alpha) \cdot 100\%$ confidence interval for $p$ given by

$$
\hat{p} \pm z_{\alpha/2}\sqrt{\frac{\hat{p}(1 - \hat{p})}{n}}
$$

where $z_{\alpha/2}$ is the critical value such that

$$
P[Z > z_{\alpha/2}] = \alpha/2
$$

Let's take a a look at some synthetic data from this setup. In particular, let's take a sample from a population with

$$
X_i \sim \text{bern}(p = 0.2)
$$

```{r}
rbinom(n = 50, size = 1, prob = 0.2)
```

Here we are using the `rbinom()` function to create a random sample of 50 Bernoulli random variables with $p = 0.2$. (Since `size = 1`, these are technically Bernoulli.) Re-run this line of code to obtain different random sample. (Also, try modifying `prob` to sample from different populations and see how that changes the samples.)

Let's create a particular sample so that we can look at some statistics of that sample.

```{r}
set.seed(42)
example_data = rbinom(n = 50, size = 1, prob = 0.2)
```

By setting a seed, we can control the randomization and the sample stored in `example_data` will be the same for anyone who runs these two lines of code in order.

```{r}
sum(example_data) # number of 1's in the data
length(example_data) # sample size of the data
sum(example_data) / length(example_data) # calculating p-hat
mean(example_data) # easier p-hat calculating
```

The above code calculates some relevant statistics for this data. In particular, we can use the `mean()` function to calculate $\hat{p}$ rather directly. (Note that here we are storing data in a vector, not a data frame, since we are only considering a single variable.)

Now let's calculate a 95% confidence interval for $p$ using this data. The following code does so with descriptive comments.

```{r}
# calculate the sample size of the data
n = length(example_data)
# calculate the estimate of p
p_hat = mean(example_data)
# calculate the standard error of the estimate
se = sqrt(p_hat * (1 - p_hat) / n)
# obtain the critical value of the confidence interval
z_star = qnorm(p = 0.05 / 2, lower.tail = FALSE) # 1.960
# calculate the lower bound of the interval
p_hat - z_star * se
# calculate the upper bound of the interval
p_hat + z_star * se
```

We could consider returning the interval together as a vector with *named* elements.

```{r}
c(lower = p_hat - z_star * se, upper = p_hat + z_star * se)
```

Since we simulated this data, we know that the true value of $p$ is $0.2$. Is this in the interval we obtained?

```{r}
(p_hat - z_star * se < 0.2) & (0.2 < p_hat + z_star * se)
```

Note that we need to evaluate two comparisons here. (Unfortunately `p_hat - z_star * se < 0.2 < p_hat + z_star * se` is not valid code.) 

In this lab, we are going to create a lot of confidence intervals. Rather than re-writing the above code over and over again, we should write a function. [Don't repeat yourself!](https://en.wikipedia.org/wiki/Don't_repeat_yourself)

```{r}
calc_ci_95 = function(data){
  n = length(data)
  p_hat = mean(data)
  se = sqrt(p_hat * (1 - p_hat) / n)
  z_star = qnorm(p = 0.05 / 2, lower.tail = FALSE)
  c(lower = p_hat - z_star * se, upper = p_hat + z_star * se)
}
```

Here we have a function `calc_ci_95()` that takes as input a vector `data` and outputs a vector of length two which contains the lower and upper bound of a 95% confidence interval. (The input `data` could actually be anything, but will cause problems if it's not a vector that contains elements that are either `0` or `1`.)

We then run this function on our data.

```{r}
calc_ci_95(data = example_data)
```

We get the same result, but the code is much cleaner.

```{r}
ci = calc_ci_95(data = example_data)
ci["lower"] < 0.2 & 0.2 < ci["upper"]
```

We again check and verify that the true value is not within the interval.

***

# Exercise 1 - Calculting Confidence Intervals, Writing Functions

For this exercise we will use:

- A sample size of $n = 8$
- A true proportion $p = 0.58$

The following chunk generates a data set from this population and stores it in `e1_data`.

```{r}
set.seed(42)
e1_data = rbinom(n = 8, size = 1, prob = 0.58)
```

Our previous function `calc_ci_95()` is sort of silly. It only calculates 95% confidence intervals. Write a new function `calc_ci()` that is a modification of `calc_ci_95()` that can calculate a confidence interval of any confidence level.

This function will have two **arguments**:

- `data`: like `calc_ci_95()` takes a vector of `0`s and `1`s
- `conf_level`: values between `0` and `1` that specifies $(1 - \alpha)$ which we will call the *confidence level* of the interval.
    - Set a default value of `0.95`.

The function should **output**:

- A named vector with the *lower* and *upper* bounds of the specified interval.

```{r}
# remove this comment and write your function here
```

Test your function by running the following code. The middle line should return a 95% confidence interval, which for this data should be (0.0395261, 0.7104739).

```{r, eval = FALSE}
# remove these comments
# change eval = FALSE to eval = TRUE in chunk options
calc_ci(data = e1_data, conf_level = 0.99)
calc_ci(data = e1_data)
calc_ci(data = e1_data, conf_level = 0.90)
```

***

# Exercise 2 - 90% CI Coverage, "Small $n$"

What we are really interested in with this lab is the [**coverage** probability](https://en.wikipedia.org/wiki/Coverage_probability) of these confidence intervals. That is, how often will a confidence interval contain the true value of the parameter.

Wait...

Shouldn't a 90% confidence interval have a 90% coverage probability?

Maybe not! Remember, we derived these intervals using *approximate* distributions, in particular

$$
\hat{p} \overset{\text{approx}}{\sim} N\left(p,  \frac{p(1 - p)}{n}\right)
$$

If this approximation isn't good, the interval won't have the correct coverage!

Use the following chunk to perform a simulation study in order to calculate the coverage for a 90% confidence interval using:

- A sample size of $n = 8$
- A true proportion $p = 0.58$
- A large number of simulations, 1000

Essentially we will repeat the following process 1000 times:

- *Generate* a sample of size 8 from a population with $p = 0.58$.
- *Calculate* a 90% confidence interval for this data.
- *Check* if the true value of the parameter, $p = 0.58$, is in this interval. 

```{r, eval = FALSE}
# remove these comments and update the for loop below
# change eval = FALSE to eval = TRUE in chunk options
set.seed(42)
cover = rep(0, 1000)

for (i in seq_along(1:1000)) {
  # generate data here
  #   n = 8, p = 0.58
  # calcualte 90% confidence interval here
  #   use your function!
  # update the i-th element of cover here
  #   TRUE if 0.58 is in interval
  #   FALSE if 0.58 is not in interval
}

c(coverage = mean(cover))
```

***

# Exercise 3 - 90% CI Coverage, "Large $n$," DRY!

The results of the previous exercise should be a coverage probability that is too low! Why? Because with $n = 8$, the following approximation is poor!

$$
\hat{p} \overset{\text{approx}}{\sim} N\left(p,  \frac{p(1 - p)}{n}\right)
$$

We are going to continue to perform these coverage calculations... so you should write a function. Write a function called `calc_coverage`.

This function will have four **arguments**:

- `num_sims`: the number of simulations to use
    - Set a default value of `1000`.
- `p`: the true value of the parameter
    - Set a default value of `0.5`.
- `sample_size`: the sample size of the data that is repeatedly generated
    - Set a default value of `20`.
- `conf_level`: the confidence level
    - Set a default value of `0.95`.

The function should **output**:

- The *coverage* probability. (A number between 0 and 1.)

Use the previous exercise as a template for the function.

```{r}
# write function here
```

Run the following code which will calculate the coverage when:

- $p = 0.58$ (true parameter)
- $n = 100$ (sample size)
- $(1 - \alpha) = 0.90$ (90% confidence intervals)
- `num_sims = 1000` (default number of simulations)

This should be an improvement over the previous exercise. (Closer to 0.90).

```{r, eval = FALSE}
# remove these comments
# change eval = FALSE to eval = TRUE in chunk options
set.seed(42)
calc_coverage(p = 0.58, sample_size = 100, conf_level = 0.90)
```

***

# Exercise 4 - 95% CI Coverage, "Large $n$"

Now that we have the `calc_coverage()` function, we can move a lot faster. Calculate the coverage of a 95% confidence interval using a sample size of 100 when $p = 0.58$.

Run the following code to do so! (We're essentially just double-checking your function.)

```{r, eval = FALSE}
# remove these comments
# change eval = FALSE to eval = TRUE in chunk options
set.seed(42)
calc_coverage(p = 0.58, sample_size = 100)
```

***

# Exercise 5 - 95% CI Coverage, Small $p$?

The last two exercises should have verified that a "large" $n$ gives good coverage. In the previous two cases, 100 was large enough. Is that always the case?

Calculate the coverage of a 95% confidence interval using a sample size of 100 when $p = 0.03$.

Run the following code to do so! (We're double-checking your function, and figuring out when coverage is actually good.)

```{r, eval = FALSE}
# remove these comments
# change eval = FALSE to eval = TRUE in chunk options
set.seed(42)
calc_coverage(p = 0.03, sample_size = 100)
```

***

# Exercise 6 - 95% CI Coverage, Small $p$?

Hmm. In the last exercise $n = 100$ wasn't enough.

Calculate the coverage of a 95% confidence interval using a sample size of 2000 when $p = 0.03$.

Run the following code to do so! (This should indicate that for any $p$ there is some $n$ that is large enough. The smaller the $p$, then larger the $n$ needed.)

```{r, eval = FALSE}
# remove these comments
# change eval = FALSE to eval = TRUE in chunk options
set.seed(42)
calc_coverage(p = 0.03, sample_size = 2000)
```

***

# Exercise 7 - Number of Simulations?

Technically, throughout this document we haven't really been calculating the coverage directly. Instead, in some sense we're actually estimating it using simulation. (Estimating and estimation procedure...)

We should note that the number of simulations used has an effect on this "calculation." Run the following code to get a better understanding of this idea.

```{r, eval = FALSE}
# remove these comments
# change eval = FALSE to eval = TRUE in chunk options
set.seed(42)
calc_coverage(num_sims = 10, p = 0.35, sample_size = 50, conf_level = 0.90)
set.seed(42)
calc_coverage(num_sims = 100, p = 0.35, sample_size = 50, conf_level = 0.90)
set.seed(42)
calc_coverage(num_sims = 1000, p = 0.35, sample_size = 50, conf_level = 0.90)
set.seed(42)
calc_coverage(num_sims = 10000, p = 0.35, sample_size = 50, conf_level = 0.90)
```

***

# Additional Thoughts

We've really only considered a small subset of the possible:

- parameters, $p$
- sample sizes, $n$
- confidence levels, $(1 - \alpha)$
- number of simulations

In the solutions, we'll include much more detailed simulation studies that further reinforce the idea of this lab, which is a [well studied topic](https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval) in mathematical statistics.

***

***