Goal: After completing this lab, you should be able to…

In this lab we will use, but not focus on…

Some additional notes:


Background

In class we will discuss a “large sample” procedure for calculating a \((1 - \alpha) \cdot 100\%\) confidence interval for a binary proportion \(p\). That is, we have binary data where one class (category) can be labeled \(1\) (perhaps “dog”) and the other class is labeled \(0\) (“cat”).

Specifically, consider \(X_1, X_2, \ldots X_n\) that are iid Bernoulli random variables. In other words,

\[ X_i \sim \text{bern}(p) \]

which indicates

\[ P[X_i = 1] = p, \quad P[X_i = 0] = 1 - p \]

Alternatively, if we write \(X = X_1 + X_2 + \cdots + X_n\), then,

\[ X \sim \text{binom}(n, p) \]

Our goal is to create a confidence interval for this parameter \(p\) using the \(n\) data points \(X_1, X_2, \ldots X_n\).

We have already seen that \(\hat{p} = \frac{\sum_{i = 1}^{n}x_i}{n} = \frac{X}{n}\) is a natural estimator for \(p\).

Also, recall that

\[ \text{E}\left[\hat{p}\right] = p \]

\[ \text{Var}\left[\hat{p}\right] = \frac{p(1 - p)}{n} \]

The central limit theorem gives

\[ \hat{p} \overset{\text{approx}}{\sim} N\left(p, \frac{p(1 - p)}{n}\right) \]

Using these three facts, together with Slutky’s theorem, and then some manipulating of probability statements, we obtain an approximate \((1 - \alpha) \cdot 100\%\) confidence interval for \(p\) given by

\[ \hat{p} \pm z_{\alpha/2}\sqrt{\frac{\hat{p}(1 - \hat{p})}{n}} \]

where \(z_{\alpha/2}\) is the critical value such that

\[ P[Z > z_{\alpha/2}] = \alpha/2 \]

Let’s take a a look at some synthetic data from this setup. In particular, let’s take a sample from a population with

\[ X_i \sim \text{bern}(p = 0.2) \]

rbinom(n = 50, size = 1, prob = 0.2)
##  [1] 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1
## [36] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0

Here we are using the rbinom() function to create a random sample of 50 Bernoulli random variables with \(p = 0.2\). (Since size = 1, these are technically Bernoulli.) Re-run this line of code to obtain different random sample. (Also, try modifying prob to sample from different populations and see how that changes the samples.)

Let’s create a particular sample so that we can look at some statistics of that sample.

set.seed(42)
example_data = rbinom(n = 50, size = 1, prob = 0.2)

By setting a seed, we can control the randomization and the sample stored in example_data will be the same for anyone who runs these two lines of code in order.

sum(example_data) # number of 1's in the data
## [1] 18
length(example_data) # sample size of the data
## [1] 50
sum(example_data) / length(example_data) # calculating p-hat
## [1] 0.36
mean(example_data) # easier p-hat calculating
## [1] 0.36

The above code calculates some relevant statistics for this data. In particular, we can use the mean() function to calculate \(\hat{p}\) rather directly. (Note that here we are storing data in a vector, not a data frame, since we are only considering a single variable.)

Now let’s calculate a 95% confidence interval for \(p\) using this data. The following code does so with descriptive comments.

# calculate the sample size of the data
n = length(example_data)
# calculate the estimate of p
p_hat = mean(example_data)
# calculate the standard error of the estimate
se = sqrt(p_hat * (1 - p_hat) / n)
# obtain the critical value of the confidence interval
z_star = qnorm(p = 0.05 / 2, lower.tail = FALSE) # 1.960
# calculate the lower bound of the interval
p_hat - z_star * se
## [1] 0.2269532
# calculate the upper bound of the interval
p_hat + z_star * se
## [1] 0.4930468

We could consider returning the interval together as a vector with named elements.

c(lower = p_hat - z_star * se, upper = p_hat + z_star * se)
##     lower     upper 
## 0.2269532 0.4930468

Since we simulated this data, we know that the true value of \(p\) is \(0.2\). Is this in the interval we obtained?

(p_hat - z_star * se < 0.2) & (0.2 < p_hat + z_star * se)
## [1] FALSE

Note that we need to evaluate two comparisons here. (Unfortunately p_hat - z_star * se < 0.2 < p_hat + z_star * se is not valid code.)

In this lab, we are going to create a lot of confidence intervals. Rather than re-writing the above code over and over again, we should write a function. Don’t repeat yourself!

calc_ci_95 = function(data){
  n = length(data)
  p_hat = mean(data)
  se = sqrt(p_hat * (1 - p_hat) / n)
  z_star = qnorm(p = 0.05 / 2, lower.tail = FALSE)
  c(lower = p_hat - z_star * se, upper = p_hat + z_star * se)
}

Here we have a function calc_ci_95() that takes as input a vector data and outputs a vector of length two which contains the lower and upper bound of a 95% confidence interval. (The input data could actually be anything, but will cause problems if it’s not a vector that contains elements that are either 0 or 1.)

We then run this function on our data.

calc_ci_95(data = example_data)
##     lower     upper 
## 0.2269532 0.4930468

We get the same result, but the code is much cleaner.

ci = calc_ci_95(data = example_data)
ci["lower"] < 0.2 & 0.2 < ci["upper"]
## lower 
## FALSE

We again check and verify that the true value is not within the interval.


Exercise 1 - Calculting Confidence Intervals, Writing Functions

For this exercise we will use:

The following chunk generates a data set from this population and stores it in e1_data.

set.seed(42)
e1_data = rbinom(n = 8, size = 1, prob = 0.58)

Our previous function calc_ci_95() is sort of silly. It only calculates 95% confidence intervals. Write a new function calc_ci() that is a modification of calc_ci_95() that can calculate a confidence interval of any confidence level.

This function will have two arguments:

The function should output:

# remove this comment and write your function here

Test your function by running the following code. The middle line should return a 95% confidence interval, which for this data should be (0.0395261, 0.7104739).

# remove these comments
# change eval = FALSE to eval = TRUE in chunk options
calc_ci(data = e1_data, conf_level = 0.99)
calc_ci(data = e1_data)
calc_ci(data = e1_data, conf_level = 0.90)

Exercise 2 - 90% CI Coverage, “Small \(n\)

What we are really interested in with this lab is the coverage probability of these confidence intervals. That is, how often will a confidence interval contain the true value of the parameter.

Wait…

Shouldn’t a 90% confidence interval have a 90% coverage probability?

Maybe not! Remember, we derived these intervals using approximate distributions, in particular

\[ \hat{p} \overset{\text{approx}}{\sim} N\left(p, \frac{p(1 - p)}{n}\right) \]

If this approximation isn’t good, the interval won’t have the correct coverage!

Use the following chunk to perform a simulation study in order to calculate the coverage for a 90% confidence interval using:

Essentially we will repeat the following process 1000 times:

# remove these comments and update the for loop below
# change eval = FALSE to eval = TRUE in chunk options
set.seed(42)
cover = rep(0, 1000)

for (i in seq_along(1:1000)) {
  # generate data here
  #   n = 8, p = 0.58
  # calcualte 90% confidence interval here
  #   use your function!
  # update the i-th element of cover here
  #   TRUE if 0.58 is in interval
  #   FALSE if 0.58 is not in interval
}

c(coverage = mean(cover))

Exercise 3 - 90% CI Coverage, “Large \(n\),” DRY!

The results of the previous exercise should be a coverage probability that is too low! Why? Because with \(n = 8\), the following approximation is poor!

\[ \hat{p} \overset{\text{approx}}{\sim} N\left(p, \frac{p(1 - p)}{n}\right) \]

We are going to continue to perform these coverage calculations… so you should write a function. Write a function called calc_coverage.

This function will have four arguments:

The function should output:

Use the previous exercise as a template for the function.

# write function here

Run the following code which will calculate the coverage when:

This should be an improvement over the previous exercise. (Closer to 0.90).

# remove these comments
# change eval = FALSE to eval = TRUE in chunk options
set.seed(42)
calc_coverage(p = 0.58, sample_size = 100, conf_level = 0.90)

Exercise 4 - 95% CI Coverage, “Large \(n\)

Now that we have the calc_coverage() function, we can move a lot faster. Calculate the coverage of a 95% confidence interval using a sample size of 100 when \(p = 0.58\).

Run the following code to do so! (We’re essentially just double-checking your function.)

# remove these comments
# change eval = FALSE to eval = TRUE in chunk options
set.seed(42)
calc_coverage(p = 0.58, sample_size = 100)

Exercise 5 - 95% CI Coverage, Small \(p\)?

The last two exercises should have verified that a “large” \(n\) gives good coverage. In the previous two cases, 100 was large enough. Is that always the case?

Calculate the coverage of a 95% confidence interval using a sample size of 100 when \(p = 0.03\).

Run the following code to do so! (We’re double-checking your function, and figuring out when coverage is actually good.)

# remove these comments
# change eval = FALSE to eval = TRUE in chunk options
set.seed(42)
calc_coverage(p = 0.03, sample_size = 100)

Exercise 6 - 95% CI Coverage, Small \(p\)?

Hmm. In the last exercise \(n = 100\) wasn’t enough.

Calculate the coverage of a 95% confidence interval using a sample size of 2000 when \(p = 0.03\).

Run the following code to do so! (This should indicate that for any \(p\) there is some \(n\) that is large enough. The smaller the \(p\), then larger the \(n\) needed.)

# remove these comments
# change eval = FALSE to eval = TRUE in chunk options
set.seed(42)
calc_coverage(p = 0.03, sample_size = 2000)

Exercise 7 - Number of Simulations?

Technically, throughout this document we haven’t really been calculating the coverage directly. Instead, in some sense we’re actually estimating it using simulation. (Estimating and estimation procedure…)

We should note that the number of simulations used has an effect on this “calculation.” Run the following code to get a better understanding of this idea.

# remove these comments
# change eval = FALSE to eval = TRUE in chunk options
set.seed(42)
calc_coverage(num_sims = 10, p = 0.35, sample_size = 50, conf_level = 0.90)
set.seed(42)
calc_coverage(num_sims = 100, p = 0.35, sample_size = 50, conf_level = 0.90)
set.seed(42)
calc_coverage(num_sims = 1000, p = 0.35, sample_size = 50, conf_level = 0.90)
set.seed(42)
calc_coverage(num_sims = 10000, p = 0.35, sample_size = 50, conf_level = 0.90)

Additional Thoughts

We’ve really only considered a small subset of the possible:

In the solutions, we’ll include much more detailed simulation studies that further reinforce the idea of this lab, which is a well studied topic in mathematical statistics.