Goal: After completing this lab, you should be able to…

• Calculate confidence intervals for proportions.
• Create R functions.
• Perform simulation studies using R.

In this lab we will use, but not focus on…

• R Markdown. This document will serve as a template. It is pre-formatted and already contains chunks that you need to complete.

• Please see Carmen for information about submission, and grading.
• You may use this document as a template. You do not need to remove directions. Chunks that require your input have a comment indicating to do so.
• Unlike Lab 02, this lab does not focus on data analysis. Instead, we are using R to learn about an anlysis method we have discussed. (Does it really work? When does it work well?)
• Recall the tutorial information posted in Lab 01. It may be extremely useful here.

# Background

In class we discussed a “large sample” procedure for calculating a $$(1 - \alpha) \cdot 100\%$$ confidence interval for a binary proportion $$p$$. That is, we have binary data where one class can be labeled $$1$$ and the other class is labeled $$0$$.

Specifically, we have $$X_1, X_2, \ldots X_n$$ that are iid Bernoulli random variables. In other words,

$X_i \sim \text{bern}(p)$

which indicates

$P[X_i = 1] = p, \quad P[X_i = 0] = 1 - p$

Alternatively, if $$X = X_1 + X_2 + \cdots + X_n$$,

$X \sim \text{binom}(n, p)$

Our goal is to create a confidence interval for this parameter $$p$$ using the $$n$$ data points $$X_1, X_2, \ldots X_n$$.

We have already seen that $$\hat{p} = \frac{\sum_{i = 1}^{n}x_i}{n} = \frac{X}{n}$$ is a natural estimator for $$p$$.

Also, recall that

$\text{E}\left[\hat{p}\right] = p$

$\text{Var}\left[\hat{p}\right] = \frac{p(1 - p)}{n}$

The central limit theorem gives

$\hat{p} \overset{\text{approx}}{\sim} N\left(p, \frac{p(1 - p)}{n}\right)$

Using these three facts, together with Slutky’s theorem, and then some manipulating of probability statements, we obtain an approximate $$(1 - \alpha) \cdot 100\%$$ confidence interval for $$p$$ given by

$\hat{p} \pm z_{\alpha/2}\sqrt{\frac{\hat{p}(1 - \hat{p})}{n}}$

where $$z_{\alpha/2}$$ is the critical value such that

$P[Z > z_{\alpha/2}] = \alpha/2$

Let’s take a a look at some synthetic data from this setup. In particular, let’s take a sample from a population with

$X_i \sim \text{bern}(p = 0.2)$

rbinom(n = 50, size = 1, prob = 0.2)
##  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [36] 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0

Here we are using the rbinom() function to create a random sample of 50 Bernoulli random variables with $$p = 0.2$$. (Since size = 1, these are technically Bernoulli.) Re-run this line of code to obtain different random sample. (Also, try modifying prob to sample from different populations and see how that changes the samples.)

Let’s create a particular sample so that we can look at some statistics of that sample.

set.seed(42)
example_data = rbinom(n = 50, size = 1, prob = 0.2)

By setting a seed, we can control the randomization and the sample stored in example_data will be the same for anyone who runs these two lines of code in order.

sum(example_data) # number of 1's in the data
## [1] 18
length(example_data) # sample size of the data
## [1] 50
sum(example_data) / length(example_data) # calculating p-hat
## [1] 0.36
mean(example_data) # easier p-hat calculating
## [1] 0.36

The above code calculates some relevant statistics for this data. In particular, we can use the mean() function to calculate $$\hat{p}$$ rather directly. (Note that here we are storing data in a vector, not a data frame, since we are only considering a single variable.)

Now let’s calculate a 95% confidence interval for $$p$$ using this data. The following code does so with descriptive comments.

# calculate the sample size of the data
n = length(example_data)
# calculate the estimate of p
p_hat = mean(example_data)
# calculate the standard error of the estimate
se = sqrt(p_hat * (1 - p_hat) / n)
# obtain the critical value of the confidence interval
z_star = qnorm(p = 0.05 / 2, lower.tail = FALSE) # 1.960
# calculate the lower bound of the interval
p_hat - z_star * se
## [1] 0.2269532
# calculate the upper bound of the iterval
p_hat + z_star * se
## [1] 0.4930468

We could consider returning the interval together as a vector with named elements.

c(lower = p_hat - z_star * se, upper = p_hat + z_star * se)
##     lower     upper
## 0.2269532 0.4930468

Since we simulated this data, we know that the true value of $$p$$ is $$0.2$$. Is this in the interval we obtained?

(p_hat - z_star * se < 0.2) & (0.2 < p_hat + z_star * se)
## [1] FALSE

Note that we need to evaluate two comparisons here. (Unfortunately p_hat - z_star * se < 0.2 < p_hat + z_star * se is not valid code.)

In this lab, we are going to create a lot of confidence intervals. Rather than re-writing the above code over and over again, we should write a function. Don’t repeat yourself!

calc_ci_95 = function(data){
n = length(data)
p_hat = mean(data)
se = sqrt(p_hat * (1 - p_hat) / n)
z_star = qnorm(p = 0.05 / 2, lower.tail = FALSE)
c(lower = p_hat - z_star * se, upper = p_hat + z_star * se)
}

Here we have a function calc_ci_95() that takes as input a vector data and outputs a vector of length two which contains the lower and upper bound of a 95% confidence interval. (The input data could actually be anything, but will cause problems if it’s not a vector that contains elements that are either 0 or 1.)

We then run this function on our data.

calc_ci_95(data = example_data)
##     lower     upper
## 0.2269532 0.4930468

We get the same result, but the code is much cleaner.

ci = calc_ci_95(data = example_data)
ci["lower"] < 0.2 & 0.2 < ci["upper"]
## lower
## FALSE

We again check and verify that the true value is not within the interval.

# Exercise 1 - Calculting Confidence Intervals, Writing Functions

For this exercise we will use:

• A sample size of $$n = 10$$
• A true proportion $$p = 0.42$$

The following chunk generates a data set from this population and stores it in e1_data.

set.seed(42)
e1_data = rbinom(n = 10, size = 1, prob = 0.42)

Our previous function calc_ci_95() is sort of silly. It only calculates 95% confidence intervals. Write a new function calc_ci() that is a modification of calc_ci_95() that can calculate a confidence interval of any confidence level.

This function will have two arguments:

• data: like calc_ci_95() takes a vector of 0s and 1s
• conf_level: values between 0 and 1 that specifies $$(1 - \alpha)$$ which we will call the confidence level of the interval.
• Set a default value of 0.95.

The function should output:

• A named vector with the lower and upper bounds of the specified interval.
# remove this comment and write your function here

Test your function by running the following code. The middle line should return a 95% confidence interval, which for this data should be (0.4159742, 0.9840258).

# remove these comments
# change eval = FALSE to eval = TRUE in chunk options
calc_ci(data = e1_data, conf_level = 0.90)
calc_ci(data = e1_data)
calc_ci(data = e1_data, conf_level = 0.99)

# Exercise 2 - 90% CI Coverage, “Small $$n$$”

What we are really interested in with this lab is the coverage probability of these confidence intervals. That is, how often will a confidence interval contain the true value of the parameter.

Wait…

Shouldn’t a 90% confidence interval have a 90% converge probability?

Maybe not! Remember, we derived these intervals using approximate distributions, in particular

$\hat{p} \overset{\text{approx}}{\sim} N\left(p, \frac{p(1 - p)}{n}\right)$

If this approximation isn’t good, the interval won’t have the correct coverage!

Use the following chunk to perform a simulation study in order to calculate the coverage for a 90% confidence interval using:

• A sample size of $$n = 10$$
• A true proportion $$p = 0.42$$
• A large number of simulations, 1000

Essentially we will repeat the following process 1000 times:

• Generate a sample of size 10 from a population with $$p = 0.42$$.
• Calculate a 90% confidence interval for this data.
• Check if the true value of the parameter, $$p = 0.42$$, is in this interval.
# remove these comments and update for loop below
# change eval = FALSE to eval = TRUE in chunk options
set.seed(42)
cover = rep(0, 1000)

for (i in seq_along(1:1000)) {
# generate data here
#   n = 10, p = 0.42
# calcualte 90% confidence interval here
#   use your function!
# update the i-th element of cover here
#   TRUE if 0.42 is in interval
#   FALSE if 0.42 is not in interval
}

c(coverage = mean(cover))

# Exercise 3 - 90% CI Coverage, “Large $$n$$,” DRY!

The results of the previous exercise should be a coverage probability that is too low! Why? Because with $$n = 10$$, the following approximation is poor!

$\hat{p} \overset{\text{approx}}{\sim} N\left(p, \frac{p(1 - p)}{n}\right)$

We going to continue to perform these coverage calculations… so you should write a function. Write a function called calc_coverage.

This function will have four arguments:

• num_sims: the number of simulations to use
• Set a default value of 1000.
• p: the true value of the parameter
• Set a default value of 0.5.
• sample_size: the sample size of the data that is repeatedly generated
• Set a default value of 20.
• conf_level: the confidence level
• Set a default value of 0.95.

The function should output:

• The coverage probability. (A number between 0 and 1.)

Use the previous exercise as a template for the function.

# write function here

Run the following code which will calculate the coverage when:

• $$p = 0.42$$ (true parameter)
• $$n = 50$$ (sample size)
• $$(1 - \alpha) = 0.90$$ (90% confidence intervals)
• num_sims = 1000 (default number of simulations)

This should be an improvement over the previous exercise. (Closer to 0.90).

# remove these comments
# change eval = FALSE to eval = TRUE in chunk options
set.seed(42)
calc_coverage(p = 0.42, sample_size = 50, conf_level = 0.90)

# Exercise 4 - 95% CI Coverage, “Large $$n$$”

Now that we have the calc_coverage() function, we can move a lot faster. Calculate the coverage of a 95% confidence interval using a sample size of 50 when $$p = 0.42$$.

Run the following code to do so! (We’re essentially just double-checking your function.)

# remove these comments
# change eval = FALSE to eval = TRUE in chunk options
set.seed(42)
calc_coverage(p = 0.42, sample_size = 50)

# Exercise 5 - 95% CI Coverage, Small $$p$$?

The last two exercises should have verified that a “large” $$n$$ gives good coverage. In the previous two cases, 50 was large enough. Is that always the case?

Calculate the coverage of a 95% confidence interval using a sample size of 50 when $$p = 0.07$$.

Run the following code to do so! (We’re double-checking your function, and figuring out when coverage is actually good.)

# remove these comments
# change eval = FALSE to eval = TRUE in chunk options
set.seed(42)
calc_coverage(p = 0.07, sample_size = 50)

# Exercise 6 - 95% CI Coverage, Small $$p$$?

Hmm. In the last exercise $$n = 50$$ wasn’t enough.

Calculate the coverage of a 95% confidence interval using a sample size of 500 when $$p = 0.07$$.

Run the following code to do so! (This should indicate that for any $$p$$ there is some $$n$$ that is large enough. The smaller the $$p$$, then larger the $$n$$ needed.)

# remove these comments
# change eval = FALSE to eval = TRUE in chunk options
set.seed(42)
calc_coverage(p = 0.07, sample_size = 500)

# Exercise 7 - Number of Simulations?

Technically, throughout this document we haven’t really been calculating the coverage directly. Instead, in some sense we’re actually estimating it using simulation. (Estimating and estimation procedure…)

We should note that the number of simulations used has an effect on this “calculation.” Run the following code to get a better understanding of this idea.

# remove these comments
# change eval = FALSE to eval = TRUE in chunk options
set.seed(42)
calc_coverage(num_sims = 10, p = 0.35, sample_size = 500, conf_level = 0.90)
set.seed(42)
calc_coverage(num_sims = 100, p = 0.35, sample_size = 500, conf_level = 0.90)
set.seed(42)
calc_coverage(num_sims = 1000, p = 0.35, sample_size = 500, conf_level = 0.90)
set.seed(42)
calc_coverage(num_sims = 10000, p = 0.35, sample_size = 500, conf_level = 0.90)

We’ve really only considered a small subset of the possible:

• parameters, $$p$$
• sample sizes, $$n$$
• confidence levels, $$(1 - \alpha)$$
• number of simulations

In the solutions, we’ll include much more detailed simulation studies that further reinforce the idea of this lab, which is a well studied topic in mathematical statistics.

The above code creates plots for a large number of values of $$p$$, for various sample sizes, for a confidence level of $$0.90$$. (They would look similar but shifted for different confidence levels.) Hopefully this illustrates that there is a relationship between $$n$$ and $$p$$ that affects the coverage of these intervals.

The code to generate the plots is “hidden.” Check the corresponding R Markdown document for details. The code to perform the simulations could be written a number of different ways, some which would be considered “better” and more efficient. This presentation was used for clarity of ideas.

The plots above were created using 10000 simulations for each value of $$p$$. Is that enough? Too many? The following code and plot essentially re-creates Exercise 7 but for any number of simulations. We see that as the number of simulations increases, the “estimated” coverage converges to the value that we expect.