---
title: 'STAT 3202: Lab 04, Simulation and Bootstrap CIs'
author: "Spring 2019, OSU"
date: 'Due: Friday, February 22'
output:
html_document:
theme: spacelab
toc: yes
pdf_document: default
urlcolor: BrickRed
---
***
```{r setup, include = FALSE}
knitr::opts_chunk$set(fig.align = "center")
```
**Goal:** After completing this lab, you should be able to...
- *Simulate* statistics from known distributions to estimate sampling distributions.
- *Bootstrap* any statistic.
- *Create* confidience intervals using bootstrap resampling.
In this lab we will use, but not focus on...
- `R` Markdown. This document will serve as a template. It is pre-formatted and already contains chunks that you need to complete.
Some additional notes:
- Please see [**Carmen**](https://carmen.osu.edu/) for information about submission, and grading.
- You may use [this document](lab-04-assign.Rmd) as a template. You do not need to remove directions. Chunks that require your input have a comment indicating to do so.
- The [reading posted with the lab](https://daviddalpiaz.github.io/stat3202-sp19/#lab-04) and the associated `R` Markdown file contain details on performing the required tasks through a number of examples.
- This lab will also serve as Homework 05. Your grade on this lab will also count as your grade on Homework 05.
***
# Exercise 1 - How Large is Large ?
For this exercise we will use:
- Random samples of size $n = 5$, $n = 25$, and $n = 100$.
- Samples from an gamma distribution with
- $\alpha = 0.2$, that is, `shape = 0.2`
- $\beta = 1.4$, that is, `scale = 1.4`
Consider using the sample mean, $\bar{x}$, to estimate the mean, $\mu = \text{E}[X] = \alpha\beta = 0.28$.
If $n$ is "large" then the central limit theorem suggests that
$$
\bar{X} \stackrel{approx}{\sim} N\left(\alpha\beta, \frac{\alpha\beta^2}{n}\right)
$$
which with some additional work we could then use to create confidence intervals. (We'd also need to estimate the variance.)
However, when is this approximation **good**?
Perform three simulation studies:
- Study 1: Samples of size $n = 5$
- Study 2: Samples of size $n = 25$
- Study 3: Samples of size $n = 100$
For each, simulate a sample of the specified size from a given gamma distribution 10000 times. For each simulation calculate and store the sample mean.
For each study create a histogram of the simulated sample means. (These will serve as an estimate of the sampling distribution of $\bar{X}$.) For each, overlay the distribution if the CLT approximation was appropriate:
$$
N\left(\alpha\beta, \frac{\alpha\beta^2}{n}\right)
$$
The chunks below outline this procedure.
**Hint:** Done correctly, you should find that the approximation is bad for $n = 5$, reasonable for $n = 100$ and you may be uncertain about $n = 25$.
```{r,}
set.seed(42)
n = 5
sample_means_n_5 = rep(0, 10000)
# perform simulations for n = 5 here
```
```{r}
set.seed(42)
n = 25
sample_means_n_25 = rep(0, 10000)
# perform simulations for n = 25 here
```
```{r}
set.seed(42)
n = 100
sample_means_n_100 = rep(0, 10000)
# perform simulations for n = 100 here
```
```{r, fig.height = 4, fig.width = 12}
par(mfrow = c(1, 3))
# create histogram for n = 5 here
# add curve for normal density assuming CLT applies
# create histogram for n = 25 here
# add curve for normal density assuming CLT applies
# create histogram for n = 100 here
# add curve for normal density assuming CLT applies
```
***
# Exercise 2 - How Long is a Trump Tweet?
Twitter has become an increasingly important part of the American political discourse. The 2016 Presidential election was unique in that all major contenders were somewhat prolific tweeters. The eventual winner, Donald Trump, was undoubtedly the most prolific of them all, and continues to be an active twitter user now that we over two years into his presidency.
This use of Twitter sparked an interesting analysis by [David Robinson](http://varianceexplained.org/) who is currently the Chief Data Scientist at DataCamp and former Data Scientist at StackOverflow. His analysis, [__*"Text analysis of Trump's tweets confirms he writes only the (angrier) Android half"*__](http://varianceexplained.org/r/trump-tweets/), became very popular leading up to the election.
Let's take a look at this data. To do so, we'll need a couple packages:
```{r, message = FALSE, warning = FALSE}
library(dplyr)
library(tidyr)
```
Then to load the data in a data frame named `trump_tweets_df`, we use:
```{r}
load(url("http://varianceexplained.org/files/trump_tweets_df.rda"))
```
We then create a new data frame named `tweets` based on the raw data:
```{r}
tweets = trump_tweets_df %>%
select(id, statusSource, text, created) %>%
extract(statusSource, "source", "Twitter for (.*?)<") %>%
filter(source %in% c("iPhone", "Android"))
```
```{r}
tweets
```
This dataset is a collection of 1390 tweets from Twitter user `@realDonaldTrump`. For this exercise we will be interested in the `text` variable which contains the text of each tweet.
For example:
```{r}
tweets[2, "text"]
```
More specifically, we'll be interested in the lengths of these tweets:
```{r}
tweet_lengths = nchar(tweets$text)
head(tweet_lengths)
```
```{r}
hist(tweet_lengths, col = "darkgrey",
main = "@realdonaldtrump Tweets",
xlab = "Number of Characters")
box()
grid()
```
Here we see that the sample standard deviation, $s$, is
```{r}
sd(tweet_lengths)
```
Notice that many of these tweets are close to the (at the time) 140 character limit.
- Create a 90% bootstrap confidence interval for $\sigma$, the true standard deviation of tweet length in characters. Use 10000 bootstrap samples. Also plot a histogram of the bootstrap replicates of $s$, the estimator of $\sigma$.
```{r}
set.seed(1)
boot_sd = rep(0, 10000)
# perform bootstrap resampling here
# store the bootstrap replicates in boot_sd
```
```{r}
# create histogram here
```
```{r}
# calculate confidence interval here
```
***
# Exercise 3 - How Much Do Professors Earn?
For this exercise we will use the `Salaries` data from the `carData` package.
```{r}
head(carData::Salaries)
```
We'll focus on the `salary` variable.
```{r}
prof_salary = carData::Salaries$salary
```
```{r}
hist(prof_salary, col = "darkgrey", breaks = 20,
xlab = "Salary (Dollars)", main = "Histogram of Professor Salaries")
box()
grid()
```
What is the 10th [percentile](https://en.wikipedia.org/wiki/Percentile), $\hat{p}_{0.10}$ of this data? That is, what is the salary that 10% of the professors make less than?
```{r}
quantile(prof_salary, probs = 0.10)
```
- Create a 99% bootstrap confidence interval for $p_{0.10}$, the true 10th percentile of professor salaries. Use 5000 bootstrap samples. Also plot a histogram of the bootstrap replicates of $\hat{p}_{0.10}$.
```{r}
set.seed(1)
boot_10th = rep(0, 5000)
# perform bootstrap resampling here
# store the bootstrap replicates in boot_10th
```
```{r}
# create histogram here
```
```{r}
# calculate confidence interval here
```
***
# Exercise 4 - How Long Will You Survive Cancer?
For this exercise we will use the `Melanoma` data from the `MASS` package.
```{r}
head(MASS::Melanoma)
```
We'll focus on the `time` variable which is survival time in days.
```{r}
mel_survive = MASS::Melanoma$time
```
```{r}
hist(mel_survive, col = "darkgrey",
xlab = "Survival (Days)", main = "Histogram of Melanoma Survival")
box()
grid()
```
What is the probability of surviving longer that 10 years? That is, if $X$ is the survival time in years, what is $P[X > 10]$?
With this data, we could estimate. We calculate $\hat{P}[X > 10]$ using
```{r}
mean(mel_survive > 10 * 365)
```
- Create a 95% bootstrap confidence interval for $P[X > 10]$, the probability of surviving longer that 10 years. Use 20000 bootstrap samples. Also plot a histogram of the bootstrap replicates of $\hat{P}[X > 10]$.
```{r}
set.seed(1)
boot_10_year_surv = rep(0, 20000)
# perform bootstrap resampling here
# store the bootstrap replicates in boot_10_year_surv
```
```{r}
# create histogram here
```
```{r}
# calculate confidence interval here
```
***
# Exercise 5 - Who's Tweeting?
Returning to the tweet data, the David Robinson analysis attempted to show that there were two different groups tweeting on the `@realDonaldTrump` account. (This is a common occurrence for public figures. Some will be from a media team, while others are from the individual. Some choose to disclose when this occurs, others, like the Donald Trump account, do not.)
In this case the hypothesis was that the tweets sent from an iPhone were from campaign staff, while the tweets sent from Android were sent from Donald Trump.
```{r}
android_tweets = nchar(subset(tweets, source == "Android")$text)
iphone_tweets = nchar(subset(tweets, source == "iPhone")$text)
```
The posted analysis uses many detailed analyses to argue for this difference, but we'll resort to some simpler methods.
```{r, fig.height = 5, fig.width = 10}
par(mfrow = c(1, 2))
hist(android_tweets, col = "darkorange",
main = "@realdonaldtrump Android Tweets",
xlab = "Number of Characters")
box()
grid()
hist(iphone_tweets, col = "dodgerblue",
main = "@realdonaldtrump iPhone Tweets",
xlab = "Number of Characters")
box()
grid()
```
Looking at these two histograms, we see a clear difference in distributions. However, this could just be due to chance...
Further simplifying, let's look at the sample standard deviations of these two datasets.
```{r}
sd(android_tweets)
sd(iphone_tweets)
```
- $s_{\text{Android}} = 26.34$
- $s_{\text{iPhone}} = 30.86$
If they were sent by the same person, you'd expect them to be equal. (The distributions should be the same, so the standard deviations should be the same.) Or in reality, at least close, but different due to random chance.
Is this difference due to chance?
- Create a 95% bootstrap confidence interval for $\frac{{\sigma}_{\text{Android}}}{{\sigma}_{\text{iPhone}}}$ the ratio of standard deviation of tweet lengths between DJT Android and iPhone tweets. Use 5000 bootstrap samples. Also plot a histogram of the bootstrap replicates of $\frac{{s}_{\text{Android}}}{{s}_{\text{iPhone}}}$, the estimates of the ratio of standard deviations. (If the interval contains the value one, we might believe the two groups are the same people, if not, we will doubt they are the same people.)
**Hint:** This is a two sample problem. You'll need to create two "resamples," one of the Android data and one of the iPhone data to create each bootstrap resample.
```{r}
set.seed(1)
boot_sd_ratio = rep(0, 5000)
# perform bootstrap resampling here
# store the bootstrap replicates in boot_sd_ratio
```
```{r}
# create histogram here
```
```{r}
# calculate confidence interval here
```
***