Goal: After completing this lab, you should be able to…

In this lab we will use, but not focus on…

Some additional notes:


Exercise 1 - How Large is Large ?

For this exercise we will use:

Consider using the sample mean, \(\bar{x}\), to estimate the mean, \(\mu = \text{E}[X] = \alpha\beta = 0.36\).

If \(n\) is “large” then the central limit theorem suggests that

\[ \bar{X} \stackrel{approx}{\sim} N\left(\alpha\beta, \frac{\alpha\beta^2}{n}\right) \]

which with some additional work we could then use to create confidence intervals. (We’d also need to estimate the variance.)

However, when is this approximation good?

Perform three simulation studies:

For each, simulate a sample of the specified size from a given gamma distribution 5000 times. For each simulation calculate and store the sample mean.

For each study create a histogram of the simulated sample means. (These will serve as an estimate of the sampling distribution of \(\bar{X}\).) For each, overlay the distribution if the CLT approximation was appropriate:

\[ N\left(\alpha\beta, \frac{\alpha\beta^2}{n}\right) \]

The chunks below outline this procedure.

Hint: Done correctly, you should find that the approximation is bad for \(n = 10\), reasonable for \(n = 100\) and you may be uncertain about \(n = 30\).

set.seed(42)
n = 10
sample_means_n_10 = rep(0, 5000)
# perform simulations for n = 10 here
set.seed(42)
n = 30
sample_means_n_30 = rep(0, 5000)
# perform simulations for n = 30 here
set.seed(42)
n = 100
sample_means_n_100 = rep(0, 5000)
# perform simulations for n = 100 here
par(mfrow = c(1, 3))

# create histogram for n = 10 here
# add curve for normal density assuming CLT applies

# create histogram for n = 30 here
# add curve for normal density assuming CLT applies

# create histogram for n = 100 here
# add curve for normal density assuming CLT applies

Exercise 2 - How Long is a Trump Tweet?

Twitter has become an increasingly important part of the American political discourse. The 2016 Presidential election was unique in that all major contenders were somewhat prolific tweeters. The eventual winner, Donald Trump, was undoubtedly the most prolific of them all, and continues to be an active twitter user now that we approach two years into his presidency.

This use of Twitter sparked an interesting analysis by David Robinson who is currently the Chief Data Scientist at DataCamp and former Data Scientist at StackOverflow. His analysis, “Text analysis of Trump’s tweets confirms he writes only the (angrier) Android half”, became very popular leading up to the election.

Let’s take a look at this data. To do so, we’ll need a couple packages:

library(dplyr)
library(tidyr)

Then to load the data in a data frame named trump_tweets_df, we use:

load(url("http://varianceexplained.org/files/trump_tweets_df.rda"))

We then create a new data frame named tweets based on the raw data:

tweets = trump_tweets_df %>%
  select(id, statusSource, text, created) %>%
  extract(statusSource, "source", "Twitter for (.*?)<") %>%
  filter(source %in% c("iPhone", "Android"))
tweets
## # A tibble: 1,390 x 4
##    id          source text                             created            
##    <chr>       <chr>  <chr>                            <dttm>             
##  1 7626698825… Andro… My economic policy speech will … 2016-08-08 15:20:44
##  2 7626415954… iPhone Join me in Fayetteville, North … 2016-08-08 13:28:20
##  3 7624396589… iPhone "#ICYMI: \"Will Media Apologize… 2016-08-08 00:05:54
##  4 7624253718… Andro… Michael Morell, the lightweight… 2016-08-07 23:09:08
##  5 7624008698… Andro… "The media is going crazy. They… 2016-08-07 21:31:46
##  6 7622845333… Andro… I see where Mayor Stephanie Raw… 2016-08-07 13:49:29
##  7 7621109187… iPhone Thank you Windham, New Hampshir… 2016-08-07 02:19:37
##  8 7621069044… iPhone ".@Larry_Kudlow - 'Donald Trump… 2016-08-07 02:03:39
##  9 7621044117… Andro… I am not just running against C… 2016-08-07 01:53:45
## 10 7620164261… iPhone "#CrookedHillary is not fit to … 2016-08-06 20:04:08
## # ... with 1,380 more rows

This dataset is a collection of 1390 tweets from Twitter user @realDonaldTrump. For this exercise we will be interested in the text variable which contains the text of each tweet.

For example:

tweets[2, "text"]
## # A tibble: 1 x 1
##   text                                                                     
##   <chr>                                                                    
## 1 Join me in Fayetteville, North Carolina tomorrow evening at 6pm. Tickets…

More specifically, we’ll be interested in the lengths of these tweets:

tweet_lengths = nchar(tweets$text)
head(tweet_lengths)
## [1]  67 114  64 134 135 138
hist(tweet_lengths, col = "darkgrey",
     main = "@realdonaldtrump Tweets",
     xlab = "Number of Characters")
box()
grid()

Here we see that the sample median, \(\hat{m}\), is

median(tweet_lengths)
## [1] 127

Notice that many of these tweets are close to the (at the time) 140 character limit.

set.seed(1)
boot_med = rep(0, 10000)
# perform bootstrap resampling here
# store the bootstrap replicates in boot_med
# create histogram here
# calculate confidence interval here

Exercise 3 - How Much Do Professors Earn?

For this exercise we will use the Salaries data from the carData package.

head(carData::Salaries)
##        rank discipline yrs.since.phd yrs.service  sex salary
## 1      Prof          B            19          18 Male 139750
## 2      Prof          B            20          16 Male 173200
## 3  AsstProf          B             4           3 Male  79750
## 4      Prof          B            45          39 Male 115000
## 5      Prof          B            40          41 Male 141500
## 6 AssocProf          B             6           6 Male  97000

We’ll focus on the salary variable.

prof_salary = carData::Salaries$salary
hist(prof_salary, col = "darkgrey", breaks = 20,
     xlab = "Salary (Dollars)", main = "Histogram of Professor Salaries")
box()
grid()

What is the 25th percentile, \(\hat{p}_{0.25}\) of this data? That is, what is the salary such that 25% of the professors make less than?

quantile(prof_salary, probs = 0.25)
##   25% 
## 91000
set.seed(1)
boot_25th = rep(0, 5000)
# perform bootstrap resampling here
# store the bootstrap replicates in boot_25th
# create histogram here
# calculate confidence interval here

Exercise 4 - How Long Will You Survive Cancer?

For this exercise we will use the Melanoma data from the MASS package.

head(MASS::Melanoma)
##   time status sex age year thickness ulcer
## 1   10      3   1  76 1972      6.76     1
## 2   30      3   1  56 1968      0.65     0
## 3   35      2   1  41 1977      1.34     0
## 4   99      3   0  71 1968      2.90     0
## 5  185      1   1  52 1965     12.08     1
## 6  204      1   1  28 1971      4.84     1

We’ll focus on the time variable which is survival time in days.

mel_survive = MASS::Melanoma$time
hist(mel_survive, col = "darkgrey",
     xlab = "Survival (Days)", main = "Histogram of Melanoma Survival")
box()
grid()

What is the probability of surviving longer that 5 years? That is, if \(X\) is the survival time in years, what is \(P[X > 5]\)?

With this data, we could estimate. We calculate \(\hat{P}[X > 5]\) using

mean(mel_survive > 5 * 365)
## [1] 0.595122
set.seed(1)
boot_5_year_surv = rep(0, 20000)
# perform bootstrap resampling here
# store the bootstrap replicates in boot_5_year_surv
# create histogram here
# calculate confidence interval here

Exercise 5 - Who’s Tweeting?

Returning to the tweet data, the David Robinson analysis attempted to show that there were two different groups tweeting on the @realDonaldTrump account. (This is a common occurrence for public figures. Some will be from a media team, while others are from the individual. Some choose to disclose when this occurs, others, like the Donald Trump account, do not.)

In this case the hypothesis was that the tweets sent from an iPhone were from campaign staff, will the tweets send from Android were sent from Donald Trump.

android_tweets = nchar(subset(tweets, source == "Android")$text)
iphone_tweets  = nchar(subset(tweets, source == "iPhone")$text)

The posted analysis uses many detailed analyses to argue for this difference, but we’ll resort to some simpler methods.

par(mfrow = c(1, 2))
hist(android_tweets, col = "darkorange",
     main = "@realdonaldtrump Android Tweets",
     xlab = "Number of Characters")
box()
grid()
hist(iphone_tweets, col = "dodgerblue",
     main = "@realdonaldtrump iPhone Tweets",
     xlab = "Number of Characters")
box()
grid()

Looking at these two histograms, we see a clear difference in distributions. However, this could just be due to chance…

Further simplifying, let’s look at the sample medians of these two datasets.

median(android_tweets)
## [1] 135
median(iphone_tweets)
## [1] 108

If they were sent by the same person, you’d expect them to be equal. (The distributions should be the same, so the medians should be the same.) Or in reality, at least close, but different due to random chance.

Is this difference due to chance?

Hint: This is a two sample problem. You’ll need to create two “resamples,” one of the Android data and one of the iPhone data to create each bootstrap resample.

set.seed(1)
boot_diff_med = rep(0, 5000)
# perform bootstrap resampling here
# store the bootstrap replicates in boot_diff_med
# create histogram here
# calculate confidence interval here