---
title: 'STAT 3202: Lab 07, ANOVA'
author: "Spring 2019, OSU"
date: 'Due: Friday, March 29'
output:
html_document:
theme: spacelab
toc: yes
df_print: paged
pdf_document: default
urlcolor: BrickRed
---
***
```{r setup, include = FALSE}
knitr::opts_chunk$set(fig.align = "center", fig.height = 4, fig.width = 8)
```
**Goal:** After completing this lab, you should be able to...
- *Perform* ANOVA tests.
- *Perform* model diagnostics on ANOVA tets.
In this lab we will use, but not focus on...
- `R` Markdown. This document will serve as a template. It is pre-formatted and already contains chunks that you need to complete.
Some additional notes:
- Please see [**Carmen**](https://carmen.osu.edu/) for information about submission, and grading.
- You may use [this document](lab-07-assign.Rmd) as a template. You do not need to remove directions. Chunks that require your input have a comment indicating to do so.
- Since this week is an exam week, the hope is that it is possible to complete this lab during lab! (If not, blame Dave.)
***
# Exercise 0A - Pain Reduction
```{r, message = FALSE, warning = FALSE}
library(tidyverse)
library(GGally)
```
For this lab we will again use some elements of the [`tidyverse`](https://www.tidyverse.org/) as a preview of the next lab which will focus on using the `tidyverse`. (And hopefully be helpful for your second group projects.) The `GGally` package is need to put plots created using `ggplot2` side-by-side. (You may need to install this package.)
**Note:** To get these document easier to read, some code as been suppressed. Please see the RMarkdown document for access to all code used to create this document.
Recall the pain reduction data from class.
```{r, message = FALSE, warning = FALSE}
pain = read_csv("https://daviddalpiaz.github.io/stat3202-sp19/data/pain.csv")
```
```{r, echo = FALSE}
pain %>% group_by(treatment) %>% summarise(mean = mean(change),
var = var(change),
n = n())
```
```{r, echo = FALSE}
pain_p1 = ggplot(pain, aes(x = treatment, y = change, col = treatment)) +
geom_boxplot() +
geom_jitter(position = position_jitter(0.2))
pain_p2 = ggplot(pain, aes(x = change, fill = treatment)) +
geom_density(alpha = .6)
ggmatrix(list(pain_p1, pain_p2), nrow = 1, ncol = 2)
```
In class, we tested
$$
H_0: \mu_B = \mu_C = \mu_M
$$
against an alternative where at least one pair of means are different. Here,
- $\mu_B$ is the mean pain reduction for the audiobook group.
- $\mu_C$ is the mean pain reduction for the control group.
- $\mu_M$ is the mean pain reduction for the music group.
```{r}
pain_anova_model = aov(change ~ treatment, data = pain)
summary(pain_anova_model)
```
The above code gives us the ANOVA table, and importantly the p-value for the above test.
```{r, echo = FALSE}
par(mfrow = c(1, 2))
qqnorm(residuals(pain_anova_model), pch = 20)
qqline(residuals(pain_anova_model), lwd = 2, col = "darkred")
grid()
plot(fitted(pain_anova_model), residuals(pain_anova_model),
xlab = "Fitted Values", ylab = "Residuals", pch = 20,
main = "Residuals vs Fitted Values")
grid()
abline(h = 0, col = "darkred", lwd = 2)
```
- A **valid** test? Looking at the qq-plot we have no reason to doubt the normality assumption, and looking at the residuals vs fitted plot, we have no reason to doubt the equal variance assumption, so we believe this to be a valid test.
- **Significance**? Supposing that we performed this test at a significance level of 0.05, based on the above p-value we reject the null hypothesis and there is sufficient evidence to suggest that the treatment *causes* a difference in mean pain reduction. (We can claim causality as this was an *experiment*.)
- **Difference**? Based on the estimated means, it seems that both treatments (audiobook and music) may be effective, but further testing should be done.
****
# Exercise 0B - Dead Poets
> "O Captain! My Captain!"
Recall the writers data from class. Do writers of non-fictions, novels, and poems die at different ages?
```{r, message = FALSE, warning = FALSE}
deadpoets = read_table2("https://daviddalpiaz.github.io/stat3202-sp19/data/deadpoets.txt",
col_types = cols(Type1 = col_skip()))
```
```{r, echo = FALSE}
deadpoets %>% group_by(Type) %>% summarise(mean = mean(Age),
var = var(Age),
n = n())
```
```{r, echo = FALSE}
deadpoets_p1 = ggplot(deadpoets, aes(x = Type, y = Age, col = Type)) +
geom_boxplot() +
geom_jitter(position = position_jitter(0.2))
deadpoets_p2 = ggplot(deadpoets, aes(x = Age, fill = Type)) +
geom_density(alpha = .6)
ggmatrix(list(deadpoets_p1, deadpoets_p2), nrow = 1, ncol = 2)
```
In class, we tested
$$
H_0: \mu_N = \mu_F = \mu_P
$$
against an alternative where at least one pair of means are different. Here,
- $\mu_N$ is the mean age of death for the non-fiction writers.
- $\mu_F$ is the mean age of death for the novel (fiction) writers.
- $\mu_P$ is the mean age of death for the poets.
```{r}
deadpoets_anova_model = aov(Age ~ Type, data = deadpoets)
summary(deadpoets_anova_model)
```
```{r, echo = FALSE}
par(mfrow = c(1, 2))
qqnorm(residuals(deadpoets_anova_model), pch = 20)
qqline(residuals(deadpoets_anova_model), lwd = 2, col = "darkred")
grid()
plot(fitted(deadpoets_anova_model), residuals(deadpoets_anova_model),
xlab = "Fitted Values", ylab = "Residuals", pch = 20,
main = "Residuals vs Fitted Values")
grid()
abline(h = 0, col = "darkred", lwd = 2)
```
- A **valid** test? The qq-plot is OK, so we'll accept the normality assumption. The residuals versus fitted plot looks good, so we have no reason to doubt the equal variance assumption. This seems like a valid test.
- **Significance**? Supposing that we performed this test at a significance level of 0.05, based on the above p-value we reject the null hypothesis and there is sufficient evidence to suggest that type of writing is *associated* with age at death. (This is an *observational* study.)
- **Difference**? It seems that in this data poets die youngest, non-fiction writers live the longest.
****
# Exercise 1 - 2019 NCAA Tournament, First Round
Does the "region" that an NCAA tournament game is played in effect the point differential? (More sports data? Note from Dave: This is my fault! But maybe you'll enjoy the next two exercises more...)
To investigate this question, we will use data in the [GitHub repository "2019 NCAA Men's March Madness"](https://github.com/daviddalpiaz/data/tree/master/ncaa-2019-mens-march-madness) which also contains a description of the data. (As of the time of this lab, only data for the first round is used.) Note that the regions are where the games are played in the tournament and are not dependent on where the schools are located.
```{r, message = FALSE, warning = FALSE}
ncaa = read_csv("https://raw.githubusercontent.com/daviddalpiaz/data/master/ncaa-2019-mens-march-madness/ncaa-2019-mens-march-madness.csv")
ncaa = ncaa %>% mutate(score_diff = highseedscore - lowseedscore)
```
We made one change to the data, which is to create a new variable called `score_diff` which stores the difference in points scored between the higher seeded team (say a 1 seed) compared to the lower seed (say a 16 seed.).
```{r, echo = FALSE}
ncaa
```
A `score_diff` less than `0` indicates an upset, of which there were 12 in the first round. (But I imagine one matters a lot more than the others...)
```{r}
sum(ncaa$score_diff < 0)
```
```{r, echo = FALSE}
ncaa %>% group_by(region) %>% summarise(mean = mean(score_diff),
var = var(score_diff),
n = n())
```
```{r, echo = FALSE}
ncaa_p1 = ggplot(ncaa, aes(x = region, y = score_diff, col = region)) +
geom_boxplot() +
geom_jitter(position = position_jitter(0.2))
ncaa_p2 = ggplot(ncaa, aes(x = score_diff, fill = region)) +
geom_density(alpha = .6)
ggmatrix(list(ncaa_p1, ncaa_p2), nrow = 1, ncol = 2)
```
Perform the test
$$
H_0: \mu_E = \mu_M = \mu_S = \mu_W
$$
against an alternative where at least one pair of means are different. Here,
- $\mu_E$ is the mean score difference for the East region.
- $\mu_M$ is the mean score difference for the Midwest region.
- $\mu_S$ is the mean score difference for the South region.
- $\mu_W$ is the mean score difference for the West region.
Do the following:
- Create the ANOVA table for this test.
- Check the assumptions (normality and equal variance) of this test.
- State your conclusion using a significance level of 0.05. When doing so consider:
- Is this an experiment or observational data?
- Did you perform a valid test?
- Have you found a statistically significant results?
- If so, how are the means different?
```{r}
# fit model and output ANOVA table here
```
```{r}
par(mfrow = c(1, 2))
# put qqplot here
# put residuals vs fitted plot ehre
```
- **REMOVE THIS TEXT AND STATE YOUR CONCLUSION HERE**
****
# Exercise 2 - Caffine, Naps, and Learning
Help! I need to memorize something quickly. Should I:
- Take a nap?
- Take 200mg of caffeine?
- Do nothing? (Take a placebo.)
That, among other things, was investigated in a 2009 study title [Comparing the benefits of Caffeine, Naps and Placebo on Verbal, Motor and Perceptual Memory](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2603066/)
Because the data for this study is not publicly available, data that has been simulated to replicate a particular finding can be found [here](https://github.com/daviddalpiaz/data/tree/master/caffeine-naps-placebo). A [description of the data](https://github.com/daviddalpiaz/data/tree/master/caffeine-naps-placebo) can be found there as well.
```{r, message = FALSE, warning = FALSE}
caf_nap_recall = read_csv("https://github.com/daviddalpiaz/data/raw/master/caffeine-naps-placebo/caf-nap-recall.csv")
```
```{r, echo = FALSE}
caf_nap_recall
```
In brief, an experiment was run to determine how the above treatments affected memory, in particular the ability to remember a list of words.
```{r, echo = FALSE}
caf_nap_recall %>% group_by(treatment) %>% summarise(mean = mean(words_recalled),
var = var(words_recalled),
n = n())
```
```{r, echo = FALSE}
caff_p1 = ggplot(caf_nap_recall, aes(x = treatment, y = words_recalled, col = treatment)) +
geom_boxplot() +
geom_jitter(position = position_jitter(0.2))
caff_p2 = ggplot(caf_nap_recall, aes(x = words_recalled, fill = treatment)) +
geom_density(alpha = .6)
ggmatrix(list(caff_p1, caff_p2), nrow = 1, ncol = 2)
```
Perform the test
$$
H_0: \mu_C = \mu_N = \mu_P
$$
against an alternative where at least one pair of means are different. Here,
- $\mu_C$ is the mean number of words remembered for the caffeine group.
- $\mu_N$ is the mean number of words remembered for the nap group.
- $\mu_P$ is the mean number of words remembered for the placebo group.
Do the following:
- Create the ANOVA table for this test.
- Check the assumptions (normality and equal variance) of this test.
- State your conclusion using a significance level of 0.05. When doing so consider:
- Is this an experiment or observational data?
- Did you perform a valid test?
- Have you found a statistically significant results?
- If so, how are the means different?
```{r}
# fit model and output ANOVA table here
```
```{r}
par(mfrow = c(1, 2))
# put qqplot here
# put residuals vs fitted plot ehre
```
- **REMOVE THIS TEXT AND STATE YOUR CONCLUSION HERE**
****
# Exercise 3 - Congress is Old
Like it or not, age is a large part of American political discourse. [The current congress, the 115th, is considered one of the oldest](https://www.quorum.us/data-driven-insights/the-115th-congress-is-among-the-oldest-in-history/175/). Even though they aren't the oldest, some of the better know senators are pretty old. For example: [Patrick Leahy](https://en.wikipedia.org/wiki/Patrick_Leahy), [Orrin Hatch](https://en.wikipedia.org/wiki/Orrin_Hatch), [Chuck Grassley](https://en.wikipedia.org/wiki/Chuck_Grassley), [Dianne Feinstein](https://en.wikipedia.org/wiki/Dianne_Feinstein). (Meanwhile, in the house, [Alexandria Ocasio-Cortez](https://en.wikipedia.org/wiki/Alexandria_Ocasio-Cortez) is the youngest woman ever to serve in the United States Congress at the age of 29.)
Age will almost undoubtedly be discussed at length in the 2020 Democratic primary where the most popular candidate, [Senator Bernie Sanders of Vermont](https://en.wikipedia.org/wiki/Bernie_Sanders) is currently 77 years old, slightly older than [Donald Trump](https://en.wikipedia.org/wiki/Donald_Trump) who is currently 72 years old. Considered old when he ran in 2008, former Alaska Senator [Mike Gravel](https://en.wikipedia.org/wiki/Mike_Gravel) has formed an exploratory committee. Should he officially enter the race, he will likely be the oldest candidate at 88 years old.
(For fun, can you find these congress members in this dataset? If not, we'll learn how to do this next lab!)
The popular blog FiveThirtyEight published data related to this issue in the post titled [Both Republicans And Democrats Have an Age Problem](https://fivethirtyeight.com/features/both-republicans-and-democrats-have-an-age-problem/) and made the data available on [GitHub](https://github.com/fivethirtyeight/data/tree/master/congress-age). It contains demographic information on the 80th - 113th congresses.
```{r, message = FALSE, warning = FALSE}
congress_terms = read_csv("https://github.com/fivethirtyeight/data/raw/master/congress-age/congress-terms.csv")
congress_terms = congress_terms %>% filter(party == "D" | party == "R" | party == "I")
```
For some simplicity, we will limit our analysis to Democrats, Republicans, and "Independents."
```{r, echo = FALSE}
congress_terms
```
Before further analysis, we will investigate a plot of the age of the congress members throughout time. (Here we drop the Independents for an easier to read plot.)
```{r, message = FALSE, warning = FALSE, echo = FALSE}
congress_terms %>% filter(party == "D" | party == "R") %>%
ggplot(aes(x = jitter(congress), y = age, col = party)) +
geom_point(alpha = 0.5) +
geom_smooth() +
scale_color_manual(values = c("dodgerblue", "lightcoral"))
```
If you are at all familiar with American politics, you should be able to identify [Strom Thurmond](https://en.wikipedia.org/wiki/Strom_Thurmond) on this plot rather easily.
Now, ignoring the time issue (which is a simplistic thing to do, but we proceed anyway), is there a difference in age between Republicans, Democrats, and Independents? (We're also dealing with highly imbalanced data, but again, we proceed anyway.)
```{r, echo = FALSE}
congress_terms %>% group_by(party) %>% summarise(mean = mean(age),
var = var(age),
n = n())
```
```{r, echo = FALSE}
congress_p1 = ggplot(congress_terms, aes(x = party, y = age, col = party)) +
geom_boxplot() +
geom_jitter(position = position_jitter(0.2)) +
scale_color_manual(values = c("dodgerblue", "lightgreen", "lightcoral"))
congress_p2 = ggplot(congress_terms, aes(x = age, fill = party)) +
geom_density(alpha = .6) +
scale_color_manual(values = c("dodgerblue", "lightgreen", "lightcoral"))
ggmatrix(list(congress_p1, congress_p2), nrow = 1, ncol = 2)
```
Perform the test
$$
H_0: \mu_D = \mu_I = \mu_R
$$
against an alternative where at least one pair of means are different. Here,
- $\mu_D$ is the mean age for the Democrats.
- $\mu_I$ is the mean age for the Independents.
- $\mu_R$ is the mean age for Republicans.
Do the following:
- Create the ANOVA table for this test.
- Check the assumptions (normality and equal variance) of this test.
- State your conclusion using a significance level of 0.05. When doing so consider:
- Is this an experiment or observational data?
- Did you perform a valid test?
- Have you found a statistically significant results?
- If so, how are the means different?
```{r}
# fit model and output ANOVA table here
```
```{r}
par(mfrow = c(1, 2))
# put qqplot here
# put residuals vs fitted plot ehre
```
- **REMOVE THIS TEXT AND STATE YOUR CONCLUSION HERE**