Goal: After completing this lab, you should be able to…

In this lab we will use, but not focus on…

Some additional notes:


Exercise 0A - Pain Reduction

library(tidyverse)
library(GGally)

For this lab we will again use some elements of the tidyverse as a preview of the next lab which will focus on using the tidyverse. (And hopefully be helpful for your second group projects.) The GGally package is need to put plots created using ggplot2 side-by-side. (You may need to install this package.)

Note: To get these document easier to read, some code as been suppressed. Please see the RMarkdown document for access to all code used to create this document.

Recall the pain reduction data from class.

pain = read_csv("https://daviddalpiaz.github.io/stat3202-sp19/data/pain.csv")

In class, we tested

\[ H_0: \mu_B = \mu_C = \mu_M \]

against an alternative where at least one pair of means are different. Here,

pain_anova_model = aov(change ~ treatment, data = pain)
summary(pain_anova_model)
##             Df Sum Sq Mean Sq F value  Pr(>F)   
## treatment    2   7.30   3.652   5.073 0.00979 **
## Residuals   51  36.72   0.720                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The above code gives us the ANOVA table, and importantly the p-value for the above test.


Exercise 0B - Dead Poets

“O Captain! My Captain!”

Recall the writers data from class. Do writers of non-fictions, novels, and poems die at different ages?

deadpoets = read_table2("https://daviddalpiaz.github.io/stat3202-sp19/data/deadpoets.txt", 
    col_types = cols(Type1 = col_skip()))

In class, we tested

\[ H_0: \mu_N = \mu_F = \mu_P \]

against an alternative where at least one pair of means are different. Here,

deadpoets_anova_model = aov(Age ~ Type, data = deadpoets)
summary(deadpoets_anova_model)
##              Df Sum Sq Mean Sq F value  Pr(>F)   
## Type          2   2744  1372.1   6.563 0.00197 **
## Residuals   120  25088   209.1                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Exercise 1 - 2019 NCAA Tournament, First Round

Does the “region” that an NCAA tournament game is played in effect the point differential? (More sports data? Note from Dave: This is my fault! But maybe you’ll enjoy the next two exercises more…)

To investigate this question, we will use data in the GitHub repository “2019 NCAA Men’s March Madness” which also contains a description of the data. (As of the time of this lab, only data for the first round is used.) Note that the regions are where the games are played in the tournament and are not dependent on where the schools are located.

ncaa = read_csv("https://raw.githubusercontent.com/daviddalpiaz/data/master/ncaa-2019-mens-march-madness/ncaa-2019-mens-march-madness.csv")
ncaa = ncaa %>% mutate(score_diff = highseedscore - lowseedscore)

We made one change to the data, which is to create a new variable called score_diff which stores the difference in points scored between the higher seeded team (say a 1 seed) compared to the lower seed (say a 16 seed.).

A score_diff less than 0 indicates an upset, of which there were 12 in the first round. (But I imagine one matters a lot more than the others…)

sum(ncaa$score_diff < 0)
## [1] 12

Perform the test

\[ H_0: \mu_E = \mu_M = \mu_S = \mu_W \]

against an alternative where at least one pair of means are different. Here,

Do the following:

# fit model and output ANOVA table here
par(mfrow = c(1, 2))
# put qqplot here
# put residuals vs fitted plot ehre

Exercise 2 - Caffine, Naps, and Learning

Help! I need to memorize something quickly. Should I:

That, among other things, was investigated in a 2009 study title Comparing the benefits of Caffeine, Naps and Placebo on Verbal, Motor and Perceptual Memory

Because the data for this study is not publicly available, data that has been simulated to replicate a particular finding can be found here. A description of the data can be found there as well.

caf_nap_recall = read_csv("https://github.com/daviddalpiaz/data/raw/master/caffeine-naps-placebo/caf-nap-recall.csv")

In brief, an experiment was run to determine how the above treatments affected memory, in particular the ability to remember a list of words.

Perform the test

\[ H_0: \mu_C = \mu_N = \mu_P \]

against an alternative where at least one pair of means are different. Here,

Do the following:

# fit model and output ANOVA table here
par(mfrow = c(1, 2))
# put qqplot here
# put residuals vs fitted plot ehre

Exercise 3 - Congress is Old

Like it or not, age is a large part of American political discourse. The current congress, the 115th, is considered one of the oldest. Even though they aren’t the oldest, some of the better know senators are pretty old. For example: Patrick Leahy, Orrin Hatch, Chuck Grassley, Dianne Feinstein. (Meanwhile, in the house, Alexandria Ocasio-Cortez is the youngest woman ever to serve in the United States Congress at the age of 29.)

Age will almost undoubtedly be discussed at length in the 2020 Democratic primary where the most popular candidate, Senator Bernie Sanders of Vermont is currently 77 years old, slightly older than Donald Trump who is currently 72 years old. Considered old when he ran in 2008, former Alaska Senator Mike Gravel has formed an exploratory committee. Should he officially enter the race, he will likely be the oldest candidate at 88 years old.

(For fun, can you find these congress members in this dataset? If not, we’ll learn how to do this next lab!)

The popular blog FiveThirtyEight published data related to this issue in the post titled Both Republicans And Democrats Have an Age Problem and made the data available on GitHub. It contains demographic information on the 80th - 113th congresses.

congress_terms = read_csv("https://github.com/fivethirtyeight/data/raw/master/congress-age/congress-terms.csv")
congress_terms = congress_terms %>% filter(party == "D" | party == "R" | party == "I")

For some simplicity, we will limit our analysis to Democrats, Republicans, and “Independents.”

Before further analysis, we will investigate a plot of the age of the congress members throughout time. (Here we drop the Independents for an easier to read plot.)

If you are at all familiar with American politics, you should be able to identify Strom Thurmond on this plot rather easily.

Now, ignoring the time issue (which is a simplistic thing to do, but we proceed anyway), is there a difference in age between Republicans, Democrats, and Independents? (We’re also dealing with highly imbalanced data, but again, we proceed anyway.)

Perform the test

\[ H_0: \mu_D = \mu_I = \mu_R \]

against an alternative where at least one pair of means are different. Here,

Do the following:

# fit model and output ANOVA table here
par(mfrow = c(1, 2))
# put qqplot here
# put residuals vs fitted plot ehre