Goal: After completing this lab, you should be able to…

In this lab we will use, but not focus on…

Some additional notes:


Exercise 1 - 2019 Ohio State Basketball

library(tidyverse)

For this lab we will use some elements of the tidyverse as a preview for a lab to come which will focus on using the tidyverse. (If you do not have the tidyverse package installed, you will need to do so. Note that the tidyverse package is actually a collection of other packages.)

# load data
osu_bb_2019_games = read_csv("https://daviddalpiaz.github.io/stat3202-sp19/data/osu-bb-2019-games.csv")
osu_bb_2019_games

For this exercise we will use data on the OSU Men’s Basketball games from the 2018 - 2019 season, excluding any games in the soon to be played 2019 NCAA Tournament where OSU is an 11 seed. While an 11 seed isn’t great, have a look at this video by Jon Bois which explains some of the weirdness around certain seeds in the tournament.

In particular we’ll investigate the personal fouls given to OSU compared to their opponents. Specifically we will look at the difference between the number of personal fouls obtained by OSU compared to their opponent in each game. That is, we have “paired” data. (So we will investigate data on the differences.)

# create difference data as a seperate vector
osu_bb_2019_games %>% mutate(pf_diff = PF - OPPPF) %>% 
  select(pf_diff) %>% unlist() %>% unname() -> pf_diff
head(pf_diff)
## [1]   3  -3  -1 -12 -11   5

For example, in the fifth game of the season, OSU had 11 fewer personal fouls than their opponent, Samford.

Suppose we are interested in testing:

There are a number of ways we could go about testing this. (Although with different or more specific null and alternative hypotheses.)

We could consider a t-test:

t.test(pf_diff, alternative = "less")
## 
##  One Sample t-test
## 
## data:  pf_diff
## t = -0.70983, df = 32, p-value = 0.2415
## alternative hypothesis: true mean is less than 0
## 95 percent confidence interval:
##      -Inf 1.008247
## sample estimates:
##  mean of x 
## -0.7272727

Or, we could consider a Wilcoxon signed rank test:

wilcox.test(pf_diff, alternative = "less")
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  pf_diff
## V = 244.5, p-value = 0.3608
## alternative hypothesis: true location is less than 0

We could also consider a sign test:

binom.test(x = sum(pf_diff > 0), n = length(pf_diff), p = 0.5, alternative = "less")
## 
##  Exact binomial test
## 
## data:  sum(pf_diff > 0) and length(pf_diff)
## number of successes = 14, number of trials = 33, p-value = 0.2434
## alternative hypothesis: true probability of success is less than 0.5
## 95 percent confidence interval:
##  0.0000000 0.5814382
## sample estimates:
## probability of success 
##              0.4242424

But maybe none of these seem right to us.

qplot(pf_diff, binwidth = 3)

So what should we do?

Use a permutation test that permutes the statistic

\[ t = \frac{\bar{x}_D}{s_D / \sqrt{n}} \]

to test the above hypotheses where \(\bar{x}_D\) is the sample mean difference, and \(s_D\) is the standard deviation of the differences. Use 10000 permutations.

set.seed(42)
# generate t statistics for the personal foul data via permutation here
# calculate t statistic on observed data here
# plot empirical distribution of permutated statistic
# add a vertical line indicating the observed value
# calculate the p-value here
# that is, calculate the proportion of the permutated statistics that are
# less than the observed value

Exercise 2 - 2018 Ohio State Football

Does Ohio State football score more points when playing at home, or on the road?

For this exercise we will use data on the OSU Football games from the 2018, including postseason games.

# load data
osu_fb_2018_games = read_csv("https://daviddalpiaz.github.io/stat3202-sp19/data/osu-fb-2018-games.csv")
osu_fb_2018_games

For the purposes of this highly simplistic analysis, we will consider games played on a neutral field, like the Rose Bowl, an “away” game.

# modify data
osu_fb_2018_games$Home = ifelse(is.na(osu_fb_2018_games$Home), "home", "away")
osu_fb_2018_games

Let’s more specifically test:

Here we are assuming that we have two independent samples, one for home and one for away. (We’ll live with this assumption, but we should be highly suspicious of it. There is a ton of dependence in this data. We’re also ignoring opponent strength, and the fact that there were some coaching changes throughout the year….)

There are a number of ways we could go about testing this. (Although with different or more specific null and alternative hypotheses.)

We could consider a t-test that does not assume equal variance in the two groups:

t.test(Pts ~ Home, data = osu_fb_2018_games)
## 
##  Welch Two Sample t-test
## 
## data:  Pts by Home
## t = -2.2599, df = 11.122, p-value = 0.04485
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -32.9709597  -0.4576117
## sample estimates:
## mean in group away mean in group home 
##           34.00000           50.71429

We could consider a t-test that does assume equal variance in the two groups:

t.test(Pts ~ Home, data = osu_fb_2018_games, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  Pts by Home
## t = -2.2599, df = 12, p-value = 0.04322
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -32.828765  -0.599806
## sample estimates:
## mean in group away mean in group home 
##           34.00000           50.71429

We could consider a Wilcoxon rank sum test, better know at OSU as the Mann-Whitney U test:

wilcox.test(Pts ~ Home, data = osu_fb_2018_games)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Pts by Home
## W = 8.5, p-value = 0.04716
## alternative hypothesis: true location shift is not equal to 0

Again, maybe none of these seem right to us.

osu_fb_2018_games %>% ggplot(aes(x = Pts)) + geom_histogram(binwidth = 20) + facet_wrap(~Home)

osu_fb_2018_games %>% ggplot(aes(x = Pts, col = Home)) + geom_line(stat = "density")

So what should we do?

Use a permutation test that permutes the statistic

\[ t = \frac{(\bar{x} - \bar{y}) - 0}{s_p\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}} \]

to test the above hypotheses. Use 10000 permutations.

If you would like to follow the code used in the practice problems, you will need to create subsets of the points variable for the home and away games. If you would like to use that data as-is, consider the following code:

sample(osu_fb_2018_games$Home)
##  [1] "away" "away" "home" "away" "home" "away" "home" "home" "home" "away"
## [11] "home" "away" "home" "away"
set.seed(42)
# generate t statistics for the scoring data via permutation here
# calculate t statistic on observed data here
# plot empirical distribution of permutated statistic
# add a vertical line indicating the observed value (and any value "as extreme")
# calculate the p-value here
# recall that this is a "two-sided" test