---
title: 'STAT 3202: Lab 06, Permutation Testing'
author: "Spring 2019, OSU"
date: 'Due: Friday, March 22'
output:
html_document:
theme: spacelab
toc: yes
df_print: paged
pdf_document: default
urlcolor: BrickRed
---
***
```{r setup, include = FALSE}
knitr::opts_chunk$set(fig.align = "center")
```
**Goal:** After completing this lab, you should be able to...
- *Use* permutation tests
In this lab we will use, but not focus on...
- `R` Markdown. This document will serve as a template. It is pre-formatted and already contains chunks that you need to complete.
Some additional notes:
- Please see [**Carmen**](https://carmen.osu.edu/) for information about submission, and grading.
- You may use [this document](lab-06-assign.Rmd) as a template. You do not need to remove directions. Chunks that require your input have a comment indicating to do so.
- Some code from [this set of practice problems](https://daviddalpiaz.github.io/stat3202-sp19/homework/pp-07-assign.html) may be of use. In particular, the code seen in the [solutions](https://daviddalpiaz.github.io/stat3202-sp19/homework/pp-07-soln.html).
***
# Exercise 1 - 2019 Ohio State Basketball
```{r, message = FALSE, warning = FALSE}
library(tidyverse)
```
For this lab we will use some elements of the [`tidyverse`](https://www.tidyverse.org/) as a preview for a lab to come which will focus on using the `tidyverse`. (If you do not have the `tidyverse` package installed, you will need to do so. Note that the `tidyverse` package is actually a collection of other packages.)
```{r, message = FALSE, warning = FALSE}
# load data
osu_bb_2019_games = read_csv("https://daviddalpiaz.github.io/stat3202-sp19/data/osu-bb-2019-games.csv")
osu_bb_2019_games
```
For this exercise we will use data on the [OSU Men's Basketball games from the 2018 - 2019 season](https://www.sports-reference.com/cbb/schools/ohio-state/2019-gamelogs.html), excluding any games in the soon to be played [2019 NCAA Tournament](http://www.espn.com/mens-college-basketball/tournament/bracket/_/id/201922/2019-ncaa-tournament) where OSU is an 11 seed. While an 11 seed isn't great, have a look at [this video](https://www.youtube.com/watch?v=4a1TUszkMfI) by Jon Bois which explains some of the weirdness around certain seeds in the tournament.
In particular we'll investigate the [personal fouls](https://en.wikipedia.org/wiki/Personal_foul_%28basketball%29) given to OSU compared to their opponents. Specifically we will look at the difference between the number of personal fouls obtained by OSU compared to their opponent *in each game*. That is, we have "paired" data. (So we will investigate data on the differences.)
```{r}
# create difference data as a seperate vector
osu_bb_2019_games %>% mutate(pf_diff = PF - OPPPF) %>%
select(pf_diff) %>% unlist() %>% unname() -> pf_diff
head(pf_diff)
```
For example, in the fifth game of the season, OSU had 11 fewer personal fouls than their opponent, Samford.
Suppose we are interested in testing:
- $H_0$: There is no difference between the distribution of fouls obtained by OSU and their opponents.
- $H_A$: OSU is given fewer fouls than their opponents. Specifically, the distribution of fouls for OSU is shifted lower ("to the left") compared to their opponents, which makes this a one-sided "less-than" alternative. (Which might lead us to believe the referees are favoring OSU. *But this analysis is far too simple to draw that conclusion.*)
There are a number of ways we could go about testing this. (Although with different or more specific null and alternative hypotheses.)
We could consider a t-test:
```{r}
t.test(pf_diff, alternative = "less")
```
Or, we could consider a Wilcoxon signed rank test:
```{r, message = FALSE, warning = FALSE}
wilcox.test(pf_diff, alternative = "less")
```
We could also consider a sign test:
```{r}
binom.test(x = sum(pf_diff > 0), n = length(pf_diff), p = 0.5, alternative = "less")
```
But maybe none of these seem right to us.
- Perhaps we don't believe the normal assumption required to perform the t-test. (Although, we could probably use a large sample $z$ procedure here, but again, that's an assumption we'd have to make.)
- Perhaps we don't understand the sort of weird assumptions of the Wilcoxon test.
- Perhaps we understand that the sign test generally has low power.
```{r}
qplot(pf_diff, binwidth = 3)
```
So what should we do?
Use a **permutation test** that permutes the *statistic*
$$
t = \frac{\bar{x}_D}{s_D / \sqrt{n}}
$$
to test the above hypotheses where $\bar{x}_D$ is the sample mean difference, and $s_D$ is the standard deviation of the differences. Use 10000 permutations.
- Create a histogram that illustrates the distribution of the statistic used.
- Report the p-value of the test.
```{r}
set.seed(42)
# generate t statistics for the personal foul data via permutation here
```
```{r}
# calculate t statistic on observed data here
```
```{r}
# plot empirical distribution of permutated statistic
# add a vertical line indicating the observed value
```
```{r}
# calculate the p-value here
# that is, calculate the proportion of the permutated statistics that are
# less than the observed value
```
***
# Exercise 2 - 2018 Ohio State Football
Does Ohio State football score more points when playing at [home](https://en.wikipedia.org/wiki/Ohio_Stadium), or on the road?
For this exercise we will use data on the [OSU Football games from the 2018](https://www.sports-reference.com/cfb/schools/ohio-state/2018-schedule.html), including postseason games.
```{r, message = FALSE, warning = FALSE}
# load data
osu_fb_2018_games = read_csv("https://daviddalpiaz.github.io/stat3202-sp19/data/osu-fb-2018-games.csv")
osu_fb_2018_games
```
For the purposes of this highly simplistic analysis, we will consider games played on a neutral field, like the [Rose Bowl](https://en.wikipedia.org/wiki/2019_Rose_Bowl), an "away" game.
```{r}
# modify data
osu_fb_2018_games$Home = ifelse(is.na(osu_fb_2018_games$Home), "home", "away")
osu_fb_2018_games
```
Let's more specifically test:
- $H_0$: There is no difference between the distribution OSU at "home" or when playing an "away" game.
- $H_A$: There **is** difference between the distribution OSU at "home" or when playing an "away" game. (In particular a shift up or down.) This is a "two-sided" test.
Here we are assuming that we have two independent samples, one for home and one for away. (We'll live with this assumption, but we should be highly suspicious of it. There is a ton of dependence in this data. We're also ignoring opponent strength, and the fact that there were some coaching changes throughout the year....)
There are a number of ways we could go about testing this. (Although with different or more specific null and alternative hypotheses.)
We could consider a t-test that does **not** assume equal variance in the two groups:
```{r}
t.test(Pts ~ Home, data = osu_fb_2018_games)
```
We could consider a t-test that **does** assume equal variance in the two groups:
```{r}
t.test(Pts ~ Home, data = osu_fb_2018_games, var.equal = TRUE)
```
We could consider a Wilcoxon rank sum test, better know [at OSU](https://en.wikipedia.org/wiki/Henry_Mann) as the [Mann-Whitney U test](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test):
```{r, message = FALSE, warning = FALSE}
wilcox.test(Pts ~ Home, data = osu_fb_2018_games)
```
Again, maybe none of these seem right to us.
- Perhaps we don't believe the normal assumption required to perform either of the t-tests. (And here the sample size probably isn't large enough to consider a large sample procedure.)
- Perhaps we don't understand the sort of weird assumptions of the Wilcoxon test.
```{r}
osu_fb_2018_games %>% ggplot(aes(x = Pts)) + geom_histogram(binwidth = 20) + facet_wrap(~Home)
```
```{r}
osu_fb_2018_games %>% ggplot(aes(x = Pts, col = Home)) + geom_line(stat = "density")
```
So what should we do?
Use a **permutation test** that permutes the *statistic*
$$
t = \frac{(\bar{x} - \bar{y}) - 0}{s_p\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}}
$$
to test the above hypotheses. Use 10000 permutations.
- Create a histogram that illustrates the distribution of the statistic used.
- Report the p-value of the test.
If you would like to follow the code used in the practice problems, you will need to create subsets of the points variable for the home and away games. If you would like to use that data as-is, consider the following code:
```{r}
sample(osu_fb_2018_games$Home)
```
```{r}
set.seed(42)
# generate t statistics for the scoring data via permutation here
```
```{r}
# calculate t statistic on observed data here
```
```{r}
# plot empirical distribution of permutated statistic
# add a vertical line indicating the observed value (and any value "as extreme")
```
```{r}
# calculate the p-value here
# recall that this is a "two-sided" test
```