Goal: After completing this lab, you will…

In this lab we will use, but not focus on…

Some additional notes:


Exercise 0 - OSU Basketball

IN the previous few labs, we’ve “used” the tidyverse but haven’t made much effort to explain it. We’ll try to de-mystify at least some of it now. Some of the material in this “Exercise 0” originated in R for DataFest.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0       ✔ readr   1.3.1  
## ✔ tibble  2.1.1       ✔ purrr   0.3.2  
## ✔ tidyr   0.8.3       ✔ dplyr   0.8.0.1
## ✔ ggplot2 3.1.0       ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Hopefully at this point loading the tidyverse packages is no longer an issue since we took care of it in previous labs. We’ll return to a familiar dataset about Ohio State basketball.

osu_bb = read_csv("https://github.com/daviddalpiaz/r4df-osu-2019/raw/master/data/osu-bb-2019-games.csv")

Note that we are using the read_csv() function to read in this data. This is actually a function from the readr package which is part of the tidyverse. This is what is used when you use the “Import Dataset” button in RStudio and select the readr option.

While it won’t make a difference for the analysis we are about to perform, we could be more careful when loading the data, especially around data types.

osu_bb = read_csv("https://github.com/daviddalpiaz/r4df-osu-2019/raw/master/data/osu-bb-2019-games.csv", 
    col_types = cols(Date = col_date(format = "%m-%d-%y"),
                     `3PPERC` = col_number(), 
                     FGPERC = col_number(), 
                     FTPERC = col_number(), 
                     OPP3PPERC = col_number(), 
                     OPPFGPERC = col_number(), 
                     OPPFTPERC = col_number()))
osu_bb

One thing you might note here is that in the Home variable, home games are denoted by NA values. This isn’t great. To fix this, we will use the fct_explicit_na() function from the forcats package, which is part of the tidyverse.

osu_bb$Home = fct_explicit_na(osu_bb$Home, "H")

See if you can spot what changed.

osu_bb

The remainder of this exercise will focus on functions from the dplyr() package. These are often referred to as “verb” as they describe an action that we will perform on a tibble. (Which is a special type of data frame.)

We will mostly explain what some of these do by example. Hopefully for the most part it is clear what is happening.

away_games = filter(osu_bb, Home == "@")
away_games

Above we create a new dataset called away_games which stores only the away games. Below, we see code that does the same.

away_games = osu_bb %>% 
  filter(Home == "@")

This code uses the “pipe” operator, %>%. Why do we want to use such a thing? Consider doing multiple filters. Here we are obtaining a dataset of away games where the opponent scores less than 60 points. (And simply returning the result, and not storing it.)

osu_bb %>% 
  filter(Home == "@") %>% 
  filter(OPPPTS < 60)

Above we use the pipe operator. Below, we call filter twice. Eww. That’s ugly. Too many parentheses.

filter(filter(osu_bb, Home == "@"), OPPPTS < 60)

Note that, within a call to filter() it is actually possible to impose multiple conditions.

osu_bb %>% 
  filter(Home == "@", OPPPTS < 60)

As another example, below we create a dataset called neutral_court_wins which contains games OSU played on a neutral court and won. Dave isn’t happy about that first one. (In summary, filter() selects rows of the data.)

neutral_court_wins = osu_bb %>% 
  filter(Home == "N", Outcome == "W")
neutral_court_wins

Now let’s start to chain together multiple verbs.

osu_bb %>% filter(Home == "@") %>% 
  summarize(avg_fg = mean(FG))

Here we have found the average number of field goals made during away games. Note that what is returned is a tibble. (A main theme of the tidyverse.) If instead you want a vector here, use unlist().

osu_bb %>% filter(Home == "@") %>% 
  summarize(avg_fg = mean(FG)) %>% unlist()
## avg_fg 
##     21

We could also add other statistic calculations inside of summarize().

osu_bb %>% filter(Home == "@") %>% 
  summarize(avg_fg = mean(FG), sd_fg = sd(FG))

The count() verb is rather self-explanatory.

osu_bb %>% filter(Home == "@") %>% 
  count(Outcome)

See if you can figure out what arrange() does:

osu_bb %>% filter(Outcome == "W") %>% 
  count(Home) %>% arrange(-n)

The filter() function subsets by row. In contrast, the select() function selects columns. (This won’t be super useful for our purposes. It is mostly useful for limiting output when printing.)

osu_bb %>% 
  filter(Home == "@") %>%
  arrange(`3P`, desc(`3PPERC`)) %>% 
  select(Date, Home, Opponent, Outcome, `3P`, `3PPERC`)

Combining summarize() and group_by() can be very powerful. By adding the group_by() function here, we are performing the average free throw calculation on “three” datasets: one for away games, one for neutral games, and one for home games.

osu_bb %>% summarize(avg_ft = mean(FT))
osu_bb %>% group_by(Home) %>% summarize(avg_ft = mean(FT))

Here’s another example:

osu_bb %>% 
  group_by(Home) %>% 
  summarize(num_games = n(),
            n_ft = sum(FT), 
            avg_ft = mean(FT), 
            sd_ft = sd(FT),
            max_ft = max(FT)) %>% 
  arrange(-n_ft)

While select() can “select” rows, the mutate() variable is more important, as it is used to create new variables based on existing variables. (Here we use a select statement to make it easier to see what we did.)

osu_bb %>% 
  mutate(shots = FG + FT, shots_2_idk = 2 * shots) %>%
  select(Date, Home, Opponent, Outcome, PTS, OPPPTS, shots, shots_2_idk)

If we wanted to instead store this results, without the column select, we would do something like this:

a_dataset_name = osu_bb %>% 
  mutate(shots = FG + FT, shots_2_idk = 2 * shots)
a_dataset_name

Congratulations, you are now probably 5% less confused about the tidyverse.


Exercise 1 - Flight Delay Data

The data that you will be working with for Project II consists of flight arrival and departure information for all commercial US flights during the years 1987 - 2008. It was provided as part of an American Statistical Association Statistical Computing / Statistical Graphics poster competition in 2009. Details are available here:

Data on individual years may be downloaded here:

Additional information about the data such as the airport codes and carrier codes are available as supplementary data here:

For this lab (and potentially the project) we will work specifically with the data for 2008. Do the following:

# put your code to read the data here.
# since this is "big data" it might take a couple minutes to run.

The next chunk verifies you read the data.

flights

The next chunk modifies the data for make months and day of the week categorical data. (Which will help later.)

# REMOVE eval = TRUE ON THIS CHUNK (after writing code to load the data)
# REMOVE eval = TRUE ON THIS CHUNK
# REMOVE eval = TRUE ON THIS CHUNK
flights$Month = month.name[flights$Month]
day_names = c("Monday", "Tuesday", "Wednesday", "Thursday", 
              "Friday", "Saturday", "Sunday")
flights$DayOfWeek = factor(day_names[flights$DayOfWeek], levels = day_names)

The next chunk helps verify the previous step.

flights

And lastly, the next chunk tells us the “size” of the data in rows and columns. (It’s kinda “big.”)

dim(flights)

Exercise 2 - Running Through O’Hare Terminal 3

Dave has spent some time flying from Champaign (CMI) to O’Hare (ORD) in Chicago. Trouble is, that flight is often late to arrive. This then creates a situation much like the John Hughes and Chris Columbus films Home Alone and Home Alone 2. That is, running through Terminal 3, the American Airlines terminal.

Create a subset of the flights data that contains only rows for flights departing from Champaign and arriving in Chicago O’Hare called champaign_to_chicago. Remove eval = FALSE from the follow-up chunk and run it to create a plot of this data.

# your code here
ggplot(champaign_to_chicago, aes(x = DayOfWeek, y = ArrDelay)) +
  geom_boxplot() +
  geom_jitter(position = position_jitter(0.2))

Exercise 3 - Get Me Off Of This Plane!

Time in the air is something that is more or less a fixed cost. But what about time taxiing? That’s just extra misery. Is there a difference in how long you spend taxiing on different airlines?

Create a subset of the flights data that contains only flights from American Airlines, Delta, and United called taxi_data. Add a new variable called extra_time which is the time spent taxiing in and out in hours. Remove eval = FALSE and run the follow-up chunk.

# your code here
ggplot(taxi_data, aes(x = extra_time)) +
  geom_histogram() +
  facet_grid(rows = vars(UniqueCarrier))

Exercise 4 - Leaving OSU For Thanksgiving

Create a subset of the flights data that contains only flights from American Airlines, Delta, and United that are departing from Columbus (CMH) in November called columbus_november. Remove eval = FALSE and run the follow-up chunk.

# your code here
ggplot(columbus_november, aes(x = factor(DayOfWeek), y = DepDelay)) + 
  geom_boxplot() +
  geom_jitter(position = position_jitter(0.2))

Exercise 5 - Home For The Holidays

Will you make it home in time for the holidays? Do the following:

Summarize the flights data in December for American Airlines, Delta, and United of departing flights from Columbus (CMH). Report the number of flights, and the proportion that arrive more than 60 minutes late. (This is challenging!) You will need to use filter(), group_by(), and summarise(). Also note that the mean() function has an argument called na.rm. (See the documentation for details.) Your result should be a tibble where each row corresponds to an airline.

# your code here