Goal: After completing this lab, you will…

In this lab we will use, but not focus on…

Some additional notes:

Exercise 0 - OSU Basketball

IN the previous few labs, we’ve “used” the tidyverse but haven’t made much effort to explain it. We’ll try to de-mystify at least some of it now. Some of the material in this “Exercise 0” originated in R for DataFest.

## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0       ✔ readr   1.3.1  
## ✔ tibble  2.1.1       ✔ purrr   0.3.2  
## ✔ tidyr   0.8.3       ✔ dplyr
## ✔ ggplot2 3.1.0       ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Hopefully at this point loading the tidyverse packages is no longer an issue since we took care of it in previous labs. We’ll return to a familiar dataset about Ohio State basketball.

osu_bb = read_csv("https://github.com/daviddalpiaz/r4df-osu-2019/raw/master/data/osu-bb-2019-games.csv")

Note that we are using the read_csv() function to read in this data. This is actually a function from the readr package which is part of the tidyverse. This is what is used when you use the “Import Dataset” button in RStudio and select the readr option.

While it won’t make a difference for the analysis we are about to perform, we could be more careful when loading the data, especially around data types.

osu_bb = read_csv("https://github.com/daviddalpiaz/r4df-osu-2019/raw/master/data/osu-bb-2019-games.csv", 
    col_types = cols(Date = col_date(format = "%m-%d-%y"),
                     `3PPERC` = col_number(), 
                     FGPERC = col_number(), 
                     FTPERC = col_number(), 
                     OPP3PPERC = col_number(), 
                     OPPFGPERC = col_number(), 
                     OPPFTPERC = col_number()))

One thing you might note here is that in the Home variable, home games are denoted by NA values. This isn’t great. To fix this, we will use the fct_explicit_na() function from the forcats package, which is part of the tidyverse.

osu_bb$Home = fct_explicit_na(osu_bb$Home, "H")

See if you can spot what changed.


The remainder of this exercise will focus on functions from the dplyr() package. These are often referred to as “verb” as they describe an action that we will perform on a tibble. (Which is a special type of data frame.)

We will mostly explain what some of these do by example. Hopefully for the most part it is clear what is happening.

away_games = filter(osu_bb, Home == "@")

Above we create a new dataset called away_games which stores only the away games. Below, we see code that does the same.

away_games = osu_bb %>% 
  filter(Home == "@")

This code uses the “pipe” operator, %>%. Why do we want to use such a thing? Consider doing multiple filters. Here we are obtaining a dataset of away games where the opponent scores less than 60 points. (And simply returning the result, and not storing it.)

osu_bb %>% 
  filter(Home == "@") %>% 
  filter(OPPPTS < 60)

Above we use the pipe operator. Below, we call filter twice. Eww. That’s ugly. Too many parentheses.

filter(filter(osu_bb, Home == "@"), OPPPTS < 60)