All models are wrong, but some models are useful.

George Box


In this document we will discuss how to use (well-known) probability distributions to model univariate data (a single variable) in R. We will call this process “fitting” a model.

For the purpose of this document, the variables that we would like to model are assumed to be a random sample from some population. That is, we are considering models of the form

\[ X_1, X_2, \ldots, X_n \overset {iid} \sim f(x \mid \theta) \]

Models of this form assume that the data we obtained can be modeled by random variables that are independent and identically distributed according to some distribution with parameter \(\theta\). It is important to be aware that these are assumptions of the model that we are using. For now, we will use these (often implicit) assumptions, but not necessarily spend much time challenging them. (Specifically, we won’t criticize the independence and identical assumptions. If our data truly is a random sample, there shouldn’t be much of an issue.) For other more explicit assumptions (the distribution assumption) that we make, we will learn how to criticize them.

Some natural questions should arise when modeling a variable:

To fit a probability model and answer these questions, we will generally use the following procedure:

To illustrate this process and introduce some new concepts such as a QQ-Plot, let’s look a several examples.


Fitbit Sleep Data

The file fitbit-sleep.csv contains data relevant to Dave’s (Professor Dalpiaz) sleep. Note that we are reading the data directly form the web into a variable named fitbit_sleep. Here we use the read_csv() function form the readr package which technically makes fitbit_sleep a tibble, which is much like a data frame, but with a few differences we won’t notice here.

library(readr)
fitbit_sleep = read_csv("https://daviddalpiaz.github.io/stat3202-sp19/data/fitbit-sleep.csv")

After reading in the data, we need to look at the data. One way to do this is to simply output the entire dataset. Since it is a tibble, it does so in an easy to read fashion.

fitbit_sleep

In RStudio, you could use View(fitbit_sleep) to view the data in a data viewer window.

When we look at the data in this manner, what we’re really looking for is overall structure of the data. We like to know:

Normally these questions can be answered is some form of documentation. If we are dealing with a dataset from R or an R package, often ?dataset_name will answer these questions.

For this dataset, each row represents a night of sleep recorded by a Fitbit Versa. The columns are:

Note that technically these are all simply what the Fitbit has measured. There is certainly a lot of measurement error. (These quantities are being estimated by the Fitbit based on movement and heart rate.)

Hours of Sleep, Normal Model

Let’s create a new variable called hour_sleep that measures the hours of sleep each night that we will attempt to model.

hour_sleep = fitbit_sleep$min_sleep / 60
hist(hour_sleep, probability = TRUE, xlim = c(3, 12),
     main = "Histogram of Hours of Sleep",
     xlab = "Hours of Sleep")
grid()
box()