All models are wrong, but some models are useful.
In this document we will discuss how to use (well-known) probability distributions to model univariate data (a single variable) in R
. We will call this process “fitting” a model.
For the purpose of this document, the variables that we would like to model are assumed to be a random sample from some population. That is, we are considering models of the form
\[ X_1, X_2, \ldots, X_n \overset {iid} \sim f(x \mid \theta) \]
Models of this form assume that the data we obtained can be modeled by random variables that are independent and identically distributed according to some distribution with parameter \(\theta\). It is important to be aware that these are assumptions of the model that we are using. For now, we will use these (often implicit) assumptions, but not necessarily spend much time challenging them. (Specifically, we won’t criticize the independence and identical assumptions. If our data truly is a random sample, there shouldn’t be much of an issue.) For other more explicit assumptions (the distribution assumption) that we make, we will learn how to criticize them.
Some natural questions should arise when modeling a variable:
To fit a probability model and answer these questions, we will generally use the following procedure:
To illustrate this process and introduce some new concepts such as a QQ-Plot, let’s look a several examples.
The file fitbit-sleep.csv
contains data relevant to Dave’s (Professor Dalpiaz) sleep. Note that we are reading the data directly form the web into a variable named fitbit_sleep
. Here we use the read_csv()
function form the readr
package which technically makes fitbit_sleep
a tibble
, which is much like a data frame, but with a few differences we won’t notice here.
library(readr)
fitbit_sleep = read_csv("https://daviddalpiaz.github.io/stat3202-sp19/data/fitbit-sleep.csv")
After reading in the data, we need to look at the data. One way to do this is to simply output the entire dataset. Since it is a tibble, it does so in an easy to read fashion.
fitbit_sleep
In RStudio, you could use View(fitbit_sleep)
to view the data in a data viewer window.
When we look at the data in this manner, what we’re really looking for is overall structure of the data. We like to know:
Normally these questions can be answered is some form of documentation. If we are dealing with a dataset from R
or an R
package, often ?dataset_name
will answer these questions.
For this dataset, each row represents a night of sleep recorded by a Fitbit Versa. The columns are:
start
- date and time Fitbit recognized beginning of sleepend
- date and time Fitbit recognized end of sleepmin_sleep
- total time spent asleep in minutesmin_awake
- total time spent awake in minutesnum_awake
- number of awakenings during sleeptotal_bed
- total time in bed in minutes, including both asleep and awakemin_REM
- time spent in the REM sleep stage in minutesmin_light
- time spent in the light sleep stages (NREM 1 and 2) in minutesmin_deep
- time spent in the deep sleep stages (NREM 3 and 4) in minutesNote that technically these are all simply what the Fitbit has measured. There is certainly a lot of measurement error. (These quantities are being estimated by the Fitbit based on movement and heart rate.)
Let’s create a new variable called hour_sleep
that measures the hours of sleep each night that we will attempt to model.
hour_sleep = fitbit_sleep$min_sleep / 60
hist(hour_sleep, probability = TRUE, xlim = c(3, 12),
main = "Histogram of Hours of Sleep",
xlab = "Hours of Sleep")
grid()
box()