All models are wrong, but some models are useful.

In this document we will discuss how to use (well-known) probability distributions to model univariate data (a single variable) in R. We will call this process “fitting” a model.

For the purpose of this document, the variables that we would like to model are assumed to be a random sample from some population. That is, we are considering models of the form

$X_1, X_2, \ldots, X_n \overset {iid} \sim f(x \mid \theta)$

Models of this form assume that the data we obtained can be modeled by random variables that are independent and identically distributed according to some distribution with parameter $$\theta$$. It is important to be aware that these are assumptions of the model that we are using. For now, we will use these (often implicit) assumptions, but not necessarily spend much time challenging them. (Specifically, we won’t criticize the independence and identical assumptions. If our data truly is a random sample, there shouldn’t be much of an issue.) For other more explicit assumptions (the distribution assumption) that we make, we will learn how to criticize them.

Some natural questions should arise when modeling a variable:

• What probability models are appropriate? Specifically, what distribution should be used?
• How do you fit a particular model? That is, how do we fit a chosen probability distribution to some data?
• How do we know if a model that we fit is good?

To fit a probability model and answer these questions, we will generally use the following procedure:

• Look at the data.
• Assume a probability distribution.
• Estimate the model parameters, that is, the parameters of the chosen distribution.
• Check how well the estimated model fits using a QQ-Plot.

To illustrate this process and introduce some new concepts such as a QQ-Plot, let’s look a several examples.

# Fitbit Sleep Data

The file fitbit-sleep.csv contains data relevant to Dave’s (Professor Dalpiaz) sleep. Note that we are reading the data directly form the web into a variable named fitbit_sleep. Here we use the read_csv() function form the readr package which technically makes fitbit_sleep a tibble, which is much like a data frame, but with a few differences we won’t notice here.

library(readr)
fitbit_sleep = read_csv("https://daviddalpiaz.github.io/stat3202-au18/data/fitbit-sleep.csv")

After reading in the data, we need to look at the data. One way to do this is to simply output the entire dataset. Since it is a tibble, it does so in an easy to read fashion.

fitbit_sleep

In RStudio, you could use View(fitbit_sleep) to view the data in a data viewer window.

When we look at the data in this manner, what we’re really looking for is overall structure of the data. We like to know:

• What are the rows (observations) in this dataset?
• What are the columns (variables) in this dataset? Why type of variables are they?

Normally these questions can be answered is some form of documentation. If we are dealing with a dataset from R or an R package, often ?dataset_name will answer these questions.

For this dataset, each row represents a night of sleep recorded by a Fitbit Versa. The columns are:

• start - date and time Fitbit recognized beginning of sleep
• end - date and time Fitbit recognized end of sleep
• min_sleep - total time spent asleep in minutes
• min_awake - total time spent awake in minutes
• num_awake - number of awakenings during sleep
• total_bed - total time in bed in minutes, including both asleep and awake
• min_REM - time spent in the REM sleep stage in minutes
• min_light - time spent in the light sleep stages (NREM 1 and 2) in minutes
• min_deep - time spent in the deep sleep stages (NREM 3 and 4) in minutes

Note that technically these are all simply what the Fitbit has measured. There is certainly a lot of measurement error. (These quantities are being estimated by the Fitbit based on movement and heart rate.)

## Hours of Sleep, Normal Model

Let’s create a new variable called hour_sleep that measures the hours of sleep each night that we will attempt to model.

hour_sleep = fitbit_sleep\$min_sleep / 60
hist(hour_sleep, probability = TRUE, xlim = c(3, 12),
main = "Histogram of Hours of Sleep",
xlab = "Hours of Sleep")
grid()
box()