All models are wrong, but some models are useful.

In this document we will discuss how to use (well-known) probability distributions to model univariate data (a single variable) in `R`

. We will call this process “fitting” a model.

For the purpose of this document, the variables that we would like to model are assumed to be a random sample from some population. That is, we are considering models of the form

\[ X_1, X_2, \ldots, X_n \overset {iid} \sim f(x \mid \theta) \]

Models of this form assume that the data we obtained can be modeled by random variables that are *independent* and *identically distributed* according to some distribution with parameter \(\theta\). It is important to be aware that these are **assumptions** of the model that we are using. For now, we will use these (often implicit) assumptions, but not necessarily spend much time challenging them. (Specifically, we won’t criticize the independence and identical assumptions. If our data truly is a random sample, there shouldn’t be much of an issue.) For other more explicit assumptions (the distribution assumption) that we make, we will learn how to criticize them.

Some natural questions should arise when modeling a variable:

- What probability models are appropriate? Specifically, what distribution should be used?
- How do you fit a particular model? That is, how do we fit a chosen probability distribution to some data?
- How do we know if a model that we fit is
*good*?

To fit a probability model and answer these questions, we will generally use the following procedure:

**Look**at the data.**Assume**a probability distribution.**Estimate**the model parameters, that is, the parameters of the chosen distribution.**Check**how well the estimated model fits using a QQ-Plot.

To illustrate this process and introduce some new concepts such as a QQ-Plot, let’s look a several examples.

The file `fitbit-sleep.csv`

contains data relevant to Dave’s (Professor Dalpiaz) sleep. Note that we are reading the data directly form the web into a variable named `fitbit_sleep`

. Here we use the `read_csv()`

function form the `readr`

package which technically makes `fitbit_sleep`

a `tibble`

, which is much like a data frame, but with a few differences we won’t notice here.

```
library(readr)
fitbit_sleep = read_csv("https://daviddalpiaz.github.io/stat3202-sp19/data/fitbit-sleep.csv")
```

After reading in the data, we need to **look** at the data. One way to do this is to simply output the entire dataset. Since it is a tibble, it does so in an easy to read fashion.

`fitbit_sleep`

In RStudio, you could use `View(fitbit_sleep)`

to view the data in a data viewer window.

When we **look** at the data in this manner, what we’re really looking for is overall structure of the data. We like to know:

- What are the rows (observations) in this dataset?
- What are the columns (variables) in this dataset? Why type of variables are they?

Normally these questions can be answered is some form of documentation. If we are dealing with a dataset from `R`

or an `R`

package, often `?dataset_name`

will answer these questions.

For this dataset, each row represents a night of sleep recorded by a Fitbit Versa. The columns are:

- date and time Fitbit recognized beginning of sleep`start`

- date and time Fitbit recognized end of sleep`end`

- total time spent asleep in minutes`min_sleep`

- total time spent awake in minutes`min_awake`

- number of awakenings during sleep`num_awake`

- total time in bed in minutes, including both asleep and awake`total_bed`

- time spent in the REM sleep stage in minutes`min_REM`

- time spent in the light sleep stages (NREM 1 and 2) in minutes`min_light`

- time spent in the deep sleep stages (NREM 3 and 4) in minutes`min_deep`

Note that technically these are all simply what the Fitbit has measured. There is certainly a lot of measurement error. (These quantities are being estimated by the Fitbit based on movement and heart rate.)

Let’s create a new variable called `hour_sleep`

that measures the hours of sleep each night that we will attempt to model.

`hour_sleep = fitbit_sleep$min_sleep / 60`

```
hist(hour_sleep, probability = TRUE, xlim = c(3, 12),
main = "Histogram of Hours of Sleep",
xlab = "Hours of Sleep")
grid()
box()
```