Processing math: 100%

All models are wrong, but some models are useful.

George Box


In this document we will discuss how to use (well-known) probability distributions to model univariate data (a single variable) in R. We will call this process “fitting” a model.

For the purpose of this document, the variables that we would like to model are assumed to be a random sample from some population. That is, we are considering models of the form

X1,X2,,Xniidf(xθ)

Models of this form assume that the data we obtained can be modeled by random variables that are independent and identically distributed according to some distribution with parameter θ. It is important to be aware that these are assumptions of the model that we are using. For now, we will use these (often implicit) assumptions, but not necessarily spend much time challenging them. (Specifically, we won’t criticize the independence and identical assumptions. If our data truly is a random sample, there shouldn’t be much of an issue.) For other more explicit assumptions (the distribution assumption) that we make, we will learn how to criticize them.

Some natural questions should arise when modeling a variable:

To fit a probability model and answer these questions, we will generally use the following procedure:

To illustrate this process and introduce some new concepts such as a QQ-Plot, let’s look a several examples.


Fitbit Sleep Data

The file fitbit-sleep.csv contains data relevant to Dave’s (Professor Dalpiaz) sleep. Note that we are reading the data directly form the web into a variable named fitbit_sleep. Here we use the read_csv() function form the readr package which technically makes fitbit_sleep a tibble, which is much like a data frame, but with a few differences we won’t notice here.

library(readr)
fitbit_sleep = read_csv("https://daviddalpiaz.github.io/stat3202-sp19/data/fitbit-sleep.csv")

After reading in the data, we need to look at the data. One way to do this is to simply output the entire dataset. Since it is a tibble, it does so in an easy to read fashion.

fitbit_sleep
ABCDEFGHIJ0123456789
start
<chr>
end
<chr>
min_sleep
<int>
min_awake
<int>
num_awake
<int>
total_bed
<int>
min_REM
<int>
min_light
<int>
2018-09-05 11:13PM2018-09-06 8:31AM4956354558101330
2018-09-05 12:04AM2018-09-05 7:48AM399652746474227
2018-09-03 11:34PM2018-09-04 8:15AM4695234521106331
2018-09-02 11:24PM2018-09-03 8:32AM485634154880322
2018-09-02 12:25AM2018-09-02 8:43AM4564235498130232
2018-08-31 11:59PM2018-09-01 8:15AM430663349668295
2018-08-30 11:21PM2018-08-31 7:45AM449553350478307
2018-08-29 10:16PM2018-08-30 7:23AM4747338547112297
2018-08-28 10:56PM2018-08-29 7:33AM447703751773310
2018-08-28 12:09AM2018-08-28 8:05AM404723447693260

In RStudio, you could use View(fitbit_sleep) to view the data in a data viewer window.

When we look at the data in this manner, what we’re really looking for is overall structure of the data. We like to know:

Normally these questions can be answered is some form of documentation. If we are dealing with a dataset from R or an R package, often ?dataset_name will answer these questions.

For this dataset, each row represents a night of sleep recorded by a Fitbit Versa. The columns are:

Note that technically these are all simply what the Fitbit has measured. There is certainly a lot of measurement error. (These quantities are being estimated by the Fitbit based on movement and heart rate.)

Hours of Sleep, Normal Model

Let’s create a new variable called hour_sleep that measures the hours of sleep each night that we will attempt to model.

hour_sleep = fitbit_sleep$min_sleep / 60
hist(hour_sleep, probability = TRUE, xlim = c(3, 12),
     main = "Histogram of Hours of Sleep",
     xlab = "Hours of Sleep")
grid()
box()

This histogram gives some insight into why we wouldn’t simply use the empirical distribution. In some sense, we can think of using the empirical distribution as “overfitting.” It places zero probability on too much of the range of the variable, and places too much probability on the observations we have observed. By fitting a probability model, we will “smooth” away these issues.

Here, we can argue that a continuous distribution would make the most sense. While the Fitbit only records sleep to the nearest minute, in reality sleep time could take on any positive real number between 0 and 24 hours.

Without giving it too much thought, we’ll attempt to model this data with a normal distribution. While a normal distribution does allow for the possibility of any real number, the vast majority of the probability is within three standard deviation of the mean.

To “fit” the normal model, we simply need to estimate its two parameters, μ and σ2. Recall that the probability density function for a normal random variable is given by

f(xμ,σ2)=1σ2πexp[(xμσ)2],xR, μR, σ2>0

The maximum likelihood estimators for μ and σ2 are given by

ˆμ=ˉx=1nni=1xi

ˆσ2=1nni=1(xiˉx)2

Notice that ˉx can be calculated with the mean() function in R. However, the var() function in R calculates

s2=1n1ni=1(xiˉx)2

While this is not the MLE (or MoM), the difference is so small that we’ll use it since it makes our life easier.

hsleep_xbar = mean(hour_sleep)
hsleep_s2   = var(hour_sleep)
c(hsleep_xbar, hsleep_s2)
## [1] 7.400000 0.700254

Here we see these two estimates. In this dataset, Dave sleeps on average 7.4 hours, with a standard deviation of about 0.7 hours.

To get a sense of how well this model fits, we add its density to the histogram of the data. Note that the dnorm() function takes the standard deviation, not the variance as an argument.

hist(hour_sleep, probability = TRUE, xlim = c(3, 12),
     main = "Histogram of Hours of Sleep",
     xlab = "Hours of Sleep")
grid()
box()
curve(dnorm(x, mean = mean(hour_sleep), sd = sd(hour_sleep)), 
      add = TRUE, col = "darkorange", lwd = 2)

It turns out that it’s difficult to tell how well this fits using only a histogram. It actually becomes really difficult when you realize that histograms are very sensitive to bin width selection.

par(mfrow = c(1, 2))

# default histogram
hist(hour_sleep, probability = TRUE, xlim = c(3, 12),
     main = "Histogram of Hours of Sleep",
     xlab = "Hours of Sleep")
grid()
box()
curve(dnorm(x, mean = mean(hour_sleep), sd = sd(hour_sleep)), 
      add = TRUE, col = "darkorange", lwd = 2)

# modified bin selection
hist(hour_sleep, probability = TRUE, xlim = c(3, 12), breaks = 20,
     main = "Histogram of Hours of Sleep",
     xlab = "Hours of Sleep")
grid()
box()
curve(dnorm(x, mean = mean(hour_sleep), sd = sd(hour_sleep)), 
      add = TRUE, col = "darkorange", lwd = 2)

Instead of using a histogram to check how well our model fits, we’ll instead use a quantile-quantile plot, often simply called a QQ plot. We’ll save the details for later, but the idea of a QQ plot is simple. On one axis, normally the y axis, we plot the actual data that we observed. On the other axis, the x axis, we plot where we would expect the data to be if it was sampled from the distribution that we fit. We then add a line for y=x. If the model fits well, the points should fall close to the line. If the model fits poorly, the points will deviate form the line.

qqplot(x = qnorm(ppoints(hour_sleep), 
                 mean = mean(hour_sleep), sd = sd(hour_sleep)),
       y = hour_sleep,
       xlim = c(4, 10), ylim = c(4, 10),
       main = "QQ-Plot: Hours of Sleep, Normal Distribution",
       xlab = "Theoretical Quantiles, Normal Distribution",
       ylab = "Sample Quantiles, Hours of Sleep")
abline(a = 0, b = 1, col = "dodgerblue", lwd = 2)
grid()

Without being an “expert” in QQ plots it might not immediately be clear if these points are close enough to the line. Reading QQ plots is actually a bit of an art. For now, we’ll say that this is pretty good! (More on “good” and “bad” QQ plots later.)

To reiterate why we like this model better than using the empirical distribution, we calculate the probability that Dave sleeps between 5.5 and 6 hours two ways.

# using empirical distribution
mean(5.5 < hour_sleep & hour_sleep < 6)
## [1] 0
# using normal model
diff(pnorm(q = c(5.5, 6), mean = mean(hour_sleep), sd = sd(hour_sleep)))
## [1] 0.03557407

Note that the smallest observations of hour_sleep in this dataset is 5.42. Thus it seems reasonable to place positive probability on sleeping between 5.5 and 6 hours. This is accomplished by the normal model, but not by using the empirical distribution.

Awakenings, Poisson Model

Now let’s try to model the number of times Dave wakes up during the night.

awakes = fitbit_sleep$num_awake

Since this is “count” data, it actually is discrete. To visualize a discrete distribution, we often use a barplot instead of a histogram.

barplot(table(factor(awakes, levels = 0:55)) / length(awakes), ylim = c(0, 0.15),
        names.arg = as.character(0:55),
        xlab = "Number of Awakenings",
        main = "Fitbit Sleep Data: Number of Awakenings")
box()
grid()

Let’s try a Poisson model. To “fit” the model, we need to estimate λ, which we have repeatedly seen can be done using the sample mean, ˉx.

mean(awakes)
## [1] 34.77778

When we represent this distribution as a barplot, we obtain the following.

barplot(dpois(0:55, lambda = mean(awakes)), ylim = c(0, 0.15),
        names.arg = as.character(0:55),
        xlab = "Number of Awakenings",
        main = "Fitbit Sleep Data: Modeled Number of Awakenings")
box()
grid()

It’s hard to tell how well this “fits,” so we create a QQ plot.

qqplot(x = qpois(ppoints(awakes), lambda = mean(awakes)),
       y = awakes,
       main = "QQ-Plot: Awakenings, Poisson Distribution",
       xlab = "Theoretical Quantiles, Poisson Distribution",
       ylab = "Sample Quantiles, Number of Awakenings")
abline(a = 0, b = 1, col = "dodgerblue", lwd = 2)
grid()

This QQ plot looks pretty good!

However, when we represented the fitted Poisson model as a barplot, its shape may have been very familiar. Even though it is a continuous model, let’s try a normal model.

hist(awakes, probability = TRUE, xlim = c(0, 60),
     main = "Histogram of Number of Awakenings",
     xlab = "Awakenings")
grid()
box()
curve(dnorm(x, mean = mean(awakes), sd = sd(awakes)), 
      add = TRUE, col = "darkorange", lwd = 2)

The fitted normal model on top of the histogram looks pretty good…

qqplot(x = qnorm(ppoints(awakes), 
                 mean = mean(awakes), sd = sd(awakes)),
       y = awakes,
       main = "QQ-Plot: Awakenings, Normal Distribution",
       xlab = "Theoretical Quantiles, Normal Distribution",
       ylab = "Sample Quantiles, Number of Awakenings")
abline(a = 0, b = 1, col = "dodgerblue", lwd = 2)
grid()

QQ plot looks good as well! Both of these models are reasonable. This is largely because a normal distribution is a decent approximation of a Poisson distribution, especially for “large” values of λ.


New Zealand River Data

The data stored in nz-rivers.txt records information about rivers in New Zealand.

nz_rivers = read_table2("https://daviddalpiaz.github.io/stat3202-sp19/data/nz-rivers.txt")

Notice that here we used the read_table2() function to import the data. This is because this is not a .csv file. It is actually delimited by whitespace. If you use the Import Dataset tool in RStudio, it will automatically detect this.

nz_rivers
ABCDEFGHIJ0123456789
River
<chr>
Length
<int>
FlowsInto
<chr>
Clarence209Pacific
Conway48Pacific
Waiau169Pacific
Hurunui138Pacific
Waipara64Pacific
Ashley97Pacific
Waimakariri161Pacific
Selwyn95Pacific
Rakaia145Pacific
Ashburton90Pacific

This dataset contains information on rivers in New Zealand including their name, length, and the body of water the river flows into.

River Length, Gamma Model

We will try to model the length of the rivers, which in this dataset is given in kilometers.

hist(nz_rivers$Length, probability = TRUE, 
     main = "NZ River Lengths", xlab = "River Length (km)", 
     ylim = c(0, 0.020), xlim = c(0, 400), breaks = 12)
box()
grid()

By looking at the histogram, and simply understanding that rivers will have some positive length, we would like to consider probability distributions that places density on positive numbers. A number of probability distributions have this feature, including: chi-square, F, exponential, and gamma. Let’s try gamma.

Recall that the probability density function of a gamma random variable is given by

f(xα,β)=1Γ(α)βαxα1ex/β,x>0,α>0,β>0

Note that R will use shape instead of α and scale instead of β.

Also, the gamma distribution has

E[X]=αβ

and

Var[X]=αβ2

Below are plots of gamma with various values for the parameters. By estimating these parameters from the data, hopefully we can find a gamma distribution that matches the shape of the above histogram.

While it is not possible to find an analytic solution for the MLE, given X1,X2,Xn assumed to be a random sample from a gamma distribution, we can use method of moments to obtain estimators:

˜α=ˉx2¯x2ˉx2

˜β=¯x2ˉx2ˉx

where

ˉx=1nni=1xi

¯x2=1nni=1x2i

We can apply these estimators to this data in R.

# calculating sample moments
len_samp_moment_1 = mean(nz_rivers$Length)
len_samp_moment_2 = mean(nz_rivers$Length ^ 2)

# method of moments estimators
len_alpha_mom = len_samp_moment_1 ^ 2 / (len_samp_moment_2 - len_samp_moment_1 ^ 2)
len_beta_mom  = len_samp_moment_1 / len_alpha_mom

# estimates for this dataset
c(len_alpha_mom, len_beta_mom)
## [1]  2.189716 44.342529

We then add this fitted gamma model to the histogram.

hist(nz_rivers$Length, probability = TRUE, 
     main = "NZ River Lengths", xlab = "River Length (km)", 
     ylim = c(0, 0.020), xlim = c(0, 400), breaks = 12)
box()
grid()
curve(dgamma(x, shape = len_alpha_mom, scale = len_beta_mom), 
      add = TRUE, col = "darkorange", lwd = 2)

Then to better verify the fit, we create a QQ plot.

qqplot(x = qgamma(ppoints(nz_rivers$Length), 
                 shape = len_alpha_mom, scale = len_beta_mom),
       y = nz_rivers$Length,
       xlim = c(0, 400), ylim = c(0, 400),
       main = "QQ-Plot: NZ River Length, Gamma Distribution",
       xlab = "Theoretical Quantiles, Gamma Distribution",
       ylab = "Sample Quantiles, River Length")
abline(a = 0, b = 1, col = "dodgerblue", lwd = 2)
grid()

This QQ plot isn’t the best, but it isn’t terrible.


Cocaine Seizure Data

This dataset comes from STRIDE, the System to Retrieve Information from Drug Evidence. It contains all cocaine seizures in the US from 2007 that have a known weight. For a full dataset description, use ?cocaine after loading the ggvis package.

cocaine = ggvis::cocaine
cocaine
ABCDEFGHIJ0123456789
state
<chr>
potency
<dbl>
weight
<dbl>
month
<int>
price
<dbl>
WA7721715000
CT5124814800
FL684313500
OH6912313500
MS7511813400
VA7312713000
FL545013000
IL5814012800
GA7712712600
MS497412600

Cocaine Potency, Beta Model

First, let’s model the potency of the cocaine seized.

range(cocaine$potency)
## [1]  2 98
hist(cocaine$potency, probability = TRUE, 
     main = "Potency of Seized Cocaine", xlab = "Potency  (0% = all filler, 100% = pure cocaine)", 
     ylim = c(0, 0.025), xlim = c(0, 100), breaks = 15)
box()
grid()

We see that potency can range from 0 (all filler) to 100 (pure cocaine). We aren’t aware of any distributions on the interval from 0 to 100, but we do know one on the interval from 0 to 1, the beta distribution.

In order to use the beta distribution, we’ll first perform a quick transformation.

coke_pot = cocaine$potency / 100
range(coke_pot)
## [1] 0.02 0.98
hist(coke_pot, probability = TRUE, 
     main = "Potency of Seized Cocaine", xlab = "Potency  (0 = all filler, 1 = pure cocaine)", 
     ylim = c(0, 2.5), xlim = c(0, 1), breaks = 15)
box()
grid()

Notice that the shape stays exactly the same.

Recall, that a random variable X that follows a beta distribution has probability density function

f(xα,β)=[Γ(α+β)Γ(α)Γ(β)]xα1(1y)β1,0x1,α>0,β>0

Also, the beta distribution has

E[X]=αα+β

and

Var[X]=αβ(α+β)2(α+β+1)

Below are plots of beta with various values for the parameters. By estimating these parameters from the data, hopefully we can find a gamma distribution that matches the shape of the above histogram.

While it is not possible to find an analytic solution for the MLE, given X1,X2,Xn assumed to be a random sample from a beta distribution, we can use method of moments to obtain estimators:

˜α=ˉx(ˉx(1ˉx)s21)

˜β=(1ˉx)(ˉx(1ˉx)s21)

where

ˉx=1nni=1xi

s2=1n1ni=1(xiˉx)2

Let’s apply these estimators to this dataset.

coke_pot_alpha_hat = mean(coke_pot) * ((mean(coke_pot) * (1 - mean(coke_pot))) / (var(coke_pot)) - 1)
coke_pot_beta_hat = (1 - mean(coke_pot)) * ((mean(coke_pot) * (1 - mean(coke_pot))) / (var(coke_pot)) - 1)
c(coke_pot_alpha_hat, coke_pot_beta_hat)
## [1] 4.138140 2.531346

Note that R will use shape1 instead of α and shape2 instead of β.

hist(coke_pot, probability = TRUE, 
     main = "Potency of Seized Cocaine", xlab = "Potency  (0 = all filler, 1 = pure cocaine)", 
     ylim = c(0, 2.5), xlim = c(0, 1), breaks = 15)
box()
grid()
curve(dbeta(x, shape1 = coke_pot_alpha_hat, shape2 = coke_pot_beta_hat), 
      add = TRUE, col = "darkorange", lwd = 2)

This looks pretty good, but let’s verify with a QQ plot.

qqplot(x = qbeta(ppoints(coke_pot), shape1 = coke_pot_alpha_hat, shape2 = coke_pot_beta_hat),
       y = coke_pot,
       xlim = c(0, 1), ylim = c(0, 1),
       main = "QQ-Plot: Cocaine Potency, Beta Distribution",
       xlab = "Theoretical Quantiles, Beta Distribution",
       ylab = "Sample Quantiles, Cocaine Potency")
abline(a = 0, b = 1, col = "dodgerblue", lwd = 2)
grid()

The left tail isn’t perfect, but again, this look pretty good.

Cocaine Price, Some Model?

Now let’s try to model the price of the cocaine seized.

hist(cocaine$price, probability = TRUE, 
     main = "Price of Seized Cocaine", xlab = "Estimate Value, USD", 
     breaks = 15)
box()
grid()

This is a very right-skewed distribution. We know a number of distributions that are valid for positive numbers. Let’s try exponential.

Note that R uses rate in place of 1/θ in the probability density function

f(xθ)=1θex/θ,x>0, θ>0

Also note that the MLE for θ is ˉX.

hist(cocaine$price, probability = TRUE, 
     main = "Price of Seized Cocaine", xlab = "Estimate Value, USD", 
     breaks = 15)
box()
grid()
curve(dexp(x, rate = 1 / mean(cocaine$price)), 
      add = TRUE, col = "darkorange")

This looks OK, but let’s check with a QQ plot.

qqplot(x = qexp(ppoints(cocaine$price), rate = 1 / mean(cocaine$price)),
       y = cocaine$price,
       main = "QQ-Plot: Cocaine Price, Exponential Distribution",
       xlab = "Theoretical Quantiles, Exponential Distribution",
       ylab = "Sample Quantiles, Cocaine Price")
abline(a = 0, b = 1, col = "dodgerblue", lwd = 2)
grid()

This actually highlights an issue. It seems that this distribution actually fits poorly in the right tail.

We could try some other distributions, but let’s cheat a bit a use something called a kernel density estimate.

hist(cocaine$price, probability = TRUE, 
     main = "Price of Seized Cocaine", xlab = "Estimate Value, USD", 
     breaks = 15)
box()
grid()
lines(density(cocaine$price), col = "darkorange")

Wow, look at that, it fits great, which we can verify with a QQ plot.

library(spatstat)
qqplot(x = quantile.density(density(cocaine$price), ppoints(cocaine$price)),
       y = cocaine$price,
       main = "QQ-Plot: Cocaine Potency, KDE",
       xlab = "Theoretical Quantiles, Kernel Density Estimate",
       ylab = "Sample Quantiles, Cocaine Price")
abline(a = 0, b = 1, col = "dodgerblue", lwd = 2)
grid()

That’s a big improvement! Why not just use this all the time. There are actually a few reasons, most of which are hard to explain currently. Once we start doing hypothesis testing, we’ll see that the lack of a model parameter here would make things difficult. (Also, there is technically a tuning (hyper) parameter being used here, we just a using the default value.)

Kernel density estimation is an example of a nonparametric method, again notice that we didn’t estimate a parameter. We’ll discuss some related methods later in the course.


QQ Plots

To get a better sense of how QQ plots work, we provide four examples.

set.seed(42)
sim_norm_data = rnorm(n = 50, mean = 2, sd = 3)
qqplot(x = qnorm(ppoints(sim_norm_data), mean = 2, sd = 3),
       y = sim_norm_data,
       xlim = c(-7, 11), ylim = c(-7, 11),
       main = "QQ-Plot: Simulated Normal Data",
       xlab = "Theoretical Quantiles, Normal Distribution",
       ylab = "Sample Quantiles, Simulated Normal Data")
abline(a = 0, b = 1, col = "dodgerblue", lwd = 2)
grid()

set.seed(42)
sim_norm_data = rnorm(n = 50, mean = 2, sd = 3)
qqplot(x = qexp(ppoints(sim_norm_data), rate = 2),
       y = sim_norm_data,
       main = "QQ-Plot: Simulated Normal Data",
       xlab = "Theoretical Quantiles, Exponential Distribution",
       ylab = "Sample Quantiles, Simulated Normal Data")
abline(a = 0, b = 1, col = "dodgerblue", lwd = 2)
grid()

set.seed(42)
sim_exp_data = rexp(n = 50, rate = 2)
qqplot(x = qnorm(ppoints(sim_exp_data), mean = 0, sd = 1),
       y = sim_exp_data,
       main = "QQ-Plot: Simulated Exponential Data",
       xlab = "Theoretical Quantiles, Normal Distribution",
       ylab = "Sample Quantiles, Simulated Exponential Data")
abline(a = 0, b = 1, col = "dodgerblue", lwd = 2)
grid()

set.seed(42)
sim_t_data = rt(n = 50, df = 3)
qqplot(x = qnorm(ppoints(sim_t_data), mean = 0, sd = 1),
       y = sim_t_data,
       main = "QQ-Plot: Simulated t Data",
       xlab = "Theoretical Quantiles, Normal Distribution",
       ylab = "Sample Quantiles, Simulated t Data")
abline(a = 0, b = 1, col = "dodgerblue", lwd = 2)
grid()


Process Recap