--- title: 'STAT 3202: Lab 02, Fitting a Probability Model' author: "Spring 2019, OSU" date: 'Due: Friday, February 1' output: html_document: theme: spacelab toc: yes pdf_document: default urlcolor: BrickRed --- *** ```{r setup, include = FALSE} knitr::opts_chunk$set(fig.align = "center") ``` **Goal:** After completing this lab, you should be able to... - *Fit* any well-known distribution (that is implemented in `R`) using either maximum likelihood or method of moments estimators. - *Plot* a fitted distribution by adding it to a histogram of data that was modeled. - *Check* how well a distribution models some data using a [QQ-Plot](https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot). In this lab we will use, but not focus on... - `R` Markdown. This document will serve as a template. It is pre-formatted and already contains chunks that you need to complete. - Importing data. Code to import the necessary data is provided. Some additional notes: - Please see [**Carmen**](https://carmen.osu.edu/) for information about submission, and grading. - You may use [this document](lab-02-assign.Rmd) as a template. You do not need to remove directions. Chunks that require your input have a comment indicating to do so. - The [reading posted with the lab](https://daviddalpiaz.github.io/stat3202-sp19/#lab-02) and the associated `R` Markdown file contain details on performing the required tasks through a number of examples. - Recall the tutorial information posted in [Lab `01`](https://daviddalpiaz.github.io/stat3202-sp19/lab/lab-01.html#exercise-7---curiosity-reading-template-videos). It may be extremely useful here. *** # Exercise 1 - Sleep Data, Beta Distribution For this exercise, we will return to the sleep data from class. Please refer to the [notes]() for a complete description of this dataset. ```{r, message = FALSE, warning = FALSE} library(readr) fitbit_sleep = read_csv("https://daviddalpiaz.github.io/stat3202-sp19/data/fitbit-sleep.csv") ``` The above chunk will read the data as a data frame named `fitbit_sleep`. (Actually, a [`tibble`](https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html), but that is unimportant at the moment.) In general, you should avoid using an absolute reference to data on your own machine. Instead, you should use a relative reference, and include the data with your `R` Markdown document. Here we avoid the issue altogether and simply read the data directly from the web. This way, the above code will work on anyone's machine. ```{r} percent_deep = fitbit_sleep$min_deep / fitbit_sleep$min_sleep ``` For this exercise, we will investigate a new variable called `percent_deep` that is created above. This variable measures the proportion of time asleep spent in the [deep sleep stages](https://en.wikipedia.org/wiki/Slow-wave_sleep). We will consider two probability models for this variable, a beta model and a normal model. ```{r} range(percent_deep) ``` Notice that the range of the observed data that is calculated above. Furthermore, since this variable is a proportion, it *could* take values between `0` and `1`. (Although, it is rather unlikely that you would spend all night in deep sleep.) If we think of the sample space of the [continuous distributions](https://daviddalpiaz.github.io/stat3202-sp19/tables/continuous.pdf) that we know, we should remember that the beta distribution takes as input values between `0` and `1`. This makes beta a good candidate. Recall, that a random variable $X$ that follows a beta distribution has probability density function $$ f(x \mid \alpha, \beta) = \left[\frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)}\right] x^{\alpha - 1}(1 - y)^{\beta - 1}, 0 \leq x \leq 1, \alpha > 0, \beta > 0 $$ Also, the beta distribution has $$ \text{E}[X] = \frac{\alpha}{\alpha + \beta} $$ and $$ \text{Var}[X] = \frac{\alpha\beta}{(\alpha + \beta)^2(\alpha + \beta + 1)} $$ While it is not possible to find an analytic solution for the MLE, given $X_1, X_2, \ldots X_n$ assumed to be a random sample from a beta distribution, we can use method of moments to obtain estimators: $$ \tilde{\alpha} = \bar{x}\left(\frac{\bar{x}(1 - \bar{x})}{s^2} - 1\right) $$ $$ \tilde{\beta} = (1 - \bar{x})\left(\frac{\bar{x}(1 - \bar{x})}{s^2} - 1\right) $$ where $$ \bar{x} = \frac{1}{n}\sum_{i = 1}^{n}x_i $$ $$ s^2 = \frac{1}{n - 1}\sum_{i = 1}^{n}(x_i - \bar{x})^2 $$ Do the following: - Calculate estimates for the parameters assuming a beta distribution using this data. (Use the provided method of moments estimators above.) - Add the fitted model to the provided histogram. (That is, add the density for the specific beta distribution you found.) - Create a QQ-plot to better determine how well this distribution fits the data. ```{r} # perform estimate calculations here # remove these comments and insert your code in this chunk ``` ```{r} hist(percent_deep, xlim = c(0, 1), ylim = c(0, 12), probability = TRUE, main = "Histogram of Percent Deep Sleep", xlab = "Percent Deep Sleep") box() grid() # remove this comment and add code to add the beta density here ``` ```{r, fig.height = 5, fig.width = 5} # remove this comment and add code to create a QQ-plot here ``` *** # Exercise 2 - Sleep Data, Normal Distribution In class we saw that [many distributions approximate each other](http://www.math.wm.edu/~leemis/2008amstat.pdf). Investigate the use of a normal model for the `percent_deep` variable. Do the following: - Calculate estimates for the parameters assuming a normal distribution using this data. (Hint: You may simply use $\bar{x}$ and $s^2$ even though $s^2$ is neither the MLE nor MoM estimator.) - Add the fitted model to the provided histogram. (That is, add the density for the specific normal distribution you found.) - Create a QQ-plot to better determine how well this distribution fits the data. ```{r} # perform estimate calculations here # remove these comments and insert your code in this chunk ``` ```{r} hist(percent_deep, xlim = c(0, 1), ylim = c(0, 12), probability = TRUE, main = "Histogram of Percent Deep Sleep", xlab = "Percent Deep Sleep") box() grid() # remove this comment and add code to add the normal density here ``` ```{r, fig.height = 5, fig.width = 5} # remove this comment and add code to create a QQ-plot here ``` *** # Exercise 3 - Sleep Data, Comparing Models After the two previous exercises, you will likely conclude that both models are "good." However, they are different. To highlight one of the differences, use both models to estimate the probability of less than 5% of sleep being deep sleep. ```{r} # remove this comment and perform beta calculation here ``` ```{r} # remove this comment and perform normal calculation here ``` *** # Exercise 4 - Diabetes Data, Normal Distribution For the remaining three exercises we will use the `diabetes` dataset from the `faraway` package. To learn more about this dataset, use `?faraway` after loading the required package ```{r} library(faraway) diabetes = na.omit(diabetes) ``` Note that before proceeding, we remove any observations from the dataset that contain an `NA`. While this is convenient, in practice this should be given much more thought. We will be interested in the variable `hdl` which records a [high-density lipoprotein](https://en.wikipedia.org/wiki/High-density_lipoprotein) measurement (the so-called "good" cholesterol) for each individual in this dataset in mg/dL. Do the following: - Calculate estimates for the parameters assuming a normal distribution using this data. (Hint: You may simply use $\bar{x}$ and $s^2$ even though $s^2$ is neither the MLE nor MoM estimator.) - Add the fitted model to the provided histogram. (That is, add the density for the specific normal distribution you found.) - Create a QQ-plot to better determine how well this distribution fits the data. ```{r} # perform estimate calculations here # remove these comments and insert your code in this chunk ``` ```{r} hist(diabetes$hdl, xlim = c(0, 150), ylim = c(0, 0.04), probability = TRUE, main = "Histogram of High Density Lipoprotein (hdl)", xlab = "High Density Lipoprotein (hdl), mg/dL", breaks = 20) box() grid() # remove this comment and add code to add the normal density here ``` ```{r, fig.height = 5, fig.width = 5} # remove this comment and add code to create a QQ-plot here ``` *** # Exercise 5 - Diabetes Data, Gamma Distribution Hopefully you are unsatisfied with how well the normal model fits in the previous exercise. Let's instead consider a gamma distribution which gives positive density for positive values. Recall, that a random variable $X$ that follows a gamma distribution has probability density function $$ f(x \mid \alpha, \beta) = \frac{1}{\Gamma(\alpha)\beta^\alpha}x^{\alpha - 1}e^{-x/\beta}, x > 0, \alpha > 0, \beta > 0 $$ Also, the gamma distribution has $$ \text{E}[X] = \alpha\beta $$ and $$ \text{Var}[X] = \alpha\beta^2 $$ While it is not possible to find an analytic solution for the MLE, given $X_1, X_2, \ldots X_n$ assumed to be a random sample from a gamma distribution, we can use method of moments to obtain estimators: $$ \tilde{\alpha} = \frac{\bar{x}^2}{\overline{x^2} - \bar{x}^2} $$ $$ \tilde{\beta} = \frac{\overline{x^2} - \bar{x}^2}{\bar{x}} $$ where $$ \bar{x} = \frac{1}{n}\sum_{i = 1}^{n}x_i $$ $$ \overline{x^2} = \frac{1}{n}\sum_{i = 1}^{n}x_i^2 $$ Do the following: - Calculate estimates for the parameters assuming a gamma distribution using this data. (Use the provided method of moments estimators above.) - Add the fitted model to the provided histogram. (That is, add the density for the specific gamma distribution you found.) - Create a QQ-plot to better determine how well this distribution fits the data. ```{r} # perform estimate calculations here # remove these comments and insert your code in this chunk ``` ```{r} hist(diabetes$hdl, xlim = c(0, 150), ylim = c(0, 0.04), probability = TRUE, main = "Histogram of High Density Lipoprotein (hdl)", xlab = "High Density Lipoprotein (hdl), mg/dL", breaks = 20) box() grid() # remove this comment and add code to add the gamma density here ``` ```{r, fig.height = 5, fig.width = 5} # remove this comment and add code to create a QQ-plot here ``` *** # Exercise 6 - Diabetes Data, Comparing Models With the "better" of the previous two models, estimate the probability of an HDL above 75 mg/dL. ```{r} # remove this comment and perform probability calculation here ``` ***