Exercise 1

For this exercise, we will use the diabetes dataset from the faraway package.

(a) Install and load the faraway package. Do not include the installation command in your .Rmd file. (If you do it will install the package every time you knit your file.) Do include the command to load the package into your environment.

Solution:

library(faraway)

(b) Coerce the data to be a tibble instead of a data frame. (You will need the tibble package to do so.) How many observations are in this dataset? How many variables? Who are the individuals in this dataset?

Solution:

library(tibble)
diabetes = as_tibble(diabetes)
diabetes
## # A tibble: 403 x 19
##       id  chol stab…   hdl ratio glyhb loca…   age gend… heig… weig… frame
##  * <int> <int> <int> <int> <dbl> <dbl> <fct> <int> <fct> <int> <int> <fct>
##  1  1000   203    82    56  3.60  4.31 Buck…    46 fema…    62   121 medi…
##  2  1001   165    97    24  6.90  4.44 Buck…    29 fema…    64   218 large
##  3  1002   228    92    37  6.20  4.64 Buck…    58 fema…    61   256 large
##  4  1003    78    93    12  6.50  4.63 Buck…    67 male     67   119 large
##  5  1005   249    90    28  8.90  7.72 Buck…    64 male     68   183 medi…
##  6  1008   248    94    69  3.60  4.81 Buck…    34 male     71   190 large
##  7  1011   195    92    41  4.80  4.84 Buck…    30 male     69   191 medi…
##  8  1015   227    75    44  5.20  3.94 Buck…    37 male     59   170 medi…
##  9  1016   177    87    49  3.60  4.84 Buck…    45 male     69   166 large
## 10  1022   263    89    40  6.60  5.78 Buck…    55 fema…    63   202 small
## # ... with 393 more rows, and 7 more variables: bp.1s <int>, bp.1d <int>,
## #   bp.2s <int>, bp.2d <int>, waist <int>, hip <int>, time.ppn <int>
?diabetes

We find there are 403 observations and 19 variables that describe African Americans from central Virginia.

(c) What is the mean HDL level (High Density Lipoprotein) of individuals in this sample?

Solution:

any(is.na(diabetes$hdl))
## [1] TRUE
anyNA(diabetes$hdl)
## [1] TRUE
mean(diabetes$hdl, na.rm = TRUE)
## [1] 50.44527

Notice that we need to deal with some missing data. We only remove observations with missing data from the variable of interest. Had we instead removed any observation with missing data, we would have less data to calculate this statistic.

(d) What is the mean HDL of females in this sample?

Solution:

mean(subset(diabetes, gender == "female")$hdl)
## [1] 52.11111

(e) Create a scatter plot of total cholesterol (y-axis) vs weight (x-axis). Use a non-default color for the points. (Also, be sure to give the plot a title and label the axes appropriately.) Based on the scatter plot, does there seem to be a relationship between the two variables? Briefly explain.

Solution:

plot(chol ~ weight, data = diabetes,
     xlab = "Weight (Pounds)",
     ylab = "Total Cholesterol (mg/dL)",
     main = "Total Cholesterol vs Weight",
     pch  = 20,
     cex  = 2,
     col  = "darkorange")

Overall, we see very little trend. Average total cholesterol seems nearly constant for different weights.

(f) Create side-by-side boxplots for HDL by gender. Use non-default colors for the plot. (Also, be sure to give the plot a title and label the axes appropriately.) Based on the boxplot, does there seem to be a difference in HDL level between the genders.? Briefly explain.

boxplot(hdl ~ gender, data = diabetes,
     xlab = "Gender",
     ylab = "High-Density Lipoprotein (mg/dL)",
     main = "HDL vs Gender",
     pch  = 20,
     cex  = 2,
     col    = "darkorange",
     border = "dodgerblue")

Aside from slightly less variation among females, there seems to be very little difference in HDL level between the genders.


Exercise 2

For this exercise we will use the data stored in nutrition.csv. It contains the nutritional values per serving size for a large variety of foods as calculated by the USDA. It is a cleaned version totaling 5138 observations and is current as of September 2015.

The variables in the dataset are:

(a) Create a histogram of Calories. Do not modify R’s default bin selection. Make the plot presentable. Describe the shape of the histogram. Do you notice anything unusual?

Solution:

library(readr)
nutrition = read_csv("nutrition.csv")
hist(nutrition$Calories,
     xlab = "Calories (kcal)",
     main = "Histogram of Calories for Various Foods",
     border = "dodgerblue",
     col  = "darkorange")