For this exercise, we will use the diabetes
dataset from the faraway
package.
(a) Install and load the faraway
package. Do not include the installation command in your .Rmd
file. (If you do it will install the package every time you knit your file.) Do include the command to load the package into your environment.
Solution:
library(faraway)
(b) Coerce the data to be a tibble instead of a data frame. (You will need the tibble
package to do so.) How many observations are in this dataset? How many variables? Who are the individuals in this dataset?
Solution:
library(tibble)
diabetes = as_tibble(diabetes)
diabetes
## # A tibble: 403 x 19
## id chol stab… hdl ratio glyhb loca… age gend… heig… weig… frame
## * <int> <int> <int> <int> <dbl> <dbl> <fct> <int> <fct> <int> <int> <fct>
## 1 1000 203 82 56 3.60 4.31 Buck… 46 fema… 62 121 medi…
## 2 1001 165 97 24 6.90 4.44 Buck… 29 fema… 64 218 large
## 3 1002 228 92 37 6.20 4.64 Buck… 58 fema… 61 256 large
## 4 1003 78 93 12 6.50 4.63 Buck… 67 male 67 119 large
## 5 1005 249 90 28 8.90 7.72 Buck… 64 male 68 183 medi…
## 6 1008 248 94 69 3.60 4.81 Buck… 34 male 71 190 large
## 7 1011 195 92 41 4.80 4.84 Buck… 30 male 69 191 medi…
## 8 1015 227 75 44 5.20 3.94 Buck… 37 male 59 170 medi…
## 9 1016 177 87 49 3.60 4.84 Buck… 45 male 69 166 large
## 10 1022 263 89 40 6.60 5.78 Buck… 55 fema… 63 202 small
## # ... with 393 more rows, and 7 more variables: bp.1s <int>, bp.1d <int>,
## # bp.2s <int>, bp.2d <int>, waist <int>, hip <int>, time.ppn <int>
?diabetes
We find there are 403 observations and 19 variables that describe African Americans from central Virginia.
(c) What is the mean HDL level (High Density Lipoprotein) of individuals in this sample?
Solution:
any(is.na(diabetes$hdl))
## [1] TRUE
anyNA(diabetes$hdl)
## [1] TRUE
mean(diabetes$hdl, na.rm = TRUE)
## [1] 50.44527
Notice that we need to deal with some missing data. We only remove observations with missing data from the variable of interest. Had we instead removed any observation with missing data, we would have less data to calculate this statistic.
(d) What is the mean HDL of females in this sample?
Solution:
mean(subset(diabetes, gender == "female")$hdl)
## [1] 52.11111
(e) Create a scatter plot of total cholesterol (y-axis) vs weight (x-axis). Use a non-default color for the points. (Also, be sure to give the plot a title and label the axes appropriately.) Based on the scatter plot, does there seem to be a relationship between the two variables? Briefly explain.
Solution:
plot(chol ~ weight, data = diabetes,
xlab = "Weight (Pounds)",
ylab = "Total Cholesterol (mg/dL)",
main = "Total Cholesterol vs Weight",
pch = 20,
cex = 2,
col = "darkorange")
Overall, we see very little trend. Average total cholesterol seems nearly constant for different weights.
(f) Create side-by-side boxplots for HDL by gender. Use non-default colors for the plot. (Also, be sure to give the plot a title and label the axes appropriately.) Based on the boxplot, does there seem to be a difference in HDL level between the genders.? Briefly explain.
boxplot(hdl ~ gender, data = diabetes,
xlab = "Gender",
ylab = "High-Density Lipoprotein (mg/dL)",
main = "HDL vs Gender",
pch = 20,
cex = 2,
col = "darkorange",
border = "dodgerblue")
Aside from slightly less variation among females, there seems to be very little difference in HDL level between the genders.
For this exercise we will use the data stored in nutrition.csv
. It contains the nutritional values per serving size for a large variety of foods as calculated by the USDA. It is a cleaned version totaling 5138 observations and is current as of September 2015.
The variables in the dataset are:
ID
Desc
- Short description of foodWater
- in gramsCalories
- in kcalProtein
- in gramsFat
- in gramsCarbs
- Carbohydrates, in gramsFiber
- in gramsSugar
- in gramsCalcium
- in milligramsPotassium
- in milligramsSodium
- in milligramsVitaminC
- Vitamin C, in milligramsChol
- Cholesterol, in milligramsPortion
- Description of standard serving size used in analysis(a) Create a histogram of Calories
. Do not modify R
’s default bin selection. Make the plot presentable. Describe the shape of the histogram. Do you notice anything unusual?
Solution:
library(readr)
nutrition = read_csv("nutrition.csv")
hist(nutrition$Calories,
xlab = "Calories (kcal)",
main = "Histogram of Calories for Various Foods",
border = "dodgerblue",
col = "darkorange")
The distribution of Calories
is right-skewed. There are two odd spikes, one around 400 kcal and one past 800 kcal. Perhaps some foods are being rounded to 400, or portion sizes are created with 400 kcal in mind. Also, perhaps there is an upper limit, and portion sizes are created to keep calories close to 900 but not above.
(b) Create a scatter plot of Calories
(y-axis) vs 4 * Protein + 4 * Carbs + 9 * Fat + 2 * Fiber
(x-axis). Make the plot presentable. You will either need to add a new variable to the data frame, or, use the I()
function in your formula in the call to plot()
. If you are at all familiar with nutrition, you may realize that this formula calculates the calorie count based on the protein, carbohydrate, and fat values. You’d expect then that the result here is a straight line. Is it? If not, can you think of any reasons why it is not?
Solution:
plot(Calories ~ I(4 * Protein + 4 * Carbs + 9 * Fat + 2 * Fiber), data = nutrition,
xlab = "Protein (grams)",
ylab = "Calories (kcal)",
main = "Calories vs Protein",
pch = 20,
cex = 1,
col = "darkorange")
The result is not a straight line. There could be any number of reasons:
For each of the following parts, use the following vectors:
a = 1:10
b = 10:1
c = rep(1, times = 10)
d = 2 ^ (1:10)
(a) Write a function called sum_of_squares
.
x
.Provide your function, as well as the result of running the following code:
sum_of_squares(x = a)
sum_of_squares(x = c(c, d))
Solution:
sum_of_squares = function(x) {
sum(x ^ 2)
}
sum_of_squares(x = a)
## [1] 385
sum_of_squares(x = c(c, d))
## [1] 1398110
(b) Write a function called rms_diff
.
x
.y
.If the vectors have different lengths, the shorter vector should be repeated until it matches the length of the longer vector.
Provide your function, as well as the result of running the following code:
rms_diff(x = a, y = b)
rms_diff(x = d, y = c)
rms_diff(x = d, y = 1)
rms_diff(x = a, y = 0) ^ 2 * length(a)
Solution:
rms_diff = function(x, y) {
sqrt(mean((x - y) ^ 2))
}
rms_diff(x = a, y = b)
## [1] 5.744563
rms_diff(x = d, y = c)
## [1] 373.3655
rms_diff(x = d, y = 1)
## [1] 373.3655
rms_diff(x = a, y = 0) ^ 2 * length(a)
## [1] 385
Notice the value 385
appears again!