Data Manipulation
Consider the msleep
data from the ggplot2
package. Do the following:
- Calculate the mean hours of REM sleep of individuals in this dataset.
- Calculate the standard deviation of brain weight of individuals in this dataset.
- Calculate the average bodyweight of carnivores in this dataset?
- Create a new dataset the only contains observations from the original dataset that have no missing values.
- Create a new dataset that only contains observations from the original dataset that are herbivores.
- Create a new dataset that contains all the original variables except for
conservation
.
Plotting
Again consider the birthwt
data from the MASS
package. Do the following:
- Create a histogram of birth weights. Change the plot from its default appearance.
- Create a scatter plot of birth weight (y-axis) vs mother’s weight before pregnancy (x-axis). Use a non-default color for the points. (Also, be sure to give the plot a title and label the axes appropriately. Based on the scatter plot, does there seem to be a relationship between the two variables?)
- Create a scatter plot of birth weight (y-axis) vs mother’s age (x-axis). Use a non-default color for the points. (Also, be sure to give the plot a title and label the axes appropriately.) Based on the scatter plot, does there seem to be a relationship between the two variables? Briefly explain.
- Create side-by-side boxplots for birth weight grouped by smoking status. Use non-default colors for the plot. (Also, be sure to give the plot a title and label the axes appropriately. Based on the boxplot, does there seem to be a difference in birth weight for mothers who smoked?)
Modeling and Inference
Consider the Auto
data from the ISLR
package. Do the following:
- Coerce the
cylinders
variable to be a factor
variable. (Think about why this is a reasonable thing to do. Do you think it is?)
- Fit a multiple linear regression model with
mpg
as the response and cylinders
, horsepower
, and weight
as predictors.
- Test for significance of
cylinders
.
- Use your fitted model to create a 99% confidence interval for the mean fuel efficiency of a car that has 4 cylinders, 100 horsepower, and weighs 2700 pounds.
Modeling and Prediction
Consider the Boston
data from the MASS
package. (Have you seen this terribly boring and overused dataset before?) Do the following:
- Run the code provided below to obtain a test-train split of the data.
- Fit a multiple linear regression model using the training data with
medv
as the response and all other variables used as predictors.
- Use your trained model to predict the value of
medv
for each observation in the test dataset.
- Evaluate how well your model predicts by…
- Calculating the RMSE in the test data.
- Plotting actual versus predicted values for the test data.
- Fit a multiple regression model using the training data with
medv
as the response and all other variables used as predictors.
- Bonus: Repeat the above, but use a random forest instead of a linear model.
library(MASS)
set.seed(42)
boston_idx = sample(1:nrow(Boston), size = 250)
trn_boston = Boston[boston_idx, ]
tst_boston = Boston[-boston_idx, ]