Data Manipulation

Consider the msleep data from the ggplot2 package. Do the following:

Calculate the mean hours of REM sleep of individuals in this dataset.
Calculate the standard deviation of brain weight of individuals in this dataset.
Calculate the average bodyweight of carnivores in this dataset?
Create a new dataset the only contains observations from the original dataset that have no missing values.
Create a new dataset that only contains observations from the original dataset that are herbivores.
Create a new dataset that contains all the original variables except for conservation.

Plotting

Again consider the birthwt data from the MASS package. Do the following:

Create a histogram of birth weights. Change the plot from its default appearance.
Create a scatter plot of birth weight (y-axis) vs mother’s weight before pregnancy (x-axis). Use a non-default color for the points. (Also, be sure to give the plot a title and label the axes appropriately. Based on the scatter plot, does there seem to be a relationship between the two variables?)
Create a scatter plot of birth weight (y-axis) vs mother’s age (x-axis). Use a non-default color for the points. (Also, be sure to give the plot a title and label the axes appropriately.) Based on the scatter plot, does there seem to be a relationship between the two variables? Briefly explain.
Create side-by-side boxplots for birth weight grouped by smoking status. Use non-default colors for the plot. (Also, be sure to give the plot a title and label the axes appropriately. Based on the boxplot, does there seem to be a difference in birth weight for mothers who smoked?)

Consider the Auto data from the ISLR package. Do the following:

Coerce the cylinders variable to be a factor variable. (Think about why this is a reasonable thing to do. Do you think it is?)
Fit a multiple linear regression model with mpg as the response and cylinders, horsepower, and weight as predictors.
Test for significance of cylinders.
Use your fitted model to create a 99% confidence interval for the mean fuel efficiency of a car that has 4 cylinders, 100 horsepower, and weighs 2700 pounds.

Consider the Boston data from the MASS package. (Have you seen this terribly boring and overused dataset before?) Do the following:

Run the code provided below to obtain a test-train split of the data.
Fit a multiple linear regression model using the training data with medv as the response and all other variables used as predictors.
Use your trained model to predict the value of medv for each observation in the test dataset.
Evaluate how well your model predicts by…
- Calculating the RMSE in the test data.
- Plotting actual versus predicted values for the test data.
Fit a multiple regression model using the training data with medv as the response and all other variables used as predictors.
Bonus: Repeat the above, but use a random forest instead of a linear model.

library(MASS)
set.seed(42)
boston_idx = sample(1:nrow(Boston), size = 250)
trn_boston = Boston[boston_idx, ]
tst_boston = Boston[-boston_idx, ]