#Overview

What college attributes predict the number of applications the college receives?

Details

The dataset was compiled from an issue of US News and World Report summarizing statistics on colleges in 1995. The data is available as part of the ISLR R package, but for this project, please load in an edited version, Data_2L_2M.RData This will load in two data sets, one called Train which you will use to build models, and one called Test which you will use to evaluate models.

Data Description

Variable Description
Private A factor with levels No and Yes indicating private or public university
Top10perc Pct. new students from top 10% of H.S. class
Top25perc Pct. new students from top 25% of H.S. class
Outstate Out-of-state tuition
Room.Board Room and board costs
Books Estimated book costs
Personal Estimated personal spending
PhD Pct. of faculty with Ph.D.’s
Terminal Pct. of faculty with terminal degree
S.F.Ratio Student/faculty ratio
perc.alumni Pct. alumni who donate
Expend Instructional expenditure per student
Grad.Rate Graduation rate
AcceptanceRate Number of applications accepted / Number of applications received
EnrollmentRate Number of students enrolled / Number of applications accepted
FT.Proportion Proportion of undergraduates enrolled full-time
log10Apps Log10-Transformed Number of applications received

Objectives

As stated above the overall objective is to identify variables that predict the number of applications a college receives in a linear model, and evaluate these variables’ ability to minimize prediction error. To do this, you will consider models built in the training data (Train) and evaluate their performance in the testing data (Test).

Specifically, suppose we try to predict the number of applications received from the percent of new students coming from the top 10% of their high school class. We may build the model in the Training set and apply it to the Testing set as follows:

load("Data/Data_2L_2M.RData")
example_mod = lm(log10Apps ~ Top10perc, data = Train)
# summary(example_mod)
example_mod_test_pred = predict(example_mod, newdata = Test)
10 ^ example_mod_test_pred[1:10]
##     Agnes Scott College       Albertson College Albertus Magnus College 
##                3491.768                2094.404                1285.777 
##     Anderson University      Andrews University      Antioch University 
##                1346.934                1199.209                1548.416 
##        Augsburg College Baldwin-Wallace College          Beaver College 
##                1144.760                1739.156                1478.111 
##       Bethel College KS 
##                1346.934

We may then calculate the root mean squared error (RMSE) in the Test data by taking the square root of the average (squared) distance between those predictions and the true number of applications:

# what are the units of this statistic?
sqrt(mean((example_mod_test_pred - Test$log10Apps) ^ 2))
## [1] 0.4528126

Ideally, we would like to choose the model that minimizes the RMSE in the Test data, but without simply evaluating every model in the Test data. That is, we would like to have a criterion to evaluate a model in the Training data which would lead us to select a model which ultimately has low RMSE in the Test data. In what follows, restrict consideration to the first eight listed variables: Private, Top10perc, Top25perc, Outstate, Room.Board, Books, Personal, PhD.

  1. Consider a few variables one at a time as predictors for the (log-transformed) number of applications received, looking only at the Training data set. Based on p-values / coefficient magnitude, which variables appear to be important predictors in these univariate models? Provide some graphical and numerical summaries, as appropriate.
  2. Choose one variable (or more than one) of particular interest and describe in more detail a linear model using it to predict the (log-transformed) number of applications received. How well does the model fit? Are there any outliers? Choose a few schools of interest – how close is the model-predicted number of applications to the actual number of applications? Provide graphical and numerical summaries, as appropriate.
  3. For a couple of variables, use the Training data set to estimate the RMSE (i.e., with the above commands, setting newdata = Train) for a univariate model predicting the (log-transformed) number of applications received. Also calculate the RMSE in the Testing data for each variable, and compare these graphically. Does the RMSE calculated in the training data tend to underestimate the RMSE in the Testing data?
  4. Learn about cross-validation, and use cross-validation to better estimate the RMSE based in the Training data alone. Compare the cross-validation-based-estimated RMSE to the “actual” RMSE in the Testing data. How do these compare? A good resource for this is the book An Introduction to Statistical Learning by James, Witten, Hastie, and Tibshirani, Chapter 6 pages 176 - 186. The book is free to download: http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf The companion website http://www-bcf.usc.edu/~gareth/ISL/ also has many useful resources, including a lab which performs some of these analyses (on a different data set): http://www-bcf.usc.edu/~gareth/ISL/Chapter%205%20Lab.txt