#Overview
What college attributes predict the number of applications the college receives?
The dataset was compiled from an issue of US News and World Report summarizing statistics on colleges in 1995. The data is available as part of the ISLR R
package, but for this project, please load in an edited version, Data_2L_2M.RData
This will load in two data sets, one called Train
which you will use to build models, and one called Test
which you will use to evaluate models.
Variable | Description |
---|---|
Private |
A factor with levels No and Yes indicating private or public university |
Top10perc |
Pct. new students from top 10% of H.S. class |
Top25perc |
Pct. new students from top 25% of H.S. class |
Outstate |
Out-of-state tuition |
Room.Board |
Room and board costs |
Books |
Estimated book costs |
Personal |
Estimated personal spending |
PhD |
Pct. of faculty with Ph.D.’s |
Terminal |
Pct. of faculty with terminal degree |
S.F.Ratio |
Student/faculty ratio |
perc.alumni |
Pct. alumni who donate |
Expend |
Instructional expenditure per student |
Grad.Rate |
Graduation rate |
AcceptanceRate |
Number of applications accepted / Number of applications received |
EnrollmentRate |
Number of students enrolled / Number of applications accepted |
FT.Proportion |
Proportion of undergraduates enrolled full-time |
log10Apps |
Log10-Transformed Number of applications received |
As stated above the overall objective is to identify variables that predict the number of applications a college receives in a linear model, and evaluate these variables’ ability to minimize prediction error. To do this, you will consider models built in the training data (Train
) and evaluate their performance in the testing data (Test
).
Specifically, suppose we try to predict the number of applications received from the percent of new students coming from the top 10% of their high school class. We may build the model in the Training set and apply it to the Testing set as follows:
load("Data/Data_2L_2M.RData")
example_mod = lm(log10Apps ~ Top10perc, data = Train)
# summary(example_mod)
example_mod_test_pred = predict(example_mod, newdata = Test)
10 ^ example_mod_test_pred[1:10]
## Agnes Scott College Albertson College Albertus Magnus College
## 3491.768 2094.404 1285.777
## Anderson University Andrews University Antioch University
## 1346.934 1199.209 1548.416
## Augsburg College Baldwin-Wallace College Beaver College
## 1144.760 1739.156 1478.111
## Bethel College KS
## 1346.934
We may then calculate the root mean squared error (RMSE) in the Test
data by taking the square root of the average (squared) distance between those predictions and the true number of applications:
# what are the units of this statistic?
sqrt(mean((example_mod_test_pred - Test$log10Apps) ^ 2))
## [1] 0.4528126
Ideally, we would like to choose the model that minimizes the RMSE in the Test
data, but without simply evaluating every model in the Test
data. That is, we would like to have a criterion to evaluate a model in the Training data which would lead us to select a model which ultimately has low RMSE in the Test
data. In what follows, restrict consideration to the first eight listed variables: Private
, Top10perc
, Top25perc
, Outstate
, Room.Board
, Books
, Personal
, PhD
.
newdata = Train
) for a univariate model predicting the (log-transformed) number of applications received. Also calculate the RMSE in the Testing data for each variable, and compare these graphically. Does the RMSE calculated in the training data tend to underestimate the RMSE in the Testing data?