Details

The dataset was compiled from an issue of US News and World Report summarizing statistics on colleges in 1995. The data is available as part of the ISLR R package, but for this project, please load in an edited version, Data_2L_2M.RData This will load in two data sets, one called Train which you will use to build models, and one called Test which you will use to evaluate models.

Data Description

Variable	Description
`Private`	A factor with levels No and Yes indicating private or public university
`Top10perc`	Pct. new students from top 10% of H.S. class
`Top25perc`	Pct. new students from top 25% of H.S. class
`Outstate`	Out-of-state tuition
`Room.Board`	Room and board costs
`Books`	Estimated book costs
`Personal`	Estimated personal spending
`PhD`	Pct. of faculty with Ph.D.’s
`Terminal`	Pct. of faculty with terminal degree
`S.F.Ratio`	Student/faculty ratio
`perc.alumni`	Pct. alumni who donate
`Expend`	Instructional expenditure per student
`Grad.Rate`	Graduation rate
`AcceptanceRate`	Number of applications accepted / Number of applications received
`EnrollmentRate`	Number of students enrolled / Number of applications accepted
`FT.Proportion`	Proportion of undergraduates enrolled full-time
`log10Apps`	Log10-Transformed Number of applications received

Objectives

As stated above the overall objective is to identify variables that predict the number of applications a college receives in a linear model, and evaluate these variables’ ability to minimize prediction error. To do this, you will consider models built in the training data (Train) and evaluate their performance in the testing data (Test).

Specifically, suppose we try to predict the number of applications received from the percent of new students coming from the top 10% of their high school class. We may build the model in the Training set and apply it to the Testing set as follows:

load("Data/Data_2L_2M.RData")
example_mod = lm(log10Apps ~ Top10perc, data = Train)
# summary(example_mod)
example_mod_test_pred = predict(example_mod, newdata = Test)
10 ^ example_mod_test_pred[1:10]

##     Agnes Scott College       Albertson College Albertus Magnus College 
##                3491.768                2094.404                1285.777 
##     Anderson University      Andrews University      Antioch University 
##                1346.934                1199.209                1548.416 
##        Augsburg College Baldwin-Wallace College          Beaver College 
##                1144.760                1739.156                1478.111 
##       Bethel College KS 
##                1346.934

We may then calculate the root mean squared error (RMSE) in the Test data by taking the square root of the average (squared) distance between those predictions and the true number of applications:

# what are the units of this statistic?
sqrt(mean((example_mod_test_pred - Test$log10Apps) ^ 2))

## [1] 0.4528126

Ideally, we would like to choose the model that minimizes the RMSE in the Test data, but without simply evaluating every model in the Test data. That is, we would like to have a criterion to evaluate a model in the Training data which would lead us to select a model which ultimately has low RMSE in the Test data. In what follows, restrict consideration to the first eight listed variables: Private, Top10perc, Top25perc, Outstate, Room.Board, Books, Personal, PhD.

Consider a few variables one at a time as predictors for the (log-transformed) number of applications received, looking only at the Training data set. Based on p-values / coefficient magnitude, which variables appear to be important predictors in these univariate models? Provide some graphical and numerical summaries, as appropriate.
Choose one variable (or more than one) of particular interest and describe in more detail a linear model using it to predict the (log-transformed) number of applications received. How well does the model fit? Are there any outliers? Choose a few schools of interest – how close is the model-predicted number of applications to the actual number of applications? Provide graphical and numerical summaries, as appropriate.
For a couple of variables, use the Training data set to estimate the RMSE (i.e., with the above commands, setting newdata = Train) for a univariate model predicting the (log-transformed) number of applications received. Also calculate the RMSE in the Testing data for each variable, and compare these graphically. Does the RMSE calculated in the training data tend to underestimate the RMSE in the Testing data?
Learn about cross-validation, and use cross-validation to better estimate the RMSE based in the Training data alone. Compare the cross-validation-based-estimated RMSE to the “actual” RMSE in the Testing data. How do these compare? A good resource for this is the book An Introduction to Statistical Learning by James, Witten, Hastie, and Tibshirani, Chapter 6 pages 176 - 186. The book is free to download: http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf The companion website http://www-bcf.usc.edu/~gareth/ISL/ also has many useful resources, including a lab which performs some of these analyses (on a different data set): http://www-bcf.usc.edu/~gareth/ISL/Chapter%205%20Lab.txt

Project 2L: College Applications I

STAT 3202: Group Project I

Autumn 2018, OSU

Details

Data Description

Objectives