Overview

What player attributes predict player salary in Major League Baseball?

Details

The dataset was compiled in 1987 from Sports Illustrated (April 20, 1987) and the 1987 Baseball Encyclopedia Update. The goal of this study is to identify the attributes of the players (in the year 1986 and across their career up to 1986) that predict salary in 1987. The data is available as part of the ISLR R package – to access it, run:

# install.packages("ISLR")
library("ISLR")
head(Hitters)

Data Description

Variable Description
AtBat Number of times at bat in 1986
Hits Number of hits in 1986
HmRun Number of home runs in 1986
Runs Number of runs in 1986
RBI Number of runs batted in in 1986
Walks Number of walks in 1986
Years Number of years in the major leagues
CAtBat Number of times at bat during his career
CHits Number of hits during his career
CHmRun Number of home runs during his career
CRuns Number of runs during his career
CRBI Number of runs batted in during his career
CWalks Number of walks during his career
League A factor with levels A and N indicating player’s league at the end of 1986
Division A factor with levels E and W indicating player’s division at the end of 1986
PutOuts Number of put outs in 1986
Assists Number of assists in 1986
Errors Number of errors in 1986
Salary 1987 annual salary on opening day in thousands of dollars
NewLeague A factor with levels A and N indicating player’s league at the beginning of 1987

Objectives

As stated above the overall objective is to identify variables that predict salary in 1987 in a linear model. Restrict to the 176 players playing in the American League in 1987. Specifically:

  1. Consider some variables of interest one at a time as predictors for Salary. Which variables appear to be important predictors (in these univariate models)? Provide some graphical and numerical summaries, as appropriate.
  2. Choose one variable (or more than one) of particular interest and describe in more detail a linear model using it to predict salary. How well does the model fit? Choose a few players of interest – how close is the model-predicted salary to the actual salary? Does the prediction interval from the model include the true salary? Does it help to first log-transform salary or log-transform the predictor? Provide graphical and numerical summaries, as appropriate.
  3. Learn about some model selection strategies (such as: best-subset selection, forward stepwise, or backward stepwise selection) as well as a model fit assessment measure (such as: the adjusted \(R^2\)). Apply these methods to this data to identify a good model for salary prediction. Do different model selection strategies yield different models? Are the suggested models consistent with what you found in the univariate analyses? What model would you recommend using? A good resource for this is the book An Introduction to Statistical Learning by James, Witten, Hastie, and Tibshirani, Chapter 6 pages 203 - 214 and 244 - 251. The book is free to download: http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf The companion website http://www-bcf.usc.edu/~gareth/ISL/ also has many useful resources, including a lab which performs some of these analyses on this exact data set: http://www-bcf.usc.edu/~gareth/ISL/Chapter%206%20Labs.txt You do not need to do exactly what they do in that lab, nor do you need to purposely deviate from it. However, make sure to explain your methods and justify all analysis choices.