Chapter 25 Elastic Net

We again use the Hitters dataset from the ISLR package to explore another shrinkage method, elastic net, which combines the ridge and lasso methods from the previous chapter.

data(Hitters, package = "ISLR")
Hitters = na.omit(Hitters)

We again remove the missing data, which was all in the response variable, Salary.

tibble::as_tibble(Hitters)
## # A tibble: 263 x 20
##    AtBat  Hits HmRun  Runs   RBI Walks Years CAtBat CHits CHmRun CRuns
##  * <int> <int> <int> <int> <int> <int> <int>  <int> <int>  <int> <int>
##  1   315    81     7    24    38    39    14   3449   835     69   321
##  2   479   130    18    66    72    76     3   1624   457     63   224
##  3   496   141    20    65    78    37    11   5628  1575    225   828
##  4   321    87    10    39    42    30     2    396   101     12    48
##  5   594   169     4    74    51    35    11   4408  1133     19   501
##  6   185    37     1    23     8    21     2    214    42      1    30
##  7   298    73     0    24    24     7     3    509   108      0    41
##  8   323    81     6    26    32     8     2    341    86      6    32
##  9   401    92    17    49    66    65    13   5206  1332    253   784
## 10   574   159    21   107    75    59    10   4631  1300     90   702
## # ... with 253 more rows, and 9 more variables: CRBI <int>, CWalks <int>,
## #   League <fct>, Division <fct>, PutOuts <int>, Assists <int>,
## #   Errors <int>, Salary <dbl>, NewLeague <fct>
dim(Hitters)
## [1] 263  20

Because this dataset isn’t particularly large, we will forego a test-train split, and simply use all of the data as training data.

library(caret)
library(glmnet)

Since he have loaded caret, we also have access to the lattice package which has a nice histogram function.

histogram(Hitters$Salary, xlab = "Salary, $1000s", 
          main = "Baseball Salaries, 1986 - 1987")

25.1 Regression

Like ridge and lasso, we again attempt to minimize the residual sum of squares plus some penalty term.

\[ \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right) ^ 2 + \lambda\left[(1-\alpha)||\beta||_2^2/2 + \alpha ||\beta||_1\right] \]

Here, \(||\beta||_1\) is called the \(l_1\) norm.

\[ ||\beta||_1 = \sum_{j=1}^{p} |\beta_j| \] Similarly, \(||\beta||_2\) is called the \(l_2\), or Euclidean norm.

\[ ||\beta||_2 = \sqrt{\sum_{j=1}^{p} \beta_j^2} \]

These both quantify how “large” the coefficients are. Like lasso and ridge, the intercept is not penalized and glment takes care of standardization internally. Also reported coefficients are on the original scale.

The new penalty is \(\frac{\lambda \cdot (1-\alpha)}{2}\) times the ridge penalty plus \(\lambda \cdot \alpha\) times the lasso lasso penalty. (Dividing the ridge penalty by 2 is a mathematical convenience for optimization.) Essentially, with the correct choice of \(\lambda\) and \(\alpha\) these two “penalty coefficients” can be any positive numbers.

Often it is more useful to simply think of \(\alpha\) as controlling the mixing between the two penalties and \(\lambda\) controlling the amount of penalization. \(\alpha\) takes values between 0 and 1. Using \(\alpha = 1\) gives the lasso that we have seen before. Similarly, \(\alpha = 0\) gives ridge. We used these two before with glmnet() to specify which to method we wanted. Now we also allow for \(\alpha\) values in between.

set.seed(42)
cv_5 = trainControl(method = "cv", number = 5)

We first setup our cross-validation strategy, which will be 5 fold. We then use train() with method = "glmnet" which is actually fitting the elastic net.

hit_elnet = train(
  Salary ~ ., data = Hitters,
  method = "glmnet",
  trControl = cv_5
)

First, note that since we are using caret() directly, it is taking care of dummy variable creation. So unlike before when we used glmnet(), we do not need to manually create a model matrix.

Also note that we have allowed caret to choose the tuning parameters for us.

hit_elnet
## glmnet 
## 
## 263 samples
##  19 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 209, 211, 211, 212, 209 
## Resampling results across tuning parameters:
## 
##   alpha  lambda   RMSE   Rsquared  MAE  
##   0.10    0.5106  327.7  0.4869    231.6
##   0.10    5.1056  327.4  0.4894    230.7
##   0.10   51.0564  334.3  0.4734    229.4
##   0.55    0.5106  327.6  0.4873    231.5
##   0.55    5.1056  328.1  0.4895    229.1
##   0.55   51.0564  338.7  0.4749    233.2
##   1.00    0.5106  327.6  0.4877    231.3
##   1.00    5.1056  331.3  0.4818    229.3
##   1.00   51.0564  348.8  0.4663    242.2
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.1 and lambda = 5.106.

Notice a few things with these results. First, we have tried three \(\alpha\) values, 0.10, 0.55, and 1. It is not entirely clear why caret doesn’t use 0. It likely uses 0.10 to fit a model close to ridge, but with some potential for sparsity.

Here, the best result uses \(\alpha = 0.10\), so this result is somewhere between ridge and lasso, but closer to ridge.

hit_elnet_int = train(
  Salary ~ . ^ 2, data = Hitters,
  method = "glmnet",
  trControl = cv_5,
  tuneLength = 10
)

Now we try a much larger model search. First, we’re expanding the feature space to include all interactions. Since we are using penalized regression, we don’t have to worry as much about overfitting. If many of the added variables are not useful, we will likely use a model close to lasso which makes many of them 0.

We’re also using a larger tuning grid. By setting tuneLength = 10, we will search 10 \(\alpha\) values and 10 \(\lambda\) values for each. Because of this larger tuning grid, the results will be very large.

To deal with this, we write a quick helper function to extract the row with the best tuning parameters.

get_best_result = function(caret_fit) {
  best = which(rownames(caret_fit$results) == rownames(caret_fit$bestTune))
  best_result = caret_fit$results[best, ]
  rownames(best_result) = NULL
  best_result
}

We then call this function on the trained object.

get_best_result(hit_elnet_int)
##   alpha lambda  RMSE Rsquared   MAE RMSESD RsquaredSD MAESD
## 1     1  4.135 313.5     0.56 206.1  70.83     0.1254 24.37

We see that the best result uses \(\alpha = 1\), which makes since. With \(\alpha = 1\), many of the added interaction coefficients are likely set to zero. (Unfortunately, obtaining this information after using caret with glmnet isn’t easy. The two don’t actually play very nice together. We’ll use cv.glmnet() with the expanded feature space to explore this.)

Also, this CV-RMSE is better than the lasso and ridge from the previous chapter that did not use the expanded feature space.

We also perform a quick analysis using cv.glmnet() instead. Due in part to randomness in cross validation, and differences in how cv.glmnet() and train() search for \(\lambda\), the results are slightly different.

set.seed(42)
X = model.matrix(Salary ~ . ^ 2, Hitters)[, -1]
y = Hitters$Salary

fit_lasso_cv = cv.glmnet(X, y, alpha = 1)
sqrt(fit_lasso_cv$cvm[fit_lasso_cv$lambda == fit_lasso_cv$lambda.min]) # CV-RMSE minimum
## [1] 305

The commented line is not run, since it produces a lot of output, but if run, it will show that the fast majority of the coefficients are zero! Also, you’ll notice that cv.glmnet() does not respect the usual predictor hierarchy. Not a problem for prediction, but a massive interpretation issue!

#coef(fit_lasso_cv)
sum(coef(fit_lasso_cv) != 0)
## [1] 5
sum(coef(fit_lasso_cv) == 0)
## [1] 186

25.2 Classification

Above, we have performed a regression task. But like lasso and ridge, elastic net can also be used for classification by using the deviance instead of the residual sum of squares. This essentially happens automatically in caret if the response variable is a factor.

We’ll test this using the familiar Default dataset, which we first test-train split.

data(Default, package = "ISLR")
set.seed(42)
default_idx = createDataPartition(Default$default, p = 0.75, list = FALSE)
default_trn = Default[default_idx, ]
default_tst = Default[-default_idx, ]

We then fit an elastic net with a default tuning grid.

def_elnet = train(
  default ~ ., data = default_trn,
  method = "glmnet",
  trControl = cv_5
)
def_elnet
## glmnet 
## 
## 7501 samples
##    3 predictor
##    2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 6001, 6001, 6001, 6000, 6001 
## Resampling results across tuning parameters:
## 
##   alpha  lambda     Accuracy  Kappa 
##   0.10   0.0001242  0.9731    0.4125
##   0.10   0.0012422  0.9731    0.3878
##   0.10   0.0124220  0.9679    0.0796
##   0.55   0.0001242  0.9731    0.4125
##   0.55   0.0012422  0.9731    0.3909
##   0.55   0.0124220  0.9684    0.1144
##   1.00   0.0001242  0.9731    0.4125
##   1.00   0.0012422  0.9732    0.4104
##   1.00   0.0124220  0.9693    0.1661
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 1 and lambda = 0.001242.

Since the best model used \(\alpha = 1\), this is a lasso model.

We also try an expanded feature space, and a larger tuning grid.

def_elnet_int = train(
  default ~ . ^ 2, data = default_trn,
  method = "glmnet",
  trControl = cv_5,
  tuneLength = 10
)

Since the result here will return 100 models, we again use are helper function to simply extract the best result.

get_best_result(def_elnet_int)
##   alpha   lambda Accuracy  Kappa AccuracySD KappaSD
## 1   0.1 0.001888   0.9732 0.3887   0.001843 0.07165

Here we see \(\alpha = 0.1\), which is a mix, but close to ridge.

calc_acc = function(actual, predicted) {
  mean(actual == predicted)
}

Evaluating the test accuracy of this model, we obtain one of the highest accuracies for this dataset of all methods we have tried.

# test acc
calc_acc(actual = default_tst$default,
         predicted = predict(def_elnet_int, newdata = default_tst))
## [1] 0.9728

25.4 rmarkdown

The rmarkdown file for this chapter can be found here. The file was created using R version 3.5.1. The following packages (and their dependencies) were loaded when knitting this file:

## [1] "glmnet"  "foreach" "Matrix"  "caret"   "ggplot2" "lattice"