Chapter 24 Regularization

Chapter Status: Currently this chapter is very sparse. It essentially only expands upon an example discussed in ISL, thus only illustrates usage of the methods. Mathematical and conceptual details of the methods will be added later. Also, more comments on using glmnet with caret will be discussed.

We will use the Hitters dataset from the ISLR package to explore two shrinkage methods: ridge regression and lasso. These are otherwise known as penalized regression methods.

data(Hitters, package = "ISLR")

This dataset has some missing data in the response Salaray. We use the na.omit() function the clean the dataset.

sum(is.na(Hitters))
## [1] 59
sum(is.na(Hitters$Salary))
## [1] 59
Hitters = na.omit(Hitters)
sum(is.na(Hitters))
## [1] 0

The predictors variables are offensive and defensive statistics for a number of baseball players.

names(Hitters)
##  [1] "AtBat"     "Hits"      "HmRun"     "Runs"      "RBI"      
##  [6] "Walks"     "Years"     "CAtBat"    "CHits"     "CHmRun"   
## [11] "CRuns"     "CRBI"      "CWalks"    "League"    "Division" 
## [16] "PutOuts"   "Assists"   "Errors"    "Salary"    "NewLeague"

We use the glmnet() and cv.glmnet() functions from the glmnet package to fit penalized regressions.

library(glmnet)

Unfortunately, the glmnet function does not allow the use of model formulas, so we setup the data for ease of use with glmnet. Eventually we will use train() from caret which does allow for fitting penalized regression with the formula syntax, but to explore some of the details, we first work with the functions from glmnet directly.

X = model.matrix(Salary ~ ., Hitters)[, -1]
y = Hitters$Salary

First, we fit an ordinary linear regression, and note the size of the predictors’ coefficients, and predictors’ coefficients squared. (The two penalties we will use.)

fit = lm(Salary ~ ., Hitters)
coef(fit)
##  (Intercept)        AtBat         Hits        HmRun         Runs 
##  163.1035878   -1.9798729    7.5007675    4.3308829   -2.3762100 
##          RBI        Walks        Years       CAtBat        CHits 
##   -1.0449620    6.2312863   -3.4890543   -0.1713405    0.1339910 
##       CHmRun        CRuns         CRBI       CWalks      LeagueN 
##   -0.1728611    1.4543049    0.8077088   -0.8115709   62.5994230 
##    DivisionW      PutOuts      Assists       Errors   NewLeagueN 
## -116.8492456    0.2818925    0.3710692   -3.3607605  -24.7623251
sum(abs(coef(fit)[-1]))
## [1] 238.7295
sum(coef(fit)[-1] ^ 2)
## [1] 18337.3

24.1 Ridge Regression

We first illustrate ridge regression, which can be fit using glmnet() with alpha = 0 and seeks to minimize

\[ \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right) ^ 2 + \lambda \sum_{j=1}^{p} \beta_j^2 . \]

Notice that the intercept is not penalized. Also, note that that ridge regression is not scale invariant like the usual unpenalized regression. Thankfully, glmnet() takes care of this internally. It automatically standardizes predictors for fitting, then reports fitted coefficient using the original scale.

The two plots illustrate how much the coefficients are penalized for different values of \(\lambda\). Notice none of the coefficients are forced to be zero.

par(mfrow = c(1, 2))
fit_ridge = glmnet(X, y, alpha = 0)
plot(fit_ridge)
plot(fit_ridge, xvar = "lambda", label = TRUE)

We use cross-validation to select a good \(\lambda\) value. The cv.glmnet()function uses 10 folds by default. The plot illustrates the MSE for the \(\lambda\)s considered. Two lines are drawn. The first is the \(\lambda\) that gives the smallest MSE. The second is the \(\lambda\) that gives an MSE within one standard error of the smallest.

fit_ridge_cv = cv.glmnet(X, y, alpha = 0)
plot(fit_ridge_cv)

The cv.glmnet() function returns several details of the fit for both \(\lambda\) values in the plot. Notice the penalty terms are smaller than the full linear regression. (As we would expect.)

# fitted coefficients, using 1-SE rule lambda, default behavior
coef(fit_ridge_cv)
## 20 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) 199.418113624
## AtBat         0.093426871
## Hits          0.389767263
## HmRun         1.212875007
## Runs          0.623229048
## RBI           0.618547529
## Walks         0.810467707
## Years         2.544170910
## CAtBat        0.007897059
## CHits         0.030554662
## CHmRun        0.226545984
## CRuns         0.061265846
## CRBI          0.063384832
## CWalks        0.060720300
## LeagueN       3.743295031
## DivisionW   -23.545192292
## PutOuts       0.056202373
## Assists       0.007879196
## Errors       -0.164203267
## NewLeagueN    3.313773161
# fitted coefficients, using minimum lambda
coef(fit_ridge_cv, s = "lambda.min")
## 20 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept)  7.645824e+01
## AtBat       -6.315180e-01
## Hits         2.642160e+00
## HmRun       -1.388233e+00
## Runs         1.045729e+00
## RBI          7.315713e-01
## Walks        3.278001e+00
## Years       -8.723734e+00
## CAtBat       1.256355e-04
## CHits        1.318975e-01
## CHmRun       6.895578e-01
## CRuns        2.830055e-01
## CRBI         2.514905e-01
## CWalks      -2.599851e-01
## LeagueN      5.233720e+01
## DivisionW   -1.224170e+02
## PutOuts      2.623667e-01
## Assists      1.629044e-01
## Errors      -3.644002e+00
## NewLeagueN  -1.702598e+01
# penalty term using minimum lambda
sum(coef(fit_ridge_cv, s = "lambda.min")[-1] ^ 2)
## [1] 18126.85
# fitted coefficients, using 1-SE rule lambda
coef(fit_ridge_cv, s = "lambda.1se")
## 20 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) 199.418113624
## AtBat         0.093426871
## Hits          0.389767263
## HmRun         1.212875007
## Runs          0.623229048
## RBI           0.618547529
## Walks         0.810467707
## Years         2.544170910
## CAtBat        0.007897059
## CHits         0.030554662
## CHmRun        0.226545984
## CRuns         0.061265846
## CRBI          0.063384832
## CWalks        0.060720300
## LeagueN       3.743295031
## DivisionW   -23.545192292
## PutOuts       0.056202373
## Assists       0.007879196
## Errors       -0.164203267
## NewLeagueN    3.313773161
# penalty term using 1-SE rule lambda
sum(coef(fit_ridge_cv, s = "lambda.1se")[-1] ^ 2)
## [1] 588.9958
# predict using minimum lambda
predict(fit_ridge_cv, X, s = "lambda.min")
# predict using 1-SE rule lambda, default behavior
predict(fit_ridge_cv, X)
# calcualte "train error"
mean((y - predict(fit_ridge_cv, X)) ^ 2)
## [1] 130404.9
# CV-RMSEs
sqrt(fit_ridge_cv$cvm)
##  [1] 452.2547 450.3308 449.7317 449.4633 449.1699 448.8493 448.4991
##  [8] 448.1168 447.6996 447.2446 446.7487 446.2084 445.6204 444.9808
## [15] 444.2857 443.5309 442.7123 441.8254 440.8657 439.8284 438.7090
## [22] 437.5027 436.2049 434.8113 433.3177 431.7201 430.0153 428.2005
## [29] 426.2737 424.2336 422.0802 419.8144 417.4386 414.9565 412.3736
## [36] 409.6966 406.9345 404.0971 401.1964 398.2459 395.2604 392.2558
## [43] 389.2487 386.2566 383.2967 380.3863 377.5418 374.7787 372.1111
## [50] 369.5511 367.1087 364.7927 362.6095 360.5628 358.6545 356.8843
## [57] 355.2502 353.7487 352.3749 351.1228 349.9858 348.9564 348.0273
## [64] 347.1905 346.4412 345.7720 345.1713 344.6349 344.1564 343.7331
## [71] 343.3561 343.0168 342.7153 342.4464 342.2042 341.9833 341.7841
## [78] 341.6001 341.4285 341.2666 341.1123 340.9622 340.8155 340.6706
## [85] 340.5242 340.3767 340.2276 340.0746 339.9185 339.7584 339.5955
## [92] 339.4296 339.2595 339.0879 338.9155 338.7416 338.5701 338.3979
## [99] 338.2327
# CV-RMSE using minimum lambda
sqrt(fit_ridge_cv$cvm[fit_ridge_cv$lambda == fit_ridge_cv$lambda.min])
## [1] 338.2327
# CV-RMSE using 1-SE rule lambda
sqrt(fit_ridge_cv$cvm[fit_ridge_cv$lambda == fit_ridge_cv$lambda.1se]) 
## [1] 367.1087

24.2 Lasso

We now illustrate lasso, which can be fit using glmnet() with alpha = 1 and seeks to minimize

\[ \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right) ^ 2 + \lambda \sum_{j=1}^{p} |\beta_j| . \]

Like ridge, lasso is not scale invariant.

The two plots illustrate how much the coefficients are penalized for different values of \(\lambda\). Notice some of the coefficients are forced to be zero.

par(mfrow = c(1, 2))
fit_lasso = glmnet(X, y, alpha = 1)
plot(fit_lasso)
plot(fit_lasso, xvar = "lambda", label = TRUE)

Again, to actually pick a \(\lambda\), we will use cross-validation. The plot is similar to the ridge plot. Notice along the top is the number of features in the model. (Which changed in this plot.)

fit_lasso_cv = cv.glmnet(X, y, alpha = 1)
plot(fit_lasso_cv)

cv.glmnet() returns several details of the fit for both \(\lambda\) values in the plot. Notice the penalty terms are again smaller than the full linear regression. (As we would expect.) Some coefficients are 0.

# fitted coefficients, using 1-SE rule lambda, default behavior
coef(fit_lasso_cv)
## 20 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept) 167.91202846
## AtBat         .         
## Hits          1.29269756
## HmRun         .         
## Runs          .         
## RBI           .         
## Walks         1.39817511
## Years         .         
## CAtBat        .         
## CHits         .         
## CHmRun        .         
## CRuns         0.14167760
## CRBI          0.32192558
## CWalks        .         
## LeagueN       .         
## DivisionW     .         
## PutOuts       0.04675463
## Assists       .         
## Errors        .         
## NewLeagueN    .
# fitted coefficients, using minimum lambda
coef(fit_lasso_cv, s = "lambda.min")
## 20 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept)  152.14662519
## AtBat         -1.91654285
## Hits           6.73000705
## HmRun          1.17822438
## Runs          -0.90695019
## RBI            .         
## Walks          5.56971500
## Years         -7.87904952
## CAtBat        -0.06113127
## CHits          .         
## CHmRun         0.25341467
## CRuns          1.09501343
## CRBI           0.55708869
## CWalks        -0.71999199
## LeagueN       45.71090646
## DivisionW   -116.86156291
## PutOuts        0.28175946
## Assists        0.28387742
## Errors        -2.82783302
## NewLeagueN    -9.44315086
# penalty term using minimum lambda
sum(coef(fit_lasso_cv, s = "lambda.min")[-1] ^ 2)
## [1] 15989.82
# fitted coefficients, using 1-SE rule lambda
coef(fit_lasso_cv, s = "lambda.1se")
## 20 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept) 167.91202846
## AtBat         .         
## Hits          1.29269756
## HmRun         .         
## Runs          .         
## RBI           .         
## Walks         1.39817511
## Years         .         
## CAtBat        .         
## CHits         .         
## CHmRun        .         
## CRuns         0.14167760
## CRBI          0.32192558
## CWalks        .         
## LeagueN       .         
## DivisionW     .         
## PutOuts       0.04675463
## Assists       .         
## Errors        .         
## NewLeagueN    .
# penalty term using 1-SE rule lambda
sum(coef(fit_lasso_cv, s = "lambda.1se")[-1] ^ 2)
## [1] 3.751855
# predict using minimum lambda
predict(fit_lasso_cv, X, s = "lambda.min")
# predict using 1-SE rule lambda, default behavior
predict(fit_lasso_cv, X)
# calcualte "train error"
mean((y - predict(fit_lasso_cv, X)) ^ 2)
## [1] 123931.3
# CV-RMSEs
sqrt(fit_lasso_cv$cvm)
##  [1] 450.3831 442.4307 433.3239 425.6752 419.1445 412.1042 404.5872
##  [8] 397.6298 391.5574 386.6781 382.6658 379.3272 376.1864 373.2915
## [15] 370.4030 367.1251 364.2269 361.8179 359.8162 358.1759 356.8397
## [22] 355.7518 354.8749 354.1790 353.7305 353.4205 353.2287 353.1599
## [29] 353.1391 353.1504 353.2208 353.3826 353.8481 354.5664 355.3187
## [36] 356.3082 357.4151 358.2083 358.6335 358.7864 358.7272 358.3778
## [43] 357.4574 356.5203 355.7000 355.0084 354.4012 353.8614 353.4295
## [50] 353.1217 352.8717 352.6211 352.4915 352.4452 352.4019 352.2639
## [57] 352.0210 351.8358 351.7627 351.7046 351.7571 351.8209 351.9515
## [64] 352.0902 352.1957 352.3144 352.4536 352.5651 352.6676 352.7647
## [71] 352.8491
# CV-RMSE using minimum lambda
sqrt(fit_lasso_cv$cvm[fit_lasso_cv$lambda == fit_lasso_cv$lambda.min])
## [1] 351.7046
# CV-RMSE using 1-SE rule lambda
sqrt(fit_lasso_cv$cvm[fit_lasso_cv$lambda == fit_lasso_cv$lambda.1se]) 
## [1] 376.1864

24.3 broom

Sometimes, the output from glmnet() can be overwhelming. The broom package can help with that.

library(broom)
# the output from the commented line would be immense
# fit_lasso_cv
tidy(fit_lasso_cv)
## # A tibble: 71 x 6
##    lambda estimate std.error conf.low conf.high nzero
##     <dbl>    <dbl>     <dbl>    <dbl>     <dbl> <int>
##  1   255.  202845.    27294.  175551.   230139.     0
##  2   233.  195745.    27946.  167799.   223691.     1
##  3   212.  187770.    27019.  160750.   214789.     2
##  4   193.  181199.    26004.  155195.   207203.     2
##  5   176.  175682.    25100.  150582.   200782.     3
##  6   160.  169830.    24237.  145593.   194066.     4
##  7   146.  163691.    23262.  140429.   186953.     4
##  8   133.  158109.    22211.  135898.   180321.     4
##  9   121.  153317.    21310.  132008.   174627.     4
## 10   111.  149520.    20551.  128969.   170071.     4
## # ... with 61 more rows
# the two lambda values of interest
glance(fit_lasso_cv) 
## # A tibble: 1 x 2
##   lambda.min lambda.1se
##        <dbl>      <dbl>
## 1       1.05       83.6

24.4 Simulated Data, \(p > n\)

Aside from simply shrinking coefficients (ridge) and setting some coefficients to 0 (lasso), penalized regression also has the advantage of being able to handle the \(p > n\) case.

set.seed(1234)
n = 1000
p = 5500
X = replicate(p, rnorm(n = n))
beta = c(1, 1, 1, rep(0, 5497))
z = X %*% beta
prob = exp(z) / (1 + exp(z))
y = as.factor(rbinom(length(z), size = 1, prob = prob))

We first simulate a classification example where \(p > n\).

# glm(y ~ X, family = "binomial")
# will not converge

We then use a lasso penalty to fit penalized logistic regression. This minimizes

\[ \sum_{i=1}^{n} L\left(y_i, \beta_0 + \sum_{j=1}^{p} \beta_j x_{ij}\right) + \lambda \sum_{j=1}^{p} |\beta_j| \]

where \(L\) is the appropriate negative log-likelihood.

library(glmnet)
fit_cv = cv.glmnet(X, y, family = "binomial", alpha = 1)
plot(fit_cv)

head(coef(fit_cv), n = 10)
## 10 x 1 sparse Matrix of class "dgCMatrix"
##                      1
## (Intercept) 0.02397452
## V1          0.59674958
## V2          0.56251761
## V3          0.60065105
## V4          .         
## V5          .         
## V6          .         
## V7          .         
## V8          .         
## V9          .
fit_cv$nzero
##  s0  s1  s2  s3  s4  s5  s6  s7  s8  s9 s10 s11 s12 s13 s14 s15 s16 s17 
##   0   2   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3 
## s18 s19 s20 s21 s22 s23 s24 s25 s26 s27 s28 s29 s30 s31 s32 s33 s34 s35 
##   3   3   3   3   3   3   3   3   3   3   3   3   4   6   7  10  18  24 
## s36 s37 s38 s39 s40 s41 s42 s43 s44 s45 s46 s47 s48 s49 s50 s51 s52 s53 
##  35  54  65  75  86 100 110 129 147 168 187 202 221 241 254 269 283 298 
## s54 s55 s56 s57 s58 s59 s60 s61 s62 s63 s64 s65 s66 s67 s68 s69 s70 s71 
## 310 324 333 350 364 375 387 400 411 429 435 445 453 455 462 466 475 481 
## s72 s73 s74 s75 s76 s77 s78 s79 s80 s81 s82 s83 s84 s85 s86 s87 s88 s89 
## 487 491 496 498 502 504 512 518 523 526 528 536 543 550 559 561 563 566 
## s90 s91 s92 s93 s94 s95 s96 s97 s98 
## 570 571 576 582 586 590 596 596 600

Notice, only the first three predictors generated are truly significant, and that is exactly what the suggested model finds.

fit_1se = glmnet(X, y, family = "binomial", lambda = fit_cv$lambda.1se)
which(as.vector(as.matrix(fit_1se$beta)) != 0)
## [1] 1 2 3

We can also see in the following plots, the three features entering the model well ahead of the irrelevant features.

par(mfrow = c(1, 2))
plot(glmnet(X, y, family = "binomial"))
plot(glmnet(X, y, family = "binomial"), xvar = "lambda")

We can extract the two relevant \(\lambda\) values.

fit_cv$lambda.min
## [1] 0.03087158
fit_cv$lambda.1se
## [1] 0.0514969

Since cv.glmnet() does not calculate prediction accuracy for classification, we take the \(\lambda\) values and create a grid for caret to search in order to obtain prediction accuracy with train(). We set \(\alpha = 1\) in this grid, as glmnet can actually tune over the \(\alpha = 1\) parameter. (More on that later.)

Note that we have to force y to be a factor, so that train() recognizes we want to have a binomial response. The train() function in caret use the type of variable in y to determine if you want to use family = "binomial" or family = "gaussian".

library(caret)
cv_5 = trainControl(method = "cv", number = 5)
lasso_grid = expand.grid(alpha = 1, 
                         lambda = c(fit_cv$lambda.min, fit_cv$lambda.1se))
lasso_grid
##   alpha     lambda
## 1     1 0.03087158
## 2     1 0.05149690
sim_data = data.frame(y, X)
fit_lasso = train(
  y ~ ., data = sim_data,
  method = "glmnet",
  trControl = cv_5,
  tuneGrid = lasso_grid
)
fit_lasso$results
##   alpha     lambda  Accuracy     Kappa AccuracySD    KappaSD
## 1     1 0.03087158 0.7679304 0.5358028 0.03430230 0.06844656
## 2     1 0.05149690 0.7689003 0.5377583 0.02806941 0.05596114

The interaction between the glmnet and caret packages is sometimes frustrating, but for obtaining results for particular values of \(\lambda\), we see it can be easily used. More on this next chapter.

24.6 rmarkdown

The rmarkdown file for this chapter can be found here. The file was created using R version 3.5.1. The following packages (and their dependencies) were loaded when knitting this file:

## [1] "caret"   "ggplot2" "lattice" "broom"   "glmnet"  "foreach" "Matrix"