Chapter 24 Regularization

Chapter Status: Currently this chapter is very sparse. It essentially only expands upon an example discussed in ISL, thus only illustrates usage of the methods. Mathematical and conceptual details of the methods will be added later. Also, more comments on using glmnet with caret will be discussed.

We will use the Hitters dataset from the ISLR package to explore two shrinkage methods: ridge regression and lasso. These are otherwise known as penalized regression methods.

data(Hitters, package = "ISLR")

This dataset has some missing data in the response Salaray. We use the na.omit() function the clean the dataset.

sum(is.na(Hitters))
## [1] 59
sum(is.na(Hitters$Salary))
## [1] 59
Hitters = na.omit(Hitters)
sum(is.na(Hitters))
## [1] 0

The predictors variables are offensive and defensive statistics for a number of baseball players.

names(Hitters)
##  [1] "AtBat"     "Hits"      "HmRun"     "Runs"      "RBI"      
##  [6] "Walks"     "Years"     "CAtBat"    "CHits"     "CHmRun"   
## [11] "CRuns"     "CRBI"      "CWalks"    "League"    "Division" 
## [16] "PutOuts"   "Assists"   "Errors"    "Salary"    "NewLeague"

We use the glmnet() and cv.glmnet() functions from the glmnet package to fit penalized regressions.

library(glmnet)

Unfortunately, the glmnet function does not allow the use of model formulas, so we setup the data for ease of use with glmnet. Eventually we will use train() from caret which does allow for fitting penalized regression with the formula syntax, but to explore some of the details, we first work with the functions from glmnet directly.

X = model.matrix(Salary ~ ., Hitters)[, -1]
y = Hitters$Salary

First, we fit an ordinary linear regression, and note the size of the predictors’ coefficients, and predictors’ coefficients squared. (The two penalties we will use.)

fit = lm(Salary ~ ., Hitters)
coef(fit)
##  (Intercept)        AtBat         Hits        HmRun         Runs 
##  163.1035878   -1.9798729    7.5007675    4.3308829   -2.3762100 
##          RBI        Walks        Years       CAtBat        CHits 
##   -1.0449620    6.2312863   -3.4890543   -0.1713405    0.1339910 
##       CHmRun        CRuns         CRBI       CWalks      LeagueN 
##   -0.1728611    1.4543049    0.8077088   -0.8115709   62.5994230 
##    DivisionW      PutOuts      Assists       Errors   NewLeagueN 
## -116.8492456    0.2818925    0.3710692   -3.3607605  -24.7623251
sum(abs(coef(fit)[-1]))
## [1] 238.7295
sum(coef(fit)[-1] ^ 2)
## [1] 18337.3

24.1 Ridge Regression

We first illustrate ridge regression, which can be fit using glmnet() with alpha = 0 and seeks to minimize

\[ \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right) ^ 2 + \lambda \sum_{j=1}^{p} \beta_j^2 . \]

Notice that the intercept is not penalized. Also, note that that ridge regression is not scale invariant like the usual unpenalized regression. Thankfully, glmnet() takes care of this internally. It automatically standardizes predictors for fitting, then reports fitted coefficient using the original scale.

The two plots illustrate how much the coefficients are penalized for different values of \(\lambda\). Notice none of the coefficients are forced to be zero.

par(mfrow = c(1, 2))
fit_ridge = glmnet(X, y, alpha = 0)
plot(fit_ridge)
plot(fit_ridge, xvar = "lambda", label = TRUE)

We use cross-validation to select a good \(\lambda\) value. The cv.glmnet()function uses 10 folds by default. The plot illustrates the MSE for the \(\lambda\)s considered. Two lines are drawn. The first is the \(\lambda\) that gives the smallest MSE. The second is the \(\lambda\) that gives an MSE within one standard error of the smallest.

fit_ridge_cv = cv.glmnet(X, y, alpha = 0)
plot(fit_ridge_cv)

The cv.glmnet() function returns several details of the fit for both \(\lambda\) values in the plot. Notice the penalty terms are smaller than the full linear regression. (As we would expect.)

# fitted coefficients, using 1-SE rule lambda, default behavior
coef(fit_ridge_cv)
## 20 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) 226.844379891
## AtBat         0.086613902
## Hits          0.352962515
## HmRun         1.144213851
## Runs          0.569353372
## RBI           0.570074066
## Walks         0.735072618
## Years         2.397356090
## CAtBat        0.007295083
## CHits         0.027995153
## CHmRun        0.208112349
## CRuns         0.056146219
## CRBI          0.058060281
## CWalks        0.056586702
## LeagueN       2.850306094
## DivisionW   -20.329125634
## PutOuts       0.049296951
## Assists       0.007063169
## Errors       -0.128066380
## NewLeagueN    2.654025549
# fitted coefficients, using minimum lambda
coef(fit_ridge_cv, s = "lambda.min")
## 20 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept)  7.645824e+01
## AtBat       -6.315180e-01
## Hits         2.642160e+00
## HmRun       -1.388233e+00
## Runs         1.045729e+00
## RBI          7.315713e-01
## Walks        3.278001e+00
## Years       -8.723734e+00
## CAtBat       1.256355e-04
## CHits        1.318975e-01
## CHmRun       6.895578e-01
## CRuns        2.830055e-01
## CRBI         2.514905e-01
## CWalks      -2.599851e-01
## LeagueN      5.233720e+01
## DivisionW   -1.224170e+02
## PutOuts      2.623667e-01
## Assists      1.629044e-01
## Errors      -3.644002e+00
## NewLeagueN  -1.702598e+01
# penalty term using minimum lambda
sum(coef(fit_ridge_cv, s = "lambda.min")[-1] ^ 2)
## [1] 18126.85
# fitted coefficients, using 1-SE rule lambda
coef(fit_ridge_cv, s = "lambda.1se")
## 20 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) 226.844379891
## AtBat         0.086613902
## Hits          0.352962515
## HmRun         1.144213851
## Runs          0.569353372
## RBI           0.570074066
## Walks         0.735072618
## Years         2.397356090
## CAtBat        0.007295083
## CHits         0.027995153
## CHmRun        0.208112349
## CRuns         0.056146219
## CRBI          0.058060281
## CWalks        0.056586702
## LeagueN       2.850306094
## DivisionW   -20.329125634
## PutOuts       0.049296951
## Assists       0.007063169
## Errors       -0.128066380
## NewLeagueN    2.654025549
# penalty term using 1-SE rule lambda
sum(coef(fit_ridge_cv, s = "lambda.1se")[-1] ^ 2)
## [1] 436.8923
# predict using minimum lambda
predict(fit_ridge_cv, X, s = "lambda.min")
# predict using 1-SE rule lambda, default behavior
predict(fit_ridge_cv, X)
# calcualte "train error"
mean((y - predict(fit_ridge_cv, X)) ^ 2)
## [1] 134397.5
# CV-RMSEs
sqrt(fit_ridge_cv$cvm)
##  [1] 451.1772 449.5218 448.7861 448.5162 448.2212 447.8988 447.5467
##  [8] 447.1623 446.7428 446.2853 445.7867 445.2435 444.6522 444.0091
## [15] 443.3103 442.5514 441.7284 440.8367 439.8718 438.8289 437.7034
## [22] 436.4907 435.1860 433.7849 432.2833 430.6772 428.9634 427.1390
## [29] 425.2020 423.1511 420.9863 418.7086 416.3203 413.8252 411.2286
## [36] 408.5376 405.7608 402.9083 399.9921 397.0258 394.0242 391.0032
## [43] 387.9798 384.9712 381.9951 379.0686 376.2085 373.4305 370.7489
## [50] 368.1761 365.7223 363.3970 361.2066 359.1553 357.2453 355.4771
## [57] 353.8490 352.3580 350.9996 349.7684 348.6581 347.6619 346.7727
## [64] 345.9822 345.2844 344.6755 344.1461 343.6856 343.2892 342.9523
## [71] 342.6699 342.4358 342.2454 342.0888 341.9642 341.8679 341.7953
## [78] 341.7394 341.6991 341.6695 341.6482 341.6297 341.6146 341.5953
## [85] 341.5748 341.5476 341.5157 341.4718 341.4199 341.3587 341.2850
## [92] 341.2006 341.1064 341.0006 340.8855 340.7584 340.6254 340.4818
## [99] 340.3364
# CV-RMSE using minimum lambda
sqrt(fit_ridge_cv$cvm[fit_ridge_cv$lambda == fit_ridge_cv$lambda.min])
## [1] 340.3364
# CV-RMSE using 1-SE rule lambda
sqrt(fit_ridge_cv$cvm[fit_ridge_cv$lambda == fit_ridge_cv$lambda.1se]) 
## [1] 370.7489

24.2 Lasso

We now illustrate lasso, which can be fit using glmnet() with alpha = 1 and seeks to minimize

\[ \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right) ^ 2 + \lambda \sum_{j=1}^{p} |\beta_j| . \]

Like ridge, lasso is not scale invariant.

The two plots illustrate how much the coefficients are penalized for different values of \(\lambda\). Notice some of the coefficients are forced to be zero.

par(mfrow = c(1, 2))
fit_lasso = glmnet(X, y, alpha = 1)
plot(fit_lasso)
plot(fit_lasso, xvar = "lambda", label = TRUE)

Again, to actually pick a \(\lambda\), we will use cross-validation. The plot is similar to the ridge plot. Notice along the top is the number of features in the model. (Which changed in this plot.)

fit_lasso_cv = cv.glmnet(X, y, alpha = 1)
plot(fit_lasso_cv)

cv.glmnet() returns several details of the fit for both \(\lambda\) values in the plot. Notice the penalty terms are again smaller than the full linear regression. (As we would expect.) Some coefficients are 0.

# fitted coefficients, using 1-SE rule lambda, default behavior
coef(fit_lasso_cv)
## 20 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept) 167.91202846
## AtBat         .         
## Hits          1.29269756
## HmRun         .         
## Runs          .         
## RBI           .         
## Walks         1.39817511
## Years         .         
## CAtBat        .         
## CHits         .         
## CHmRun        .         
## CRuns         0.14167760
## CRBI          0.32192558
## CWalks        .         
## LeagueN       .         
## DivisionW     .         
## PutOuts       0.04675463
## Assists       .         
## Errors        .         
## NewLeagueN    .
# fitted coefficients, using minimum lambda
coef(fit_lasso_cv, s = "lambda.min")
## 20 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept)  129.4155569
## AtBat         -1.6130155
## Hits           5.8058915
## HmRun          .        
## Runs           .        
## RBI            .        
## Walks          4.8469340
## Years         -9.9724045
## CAtBat         .        
## CHits          .        
## CHmRun         0.5374550
## CRuns          0.6811938
## CRBI           0.3903563
## CWalks        -0.5560143
## LeagueN       32.4646094
## DivisionW   -119.3480842
## PutOuts        0.2741895
## Assists        0.1855978
## Errors        -2.1650837
## NewLeagueN     .
# penalty term using minimum lambda
sum(coef(fit_lasso_cv, s = "lambda.min")[-1] ^ 2)
## [1] 15463.18
# fitted coefficients, using 1-SE rule lambda
coef(fit_lasso_cv, s = "lambda.1se")
## 20 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept) 167.91202846
## AtBat         .         
## Hits          1.29269756
## HmRun         .         
## Runs          .         
## RBI           .         
## Walks         1.39817511
## Years         .         
## CAtBat        .         
## CHits         .         
## CHmRun        .         
## CRuns         0.14167760
## CRBI          0.32192558
## CWalks        .         
## LeagueN       .         
## DivisionW     .         
## PutOuts       0.04675463
## Assists       .         
## Errors        .         
## NewLeagueN    .
# penalty term using 1-SE rule lambda
sum(coef(fit_lasso_cv, s = "lambda.1se")[-1] ^ 2)
## [1] 3.751855
# predict using minimum lambda
predict(fit_lasso_cv, X, s = "lambda.min")
# predict using 1-SE rule lambda, default behavior
predict(fit_lasso_cv, X)
# calcualte "train error"
mean((y - predict(fit_lasso_cv, X)) ^ 2)
## [1] 123931.3
# CV-RMSEs
sqrt(fit_lasso_cv$cvm)
##  [1] 449.5476 439.8301 430.0707 421.8891 413.7590 405.2809 396.7882
##  [8] 389.3514 383.2383 378.3132 374.3378 371.1364 368.4413 366.2982
## [15] 364.4505 362.8489 361.2297 359.4252 357.5374 355.9585 354.6605
## [22] 353.5969 352.7915 352.2175 351.7586 351.4219 351.1697 351.0200
## [29] 350.9449 350.9065 350.9106 350.9508 350.9949 351.1107 351.3967
## [36] 351.6997 352.1176 352.3968 352.2284 351.4513 350.5703 349.5480
## [43] 348.4454 347.4367 346.4896 345.6987 345.0589 344.5373 344.0981
## [50] 343.7666 343.6808 343.7934 343.8787 343.9644 344.0379 344.1172
## [57] 344.2081 344.3154 344.4026 344.3953 344.3500 344.2605 344.2293
## [64] 344.2134 344.2067 344.2084 344.2437 344.2762 344.3076 344.4098
## [71] 344.5598 344.6825
# CV-RMSE using minimum lambda
sqrt(fit_lasso_cv$cvm[fit_lasso_cv$lambda == fit_lasso_cv$lambda.min])
## [1] 343.6808
# CV-RMSE using 1-SE rule lambda
sqrt(fit_lasso_cv$cvm[fit_lasso_cv$lambda == fit_lasso_cv$lambda.1se]) 
## [1] 368.4413

24.3 broom

Sometimes, the output from glmnet() can be overwhelming. The broom package can help with that.

library(broom)
# the output from the commented line would be immense
# fit_lasso_cv
tidy(fit_lasso_cv)
## # A tibble: 72 x 6
##    lambda estimate std.error conf.low conf.high nzero
##     <dbl>    <dbl>     <dbl>    <dbl>     <dbl> <int>
##  1   255.  202093.    25322.  176771.   227415.     0
##  2   233.  193451.    25113.  168337.   218564.     1
##  3   212.  184961.    24639.  160322.   209600.     2
##  4   193.  177990.    24206.  153785.   202196.     2
##  5   176.  171197.    23792.  147404.   194989.     3
##  6   160.  164253.    23466.  140787.   187718.     4
##  7   146.  157441.    23212.  134229.   180652.     4
##  8   133.  151595.    23037.  128558.   174631.     4
##  9   121.  146872.    22909.  123963.   169780.     4
## 10   111.  143121.    22782.  120339.   165903.     4
## # … with 62 more rows
# the two lambda values of interest
glance(fit_lasso_cv) 
## # A tibble: 1 x 2
##   lambda.min lambda.1se
##        <dbl>      <dbl>
## 1       2.44       83.6

24.4 Simulated Data, \(p > n\)

Aside from simply shrinking coefficients (ridge) and setting some coefficients to 0 (lasso), penalized regression also has the advantage of being able to handle the \(p > n\) case.

set.seed(1234)
n = 1000
p = 5500
X = replicate(p, rnorm(n = n))
beta = c(1, 1, 1, rep(0, 5497))
z = X %*% beta
prob = exp(z) / (1 + exp(z))
y = as.factor(rbinom(length(z), size = 1, prob = prob))

We first simulate a classification example where \(p > n\).

# glm(y ~ X, family = "binomial")
# will not converge

We then use a lasso penalty to fit penalized logistic regression. This minimizes

\[ \sum_{i=1}^{n} L\left(y_i, \beta_0 + \sum_{j=1}^{p} \beta_j x_{ij}\right) + \lambda \sum_{j=1}^{p} |\beta_j| \]

where \(L\) is the appropriate negative log-likelihood.

library(glmnet)
fit_cv = cv.glmnet(X, y, family = "binomial", alpha = 1)
plot(fit_cv)

head(coef(fit_cv), n = 10)
## 10 x 1 sparse Matrix of class "dgCMatrix"
##                      1
## (Intercept) 0.02397452
## V1          0.59674958
## V2          0.56251761
## V3          0.60065105
## V4          .         
## V5          .         
## V6          .         
## V7          .         
## V8          .         
## V9          .
fit_cv$nzero
##  s0  s1  s2  s3  s4  s5  s6  s7  s8  s9 s10 s11 s12 s13 s14 s15 s16 s17 
##   0   2   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3 
## s18 s19 s20 s21 s22 s23 s24 s25 s26 s27 s28 s29 s30 s31 s32 s33 s34 s35 
##   3   3   3   3   3   3   3   3   3   3   3   3   4   6   7  10  18  24 
## s36 s37 s38 s39 s40 s41 s42 s43 s44 s45 s46 s47 s48 s49 s50 s51 s52 s53 
##  35  54  65  75  86 100 110 129 147 168 187 202 221 241 254 269 283 298 
## s54 s55 s56 s57 s58 s59 s60 s61 s62 s63 s64 s65 s66 s67 s68 s69 s70 s71 
## 310 324 333 350 364 375 387 400 411 429 435 445 453 455 462 466 475 481 
## s72 s73 s74 s75 s76 s77 s78 s79 s80 s81 s82 s83 s84 s85 s86 s87 s88 s89 
## 487 491 496 498 502 504 512 518 523 526 528 536 543 550 559 561 563 566 
## s90 s91 s92 s93 s94 s95 s96 s97 s98 
## 570 571 576 582 586 590 596 596 600

Notice, only the first three predictors generated are truly significant, and that is exactly what the suggested model finds.

fit_1se = glmnet(X, y, family = "binomial", lambda = fit_cv$lambda.1se)
which(as.vector(as.matrix(fit_1se$beta)) != 0)
## [1] 1 2 3

We can also see in the following plots, the three features entering the model well ahead of the irrelevant features.

par(mfrow = c(1, 2))
plot(glmnet(X, y, family = "binomial"))
plot(glmnet(X, y, family = "binomial"), xvar = "lambda")

We can extract the two relevant \(\lambda\) values.

fit_cv$lambda.min
## [1] 0.03087158
fit_cv$lambda.1se
## [1] 0.0514969

Since cv.glmnet() does not calculate prediction accuracy for classification, we take the \(\lambda\) values and create a grid for caret to search in order to obtain prediction accuracy with train(). We set \(\alpha = 1\) in this grid, as glmnet can actually tune over the \(\alpha = 1\) parameter. (More on that later.)

Note that we have to force y to be a factor, so that train() recognizes we want to have a binomial response. The train() function in caret use the type of variable in y to determine if you want to use family = "binomial" or family = "gaussian".

library(caret)
cv_5 = trainControl(method = "cv", number = 5)
lasso_grid = expand.grid(alpha = 1, 
                         lambda = c(fit_cv$lambda.min, fit_cv$lambda.1se))
lasso_grid
##   alpha     lambda
## 1     1 0.03087158
## 2     1 0.05149690
sim_data = data.frame(y, X)
fit_lasso = train(
  y ~ ., data = sim_data,
  method = "glmnet",
  trControl = cv_5,
  tuneGrid = lasso_grid
)
fit_lasso$results
##   alpha     lambda  Accuracy     Kappa AccuracySD    KappaSD
## 1     1 0.03087158 0.7679304 0.5358028 0.03430230 0.06844656
## 2     1 0.05149690 0.7689003 0.5377583 0.02806941 0.05596114

The interaction between the glmnet and caret packages is sometimes frustrating, but for obtaining results for particular values of \(\lambda\), we see it can be easily used. More on this next chapter.

24.6 rmarkdown

The rmarkdown file for this chapter can be found here. The file was created using R version 3.5.2. The following packages (and their dependencies) were loaded when knitting this file:

## [1] "caret"   "ggplot2" "lattice" "broom"   "glmnet"  "foreach" "Matrix"