```{r, message = FALSE, warning = FALSE}
library(MASS)
library(randomForest)
library(caret)
```

# `R` Packages

In this document, we will compare Random Forests and a similar method called **Extremely Randomized Trees** which can be found in the `R` package `extraTrees`. The `extraTrees` package uses Java in the background and sometimes has memory issues. The command below modifies the Java back-end to be given more memory by default. (By default the Java Virtual Machine is allocated 512 MB, which we change to 4 GB.) This must be done before loading the `extraTrees` package.

```{r, message = FALSE, warning = FALSE}
options(java.parameters = "-Xmx4g")
library(extraTrees)
```

Details on the `R` package can be found in its [vignette.](https://cran.r-project.org/web/packages/extraTrees/vignettes/extraTrees.pdf)

We will also discuss `ranger` an alternative package for fitting a random forest, as well as `xgboost`, an alternative boosting package.

# Extremely Randomized Trees

Extremely Randomized Trees (ERT) are very similar to Random Forests. (RF) There are essentially two main differences:

- ERT do not resample observations when building a tree. (They do not perform bagging.)
- ERT do not use the "best split."
    - Like a RF, ERT select a random subset of predictors for each split. (A tuning parameter: `mtry`)
    - Instead of the "best split" for the predictors, ERT makes a small number of randomly chosen splits-points for each of the selected predictors. In the original method, this value was 1. (A tuning parameter: `numRandomCuts`) 
    - ERT then selects the "best split" from this small number of choices.
    
The resulting "forest" contains trees that are more variable, but less correlated than the trees in a Random Forest. Details of the method can be found in the [original paper.](http://www.montefiore.ulg.ac.be/~ernst/uploads/news/id63/extremely-randomized-trees.pdf)

As most papers do, the claim is that Extremely Randomized Trees are better than Random Forests. In practice, you will find this is certainly true sometimes, but not always. Remember, there is no free lunch.

ERT can be used for both classification and regression, much like a RF. We will evaluate a regression example in this document.

# Regression Example

We consider the regression case, using the `Boston` data from the `MASS` package. We will use RMSE as our metric, so we write a function which will help us along the way:

```{r}
rmse = function(actual, predicted) {
  sqrt(mean((actual - predicted) ^ 2))
}
```

As always, we test-train split the data. Half for training, half for testing in this case.

```{r}
set.seed(42)
boston_idx = sample(1:nrow(Boston), nrow(Boston) / 2)
boston_trn = Boston[boston_idx,]
boston_tst = Boston[-boston_idx,]
```

Notice that this dataset contains 13 predictor variables.

## Random Forest

We first train a Random Forest model. For this example, we will use cross-validation to select a value of `mtry`, the tuning parameter for RF. Two reasons for this:

- OOB error calculations are not implemented for the `extraTrees` package. So we'll use CV for both to keep the comparison as similar as possible.
- Using CV allows us to create a nice plot of the results.

We setup both our cross-validation (5 fold) and a grid of `mtry` values. (Here, trying all possible values.)

```{r}
cv_5 = trainControl(method = "cv", number = 5)
rf_grid =  expand.grid(mtry = 1:13)
```

We then train the model.

```{r}
set.seed(42)
rf_fit = train(medv ~ ., data = boston_trn,
               method = "rf",
               trControl = cv_5,
               tuneGrid = rf_grid)
```

We suppress the bulk of the output and only view the selected tuning parameters and a plot of our results.

```{r}
#rf_fit
rf_fit$bestTune
plot(rf_fit)
rmse(predict(rf_fit, boston_tst), boston_tst$medv)
```

We find the resulting test RMSE for our chosen RF model with `mtry` = `r as.numeric(rf_fit$bestTune)` to be **`r rmse(predict(rf_fit, boston_tst), boston_tst$medv)`.**


## Extremely Randomized Trees

We now try an Extremely Randomized Trees model. The ERT model has two parameters:

- `mtry` which works in the same way as RF.
- `numRandomCuts` which determines the number of randomly chosen splits for each of the `mtry` predictors selected for each split. Lower values make trees more random.

When specifying the grid of values, we only use selected values of `mtry` and only "small" values of `numRandomCuts` to keep computation time somewhat reasonable. (Remember we're cross-validating which takes more time than using OOB samples.) Usually `numRandomCuts` is probably kept smaller than these values, say `1:5` but these values were chosen for the plot below. A value of `1` is the value for the originally specified ERT method.

```{r}
et_grid =  expand.grid(mtry = 4:7, numRandomCuts = 1:10)
```

We train the model using `caret` with `method = "extraTrees"` which uses the `extraTrees` package. When training the model, we add one extra argument, `numThreads = 4` which tells `R` to use 4 cores in the Java Virtual Machine. (Which will speed up computation.)

```{r}
set.seed(42)
et_fit = train(medv ~ ., data = boston_trn,
               method = "extraTrees",
               trControl = cv_5,
               tuneGrid = et_grid,
               numThreads = 4)
```

Again, we suppress the bulk of the output and only view the selected tuning parameters and a plot of our results.

```{r}
#et_fit
et_fit$bestTune
plot(et_fit)
rmse(predict(et_fit, boston_tst), boston_tst$medv)
```

We find the resulting test RMSE for our chosen ERT model to be **`r rmse(predict(et_fit, boston_tst), boston_tst$medv)`.** So, for this example, Extremely Randomized Trees win over Random Forests, but remember this won't always be the case.


# `ranger`

The `ranger` package simply re-implements the random forest method. It has a number of speed advantages, including the ability to grow trees in parallel.

```{r, message = FALSE, warning = FALSE}
library(ranger)
```

```{r, message = FALSE, warning = FALSE}
set.seed(42)
system.time({ranger_fit = train(medv ~ ., data = boston_trn,
                                method = "ranger",
                                trControl = cv_5,
                                num.threads = 1,
                                tuneGrid = rf_grid)})
```

```{r, message = FALSE, warning = FALSE}
set.seed(42)
system.time({ranger_fit = train(medv ~ ., data = boston_trn,
                                method = "ranger",
                                trControl = cv_5,
                                num.threads = 4,
                                tuneGrid = rf_grid)})
```

```{r}
#ranger_fit
ranger_fit$bestTune
plot(ranger_fit)
rmse(predict(ranger_fit, boston_tst), boston_tst$medv)
```

Notice due to the differences in the implementation, the results are not the same as the original random forest. In general the results should be similar.

# `xgboost`

The `xgboost` package implements [eXtreme Gradient Boosting](http://xgboost.readthedocs.io/en/latest/model.html), which is similar to the methods found in `gbm.` Tuned well, often `xgboost` can obtain excellent results, often winning Kaggle competitions. (In this example it beats `gbm`, but not the random forest based methods.)

```{r, message = FALSE, warning = FALSE}
library(gbm)
library(xgboost)
```


```{r, message = FALSE, warning = FALSE}
set.seed(42)
gbm_fit = train(medv ~ ., data = boston_trn,
                method = "gbm",
                trControl = cv_5,
                verbose = FALSE,
                tuneLength = 10)
```

```{r}
#gbm_fit
gbm_fit$bestTune
plot(gbm_fit)
rmse(predict(gbm_fit, boston_tst), boston_tst$medv)
```

```{r}
set.seed(42)
xgb_fit = train(medv ~ ., data = boston_trn,
                method = "xgbTree",
                trControl = cv_5,
                verbose = FALSE,
                tuneLength = 10)
```

```{r}
#xgb_fit
xgb_fit$bestTune
plot(xgb_fit)
rmse(predict(xgb_fit, boston_tst), boston_tst$medv)
```