4 Random Forests

4.1 Decision trees - Limitation

To capture a complex decision boundary we need to use a deep tree

In-class explanation

4.2 Bias-Variance Trade off

A deep decision tree has low bias and high variance.

4.3 Bagging (Bootstrap Aggregation)

Technique for reducing the variance of an estimated predicted function
Works well for high-variance, low-bias procedures, such as trees

4.4 Ensemble Methods

Combines several base models
Bagging (Bootstrap Aggregation) is an ensemble method

4.5 Ensemble Methods

“Ensemble learning gives credence to the idea of the “wisdom of crowds,” which suggests that the decision-making of a larger group of people is typically better than that of an individual expert.”

Source: https://www.ibm.com/cloud/learn/boosting

4.6 Bootstrap

Generate multiple samples of training data, via bootstrapping

Example

Training data: \(\{(y_1, x_1), (y_2, x_2), (y_3, x_3), (y_4, x_4)\}\)

Three samples generated from bootstrapping

Sample 1 = \(\{(y_1, x_1), (y_2, x_2), (y_3, x_3), (y_4, x_4)\}\)

Sample 2 = \(\{(y_1, x_1), (y_1, x_1), (y_1, x_1), (y_4, x_4)\}\)

Sample 3 = \(\{(y_1, x_1), (y_2, x_2), (y_1, x_1), (y_4, x_4)\}\)

4.7 Aggregation

Train a decision tree on each bootstrap sample of data without pruning.
Aggregate prediction using either voting or averaging

4.8 Bagging - in class diagram

4.9 Bagging

Pros

Ease of implementation
Reduction of variance

Cons

Loss of interpretability
Computationally expensive

4.10 Bagging

Bootstrapped subsamples are created
A Decision Tree is formed on each bootstrapped sample.
The results of each tree are aggregated

4.11 Random Forests: Improving on Bagging

The ensembles of trees in Bagging tend to be highly correlated.
All of the bagged trees will look quite similar to each other. Hence, the predictions from the bagged trees will be highly correlated.

4.12 Random Forests

Bootstrap samples
At each split, randomly select a set of predictors from the full set of predictors
From the selected predictors we select the optimal predictor and the optimal corresponding threshold for the split.
Grow multiple trees and aggregate

4.13 Random Forests - Hyper parameters

Number of variables randomly sampled as candidates at each split
Number of trees to grow
Minimum size of terminal nodes. Setting this number larger causes smaller trees to be grown (and thus take less time).

Note: In theory, each tree in the random forest is full (not pruned), but in practice this can be computationally expensive,thus, imposing a minimum node size is not unusual.

4.14 Random Forests

Bagging ensemble method
Gives final prediction by aggregating the predictions of bootstrapped decision tree samples.
Trees in a random forest are independent of each other.

4.15 Random Forests

Pros

Accuracy

Cons

Speed
Interpretability
Overfitting

4.16 Out-of-bag error

With ensemble methods, we get a new metric for assessing the predictive performance of the model, the out-of-bag error

4.17 Random Forests

4.18 Random Forests

4.19 Out-of-Bag (OOB) Samples

4.20 Out-of-Bag (OOB) Samples

4.21 Predictions based on OOB observations

4.22 Predictions based on OOB observations

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Figure 6

Figure 7

Figure 8

4.23 Variable Importance in Random Forest

contribution to predictive accuracy

Permutation-based variable importance
Mean decrease in Gini coefficient

4.24 Permutation-based variable importance

the OOB samples are passed down the tree, and the prediction accuracy is recorded
the values for the \(j^{th}\) variable are randomly permuted in the OOB samples, and the accuracy is again computed.
the decrease in accuracy as a result of this permuting is averaged over all trees, and is used as a measure of the importance of variable \(j\) in the random forests

4.25 Mean decrease in Gini coefficient

Measure of how each variable contributes to the homogeneity of the nodes and leaves in the resulting random forest
The higher the value of mean decrease accuracy or mean decrease Gini score, the higher the importance of the variable in the model

4.26 Boosting

Bagging and boosting are two main types of ensemble learning methods.
The main difference between bagging and boosting is the way in which they are trained.
In bagging, weak learners (decision trees) are trained in parallel, but in boosting, they learn sequentially.

4.27 Boosting

Fit a single tree
Draw a sample that gives higher selection probabilities to misclassified records
Fit a tree to the new sample
Repeat Steps 2 and 3 multiple times
Use weighted voting to classify records, with heavier weights for later trees

4.28 Boosting

Iterative process.
Each tree is dependent on the previous one. Hence, it is hard to parallelize the training process of boosting algorithms.
The training time will be higher. This is the main drawback of boosting algorithms.

4.29 Boosting Algorithms

Adaptive boosting or AdaBoost
Gradient boosting
Extreme gradient boosting or XGBoost