9 Decision Trees and Random Forests

9.1 Classification and Regression Trees

9.1.1 Construct a decision tree using R

9.1.2 Packages

library(tidymodels)
library(palmerpenguins)
library(rpart)
library(skimr)
library(rpart.plot)
library(GGally)

9.1.3 Data

data(penguins)
skim(penguins)

Data summary
Name	penguins
Number of rows	344
Number of columns	8
_______________________
Column type frequency:
factor	3
numeric	5
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
species	0	1.00	FALSE	3	Ade: 152, Gen: 124, Chi: 68
island	0	1.00	FALSE	3	Bis: 168, Dre: 124, Tor: 52
sex	11	0.97	FALSE	2	mal: 168, fem: 165

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
bill_length_mm	2	0.99	43.92	5.46	32.1	39.23	44.45	48.5	59.6	▃▇▇▆▁
bill_depth_mm	2	0.99	17.15	1.97	13.1	15.60	17.30	18.7	21.5	▅▅▇▇▂
flipper_length_mm	2	0.99	200.92	14.06	172.0	190.00	197.00	213.0	231.0	▂▇▃▅▂
body_mass_g	2	0.99	4201.75	801.95	2700.0	3550.00	4050.00	4750.0	6300.0	▃▇▆▃▂
year	0	1.00	2008.03	0.82	2007.0	2007.00	2008.00	2009.0	2009.0	▇▁▇▁▇

9.1.4 Exploratory Data Analysis

Your turn: Perform an exploratory data analysis
What are the limitations of the following chart? Could you suggest ways to reduce clutter and enhance its clarity?

ggpairs(penguins, aes(colour = species), progress = FALSE) +
    theme_bw()

9.1.5 Remove missing values

penguins <- drop_na(penguins)

9.1.6 Split data

library(rsample)
set.seed(123)
penguin_split <- initial_split(penguins, strata = species)
penguin_train <- training(penguin_split)
penguin_test <- testing(penguin_split)

9.1.7 Display the dimensions of the training dataset

dim(penguin_test)

[1] 84  8

dim(penguin_train)

[1] 249   8

9.1.8 Build Decision Tree

tree1 <- rpart(species ~ ., penguin_train,  cp = 0.1)
rpart.plot(tree1, box.palette="RdBu", shadow.col="gray", nn=TRUE)

9.1.9 Make predictions

head(predict(tree1, penguin_test))

     Adelie  Chinstrap Gentoo
1 0.9626168 0.03738318      0
2 0.9626168 0.03738318      0
3 0.9626168 0.03738318      0
4 0.9626168 0.03738318      0
5 0.9626168 0.03738318      0
6 0.9626168 0.03738318      0

head(predict(tree1, penguin_test,  type = "class"))

     1      2      3      4      5      6 
Adelie Adelie Adelie Adelie Adelie Adelie 
Levels: Adelie Chinstrap Gentoo

9.1.10 Evaluate accuracy over the test set

library(caret)

Loading required package: lattice


Attaching package: 'caret'

The following objects are masked from 'package:yardstick':

    precision, recall, sensitivity, specificity

The following object is masked from 'package:purrr':

    lift

penguin_test$predict <- predict(tree1, penguin_test,  type = "class")
confusionMatrix(data = penguin_test$predict,
                reference = penguin_test$species)

Confusion Matrix and Statistics

           Reference
Prediction  Adelie Chinstrap Gentoo
  Adelie        35         0      0
  Chinstrap      2        14      0
  Gentoo         0         3     30

Overall Statistics
                                          
               Accuracy : 0.9405          
                 95% CI : (0.8665, 0.9804)
    No Information Rate : 0.4405          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9066          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: Adelie Class: Chinstrap Class: Gentoo
Sensitivity                 0.9459           0.8235        1.0000
Specificity                 1.0000           0.9701        0.9444
Pos Pred Value              1.0000           0.8750        0.9091
Neg Pred Value              0.9592           0.9559        1.0000
Prevalence                  0.4405           0.2024        0.3571
Detection Rate              0.4167           0.1667        0.3571
Detection Prevalence        0.4167           0.1905        0.3929
Balanced Accuracy           0.9730           0.8968        0.9722

9.2 Random forests

Bagging ensemble method
Gives final prediction by aggregating the predictions of bootstrapped decision tree samples.
Trees in a random forest are independent of each other.
With ensemble methods, we get a new metric for assessing the predictive performance of the model, the out-of-bag error

9.3 Illustration of how random forests algorithm works

9.3.1 Boostrap Sample for the first tree

9.3.2 Out-of-bag sample for the first tree

9.3.3 Out-of-bag sample predictions

9.3.4 Variable importance measures in random forest

Permutation-based variable importance
Mean decrease in Gini coefficient

9.3.5 Random forest using R

set.seed(123)
library(randomForest)
rf <- randomForest(species ~ ., penguin_train)

9.3.6 Make predictions and evaluate accuracy over the test set

penguin_test$predictrf <- predict(rf, penguin_test,  type = "class")
confusionMatrix(data = penguin_test$predictrf,
                reference = penguin_test$species)

Confusion Matrix and Statistics

           Reference
Prediction  Adelie Chinstrap Gentoo
  Adelie        37         0      0
  Chinstrap      0        17      0
  Gentoo         0         0     30

Overall Statistics
                                    
               Accuracy : 1         
                 95% CI : (0.957, 1)
    No Information Rate : 0.4405    
    P-Value [Acc > NIR] : < 2.2e-16 
                                    
                  Kappa : 1         
                                    
 Mcnemar's Test P-Value : NA        

Statistics by Class:

                     Class: Adelie Class: Chinstrap Class: Gentoo
Sensitivity                 1.0000           1.0000        1.0000
Specificity                 1.0000           1.0000        1.0000
Pos Pred Value              1.0000           1.0000        1.0000
Neg Pred Value              1.0000           1.0000        1.0000
Prevalence                  0.4405           0.2024        0.3571
Detection Rate              0.4405           0.2024        0.3571
Detection Prevalence        0.4405           0.2024        0.3571
Balanced Accuracy           1.0000           1.0000        1.0000

9.3.7 Receiver Operating Characteristic (ROC) curves

Graphical plot that illustrates the performance of a binary classifier model.

9.3.8 When to Use an ROC Curve?

Binary Classification: It is particularly useful when you have two classes (e.g., yes/no, positive/negative).
Imbalanced Datasets: It’s beneficial for imbalanced classes since it doesn’t rely on accuracy, which can be biased when one class is significantly more frequent than the other.
Comparing Models: It’s a great tool for comparing multiple classifiers on the same dataset to see which performs better across thresholds.

9.4 Simulation illustrating ROC curve calculations

set.seed(123)  # For reproducibility

# Simulate true labels (0 or 1) for 100 observations
true_labels <- sample(c(0, 1), 100, replace = TRUE)

# Simulate predicted probabilities for the positive class (between 0 and 1)
predicted_probs <- runif(100)
head(predicted_probs)

[1] 0.5999890 0.3328235 0.4886130 0.9544738 0.4829024 0.8903502

# Define thresholds from 0 to 1
thresholds <- seq(0, 1, by = 0.01)
head(thresholds)

[1] 0.00 0.01 0.02 0.03 0.04 0.05

# Initialize vectors to store TPR and FPR values
tpr_values <- numeric(length(thresholds))
fpr_values <- numeric(length(thresholds))

# Calculate TPR and FPR at each threshold
for (i in 1:length(thresholds)) {
  threshold <- thresholds[i]
  
  # Classify samples based on the threshold
  predicted_class <- ifelse(predicted_probs >= threshold, 1, 0)
  
  # Calculate confusion matrix components
  TP <- sum((predicted_class == 1) & (true_labels == 1))  # True Positives
  FP <- sum((predicted_class == 1) & (true_labels == 0))  # False Positives
  TN <- sum((predicted_class == 0) & (true_labels == 0))  # True Negatives
  FN <- sum((predicted_class == 0) & (true_labels == 1))  # False Negatives
  
  # Calculate TPR and FPR
  tpr_values[i] <- TP / (TP + FN)
  fpr_values[i] <- FP / (FP + TN)
}

# View first few TPR and FPR values
df <- data.frame(thresholds, tpr_values, fpr_values)
head(df)

  thresholds tpr_values fpr_values
1       0.00  1.0000000          1
2       0.01  1.0000000          1
3       0.02  0.9767442          1
4       0.03  0.9767442          1
5       0.04  0.9767442          1
6       0.05  0.9767442          1

# Plot ROC curve
ggplot(df, aes(x=fpr_values, y=tpr_values)) + geom_line()