7  Extreme Gradient Boosting (XGBoost)

Extreme Gradient Boosting (XGBoost) is a machine learning algorithm that uses gradient boosted decision trees. It is part of a new generation of algorithms that are designed to be highly accurate.

XGBoost improves accuracy mainly by reducing overfitting during training. This is achieved through its objective function, which has two parts:

  1. Loss function – measures how far the model’s predictions are from the actual values.

  2. Regularization – controls the complexity of the model, helping to prevent overfitting.

By combining these, XGBoost builds a model that is both accurate and generalizes well to new data.

7.1 Without cross validation

# Install packages if not already installed
# install.packages("xgboost")
# install.packages("caret")
# install.packages("Matrix")

library(xgboost)
library(caret)
library(Matrix)

# Example dataset: Iris (binary classification: setosa vs others)
data(iris)
iris$label <- ifelse(iris$Species == "setosa", 1, 0)  # binary label
iris$Species <- NULL  # remove original factor

# Split data into training and test sets
set.seed(123)
train_index <- createDataPartition(iris$label, p = 0.8, list = FALSE)
train_data <- iris[train_index, ]
test_data <- iris[-train_index, ]

# Convert data to matrix form
train_matrix <- as.matrix(train_data[, -ncol(train_data)])
train_label <- train_data$label
test_matrix <- as.matrix(test_data[, -ncol(test_data)])
test_label <- test_data$label

# Create DMatrix objects (XGBoost format)
dtrain <- xgb.DMatrix(data = train_matrix, label = train_label)
dtest <- xgb.DMatrix(data = test_matrix, label = test_label)

# Set parameters
params <- list(
  booster = "gbtree",
  objective = "binary:logistic",
  eval_metric = "logloss",  # or "error"
  eta = 0.1,  # learning rate
  max_depth = 3,
  min_child_weight = 1,
  subsample = 0.8,
  colsample_bytree = 0.8
)

# Train the model
set.seed(123)
xgb_model <- xgb.train(
  params = params,
  data = dtrain,
  nrounds = 100,
  watchlist = list(train = dtrain, test = dtest),
  early_stopping_rounds = 10,
  print_every_n = 10
)
Multiple eval metrics are present. Will use test_logloss for early stopping.
Will train until test_logloss hasn't improved in 10 rounds.

[1] train-logloss:0.552090  test-logloss:0.552090 
[11]    train-logloss:0.179963  test-logloss:0.179392 
[21]    train-logloss:0.074191  test-logloss:0.073715 
[31]    train-logloss:0.036053  test-logloss:0.035705 
[41]    train-logloss:0.021509  test-logloss:0.021298 
[51]    train-logloss:0.019402  test-logloss:0.019212 
Stopping. Best iteration:
[56]    train-logloss:0.019402  test-logloss:0.019212

[56]    train-logloss:0.019402  test-logloss:0.019212 
# Make predictions
pred_prob <- predict(xgb_model, dtest)
pred_label <- ifelse(pred_prob > 0.5, 1, 0)

# Evaluate
confusionMatrix(factor(pred_label), factor(test_label))
Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 20  0
         1  0 10
                                     
               Accuracy : 1          
                 95% CI : (0.8843, 1)
    No Information Rate : 0.6667     
    P-Value [Acc > NIR] : 5.215e-06  
                                     
                  Kappa : 1          
                                     
 Mcnemar's Test P-Value : NA         
                                     
            Sensitivity : 1.0000     
            Specificity : 1.0000     
         Pos Pred Value : 1.0000     
         Neg Pred Value : 1.0000     
             Prevalence : 0.6667     
         Detection Rate : 0.6667     
   Detection Prevalence : 0.6667     
      Balanced Accuracy : 1.0000     
                                     
       'Positive' Class : 0          
                                     

7.2 With cross validation