Extreme Gradient Boosting (XGBoost) is a machine learning algorithm that uses gradient boosted decision trees. It is part of a new generation of algorithms that are designed to be highly accurate.
XGBoost improves accuracy mainly by reducing overfitting during training. This is achieved through its objective function, which has two parts:
Loss function – measures how far the model’s predictions are from the actual values.
Regularization – controls the complexity of the model, helping to prevent overfitting.
By combining these, XGBoost builds a model that is both accurate and generalizes well to new data.
7.1 Without cross validation
# Install packages if not already installed# install.packages("xgboost")# install.packages("caret")# install.packages("Matrix")library(xgboost)library(caret)library(Matrix)# Example dataset: Iris (binary classification: setosa vs others)data(iris)iris$label <-ifelse(iris$Species =="setosa", 1, 0) # binary labeliris$Species <-NULL# remove original factor# Split data into training and test setsset.seed(123)train_index <-createDataPartition(iris$label, p =0.8, list =FALSE)train_data <- iris[train_index, ]test_data <- iris[-train_index, ]# Convert data to matrix formtrain_matrix <-as.matrix(train_data[, -ncol(train_data)])train_label <- train_data$labeltest_matrix <-as.matrix(test_data[, -ncol(test_data)])test_label <- test_data$label# Create DMatrix objects (XGBoost format)dtrain <-xgb.DMatrix(data = train_matrix, label = train_label)dtest <-xgb.DMatrix(data = test_matrix, label = test_label)# Set parametersparams <-list(booster ="gbtree",objective ="binary:logistic",eval_metric ="logloss", # or "error"eta =0.1, # learning ratemax_depth =3,min_child_weight =1,subsample =0.8,colsample_bytree =0.8)# Train the modelset.seed(123)xgb_model <-xgb.train(params = params,data = dtrain,nrounds =100,watchlist =list(train = dtrain, test = dtest),early_stopping_rounds =10,print_every_n =10)
Multiple eval metrics are present. Will use test_logloss for early stopping.
Will train until test_logloss hasn't improved in 10 rounds.
[1] train-logloss:0.552090 test-logloss:0.552090
[11] train-logloss:0.179963 test-logloss:0.179392
[21] train-logloss:0.074191 test-logloss:0.073715
[31] train-logloss:0.036053 test-logloss:0.035705
[41] train-logloss:0.021509 test-logloss:0.021298
[51] train-logloss:0.019402 test-logloss:0.019212
Stopping. Best iteration:
[56] train-logloss:0.019402 test-logloss:0.019212
[56] train-logloss:0.019402 test-logloss:0.019212