Using {xgboost} and {SLmetrics} in regression tasks

In this section a gradient boosting machine (GBM) is trained on the obesity-dataset, and evaluated using {SLmetrics}. The gbm trained here is a light gradient boosting machine from {xgboost}.¹

# 1) load data
# from {SLmetrics}
data("obesity", package = "SLmetrics")

Data preparation

# 1.1) define the features
# and outcomes
outcome  <- obesity$target$regression
features <- obesity$features

# 2) split data in training
# and test

# 2.1) set seed for 
# for reproducibility
set.seed(1903)

# 2.2) exttract
# indices with a simple
# 80/20 split
index <- sample(1:nrow(features), size = 0.95 * nrow(features))

# 1.1) extract training
# data and construct
# as lgb.Dataset
train <- features[index,]
dtrain <- xgboost::xgb.DMatrix(
    data  = data.matrix(train),
    label = outcome[index]
)

# 1.2) extract test
# data
test <- features[-index,]

# 1.2.1) extract actual
# values and constuct
# as.factor for {SLmetrics}
# methods
actual <- outcome[-index]

# 1.2.2) construct as data.matrix
# for predict method
dtest <-  xgboost::xgb.DMatrix(
    data = data.matrix(test),
    label = data.matrix(actual)
)

Training the GBM

Evaluation function

# 1) define the custom
# evaluation metric
eval_rrse <- function(
    preds, 
    dtrain) {

        # 1) extract values
        actual    <- xgboost::getinfo(dtrain, "label")
        predicted <- preds
        value     <- rrse(
            actual    = actual,
            predicted = predicted
        )

        # 2) construnct output
        # list
        list(
            metric = "RRMSE",
            value  = value
        )
    
}

Training the GBM

We train the model using the xgb.train()-function,

# 1) model training
model <- xgboost::xgb.train(
    data    = dtrain,
    nrounds = 10L,
    verbose = 0,
    feval   = eval_rrse,
    watchlist = list(
        train = dtrain,
        test  = dtest
    ),
    maximize = FALSE
)

Performance Evaluation

We extract the predicted values using the predict()-function,

# 1) out of sample
# prediction
predicted <- predict(
    model,
    newdata = dtest
)

We summarize the performance using relative root mean squared error, root mean squared error and concordance correlation coefficient

# 1) summarize all
# performance measures 
# in data.frame
data.frame(
    RRMSE  = rrse(actual, predicted), 
    RMSE   = rmse(actual, predicted),
    CCC    = ccc(actual, predicted)
)

#>       RRMSE     RMSE       CCC
#> 1 0.4115731 10.76499 0.9062945

The obesity dataset comes (almost) ready for analysis. See the repo for more details on the data-manipulation steps taken.↩︎