library(mlr3)
library(mlr3viz)
library(mlr3learners)
library(mlr3data)
library(ggplot2)
library(patchwork)
library(data.table)
options(datatable.print.nrows = 20)Introduction
I am attempting to learn how to use {mlr3} (Lang et al. 2019), by reading through the book Applied Machine Learning Using mlr3 in R (Bischl et al. 2024).
My previous posts include:
- Part one:
- Create a classification tree model to predict diabetes.
- Look at the confusion matrix and create measures without using {mlr3measures}.
- Change the thresholds in the model.
In this second blog post, I am going through the exercises given in Section 3 (Casalicchio and Burk 2024). This involves using repeated cross-validation resampling, using a custom resampling strategy, and creating a function that produces a ROC.
Prerequisites
Exercises
- Apply a repeated cross-validation resampling strategy on
tsk("mtcars")and evaluate the performance oflrn("regr.rpart").- Use five repeats of three folds each.
- Calculate the MSE for each iteration and visualize the result.
- Finally, calculate the aggregated performance score.
- Use
tsk("spam")and five-fold CV to benchmarklrn("classif.ranger"),lrn("classif.log_reg"), andlrn("classif.xgboost", nrounds = 100)with respect to AUC.- Which learner appears to perform best?
- How confident are you in your conclusion?
- Think about the stability of results and investigate this by re-running the experiment with different seeds.
- What can be done to improve this?
- A colleague reports a 93.1% classification accuracy using
lrn("classif.rpart")ontsk("penguins_simple").- You want to reproduce their results and ask them about their resampling strategy.
- They said they used a custom three-fold CV with folds assigned as
factor(task$row_ids %% 3). - See if you can reproduce their results.
- (*) Program your own ROC plotting function without using
mlr3’sautoplot()function.- The signature of your function should be
my_roc_plot(task, learner, train_indices, test_indices). - Your function should use the
$set_threshold()method ofPrediction, as well asmlr3measures.
- The signature of your function should be
First, let’s suppress all messaging unless it’s a warning:1
1 See Section 10.3 of the tutorial for more information about mlr3 logging output)
lgr::get_logger("mlr3")$set_threshold("warn")Question 1
Apply a repeated cross-validation resampling strategy on tsk("mtcars") and evaluate the performance of lrn("regr.rpart").
- Use five repeats of three folds each.
- Calculate the MSE for each iteration and visualize the result.
- Finally, calculate the aggregated performance score.
Answer
First, I’ll load the Task, Learner, and create the rsmp() object.
tsk_mtcars <- tsk("mtcars")
tsk_mtcars
── <TaskRegr> (32x11): Motor Trends ────────────────────────────────────────────
• Target: mpg
• Properties: -
• Features (10):
• dbl (10): am, carb, cyl, disp, drat, gear, hp, qsec, vs, wt
# load learner
lrn_rpart <- lrn("regr.rpart")
lrn_rpart
── <LearnerRegrRpart> (regr.rpart): Regression Tree ────────────────────────────
• Model: -
• Parameters: xval=0
• Packages: mlr3 and rpart
• Predict Types: [response]
• Feature Types: logical, integer, numeric, factor, and ordered
• Encapsulation: none (fallback: -)
• Properties: importance, missings, selected_features, and weights
• Other settings: use_weights = 'use'
# load resampling method: 5 lots of three-fold CV
rcv53 = rsmp("repeated_cv", repeats = 5, folds = 3)
rcv53
── <ResamplingRepeatedCV> : Repeated Cross-Validation ──────────────────────────
• Iterations: 15
• Instantiated: FALSE
• Parameters: folds=3, repeats=5
Now, I’ll use the resample() function to run the resampling strategy.
rr <- resample(tsk_mtcars, lrn_rpart, rcv53)
rr
── <ResampleResult> with 15 resampling iterations ──────────────────────────────
task_id learner_id resampling_id iteration prediction_test warnings errors
mtcars regr.rpart repeated_cv 1 <PredictionRegr> 0 0
mtcars regr.rpart repeated_cv 2 <PredictionRegr> 0 0
mtcars regr.rpart repeated_cv 3 <PredictionRegr> 0 0
mtcars regr.rpart repeated_cv 4 <PredictionRegr> 0 0
mtcars regr.rpart repeated_cv 5 <PredictionRegr> 0 0
mtcars regr.rpart repeated_cv 6 <PredictionRegr> 0 0
mtcars regr.rpart repeated_cv 7 <PredictionRegr> 0 0
mtcars regr.rpart repeated_cv 8 <PredictionRegr> 0 0
mtcars regr.rpart repeated_cv 9 <PredictionRegr> 0 0
mtcars regr.rpart repeated_cv 10 <PredictionRegr> 0 0
mtcars regr.rpart repeated_cv 11 <PredictionRegr> 0 0
mtcars regr.rpart repeated_cv 12 <PredictionRegr> 0 0
mtcars regr.rpart repeated_cv 13 <PredictionRegr> 0 0
mtcars regr.rpart repeated_cv 14 <PredictionRegr> 0 0
mtcars regr.rpart repeated_cv 15 <PredictionRegr> 0 0
Calculating the MSE for each iteration requires running $score().
rr_mse <- rr$score(msr("regr.mse"))
rr_mse task_id learner_id resampling_id iteration regr.mse
<char> <char> <char> <int> <num>
1: mtcars regr.rpart repeated_cv 1 36.40191
2: mtcars regr.rpart repeated_cv 2 11.71977
3: mtcars regr.rpart repeated_cv 3 23.26404
4: mtcars regr.rpart repeated_cv 4 19.87484
5: mtcars regr.rpart repeated_cv 5 25.32809
6: mtcars regr.rpart repeated_cv 6 23.96217
7: mtcars regr.rpart repeated_cv 7 20.17025
8: mtcars regr.rpart repeated_cv 8 14.55125
9: mtcars regr.rpart repeated_cv 9 23.05587
10: mtcars regr.rpart repeated_cv 10 25.35146
11: mtcars regr.rpart repeated_cv 11 18.06417
12: mtcars regr.rpart repeated_cv 12 12.90876
13: mtcars regr.rpart repeated_cv 13 23.81826
14: mtcars regr.rpart repeated_cv 14 42.01748
15: mtcars regr.rpart repeated_cv 15 19.90676
Hidden columns: task, learner, resampling, prediction_test
Let’s plot this.
autoplot(rr, measure = msr("regr.mse"), type = "boxplot") +
autoplot(rr, measure = msr("regr.mse"), type = "histogram")`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Aggregating the MSE scores (using macro aggregation) gives:
rr$aggregate(msr("regr.mse"))regr.mse
22.69301
Question 2
Use tsk("spam") and five-fold CV to benchmark lrn("classif.ranger"), lrn("classif.log_reg"), and lrn("classif.xgboost", nrounds = 100) with respect to AUC.
- Which learner appears to perform best?
- How confident are you in your conclusion?
- Think about the stability of results and investigate this by re-running the experiment with different seeds.
- What can be done to improve this?
Answer
Let’s load the task, learners and resampling method.
tsk_spam <- tsk("spam")
tsk_spam
── <TaskClassif> (4601x58): HP Spam Detection ──────────────────────────────────
• Target: type
• Target classes: spam (positive class, 39%), nonspam (61%)
• Properties: twoclass
• Features (57):
• dbl (57): address, addresses, all, business, capitalAve, capitalLong,
capitalTotal, charDollar, charExclamation, charHash, charRoundbracket,
charSemicolon, charSquarebracket, conference, credit, cs, data, direct, edu,
email, font, free, george, hp, hpl, internet, lab, labs, mail, make, meeting,
money, num000, num1999, num3d, num415, num650, num85, num857, order,
original, our, over, parts, people, pm, project, re, receive, remove, report,
table, technology, telnet, will, you, your
# set up leaners
# first set up the 'lrns()' then modify the xgboost 'nrounds' argument
learners <- lrns(c("classif.ranger", "classif.log_reg", "classif.xgboost"),
predict_type = "prob")
# adjust 'nrounds' argument for xgboost
learners$classif.xgboost$param_set$values$nrounds <- 100
learners$classif.ranger
── <LearnerClassifRanger> (classif.ranger): Random Forest ──────────────────────
• Model: -
• Parameters: num.threads=1
• Packages: mlr3, mlr3learners, and ranger
• Predict Types: response and [prob]
• Feature Types: logical, integer, numeric, character, factor, and ordered
• Encapsulation: none (fallback: -)
• Properties: hotstart_backward, importance, missings, multiclass, oob_error,
selected_features, twoclass, and weights
• Other settings: use_weights = 'use'
$classif.log_reg
── <LearnerClassifLogReg> (classif.log_reg): Logistic Regression ───────────────
• Model: -
• Parameters: use_pred_offset=TRUE
• Packages: mlr3, mlr3learners, and stats
• Predict Types: response and [prob]
• Feature Types: logical, integer, numeric, character, factor, and ordered
• Encapsulation: none (fallback: -)
• Properties: offset, twoclass, and weights
• Other settings: use_weights = 'use'
$classif.xgboost
── <LearnerClassifXgboost> (classif.xgboost): Extreme Gradient Boosting ────────
• Model: -
• Parameters: nrounds=100, nthread=1, verbose=0
• Validate: NULL
• Packages: mlr3, mlr3learners, and xgboost
• Predict Types: response and [prob]
• Feature Types: logical, integer, and numeric
• Encapsulation: none (fallback: -)
• Properties: hotstart_forward, importance, internal_tuning, missings,
multiclass, offset, twoclass, validation, and weights
• Other settings: use_weights = 'use'
# set up resampling
cv5 <- rsmp("cv", folds = 5)
cv5
── <ResamplingCV> : Cross-Validation ───────────────────────────────────────────
• Iterations: 5
• Instantiated: FALSE
• Parameters: folds=5
Now we can set up the benchmark grid.
set.seed(1)
design <- benchmark_grid(tsk_spam, learners, cv5)
design task learner resampling
<char> <char> <char>
1: spam classif.ranger cv
2: spam classif.log_reg cv
3: spam classif.xgboost cv
Now, see how well these perform in terms of AUC.2
2 Recall: AUC can be interpreted as the probability that a randomly chosen positive instance has a higher predicted probability of belonging to the positive class than a randomly chosen negative instance
bmr <- benchmark(design)Warning in .__Task__weights(self = self, private = private, super = super, : 'Task$weights' is deprecated.
Use 'Task$weights_learner' instead.
See help("Deprecated")
Warning in .__Task__weights(self = self, private = private, super = super, : 'Task$weights' is deprecated.
Use 'Task$weights_learner' instead.
See help("Deprecated")
Warning in .__Task__weights(self = self, private = private, super = super, : 'Task$weights' is deprecated.
Use 'Task$weights_learner' instead.
See help("Deprecated")
Warning in .__Task__weights(self = self, private = private, super = super, : 'Task$weights' is deprecated.
Use 'Task$weights_learner' instead.
See help("Deprecated")
Warning in .__Task__weights(self = self, private = private, super = super, : 'Task$weights' is deprecated.
Use 'Task$weights_learner' instead.
See help("Deprecated")
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
bmr$score(msr("classif.auc"))[, .(learner_id, iteration, classif.auc)] learner_id iteration classif.auc
<char> <int> <num>
1: classif.ranger 1 0.9905328
2: classif.ranger 2 0.9807452
3: classif.ranger 3 0.9861848
4: classif.ranger 4 0.9818771
5: classif.ranger 5 0.9914047
6: classif.log_reg 1 0.9729649
7: classif.log_reg 2 0.9604636
8: classif.log_reg 3 0.9831002
9: classif.log_reg 4 0.9646964
10: classif.log_reg 5 0.9757900
11: classif.xgboost 1 0.9896168
12: classif.xgboost 2 0.9844034
13: classif.xgboost 3 0.9909272
14: classif.xgboost 4 0.9853751
15: classif.xgboost 5 0.9919197
And let’s aggregate by Learner.
bmr$aggregate(msr("classif.auc")) nr task_id learner_id resampling_id iters classif.auc
<int> <char> <char> <char> <int> <num>
1: 1 spam classif.ranger cv 5 0.9861489
2: 2 spam classif.log_reg cv 5 0.9714030
3: 3 spam classif.xgboost cv 5 0.9884484
Hidden columns: resample_result
autoplot(bmr, measure = msr("classif.auc"))
So, from a naive look at this, it appears that the XGBoost model performs the best (highest AUC). However, the results from all three of these models appear very similar, and I would maybe prefer “simpler” models over more flexible ones in this case (here, the logistic regression model).
If we run this 5 times with different seeds, let’s see how the AUC varies.
bmr_auc <- rbindlist(lapply(seq_len(5), function(i) {
tmp_seed <- i * 100
set.seed(tmp_seed)
design <- benchmark_grid(tsk_spam, learners, cv5)
bmr <- benchmark(design)
data.table(
seed = tmp_seed,
auc = bmr$aggregate(msr("classif.auc"))
)
})
)
bmr_auc[, .(seed, auc.learner_id, auc.classif.auc)] seed auc.learner_id auc.classif.auc
<num> <char> <num>
1: 100 classif.ranger 0.9866242
2: 100 classif.log_reg 0.9708618
3: 100 classif.xgboost 0.9883961
4: 200 classif.ranger 0.9864776
5: 200 classif.log_reg 0.9705954
6: 200 classif.xgboost 0.9879994
7: 300 classif.ranger 0.9855379
8: 300 classif.log_reg 0.9710461
9: 300 classif.xgboost 0.9878387
10: 400 classif.ranger 0.9866107
11: 400 classif.log_reg 0.9697749
12: 400 classif.xgboost 0.9888800
13: 500 classif.ranger 0.9862768
14: 500 classif.log_reg 0.9702488
15: 500 classif.xgboost 0.9879613
# some summary stats
bmr_auc[, as.list(summary(auc.classif.auc)), by = auc.learner_id] auc.learner_id Min. 1st Qu. Median Mean 3rd Qu. Max.
<char> <num> <num> <num> <num> <num> <num>
1: classif.ranger 0.9855379 0.9862768 0.9864776 0.9863055 0.9866107 0.9866242
2: classif.log_reg 0.9697749 0.9702488 0.9705954 0.9705054 0.9708618 0.9710461
3: classif.xgboost 0.9878387 0.9879613 0.9879994 0.9882151 0.9883961 0.9888800
Although XGBoost achieved the highest AUC on average, the difference compared to ranger was minimal across repeated runs (although XGBoost always very slighly outperforms the random forest model, after aggregation). Confidence intervals or additional repeats could provide better insight into whether the observed difference is meaningful. The choice of model will depend on how important that small difference is in the AUC compared to model complexity.
Question 3
A colleague reports a \(93.1\%\) classification accuracy using lrn("classif.rpart") on tsk("penguins_simple").
- You want to reproduce their results and ask them about their resampling strategy.
- They said they used a custom three-fold CV with folds assigned as
factor(task$row_ids %% 3). - See if you can reproduce their results.
Answer
Let’s have a look at the Task. This task doesn’t seem to be in included in the default {mlr3} package, but is referenced in the {mlr3data} (Becker 2024) docs.
tsk_penguins <- tsk("penguins_simple")
tsk_penguins
── <TaskClassif> (333x11): Simplified Palmer Penguins ──────────────────────────
• Target: species
• Target classes: Adelie (44%), Gentoo (36%), Chinstrap (20%)
• Properties: multiclass
• Features (10):
• dbl (7): bill_depth, bill_length, island.Biscoe, island.Dream,
island.Torgersen, sex.female, sex.male
• int (3): body_mass, flipper_length, year
OK, so this is a multi-class classification task, using 10 features to predict the species of the penguin.
They said they used a custom three-fold CV, so let’s try and reproduce this. By looking at factor(tsk_penguins$row_ids %% 3), we can see that the CV is putting every third observation into the same fold. This feels weird and wrong, but fine.3
3 This folding strategy does not ensure class balance within each fold and may lead to biased performance estimates, particularly in smaller datasets.
# load learner
lrn_rpart <- lrn("classif.rpart")
lrn_rpart
── <LearnerClassifRpart> (classif.rpart): Classification Tree ──────────────────
• Model: -
• Parameters: xval=0
• Packages: mlr3 and rpart
• Predict Types: [response] and prob
• Feature Types: logical, integer, numeric, factor, and ordered
• Encapsulation: none (fallback: -)
• Properties: importance, missings, multiclass, selected_features, twoclass,
and weights
• Other settings: use_weights = 'use'
# create custom resampling strategy
rsmp_custom = rsmp("custom_cv")
folds <- factor(tsk_penguins$row_ids %% 3)
rsmp_custom$instantiate(tsk_penguins, f = folds)
rr <- resample(tsk_penguins, lrn_rpart, rsmp_custom)
rr$predictions()[[1]]
── <PredictionClassif> for 111 observations: ───────────────────────────────────
row_ids truth response
3 Adelie Adelie
6 Adelie Adelie
9 Adelie Adelie
--- --- ---
327 Chinstrap Chinstrap
330 Chinstrap Adelie
333 Chinstrap Chinstrap
[[2]]
── <PredictionClassif> for 111 observations: ───────────────────────────────────
row_ids truth response
1 Adelie Adelie
4 Adelie Adelie
7 Adelie Adelie
--- --- ---
325 Chinstrap Chinstrap
328 Chinstrap Chinstrap
331 Chinstrap Chinstrap
[[3]]
── <PredictionClassif> for 111 observations: ───────────────────────────────────
row_ids truth response
2 Adelie Adelie
5 Adelie Adelie
8 Adelie Adelie
--- --- ---
326 Chinstrap Gentoo
329 Chinstrap Gentoo
332 Chinstrap Gentoo
rr$score(msr("classif.acc")) task_id learner_id resampling_id iteration classif.acc
<char> <char> <char> <int> <num>
1: penguins classif.rpart custom_cv 1 0.9369369
2: penguins classif.rpart custom_cv 2 0.9189189
3: penguins classif.rpart custom_cv 3 0.9369369
Hidden columns: task, learner, resampling, prediction_test
rr$aggregate(msr("classif.acc"))classif.acc
0.9309309
So, we get a model with \(93.1\%\) accuracy, as required.4
4 Would the results change much with, for example, grouped resampling? I should look at this at some point.
Question 4
(*) Program your own ROC plotting function without using mlr3’s autoplot() function.
- The signature of your function should be
my_roc_plot(task, learner, train_indices, test_indices). - Your function should use the
$set_threshold()method ofPrediction, as well asmlr3measures.
Answer
Let’s first have a look at the output from using autoplot(). I’ll use the german_credit task.
tsk_german = tsk("german_credit")
tsk_german
── <TaskClassif> (1000x21): German Credit ──────────────────────────────────────
• Target: credit_risk
• Target classes: good (positive class, 70%), bad (30%)
• Properties: twoclass
• Features (20):
• fct (14): credit_history, employment_duration, foreign_worker, housing,
job, other_debtors, other_installment_plans, people_liable,
personal_status_sex, property, purpose, savings, status, telephone
• int (3): age, amount, duration
• ord (3): installment_rate, number_credits, present_residence
lrn_ranger = lrn("classif.ranger", predict_type = "prob")
splits = partition(tsk_german, ratio = 0.8)
lrn_ranger$train(tsk_german, splits$train)Warning in .__Task__weights(self = self, private = private, super = super, : 'Task$weights' is deprecated.
Use 'Task$weights_learner' instead.
See help("Deprecated")
prediction = lrn_ranger$predict(tsk_german, splits$test)autoplot(prediction, type = "roc")
First, I’ll do all the steps to create the ROC, then I’ll wrap this in a function, my_roc_plot().
Creating the ROC
OK – so I need to use $set_threshold() to obtain predictions over the range of thresholds. Then, I need to use {mlr3measures} (Lang, Becker, and Koers 2024) to compute the TPR (Sensitivity) and FPR (\(1 -\) Specificity) and plot these all on a lovely graph.
- Sensitivity
- (true positive rate) is the probability of a positive test result, conditioned on the individual truly being positive.
- Specificity
- (true negative rate) is the probability of a negative test result, conditioned on the individual truly being negative.
I’ll first check to see which is the positive outcome in the Task.
tsk_german$positive[1] "good"
# also, by looking at the help file 'prediction$help()'
# can see that the positive class is the first level of '$truth', i.e.
levels(prediction$truth)[1][1] "good"
So having good credit is the positive outcome here.
Now, I’ll create a vector of thresholds5 and then obtain predictions and calculate the measures.
5 Thresholds were discussed in Section 2.5.4 of the mlr3 tutorial, and I looked at them in Question 3 of my previous post
positive_class <- levels(prediction$truth)[1]
thresholds <- seq(0, 1, length = 101)
tsk_german_measures <- rbindlist(
lapply(thresholds, function(j) {
prediction$set_threshold(j)
tpr_tmp <- mlr3measures::tpr(truth = prediction$truth,
response = prediction$response,
positive = positive_class)
fpr_tmp <- mlr3measures::fpr(truth = prediction$truth,
response = prediction$response,
positive = positive_class)
data.table(threshold = j,
tpr = tpr_tmp,
fpr = fpr_tmp)
}
)
)
# order by increasing fpr, and tpr
# s.t. the step function avoids spikes
# spikes are happening as seed not set in $set_threshold(),
# so possible to get non-monotonic tpr/ fpr
# also put them in descending threshold order, just to make the data look nicer.
tsk_german_measures <- tsk_german_measures[order(fpr, tpr, -threshold)]
tsk_german_measures threshold tpr fpr
<num> <num> <num>
1: 1.00 0.000000000 0
2: 0.99 0.000000000 0
3: 0.98 0.006896552 0
4: 0.97 0.013793103 0
5: 0.96 0.013793103 0
---
97: 0.04 1.000000000 1
98: 0.03 1.000000000 1
99: 0.02 1.000000000 1
100: 0.01 1.000000000 1
101: 0.00 1.000000000 1
OK, I think I’ve got everything required to plot the ROC.
ggplot(tsk_german_measures, aes(x = fpr, y = tpr)) +
geom_step() +
geom_abline(intercept = 0, slope = 1,
linetype = "dotted", colour = "grey") +
labs(x = "1 - Specificity",
y = "Sensitivity") +
theme_minimal()
Making the function my_roc_plot()
my_roc_plot <- function(task, learner, train_indices, test_indices) {
# task: a 'Task' object
# learner: a 'Learner' object
# train the learner on the task
learner$train(task, row_ids = train_indices)
# create the prediction object
prediction <- learner$predict(task, row_ids = test_indices)
# find TPR and FPR over a seq of thresholds
positive_class <- levels(prediction$truth)[1]
thresholds <- seq(0, 1, length = 101)
tpr_fpr_thresholds <- rbindlist(
lapply(thresholds, function(j) {
prediction$set_threshold(j)
tpr_tmp <- mlr3measures::tpr(truth = prediction$truth,
response = prediction$response,
positive = positive_class)
fpr_tmp <- mlr3measures::fpr(truth = prediction$truth,
response = prediction$response,
positive = positive_class)
data.table(threshold = j,
tpr = tpr_tmp,
fpr = fpr_tmp)
}
)
)
tpr_fpr_thresholds <- tpr_fpr_thresholds[order(fpr, tpr, -threshold)]
# and plot
ggplot(tpr_fpr_thresholds, aes(x = fpr, y = tpr)) +
geom_step() +
geom_abline(intercept = 0, slope = 1,
linetype = "dotted", colour = "grey") +
labs(x = "1 - Specificity",
y = "Sensitivity") +
theme_minimal()
}Let’s test it:
my_roc_plot(task = tsk_german,
learner = lrn("classif.ranger", predict_type = "prob"),
train_indices = splits$train,
test_indices = splits$test)Warning in .__Task__weights(self = self, private = private, super = super, : 'Task$weights' is deprecated.
Use 'Task$weights_learner' instead.
See help("Deprecated")

Cool, looks good!
Fin
Citation
@online{smith2025,
author = {Smith, Paul and Smith, Paul},
title = {Getting {Started} with \{Mlr3\}},
date = {2025-03-17},
url = {https://pws3141.github.io/blog/posts/08-mlr3_evaluation_benchmarking/},
langid = {en}
}