Skip to content

add chapter on validation and internal tuning #829

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 52 commits into from
Nov 7, 2024
Merged
Changes from all commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
ff2b761
...
sebffischer Aug 16, 2024
1bef3af
...
sebffischer Aug 16, 2024
096fbfd
...
sebffischer Aug 16, 2024
2cc3919
...
sebffischer Aug 16, 2024
1b08540
add warning
sebffischer Aug 16, 2024
369f16d
iterate
sebffischer Aug 16, 2024
2cc0c64
...
sebffischer Aug 16, 2024
242df3f
...
berndbischl Aug 17, 2024
11b08ae
...
sebffischer Aug 18, 2024
93074a6
...
sebffischer Aug 18, 2024
31b3295
...
sebffischer Aug 18, 2024
2d07b48
...
sebffischer Aug 18, 2024
5ba2103
...
sebffischer Aug 18, 2024
92f46c4
...
sebffischer Aug 18, 2024
2a1013f
typo
sebffischer Aug 18, 2024
ca033de
...
berndbischl Aug 22, 2024
3b24896
...
sebffischer Sep 3, 2024
c52d82b
...
sebffischer Sep 3, 2024
f238ae9
...
sebffischer Sep 3, 2024
f073db5
...
sebffischer Sep 3, 2024
58fb88b
...
sebffischer Sep 3, 2024
4209382
...
sebffischer Sep 3, 2024
23bcb04
...
sebffischer Sep 3, 2024
7787673
update pipelines
sebffischer Sep 4, 2024
57eebe8
update measures
sebffischer Sep 4, 2024
386ff8f
typos
sebffischer Sep 4, 2024
3a45430
let's hope
sebffischer Sep 4, 2024
c1b39a0
does it render now?
sebffischer Oct 7, 2024
2a5834f
internal tuning in manual search space
sebffischer Oct 7, 2024
e9e4b88
...
sebffischer Oct 16, 2024
bda53da
...
sebffischer Oct 17, 2024
ff3a62f
...
sebffischer Oct 17, 2024
342e428
...
sebffischer Oct 18, 2024
ab4f51f
advanced tuning chapter
be-marc Oct 18, 2024
ed75951
advanced technical text
be-marc Oct 18, 2024
1b54a50
...
be-marc Oct 27, 2024
37cdb58
...
be-marc Oct 27, 2024
0e5ce01
errata
be-marc Oct 27, 2024
ab47aee
Merge branch 'main' into validation
be-marc Oct 27, 2024
d6c714e
update mlr3mbo
be-marc Oct 27, 2024
6032a62
Merge branch 'validation' of github.com:mlr-org/mlr3book into validation
be-marc Oct 27, 2024
d7e2f23
tuning
be-marc Oct 27, 2024
a9a9195
...
sebffischer Oct 28, 2024
725a23e
...
sebffischer Oct 28, 2024
f55a805
...
sebffischer Oct 28, 2024
20e1b94
...
sebffischer Oct 28, 2024
7104176
...
sebffischer Oct 31, 2024
056f545
...
sebffischer Oct 31, 2024
fa9b089
...
sebffischer Nov 3, 2024
01f5f42
...
sebffischer Nov 4, 2024
58220a2
BB final edits
berndbischl Nov 7, 2024
92ae51c
...
sebffischer Nov 7, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions book/_quarto.yml
Original file line number Diff line number Diff line change
@@ -44,6 +44,7 @@ book:
- chapters/chapter12/model_interpretation.qmd
- chapters/chapter13/beyond_regression_and_classification.qmd
- chapters/chapter14/algorithmic_fairness.qmd
- chapters/chapter15/predsets_valid_inttune.qmd
- chapters/references.qmd
appendices:
- chapters/appendices/solutions.qmd # online only
35 changes: 31 additions & 4 deletions book/chapters/appendices/errata.qmd
Original file line number Diff line number Diff line change
@@ -11,19 +11,27 @@ aliases:

This appendix lists changes to the online version of this book to chapters included in the first edition.

## 1. Introduction and Overview

## Data and Basic Modeling
* Add


## 2. Data and Basic Modeling

* Replaced reference to `Param` with `Domain`.

## Hyperparameter Optimization
## 3. Evaluation and Benchmarking

* Use `$encapsulate()` method instead of the `$encapsulate` and `$fallback` fields.

## 4. Hyperparameter Optimization

* Renamed `TuningInstanceSingleCrit` to `TuningInstanceBatchSingleCrit`.
* Renamed `TuningInstanceMultiCrit` to `TuningInstanceBatchMultiCrit`.
* Renamed `Tuner` to `TunerBatch`.
* Replaced reference to `Param` with `Domain`.

## Advanced Tuning Methods and Black Box Optimization
## 5. Advanced Tuning Methods and Black Box Optimization

* Renamed `TuningInstanceSingleCrit` to `TuningInstanceBatchSingleCrit`.
* Renamed `TuningInstanceMultiCrit` to `TuningInstanceBatchMultiCrit`.
@@ -33,10 +41,29 @@ This appendix lists changes to the online version of this book to chapters inclu
* Renamed `Optimizer` to `OptimizerBatch`.
* Replaced `OptimInstanceSingleCrit$new()` with `oi()`.
* Add `oi()` to the table about important functions.
* Use `$encapsulate()` method instead of the `$encapsulate` and `$fallback` fields.

## Feature Selection
## 6. Feature Selection

* Renamed `FSelectInstanceSingleCrit` to `FSelectInstanceBatchSingleCrit`.
* Renamed `FSelectInstanceMultiCrit` to `FSelectInstanceBatchMultiCrit`.
* Renamed `FeatureSelector` to `FeatureSelectorBatch`.
* Add `fsi()` to the table about important functions.

## 8. Non-sequential Pipelines and Tuning

* Use `$encapsulate()` method instead of the `$encapsulate` and `$fallback` fields.

## 10. Advanced Technical Aspects of mlr3

* Use `$encapsulate()` method instead of the `$encapsulate` and `$fallback` fields.

## 11. Large-Scale Benchmarking

* Use `$encapsulate()` method instead of the `$encapsulate` and `$fallback` fields.

## 12. Model Interpretation

* Subset task to row 127 instead of 35 for the local surrogate model.
* Add `as.data.frame()` to "Correctly Interpreting Shapley Values" section.

199 changes: 197 additions & 2 deletions book/chapters/appendices/solutions.qmd
Original file line number Diff line number Diff line change
@@ -1711,9 +1711,9 @@ First, we create the learner that we want to tune, mark the relevant parameter f

```{r}
lrn_debug = lrn("classif.debug",
error_train = to_tune(0, 1),
fallback = lrn("classif.rpart")
error_train = to_tune(0, 1)
)
lrn_debug$encapsulate("evaluate", fallback = lrn("classif.rpart"))
lrn_debug
```

@@ -2171,4 +2171,199 @@ prediction$score(msr_3, adult_subset)
We can see, that between women there is an even bigger discrepancy compared to men.

* The bias mitigation strategies we employed do not optimize for the *false omission rate* metric, but other metrics instead. It might therefore be better to try to achieve fairness via other strategies, using different or more powerful models or tuning hyperparameters.

## Solutions to @sec-predsets-valid-inttune

1. Manually `$train()` a LightGBM classifier from `r ref_pkg("mlr3extralearners")` on the pima task using $1/3$ of the training data for validation.
As the pima task has missing values, select a method from `r ref_pkg("mlr3pipelines")` to impute them.
Explicitly set the evaluation metric to logloss (`"binary_logloss"`), the maximum number of boosting iterations to 1000, the patience parameter to 10, and the step size to 0.01.
After training the learner, inspect the final validation scores as well as the early stopped number of iterations.

We start by loading the packages and creating the task.

```{r}
library(mlr3)
library(mlr3extralearners)
library(mlr3pipelines)

tsk_pima = tsk("pima")
tsk_pima
```

Below, we see that the task has five features with missing values.

```{r}
tsk_pima$missings()
```

Next, we create the LightGBM classifier, but don't specify the validation data yet.
We handle the missing values using a simple median imputation.

```{r}
lrn_lgbm = lrn("classif.lightgbm",
num_iterations = 1000,
early_stopping_rounds = 10,
learning_rate = 0.01,
eval = "binary_logloss"
)

glrn = as_learner(po("imputemedian") %>>% lrn_lgbm)
glrn$id = "lgbm"
```

After constructing the graphlearner, we now configure the validation data using `r ref("set_validate()")`.
The call below sets the `$validate` field of the LightGBM pipeop to `"predefined"` and of the graphlearner to `0.3`.
Recall that only the graphlearner itself can specify *how* the validation data is generated.
The individual pipeops can either use it (`"predefined"`) or not (`NULL`).

```{r}
set_validate(glrn, validate = 0.3, ids = "classif.lightgbm")
glrn$validate
glrn$graph$pipeops$classif.lightgbm$validate
```

Finally, we train the learner and inspect the validation scores and internally tuned parameters.

```{r}
glrn$train(tsk_pima)

glrn$internal_tuned_values
glrn$internal_valid_scores
```

2. Wrap the learner from exercise 1) in an `AutoTuner` using a three-fold CV for the tuning.
Also change the rule for aggregating the different boosting iterations from averaging to taking the maximum across the folds.
Don't tune any parameters other than `nrounds`, which can be done using `tnr("internal")`.
Use the internal validation metric as the tuning measure.
Compare this learner with a `lrn("classif.rpart")` using a 10-fold outer cross-validation with respect to classification accuracy.

We start by setting the number of boosting iterations to an internal tune token where the maximum number of boosting iterations is 1000 and the aggregation function the maximum.
Note that the input to the aggregation function is a list of integer values (the early stopped values for the different resampling iterations), so we need to `unlist()` it first before taking the maximum.

```{r}
library(mlr3tuning)

glrn$param_set$set_values(
classif.lightgbm.num_iterations = to_tune(
upper = 1000, internal = TRUE, aggr = function(x) max(unlist(x))
)
)
```

Now, we change the validation data from `0.3` to `"test"`, where we can omit the `ids` specification as LightGBM is the base learner.

```{r}
set_validate(glrn, validate = "test")
```

Next, we create the autotuner using the configuration given in the instructions.
As the internal validation measures are calculated by `lightgbm` and not `mlr3`, we need to specify whether the metric should be minimized.

```{r}
at_lgbm = auto_tuner(
learner = glrn,
tuner = tnr("internal"),
resampling = rsmp("cv", folds = 3),
measure = msr("internal_valid_score",
select = "classif.lightgbm.binary_logloss", minimize = TRUE)
)
at_lgbm$id = "at_lgbm"
```

Finally, we set up the benchmark design, run it, and evaluate the learners in terms of their classification accuracy.

```{r}
design = benchmark_grid(
task = tsk_pima,
learners = list(at_lgbm, lrn("classif.rpart")),
resamplings = rsmp("cv", folds = 10)
)

bmr = benchmark(design)

bmr$aggregate(msr("classif.acc"))
```

3. Consider the code below:

```{r}
branch_lrn = as_learner(
ppl("branch", list(
lrn("classif.ranger"),
lrn("classif.xgboost",
early_stopping_rounds = 10,
eval_metric = "error",
eta = to_tune(0.001, 0.1, logscale = TRUE),
nrounds = to_tune(upper = 1000, internal = TRUE)))))

set_validate(branch_lrn, validate = "test", ids = "classif.xgboost")
branch_lrn$param_set$set_values(branch.selection = to_tune())

at = auto_tuner(
tuner = tnr("grid_search"),
learner = branch_lrn,
resampling = rsmp("holdout", ratio = 0.8),
# cannot use internal validation score because ranger does not have one
measure = msr("classif.ce"),
term_evals = 10L,
store_models = TRUE
)

tsk_sonar = tsk("sonar")$filter(1:100)

rr = resample(
tsk_sonar, at, rsmp("holdout", ratio = 0.8), store_models = TRUE
)
```

Answer the following questions (ideally without running the code):

3.1 During the hyperparameter optimization, how many observations are used to train the XGBoost algorithm (excluding validation data) and how many for the random forest?
Hint: learners that cannot make use of validation data ignore it.

The outer resampling already removes 20 observations from the data (the outer test set), leaving only 80 data points (the outer train set) for the inner resampling.
Then 16 (0.2 * 80; the test set of the inner holdout resampling) observations are used to evaluate the hyperparameter configurations.
This leaves 64 (80 - 16) observations for training.
For XGBoost, the 16 observations that make up the inner test set are also used for validation, so no more observations from the 64 training points are removed.
Because the random forest does not support validation, the 16 observations from the inner test set will only be used for evaluation the hyperparameter configuration, but not simultanteously for internal validation.
Therefore, both the random forest and XGBoost models use 64 observations for training.

3.2 How many observations would be used to train the final model if XGBoost was selected? What if the random forest was chosen?

In both cases, all 80 observations (the train set from the outer resampling) would be used.
This is because during the final model fit no validation data is generated.

3.3 How would the answers to the last two questions change if we had set the `$validate` field of the graphlearner to `0.25` instead of `"test"`?

In this case, the validation data is no longer identical to the inner resampling test set.
Instead, it is split from the 64 observations that make up the inner training set.
Because this happens before the task enters the graphlearner, both the XGBoost model *and* the random forest only have access to 48 ((1 - 0.25) * 64) observations, and the remaining 16 are used to create the validation data.
Note that the random forest will again ignore the validation data as it does not have the 'validation' property and therefore cannot use it.
Also, the autotuner would now use a different set for tuning the step size and boosting iterations (which coincidentally both have size 16).
Therefore, the answer to question 3.1 would be 48 instead of 64.

However, this does not change the answer to 3.2, as, again, no validation is performed during the final model fit.

Note that we would normally recommend setting the validation data to `"test"` when tuning, so this should be thought of as a illustrative example.


4. Look at the (failing) code below:

```{r, error = TRUE}
tsk_sonar = tsk("sonar")
glrn = as_learner(
po("pca") %>>% lrn("classif.xgboost", validate = 0.3)
)
```

Can you explain *why* the code fails?
Hint: Should the data that xgboost uses for validation be preprocessed according to the *train* or *predict* logic?

If we set the `$validate` field of the XGBoost classifier to `0.3`, the validation data would be generated from the output task of `PipeOpOpPCA`.
However, this task has been exclusively preprocessed using the train logic, because the `PipeOpPCA` does not 'know' that the LightGBM classifier wants to do validation.
Because validation performance is intended to measure how well a model would perform during prediction, the validation should be preprocessed according to the predict logic.
For this reason, splitting of the 30% of the output from `PipeOpPCA` to use as validation data in the XGBoost classifier would be invalid.
Therefore, it is not possible to set the `$validate` field of `PipeOps` to values other than `predefined' or `NULL'.
Only the `GraphLearner` itself can dictate *how* the validation data is created *before* it enters the `Graph`, so the validation data is then preprocessed according to the predict logic.

:::
Original file line number Diff line number Diff line change
@@ -104,14 +104,14 @@ lrn_ranger = as_learner(
po("learner", lrn("regr.ranger"))
)
lrn_ranger$id = "ranger"
lrn_ranger$fallback = lrn("regr.featureless")
lrn_ranger$encapsulate("evaluate", fallback = lrn("regr.featureless"))

lrn_rpart = as_learner(
ppl("robustify", learner = lrn("regr.rpart")) %>>%
po("learner", lrn("regr.rpart"))
)
lrn_rpart$id = "rpart"
lrn_rpart$fallback = lrn("regr.featureless")
lrn_rpart$encapsulate("evaluate", fallback = lrn("regr.featureless"))

learners = list(lrn_ranger, lrn_rpart)
```
2 changes: 2 additions & 0 deletions book/chapters/chapter1/introduction_and_overview.qmd
Original file line number Diff line number Diff line change
@@ -27,6 +27,8 @@ Before we can show you the full power of `mlr3`, we recommend installing the `r
install.packages("mlr3verse")
```

Chapters that were added after the release of the printed version of this book are marked with a '+'.

## Installation Guidelines {#installguide}

There are many packages in the `mlr3` ecosystem that you may want to use as you work through this book.
18 changes: 7 additions & 11 deletions book/chapters/chapter10/advanced_technical_aspects_of_mlr3.qmd
Original file line number Diff line number Diff line change
@@ -530,21 +530,18 @@ This means that models can be used for fitting and predicting and any conditions
However, the result of the experiment will be a missing model and/or predictions, depending on where the error occurs.
In @sec-fallback, we will discuss fallback learners to replace missing models and/or predictions.

Each `r ref("Learner")` contains the field `r index("$encapsulate", parent = "Learner", aside = TRUE, code = TRUE)` to control how the train or predict steps are wrapped.
Each `r ref("Learner")` has the method `r index("$encapsulate()", parent = "Learner", aside = TRUE, code = TRUE)` to control how the train or predict steps are wrapped.
The first way to encapsulate the execution is provided by the package `r ref_pkg("evaluate")`, which evaluates R expressions and captures and tracks conditions (outputs, messages, warnings or errors) without letting them stop the process (see documentation of `r ref("mlr3misc::encapsulate()")` for full details):

```{r technical-017}
# trigger warning and error in training
lrn_debug = lrn("classif.debug", warning_train = 1, error_train = 1)

# enable encapsulation for train() and predict()
lrn_debug$encapsulate = c(train = "evaluate", predict = "evaluate")
lrn_debug$encapsulate("evaluate", fallback = lrn("classif.featureless"))
lrn_debug$train(tsk_penguins)
```

Note how we passed `"evaluate"` to `train` and `predict` to enable encapsulation in both training and predicting.
However, we could have only set encapsulation for one of these stages by instead passing `c(train = "evaluate", predict = "none")` or `c(train = "none", predict = "evaluate")`.

Note that encapsulation captures all output written to the standard output (stdout) and standard error (stderr) streams and stores them in the learner's log.
However, in some computational setups, the calling process needs to operate on the log output, such as the `r ref_pkg("batchtools")` package in @sec-large-benchmarking.
In this case, use the encapsulation method `"try"` instead, which catches signaled conditions but does not suppress the output.
@@ -563,7 +560,7 @@ This guards the calling session against segmentation faults which otherwise woul
On the downside, starting new processes comes with comparably more computational overhead.

```{r technical-019}
lrn_debug$encapsulate = c(train = "callr", predict = "callr")
lrn_debug$encapsulate("callr", fallback = lrn("classif.featureless"))
# set segfault_train and remove warning_train and error_train
lrn_debug$param_set$values = list(segfault_train = 1)
lrn_debug$train(task = tsk_penguins)$errors
@@ -613,13 +610,12 @@ Say an error has occurred when training a model in one or more iterations during
We strongly recommend the final option, which is statistically sound and can be easily used in any practical experiment.
`mlr3` includes two baseline learners: `lrn("classif.featureless")`, which, in its default configuration, always predicts the majority class, and `lrn("regr.featureless")`, which predicts the average response by default.

To make this procedure convenient during resampling and benchmarking, we support fitting a baseline (though in theory you could use any `Learner`) as a `r index('fallback learner')` by passing a `r ref("Learner")` to `r index('$fallback', parent = "Learner", aside = TRUE, code = TRUE)`.
To make this procedure convenient during resampling and benchmarking, we support fitting a baseline (though in theory you could use any `Learner`) as a `r index('fallback learner')` by passing a `r ref("Learner")` to `r index('$encapsulate()', parent = "Learner", aside = TRUE, code = TRUE)`.
In the next example, we add a classification baseline to our debug learner, so that when the debug learner errors, `mlr3` falls back to the predictions of the featureless learner internally.
Note that while encapsulation is not enabled explicitly, it is automatically enabled and set to `"evaluate"` if a fallback learner is added.

```{r technical-022}
lrn_debug = lrn("classif.debug", error_train = 1)
lrn_debug$fallback = lrn("classif.featureless")
lrn_debug$encapsulate("evaluate", fallback = lrn("classif.featureless"))

lrn_debug$train(tsk_penguins)
lrn_debug
@@ -639,7 +635,7 @@ We re-parametrize the debug learner to fail in roughly 50% of the resampling ite

```{r technical-024}
lrn_debug = lrn("classif.debug", error_train = 0.5)
lrn_debug$fallback = lrn("classif.featureless")
lrn_debug$encapsulate("evaluate", fallback = lrn("classif.featureless"))

aggr = benchmark(benchmark_grid(
tsk_penguins,
@@ -970,7 +966,7 @@ For an overview of available DBMS in R, see the CRAN task view on databases at `
| - | `r ref("future::plan()")` | - |
| - | `r ref("set_threads()")` | - |
| - | `r ref("future::tweak()")` | - |
| `Learner` | `lrn()` | `$encapsulate`; `$fallback`; `$timeout`; `$parallel_predict`; `$log` |
| `Learner` | `lrn()` | `$encapsulate()`; `$timeout`; `$parallel_predict`; `$log` |
| `r ref("lgr::Logger")` | `r ref("lgr::get_logger")` | `$set_threshold()` |
| `r ref("mlr3db::DataBackendDplyr")` | `r ref("mlr3::as_data_backend")` | - |
| `r ref("mlr3db::DataBackendDuckDB")` | `r ref("as_duckdb_backend")` | - |
6 changes: 2 additions & 4 deletions book/chapters/chapter11/large-scale_benchmarking.qmd
Original file line number Diff line number Diff line change
@@ -49,15 +49,13 @@ lrn_baseline = lrn("classif.featureless", id = "featureless")
lrn_lr = lrn("classif.log_reg")
lrn_lr = as_learner(ppl("robustify", learner = lrn_lr) %>>% lrn_lr)
lrn_lr$id = "logreg"
lrn_lr$fallback = lrn_baseline
lrn_lr$encapsulate = c(train = "try", predict = "try")
lrn_lr$encapsulate("try", fallback = lrn_baseline)

# random forest pipeline
lrn_rf = lrn("classif.ranger")
lrn_rf = as_learner(ppl("robustify", learner = lrn_rf) %>>% lrn_rf)
lrn_rf$id = "ranger"
lrn_rf$fallback = lrn_baseline
lrn_rf$encapsulate = c(train = "try", predict = "try")
lrn_rf$encapsulate("try", fallback = lrn_baseline)

learners = list(lrn_lr, lrn_rf, lrn_baseline)
```
Loading