How to deal with meta-learning #234

PGijsbers · 2021-01-04T10:17:42Z

PGijsbers
Jan 4, 2021
Maintainer

note: This is issue #18, but moved to the more appropriate discussions board.

We encourage everyone to partake in this discussion, as it is an important but difficult issue to address.

The Problem

Meta-learning is the act of learning across datasets to generalize to new problems. AutoML frameworks which use meta-learning to aid their optimization gain an unfair advantage if the benchmark problem to solve was present in their meta-learning process.

For instance, auto-sklearn uses results of experiments on a plethora of datasets to warm-start their optimization. It will characterize new problems, relate them to those meta-learning datasets, and propose candidate solutions which worked well for those old problems. If auto-sklearn is asked to solve a problem which it has seen before, it obviously benefits as it has access to experiment data on the same dataset. This is the unfair advantage.

Discussion

The cleanest solution is that we should not allow AutoML framework tools to use any of the problems we selected in their meta-models (or depending on where to place the burden, we have to select datasets which were not used in any system's meta-learning process).
Both stances share the problem that both parties are very interested in using the data. Excluding them from meta-learning would make the AutoML tools worse, while excluding them from the benchmark would make the benchmark worse.

In our ICML 2019 AutoML workshop paper, we did not have a satisfying solution; we merely indicated where auto-sklearn had seen the dataset during meta-learning.

Any proposed solutions should take into account that datasets in this benchmark change over time. This means that a tool which had no overlap in its meta-learning datasets with the benchmark, may have so after additions to the benchmark.

Abolishing meta-learning for the benchmark is also not an option. We want to evaluate the frameworks as a user would use them. Meta-learning has shown it can provide significant improvements (e.g. auto-sklearn, OBOE), and we hope to see further improvements from it in the future. These improvements should also be accurately reflected in our benchmark.

Solutions

I will try to keep this section up-to-date as the discussion grows.

One solution could be to require AutoML frameworks which use meta-learning to expose an interface to easily exclude a specific dataset from the meta-model (referenced by dataset id).

This requires AutoML developers to cross-reference their used data with OpenML (or preferably always use OpenML to source their data).
Not all meta-learning techniques easily exclude results from one dataset. For those that don't, up to (N+1) models need to be trained/maintained (one with all N datasets, and N models excluding one dataset each).
Requires the interface to be adopted specifically for (this) benchmark purposes.
This would allow clean evaluation in the AutoML benchmark.
OpenML currently has a lot of duplicate datasets, or variations of the same dataset. These all need to be considered when communicating which dataset(s) to exclude.

PGijsbers · 2021-01-04T10:21:57Z

PGijsbers
Jan 4, 2021
Maintainer Author

We are working towards proposed solution, but remain open for other views/ideas.

In the meanwhile we do require that automl frameworks open source their meta-learning by:

Documentating how the meta-models are constructed. The source code is sufficient, but an accompanying brief write-up or paper reference is highly encouraged.
A list of openml datasets that were used during training.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to deal with meta-learning #234

{{title}}

Replies: 1 comment

{{title}}

Select a reply

How to deal with meta-learning #234

PGijsbers Jan 4, 2021 Maintainer

The Problem

Discussion

Solutions

Replies: 1 comment

PGijsbers Jan 4, 2021 Maintainer Author

PGijsbers
Jan 4, 2021
Maintainer

PGijsbers
Jan 4, 2021
Maintainer Author