[WIP] Trainer supporting evaluation on multiple datasets #19158

timbmg · 2022-09-22T13:59:02Z

What does this PR do?

With this PR Trainer and Seq2SeqTrainer support evaluating on multiple datasets. For this, the eval_dataset and compute_metrics parameters have been updated. In order to evaluate on multiple datasets, eval_dataset should be a dict mapping a dataset name to a Dataset. In _maybe_log_save_evaluate we then loop over the dict, calling evaluate with each Dataset. The metric prefix is also updated to contain the dataset name. Furthermore, each eval dataset can optionally have its own compute_metrics function. For this, compute_metrics should be a dict where the keys match with eval_dataset.

Fixes #15857

Before submitting

Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@sgugger

HuggingFaceDocBuilderDev · 2022-09-22T14:15:06Z

The documentation is not available anymore as the PR was closed or merged.

timbmg · 2022-09-22T14:29:29Z

Hey @sgugger, I mostly followed your suggestion in #15857, except instead of having a list of eval_datasets and another training arg, I solved it via passing a dict of eval_datasets. I thought a dict would work better because we also need multiple compute_metric functions. This way it is all lined up and less error-prone. However, let me know if you think otherwise.

Also, could you suggest what tests to write for this PR? I am not really sure, since the major change is in _maybe_log_save_evaluate and I didn't find a test for that.

sgugger

Thanks for your PR! Having the multiple datasets as a dict solves the problem of distinguishing a single dataset that is a list or a list of datasets. So I like this part.

However I didn't see anything in the issue regarding using several compute_metrics function. If there is a need for different metrics, it probably means different Trainer should be built as it represents different tasks/problems. That change should be reverted, as the part where compute_metrics can be passed along to the evaluate/predict function.

timbmg · 2022-09-22T15:21:54Z

Thanks for checking it so quickly!

In my case, I am training a seq2seq QA model and evaluating it on multiple datasets. However, they have different formats (eg extractive qa like SQuAD, or multiple-choice qa like commonsese QA). Using a seq2seq model for multiple formats has been for example proposed in the UnifiedQA paper. Having multiple trainers has the limitation that I could only train on a single dataset at a time, but not train on multiple ones at the same time. However, note that if you pass multiple eval_datasets as a dict, but only a single compute_metric callable, the same compute_metrics function will be called on all the eval_datasets. That's what this if statement is doing. So the original scenario described in the Issue is also solved.

sgugger · 2022-09-22T18:51:27Z

It's too niche of a use-case to allow for support, especially when we have other tools that easily let you more customizable training/evaluation loops like Accelerate.

…com/timbmg/transformers into trainer-with-multiple-eval-datasets

timbmg · 2022-09-23T07:51:57Z

Alright, I have reverted the change. Let me know in case of anything else:)

sgugger

Looking good, thanks a lot for amending the PR!

…#19158) * support for multiple eval datasets * support multiple datasets in seq2seq trainer * add documentation * update documentation * make fixup * revert option for multiple compute_metrics * revert option for multiple compute_metrics * revert added empty line

jurrr · 2023-02-14T10:24:44Z

I'm trying to take advantage of the feature to include multiple eval_datasets in the trainer. Maybe I'm misreading the documentation, I've tried several ways to present the eval_dataset, but keep getting a KeyError when I include a DatasetDict / dict with datasets for the eval_dataset parameter. Am I doing something wrong? Do I need to specify the compute_metrics differently? Couldn't find anything on that.

Here's an example notebook resulting in the ValueError: https://colab.research.google.com/drive/1yLo9iqY4Cz9_h8BtAvcYRCtK5O_xa5jP?usp=sharing

alexcoca · 2023-03-01T19:15:18Z

Thanks for your PR! Having the multiple datasets as a dict solves the problem of distinguishing a single dataset that is a list or a list of datasets. So I like this part.

However I didn't see anything in the issue regarding using several compute_metrics function. If there is a need for different metrics, it probably means different Trainer should be built as it represents different tasks/problems. That change should be reverted, as the part where compute_metrics can be passed along to the evaluate/predict function.

@sgugger passing multiple compute_metrics functions for evaluation purposes can actually be more general than stated by @timbmg. For example, suppose we are doing multi-task training and we wish to evaluate on the same or held-out tasks as we train. This is common in recent research publications (eg FLAN-T5). Would you accept to support the multiple compute metrics functions? Or would your advice be to not use the trainer altogether and look towards using accelerate? I was worried that accelerate for training is a big step back towards writing a lot of boilerplate and code that the Trainer saves us.

sgugger · 2023-03-01T19:16:44Z

I'd recommend using Accelerate instead of the Trainer for this use case.

ZQ-Dev8 · 2023-05-18T00:00:27Z

@sgugger Any examples out there of using Accelerate for this? I would also like to evaluate on multiple datasets while training. Thanks!

sieu-n · 2023-09-06T02:17:16Z

fyi, @dcruiz01 I guess this can be implemented by overriding the Trainer class following the initial commit? Or some mixin regarding multiple trainers or even monkey patching it. Another workaround that comes to my mind is identifying the eval dataset with an additional field and splitting them in the compute_metrics function. I'll try to share if any actually works. Both do feel really hacky though..

Though I agree this is less common for standard fine-tuning, I want to add use cases where we want the model to perform well on multiple tasks in a zero-shot manner.

fine-tuning LLMs, in which we want to measure performance on many many different task / metrics / dataset.
More meta-benchmarks such as MTEB

ZQ-Dev8 · 2023-09-19T19:26:58Z

@sieu-n Any luck with the experiments you mentioned?

lhallee · 2024-01-19T16:50:31Z

@sgugger @timbmg it looks like the eval_datasets change was pushed for support as a dict but the compute_metrics changes were not pushed. Will this change be made or no?

0seba · 2024-02-02T01:58:44Z

Hey, I think the additional feature to use separated data collators would be useful

timbmg added 5 commits September 22, 2022 11:59

support for multiple eval datasets

56fd379

support multiple datasets in seq2seq trainer

a2b597d

add documentation

73b80dc

update documentation

23b7b54

make fixup

12ba5b0

sgugger reviewed Sep 22, 2022

View reviewed changes

timbmg added 4 commits September 23, 2022 09:30

revert option for multiple compute_metrics

5daed15

revert option for multiple compute_metrics

428c07f

Merge branch 'trainer-with-multiple-eval-datasets' of https://github.…

834753b

…com/timbmg/transformers into trainer-with-multiple-eval-datasets

revert added empty line

fa709e9

timbmg marked this pull request as ready for review September 23, 2022 07:51

sgugger approved these changes Sep 23, 2022

View reviewed changes

sgugger merged commit 905635f into huggingface:main Sep 23, 2022

julianmack mentioned this pull request Dec 7, 2022

[Trainer] Corrects typing of Trainer __init__ args #20655

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Trainer supporting evaluation on multiple datasets #19158

[WIP] Trainer supporting evaluation on multiple datasets #19158

timbmg commented Sep 22, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Sep 22, 2022 •

edited

Loading

timbmg commented Sep 22, 2022 •

edited

Loading

sgugger left a comment

timbmg commented Sep 22, 2022 •

edited

Loading

sgugger commented Sep 22, 2022

timbmg commented Sep 23, 2022

sgugger left a comment

jurrr commented Feb 14, 2023

alexcoca commented Mar 1, 2023 •

edited

Loading

sgugger commented Mar 1, 2023

ZQ-Dev8 commented May 18, 2023

sieu-n commented Sep 6, 2023 •

edited

Loading

ZQ-Dev8 commented Sep 19, 2023

lhallee commented Jan 19, 2024

0seba commented Feb 2, 2024

[WIP] Trainer supporting evaluation on multiple datasets #19158

[WIP] Trainer supporting evaluation on multiple datasets #19158

Conversation

timbmg commented Sep 22, 2022 • edited Loading

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Sep 22, 2022 • edited Loading

timbmg commented Sep 22, 2022 • edited Loading

sgugger left a comment

Choose a reason for hiding this comment

timbmg commented Sep 22, 2022 • edited Loading

sgugger commented Sep 22, 2022

timbmg commented Sep 23, 2022

sgugger left a comment

Choose a reason for hiding this comment

jurrr commented Feb 14, 2023

alexcoca commented Mar 1, 2023 • edited Loading

sgugger commented Mar 1, 2023

ZQ-Dev8 commented May 18, 2023

sieu-n commented Sep 6, 2023 • edited Loading

ZQ-Dev8 commented Sep 19, 2023

lhallee commented Jan 19, 2024

0seba commented Feb 2, 2024

timbmg commented Sep 22, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Sep 22, 2022 •

edited

Loading

timbmg commented Sep 22, 2022 •

edited

Loading

timbmg commented Sep 22, 2022 •

edited

Loading

alexcoca commented Mar 1, 2023 •

edited

Loading

sieu-n commented Sep 6, 2023 •

edited

Loading