Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Trainer supporting evaluation on multiple datasets #19158

Merged
merged 9 commits into from
Sep 23, 2022

Conversation

timbmg
Copy link
Contributor

@timbmg timbmg commented Sep 22, 2022

What does this PR do?

With this PR Trainer and Seq2SeqTrainer support evaluating on multiple datasets. For this, the eval_dataset and compute_metrics parameters have been updated. In order to evaluate on multiple datasets, eval_dataset should be a dict mapping a dataset name to a Dataset. In _maybe_log_save_evaluate we then loop over the dict, calling evaluate with each Dataset. The metric prefix is also updated to contain the dataset name. Furthermore, each eval dataset can optionally have its own compute_metrics function. For this, compute_metrics should be a dict where the keys match with eval_dataset.

Fixes #15857

Before submitting

Who can review?

@sgugger

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Sep 22, 2022

The documentation is not available anymore as the PR was closed or merged.

@timbmg
Copy link
Contributor Author

timbmg commented Sep 22, 2022

Hey @sgugger, I mostly followed your suggestion in #15857, except instead of having a list of eval_datasets and another training arg, I solved it via passing a dict of eval_datasets. I thought a dict would work better because we also need multiple compute_metric functions. This way it is all lined up and less error-prone. However, let me know if you think otherwise.

Also, could you suggest what tests to write for this PR? I am not really sure, since the major change is in _maybe_log_save_evaluate and I didn't find a test for that.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR! Having the multiple datasets as a dict solves the problem of distinguishing a single dataset that is a list or a list of datasets. So I like this part.

However I didn't see anything in the issue regarding using several compute_metrics function. If there is a need for different metrics, it probably means different Trainer should be built as it represents different tasks/problems. That change should be reverted, as the part where compute_metrics can be passed along to the evaluate/predict function.

@timbmg
Copy link
Contributor Author

timbmg commented Sep 22, 2022

Thanks for checking it so quickly!

In my case, I am training a seq2seq QA model and evaluating it on multiple datasets. However, they have different formats (eg extractive qa like SQuAD, or multiple-choice qa like commonsese QA). Using a seq2seq model for multiple formats has been for example proposed in the UnifiedQA paper. Having multiple trainers has the limitation that I could only train on a single dataset at a time, but not train on multiple ones at the same time. However, note that if you pass multiple eval_datasets as a dict, but only a single compute_metric callable, the same compute_metrics function will be called on all the eval_datasets. That's what this if statement is doing. So the original scenario described in the Issue is also solved.

@sgugger
Copy link
Collaborator

sgugger commented Sep 22, 2022

It's too niche of a use-case to allow for support, especially when we have other tools that easily let you more customizable training/evaluation loops like Accelerate.

@timbmg timbmg marked this pull request as ready for review September 23, 2022 07:51
@timbmg
Copy link
Contributor Author

timbmg commented Sep 23, 2022

Alright, I have reverted the change. Let me know in case of anything else:)

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, thanks a lot for amending the PR!

@sgugger sgugger merged commit 905635f into huggingface:main Sep 23, 2022
oneraghavan pushed a commit to oneraghavan/transformers that referenced this pull request Sep 26, 2022
…#19158)

* support for multiple eval datasets

* support multiple datasets in seq2seq trainer

* add documentation

* update documentation

* make fixup

* revert option for multiple compute_metrics

* revert option for multiple compute_metrics

* revert added empty line
@jurrr
Copy link

jurrr commented Feb 14, 2023

I'm trying to take advantage of the feature to include multiple eval_datasets in the trainer. Maybe I'm misreading the documentation, I've tried several ways to present the eval_dataset, but keep getting a KeyError when I include a DatasetDict / dict with datasets for the eval_dataset parameter. Am I doing something wrong? Do I need to specify the compute_metrics differently? Couldn't find anything on that.

Here's an example notebook resulting in the ValueError: https://colab.research.google.com/drive/1yLo9iqY4Cz9_h8BtAvcYRCtK5O_xa5jP?usp=sharing

@alexcoca
Copy link

alexcoca commented Mar 1, 2023

Thanks for your PR! Having the multiple datasets as a dict solves the problem of distinguishing a single dataset that is a list or a list of datasets. So I like this part.

However I didn't see anything in the issue regarding using several compute_metrics function. If there is a need for different metrics, it probably means different Trainer should be built as it represents different tasks/problems. That change should be reverted, as the part where compute_metrics can be passed along to the evaluate/predict function.

@sgugger passing multiple compute_metrics functions for evaluation purposes can actually be more general than stated by @timbmg. For example, suppose we are doing multi-task training and we wish to evaluate on the same or held-out tasks as we train. This is common in recent research publications (eg FLAN-T5). Would you accept to support the multiple compute metrics functions? Or would your advice be to not use the trainer altogether and look towards using accelerate? I was worried that accelerate for training is a big step back towards writing a lot of boilerplate and code that the Trainer saves us.

@sgugger
Copy link
Collaborator

sgugger commented Mar 1, 2023

I'd recommend using Accelerate instead of the Trainer for this use case.

@ZQ-Dev8
Copy link

ZQ-Dev8 commented May 18, 2023

@sgugger Any examples out there of using Accelerate for this? I would also like to evaluate on multiple datasets while training. Thanks!

@sieu-n
Copy link

sieu-n commented Sep 6, 2023

fyi, @dcruiz01 I guess this can be implemented by overriding the Trainer class following the initial commit? Or some mixin regarding multiple trainers or even monkey patching it. Another workaround that comes to my mind is identifying the eval dataset with an additional field and splitting them in the compute_metrics function. I'll try to share if any actually works. Both do feel really hacky though..

Though I agree this is less common for standard fine-tuning, I want to add use cases where we want the model to perform well on multiple tasks in a zero-shot manner.

  • fine-tuning LLMs, in which we want to measure performance on many many different task / metrics / dataset.
  • More meta-benchmarks such as MTEB

@ZQ-Dev8
Copy link

ZQ-Dev8 commented Sep 19, 2023

@sieu-n Any luck with the experiments you mentioned?

@lhallee
Copy link

lhallee commented Jan 19, 2024

@sgugger @timbmg it looks like the eval_datasets change was pushed for support as a dict but the compute_metrics changes were not pushed. Will this change be made or no?

@0seba
Copy link

0seba commented Feb 2, 2024

Hey, I think the additional feature to use separated data collators would be useful

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Supporting multiple evaluation datasets in Trainer and Seq2seqTrainer
9 participants