-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Trainer supporting evaluation on multiple datasets #19158
[WIP] Trainer supporting evaluation on multiple datasets #19158
Conversation
The documentation is not available anymore as the PR was closed or merged. |
Hey @sgugger, I mostly followed your suggestion in #15857, except instead of having a list of eval_datasets and another training arg, I solved it via passing a dict of eval_datasets. I thought a dict would work better because we also need multiple compute_metric functions. This way it is all lined up and less error-prone. However, let me know if you think otherwise. Also, could you suggest what tests to write for this PR? I am not really sure, since the major change is in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your PR! Having the multiple datasets as a dict solves the problem of distinguishing a single dataset that is a list or a list of datasets. So I like this part.
However I didn't see anything in the issue regarding using several compute_metrics
function. If there is a need for different metrics, it probably means different Trainer should be built as it represents different tasks/problems. That change should be reverted, as the part where compute_metrics
can be passed along to the evaluate
/predict
function.
Thanks for checking it so quickly! In my case, I am training a seq2seq QA model and evaluating it on multiple datasets. However, they have different formats (eg extractive qa like SQuAD, or multiple-choice qa like commonsese QA). Using a seq2seq model for multiple formats has been for example proposed in the UnifiedQA paper. Having multiple trainers has the limitation that I could only train on a single dataset at a time, but not train on multiple ones at the same time. However, note that if you pass multiple eval_datasets as a dict, but only a single compute_metric callable, the same compute_metrics function will be called on all the eval_datasets. That's what this if statement is doing. So the original scenario described in the Issue is also solved. |
It's too niche of a use-case to allow for support, especially when we have other tools that easily let you more customizable training/evaluation loops like Accelerate. |
Alright, I have reverted the change. Let me know in case of anything else:) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, thanks a lot for amending the PR!
…#19158) * support for multiple eval datasets * support multiple datasets in seq2seq trainer * add documentation * update documentation * make fixup * revert option for multiple compute_metrics * revert option for multiple compute_metrics * revert added empty line
I'm trying to take advantage of the feature to include multiple eval_datasets in the trainer. Maybe I'm misreading the documentation, I've tried several ways to present the eval_dataset, but keep getting a KeyError when I include a DatasetDict / dict with datasets for the eval_dataset parameter. Am I doing something wrong? Do I need to specify the compute_metrics differently? Couldn't find anything on that. Here's an example notebook resulting in the ValueError: https://colab.research.google.com/drive/1yLo9iqY4Cz9_h8BtAvcYRCtK5O_xa5jP?usp=sharing |
@sgugger passing multiple |
I'd recommend using Accelerate instead of the Trainer for this use case. |
@sgugger Any examples out there of using Accelerate for this? I would also like to evaluate on multiple datasets while training. Thanks! |
fyi, @dcruiz01 I guess this can be implemented by overriding the Though I agree this is less common for standard fine-tuning, I want to add use cases where we want the model to perform well on multiple tasks in a zero-shot manner.
|
@sieu-n Any luck with the experiments you mentioned? |
Hey, I think the additional feature to use separated data collators would be useful |
What does this PR do?
With this PR
Trainer
andSeq2SeqTrainer
support evaluating on multiple datasets. For this, theeval_dataset
andcompute_metrics
parameters have been updated. In order to evaluate on multiple datasets,eval_dataset
should be a dict mapping a dataset name to a Dataset. In_maybe_log_save_evaluate
we then loop over the dict, callingevaluate
with each Dataset. The metric prefix is also updated to contain the dataset name. Furthermore, each eval dataset can optionally have its owncompute_metrics
function. For this,compute_metrics
should be a dict where the keys match witheval_dataset
.Fixes #15857
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@sgugger