-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add predict step accumulation #7767
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, this is really cool! Great job!
For the TPU perf script, should we add that to our TPU CI to make sure perf doesn’t regress?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice, way cleaner!
You can add your TPU tests to test_xla_examples.py for now and lets refactor later to have a better suite for TPU.
src/transformers/trainer_pt_utils.py
Outdated
If our dataset as 15 samples with a batch size of 2 on 3 processes and we gather then transfer on | ||
CPU at every step, our sampler will generate the following indices: | ||
|
||
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 0, 1, 2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't we computing the results twice on samples 0, 1 and 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we are, the distributed sampler used makes sure all processes get the same length of dataset by circling through the indices. Though for my example 15 is a multiple of 3 so I should have set 16 elements...
That's why finalize
has a truncation operation, to forget those elements we added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks for the explanation.
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @sgugger
Thanks so much @sgugger - you're a legend! 🙇 |
* Add eval_accumulation_step and clean distributed eval * Add TPU test * Add TPU stuff * Fix arg name * Fix Seq2SeqTrainer * Fix total_size * Update src/transformers/trainer_pt_utils.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Doc and add test to TPU * Add unit test * Adapt name Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
This reverts commit cc03605.
Thanks so much @sgugger |
What does this PR do?
Currently, the Trainer accumulates all predictions on the host (GPU or TPU) before gathering across all hosts (in case of distributed training) and moving back to the CPU. This can result in OOM errors when users have a big dataset (the model is already taking up a lot of space on the host) as highlighted in #7232. However moving the predictions to the CPU at each prediction step is also inefficient (particularly on TPU).
This PR aims at fixing the OOM problem while retaining efficiency by introducing a new training argument called
eval_accumulation_step
. If left untouched, the behavior is the same as right now (all predictions accumulated on the host and moved at the end of the prediction loop). If set to an int, the predictions are gathered and moved everyeval_accumulation_step
. This required some clever reorganization of the predictions (see the docstring ofDistributedTensorGatherer
for more details).In passing I cleaned up the code related to gathering tensors across multiple hosts and fixed the issue of the
loss.item()
(big slow down to do that at every step on TPUs) and accumulated the losses the same way predictions and labels are. This still works for any number of outputs/labels of the model.To check those changes did not break anything, I ran
test_trainer_distributed.py
on my local setup and created an equivalent for TPUs that I also ran (they both pass).This slightly change Seq2SeqTrainer (since we don't want the
loss.item()
) so cc @patil-suraj I don't think this should break anything in it.Fixes #7232