Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add predict step accumulation #7767

Merged
merged 10 commits into from
Oct 14, 2020
Merged

Add predict step accumulation #7767

merged 10 commits into from
Oct 14, 2020

Conversation

sgugger
Copy link
Collaborator

@sgugger sgugger commented Oct 13, 2020

What does this PR do?

Currently, the Trainer accumulates all predictions on the host (GPU or TPU) before gathering across all hosts (in case of distributed training) and moving back to the CPU. This can result in OOM errors when users have a big dataset (the model is already taking up a lot of space on the host) as highlighted in #7232. However moving the predictions to the CPU at each prediction step is also inefficient (particularly on TPU).

This PR aims at fixing the OOM problem while retaining efficiency by introducing a new training argument called eval_accumulation_step. If left untouched, the behavior is the same as right now (all predictions accumulated on the host and moved at the end of the prediction loop). If set to an int, the predictions are gathered and moved every eval_accumulation_step. This required some clever reorganization of the predictions (see the docstring of DistributedTensorGatherer for more details).

In passing I cleaned up the code related to gathering tensors across multiple hosts and fixed the issue of the loss.item() (big slow down to do that at every step on TPUs) and accumulated the losses the same way predictions and labels are. This still works for any number of outputs/labels of the model.

To check those changes did not break anything, I ran test_trainer_distributed.py on my local setup and created an equivalent for TPUs that I also ran (they both pass).

This slightly change Seq2SeqTrainer (since we don't want the loss.item()) so cc @patil-suraj I don't think this should break anything in it.

Fixes #7232

Copy link
Member

@julien-c julien-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this is really cool! Great job!

For the TPU perf script, should we add that to our TPU CI to make sure perf doesn’t regress?

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, way cleaner!

You can add your TPU tests to test_xla_examples.py for now and lets refactor later to have a better suite for TPU.

src/transformers/trainer_pt_utils.py Outdated Show resolved Hide resolved
If our dataset as 15 samples with a batch size of 2 on 3 processes and we gather then transfer on
CPU at every step, our sampler will generate the following indices:

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 0, 1, 2]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't we computing the results twice on samples 0, 1 and 2?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we are, the distributed sampler used makes sure all processes get the same length of dataset by circling through the indices. Though for my example 15 is a multiple of 3 so I should have set 16 elements...
That's why finalize has a truncation operation, to forget those elements we added.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for the explanation.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @sgugger

@sgugger sgugger merged commit a1d1b33 into master Oct 14, 2020
@sgugger sgugger deleted the predict_step_accumulation branch October 14, 2020 15:41
@eugeneware
Copy link

Thanks so much @sgugger - you're a legend! 🙇

fabiocapsouza pushed a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020
* Add eval_accumulation_step and clean distributed eval

* Add TPU test

* Add TPU stuff

* Fix arg name

* Fix Seq2SeqTrainer

* Fix total_size

* Update src/transformers/trainer_pt_utils.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Doc and add test to TPU

* Add unit test

* Adapt name

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
fabiocapsouza added a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020
@Godwinh19
Copy link

Godwinh19 commented May 11, 2022

What does this PR do?

Currently, the Trainer accumulates all predictions on the host (GPU or TPU) before gathering across all hosts (in case of distributed training) and moving back to the CPU. This can result in OOM errors when users have a big dataset (the model is already taking up a lot of space on the host) as highlighted in #7232. However moving the predictions to the CPU at each prediction step is also inefficient (particularly on TPU).

This PR aims at fixing the OOM problem while retaining efficiency by introducing a new training argument called eval_accumulation_step. If left untouched, the behavior is the same as right now (all predictions accumulated on the host and moved at the end of the prediction loop). If set to an int, the predictions are gathered and moved every eval_accumulation_step. This required some clever reorganization of the predictions (see the docstring of DistributedTensorGatherer for more details).

In passing I cleaned up the code related to gathering tensors across multiple hosts and fixed the issue of the loss.item() (big slow down to do that at every step on TPUs) and accumulated the losses the same way predictions and labels are. This still works for any number of outputs/labels of the model.

To check those changes did not break anything, I ran test_trainer_distributed.py on my local setup and created an equivalent for TPUs that I also ran (they both pass).

This slightly change Seq2SeqTrainer (since we don't want the loss.item()) so cc @patil-suraj I don't think this should break anything in it.

Fixes #7232

Thanks so much @sgugger
eval_accumulation_steps for the argument name and not eval_accumulation_step 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

trainer.evaluate() aggregates predictions on GPU and causes CUDA out of memory issues for large datasets
5 participants