Add predict step accumulation #7767

sgugger · 2020-10-13T20:25:56Z

What does this PR do?

Currently, the Trainer accumulates all predictions on the host (GPU or TPU) before gathering across all hosts (in case of distributed training) and moving back to the CPU. This can result in OOM errors when users have a big dataset (the model is already taking up a lot of space on the host) as highlighted in #7232. However moving the predictions to the CPU at each prediction step is also inefficient (particularly on TPU).

This PR aims at fixing the OOM problem while retaining efficiency by introducing a new training argument called eval_accumulation_step. If left untouched, the behavior is the same as right now (all predictions accumulated on the host and moved at the end of the prediction loop). If set to an int, the predictions are gathered and moved every eval_accumulation_step. This required some clever reorganization of the predictions (see the docstring of DistributedTensorGatherer for more details).

In passing I cleaned up the code related to gathering tensors across multiple hosts and fixed the issue of the loss.item() (big slow down to do that at every step on TPUs) and accumulated the losses the same way predictions and labels are. This still works for any number of outputs/labels of the model.

To check those changes did not break anything, I ran test_trainer_distributed.py on my local setup and created an equivalent for TPUs that I also ran (they both pass).

This slightly change Seq2SeqTrainer (since we don't want the loss.item()) so cc @patil-suraj I don't think this should break anything in it.

Fixes #7232

julien-c

Wow, this is really cool! Great job!

For the TPU perf script, should we add that to our TPU CI to make sure perf doesn’t regress?

LysandreJik

Very nice, way cleaner!

You can add your TPU tests to test_xla_examples.py for now and lets refactor later to have a better suite for TPU.

src/transformers/trainer_pt_utils.py

LysandreJik · 2020-10-14T08:42:20Z

src/transformers/trainer_pt_utils.py

+    If our dataset as 15 samples with a batch size of 2 on 3 processes and we gather then transfer on
+    CPU at every step, our sampler will generate the following indices:
+
+    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 0, 1, 2]


Aren't we computing the results twice on samples 0, 1 and 2?

Yes we are, the distributed sampler used makes sure all processes get the same length of dataset by circling through the indices. Though for my example 15 is a multiple of 3 so I should have set 16 elements...
That's why finalize has a truncation operation, to forget those elements we added.

Great, thanks for the explanation.

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

LysandreJik

LGTM, thanks @sgugger

eugeneware · 2020-10-15T03:48:59Z

Thanks so much @sgugger - you're a legend! 🙇

* Add eval_accumulation_step and clean distributed eval * Add TPU test * Add TPU stuff * Fix arg name * Fix Seq2SeqTrainer * Fix total_size * Update src/transformers/trainer_pt_utils.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Doc and add test to TPU * Add unit test * Adapt name Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

This reverts commit cc03605.

Godwinh19 · 2022-05-11T18:12:58Z

What does this PR do?

Currently, the Trainer accumulates all predictions on the host (GPU or TPU) before gathering across all hosts (in case of distributed training) and moving back to the CPU. This can result in OOM errors when users have a big dataset (the model is already taking up a lot of space on the host) as highlighted in #7232. However moving the predictions to the CPU at each prediction step is also inefficient (particularly on TPU).

This PR aims at fixing the OOM problem while retaining efficiency by introducing a new training argument called eval_accumulation_step. If left untouched, the behavior is the same as right now (all predictions accumulated on the host and moved at the end of the prediction loop). If set to an int, the predictions are gathered and moved every eval_accumulation_step. This required some clever reorganization of the predictions (see the docstring of DistributedTensorGatherer for more details).

In passing I cleaned up the code related to gathering tensors across multiple hosts and fixed the issue of the loss.item() (big slow down to do that at every step on TPUs) and accumulated the losses the same way predictions and labels are. This still works for any number of outputs/labels of the model.

To check those changes did not break anything, I ran test_trainer_distributed.py on my local setup and created an equivalent for TPUs that I also ran (they both pass).

This slightly change Seq2SeqTrainer (since we don't want the loss.item()) so cc @patil-suraj I don't think this should break anything in it.

Fixes #7232

Thanks so much @sgugger
eval_accumulation_steps for the argument name and not eval_accumulation_step 😉

sgugger added 5 commits October 13, 2020 15:47

Add eval_accumulation_step and clean distributed eval

4621a1b

Add TPU test

850c095

Add TPU stuff

dc42f65

Fix arg name

a02947a

Fix Seq2SeqTrainer

ab401bf

sgugger requested review from julien-c and LysandreJik October 13, 2020 20:25

Fix total_size

1284949

julien-c approved these changes Oct 14, 2020

View reviewed changes

LysandreJik approved these changes Oct 14, 2020

View reviewed changes

sgugger and others added 4 commits October 14, 2020 08:58

Update src/transformers/trainer_pt_utils.py

70ff309

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

Doc and add test to TPU

1220bd5

Add unit test

9648c6c

Adapt name

de2579a

LysandreJik approved these changes Oct 14, 2020

View reviewed changes

sgugger merged commit a1d1b33 into master Oct 14, 2020

sgugger deleted the predict_step_accumulation branch October 14, 2020 15:41

phtephanx mentioned this pull request Oct 14, 2020

Memory blowup with TPU Trainer in master #6873

Closed

4 tasks

fabiocapsouza added a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020

Revert "Add predict step accumulation (huggingface#7767)"

376f9e0

This reverts commit cc03605.

jannisborn mentioned this pull request Apr 8, 2021

Fix typing error in Trainer class (prediction_step) #11138

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add predict step accumulation #7767

Add predict step accumulation #7767

sgugger commented Oct 13, 2020

julien-c left a comment

LysandreJik left a comment

LysandreJik Oct 14, 2020

sgugger Oct 14, 2020

LysandreJik Oct 14, 2020

LysandreJik left a comment

eugeneware commented Oct 15, 2020

Godwinh19 commented May 11, 2022 •

edited

Loading

What does this PR do?

Add predict step accumulation #7767

Add predict step accumulation #7767

Conversation

sgugger commented Oct 13, 2020

What does this PR do?

julien-c left a comment

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

LysandreJik Oct 14, 2020

Choose a reason for hiding this comment

sgugger Oct 14, 2020

Choose a reason for hiding this comment

LysandreJik Oct 14, 2020

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

eugeneware commented Oct 15, 2020

Godwinh19 commented May 11, 2022 • edited Loading

What does this PR do?

Godwinh19 commented May 11, 2022 •

edited

Loading