[zero3] fix reference counting in backward over multiple forwards #1227

stas00 · 2021-07-14T00:03:30Z

Models like Albert run the same layer's forward multiple times in a loop before doing backward. The current implementation can't handle that because it assumes forward/backward pairs and runs prefetch hooks only once, the subsequent backwards get to work with the partitioned / ungathered param which breaks with:

Traceback (most recent call last):
  File "/mnt/nvme1/code/huggingface/transformers-ds-model-zoo-2/examples/pytorch/language-modeling/run_mlm.py", line 550, in <module>
    main()
  File "/mnt/nvme1/code/huggingface/transformers-ds-model-zoo-2/examples/pytorch/language-modeling/run_mlm.py", line 501, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/mnt/nvme1/code/huggingface/transformers-ds-model-zoo-2/src/transformers/trainer.py", line 1275, in train
    tr_loss += self.training_step(model, inputs)
  File "/mnt/nvme1/code/huggingface/transformers-ds-model-zoo-2/src/transformers/trainer.py", line 1784, in training_step
    loss = self.deepspeed.backward(loss)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/engine.py", line 1191, in backward
    self.optimizer.backward(loss)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/zero/stage3.py", line 2972, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/fp16/loss_scaler.py", line 53, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/autograd/__init__.py", line 147, in backward
    Variable._execution_engine.run_backward(
RuntimeError: Function LinearFunctionForZeroStage3Backward returned an invalid gradient at index 0 - got [2, 512] but expected shape compatible with [2, 512, 256]

This PR

switches to reference counting, instead of on/off flag, which solves the problem.
adds tests
also did some other small improvements in tests and code

Thank you @tjruwase and @samyam for helping to diagnose and fix this problem.

tests/unit/test_zero.py

fix reference counting in backward over multiple forwards

a04e568

stas00 requested review from awan-10, cli99, conglongli, eltonzheng, jeffra, minjiaz, niumanar, RezaYazdaniAminabadi, samyam, ShadenSmith and tjruwase as code owners July 14, 2021 00:03

test + cleanup

4adf802

stas00 changed the title ~~[WIP] [zero3] fix reference counting in backward over multiple forwards~~ [zero3] fix reference counting in backward over multiple forwards Jul 14, 2021

stas00 commented Jul 14, 2021

View reviewed changes

tests/unit/test_zero.py Show resolved Hide resolved

stas00 mentioned this pull request Jul 14, 2021

[Deepspeed] add many more models to the model zoo test huggingface/transformers#12695

Merged

6 tasks

tjruwase approved these changes Jul 14, 2021

View reviewed changes

tests/unit/test_zero.py Show resolved Hide resolved

tjruwase merged commit 3fa2420 into microsoft:master Jul 14, 2021

stas00 deleted the fix-prefetch-with-repeat-layer branch July 14, 2021 16:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[zero3] fix reference counting in backward over multiple forwards #1227

[zero3] fix reference counting in backward over multiple forwards #1227

stas00 commented Jul 14, 2021 •

edited

Loading

[zero3] fix reference counting in backward over multiple forwards #1227

[zero3] fix reference counting in backward over multiple forwards #1227

Conversation

stas00 commented Jul 14, 2021 • edited Loading

stas00 commented Jul 14, 2021 •

edited

Loading