Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[zero3] fix reference counting in backward over multiple forwards #1227

Merged
merged 2 commits into from
Jul 14, 2021

Conversation

stas00
Copy link
Collaborator

@stas00 stas00 commented Jul 14, 2021

Models like Albert run the same layer's forward multiple times in a loop before doing backward. The current implementation can't handle that because it assumes forward/backward pairs and runs prefetch hooks only once, the subsequent backwards get to work with the partitioned / ungathered param which breaks with:

Traceback (most recent call last):
  File "/mnt/nvme1/code/huggingface/transformers-ds-model-zoo-2/examples/pytorch/language-modeling/run_mlm.py", line 550, in <module>
    main()
  File "/mnt/nvme1/code/huggingface/transformers-ds-model-zoo-2/examples/pytorch/language-modeling/run_mlm.py", line 501, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/mnt/nvme1/code/huggingface/transformers-ds-model-zoo-2/src/transformers/trainer.py", line 1275, in train
    tr_loss += self.training_step(model, inputs)
  File "/mnt/nvme1/code/huggingface/transformers-ds-model-zoo-2/src/transformers/trainer.py", line 1784, in training_step
    loss = self.deepspeed.backward(loss)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/engine.py", line 1191, in backward
    self.optimizer.backward(loss)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/zero/stage3.py", line 2972, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/fp16/loss_scaler.py", line 53, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/autograd/__init__.py", line 147, in backward
    Variable._execution_engine.run_backward(
RuntimeError: Function LinearFunctionForZeroStage3Backward returned an invalid gradient at index 0 - got [2, 512] but expected shape compatible with [2, 512, 256]

This PR

  • switches to reference counting, instead of on/off flag, which solves the problem.
  • adds tests
  • also did some other small improvements in tests and code

Thank you @tjruwase and @samyam for helping to diagnose and fix this problem.

@stas00 stas00 changed the title [WIP] [zero3] fix reference counting in backward over multiple forwards [zero3] fix reference counting in backward over multiple forwards Jul 14, 2021
tests/unit/test_zero.py Show resolved Hide resolved
@tjruwase tjruwase merged commit 3fa2420 into microsoft:master Jul 14, 2021
@stas00 stas00 deleted the fix-prefetch-with-repeat-layer branch July 14, 2021 16:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants