-
Notifications
You must be signed in to change notification settings - Fork 27.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sharded Multi-GPU MT5 training with the Seq2SeqTrainer fails (4.21.0) #18410
Comments
It still fails when I install Here's the traceback:
|
Related issue: https://discuss.pytorch.org/t/multi-gpu-model-parallelism-device-error/117854/9 This issue seems to be related to how DDP is set up in a constructor somewhere, probably in the trainer's constructor when adding DDP. |
Hello @shermansiu , I am unable to reproduce the error with transformers==4.22.0.dev0 main branch and fairscale==0.4.6. Traceback (most recent call last):
File "script.py", line 109, in <module>
main()
File "script.py", line 103, in main
trainer.train()
File "/home/sourab/transformers/src/transformers/trainer.py", line 1502, in train
return inner_training_loop(
File "/home/sourab/transformers/src/transformers/trainer.py", line 1744, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/sourab/transformers/src/transformers/trainer.py", line 2492, in training_step
loss.backward()
File "/home/sourab/dev/lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/sourab/dev/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backw
ard
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Function SplitWithSizesBackward0 returned an invalid gradient at index 0 - got [582401
280] but expected shape compatible with [291200640] Also, if you want to leverage Fully Sharded Data Parallelism, you can use the production focused PyTorch FSDP integration in transformers by having following args: args = Seq2SeqTrainingArguments(
"script_debug",
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
fp16=False,
- sharded_ddp=["zero_dp_3"],
+ fsdp=["full_shard", "auto_wrap"],
+ fsdp_transformer_layer_cls_to_wrap="T5Block",
max_steps=100,
logging_steps=5000,
save_steps=5000
) which gives below output: ***** Running training *****
Num examples = 500
Num Epochs = 2
Instantaneous batch size per device = 4
Total train batch size (w. parallel, distributed & accumulation) = 8
Gradient Accumulation steps = 1
Total optimization steps = 100
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
...
100%|█████████████████████████████████████████████████████████████| 100/100 [00:26<00:00, 3.72it/s]
Training completed. Do not forget to share your model on huggingface.co/models =)
FullyShardedDataParallel(
(_fsdp_wrapped_module): FlattenParamsWrapper(
(_fpw_module): MT5ForConditionalGeneration(
(shared): Embedding(250112, 768)
(encoder): T5Stack(
(embed_tokens): Embedding(250112, 768)
(block): ModuleList(
(0): FullyShardedDataParallel(
(_fsdp_wrapped_module): FlattenParamsWrapper(
(_fpw_module): T5Block(
(layer): ModuleList(
(0): T5LayerSelfAttention(
(SelfAttention): T5Attention(
(q): Linear(in_features=768, out_features=768, bias=False)
(k): Linear(in_features=768, out_features=768, bias=False)
(v): Linear(in_features=768, out_features=768, bias=False)
(o): Linear(in_features=768, out_features=768, bias=False)
(relative_attention_bias): Embedding(32, 12)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): T5LayerFF(
(DenseReluDense): T5DenseGatedActDense(
...
On transformers[deepspeed]==4.20.1, I don't the issue as you mentioned. I will look into it further by this week or next. |
Thanks! The weird thing is that changing the fairscale version doesn't affect whether the bug appears. As you just said, I can make the bug appear by first running |
I was able to reproduce your I think it's still the same bug because running After upgrading PyTorch to 1.12.0, I applied your FSDP patch and the code started to work. Thanks! |
(FSDP is only available for PyTorch versions 1.12 and later) |
Hello @shermansiu , I found the bug and raised above PR which should fix it. Can you try the above PR and confirm? |
Yes |
Post applying PR, the output logs for 100%|█████████████████████████████████████████████████████████████| 100/100 [00:25<00:00, 3.93it/s]
Training completed. Do not forget to share your model on huggingface.co/models =)
{'train_runtime': 26.4257, 'train_samples_per_second': 30.274, 'train_steps_per_second': 3.784, 'tra
in_loss': 17.26375, 'epoch': 1.59}
FullyShardedDataParallel(
world_size=2, flatten_parameters=True, mixed_precision=False,
(_fsdp_wrapped_module): FlattenParamsWrapper(
(_fpw_module): MT5ForConditionalGeneration(
(shared): Embedding(250112, 768)
(encoder): T5Stack(
(embed_tokens): Embedding(250112, 768)
(block): ModuleList(
(0): T5Block(
(layer): ModuleList(
(0): T5LayerSelfAttention(
... |
Yes, I can confirm that it works!
I guess I don't need to file a FairScale issue after all! |
Wait... am I supposed to keep the issue open until the PR is merged? |
Probably, I suppose.
|
System Info
transformers version: 4.21.0
Platform: Linux
Python version: 3.7.6
Huggingface_hub version: 0.8.1
PyTorch version (GPU?): 1.10.2 (Yes)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes (2+ Tesla V100)
Using distributed or parallel set-up in script?: Yes
When trying to fine-tune a MT5ForConditionalGeneration model using a Seq2SeqTrainer, while using multiple GPUs, I get a InternalAssert error. I am running the script using
torchrun --nproc=$NUM_GPUS script.py
. The issue appears when$NUM_GPUS
is greater than 1. Also, it only appears when the argumentsharded_ddp: ["zero_dp_3"]
is passed to the trainer.The issue fails on
transformers[deepspeed]==4.21.0
but there are no issues intransformers[deepspeed]==4.20.1
. The versions of Deepspeed and Fairscale aredeepspeed==0.6.5
ordeepspeed==0.6.7
andfairscale=0.4.6
and this code was run in a Linux machine.Who can help?
@sgugger
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
The training code should not crash.
The text was updated successfully, but these errors were encountered: