Sharded Multi-GPU MT5 training with the Seq2SeqTrainer fails (4.21.0) #18410

shermansiu · 2022-08-01T23:17:40Z

System Info

transformers version: 4.21.0
Platform: Linux
Python version: 3.7.6
Huggingface_hub version: 0.8.1
PyTorch version (GPU?): 1.10.2 (Yes)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes (2+ Tesla V100)
Using distributed or parallel set-up in script?: Yes

When trying to fine-tune a MT5ForConditionalGeneration model using a Seq2SeqTrainer, while using multiple GPUs, I get a InternalAssert error. I am running the script using torchrun --nproc=$NUM_GPUS script.py. The issue appears when $NUM_GPUS is greater than 1. Also, it only appears when the argument sharded_ddp: ["zero_dp_3"] is passed to the trainer.

  Traceback (most recent call last):
  File "script.py", line 475, in <module>
    fire.Fire(main)
  File "/miniconda/lib/python3.7/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/miniconda/lib/python3.7/site-packages/fire/core.py", line 471, in _Fire
    target=component.__name__)
  File "/miniconda/lib/python3.7/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "script.py", line 447, in main
    train_model(model, tokenizer, cli_arguments)
  File "script.py", line 357, in train_model
    trainer.train()
  File "/miniconda/lib/python3.7/site-packages/transformers/trainer.py", line 1502, in train
    ignore_keys_for_eval=ignore_keys_for_eval,
  File "/miniconda/lib/python3.7/site-packages/transformers/trainer.py", line 1740, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/miniconda/lib/python3.7/site-packages/transformers/trainer.py", line 2488, in training_step
    loss.backward()
  File "/miniconda/lib/python3.7/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/miniconda/lib/python3.7/site-packages/torch/autograd/__init__.py", line 156, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: grad.numel() == bucket_view.numel()INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1640811797118/work/torch/csrc/distributed/c10d/reducer.cpp":328, please report a bug to PyTorch. 

  0%|          | 0/100000 [00:06<?, ?it/s]
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 660 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 662 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 663 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 661) of binary: /miniconda/bin/python
Traceback (most recent call last):
  File "/miniconda/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.10.2', 'console_scripts', 'torchrun')())
  File "/miniconda/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/miniconda/lib/python3.7/site-packages/torch/distributed/run.py", line 719, in main
    run(args)
  File "/miniconda/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run
    )(*cmd_args)
  File "/miniconda/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/miniconda/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
script.py FAILED
------------------------------------------------------------

The issue fails on transformers[deepspeed]==4.21.0 but there are no issues in transformers[deepspeed]==4.20.1. The versions of Deepspeed and Fairscale are deepspeed==0.6.5 or deepspeed==0.6.7 and fairscale=0.4.6 and this code was run in a Linux machine.

Who can help?

@sgugger

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

# The simplified contents of script.py
# Running torchrun --nproc_per_node=1 script.py should work
# Running torchrun --nproc_per_node=4 script.py should fail with a RuntimeError: grad.numel() == bucket_view.numel()INTERNAL ASSERT FAILED error.

from __future__ import annotations
import functools
import typing as tp
import datasets
import transformers
from transformers import (
    DataCollatorForSeq2Seq,
    PreTrainedTokenizer,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
)


increment_en = [
    {"input": "One", "target": "Two"},
    {"input": "Three", "target": "Four"},
    {"input": "Five", "target": "Six"},
    {"input": "Seven", "target": "Eight"},
    {"input": "Nine", "target": "Ten"},
]
increment_en = increment_en * 100


def lod_to_dol(list_of_dicts: tp.List[tp.Dict[str, tp.Any]]) -> tp.Dict[str, list]:
    dict_of_lists = {
        key: [dct[key] for dct in list_of_dicts] for key in list_of_dicts[0]
    }
    return dict_of_lists


increment_en = lod_to_dol(increment_en)


def preprocess_function_(
    examples,
    tokenizer: PreTrainedTokenizer,
    max_input_length: int,
    max_target_length: int,
):
    inputs = examples["input"]
    targets = examples["target"]

    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs


def main():
    tokenizer = transformers.MT5Tokenizer.from_pretrained("google/mt5-base")
    model = transformers.MT5ForConditionalGeneration.from_pretrained("google/mt5-base")

    args = Seq2SeqTrainingArguments(
        "script_debug",
        per_device_train_batch_size=4,
        per_device_eval_batch_size=4,
        fp16=False,
        push_to_hub=False,
        sharded_ddp=["zero_dp_3"],
        max_steps=10000,
        logging_steps=5000,
        save_steps=5000
    )

    data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, padding=True)

    dataset = datasets.DatasetDict(
        {
            "train": datasets.Dataset.from_dict(increment_en),
            "test": datasets.Dataset.from_dict(increment_en),
        }
    )

    preprocess_function = functools.partial(
        preprocess_function_,
        tokenizer=tokenizer,
        max_input_length=512,
        max_target_length=512
    )

    processed_ds = dataset.map(preprocess_function, batched=True)
    processed_ds.set_format(
        type="torch", columns=["input_ids", "attention_mask", "labels"]
    )

    trainer = Seq2SeqTrainer(
        model,
        args,
        train_dataset=processed_ds["train"],
        eval_dataset=processed_ds["test"],
        data_collator=data_collator,
        tokenizer=tokenizer,
    )
    trainer.train()


if __name__ == "__main__":
    main()

Expected behavior

The training code should not crash.

The text was updated successfully, but these errors were encountered:

shermansiu · 2022-08-02T15:31:42Z

It still fails when I install transformers directly from the GitHub repository (as of today).

Here's the traceback:

Traceback (most recent call last):
  File "script.py", line 102, in <module>
    main()
  File "script.py", line 98, in main
    trainer.train()
  File "/mnt/task_runtime/transformers/src/transformers/trainer.py", line 1506, in train
    ignore_keys_for_eval=ignore_keys_for_eval,
  File "/mnt/task_runtime/transformers/src/transformers/trainer.py", line 1744, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/mnt/task_runtime/transformers/src/transformers/trainer.py", line 2492, in training_step
    loss.backward()
  File "/miniconda/lib/python3.7/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/miniconda/lib/python3.7/site-packages/torch/autograd/__init__.py", line 156, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: grad.numel() == bucket_view.numel()INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1640811797118/work/torch/csrc/distributed/c10d/reducer.cpp":328, please report a bug to PyTorch. 
  0%|                              | 0/10000 [00:00<?, ?it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 48181) of binary: /miniconda/bin/python
Traceback (most recent call last):
  File "/miniconda/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.10.2', 'console_scripts', 'torchrun')())
  File "/miniconda/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/miniconda/lib/python3.7/site-packages/torch/distributed/run.py", line 719, in main
    run(args)
  File "/miniconda/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run
    )(*cmd_args)
  File "/miniconda/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/miniconda/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
script.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-08-02_15:26:28
  host      : bolt-imq45r3c3y-8dfzr73qqa.bolt-pods.turi-bolt.svc.int.usmsc39.applecloud.io
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 48182)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-08-02_15:26:28
  host      : bolt-imq45r3c3y-8dfzr73qqa.bolt-pods.turi-bolt.svc.int.usmsc39.applecloud.io
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 48181)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

shermansiu · 2022-08-02T17:25:26Z

Related issue: https://discuss.pytorch.org/t/multi-gpu-model-parallelism-device-error/117854/9

This issue seems to be related to how DDP is set up in a constructor somewhere, probably in the trainer's constructor when adding DDP.

pacman100 · 2022-08-02T19:17:27Z

Hello @shermansiu , I am unable to reproduce the error with transformers==4.22.0.dev0 main branch and fairscale==0.4.6. sharded_ddp has nothing to do with DeepSpeed. I get another error and it is unrelate with the integration. Therefore, please open the issue with Fairscale and follow it there. The issue I face is below which is different from the one you face:

Traceback (most recent call last):                                                                  
  File "script.py", line 109, in <module>                                                           
    main()
  File "script.py", line 103, in main
    trainer.train()
  File "/home/sourab/transformers/src/transformers/trainer.py", line 1502, in train
    return inner_training_loop(
  File "/home/sourab/transformers/src/transformers/trainer.py", line 1744, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/sourab/transformers/src/transformers/trainer.py", line 2492, in training_step
    loss.backward()
  File "/home/sourab/dev/lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/sourab/dev/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backw
ard
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Function SplitWithSizesBackward0 returned an invalid gradient at index 0 - got [582401
280] but expected shape compatible with [291200640]

Also, if you want to leverage Fully Sharded Data Parallelism, you can use the production focused PyTorch FSDP integration in transformers by having following args:

args = Seq2SeqTrainingArguments(
        "script_debug",
        per_device_train_batch_size=4,
        per_device_eval_batch_size=4,
        fp16=False,
-      sharded_ddp=["zero_dp_3"],
+      fsdp=["full_shard", "auto_wrap"],
+      fsdp_transformer_layer_cls_to_wrap="T5Block",
        max_steps=100,
        logging_steps=5000,
        save_steps=5000
    )

which gives below output:

***** Running training *****
  Num examples = 500
  Num Epochs = 2
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 100
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"

...


100%|█████████████████████████████████████████████████████████████| 100/100 [00:26<00:00,  3.72it/s]

Training completed. Do not forget to share your model on huggingface.co/models =)


FullyShardedDataParallel( 
  (_fsdp_wrapped_module): FlattenParamsWrapper(
    (_fpw_module): MT5ForConditionalGeneration(
      (shared): Embedding(250112, 768)
      (encoder): T5Stack( 
        (embed_tokens): Embedding(250112, 768)
        (block): ModuleList(
          (0): FullyShardedDataParallel(
            (_fsdp_wrapped_module): FlattenParamsWrapper(
              (_fpw_module): T5Block(
                (layer): ModuleList(
                  (0): T5LayerSelfAttention(
                    (SelfAttention): T5Attention(
                      (q): Linear(in_features=768, out_features=768, bias=False)
                      (k): Linear(in_features=768, out_features=768, bias=False)
                      (v): Linear(in_features=768, out_features=768, bias=False)
                      (o): Linear(in_features=768, out_features=768, bias=False)
                      (relative_attention_bias): Embedding(32, 12)
                    )
                    (layer_norm): T5LayerNorm()
                    (dropout): Dropout(p=0.1, inplace=False)
                  )
                  (1): T5LayerFF(
                    (DenseReluDense): T5DenseGatedActDense(

...

On transformers[deepspeed]==4.20.1, I don't the issue as you mentioned. I will look into it further by this week or next.

shermansiu · 2022-08-02T19:49:51Z

Thanks! The weird thing is that changing the fairscale version doesn't affect whether the bug appears.

As you just said, I can make the bug appear by first running pip install transformers==4.21.0 and disappear by running pip install transformers==4.20.1. I'll file a bug report in the FairScale repository anyway.

shermansiu · 2022-08-02T20:08:14Z

I was able to reproduce your RuntimeError: Function SplitWithSizesBackward0 returned an invalid gradient at index 0 - got [582401280] but expected shape compatible with [145600320] error by upgrading PyTorch (cudatoolkit=11.3) from 1.10.2 to 1.12.0.

I think it's still the same bug because running torchrun --nproc_per_node=1 script.py with pytorch==1.12.0 works.

After upgrading PyTorch to 1.12.0, I applied your FSDP patch and the code started to work. Thanks!

shermansiu · 2022-08-02T20:10:22Z

(FSDP is only available for PyTorch versions 1.12 and later)

pacman100 · 2022-08-02T20:13:28Z

Hello @shermansiu , I found the bug and raised above PR which should fix it. Can you try the above PR and confirm?

pacman100 · 2022-08-02T20:16:04Z

(FSDP is only available for PyTorch versions 1.12 and later)

Yes

pacman100 · 2022-08-02T20:18:01Z

Post applying PR, the output logs for sharded_ddp:

100%|█████████████████████████████████████████████████████████████| 100/100 [00:25<00:00,  3.93it/s]
                                                                                                    
Training completed. Do not forget to share your model on huggingface.co/models =)                   
                                                                                                    
                                                                                                    
                                                                                                    
{'train_runtime': 26.4257, 'train_samples_per_second': 30.274, 'train_steps_per_second': 3.784, 'tra
in_loss': 17.26375, 'epoch': 1.59}                                                                  
FullyShardedDataParallel(                                                                           
  world_size=2, flatten_parameters=True, mixed_precision=False,                                     
  (_fsdp_wrapped_module): FlattenParamsWrapper(                                                     
    (_fpw_module): MT5ForConditionalGeneration(                                                     
      (shared): Embedding(250112, 768)
      (encoder): T5Stack( 
        (embed_tokens): Embedding(250112, 768)
        (block): ModuleList(
          (0): T5Block(
            (layer): ModuleList(
              (0): T5LayerSelfAttention(

...

shermansiu · 2022-08-02T21:22:15Z

Yes, I can confirm that it works!

Training completed. Do not forget to share your model on huggingface.co/models =)


{'train_runtime': 48.4985, 'train_samples_per_second': 32.991, 'train_steps_per_second': 2.062, 'train_loss': 18.418689575195312, 'epoch': 3.12}
100%|██████████████████████| 100/100 [00:48<00:00,  2.06it/s]

I guess I don't need to file a FairScale issue after all!

shermansiu · 2022-08-02T21:22:49Z

Wait... am I supposed to keep the issue open until the PR is merged?

shermansiu · 2022-08-02T21:23:35Z

Probably, I suppose.

pacman100 linked a pull request 1 hour ago that will close this issue

shermansiu added the bug label Aug 1, 2022

shermansiu changed the title ~~Sharded Multi-GPU MT5 training fails o~~ Sharded Multi-GPU MT5 training fails (4.21.0) Aug 1, 2022

shermansiu changed the title ~~Sharded Multi-GPU MT5 training fails (4.21.0)~~ Sharded Multi-GPU MT5 training with the Seq2SeqTrainer fails (4.21.0) Aug 1, 2022

sgugger assigned pacman100 Aug 2, 2022

pacman100 mentioned this issue Aug 2, 2022

fixing error when using sharded ddp #18435

Merged

5 tasks

shermansiu closed this as completed Aug 2, 2022

shermansiu reopened this Aug 2, 2022

pacman100 closed this as completed in #18435 Aug 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sharded Multi-GPU MT5 training with the Seq2SeqTrainer fails (4.21.0) #18410

Sharded Multi-GPU MT5 training with the Seq2SeqTrainer fails (4.21.0) #18410

shermansiu commented Aug 1, 2022 •

edited

Loading

shermansiu commented Aug 2, 2022

shermansiu commented Aug 2, 2022

pacman100 commented Aug 2, 2022 •

edited

Loading

shermansiu commented Aug 2, 2022 •

edited

Loading

shermansiu commented Aug 2, 2022

shermansiu commented Aug 2, 2022

pacman100 commented Aug 2, 2022

pacman100 commented Aug 2, 2022

pacman100 commented Aug 2, 2022

shermansiu commented Aug 2, 2022 •

edited

Loading

shermansiu commented Aug 2, 2022

shermansiu commented Aug 2, 2022

Sharded Multi-GPU MT5 training with the Seq2SeqTrainer fails (4.21.0) #18410

Sharded Multi-GPU MT5 training with the Seq2SeqTrainer fails (4.21.0) #18410

Comments

shermansiu commented Aug 1, 2022 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

shermansiu commented Aug 2, 2022

shermansiu commented Aug 2, 2022

pacman100 commented Aug 2, 2022 • edited Loading

shermansiu commented Aug 2, 2022 • edited Loading

shermansiu commented Aug 2, 2022

shermansiu commented Aug 2, 2022

pacman100 commented Aug 2, 2022

pacman100 commented Aug 2, 2022

pacman100 commented Aug 2, 2022

shermansiu commented Aug 2, 2022 • edited Loading

shermansiu commented Aug 2, 2022

shermansiu commented Aug 2, 2022

shermansiu commented Aug 1, 2022 •

edited

Loading

pacman100 commented Aug 2, 2022 •

edited

Loading

shermansiu commented Aug 2, 2022 •

edited

Loading

shermansiu commented Aug 2, 2022 •

edited

Loading