Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sharded Multi-GPU MT5 training with the Seq2SeqTrainer fails (4.21.0) #18410

Closed
2 of 4 tasks
shermansiu opened this issue Aug 1, 2022 · 12 comments · Fixed by #18435
Closed
2 of 4 tasks

Sharded Multi-GPU MT5 training with the Seq2SeqTrainer fails (4.21.0) #18410

shermansiu opened this issue Aug 1, 2022 · 12 comments · Fixed by #18435
Assignees
Labels

Comments

@shermansiu
Copy link
Contributor

shermansiu commented Aug 1, 2022

System Info

transformers version: 4.21.0
Platform: Linux
Python version: 3.7.6
Huggingface_hub version: 0.8.1
PyTorch version (GPU?): 1.10.2 (Yes)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes (2+ Tesla V100)
Using distributed or parallel set-up in script?: Yes

When trying to fine-tune a MT5ForConditionalGeneration model using a Seq2SeqTrainer, while using multiple GPUs, I get a InternalAssert error. I am running the script using torchrun --nproc=$NUM_GPUS script.py. The issue appears when $NUM_GPUS is greater than 1. Also, it only appears when the argument sharded_ddp: ["zero_dp_3"] is passed to the trainer.

  Traceback (most recent call last):
  File "script.py", line 475, in <module>
    fire.Fire(main)
  File "/miniconda/lib/python3.7/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/miniconda/lib/python3.7/site-packages/fire/core.py", line 471, in _Fire
    target=component.__name__)
  File "/miniconda/lib/python3.7/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "script.py", line 447, in main
    train_model(model, tokenizer, cli_arguments)
  File "script.py", line 357, in train_model
    trainer.train()
  File "/miniconda/lib/python3.7/site-packages/transformers/trainer.py", line 1502, in train
    ignore_keys_for_eval=ignore_keys_for_eval,
  File "/miniconda/lib/python3.7/site-packages/transformers/trainer.py", line 1740, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/miniconda/lib/python3.7/site-packages/transformers/trainer.py", line 2488, in training_step
    loss.backward()
  File "/miniconda/lib/python3.7/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/miniconda/lib/python3.7/site-packages/torch/autograd/__init__.py", line 156, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: grad.numel() == bucket_view.numel()INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1640811797118/work/torch/csrc/distributed/c10d/reducer.cpp":328, please report a bug to PyTorch. 

  0%|          | 0/100000 [00:06<?, ?it/s]
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 660 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 662 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 663 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 661) of binary: /miniconda/bin/python
Traceback (most recent call last):
  File "/miniconda/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.10.2', 'console_scripts', 'torchrun')())
  File "/miniconda/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/miniconda/lib/python3.7/site-packages/torch/distributed/run.py", line 719, in main
    run(args)
  File "/miniconda/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run
    )(*cmd_args)
  File "/miniconda/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/miniconda/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
script.py FAILED
------------------------------------------------------------

The issue fails on transformers[deepspeed]==4.21.0 but there are no issues in transformers[deepspeed]==4.20.1. The versions of Deepspeed and Fairscale are deepspeed==0.6.5 or deepspeed==0.6.7 and fairscale=0.4.6 and this code was run in a Linux machine.

Who can help?

@sgugger

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

# The simplified contents of script.py
# Running torchrun --nproc_per_node=1 script.py should work
# Running torchrun --nproc_per_node=4 script.py should fail with a RuntimeError: grad.numel() == bucket_view.numel()INTERNAL ASSERT FAILED error.

from __future__ import annotations
import functools
import typing as tp
import datasets
import transformers
from transformers import (
    DataCollatorForSeq2Seq,
    PreTrainedTokenizer,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
)


increment_en = [
    {"input": "One", "target": "Two"},
    {"input": "Three", "target": "Four"},
    {"input": "Five", "target": "Six"},
    {"input": "Seven", "target": "Eight"},
    {"input": "Nine", "target": "Ten"},
]
increment_en = increment_en * 100


def lod_to_dol(list_of_dicts: tp.List[tp.Dict[str, tp.Any]]) -> tp.Dict[str, list]:
    dict_of_lists = {
        key: [dct[key] for dct in list_of_dicts] for key in list_of_dicts[0]
    }
    return dict_of_lists


increment_en = lod_to_dol(increment_en)


def preprocess_function_(
    examples,
    tokenizer: PreTrainedTokenizer,
    max_input_length: int,
    max_target_length: int,
):
    inputs = examples["input"]
    targets = examples["target"]

    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs


def main():
    tokenizer = transformers.MT5Tokenizer.from_pretrained("google/mt5-base")
    model = transformers.MT5ForConditionalGeneration.from_pretrained("google/mt5-base")

    args = Seq2SeqTrainingArguments(
        "script_debug",
        per_device_train_batch_size=4,
        per_device_eval_batch_size=4,
        fp16=False,
        push_to_hub=False,
        sharded_ddp=["zero_dp_3"],
        max_steps=10000,
        logging_steps=5000,
        save_steps=5000
    )

    data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, padding=True)

    dataset = datasets.DatasetDict(
        {
            "train": datasets.Dataset.from_dict(increment_en),
            "test": datasets.Dataset.from_dict(increment_en),
        }
    )

    preprocess_function = functools.partial(
        preprocess_function_,
        tokenizer=tokenizer,
        max_input_length=512,
        max_target_length=512
    )

    processed_ds = dataset.map(preprocess_function, batched=True)
    processed_ds.set_format(
        type="torch", columns=["input_ids", "attention_mask", "labels"]
    )

    trainer = Seq2SeqTrainer(
        model,
        args,
        train_dataset=processed_ds["train"],
        eval_dataset=processed_ds["test"],
        data_collator=data_collator,
        tokenizer=tokenizer,
    )
    trainer.train()


if __name__ == "__main__":
    main()

Expected behavior

The training code should not crash.

@shermansiu shermansiu added the bug label Aug 1, 2022
@shermansiu shermansiu changed the title Sharded Multi-GPU MT5 training fails o Sharded Multi-GPU MT5 training fails (4.21.0) Aug 1, 2022
@shermansiu shermansiu changed the title Sharded Multi-GPU MT5 training fails (4.21.0) Sharded Multi-GPU MT5 training with the Seq2SeqTrainer fails (4.21.0) Aug 1, 2022
@shermansiu
Copy link
Contributor Author

It still fails when I install transformers directly from the GitHub repository (as of today).

Here's the traceback:

Traceback (most recent call last):
  File "script.py", line 102, in <module>
    main()
  File "script.py", line 98, in main
    trainer.train()
  File "/mnt/task_runtime/transformers/src/transformers/trainer.py", line 1506, in train
    ignore_keys_for_eval=ignore_keys_for_eval,
  File "/mnt/task_runtime/transformers/src/transformers/trainer.py", line 1744, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/mnt/task_runtime/transformers/src/transformers/trainer.py", line 2492, in training_step
    loss.backward()
  File "/miniconda/lib/python3.7/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/miniconda/lib/python3.7/site-packages/torch/autograd/__init__.py", line 156, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: grad.numel() == bucket_view.numel()INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1640811797118/work/torch/csrc/distributed/c10d/reducer.cpp":328, please report a bug to PyTorch. 
  0%|                              | 0/10000 [00:00<?, ?it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 48181) of binary: /miniconda/bin/python
Traceback (most recent call last):
  File "/miniconda/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.10.2', 'console_scripts', 'torchrun')())
  File "/miniconda/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/miniconda/lib/python3.7/site-packages/torch/distributed/run.py", line 719, in main
    run(args)
  File "/miniconda/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run
    )(*cmd_args)
  File "/miniconda/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/miniconda/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
script.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-08-02_15:26:28
  host      : bolt-imq45r3c3y-8dfzr73qqa.bolt-pods.turi-bolt.svc.int.usmsc39.applecloud.io
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 48182)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-08-02_15:26:28
  host      : bolt-imq45r3c3y-8dfzr73qqa.bolt-pods.turi-bolt.svc.int.usmsc39.applecloud.io
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 48181)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

@shermansiu
Copy link
Contributor Author

Related issue: https://discuss.pytorch.org/t/multi-gpu-model-parallelism-device-error/117854/9

This issue seems to be related to how DDP is set up in a constructor somewhere, probably in the trainer's constructor when adding DDP.

@pacman100
Copy link
Contributor

pacman100 commented Aug 2, 2022

Hello @shermansiu , I am unable to reproduce the error with transformers==4.22.0.dev0 main branch and fairscale==0.4.6. sharded_ddp has nothing to do with DeepSpeed. I get another error and it is unrelate with the integration. Therefore, please open the issue with Fairscale and follow it there. The issue I face is below which is different from the one you face:

Traceback (most recent call last):                                                                  
  File "script.py", line 109, in <module>                                                           
    main()
  File "script.py", line 103, in main
    trainer.train()
  File "/home/sourab/transformers/src/transformers/trainer.py", line 1502, in train
    return inner_training_loop(
  File "/home/sourab/transformers/src/transformers/trainer.py", line 1744, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/sourab/transformers/src/transformers/trainer.py", line 2492, in training_step
    loss.backward()
  File "/home/sourab/dev/lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/sourab/dev/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backw
ard
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Function SplitWithSizesBackward0 returned an invalid gradient at index 0 - got [582401
280] but expected shape compatible with [291200640]

Also, if you want to leverage Fully Sharded Data Parallelism, you can use the production focused PyTorch FSDP integration in transformers by having following args:

args = Seq2SeqTrainingArguments(
        "script_debug",
        per_device_train_batch_size=4,
        per_device_eval_batch_size=4,
        fp16=False,
-      sharded_ddp=["zero_dp_3"],
+      fsdp=["full_shard", "auto_wrap"],
+      fsdp_transformer_layer_cls_to_wrap="T5Block",
        max_steps=100,
        logging_steps=5000,
        save_steps=5000
    )

which gives below output:

***** Running training *****
  Num examples = 500
  Num Epochs = 2
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 100
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"

...


100%|█████████████████████████████████████████████████████████████| 100/100 [00:26<00:00,  3.72it/s]

Training completed. Do not forget to share your model on huggingface.co/models =)


FullyShardedDataParallel( 
  (_fsdp_wrapped_module): FlattenParamsWrapper(
    (_fpw_module): MT5ForConditionalGeneration(
      (shared): Embedding(250112, 768)
      (encoder): T5Stack( 
        (embed_tokens): Embedding(250112, 768)
        (block): ModuleList(
          (0): FullyShardedDataParallel(
            (_fsdp_wrapped_module): FlattenParamsWrapper(
              (_fpw_module): T5Block(
                (layer): ModuleList(
                  (0): T5LayerSelfAttention(
                    (SelfAttention): T5Attention(
                      (q): Linear(in_features=768, out_features=768, bias=False)
                      (k): Linear(in_features=768, out_features=768, bias=False)
                      (v): Linear(in_features=768, out_features=768, bias=False)
                      (o): Linear(in_features=768, out_features=768, bias=False)
                      (relative_attention_bias): Embedding(32, 12)
                    )
                    (layer_norm): T5LayerNorm()
                    (dropout): Dropout(p=0.1, inplace=False)
                  )
                  (1): T5LayerFF(
                    (DenseReluDense): T5DenseGatedActDense(

...

On transformers[deepspeed]==4.20.1, I don't the issue as you mentioned. I will look into it further by this week or next.

@shermansiu
Copy link
Contributor Author

shermansiu commented Aug 2, 2022

Thanks! The weird thing is that changing the fairscale version doesn't affect whether the bug appears.

As you just said, I can make the bug appear by first running pip install transformers==4.21.0 and disappear by running pip install transformers==4.20.1. I'll file a bug report in the FairScale repository anyway.

@shermansiu
Copy link
Contributor Author

I was able to reproduce your RuntimeError: Function SplitWithSizesBackward0 returned an invalid gradient at index 0 - got [582401280] but expected shape compatible with [145600320] error by upgrading PyTorch (cudatoolkit=11.3) from 1.10.2 to 1.12.0.

I think it's still the same bug because running torchrun --nproc_per_node=1 script.py with pytorch==1.12.0 works.

After upgrading PyTorch to 1.12.0, I applied your FSDP patch and the code started to work. Thanks!

@shermansiu
Copy link
Contributor Author

(FSDP is only available for PyTorch versions 1.12 and later)

@pacman100
Copy link
Contributor

Hello @shermansiu , I found the bug and raised above PR which should fix it. Can you try the above PR and confirm?

@pacman100
Copy link
Contributor

(FSDP is only available for PyTorch versions 1.12 and later)

Yes

@pacman100
Copy link
Contributor

Post applying PR, the output logs for sharded_ddp:

100%|█████████████████████████████████████████████████████████████| 100/100 [00:25<00:00,  3.93it/s]
                                                                                                    
Training completed. Do not forget to share your model on huggingface.co/models =)                   
                                                                                                    
                                                                                                    
                                                                                                    
{'train_runtime': 26.4257, 'train_samples_per_second': 30.274, 'train_steps_per_second': 3.784, 'tra
in_loss': 17.26375, 'epoch': 1.59}                                                                  
FullyShardedDataParallel(                                                                           
  world_size=2, flatten_parameters=True, mixed_precision=False,                                     
  (_fsdp_wrapped_module): FlattenParamsWrapper(                                                     
    (_fpw_module): MT5ForConditionalGeneration(                                                     
      (shared): Embedding(250112, 768)
      (encoder): T5Stack( 
        (embed_tokens): Embedding(250112, 768)
        (block): ModuleList(
          (0): T5Block(
            (layer): ModuleList(
              (0): T5LayerSelfAttention(

...

@shermansiu
Copy link
Contributor Author

shermansiu commented Aug 2, 2022

Yes, I can confirm that it works!

Training completed. Do not forget to share your model on huggingface.co/models =)


{'train_runtime': 48.4985, 'train_samples_per_second': 32.991, 'train_steps_per_second': 2.062, 'train_loss': 18.418689575195312, 'epoch': 3.12}
100%|██████████████████████| 100/100 [00:48<00:00,  2.06it/s]

I guess I don't need to file a FairScale issue after all!

@shermansiu
Copy link
Contributor Author

Wait... am I supposed to keep the issue open until the PR is merged?

@shermansiu
Copy link
Contributor Author

Probably, I suppose.

pacman100 linked a pull request 1 hour ago that will close this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants