Multi GPU dataparallel crash #1779

devroy73 · 2019-11-10T05:10:57Z

🐛 Bug

Model I am using (Bert, XLNet....): GPT2

Language I am using the model on (English, Chinese....): English

The problem arise when using:

the official example scripts: run_lm_finetuning
my own modified scripts: (give details)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: I am finetuning the GPT2 model with a dataset that I have used in the past to fine-tune the BERT model among others

To Reproduce

Steps to reproduce the behavior:

raceback (most recent call last):████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 7587/7588 [1:42:01<00:00, 1.26it/s]
File "run_lm_finetuning.py", line 551, in
main()
File "run_lm_finetuning.py", line 503, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
File "run_lm_finetuning.py", line 228, in train
outputs = model(inputs, masked_lm_labels=labels) if args.mlm else model(inputs, labels=labels)
File "/home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 153, in forward
return self.gather(outputs, self.output_device)
File "/home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in gather
return gather(outputs, output_device, dim=self.dim)
File "/home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
res = gather_map(outputs)
File "/home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
File "/home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
File "/home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
return Gather.apply(target_device, dim, *outputs)
File "/home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 68, in forward
return comm.gather(inputs, ctx.dim, ctx.target_device)
File "/home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/cuda/comm.py", line 165, in gather
return torch._C._gather(tensors, dim, destination)
**RuntimeError: Gather got an input of invalid size: got [2, 2, 12, 1024, 64], but expected [2, 3, 12, 1024, 64] (gather at /opt/conda/conda-bld/pytorch_1565272279342/work/torch/csrc/cuda/comm.cpp:226)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7f47a13d5e37 in /home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: torch::cuda::gather(c10::ArrayRefat::Tensor, long, c10::optional) + 0x3c7 (0x7f4720c61327 in /home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #2: + 0x5fa742 (0x7f47a420f742 in /home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #3: + 0x1c8316 (0x7f47a3ddd316 in /home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

frame #14: THPFunction_apply(_object, _object) + 0x98f (0x7f47a40024bf in /home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

Expected behavior

it crashes at the last batch all the time. I expect it to move to the next epoch

Environment

OS:
Python version: 3.6
PyTorch version: 1.2
PyTorch Transformers version (or branch):
Using GPU ? 4 Quadro 8000
Distributed of parallel setup ? parallel
Any other relevant information:

Additional context

anandhperumal · 2019-11-13T04:44:41Z

Even I'm facing this issue @LysandreJik @thomwolf can you throw some input on this?
all the input is of the same length still this issue occurs.
@devroy73 meanwhile in data loader you can set drop_last = True

devroy73 · 2019-11-13T12:31:19Z

Hey @anandhperumal thanks for that it solved my crashing issue.

zbloss · 2020-01-23T23:16:28Z

I tried setting drop_last=True, but it did not fix the issue for me.

aohan237 · 2020-05-20T07:04:56Z

maybe we should add parameter to Trainer.
or it should added to the doc for others

devroy73 closed this as completed Nov 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi GPU dataparallel crash #1779

Multi GPU dataparallel crash #1779

devroy73 commented Nov 10, 2019 •

edited

Loading

anandhperumal commented Nov 13, 2019

devroy73 commented Nov 13, 2019

zbloss commented Jan 23, 2020

aohan237 commented May 20, 2020

Multi GPU dataparallel crash #1779

Multi GPU dataparallel crash #1779

Comments

devroy73 commented Nov 10, 2019 • edited Loading

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

anandhperumal commented Nov 13, 2019

devroy73 commented Nov 13, 2019

zbloss commented Jan 23, 2020

aohan237 commented May 20, 2020

devroy73 commented Nov 10, 2019 •

edited

Loading