Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi GPU dataparallel crash #1779

Closed
4 tasks
devroy73 opened this issue Nov 10, 2019 · 4 comments
Closed
4 tasks

Multi GPU dataparallel crash #1779

devroy73 opened this issue Nov 10, 2019 · 4 comments

Comments

@devroy73
Copy link

devroy73 commented Nov 10, 2019

🐛 Bug

Model I am using (Bert, XLNet....): GPT2

Language I am using the model on (English, Chinese....): English

The problem arise when using:

  • the official example scripts: run_lm_finetuning
  • my own modified scripts: (give details)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: I am finetuning the GPT2 model with a dataset that I have used in the past to fine-tune the BERT model among others

To Reproduce

Steps to reproduce the behavior:

raceback (most recent call last):████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 7587/7588 [1:42:01<00:00, 1.26it/s]
File "run_lm_finetuning.py", line 551, in
main()
File "run_lm_finetuning.py", line 503, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
File "run_lm_finetuning.py", line 228, in train
outputs = model(inputs, masked_lm_labels=labels) if args.mlm else model(inputs, labels=labels)
File "/home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 153, in forward
return self.gather(outputs, self.output_device)
File "/home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in gather
return gather(outputs, output_device, dim=self.dim)
File "/home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
res = gather_map(outputs)
File "/home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
File "/home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
File "/home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
return Gather.apply(target_device, dim, *outputs)
File "/home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 68, in forward
return comm.gather(inputs, ctx.dim, ctx.target_device)
File "/home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/cuda/comm.py", line 165, in gather
return torch._C._gather(tensors, dim, destination)
**RuntimeError: Gather got an input of invalid size: got [2, 2, 12, 1024, 64], but expected [2, 3, 12, 1024, 64] (gather at /opt/conda/conda-bld/pytorch_1565272279342/work/torch/csrc/cuda/comm.cpp:226)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7f47a13d5e37 in /home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: torch::cuda::gather(c10::ArrayRefat::Tensor, long, c10::optional) + 0x3c7 (0x7f4720c61327 in /home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #2: + 0x5fa742 (0x7f47a420f742 in /home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #3: + 0x1c8316 (0x7f47a3ddd316 in /home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

frame #14: THPFunction_apply(_object
, _object
) + 0x98f (0x7f47a40024bf in /home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

Expected behavior

it crashes at the last batch all the time. I expect it to move to the next epoch

Environment

  • OS:
  • Python version: 3.6
  • PyTorch version: 1.2
  • PyTorch Transformers version (or branch):
  • Using GPU ? 4 Quadro 8000
  • Distributed of parallel setup ? parallel
  • Any other relevant information:

Additional context

@anandhperumal
Copy link

Even I'm facing this issue @LysandreJik @thomwolf can you throw some input on this?
all the input is of the same length still this issue occurs.
@devroy73 meanwhile in data loader you can set drop_last = True

@devroy73
Copy link
Author

Hey @anandhperumal thanks for that it solved my crashing issue.

@zbloss
Copy link

zbloss commented Jan 23, 2020

I tried setting drop_last=True, but it did not fix the issue for me.

@aohan237
Copy link

maybe we should add parameter to Trainer.
or it should added to the doc for others

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants