-
Notifications
You must be signed in to change notification settings - Fork 26.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi GPU dataparallel crash #1779
Comments
Even I'm facing this issue @LysandreJik @thomwolf can you throw some input on this? |
Hey @anandhperumal thanks for that it solved my crashing issue. |
I tried setting drop_last=True, but it did not fix the issue for me. |
maybe we should add parameter to Trainer. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
🐛 Bug
Model I am using (Bert, XLNet....): GPT2
Language I am using the model on (English, Chinese....): English
The problem arise when using:
The tasks I am working on is:
To Reproduce
Steps to reproduce the behavior:
raceback (most recent call last):████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 7587/7588 [1:42:01<00:00, 1.26it/s]
File "run_lm_finetuning.py", line 551, in
main()
File "run_lm_finetuning.py", line 503, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
File "run_lm_finetuning.py", line 228, in train
outputs = model(inputs, masked_lm_labels=labels) if args.mlm else model(inputs, labels=labels)
File "/home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 153, in forward
return self.gather(outputs, self.output_device)
File "/home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in gather
return gather(outputs, output_device, dim=self.dim)
File "/home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
res = gather_map(outputs)
File "/home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
File "/home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
File "/home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
return Gather.apply(target_device, dim, *outputs)
File "/home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 68, in forward
return comm.gather(inputs, ctx.dim, ctx.target_device)
File "/home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/cuda/comm.py", line 165, in gather
return torch._C._gather(tensors, dim, destination)
**RuntimeError: Gather got an input of invalid size: got [2, 2, 12, 1024, 64], but expected [2, 3, 12, 1024, 64] (gather at /opt/conda/conda-bld/pytorch_1565272279342/work/torch/csrc/cuda/comm.cpp:226)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7f47a13d5e37 in /home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: torch::cuda::gather(c10::ArrayRefat::Tensor, long, c10::optional) + 0x3c7 (0x7f4720c61327 in /home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #2: + 0x5fa742 (0x7f47a420f742 in /home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #3: + 0x1c8316 (0x7f47a3ddd316 in /home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #14: THPFunction_apply(_object, _object) + 0x98f (0x7f47a40024bf in /home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
Expected behavior
it crashes at the last batch all the time. I expect it to move to the next epoch
Environment
Additional context
The text was updated successfully, but these errors were encountered: