-
Notifications
You must be signed in to change notification settings - Fork 23.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-GPU autograd error with Pytorch 0.4 #7092
Comments
Urr. This seems like a bug. Can you try to come up with a minimal working example please? Thanks for reporting! |
The problem is flattening the parameters. If I dont use it everything works with a warning suggesting it. import torch
class Net(torch.nn.Module):
def __init__(self):
super(Net, self).__init__()
self.gru = torch.nn.GRU(128, 64, 1, batch_first=True, bidirectional=True)
def forward(self, inp):
self.gru.flatten_parameters()
out, _ = self.gru(inp)
return out
net = Net()
inp = torch.rand(32, 8, 128, requires_grad=True)
net = torch.nn.DataParallel(net)
inp = inp.cuda()
net = net.cuda()
out = net.forward(inp) |
@erogol Thanks Eren. This is very helpful. We'll look into it. |
@erogol What was the last version of pytorch you used that the code worked? The code crashes on 0.3.1 for me |
@zou3519 0.3.0 |
Same bug when using the code here: https://github.com/NVIDIA/tacotron2 with Pytorch0.4 |
I also have the same problem here. Has the bug been fixed? Or do we have any solution now? |
@BangLiu the workaround is to to remove the |
@colesbury do you think that would make any performance difference, calling it in |
@erogol yes, calling it in forward is detrimental to performance. If you are using RNN, it is better to use DistributedDataParallel (and 1 process per GPU) than using DataParallel. It has benefits of being faster and you dont have to restructure your batching (your code is as if it is just using 1-GPU). See https://pytorch.org/docs/stable/distributed.html#launch-utility |
+1. I met the same bug. |
@liqing-ustc see #7092 (comment) for answer |
I met the same problem with PyTorch 0.4.0. I was wondering if this bug have been solved? I found if I deleted
|
any good solution to this problem? |
only suggested solution so far #7092 (comment) |
Can anyone give tips on how to use DistributedDataParallel , I am trying to use DistributedDataParallel
gives the errror
|
@eriche2016 I had the same issue then quit trying. However, I think the forum is the right place to raise this. |
You're not even initializing the distributed module. Use |
flatten_parameters() issue with PyTorch 0.4 pytorch/pytorch#7092
Any chance applying distributed to training will be as easy as calling a function like: Folks, btw, here is a good example of applying distributed to Pytorch: https://github.com/pytorch/examples/blob/master/imagenet/main.py It's a bit cumbersome but doable. |
@erogol @PetrochukM @eriche2016 the distributed launcher page cleanly describes (in 4 steps) what you have to do to your code to make it use distributed. https://pytorch.org/docs/stable/distributed.html#launch-utility |
@soumith Read through the tutorial and the launch utility, the distributed API has many options that enable it to be really powerful. Amazing! Following up on this comment -- "If you are using RNN, it is better to use DistributedDataParallel (and 1 process per GPU) than using DataParallel." @soumith For a single machine with multiple GPUs 2 - 8 running RNNs, what are the best parameters.
Sorry, do not have much experience with this and following reading the article, it was not obviously clear! |
Okay... running through this process, here are some things to watch out for:
Got distributed running and the error went away but I would not recommend this approach. This is due to:
Other notes:
|
Hi @PetrochukM , to answer your questions:
|
@ailzhang Thanks for your thorough reply, learned something new :) Responding to a couple of your points:
|
Hi @PetrochukM ,
|
Hi!
|
one simple way to only execute particular code on 1 particular "master" worker is to have a simple if conditioned on the rank of the process.
it's a common style in mpi-style distributed code |
Still have the same bug for multi-GPU with Pytorch 0.4.0 Another problem with DataParallel is that the hidden state of RNN is batched on the second dimension. |
@Aspire1Inspire2 could you provide a simple script to repro the RNN bug and expected behavior? |
If you google "pytorch dataparallel rnn", or "pytorch dataparallel lstm" you will find several dozen examples who complain about the hidden state not parallelizable. Here is a simple example toy code, I want to comment one more time. Neither dim=1 nor batch_first=True works. It is simply the hidden state that got messed up with dataparallel. Thank you for your response. |
Having this problem since we switched to 0.4 and it's been a while now. Any chance we will get a solution soon? Also, is there is a performance comparison between models that do flatten_parameter() and those don't? |
@ALL, same here running MLperf/speech_recognition, on a 4x GPU system: |
Same here |
Closing due to age |
After updating pytorch 0.4 I am getting the following error when I try to train my model here: https://github.com/mozilla/TTS with multi-gpus. I have no idea about what it means unfortunately. A bug or just a problem that I need some feedback on. Thx.
The text was updated successfully, but these errors were encountered: