-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training stuck with multiple call of forward function #46
Comments
You can either add checkpoints/do debug printing in the main forward function: https://github.com/vacancy/Synchronized-BatchNorm-PyTorch/blob/master/sync_batchnorm/batchnorm.py#L78 or provide a minimal script that can reproduce the issue, so that I can take a look. |
Hi, I am able to run the code after reducing the network size. I tested on two 12GB Titan X: If my network is large, lets say 20GB, then the sync BN will stuck at some points without reporting any error. Then I sightly shrink the model to 16GB, then I can get the OOM error. Further reducing the model size works well for me. I quickly looked at your source code, but didnt figure out why this happened. Do you have any ideas? |
Interesting findings! Can you add additional prints before and after https://github.com/vacancy/Synchronized-BatchNorm-PyTorch/blob/master/sync_batchnorm/batchnorm.py#L133 and L136 to check if these are the lines that the code gets stuck? A little background: to make the same module run on multiple GPUs, PyTorch actually duplicates the same module for N times, each one running on a separate thread. To make a synchronized batch normalization, we need to add barriers. These barriers require transmitting data among GPUs and thus need additional malloc operations on GPUs. It is possible that a deadlock is happening:
Unfortunately, there are currently no simple ways to check if such a deadlock is happening as we can directly check PyTorch memory management in Python. |
Hi, Thanks for your reply. It turns out that it was not stuck in these two lines. My finding is that: For this function: https://github.com/vacancy/Synchronized-BatchNorm-PyTorch/blob/7553990fb9a917cddd9342e89b6dc12a70573f5b/sync_batchnorm/batchnorm.py#L78 The master process will execute this function one time less than the slave process, so slave process just keep waiting the message from master. |
This is weird. Why is that? At least, the forward function should be called on each process (master/slave) for the same number of times. |
Not sure about it... But it works fine when there is no OOM issue, so I guess the training code is correct. My code is modified based on this repo https://github.com/swabhs/open-sesame |
Can you be more specific about how did you come to the conclusion that the forward function on the master process gets called one time less than the on the slave? Any code snippets on how you modify this repo will be greatly helpful. |
Hi,
Thank you for the great code. I have looked at the related issues but it turns out that it doesnt help in my case. I have a network using your Sync BN. I try to call the forward pass of the model for 4 times and sum over all the 4 outputs, and it stuck in the last forward call. If I reduce the number of calling to 3, everything works fine. I am sure that I do the same thing on different GPUs.
Besides, if I dont do the sum, then my code also works well. It is really wired so I would like to ask if you have any suggestions? Thanks!
The text was updated successfully, but these errors were encountered: