-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
torch.nn.DataParallel with OperatorModule wrapping 'astra_cuda'-RayTransform malfunctioning #1545
Comments
That's an interesting problem, but hard to diagnose without looking into the implementation of the My first guess, though, would be that PyTorch somehow has to know which operations to perform on each GPU, and that requires making copies of the (relevant parts of) the neural net. If that neural net has a It may also have to do with the ASTRA-managed memory that the ODL operator handles through ASTRA IDs. Finally, I'm not sure how ASTRA handles different GPU IDs. I think you can specify which one to use, but that's a likely global setting. Since Maybe it's best if you ask the same question on the ASTRA issue tracker. |
That said, if you find out anything we could do here, e.g., better support data parallelism in |
Okay, I couldn't help looking myself. PyTorch The default implementation does what one might expect: copy the |
Thank you very much for looking into this! I will try to make the operator ready for data parallel this way. |
Thanks for that! I hope it will solve the issue. |
I now think the current status should be working, since there already is a mutex Lock: https://github.com/odlgroup/odl/blob/c16033c304cfd7f013b8bcfe4ed4a6b36b9a87ee/odl/tomo/backends/astra_cuda.py However i made PR #1546 which would allow to auto distribute the RayTransform on the same GPU used by torch for each replica. This can in some cases boost performance slightly because all GPUs are used the same then. Note that all arrays are copied to CPU anyways atm, so it is not that important where astra runs after all. |
There seem to be problems when using the
'astra_cuda'
backend wrapped by aRayTransform
operator again wrapped byodl.contrib.torch.operator.OperatorModule
when trying to distribute the model on multiple GPUs usingtorch.nn.DataParallel
.I don't have a specific error message atm, but it lead to some kernel panics on different servers.
My guess would be that it is related to the copying performed by
DataParallel
, which i imagine to result in problems like conflicting shared memory usage or double freeing.Does someone have more knowledge on why this is happening or how to make it work?
The text was updated successfully, but these errors were encountered: