Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torch.nn.DataParallel with OperatorModule wrapping 'astra_cuda'-RayTransform malfunctioning #1545

Open
jleuschn opened this issue Mar 17, 2020 · 6 comments

Comments

@jleuschn
Copy link
Contributor

There seem to be problems when using the 'astra_cuda' backend wrapped by a RayTransform operator again wrapped by odl.contrib.torch.operator.OperatorModule when trying to distribute the model on multiple GPUs using torch.nn.DataParallel.
I don't have a specific error message atm, but it lead to some kernel panics on different servers.
My guess would be that it is related to the copying performed by DataParallel, which i imagine to result in problems like conflicting shared memory usage or double freeing.
Does someone have more knowledge on why this is happening or how to make it work?

@kohr-h
Copy link
Member

kohr-h commented Mar 17, 2020

That's an interesting problem, but hard to diagnose without looking into the implementation of the DataParallel class.

My first guess, though, would be that PyTorch somehow has to know which operations to perform on each GPU, and that requires making copies of the (relevant parts of) the neural net. If that neural net has a RayTransform in it, that can be problematic, because copying an operator is not really supported and may be done in a shallow way.

It may also have to do with the ASTRA-managed memory that the ODL operator handles through ASTRA IDs.

Finally, I'm not sure how ASTRA handles different GPU IDs. I think you can specify which one to use, but that's a likely global setting. Since DataParallel doesn't know that it has to do that, what may happen is that all your Torch CUDA tensors are distributed, but all ASTRA calls still go to the same GPU via CPU memory.

Maybe it's best if you ask the same question on the ASTRA issue tracker.

@kohr-h
Copy link
Member

kohr-h commented Mar 17, 2020

That said, if you find out anything we could do here, e.g., better support data parallelism in OpertorModule, please let us know.

@kohr-h
Copy link
Member

kohr-h commented Mar 17, 2020

Okay, I couldn't help looking myself. PyTorch DataParallel relies on the layer's _replicate_for_data_parallel() method:

https://github.com/pytorch/pytorch/blob/a0b7a39a92bf0fa286ce79a853d591c990befa08/torch/nn/parallel/replicate.py#L120

The default implementation does what one might expect: copy the __dict__, copy parameters, and duplicate some other stuff. We could customize that one in OperatorModule to switch ASTRA to the correct GPU. I'm not sure, though, whether we can truly run ASTRA in multiple threads.

@jhnnslschnr
Copy link

Thank you very much for looking into this! I will try to make the operator ready for data parallel this way.

@kohr-h
Copy link
Member

kohr-h commented Mar 17, 2020

Thanks for that! I hope it will solve the issue.

@jleuschn
Copy link
Contributor Author

I now think the current status should be working, since there already is a mutex Lock: https://github.com/odlgroup/odl/blob/c16033c304cfd7f013b8bcfe4ed4a6b36b9a87ee/odl/tomo/backends/astra_cuda.py

However i made PR #1546 which would allow to auto distribute the RayTransform on the same GPU used by torch for each replica. This can in some cases boost performance slightly because all GPUs are used the same then. Note that all arrays are copied to CPU anyways atm, so it is not that important where astra runs after all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants