torch.nn.DataParallel with OperatorModule wrapping 'astra_cuda'-RayTransform malfunctioning #1545

jleuschn · 2020-03-17T16:21:19Z

There seem to be problems when using the 'astra_cuda' backend wrapped by a RayTransform operator again wrapped by odl.contrib.torch.operator.OperatorModule when trying to distribute the model on multiple GPUs using torch.nn.DataParallel.
I don't have a specific error message atm, but it lead to some kernel panics on different servers.
My guess would be that it is related to the copying performed by DataParallel, which i imagine to result in problems like conflicting shared memory usage or double freeing.
Does someone have more knowledge on why this is happening or how to make it work?

The text was updated successfully, but these errors were encountered:

kohr-h · 2020-03-17T17:03:38Z

That's an interesting problem, but hard to diagnose without looking into the implementation of the DataParallel class.

My first guess, though, would be that PyTorch somehow has to know which operations to perform on each GPU, and that requires making copies of the (relevant parts of) the neural net. If that neural net has a RayTransform in it, that can be problematic, because copying an operator is not really supported and may be done in a shallow way.

It may also have to do with the ASTRA-managed memory that the ODL operator handles through ASTRA IDs.

Finally, I'm not sure how ASTRA handles different GPU IDs. I think you can specify which one to use, but that's a likely global setting. Since DataParallel doesn't know that it has to do that, what may happen is that all your Torch CUDA tensors are distributed, but all ASTRA calls still go to the same GPU via CPU memory.

Maybe it's best if you ask the same question on the ASTRA issue tracker.

kohr-h · 2020-03-17T17:06:30Z

That said, if you find out anything we could do here, e.g., better support data parallelism in OpertorModule, please let us know.

kohr-h · 2020-03-17T17:28:49Z

Okay, I couldn't help looking myself. PyTorch DataParallel relies on the layer's _replicate_for_data_parallel() method:

https://github.com/pytorch/pytorch/blob/a0b7a39a92bf0fa286ce79a853d591c990befa08/torch/nn/parallel/replicate.py#L120

The default implementation does what one might expect: copy the __dict__, copy parameters, and duplicate some other stuff. We could customize that one in OperatorModule to switch ASTRA to the correct GPU. I'm not sure, though, whether we can truly run ASTRA in multiple threads.

jhnnslschnr · 2020-03-17T17:38:07Z

Thank you very much for looking into this! I will try to make the operator ready for data parallel this way.

kohr-h · 2020-03-17T18:04:50Z

Thanks for that! I hope it will solve the issue.

jleuschn · 2020-03-25T16:27:03Z

I now think the current status should be working, since there already is a mutex Lock: https://github.com/odlgroup/odl/blob/c16033c304cfd7f013b8bcfe4ed4a6b36b9a87ee/odl/tomo/backends/astra_cuda.py

However i made PR #1546 which would allow to auto distribute the RayTransform on the same GPU used by torch for each replica. This can in some cases boost performance slightly because all GPUs are used the same then. Note that all arrays are copied to CPU anyways atm, so it is not that important where astra runs after all.

kohr-h added status: postponed area: contrib type: deficiency labels Apr 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch.nn.DataParallel with OperatorModule wrapping 'astra_cuda'-RayTransform malfunctioning #1545

torch.nn.DataParallel with OperatorModule wrapping 'astra_cuda'-RayTransform malfunctioning #1545

jleuschn commented Mar 17, 2020

kohr-h commented Mar 17, 2020

kohr-h commented Mar 17, 2020

kohr-h commented Mar 17, 2020

jhnnslschnr commented Mar 17, 2020

kohr-h commented Mar 17, 2020

jleuschn commented Mar 25, 2020

torch.nn.DataParallel with OperatorModule wrapping 'astra_cuda'-RayTransform malfunctioning #1545

torch.nn.DataParallel with OperatorModule wrapping 'astra_cuda'-RayTransform malfunctioning #1545

Comments

jleuschn commented Mar 17, 2020

kohr-h commented Mar 17, 2020

kohr-h commented Mar 17, 2020

kohr-h commented Mar 17, 2020

jhnnslschnr commented Mar 17, 2020

kohr-h commented Mar 17, 2020

jleuschn commented Mar 25, 2020