Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for multiple GPUs #19

Open
Selozhd opened this issue Jan 17, 2023 · 7 comments
Open

Support for multiple GPUs #19

Selozhd opened this issue Jan 17, 2023 · 7 comments

Comments

@Selozhd
Copy link

Selozhd commented Jan 17, 2023

I am planning to run the model on multiple GPUs. However, Looking at the way optimize_mesh() is written it is not immediately clear how to implement it. In nvdiffrec, there used to be multiple gpu support implemented through a Trainer class.
Is any particular reason why you removed it?

@jmunkberg
Copy link
Collaborator

Hello,

We removed it from the public repo to make the code a bit easier to read and support. You can likely do something similar to the nvdiffrec mGPU setup.

@Selozhd
Copy link
Author

Selozhd commented Jan 20, 2023

Hello again,

I have a working first implementation but I am encountering some problems as far as GPU memory is concerned. For example, I still get CUDA out of memory errors when I increase the batch size by one despite effectively having x4 memory. This leads me to suspect that some processes are shared between the GPUs. Maybe you have some insights to help me here?

I have also noticed that DMTetGeometry and DLMesh implementations are different (even though they are very similar algorithmically) between nvdiffrec and nvdiffrecmc, for example you don't inherit them from torch's nn.Module. Is there any specific reason for this?

@iraj465
Copy link

iraj465 commented Mar 15, 2023

Hey were you able to get the multi-gpu setup working?

@Selozhd
Copy link
Author

Selozhd commented Apr 7, 2023

Yeah partially, I had to do a few hacky things in the data processing to get it to work. But at the end I could process a batch across multiple gpus.

@iraj465
Copy link

iraj465 commented Apr 7, 2023

I tried with distributed data parallel and somewhat changing to pytorvch lightning but getting seg fault on pretty low res images and batch sizes. How did you resolve it? It would be nice to discuss further

@Selozhd
Copy link
Author

Selozhd commented Apr 12, 2023

I tried with distributed data parallel and somewhat changing to pytorvch lightning but getting seg fault on pretty low res images and batch sizes. How did you resolve it? It would be nice to discuss further

I never got a segfault error. How are you trying to implement the parallelism? I think you can only expect to divide the batches across the gpus. Here is briefly what I have done:

  • Rewritten the classes for light, material, dmtet, and dlmesh using 'nn.Module'.
  • Changed some parts of 'optimize_mesh()' to handle training parameters and re-added the 'Trainer' class. The old nvdiffrec code is a good reference here.
  • Finally added the boilerplate code for torch to handle the rest. I used torch's 'DistributedDataParallel'.

@VLadImirluren
Copy link

@Selozhd thanks! Can you share the code for reference~~
best wish~~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants