Support for multiple GPUs #19

Selozhd · 2023-01-17T10:28:51Z

I am planning to run the model on multiple GPUs. However, Looking at the way optimize_mesh() is written it is not immediately clear how to implement it. In nvdiffrec, there used to be multiple gpu support implemented through a Trainer class.
Is any particular reason why you removed it?

The text was updated successfully, but these errors were encountered:

jmunkberg · 2023-01-17T10:51:19Z

Hello,

We removed it from the public repo to make the code a bit easier to read and support. You can likely do something similar to the nvdiffrec mGPU setup.

Selozhd · 2023-01-20T08:00:12Z

Hello again,

I have a working first implementation but I am encountering some problems as far as GPU memory is concerned. For example, I still get CUDA out of memory errors when I increase the batch size by one despite effectively having x4 memory. This leads me to suspect that some processes are shared between the GPUs. Maybe you have some insights to help me here?

I have also noticed that DMTetGeometry and DLMesh implementations are different (even though they are very similar algorithmically) between nvdiffrec and nvdiffrecmc, for example you don't inherit them from torch's nn.Module. Is there any specific reason for this?

iraj465 · 2023-03-15T21:59:52Z

Hey were you able to get the multi-gpu setup working?

Selozhd · 2023-04-07T15:12:36Z

Yeah partially, I had to do a few hacky things in the data processing to get it to work. But at the end I could process a batch across multiple gpus.

iraj465 · 2023-04-07T17:15:08Z

I tried with distributed data parallel and somewhat changing to pytorvch lightning but getting seg fault on pretty low res images and batch sizes. How did you resolve it? It would be nice to discuss further

Selozhd · 2023-04-12T21:41:20Z

I tried with distributed data parallel and somewhat changing to pytorvch lightning but getting seg fault on pretty low res images and batch sizes. How did you resolve it? It would be nice to discuss further

I never got a segfault error. How are you trying to implement the parallelism? I think you can only expect to divide the batches across the gpus. Here is briefly what I have done:

Rewritten the classes for light, material, dmtet, and dlmesh using 'nn.Module'.
Changed some parts of 'optimize_mesh()' to handle training parameters and re-added the 'Trainer' class. The old nvdiffrec code is a good reference here.
Finally added the boilerplate code for torch to handle the rest. I used torch's 'DistributedDataParallel'.

VLadImirluren · 2023-08-28T12:47:43Z

@Selozhd thanks! Can you share the code for reference~~
best wish~~

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for multiple GPUs #19

Support for multiple GPUs #19

Selozhd commented Jan 17, 2023

jmunkberg commented Jan 17, 2023

Selozhd commented Jan 20, 2023 •

edited

Loading

iraj465 commented Mar 15, 2023

Selozhd commented Apr 7, 2023

iraj465 commented Apr 7, 2023

Selozhd commented Apr 12, 2023 •

edited

Loading

VLadImirluren commented Aug 28, 2023

Support for multiple GPUs #19

Support for multiple GPUs #19

Comments

Selozhd commented Jan 17, 2023

jmunkberg commented Jan 17, 2023

Selozhd commented Jan 20, 2023 • edited Loading

iraj465 commented Mar 15, 2023

Selozhd commented Apr 7, 2023

iraj465 commented Apr 7, 2023

Selozhd commented Apr 12, 2023 • edited Loading

VLadImirluren commented Aug 28, 2023

Selozhd commented Jan 20, 2023 •

edited

Loading

Selozhd commented Apr 12, 2023 •

edited

Loading