Multi-GPU training #9

alexdemartos · 2020-06-11T19:32:46Z

Hi, thanks for sharing your work!

Do you have any idea how to get multi-GPU training working? I looked at how it is implemented on fatchord's original repo, but doesn't seem to work well:

           # Parallelize model onto GPUS using workaround due to python bug
            if device.type == 'cuda' and torch.cuda.device_count() > 1:
                m1_hat, m2_hat, attention = data_parallel_workaround(model, x, m)
            else:
                m1_hat, m2_hat, attention = model(x, m)

Thanks in advance!

The text was updated successfully, but these errors were encountered:

cschaefer26 · 2020-06-12T06:26:59Z

Hi, I believe the multi-GPU issue is fixed since torch==1.4.0, but I am not 100% sure, see pytorch/pytorch#15716. If you want to try multi-GPU training you could try the nn.DataParallel wrapper for the model. I did not bother yet with multi-GPU training as training time is much shorter than with the original Tacotron. Keep me posted if you try it!

alexdemartos · 2020-06-12T09:45:21Z

Hi,

I managed to get multi-GPU training working using nn.DataParallel, but I should be missing something. Performance decreased from ~4.0 steps/s on single-GPU to 0.46 steps/s on 2-GPUs (same batch size, just split into 2).

I basically added this to train_tacotron.py

if device.type == 'cuda' and torch.cuda.device_count() > 1:
        print("Using", torch.cuda.device_count(), "GPUs!")
        model = torch.nn.DataParallel(model)
        model.get_step = model.module.get_step
        model.reset_step = model.module.reset_step
        model.log = model.module.log
        model.load = model.module.load
        model.save = model.module.save
        model.num_params = model.module.num_params

It also uses significantly more GPU memory (specially on GPU 1)

alexdemartos · 2020-06-12T17:40:24Z

If my implementation is correct, I think the multi-GPU training introduces a big overhead. Unfortunately I have no more clues. I have gone for gradient accumulation to enable using larger batch sizes on small GPUs.

Thanks for your help :)

alexdemartos closed this as completed Jun 12, 2020

prajwaljpj mentioned this issue Aug 27, 2020

non-empty TensorList? #26

Open

ghost mentioned this issue Mar 2, 2021

Can't train on two GPU's CorentinJ/Real-Time-Voice-Cloning#664

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU training #9

Multi-GPU training #9

alexdemartos commented Jun 11, 2020 •

edited

Loading

cschaefer26 commented Jun 12, 2020

alexdemartos commented Jun 12, 2020

alexdemartos commented Jun 12, 2020

Multi-GPU training #9

Multi-GPU training #9

Comments

alexdemartos commented Jun 11, 2020 • edited Loading

cschaefer26 commented Jun 12, 2020

alexdemartos commented Jun 12, 2020

alexdemartos commented Jun 12, 2020

alexdemartos commented Jun 11, 2020 •

edited

Loading