multiple GPUs do not reduce training time #89

nicolabertoldi · 2017-07-04T15:26:12Z

I am trying to use multiple GPU in training, but I am not able to reduce training time .

I have a machine with 3 GPUs (GeForce GTX 1080), and I train a network (details are below)
I tried with different amount of GPUS (1 or 2 or 3) and different batch_size (64,128,192,248)
Here is the table reporting the times of one epoch

using 1 GPU:
batch_size=64 43 seconds
batch_size=128 35 seconds
batch_size=192 32 seconds
batch_size=248 30 seconds

using 2 GPUs:
batch_size=64 78 seconds
batch_size=128 51 seconds
batch_size=192 43 seconds
batch_size=248 40 seconds

using 3 GPUs:
batch_size=64 94 seconds
batch_size=128 60 seconds
batch_size=192 50 seconds
batch_size=248 44 seconds

I also notice that the GPU utilization is quite low when multiple GPUs are used
with 1 GPU GPU utilization: 80-90%
with 2 GPU GPU utilization: 45-55%
with 3 GPU GPU utilization: 35-45%

I am using this setting (gpus and batch_size vary according to the experiments):
Namespace(batch_size=128, brnn=False, brnn_merge='concat', context_gate=None, curriculum=False, data='debugging/model.train.pt', dropout=0.3, encoder_type='text', epochs=13, extra_shuffle=False, gpus=[0], input_feed=1, layers=2, learning_rate=1.0, learning_rate_decay=0.5, log_interval=50, max_generator_batches=32, max_grad_norm=5, optim='sgd', param_init=0.1, pre_word_vecs_dec=None, pre_word_vecs_enc=None, rnn_size=500, rnn_type='LSTM', save_model='debugging/model', seed=-1, start_decay_at=8, start_epoch=1, train_from='', train_from_state_dict='', word_vec_size=500)

and this commit 58c8b52

Why training speed does not scale with the number of GPUs?
But rather it seems to slow down the training.

Have you already noticed this behavior?
Am I doing any error?

Any comment is welcome.

The text was updated successfully, but these errors were encountered:

srush · 2017-07-05T18:23:55Z

This is bad. We'll look into it.

jekbradbury · 2017-07-05T23:06:34Z

This has long been the case and is basically related to the architecture of nn.DataParallel, which uses Python threads and is limited by both the GIL and CUDA synchronization. So it's fine for networks like typical convnets that have few to no sync points and relatively few kernel launches (each of which is fairly large), but it doesn't work very well for NLP models with lots of tiny kernels.

It's worth also trying DistributedDataParallel from torch.distributed, which is available in master already. When running on a single machine, that uses Python multiprocessing rather than threads and should avoid at least the GIL-thrashing if you can get it to run.

srush · 2017-07-05T23:14:34Z

Thanks, we'll give this a try, or take a PR @nicolabertoldi if you are interested

vince62s · 2018-08-02T16:16:49Z

closing, now Multi GPU is implemented. x3 on 4 GPU.

falcondai added type:performance contributions welcome labels Jul 7, 2017

ArEsKay3 mentioned this issue Aug 9, 2017

Single Node DistributedDataParallel Not Supported pytorch/pytorch#2363

Closed

myleott mentioned this issue Oct 20, 2017

Why not data_parallel? facebookresearch/fairseq#34

Closed

ryanleary mentioned this issue Jan 16, 2018

MultiGPU Support SeanNaren/deepspeech.pytorch#211

Closed

vince62s closed this as completed Aug 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multiple GPUs do not reduce training time #89

multiple GPUs do not reduce training time #89

nicolabertoldi commented Jul 4, 2017

srush commented Jul 5, 2017

jekbradbury commented Jul 5, 2017

srush commented Jul 5, 2017

vince62s commented Aug 2, 2018

multiple GPUs do not reduce training time #89

multiple GPUs do not reduce training time #89

Comments

nicolabertoldi commented Jul 4, 2017

srush commented Jul 5, 2017

jekbradbury commented Jul 5, 2017

srush commented Jul 5, 2017

vince62s commented Aug 2, 2018