Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multiple GPUs do not reduce training time #89

Closed
nicolabertoldi opened this issue Jul 4, 2017 · 4 comments
Closed

multiple GPUs do not reduce training time #89

nicolabertoldi opened this issue Jul 4, 2017 · 4 comments

Comments

@nicolabertoldi
Copy link

I am trying to use multiple GPU in training, but I am not able to reduce training time .

I have a machine with 3 GPUs (GeForce GTX 1080), and I train a network (details are below)
I tried with different amount of GPUS (1 or 2 or 3) and different batch_size (64,128,192,248)
Here is the table reporting the times of one epoch

using 1 GPU:
batch_size=64 43 seconds
batch_size=128 35 seconds
batch_size=192 32 seconds
batch_size=248 30 seconds

using 2 GPUs:
batch_size=64 78 seconds
batch_size=128 51 seconds
batch_size=192 43 seconds
batch_size=248 40 seconds

using 3 GPUs:
batch_size=64 94 seconds
batch_size=128 60 seconds
batch_size=192 50 seconds
batch_size=248 44 seconds

I also notice that the GPU utilization is quite low when multiple GPUs are used
with 1 GPU GPU utilization: 80-90%
with 2 GPU GPU utilization: 45-55%
with 3 GPU GPU utilization: 35-45%

I am using this setting (gpus and batch_size vary according to the experiments):
Namespace(batch_size=128, brnn=False, brnn_merge='concat', context_gate=None, curriculum=False, data='debugging/model.train.pt', dropout=0.3, encoder_type='text', epochs=13, extra_shuffle=False, gpus=[0], input_feed=1, layers=2, learning_rate=1.0, learning_rate_decay=0.5, log_interval=50, max_generator_batches=32, max_grad_norm=5, optim='sgd', param_init=0.1, pre_word_vecs_dec=None, pre_word_vecs_enc=None, rnn_size=500, rnn_type='LSTM', save_model='debugging/model', seed=-1, start_decay_at=8, start_epoch=1, train_from='', train_from_state_dict='', word_vec_size=500)

and this commit 58c8b52

Why training speed does not scale with the number of GPUs?
But rather it seems to slow down the training.

Have you already noticed this behavior?
Am I doing any error?

Any comment is welcome.

@srush
Copy link
Contributor

srush commented Jul 5, 2017

This is bad. We'll look into it.

@jekbradbury
Copy link

This has long been the case and is basically related to the architecture of nn.DataParallel, which uses Python threads and is limited by both the GIL and CUDA synchronization. So it's fine for networks like typical convnets that have few to no sync points and relatively few kernel launches (each of which is fairly large), but it doesn't work very well for NLP models with lots of tiny kernels.

It's worth also trying DistributedDataParallel from torch.distributed, which is available in master already. When running on a single machine, that uses Python multiprocessing rather than threads and should avoid at least the GIL-thrashing if you can get it to run.

@srush
Copy link
Contributor

srush commented Jul 5, 2017

Thanks, we'll give this a try, or take a PR @nicolabertoldi if you are interested

@vince62s
Copy link
Member

vince62s commented Aug 2, 2018

closing, now Multi GPU is implemented. x3 on 4 GPU.

@vince62s vince62s closed this as completed Aug 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants