-
Notifications
You must be signed in to change notification settings - Fork 620
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MultiGPU Support #211
Comments
I've also seen the same behavior and am unsure on why this is the case, hopefully soon I'll get more time to try figure this out... |
Relevant: OpenNMT/OpenNMT-py#89 (comment) Should probably experiment with DistributedDataParallel. |
@ryanleary this is concerning... I'll try to mimic fairseqs implementation and then do benchmark runs. I'm not sure why the hell dataparallel uses threads... @alugupta I noticed you made a post here some time ago, here you say:
Isn't that an acceptable speed increase? Doubling the batch size kept the speeds consistent when using multiple GPUs? |
No, not unless he ran for double the number of epochs in the latter case. |
oh yeah my bad interpreted that wrong, thanks @ryanleary |
Right, what @ryanleary said :) I didn't double the number of epochs in the latter case so there was effectively no speedup (reduction in training time). I've only tried with 2 GPUs so far, perhaps it scales once you get to 4 or 8 GPUs. Will try to give this a spin soon! |
I suspect if it's the same speed with 2, it'll be as slow or worse (because GIL contention) with 4 or more. |
I'm not seeing this issue anymore using Pytorch 0.3 and CUDA 9 on a G3 instance from AWS with 2 GPUs: AN4 epoch times:
Will try scaling up further and check if the benefits disappear |
So upon further benchmarking I'm sure the results you got were due to AN4 being very small. On larger dataset I do see scaling albeit not as fast as I'd like. I'm going to keep this ticket open because I want to provide benchmarks around V100s using NCCL2 etc |
Hi! Thanks for this. I'll try to run some experiments on the larger datasets and see if I can see some scaling. Thanks! |
Did some benchmarking on librispeech_clean_100 (100 hours of libri), and then using the single GPU epoch time as the baseline to compare 2/4/8 GPU times. I used pytorch 0.3, with CUDA 9.1. Using the Below are the graphs using data parallel, and then distributed data parallel on the From this it's clear to get the speedup on p3.x16large instances (V100 cards) we need to use distributed data parallel. Any more thoughts on this please let me know! If someone knows of a nice way to launch N copies of the training script automatically, please let me know since this is needed for distributed pytorch to work. |
It's a known fact, NVIDIA has already published numbers similar to those (although they are worse likely because they used 0.2 and a lot has been improved since then). This is a nice starting point (also courtesy of NVIDIA) for a script that lets you start multiple DDP processes quite easily. We're planning on integrating a similar version into mainline PyTorch. |
@SeanNaren Id like to share the 8xp100 scalability data with the default model on librispeech_clean_100, and i got the following result:
From this performance data, we get about 51.3% scalability with P100, which match with your p3.x16 large. |
Hey @xhzhao there is a branch called distributed, using the multiproc.py script you can scale training onto all GPUs with a separate process. Current away from my PC for a few days once I'm back I can give better instructions! |
I've just merged a branch using the distributed wrapper for multi-gpu. Not sure if you're still using the package, but @alugupta would be nice for your to retry! Again AN4 is a small dataset, would suggest like librispeech for a nice comparison |
Hi,
I was wondering if anyone had tried using multiple GPUs with the DeepSpeech models and what their experience was. Currently I am seeing that there is little difference in training time between using 1 or 2 GPUs (maybe 10% improvement if that). When running nvidia-smi I can see multiple GPUs being used so that is not the problem (DataParallel handles this automatically).
Is there something I should look out for in terms of multi-GPU training? I did increase the batchsize when running on multiple GPUs so that the utilization for each GPU is comparable to using 1 GPU in isolation.
Thanks!
Udit
The text was updated successfully, but these errors were encountered: