-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why not data_parallel? #34
Comments
Yes, it's for performance reasons. DataParallel relies on Python threading, which is slow due to the GIL [1][2]. When we tried nn.DataParallel initially, we saw negative speedup with multiple GPUs (e.g., one GPU training was faster than using four GPUs). The custom event loop in fairseq-py uses multiprocessing (i.e., one Process per GPU), which gets around the GIL and gives much better multi-GPU performance. We typically see ~5.5-6x speedup with 8 GPUs. |
Okay, I see. Thanks for the prompt reply. Have you tried |
I haven't tried |
It's definitely worked for our use cases, including speech and MT. I think it's ultimately very similar to the implementation you built into fairseq, except that the user must explicitly launch N copies of the script, and each copy should have its own data loader or data loader shard. |
I wonder why you implemented the multi GPU training using a custom Event Loop instead of using
torch.nn.DataParallel
. I suppose it is for performance reasons?If so, what is the main bottleneck in data_parallel that prevents you from using it? Do you have an estimate of how much the speed up compared to the (simpler) DataParallel solution is?
The text was updated successfully, but these errors were encountered: