-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REQUEST] Distributed data parallel training #2
Comments
can't agree more! |
The main issue is that highly efficient gradient synchronization (all-reduce) of An ad-hoc solution is manually performing gradient synchronization, however this may degrade throughput as we can't add any tricks like computation-communication overlapping ( Therefore, I may implement this ad-hoc option soon, and once PyTorch supports efficient synchronization for Best, |
I understand the difficulty of implementing this feature, but in my opinion, this is the single most important feature that betty should have while similar repos like facebookresearch/higher#116 doesn't. To make GML/MLO more impactful, it should be applied to large scale problems in the wild. Distributed training is a must have for real applications. The ad-hoc option looks good to me, can't wait to try it 👍 |
Hello @gongbudaizhe, I apologize for the late reply! I finally implemented the distributed training feature for the multi-node/multi-gpu setting. git clone https://github.com/leopard-ai/betty.git
cd betty
pip install . In detail, you can enable distributed training by 1) setting
and 2) launching the training script with In the future, we plan to simplify the launching procedure as Microsoft's DeepSpeed or HuggingFace's Accelerate. To do this, we need to write a custom launcher code. If you are willing to contribute to this, that is very welcome! I am sorry again for the delay, and let me know if you have any questions! Best, |
Currently, Betty only supports torch.nn.DataParallel. Compared to torch.nn.parallel.DistributedDataParallel, torch.nn.DataParallel is much slower even in the single-machine multi-gpu settings. Therefore, we need to replace torch.nn.DataParallel with torch.nn.parallel.DistributedDataParallel for better training speed and the multi-machine multi-gpu support.
The text was updated successfully, but these errors were encountered: