Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does this support torch.nn.parallel.DistributedDataParallel? #1

Open
acgtyrant opened this issue Apr 10, 2018 · 18 comments
Open

Does this support torch.nn.parallel.DistributedDataParallel? #1

acgtyrant opened this issue Apr 10, 2018 · 18 comments
Labels
enhancement New feature or request

Comments

@acgtyrant
Copy link
Contributor

No description provided.

@vacancy
Copy link
Owner

vacancy commented Apr 10, 2018

Currently not.

The implementation is designed for multi-GPU BatchNorm, which is commonly used for Computer Vision tasks. Thus, it uses NCCL for multi-GPU broadcasting and reduction.

Distributed version needs other types of synchronization primitive operations (e.g., via shared memory or cross-machine synchronization).

Contributions will be highly appreciated!

@vacancy vacancy added enhancement New feature or request help wanted Extra attention is needed labels Apr 10, 2018
@zhanghang1989
Copy link

zhanghang1989 commented Apr 13, 2018

Just let you know that PyTorch Compatible Synchronized Batch Norm is provided here http://hangzh.com/PyTorch-Encoding/index.html
See the example here.

@acgtyrant
Copy link
Contributor Author

@zhanghang1989 Does this support torch.nn.parallel.DistributedDataParallel?

@acgtyrant
Copy link
Contributor Author

@zhanghang1989 Excuse me, I only see that you use the DataParallel, not the DistributedDataParallel. If you are sure, I use the DistributedDataParallel by myself later.

@acgtyrant
Copy link
Contributor Author

BTW, the python notebook is 404.

@vacancy
Copy link
Owner

vacancy commented Apr 13, 2018

@zhanghang1989 Hi Hang. Thanks for the introduction. This repo aims at providing a standalone and easy-to-use version of sync_bn such that it can be easily integrated into any existing frameworks. Its implementation also differs from your previous implementation.

For an example use of the sync_bn, please check:
https://github.com/CSAILVision/semantic-segmentation-pytorch

As for the DistributedDataParallel, currently, I have no plan for supporting it. @zhanghang1989 Have you tested your implementation in the distributed setting?

@zhanghang1989
Copy link

Thanks @vacancy for the introduction! Nice work. I have plan of supporting distributed training, due to a recent paper use it in object detection.
@acgtyrant The only thing need to be considered for distribute is making sure the number of gpus is set correctly https://github.com/zhanghang1989/PyTorch-Encoding/blob/master/encoding/nn/syncbn.py#L196

@vacancy
Copy link
Owner

vacancy commented Apr 13, 2018

@zhanghang1989 I am not an expert at this. But it seems to me that the DistributedDataParallel uses different implementation as DataParallel for broadcasting and reduction. While in a single process, multi-thread setting (DataParallel), one can use simple NCCL's broadcast and reduce, for tensors shared across multiple processes or even multiple machines, we need special implementation, which is defined in the package torch.distributed.

@zhanghang1989
Copy link

@vacancy Thanks for the information.

@acgtyrant
Copy link
Contributor Author

I read the source code of DistributedDataParallel, I find that it does not broadcast the parameters like DataParallel, it only all_reduce the gradient so that all model_replicates use the same gradient to optimize the same model at all, and all model_replicates use the same buffer which is broadcasted from the model in the device 0 of the rank 0 node.

@zhanghang1989 I think your syncbn use only comm.broadcast_coalesced and comm.reduce_add_coalesced which only support cross-gpu, it does not support cross-node which is distributed.

@vacancy vacancy removed the help wanted Extra attention is needed label Aug 19, 2018
@vacancy
Copy link
Owner

vacancy commented Oct 30, 2018

Hi @acgtyrant

What do you exactly mean by "SynchronizedBatchNorm2d is not numerical stable only"?

@acgtyrant
Copy link
Contributor Author

acgtyrant commented Oct 30, 2018

My tests show that my implemented SynchronizedBatchNorm2d is not numerical stable as your bn, there is not any other error. So now that your bn can work, my bn should work too. But it does not, I do not know why... So I need some help now.

@acgtyrant
Copy link
Contributor Author

After fix the wrong view of the output, I retrain drn now, and it seems as expected.

2018-11-02-152419_1914x956_scrot

@Cadene
Copy link

Cadene commented Nov 4, 2018

@acgtyrant Is DistributedDataParallel working for you? Are you planning to send a Pull Request?

Thank you all for this beautiful work!

@acgtyrant
Copy link
Contributor Author

acgtyrant commented Nov 7, 2018

My distributed synced bn works in DistributedDataParallel, but because of the secrecy laws from my boss, I delete the post which contains the source of the distributed synced bn, sorry.

However, it is easy to be implemented, the torch.distributed.all_reduce is synced automatically, so just warp it as an autograd function and use it to all reduce sum and sum_of_square, then use this function in your implemented synced bn module.

@Spritea
Copy link

Spritea commented Jan 9, 2019

@acgtyrant Is DistributedDataParallel working for you? Are you planning to send a Pull Request?

Thank you all for this beautiful work!

FYI, there is an open source repo created by NVIDIA, https://github.com/NVIDIA/apex, which supports SyncBN with DistributedDataParallel .

In fact, till now, it only supports SyncBN with DistributedDataParallel, and doesn't support SyncBN with DataParallel, see this issue. NVIDIA/apex#115

@qchenclaire
Copy link

How about using torch.nn.parallel.DistributedDataParallel but only running on one node with 8 gpus? Does this repo work?

@zhanghang1989
Copy link

How about using torch.nn.parallel.DistributedDataParallel but only running on one node with 8 gpus? Does this repo work?

For torch.nn.parallel.DistributedDataParallel please use torch.nn.SyncBatchNorm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants