-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add interface to launch parallel dygraph by multiprocessing #26044
Add interface to launch parallel dygraph by multiprocessing #26044
Conversation
… dygraph/add_multiprocess_run_interface
… dygraph/add_multiprocess_run_interface
… dygraph/add_multiprocess_run_interface
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spawn 模式最好有性能对比?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
ParallelStrategy = core.ParallelStrategy | ||
|
||
|
||
def init_parallel_env(backend='nccl'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NCCL is an underlying communication library, I don't think it's necessary to let users know we have different backends here. If we want to support operating system such as windows that doesn't support NCCL, it's better to detect the operating system inside the init function to use other communication library, such as gloo. I highly recommend to remove backend argument currently for simplicity of usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx, I think it is okay to remove it, we can discuss removing this argument by cherry-pick
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove the backend argument for simplicity
感谢意见,确实应该有的,我后续出个报告可以吗?这个接口开发工作开展的时间有点短,近一周一直在讨论迭代接口形态,这个又要随2.0-beta发布,所以仅验证了正确性,性能对比还没来得及开展 这个接口在理论上与launch并无差别,只是换了一种多进程的启动方式,没有增加多余的实现,理论上不会有差别,同时这只是一种可选的启动方式,也不影响launch原来的使用 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
New features
PR changes
APIs
Describe
This PR add multiprocessing start method
start_processes
andspawn
for dygraph data parallel training.1. Start method difference
launch
python -m paddle.distributed.launch --selected_gpus=0,1 train.py
spawn
python train.py
and add
spawn
in__main__
method, for example:2. Simple example
3. API change
Add 4 new apis:
paddle.distributed.spawn
: start mulit-process training by spawn methodpaddle.distributed.init_parallel_env
: init parallel environment variables & get paralllel strategypaddle.distributed.get_rank
: get current process rankpaddle.distributed.get_world_size
: get current world sizeMove 2 old apis:
paddle.prepare_context (fluid.dygraph.prepare_context)
->paddle.distributed.prepare_context
paddle.ParallelEnv (fluid.dygraph.ParallelEnv)
->paddle.distributed.ParallelEnv
Refine 1 old api:
paddle.DataParallel (fluid.dygraph.DataParallel)
: Setstrategy
as an optional argumentDeprecate 1 old apis:
paddle.distributed.prepare_context (fluid.dygraph.prepare_context)
: replace bypaddle.distributed.init_parallel_env
later4. Correctness
Verify the correctness of the interface in the following models:
test_parallel_dygraph_mnist.py
test_parallel_dygraph_se_resnext.py
test_parallel_dygraph_transformer.py
5. Related docs