Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new group #31682

Merged
merged 3 commits into from
Apr 1, 2021
Merged

new group #31682

merged 3 commits into from
Apr 1, 2021

Conversation

kuizhiqing
Copy link
Member

@kuizhiqing kuizhiqing commented Mar 17, 2021

PR types

New features

PR changes

APIs

Describe

  • add python api new_group
  • add python api wait
  • rewrite op CSyncCalcStreamOp with kernel
  • rewrite op CSyncCommStreamOp with kernel
  • add function NCCLParallelContext::InitWithRingID allowing creating Communicator with preset ring_id/group_id

unitest coverd by

  • test_fleet_sharding_meta_optimizer.py
  • test_new_group_api.py
  • test_new_group.sh

@CLAassistant
Copy link

CLAassistant commented Mar 17, 2021

CLA assistant check
All committers have signed the CLA.

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@paddle-bot-old
Copy link

paddle-bot-old bot commented Mar 17, 2021

✅ This PR's description meets the template requirements!
Please wait for other CI results.

Copy link
Contributor

@gongweibao gongweibao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please finetune you title and repair your CI first!

paddle/fluid/imperative/nccl_context.cc Show resolved Hide resolved
python/paddle/distributed/collective.py Outdated Show resolved Hide resolved
python/paddle/distributed/collective.py Outdated Show resolved Hide resolved
python/paddle/distributed/collective.py Show resolved Hide resolved
python/paddle/distributed/collective.py Outdated Show resolved Hide resolved
python/paddle/distributed/collective.py Outdated Show resolved Hide resolved
python/paddle/distributed/collective.py Show resolved Hide resolved
paddle/fluid/operators/collective/c_sync_comm_stream_op.cc Outdated Show resolved Hide resolved
paddle/fluid/operators/collective/c_sync_calc_stream_op.cc Outdated Show resolved Hide resolved
wangxicoding
wangxicoding previously approved these changes Mar 24, 2021
Copy link
Contributor

@wangxicoding wangxicoding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

paddle/fluid/imperative/bkcl_context.cc Outdated Show resolved Hide resolved
python/paddle/distributed/collective.py Outdated Show resolved Hide resolved
python/paddle/distributed/collective.py Outdated Show resolved Hide resolved
python/paddle/distributed/collective.py Outdated Show resolved Hide resolved
python/paddle/distributed/collective.py Outdated Show resolved Hide resolved
@kuizhiqing kuizhiqing marked this pull request as draft March 26, 2021 12:15
@kuizhiqing kuizhiqing marked this pull request as ready for review March 29, 2021 07:41
Copy link
Contributor

@gongweibao gongweibao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@qili93 qili93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


def wait(tensor, group=None, use_calc_stream=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有个疑问

  1. use_calc_stream这个参数是用来做什么的?设置为True或者False的时候,分别有什么作用?
  2. 未来是否可能会增加对其他stream的控制?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. paddle 对 gpu 的 stream 进行的逻辑抽象,calculation stream 和 communication stream,对应不同的通道;
  2. 不会,除了这层抽象外,还有多个 comm stream,用 id 做区分,和 group 绑定。

attrs={'ring_id': ring_id}, )


def broadcast(tensor, src, group=None, use_calc_stream=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

group=0改为group=None,对兼容性是否会有影响?
比如,是否有代码设定group=1的情况?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经和同事确认过 group=None 不会有影响,在本次 pr 前无法创建 group,所以 group=1的情况不存在,另外 group=0 显示调用的情况在代码中已经排除。

Copy link
Contributor

@XiaoguangHu01 XiaoguangHu01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG API

def new_group(ranks=None, backend=None):
"""

Creates a new distributed comminication group.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

communication

backend (str): The backend used to create group, only nccl is supported now.

Returns:
Group: The group instance. Nerver return None.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nerver?Never?

import paddle

paddle.distributed.init_parallel_env()
tindata = np.random.random([10, 1000]).astype('float32')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

用paddle.random 新建Tensor,就不用使用 numpy了

Examples:
.. code-block:: python

import numpy as np
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以删除

import paddle

paddle.distributed.init_parallel_env()
tindata = np.random.random([10, 1000]).astype('float32')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上

Copy link
Contributor

@jzhang533 jzhang533 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see inline comments


paddle.distributed.init_parallel_env()
tindata = np.random.random([10, 1000]).astype('float32')
tindata = paddle.to_tensor(tindata)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以直接用paddle.rand


Args:
ranks (list): The global ranks of group members, list as sorted.
backend (str): The backend used to create group, only nccl is supported now.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backend设定默认值为None时,现在的行为是直接设置为用nccl,未来有计划改这个默认行为吗?

place = core.CUDAPlace(genv.device_id)
core.NCCLParallelContext(strategy, place).init_with_ring_id(ring_id)
else:
assert False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

直接给个出错信息吧。

Creates a new distributed comminication group.

Args:
ranks (list): The global ranks of group members, list as sorted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

没太懂list as sorted,是啥意思,是对ranks里的值的序有要求?

Examples:
.. code-block:: python

import numpy as np
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

没必要import numpy


paddle.distributed.init_parallel_env()
tindata = np.random.random([10, 1000]).astype('float32')
tindata = paddle.to_tensor(tindata)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上,可以不用numpy。

@@ -163,7 +371,9 @@ def all_reduce(tensor, op=ReduceOp.SUM, group=0):
tensor (Tensor): The input Tensor. It also works as the output Tensor. Its data type
should be float16, float32, float64, int32 or int64.
op (ReduceOp.SUM|ReduceOp.MAX|ReduceOp.Min|ReduceOp.PROD): Optional. The operation used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

文档没写默认值是什么

@@ -238,7 +454,9 @@ def reduce(tensor, dst, op=ReduceOp.SUM, group=0):
should be float16, float32, float64, int32 or int64.
dst (int): The destination rank id.
op (ReduceOp.SUM|ReduceOp.MAX|ReduceOp.Min|ReduceOp.PROD): Optional. The operation used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

文档没写默认值是什么。

@@ -394,7 +626,9 @@ def scatter(tensor, tensor_list=None, src=0, group=0):
tensor_list (list): A list of Tensors to scatter. Every element in the list must be a Tensor whose data type
should be float16, float32, float64, int32 or int64.
src (int): The source rank id.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

默认值说明

@@ -394,7 +626,9 @@ def scatter(tensor, tensor_list=None, src=0, group=0):
tensor_list (list): A list of Tensors to scatter. Every element in the list must be a Tensor whose data type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

默认值说明

Copy link
Contributor

@TCChenlong TCChenlong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
TODO:fix docs bug

Copy link
Contributor

@jzhang533 jzhang533 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm
will fix doc problem in following pr

@seiriosPlus seiriosPlus merged commit 0774159 into PaddlePaddle:develop Apr 1, 2021
@kuizhiqing kuizhiqing mentioned this pull request Apr 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants