-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
new group #31682
new group #31682
Conversation
Thanks for your contribution! |
✅ This PR's description meets the template requirements! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please finetune you title and repair your CI first!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
||
def wait(tensor, group=None, use_calc_stream=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
有个疑问
- use_calc_stream这个参数是用来做什么的?设置为True或者False的时候,分别有什么作用?
- 未来是否可能会增加对其他stream的控制?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- paddle 对 gpu 的 stream 进行的逻辑抽象,calculation stream 和 communication stream,对应不同的通道;
- 不会,除了这层抽象外,还有多个 comm stream,用 id 做区分,和 group 绑定。
attrs={'ring_id': ring_id}, ) | ||
|
||
|
||
def broadcast(tensor, src, group=None, use_calc_stream=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
group=0改为group=None,对兼容性是否会有影响?
比如,是否有代码设定group=1的情况?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经和同事确认过 group=None 不会有影响,在本次 pr 前无法创建 group,所以 group=1的情况不存在,另外 group=0 显示调用的情况在代码中已经排除。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG API
def new_group(ranks=None, backend=None): | ||
""" | ||
|
||
Creates a new distributed comminication group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
communication
backend (str): The backend used to create group, only nccl is supported now. | ||
|
||
Returns: | ||
Group: The group instance. Nerver return None. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nerver?Never?
import paddle | ||
|
||
paddle.distributed.init_parallel_env() | ||
tindata = np.random.random([10, 1000]).astype('float32') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
用paddle.random 新建Tensor,就不用使用 numpy了
Examples: | ||
.. code-block:: python | ||
|
||
import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可以删除
import paddle | ||
|
||
paddle.distributed.init_parallel_env() | ||
tindata = np.random.random([10, 1000]).astype('float32') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see inline comments
|
||
paddle.distributed.init_parallel_env() | ||
tindata = np.random.random([10, 1000]).astype('float32') | ||
tindata = paddle.to_tensor(tindata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可以直接用paddle.rand
|
||
Args: | ||
ranks (list): The global ranks of group members, list as sorted. | ||
backend (str): The backend used to create group, only nccl is supported now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
backend设定默认值为None时,现在的行为是直接设置为用nccl,未来有计划改这个默认行为吗?
place = core.CUDAPlace(genv.device_id) | ||
core.NCCLParallelContext(strategy, place).init_with_ring_id(ring_id) | ||
else: | ||
assert False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
直接给个出错信息吧。
Creates a new distributed comminication group. | ||
|
||
Args: | ||
ranks (list): The global ranks of group members, list as sorted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
没太懂list as sorted,是啥意思,是对ranks里的值的序有要求?
Examples: | ||
.. code-block:: python | ||
|
||
import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
没必要import numpy
|
||
paddle.distributed.init_parallel_env() | ||
tindata = np.random.random([10, 1000]).astype('float32') | ||
tindata = paddle.to_tensor(tindata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上,可以不用numpy。
@@ -163,7 +371,9 @@ def all_reduce(tensor, op=ReduceOp.SUM, group=0): | |||
tensor (Tensor): The input Tensor. It also works as the output Tensor. Its data type | |||
should be float16, float32, float64, int32 or int64. | |||
op (ReduceOp.SUM|ReduceOp.MAX|ReduceOp.Min|ReduceOp.PROD): Optional. The operation used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
文档没写默认值是什么
@@ -238,7 +454,9 @@ def reduce(tensor, dst, op=ReduceOp.SUM, group=0): | |||
should be float16, float32, float64, int32 or int64. | |||
dst (int): The destination rank id. | |||
op (ReduceOp.SUM|ReduceOp.MAX|ReduceOp.Min|ReduceOp.PROD): Optional. The operation used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
文档没写默认值是什么。
@@ -394,7 +626,9 @@ def scatter(tensor, tensor_list=None, src=0, group=0): | |||
tensor_list (list): A list of Tensors to scatter. Every element in the list must be a Tensor whose data type | |||
should be float16, float32, float64, int32 or int64. | |||
src (int): The source rank id. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
默认值说明
@@ -394,7 +626,9 @@ def scatter(tensor, tensor_list=None, src=0, group=0): | |||
tensor_list (list): A list of Tensors to scatter. Every element in the list must be a Tensor whose data type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
默认值说明
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
TODO:fix docs bug
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
will fix doc problem in following pr
PR types
New features
PR changes
APIs
Describe
unitest coverd by