Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add the basic apis for auto_parallel #33804

Merged
merged 60 commits into from
Aug 11, 2021

Conversation

sandyhouse
Copy link

@sandyhouse sandyhouse commented Jun 28, 2021

PR types

New features

PR changes

Others

Describe

  1. add the basic directory for auto_parallel (python/paddle/distributed/auto_parallel)
  2. add the following api:
  • ProcessMesh
  • shard_tensor
  • shard_op
  • set_pipeline_stage
  • set_offload_device
  • set_shard_mask

Usage:

  • ProcessMessage
    import numpy as np
    import paddle
    import paddle.distributed as dist
    
    paddle.enable_static()
    
    mesh = dist.ProcessMesh(np.array([[2, 4, 5], [0, 1, 3]]))
    assert mesh.parent is None
    assert mesh.topology == [2, 3]
    assert mesh.process_group == [2, 4, 5, 0, 1, 3]
    mesh.set_placement([0, 1, 2, 3, 4, 5])
  • shard_tensor
    import numpy as np
    import paddle
    import paddle.distributed as dist
    paddle.enable_static()
    mesh = dist.ProcessMesh(np.array([[2, 4, 5], [0, 1, 3]]))
    x = paddle.ones([4, 6])
    dist.shard_tensor(x, mesh, [0, -1])
  • shard_op
    import numpy as np
    import paddle
    import paddle.distributed as dist
    paddle.enable_static()
    mesh = dist.ProcessMesh(np.array([[2, 4, 5], [0, 1, 3]]))
    x = paddle.ones([4, 6])
    y = paddle.zeros([4, 6])
    kwargs = {'x': x, 'y': y}
    dist.shard_op(paddle.add, mesh, None, **kwargs)
  • set_pipeline_stage
    import numpy as np
    import paddle
    import paddle.distributed as dist
    paddle.enable_static()
    dist.set_pipeline_stage(1)
  • set_offload_device
    import numpy as np
    import paddle
    import paddle.distributed as dist
    paddle.enable_static()
    x = paddle.ones([4, 6])
    dist.set_offload_device(x, 'cpu')
  • set_shard_mask
    import numpy as np
    import paddle
    import paddle.distributed as dist
    paddle.enable_static()
    mesh = dist.ProcessMesh(np.array([[2, 4, 5], [0, 1, 3]]))
    x = paddle.ones([4, 6])
    dist.set_shard_mask(x, mask)

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@sandyhouse sandyhouse changed the title add the basic directory for auto_parallel [WIP] add the basic directory for auto_parallel Jul 1, 2021
@paddle-bot-old
Copy link

Sorry to inform you that bf24fb7's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

@sandyhouse sandyhouse changed the title [WIP] add the basic directory for auto_parallel add the basic directory and related apis for auto_parallel Aug 5, 2021
@sandyhouse sandyhouse changed the title add the basic directory and related apis for auto_parallel add the basic apis for auto_parallel Aug 6, 2021
fuyinno4
fuyinno4 previously approved these changes Aug 6, 2021
PangHua
PangHua previously approved these changes Aug 6, 2021
And the first logical process is the one with id=2.

Args:
mesh (numpy.ndarray): an N-dimensional array describes the toplogy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里参数类型用numpy.ndarray的原因是什么呢?从示例代码看的话,是不是用python的list就可以了?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改为使用python list.


Args:
x (Tensor): the tensor to process.
mask (numpy.ndarray): the shape of `mask` must be the same as the ProcessMesh belonging to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的mask是否用python的list就可以?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改为使用Python list


Args:
x (tensor): the tensor to process.
device (str): the device that the tensor `x` will be put on, e.g., 'gpu:0', 'cpu'.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set_offload_device什么情况下需要设置成'gpu:0',表示什么意思呢?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

从实际使用场景看,offload的使用需求是offload指定的tensor到cpu,此处已去掉gpu:0

@@ -175,6 +191,7 @@ message VarDesc {
optional bool need_check_feed = 4 [ default = false ];
optional bool is_parameter = 5 [ default = false ];
optional bool stop_gradient = 6 [ default = false ];
repeated Attr attrs = 7;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这些新增的字段,在保存模型的时候,会被存下来吗?
我看示例代码,模型定义的时候就会添加这些字段,模型定义完再调用模型保存的时候,是不是会把这些字段都保存下来?什么时候把这些字段去掉呢?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

自动并行主要包括以下几个主要过程:1. 使用自动并行接口标识关键tensor或op;2. 自动补全:补全所有tensor和op的分布式属性;3. 逻辑切分;4. 物理映射;5. 执行训练。其中步骤1-3会使用到此处新增的字段;所以该接口新增的字段会在步骤1-3完成后删除,且该过程用户无感知。

常规的模型保存过程是 执行部分训练或全部训练完成后进行模型保存,这时,新增字段已经完全删除。

但存在一个特殊的情形,即用户完成组网后即刻保存模型,这时相关的字段会被保存下来。但我们认为,这种特殊情形是不应该存在的,因为完成组网后即保存模型是没有意义的。

mesh_id = self.attr(mesh_attr_name)
return _g_process_mesh_map[mesh_id]

def dims_mapping(self, name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

表示Tensor整体的维度概念时用dimension, 一般从1开始编号,1维Tensor,2维Tensor
表示Tensor第几维概念是用axis和axes,一般从0开始编号,Tensor的第1维,Tensor的第2维
这里看起来是表示整体维度的概念,建议直接用单数dim_mapping

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@sandyhouse sandyhouse dismissed stale reviews from PangHua and fuyinno4 via 773516b August 10, 2021 05:33
@sandyhouse sandyhouse requested review from XiaoguangHu01 and removed request for chenwhql August 10, 2021 10:02
Copy link
Contributor

@XiaoguangHu01 XiaoguangHu01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sandyhouse sandyhouse merged commit 3f962e7 into PaddlePaddle:develop Aug 11, 2021
@sandyhouse sandyhouse deleted the auto_parallel_basic branch March 8, 2022 10:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants