[AutoParallel] Support pipeline parallelism backward non-computation clip. #58609

GhostScreaming · 2023-11-02T06:39:26Z

PR types

Bug fixes

PR changes

Others

Description

Pcard-73145

修复 PR 58449 和 PR 58506 的编译冲突。

支持流水线并行反向的非计算rank计算裁剪。前向PR参考PR 58126，paddle::distributed::reshard构建前反向的PR参考PR 58238。重点对创建反向图时，对unintialized的Tensor行为进行了特殊处理。

对于IsRunAutoParallel()的情况，跳过FillZeroForEmptyGradInput处理。
SetGradInMeta特殊处理PP的情况
GradTensorHolder::add特殊处理PP的情况，防止反向节点之间的边未连接。

which is needed for pipeline parallel.

… support_reshard_backward

…comments.

not allowed to include files in phi/api.

… support_reshard_backward

strategy and dp-mp-pp hybrid strategy are verified. As CI machine only has 2 cards and dp-mp-pp strategy needs 9 GPU cards, such case will be added in testcase later.

CI Machine now as it needs 8 gpus.

… support_reshard_backward

…non-computation clip. (PaddlePaddle#58449)" (PaddlePaddle#58601)" This reverts commit 79e24ec.

…clip. (PaddlePaddle#58609) * [AutoParallel] Support paddle.distributed.reshard construct GradNode, which is needed for pipeline parallel. * Fix problem of CI, and fix pp testcase as review comments advising. * Fix including files problem. * Polish paddle.distributed.reshard implementation according to review comments. * Fix some problems. * Polish code. * Fix problem of failed testcase. * Move reshard function to tensor_utils.h, as files in phi/core is not allowed to include files in phi/api. * Add forgetting file. * Fix some compilation problem. * Remove useless PADDLE_WITH_DISTRIBUTE conditional compilation. * Remove useless PADDLE_WITH_DISTRIBUTE conditional compilation. * Fix problem of WITH_PYTHON=OFF compilation option. * Fix bug of conditional compilation. * [AutoParallel] Support pipeline parallel backward. Both pp single strategy and dp-mp-pp hybrid strategy are verified. As CI machine only has 2 cards and dp-mp-pp strategy needs 9 GPU cards, such case will be added in testcase later. * Polish pipeline parallel backward implementation. * Remove useless modification. * Add MLP dp-mp-pp hybrid strategy testcase, it can't be run on CI Machine now as it needs 8 gpus. * Remove useless modification. * Fix problem of Tensor double free and polish code. * Fix problem of ReshardOutputPartialAxisToReplicated. * Revert "Revert "[AutoParallel] Support pipeline parallelism backward non-computation clip. (PaddlePaddle#58449)" (PaddlePaddle#58601)" This reverts commit 79e24ec.

GhostScreaming added 27 commits October 19, 2023 16:33

[AutoParallel] Support paddle.distributed.reshard construct GradNode,

c95797b

which is needed for pipeline parallel.

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

Verified

This commit was signed with the committer’s verified signature.

suemto suem

GPG key ID: 8E2A4F179C6D8B52

Verified
Learn about vigilant mode

243d0d3

… support_reshard_backward

Fix problem of CI, and fix pp testcase as review comments advising.

b241620

Fix including files problem.

057114d

Polish paddle.distributed.reshard implementation according to review …

9f5c0a7

…comments.

Fix some problems.

7e71a8e

Polish code.

1ba8ae8

Fix problem of failed testcase.

bc3db47

Move reshard function to tensor_utils.h, as files in phi/core is

343e6c0

not allowed to include files in phi/api.

Add forgetting file.

53c3d19

Fix some compilation problem.

58004ef

Remove useless PADDLE_WITH_DISTRIBUTE conditional compilation.

23cdf20

Remove useless PADDLE_WITH_DISTRIBUTE conditional compilation.

0544d20

Fix problem of WITH_PYTHON=OFF compilation option.

5c0b70e

Fix bug of conditional compilation.

1d7e5ad

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

083e48e

… support_reshard_backward

[AutoParallel] Support pipeline parallel backward. Both pp single

c3929fb

strategy and dp-mp-pp hybrid strategy are verified. As CI machine only has 2 cards and dp-mp-pp strategy needs 9 GPU cards, such case will be added in testcase later.

Polish pipeline parallel backward implementation.

3dd0d3e

Remove useless modification.

eec1be9

Add MLP dp-mp-pp hybrid strategy testcase, it can't be run on

5c9fadf

CI Machine now as it needs 8 gpus.

Remove useless modification.

a024e6a

Fix problem of Tensor double free and polish code.

624648a

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

bfa3d52

… support_reshard_backward

Fix problem of ReshardOutputPartialAxisToReplicated.

fa792bd

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

d0e6633

… support_reshard_backward

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

f1aab16

… support_reshard_backward

Revert "Revert "[AutoParallel] Support pipeline parallelism backward …

ce6a6e1

…non-computation clip. (PaddlePaddle#58449)" (PaddlePaddle#58601)" This reverts commit 79e24ec.

chenwhql approved these changes Nov 2, 2023

View reviewed changes

GhostScreaming merged commit 3b44f88 into PaddlePaddle:develop Nov 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AutoParallel] Support pipeline parallelism backward non-computation clip. #58609

[AutoParallel] Support pipeline parallelism backward non-computation clip. #58609

GhostScreaming commented Nov 2, 2023 •

edited

Loading

[AutoParallel] Support pipeline parallelism backward non-computation clip. #58609

[AutoParallel] Support pipeline parallelism backward non-computation clip. #58609

Conversation

GhostScreaming commented Nov 2, 2023 • edited Loading

PR types

PR changes

Description

GhostScreaming commented Nov 2, 2023 •

edited

Loading