Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Reshard] Support reshard s to s on same placement #57210

Merged
merged 3 commits into from
Sep 15, 2023

Conversation

LiYuRio
Copy link
Contributor

@LiYuRio LiYuRio commented Sep 12, 2023

PR types

New features

PR changes

Others

Description

Pcard-73145

在一维mesh上shard到shard状态的转换,工作原理:

以形状为(a, b, c)的tensor在有n个进程的一维mesh上做切分转换为例。

  • 输入dims_mapping为[-1, -1, 0],输出dims_mapping为[0, -1, -1],每个进程上需要从(a, b, c/n)变成(a/n, b, c)。
    • 首先,做all_to_all,shape不变,值进行交换。这时,0维的值已经和其他进程交换过了。
    • 其次,做reshape,形状会从(a, b, c/n)变成(n, a/n, b, c/n)
    • 然后,做transpose,形状会从(n, a/n, b, c/n)变成(a/n, b, n, c/n)
    • 最后,做reshape,形状会从(a/n, b, n, c/n)变成(a/n, b, c)
  • 输入dims_mapping为[0, -1, -1],输出dims_mapping为[-1, -1, 0],每个进程上需要从(a/n, b, c)变成(a, b, c/n)。
    • 首先,做reshape,形状会从(a/n, b, c)变成(a/n, b, n, c/n)
    • 其次,做transpose,形状会从(a/n, b, n, c/n)变成(n, a/n, b, c/n)
    • 然后,做reshape,形状会从(n, a/n, b, c/n)变成(a, b, c/n)
    • 最后,做all_to_all,shape不变,值进行交换。这时,0维的值已经和其他进程交换过了,补全了0维缺失的数据

TODO:

  • 对非均匀切分做硬性检查
  • all_to_all目前不支持cpu,需要写对应的cpu kernel

@paddle-bot
Copy link

paddle-bot bot commented Sep 12, 2023

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@LiYuRio LiYuRio force-pushed the dev_s_to_s branch 2 times, most recently from 17f53e7 to 0316a9c Compare September 12, 2023 03:49

#include "paddle/phi/core/distributed/auto_parallel/s_to_s_reshard_function.h"

#include "glog/logging.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个头文件好像没有使用

const auto& logical_ddim = in.dims();
int64_t nranks = in_process_ids.size();
int64_t in_split_axis =
GetSplitAxisWithDimsMapping(in.dist_attr().dims_mapping()).begin()->first;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

会不会第一个之后还有非-1的维度

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在这个场景下,是没有的,因为限制了mesh的维度是1,也就是说这里只会出现一次shard axis。


DenseTensor out_reshape1;
RESHARD_FUNCTOR(
dev_ctx, Reshape, dtype, in.value(), pre_shape_vec, &out_reshape1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里reshape有没有可能引入不必要的拷贝,浅拷贝一个输入传进去会不会好一点

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reshape会修改值的排列顺序,暂时先避免可能对input底层DenseTensor产生的修改,后续有需要可以进一步优化

dev_ctx, Reshape, dtype, in.value(), pre_shape_vec, &out_reshape1);

// 1.2 calc the the desire axis and transpose
std::vector<int> axis;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int64_t到int极端情况下是不是有可能截断,要不直接用vector int64?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

因为axis不会超过32位的限制,已经全部改成vector int

Verified

This commit was signed with the committer’s verified signature.
suemto suem
Copy link
Contributor

@chenwhql chenwhql left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@XieYunshen XieYunshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for set_tests_properties(test_reshard_s_to_s PROPERTIES LABELS "RUN_TYPE=EXCLUSIVE" TIMEOUT 100)

@LiYuRio LiYuRio merged commit da98117 into PaddlePaddle:develop Sep 15, 2023
@LiYuRio LiYuRio deleted the dev_s_to_s branch September 15, 2023 03:05
danleifeng pushed a commit to danleifeng/Paddle that referenced this pull request Nov 14, 2023
* support reshard s to s

* refine, remove useless code

* reduce changable file
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants