-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ProcessGroupNCCL for distributed training #39737
Add ProcessGroupNCCL for distributed training #39737
Conversation
e92b6f1
to
ff84be7
Compare
@@ -1,5 +1,6 @@ | |||
if(NOT WITH_PSCORE) | |||
add_subdirectory(fleet_executor) | |||
add_subdirectory(collective) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这条语句加到第一行是不是就可以了,这样可以省略下面那行。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok,done
namespace paddle { | ||
namespace distributed { | ||
|
||
ProcessGroup::Task::Task(int rank, const std::vector<Tensor>& inputTensors, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rank参数的意义是什么?是不是没有用到?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
用于后面的debug调试,打印rank信息,我记一下todo吧
if (FLAGS_nccl_blocking_wait) { | ||
// NOTE(shenliang03): It will block host for sync | ||
while (!IsCompleted()) { | ||
std::this_thread::sleep_for(std::chrono::milliseconds(10)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make 10 a constexpr and easy to modify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok,done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
b8176b3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for ‘set_tests_properties(test_collective_process_group PROPERTIES TIMEOUT 120)’
PR types
New features
PR changes
Others
Describe
增加ProcessGroup、ProcessGroupNCCL概念。屏蔽当前通信库中关于通信流和计算流概念。所有通信默认异步通信。
整体设计图:
TODO: