-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add all ccl op for Ascendrc #31437
add all ccl op for Ascendrc #31437
Conversation
…cendrc merge hcom_destory and other hcom tests
…cendrc merge all reduce sum change
Ascendrc
Thanks for your contribution! |
…o ascendrc merge elements_add_test
cmake/operators.cmake
Outdated
@@ -19,7 +19,7 @@ function(op_library TARGET) | |||
set(MKLDNN_FILE) | |||
set(op_common_deps operator op_registry math_function layer common_infer_shape_functions) | |||
if (WITH_ASCEND_CL) | |||
set(op_common_deps ${op_common_deps} npu_op_runner) | |||
set(op_common_deps ${op_common_deps} npu_op_runner memory) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems it is not necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good commet!
@@ -30,10 +30,16 @@ if(WITH_XPU_BKCL) | |||
endif() | |||
|
|||
if(WITH_ASCEND_CL) | |||
set(COLLECTIVE_DEPS ${COLLECTIVE_DEPS} collective_helper) | |||
set(COLLECTIVE_DEPS ${COLLECTIVE_DEPS} collective_helper memory memcpy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good commet!
endif() | ||
|
||
set(OPERATOR_DEPS ${OPERATOR_DEPS} ${COLLECTIVE_DEPS} PARENT_SCOPE) | ||
set(OPERATOR_DEPS ${OPERATOR_DEPS} ${COLLECTIVE_DEPS} memory memcpy PARENT_SCOPE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good commet!
@@ -42,6 +42,11 @@ class CAllGatherOpMaker : public framework::OpProtoAndCheckerMaker { | |||
AddOutput("Out", "(Tensor) the allgather result"); | |||
AddAttr<int>("ring_id", "(int default 0) communication ring id.") | |||
.SetDefault(0); | |||
#if defined(PADDLE_WITH_ASCEND_CL) | |||
#pragma message("tag") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove #pragma which is used for debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good commet!
@@ -0,0 +1,89 @@ | |||
/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2019 -> 2021
|
||
#include <memory> | ||
|
||
#if defined(PADDLE_WITH_ASCEND_CL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good commet!
namespace operators { | ||
|
||
template <typename T> | ||
class CAllGatherOpASCENDKernel : public framework::OpKernel<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest CAllGatherOpASCENDKernel -> AllGatherOpNPUKernel, removed prefix "C" since it looks like word "CAll".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We‘d better keep this style, because other ops under this path start with 'C'. Maybe it means these ops were implented by c/c++ language.
public: | ||
void Compute(const framework::ExecutionContext& ctx) const override { | ||
#if defined(PADDLE_WITH_ASCEND_CL) | ||
#pragma message("compile CAllGatherOpASCENDKernel") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good commet!
// const T* send_buff = in->data<T>(); | ||
// T* recv_buff = out->data<T>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed unused code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good commet!
|
||
#else | ||
PADDLE_THROW(platform::errors::PreconditionNotMet( | ||
"PaddlePaddle should compile with GPU.")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GPU -> NPU
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good commet!
cc_test(c_allgather_op_npu_test SRCS c_allgather_op_npu_test.cc DEPS op_registry c_broadcast_op c_allreduce_sum_op c_allgather_op c_reducescatter_op c_comm_init_hcom_op ${COLLECTIVE_DEPS} memory memcpy ascend_hccl dynamic_loader dynload_warpctc scope device_context enforce executor) | ||
cc_test(send_v2_op_npu_test SRCS send_v2_op_npu_test.cc DEPS op_registry send_v2_op recv_v2_op c_comm_init_hcom_op ${COLLECTIVE_DEPS} ascend_hccl dynamic_loader dynload_warpctc scope device_context enforce executor) | ||
cc_test(recv_v2_op_npu_test SRCS recv_v2_op_npu_test.cc DEPS op_registry send_v2_op recv_v2_op c_comm_init_hcom_op ${COLLECTIVE_DEPS} ascend_hccl dynamic_loader dynload_warpctc scope device_context enforce executor) | ||
endif() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a new line after the last line and the symbol will dismiss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good! Simple code can be more freindly.
std::string group = std::string(HCOM_GROUP_PREFIX) + std::to_string(ring_id); | ||
std::string tag = ctx.Attr<std::string>("tag"); | ||
auto place = ctx.GetPlace(); | ||
auto comm = platform::HCCLCommContext::Instance().Get(ring_id, place); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest adding HCCLCommContext to NPUDeviceContext in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
handsome!
out->mutable_data<T>(out_dims, place); | ||
|
||
int64_t send_numel = in->numel(); | ||
void *send_buff = reinterpret_cast<void*>(const_cast<T*>(in->data<T>())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is const_cast
necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just delete it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need it somewhere, but we do not need it for most time, I have delete most of unneeded case.
USE_OP_DEVICE_KERNEL(c_allreduce_max, NPU); | ||
|
||
template<typename T> | ||
void PrintDebugInfo(std::string preStr, std::vector<T> &data){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
void PrintDebugInfo(std::string preStr, std::vector<T> &data){ | |
void PrintDebugInfo(const std::string& pre_str, const std::vector<T> &data){ | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so precise and so good!
<< ", tag is " << tag; | ||
|
||
PADDLE_ENFORCE_NPU_SUCCESS(platform::dynload::hcom_all_gather( | ||
tag.c_str(), send_buff, recv_buff, (u64)send_numel, dtype, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
u64 is uint64_t ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typedef unsigned long long u64;
it is defined by huawei in paddle/fluid/platform/dynload/hcom_type.h.
@@ -115,36 +117,47 @@ class CAllReduceOpASCENDKernel : public framework::OpKernel<T> { | |||
public: | |||
void Compute(const framework::ExecutionContext& ctx) const override { | |||
#if defined(PADDLE_WITH_ASCEND_CL) | |||
#define PRE_MALLOC_SIZE_BYTES 512 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add some comments on this, otherwise, other RDs may get confused about it since it is very tricky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good!
int64_t tmp_numel = numel + pre_tmp_size * 2; | ||
|
||
paddle::framework::LoDTensor tmp_in, tmp_out; | ||
tmp_in.Resize({1, tmp_numel}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tmp_in.Resize({1, tmp_numel}); | |
tmp_in.Resize({tmp_numel}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, maybe sometimes less is more.
auto comm = paddle::platform::HCCLCommContext::Instance().Get(ring_id, place); | ||
|
||
aclrtStream stream = nullptr; | ||
auto dev_ctx = platform::DeviceContextPool::Instance().Get(place); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better getting dev_ctx from the argument ctx.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
npu_place, reinterpret_cast<void*>(const_cast<T*>(in->data<T>())), | ||
numel * sizeof(T), | ||
stream); | ||
dev_ctx->Wait(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this wait is not needed, you can test that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, it run more smoothly after removed that.
paddle/fluid/platform/npu_info.cc
Outdated
@@ -268,6 +268,7 @@ class RecordedNPUMallocHelper { | |||
} | |||
|
|||
NPUDeviceGuard guard(dev_id_); | |||
// auto result = aclrtMalloc(ptr, size, ACL_MEM_MALLOC_NORMAL_ONLY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Please add python tests in next PR.
* add allreduce and broadcast without test (#31024) add allreduce and broadcast without test * Refactor HCCLCommContext to be compatible with Paddle (#31359) Refactor HCCLCommContext to be compatible with Paddle (#31359) * [NPU] add npu kernel for communication op (#31437) * add allreduce and broadcast without test * add c_broadcast_test case * build c_comm_init and c_create_group operators * make the whole thing compile * add broadcast and init op test case but run failed * make unit test compile * fix broadcast test bug and change into hcom for ccl * change c_comm_init and c_create_group ops accordingly * make tests compile * transfer code to 27 * compiled successfully in 28, but run failed * test broadcast in 28, but failed * make hcom primitives work * change hccl data type for base.h * fix broadcast bug * make attributes work * fix group name bug * add allreduce but test failed * allreduce bug for qiuliang * allreduce finished * add allgather and reducescatter * merge all op code * add allgather test * finish run all ccl op test exclude send/recv * all all op and test exclude send/recv * send_v2_npu.cc recv_v2_npiu.cc compiled * fix ccl core dump bug and test allgather, reducescatter, broadcast op * fix allreduce bug just for test * hcom send&recv test pass, without hcom_destroy * for qiuliang test * Ascend Send&Recv Test Pass * all op (ex send/recv) ok * fix bug * merge all ccl op * style merge to PaddlePaddle * merge style * new merge style * merge style 2 * insert an empty at the end * disable ctest for hcom to pass ci Co-authored-by: void-main <voidmain1313113@gmail.com> Co-authored-by: f2hkop <f2huestc@outlook.com> * Add auto-increasing tag id for Hcom OPs (#31702) * add c_reduce_sum op (#31793) add c_reduce_sum op * update Ascendrc hccl to 20.3 (#32126) update Ascendrc hccl to 20.3 (#32126) * fix merge code * change cmake.txt1 * [NPU] Support npu kernel for c sync stream op (#31386) * sync stream npu op * add with_ascend_acl * update c++ unittest * compile all failed * try to pre commit * after pre commit * merge&compile&test hccl successfully! * fix code style * fix code style * fix bugs about hccl * fix some bugs * fix code style * fix style * fix style * fix * fixed * merge develop Co-authored-by: lw921014 <liuwei921014@yeah.net> Co-authored-by: Void Main <voidmain1313113@gmail.com> Co-authored-by: f2hkop <f2huestc@outlook.com> Co-authored-by: xiayanming <41795079@qq.com>
PR types
New features
PR changes
OPs
Describe
add ccl op