Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add all ccl op for Ascendrc #31437

Merged
merged 46 commits into from
Mar 8, 2021
Merged

add all ccl op for Ascendrc #31437

merged 46 commits into from
Mar 8, 2021

Conversation

lw921014
Copy link
Contributor

@lw921014 lw921014 commented Mar 4, 2021

PR types

New features

PR changes

OPs

Describe

add ccl op

@paddle-bot-old
Copy link

paddle-bot-old bot commented Mar 4, 2021

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@@ -19,7 +19,7 @@ function(op_library TARGET)
set(MKLDNN_FILE)
set(op_common_deps operator op_registry math_function layer common_infer_shape_functions)
if (WITH_ASCEND_CL)
set(op_common_deps ${op_common_deps} npu_op_runner)
set(op_common_deps ${op_common_deps} npu_op_runner memory)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems it is not necessary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good commet!

@@ -30,10 +30,16 @@ if(WITH_XPU_BKCL)
endif()

if(WITH_ASCEND_CL)
set(COLLECTIVE_DEPS ${COLLECTIVE_DEPS} collective_helper)
set(COLLECTIVE_DEPS ${COLLECTIVE_DEPS} collective_helper memory memcpy)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good commet!

endif()

set(OPERATOR_DEPS ${OPERATOR_DEPS} ${COLLECTIVE_DEPS} PARENT_SCOPE)
set(OPERATOR_DEPS ${OPERATOR_DEPS} ${COLLECTIVE_DEPS} memory memcpy PARENT_SCOPE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good commet!

@@ -42,6 +42,11 @@ class CAllGatherOpMaker : public framework::OpProtoAndCheckerMaker {
AddOutput("Out", "(Tensor) the allgather result");
AddAttr<int>("ring_id", "(int default 0) communication ring id.")
.SetDefault(0);
#if defined(PADDLE_WITH_ASCEND_CL)
#pragma message("tag")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove #pragma which is used for debug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good commet!

@@ -0,0 +1,89 @@
/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2019 -> 2021


#include <memory>

#if defined(PADDLE_WITH_ASCEND_CL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good commet!

namespace operators {

template <typename T>
class CAllGatherOpASCENDKernel : public framework::OpKernel<T> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest CAllGatherOpASCENDKernel -> AllGatherOpNPUKernel, removed prefix "C" since it looks like word "CAll".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We‘d better keep this style, because other ops under this path start with 'C'. Maybe it means these ops were implented by c/c++ language.

public:
void Compute(const framework::ExecutionContext& ctx) const override {
#if defined(PADDLE_WITH_ASCEND_CL)
#pragma message("compile CAllGatherOpASCENDKernel")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good commet!

Comment on lines 52 to 53
// const T* send_buff = in->data<T>();
// T* recv_buff = out->data<T>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed unused code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good commet!


#else
PADDLE_THROW(platform::errors::PreconditionNotMet(
"PaddlePaddle should compile with GPU."));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPU -> NPU

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good commet!

cc_test(c_allgather_op_npu_test SRCS c_allgather_op_npu_test.cc DEPS op_registry c_broadcast_op c_allreduce_sum_op c_allgather_op c_reducescatter_op c_comm_init_hcom_op ${COLLECTIVE_DEPS} memory memcpy ascend_hccl dynamic_loader dynload_warpctc scope device_context enforce executor)
cc_test(send_v2_op_npu_test SRCS send_v2_op_npu_test.cc DEPS op_registry send_v2_op recv_v2_op c_comm_init_hcom_op ${COLLECTIVE_DEPS} ascend_hccl dynamic_loader dynload_warpctc scope device_context enforce executor)
cc_test(recv_v2_op_npu_test SRCS recv_v2_op_npu_test.cc DEPS op_registry send_v2_op recv_v2_op c_comm_init_hcom_op ${COLLECTIVE_DEPS} ascend_hccl dynamic_loader dynload_warpctc scope device_context enforce executor)
endif()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a new line after the last line and the symbol will dismiss.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good! Simple code can be more freindly.

std::string group = std::string(HCOM_GROUP_PREFIX) + std::to_string(ring_id);
std::string tag = ctx.Attr<std::string>("tag");
auto place = ctx.GetPlace();
auto comm = platform::HCCLCommContext::Instance().Get(ring_id, place);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest adding HCCLCommContext to NPUDeviceContext in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handsome!

out->mutable_data<T>(out_dims, place);

int64_t send_numel = in->numel();
void *send_buff = reinterpret_cast<void*>(const_cast<T*>(in->data<T>()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is const_cast necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just delete it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need it somewhere, but we do not need it for most time, I have delete most of unneeded case.

USE_OP_DEVICE_KERNEL(c_allreduce_max, NPU);

template<typename T>
void PrintDebugInfo(std::string preStr, std::vector<T> &data){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void PrintDebugInfo(std::string preStr, std::vector<T> &data){
void PrintDebugInfo(const std::string& pre_str, const std::vector<T> &data){

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so precise and so good!

<< ", tag is " << tag;

PADDLE_ENFORCE_NPU_SUCCESS(platform::dynload::hcom_all_gather(
tag.c_str(), send_buff, recv_buff, (u64)send_numel, dtype,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

u64 is uint64_t ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typedef unsigned long long u64;
it is defined by huawei in paddle/fluid/platform/dynload/hcom_type.h.

@@ -115,36 +117,47 @@ class CAllReduceOpASCENDKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
#if defined(PADDLE_WITH_ASCEND_CL)
#define PRE_MALLOC_SIZE_BYTES 512
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some comments on this, otherwise, other RDs may get confused about it since it is very tricky.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good!

int64_t tmp_numel = numel + pre_tmp_size * 2;

paddle::framework::LoDTensor tmp_in, tmp_out;
tmp_in.Resize({1, tmp_numel});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tmp_in.Resize({1, tmp_numel});
tmp_in.Resize({tmp_numel});

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, maybe sometimes less is more.

auto comm = paddle::platform::HCCLCommContext::Instance().Get(ring_id, place);

aclrtStream stream = nullptr;
auto dev_ctx = platform::DeviceContextPool::Instance().Get(place);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better getting dev_ctx from the argument ctx.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

npu_place, reinterpret_cast<void*>(const_cast<T*>(in->data<T>())),
numel * sizeof(T),
stream);
dev_ctx->Wait();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this wait is not needed, you can test that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it run more smoothly after removed that.

@@ -268,6 +268,7 @@ class RecordedNPUMallocHelper {
}

NPUDeviceGuard guard(dev_id_);
// auto result = aclrtMalloc(ptr, size, ACL_MEM_MALLOC_NORMAL_ONLY);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better remove it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice.

Copy link
Contributor

@zhiqiu zhiqiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@zhiqiu zhiqiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Please add python tests in next PR.

@zhiqiu zhiqiu merged commit 15823bb into PaddlePaddle:ascendrc Mar 8, 2021
frankwhzhang added a commit that referenced this pull request Apr 21, 2021
* add allreduce and broadcast without test (#31024)

add allreduce and broadcast without test

* Refactor HCCLCommContext to be compatible with Paddle (#31359)

Refactor HCCLCommContext to be compatible with Paddle (#31359)

* [NPU] add npu kernel for communication op (#31437)

* add allreduce and broadcast without test

* add c_broadcast_test case

* build c_comm_init and c_create_group operators

* make the whole thing compile

* add broadcast and init op test case but run failed

* make unit test compile

* fix broadcast test bug and change into hcom for ccl

* change c_comm_init and c_create_group ops accordingly

* make tests compile

* transfer code to 27

* compiled successfully in 28, but run failed

* test broadcast in 28, but failed

* make hcom primitives work

* change hccl data type for base.h

* fix broadcast bug

* make attributes work

* fix group name bug

* add allreduce but test failed

* allreduce bug for qiuliang

* allreduce finished

* add allgather and reducescatter

* merge all op code

* add allgather test

* finish run all ccl op test exclude send/recv

* all all op and test exclude send/recv

* send_v2_npu.cc recv_v2_npiu.cc compiled

* fix ccl core dump bug and test allgather, reducescatter, broadcast op

* fix allreduce bug just for test

* hcom send&recv test pass, without hcom_destroy

* for qiuliang test

* Ascend Send&Recv Test Pass

* all op (ex send/recv) ok

* fix bug

* merge all ccl op

* style merge to PaddlePaddle

* merge style

* new merge style

* merge style 2

* insert an empty at the end

* disable ctest for hcom to pass ci

Co-authored-by: void-main <voidmain1313113@gmail.com>
Co-authored-by: f2hkop <f2huestc@outlook.com>

* Add auto-increasing tag id for Hcom OPs (#31702)

* add c_reduce_sum op (#31793)

add c_reduce_sum op

* update Ascendrc hccl to 20.3 (#32126)

update Ascendrc hccl to 20.3 (#32126)

* fix merge code

* change cmake.txt1

* [NPU] Support npu kernel for c sync stream op (#31386)

* sync stream npu op

* add with_ascend_acl

* update c++ unittest

* compile all failed

* try to pre commit

* after pre commit

* merge&compile&test hccl successfully!

* fix code style

* fix code style

* fix bugs about hccl

* fix some bugs

* fix code style

* fix style

* fix style

* fix

* fixed

* merge develop

Co-authored-by: lw921014 <liuwei921014@yeah.net>
Co-authored-by: Void Main <voidmain1313113@gmail.com>
Co-authored-by: f2hkop <f2huestc@outlook.com>
Co-authored-by: xiayanming <41795079@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants