solved some npu bugs #32793

Baibaifan · 2021-05-07T11:57:34Z

PR types

Bug fixes

PR changes

Others

Describe

shared_index api support two types int32 and int64.
Input indices with data type int64 and int32. It's last dimension must be 1.
situation1:

 import paddle
            label = paddle.to_tensor([[16], [1]], "int64")
            shard_label = paddle.shard_index(input=label,
                                             index_num=20,
                                             nshards=2,
                                             shard_id=0)
            print(shard_label)

situation2:

 import paddle
            label = paddle.to_tensor([[16], [1]], "int32")
            shard_label = paddle.shard_index(input=label,
                                             index_num=20,
                                             nshards=2,
                                             shard_id=0)
            print(shard_label)

solved some npu bugs

paddle-bot-old · 2021-05-07T11:57:37Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

JZ-LIANG · 2021-05-10T12:53:59Z

python/paddle/distributed/fleet/meta_optimizers/sharding/utils.py


    for idx, op in enumerate(block.ops):
        if op.type == "check_finite_and_unscale":
            return idx

-    raise ValueError("check_finite_and_unscale does not exist in block")
+    if raise_error:
+        raise ValueError("check_finite_and_unscale does not exist in block")


the error message should be:
"amp is turn on but check_finite_and_unscale op does not exist in main block"

JZ-LIANG · 2021-05-10T12:57:28Z

python/paddle/distributed/fleet/meta_optimizers/sharding_optimizer.py

-                    accumulated_grad_names,
-                    core.op_proto_and_checker_maker.OpRole.Optimize,
-                    use_calc_stream=True)
+                    main_block, raise_error=self.user_defined_strategy.amp)


what is the reason for this modification ? for npu hang bug?

To solove nup hang bugs!

JZ-LIANG

LGTM for Sharding

zhiqiu · 2021-05-11T03:14:22Z

paddle/fluid/operators/collective/recv_v2_op_npu.cc

@@ -28,6 +28,7 @@ class CRecvOpASCENDKernel : public framework::OpKernel<T> {
  void Compute(const framework::ExecutionContext& ctx) const override {
 #if defined(PADDLE_WITH_ASCEND_CL)
    auto x = ctx.Output<framework::LoDTensor>("Out");


Suggested change

auto x = ctx.Output<framework::LoDTensor>("Out");

auto out = ctx.Output<framework::LoDTensor>("Out");

zhiqiu · 2021-05-11T03:16:01Z

python/paddle/nn/functional/common.py

@@ -1467,7 +1467,7 @@ def linear(x, weight, bias=None, name=None):
        }
        tmp = helper.create_variable_for_type_inference(dtype)
        helper.append_op(
-            type='matmul', inputs=inputs, outputs={'Out': tmp}, attrs=attrs)
+            type='matmul_v2', inputs=inputs, outputs={'Out': tmp}, attrs=attrs)


Why change here? Is it consistent?

zhiqiu · 2021-05-11T03:18:15Z

python/paddle/distributed/fleet/meta_optimizers/sharding/utils.py

-    raise ValueError("check_finite_and_unscale does not exist in block")
+    if raise_error:
+        raise ValueError(
+            "amp is turn on but check_finite_and_unscale op does not exist in main block"


turn -> turned

zhiqiu · 2021-05-11T03:20:59Z

paddle/fluid/framework/operator.cc

+    if (type_ != "reshape2" && type_ != "reshape2_grad") {
+      original_tensor->Resize(original_dims);
+    }


As discussed, it is a temp solution to change here, please add some notes.

sandyhouse · 2021-05-11T03:51:55Z

python/paddle/fluid/dataset.py

@@ -251,6 +251,8 @@ def set_use_var(self, var_list):
                slot_var.type = "float"
            elif var.dtype == core.VarDesc.VarType.INT64:
                slot_var.type = "uint64"
+            elif var.dtype == core.VarDesc.VarType.INT32:
+                slot_var.type = "uint32"
            else:
                raise ValueError(
                    "Currently, fluid.dataset only supports dtype=float32 and dtype=int64"


sandyhouse · 2021-05-11T03:52:45Z

python/paddle/fluid/optimizer.py

-                                       "for pipeline parallelism.")
+            assert dev_type == "gpu" or dev_type == 'npu', (
+                "Now only gpu and npu devices are supported "
+                "for pipeline parallelism.")


How to deal with npu:all?

zhiqiu · 2021-05-12T02:52:25Z

paddle/fluid/operators/collective/recv_v2_op_npu.cc

-    HcclDataType dtype = platform::ToHCCLDataType(x->type());
+    auto out = ctx.Output<framework::LoDTensor>("Out");
+    out->mutable_data<T>(out->dims(), ctx.GetPlace());
+    void* ptr = reinterpret_cast<void*>(const_cast<T*>(out->data<T>()));


Not important.

Suggested change

void* ptr = reinterpret_cast<void*>(const_cast<T*>(out->data<T>()));

void* ptr = out->data<void>();

zhiqiu

LGTM

jzhang533

我看shard_index有一个deprecate的说明。
所以对这个API的维护，是在fluid下面原地修改，还是需要迁移一下？

jzhang533

lgtm

sandyhouse

LGTM for pp

mode_some_npu_bugs

cb4523d

Baibaifan changed the title ~~Mode some npu bugs~~ solved some npu bugs May 7, 2021

JZ-LIANG requested changes May 10, 2021

View reviewed changes

gongweibao requested review from sandyhouse and wangxicoding May 11, 2021 01:21

JZ-LIANG previously approved these changes May 11, 2021

View reviewed changes

zhiqiu reviewed May 11, 2021

View reviewed changes

Baibaifan dismissed JZ-LIANG’s stale review via 4461053 May 11, 2021 03:32

sandyhouse reviewed May 11, 2021

View reviewed changes

zhiqiu previously approved these changes May 12, 2021

View reviewed changes

mode_some_npu_bugs_2

19cf4ee

Baibaifan dismissed zhiqiu’s stale review via 19cf4ee May 12, 2021 03:26

zhiqiu approved these changes May 12, 2021

View reviewed changes

jzhang533 reviewed May 12, 2021

View reviewed changes

jzhang533 approved these changes May 12, 2021

View reviewed changes

sandyhouse approved these changes May 13, 2021

View reviewed changes

wangxicoding merged commit c3ae0d4 into PaddlePaddle:develop May 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

solved some npu bugs #32793

solved some npu bugs #32793

Baibaifan commented May 7, 2021 •

edited

Loading

paddle-bot-old bot commented May 7, 2021

JZ-LIANG May 10, 2021

Baibaifan May 10, 2021

JZ-LIANG May 10, 2021

Baibaifan May 10, 2021

JZ-LIANG left a comment

zhiqiu May 11, 2021

zhiqiu May 11, 2021

zhiqiu May 11, 2021

zhiqiu May 11, 2021

sandyhouse May 11, 2021

sandyhouse May 11, 2021

zhiqiu May 12, 2021

zhiqiu left a comment

jzhang533 left a comment

jzhang533 left a comment

sandyhouse left a comment

	auto x = ctx.Output<framework::LoDTensor>("Out");
	auto out = ctx.Output<framework::LoDTensor>("Out");

	void* ptr = reinterpret_cast<void>(const_cast<T>(out->data<T>()));
	void* ptr = out->data<void>();

solved some npu bugs #32793

solved some npu bugs #32793

Conversation

Baibaifan commented May 7, 2021 • edited Loading

PR types

PR changes

Describe

paddle-bot-old bot commented May 7, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JZ-LIANG left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhiqiu left a comment

Choose a reason for hiding this comment

jzhang533 left a comment

Choose a reason for hiding this comment

jzhang533 left a comment

Choose a reason for hiding this comment

sandyhouse left a comment

Choose a reason for hiding this comment

Baibaifan commented May 7, 2021 •

edited

Loading