Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

solved some npu bugs #32793

Merged
merged 2 commits into from
May 13, 2021
Merged

solved some npu bugs #32793

merged 2 commits into from
May 13, 2021

Conversation

Baibaifan
Copy link
Contributor

@Baibaifan Baibaifan commented May 7, 2021

PR types

Bug fixes

PR changes

Others

Describe

shared_index api support two types int32 and int64.
Input indices with data type int64 and int32. It's last dimension must be 1.
situation1:

 import paddle
            label = paddle.to_tensor([[16], [1]], "int64")
            shard_label = paddle.shard_index(input=label,
                                             index_num=20,
                                             nshards=2,
                                             shard_id=0)
            print(shard_label)

situation2:

 import paddle
            label = paddle.to_tensor([[16], [1]], "int32")
            shard_label = paddle.shard_index(input=label,
                                             index_num=20,
                                             nshards=2,
                                             shard_id=0)
            print(shard_label)

solved some npu bugs

@paddle-bot-old
Copy link

paddle-bot-old bot commented May 7, 2021

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@Baibaifan Baibaifan changed the title Mode some npu bugs solved some npu bugs May 7, 2021

for idx, op in enumerate(block.ops):
if op.type == "check_finite_and_unscale":
return idx

raise ValueError("check_finite_and_unscale does not exist in block")
if raise_error:
raise ValueError("check_finite_and_unscale does not exist in block")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the error message should be:
"amp is turn on but check_finite_and_unscale op does not exist in main block"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

accumulated_grad_names,
core.op_proto_and_checker_maker.OpRole.Optimize,
use_calc_stream=True)
main_block, raise_error=self.user_defined_strategy.amp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the reason for this modification ? for npu hang bug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To solove nup hang bugs!

JZ-LIANG
JZ-LIANG previously approved these changes May 11, 2021
Copy link
Contributor

@JZ-LIANG JZ-LIANG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for Sharding

@@ -28,6 +28,7 @@ class CRecvOpASCENDKernel : public framework::OpKernel<T> {
void Compute(const framework::ExecutionContext& ctx) const override {
#if defined(PADDLE_WITH_ASCEND_CL)
auto x = ctx.Output<framework::LoDTensor>("Out");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto x = ctx.Output<framework::LoDTensor>("Out");
auto out = ctx.Output<framework::LoDTensor>("Out");

@@ -1467,7 +1467,7 @@ def linear(x, weight, bias=None, name=None):
}
tmp = helper.create_variable_for_type_inference(dtype)
helper.append_op(
type='matmul', inputs=inputs, outputs={'Out': tmp}, attrs=attrs)
type='matmul_v2', inputs=inputs, outputs={'Out': tmp}, attrs=attrs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change here? Is it consistent?

raise ValueError("check_finite_and_unscale does not exist in block")
if raise_error:
raise ValueError(
"amp is turn on but check_finite_and_unscale op does not exist in main block"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turn -> turned

Comment on lines +1304 to +1308
if (type_ != "reshape2" && type_ != "reshape2_grad") {
original_tensor->Resize(original_dims);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, it is a temp solution to change here, please add some notes.

@@ -251,6 +251,8 @@ def set_use_var(self, var_list):
slot_var.type = "float"
elif var.dtype == core.VarDesc.VarType.INT64:
slot_var.type = "uint64"
elif var.dtype == core.VarDesc.VarType.INT32:
slot_var.type = "uint32"
else:
raise ValueError(
"Currently, fluid.dataset only supports dtype=float32 and dtype=int64"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add INT32?

"for pipeline parallelism.")
assert dev_type == "gpu" or dev_type == 'npu', (
"Now only gpu and npu devices are supported "
"for pipeline parallelism.")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to deal with npu:all?

zhiqiu
zhiqiu previously approved these changes May 12, 2021
HcclDataType dtype = platform::ToHCCLDataType(x->type());
auto out = ctx.Output<framework::LoDTensor>("Out");
out->mutable_data<T>(out->dims(), ctx.GetPlace());
void* ptr = reinterpret_cast<void*>(const_cast<T*>(out->data<T>()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not important.

Suggested change
void* ptr = reinterpret_cast<void*>(const_cast<T*>(out->data<T>()));
void* ptr = out->data<void>();

Copy link
Contributor

@zhiqiu zhiqiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@jzhang533 jzhang533 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我看shard_index有一个deprecate的说明。
所以对这个API的维护,是在fluid下面原地修改,还是需要迁移一下?

Copy link
Contributor

@jzhang533 jzhang533 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link

@sandyhouse sandyhouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for pp

@wangxicoding wangxicoding merged commit c3ae0d4 into PaddlePaddle:develop May 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants