Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[oneDNN] Fix to issue #34554 #34623

Merged
merged 18 commits into from
Aug 11, 2021
Merged

[oneDNN] Fix to issue #34554 #34623

merged 18 commits into from
Aug 11, 2021

Conversation

jczaja
Copy link
Contributor

@jczaja jczaja commented Aug 4, 2021

PR types

Bug fixes

PR changes

OPs

Describe

It is a fix to #34554 where models are using inplace execution. This inplace execution does not work well with caching mechanism so this PR is disabling caching for inplace capable ops: softmax, elementwise_*** and activation.

@jczaja jczaja changed the title [WIP] Fix to issue #34554 [oneDNN] Fix to issue #34554 Aug 5, 2021
@jczaja
Copy link
Contributor Author

jczaja commented Aug 9, 2021

@chenwhql Could you please approve PR-CI-APPROVAL. Its failure is a false positive due to fact that mechanism of checking PADDLE_ENFORCE_** works only on added lines of code. If we modify existing PADDLE_ENFORCE_** that spawn on multiple lines then CI mechanism will only see part of this call and pattern of VALID check may fail.

@jczaja jczaja requested a review from wozna August 9, 2021 16:36
@lidanqing-intel
Copy link
Contributor

@OliverLPH Could you please review it?

Copy link
Contributor

@lidanqing-intel lidanqing-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

PADDLE_ENFORCE_NOT_NULL(
bwd_w_pd_,
platform::errors::Unavailable(
"Error: BWD_W_PD should be set when getting BWD grad of weights."));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error: here is needless, other similar cases are same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok


std::shared_ptr<TBackward_params> AcquireBackwardWeightsPrimitive() {
PADDLE_ENFORCE_NOT_NULL(bwd_w_pd_, platform::errors::Unavailable(
"Error: BWD_PD should be set when "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not very sure.. if user see this error message, what changes they need to make? does the user understand BWD_PD?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User of this API is a developer of oneDNN PaddlePaddle operator.This error should be understandable as it implies than some other calls of this API should be called before this one when implementing operator.

mkldnn_engine, cpu_place) {
PADDLE_ENFORCE_EQ(
out_grad->dims(), in_x_grad->dims(),
platform::errors::InvalidArgument("The shape of softmax_grad's input "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we recommend that the expected value and actual value are also reported in the error message

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

: platform::MKLDNNHandlerNoCachingT<T, dnnl::binary>(engine, cpu_place) {
PADDLE_ENFORCE_EQ(
x->layout(), DataLayout::kMKLDNN,
platform::errors::InvalidArgument("Wrong layout set for X tensor."));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recommend give expected value and actual value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@jczaja jczaja requested a review from sfraczek August 10, 2021 11:14
Comment on lines 56 to 59
// For Inplace src and and dst are the same memory object
// (jczaja) UT mechanics is testing inplace for this op
// regardless shapes, which is wrong when X is to be broadcasted as output
// is of bigger shape that X.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify this comment, please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. Comment was updated.

Copy link
Contributor

@sfraczek sfraczek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jczaja
Copy link
Contributor Author

jczaja commented Aug 10, 2021

@chenwhql Please continue your review

Copy link
Contributor

@chenwhql chenwhql left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for PADDLE_ENFORCE

@jczaja
Copy link
Contributor Author

jczaja commented Aug 10, 2021

@chenwhql Please approve PR-CI-APPROVAL

PR-CI-APPROVAL is failing as I modified first and third line of PADDLE_ENFORCE_EQ (bold black):
PADDLE_ENFORCE_EQ(ct.Analyze(9), true,
platform::errors::InvalidArgument(
"Invalid number of cached oneDNN objects"));

And problem is that script evaluating this change is only recieving modified lines . So it will not recieve :
platform::errors::InvalidArgument(
And then script evaluating if PADDLE_ENFORCE is correct will fail as It cannot find missing line.

Copy link
Contributor

@lidanqing-intel lidanqing-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@OliverLPH
Copy link
Contributor

OliverLPH commented Aug 11, 2021

hi @jczaja , I compiled this PR and test #34554, there are still some models will crash.

when test_groups equals 3, 4, 5, 8, 9 will meet Segmentation Fault

./build/model_test --single_thread=false --single_instance=false --test_groups="3"

e5147405c135bfb3215ec7941f00c768

could you please try re-test these models in your environment?

@jczaja jczaja merged commit 0a5c99e into PaddlePaddle:develop Aug 11, 2021
@jczaja jczaja deleted the prv-34554-fix branch August 11, 2021 12:49
@jczaja
Copy link
Contributor Author

jczaja commented Aug 11, 2021

@OliverLPH Hi, I will test given commandline and fix if there is any crash

@chenwhql
Copy link
Contributor

@jczaja this commit cause paddle compile failed, ci passed may be related by ccatch, so we revert this pr temporarily
11ead905866180d7219cdec6d3945936

chenwhql added a commit that referenced this pull request Aug 12, 2021
lidanqing-intel added a commit to lidanqing-intel/Paddle that referenced this pull request Aug 12, 2021
@jczaja
Copy link
Contributor Author

jczaja commented Aug 12, 2021

@chenwhql Could you please share a full log of this build that failed?

@lidanqing-intel
Copy link
Contributor

@jczaja @chenwhql HiHi, I have that, I will share with you @jczaja

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants