Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU预测模型开启MKLDNN后报错 #25537

Closed
ggyuan opened this issue Jul 15, 2020 · 14 comments · Fixed by #27546
Closed

CPU预测模型开启MKLDNN后报错 #25537

ggyuan opened this issue Jul 15, 2020 · 14 comments · Fixed by #27546
Assignees

Comments

@ggyuan
Copy link

ggyuan commented Jul 15, 2020

  • 版本、环境信息:
       1)PaddlePaddle版本:1.8.1
       2)CPU:Intel(R) Xeon(R) Gold 5117 CPU @ 2.00GHz, MKLDNN
       3)GPU:无
       4)系统环境:CentOS release 6.3 (Final)、Python 2.7.15

  • 复现信息:
    百度内部提出,复现代码和模型,请hi上联系 guguiyuan

  • 问题描述:
    模型使用了pool2d op开启mkldnn报错,未使用的可正常运行
    错误信息如下
    image
    image

@yaoxuefeng6
Copy link
Contributor

yaoxuefeng6 commented Jul 15, 2020

看起来报错是在elementwise op上 elementwise op的两个输入维度不对 如报错,两个shape应该完全相等或者可以broadcast。 可以使用fluid.layers.Print() api 打印下某个输出看是否符合预期

@luotao1
Copy link
Contributor

luotao1 commented Jul 22, 2020

@jczaja Could you help see this issue? The model/data/codes are already emailed to @lidanqing-intel

@jczaja
Copy link
Contributor

jczaja commented Jul 22, 2020

@luotao1 Please forward model/codes to me as @lidanqing-intel is out of office

@luotao1
Copy link
Contributor

luotao1 commented Aug 6, 2020

@lidanqing-intel Could this issue be fixed in 1.8.5(next month)?

@lidanqing-intel
Copy link
Contributor

It is related to fact that some tensors are having data_layout set to NHWC , while model is NCHW. Status: investigating.

@jczaja
Copy link
Contributor

jczaja commented Aug 10, 2020

@luotao1 I'm sorry for late response as I was away from office. I have just resumed investigation on this problem and will do our best to have it fixed.

@jczaja
Copy link
Contributor

jczaja commented Aug 11, 2020

@luotao1 I would like to share some findings and ask for advice. Reason why there is a crash is that this model is having pool2d ops and those ops are having an attribute : data_format , which is used to indicate if model is working on NCHW or NHWC data. In the past it was agreed that either all data_format attribs are set to NCHW or all are set to NHWC , there shouldn't be scenario where some operators are working with data in NCHW and some others in data arranged in NHWC. Problem is that in this model there are pool2d ops and their's data_format values are diffrent. for example pool2d(id=139) is having NHWC:
faulty_pool
While pool2d(id=134) is having data_format set to kNCHW:
baidu_ok

PaddlePaddle oneDNN integration does not support situation where some operators are to work on NCHW and some others on NHWC. Is this intentional that two diffrent pool2d ops are using diffrent data_format ?

@luotao1
Copy link
Contributor

luotao1 commented Aug 12, 2020

Is this intentional that two different pool2d ops are using different data_format

Discussed with @phlrain, it is not reasonable that two different pool2d ops are using different data_format. @phlrain and @wanghuancoder will help see the training logical at first.

Thanks for the analysis of @jczaja!

@luotao1
Copy link
Contributor

luotao1 commented Sep 14, 2020

@jczaja @lidanqing-intel We make two different pool2d ops use the same data_format, and train a new model for inference.
image
But we can infer successfully with MKL, but fails with MKLDNN. The error is
image
I will email the new inference model to you.

@jczaja
Copy link
Contributor

jczaja commented Sep 17, 2020

@luotao1 I have question regarding this situation. Model is to be executed using NHWC input data as pool2d ops are having data_format=kNHWC set . The other ops in a model do not have data_format attribute so the way they work should be transparent to whether we use NCHW or NHWC signal. The thing is that this model is having matmul op . Matmul does not have data_format so it should work properly regardless NCHW or NHWC used. Could you please confirm that matmul should NOT have data_format and that its implementation will work properly regardless NCWH or NHWC arrangment used?

@luotao1
Copy link
Contributor

luotao1 commented Sep 22, 2020

Could you please confirm that matmul should NOT have data_format and that its implementation will work properly regardless NCWH or NHWC arrangment used

Yes, it is. matmul should not have data_format attribute, and its implementation will work properly regardless any data_format used.

jczaja added a commit to jczaja/Paddle that referenced this issue Sep 24, 2020
@jczaja
Copy link
Contributor

jczaja commented Sep 24, 2020

@luotao1 I have made a fix(develop) to this issue: #27546 . Please test it and let us know if this works fine for you. Also please tell me if you need a fix to be merged to 1.8 as well.

@luotao1
Copy link
Contributor

luotao1 commented Sep 24, 2020

@OliverLPH Could you help test it?
@jczaja We have 1.8.5 tag three hours before.

@OliverLPH
Copy link
Contributor

@jczaja @luotao1
I have verified #27546 has fixed this issue, please merge this PR

luotao1 pushed a commit that referenced this issue Oct 1, 2020
* - condidate fix to issue #25537

test=develop

* - UT for transpose NHWC

test=develop
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants