Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deploying models with multi-thread will raise MKLDNN error #34554

Closed
OliverLPH opened this issue Aug 2, 2021 · 31 comments
Closed

Deploying models with multi-thread will raise MKLDNN error #34554

OliverLPH opened this issue Aug 2, 2021 · 31 comments
Assignees
Milestone

Comments

@OliverLPH
Copy link
Contributor

System information
-PaddlePaddle version: develop(393a0b1) or release/2.1
-CPU: MKLDNN v2.2.1
-GPU: None
-OS Platform: ubuntu1604
-Python version: None
-Cmake orders
-C++version.txt
-API information
Note: You can get most of the information by running summary_env.py.
To Reproduce
reproduce procedure is same as #31992

cmake commands

cmake .. -DWITH_FLUID_ONLY=ON -DWITH_GPU=OFF -DWITH_TESTING=OFF -DCMAKE_BUILD_TYPE=Release -DON_INFER=ON -DWITH_INFERENCE_API_TEST=OFF -DWITH_MKL=ON -DWITH_MKLDNN=ON -DWITH_AVX=ON -DWITH_DISTRIBUTE=OFF -DWITH_STRIP=ON -DWITH_PYTHON=OFF

Describe your current behavior

  • The first test
    Run ./build/model_test --single_thread=true --single_instance=false --test_groups="1", it does not case error.

  • The second test
    Run ./build/model_test --single_thread=true --single_instance=true --test_groups="1", it does not case error.

  • The third test
    Run ./build/model_test --single_thread=false --single_instance=true --test_groups="1", it will raise an error as followed
    image

this error might caused by https://github.com/PaddlePaddle/Paddle/pull/33549/files#diff-3363ff47a1f22a111386eeff77a8572f6618d7d0602ffe71e92ecf8682faa84eR2343

Code to reproduce the issue
Other info / logs

@paddle-bot-old
Copy link

paddle-bot-old bot commented Aug 2, 2021

您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档常见问题历史IssueAI社区来寻求解答。祝您生活愉快~

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!

jczaja added a commit that referenced this issue Aug 11, 2021
* - Added softmax without caching

* - Binary is no longer manually cached

* - Activation onednn caching removed

* - Removed manual caching of activation

* - modified UT

* - fix

* - fix

* - fixes to building

* - fix

* - fix

* - fix to UT

* - Faulty UT workaround

* - approval workaround

* - Fixes after review

* - compilation fixes

* - more lint fixes

* - more fixes after review

* - fixes after another round of review
@lidanqing-intel
Copy link
Contributor

Hi, @OliverLPH 可以汇总一下这个测试里的全部模型然后发到 dubox 吗?因为上次军才提这个issue的时候,只给了两个模型,其他的代码里注释掉了。Jacek只有两个模型,没法测test_group 3~9。
因为有好几个模型,估计size较大,可以直接发到dubox,然后分享链接,波兰团队方便直接下载,就不用我先下载再上传再转移了。非常感谢!

@baoachun
Copy link
Contributor

@jczaja @lidanqing-intel Hi, I have re-executed the test program, and I found that there will be dnnl errors when config.mkldnn_enabled() is 0.
图片
图片
Here is the log without_mkldnn.log.

@lidanqing-intel
Copy link
Contributor

Baoachun: Turn on MKLDNN, some models still crash, core dump.. but no dnnl::error such kind of errors show up
Jacek: Turn on MKLDNN, all models work well and no crash. Only when mkldnn OFF, some int8 models fail because int8 should not run with mkldnn OFF.
@lidanqing-intel Investigate the gap

@baoachun
Copy link
Contributor

Baoachun: Turn on MKLDNN, some models still crash, core dump.. but no dnnl::error such kind of errors show up
Jacek: Turn on MKLDNN, all models work well and no crash. Only when mkldnn OFF, some int8 models fail because int8 should not run with mkldnn OFF.
@lidanqing-intel Investigate the gap

This is the crash log when mkldnn is turned on.
Link:https://dubox.com/s/1-I4LjktKo2n1r0ieS91Skg Password:cbak

@lidanqing-intel
Copy link
Contributor

lidanqing-intel commented Aug 25, 2021

Hi, I have tested with newest develop f609ca379173de0be1e288a1832fd074b6c61587,

Note 1, if you turn on mkldnn, you should also turn on IR optim as noted in the running log. So for enable mkldnn, all 10 models, last two params should be both 1, 1

Note 2, there is a bug in Test source code, < is wrongly writtent to , You need to change line 318
From:

 for (int i = 0; i , test_model_params.size(); i++) {

To:

 for (int i = 0; i < test_model_params.size(); i++) {

Following command all passed.
./build/model_test --single_thread=true --single_instance=false --test_groups="1"

Following command all passed
./build/model_test --single_thread=false --single_instance=true --test_groups="1"

Following command also all passed
./build/model_test --single_thread=false --single_instance=true --test_groups="-1"

@baoachun 你可以测了,谢谢!

@lidanqing-intel
Copy link
Contributor

lidanqing-intel commented Aug 26, 2021

@OliverLPH 我这边测这个issue不存在了。
there is a bug in Test source code, < is wrongly writtent to , You need to change in model_test.cc line 318
From:

 for (int i = 0; i , test_model_params.size(); i++) {

To:

 for (int i = 0; i < test_model_params.size(); i++) {

Following command all passed.
./build/model_test --single_thread=true --single_instance=false --test_groups="1"

Following command all passed
./build/model_test --single_thread=false --single_instance=true --test_groups="1"

Following command also all passed
./build/model_test --single_thread=false --single_instance=true --test_groups="-1"

@lidanqing-intel lidanqing-intel added this to the v2.2 milestone Sep 6, 2021
@jczaja
Copy link
Contributor

jczaja commented Sep 14, 2021

@OliverLPH I got a core dump from you . To use it I need to know SHA (identification of commit) and exact cmake flags and compiler version .

Please also tell me if core dump is for enabled mkldnn ( enable_mkldnn() 0 or 1 ) ?

@OliverLPH
Copy link
Contributor Author

@jczaja hi, the core dump was created by following version, and the core dump is for enabled mkldnn=1
commit: 0043fa8
version.txt is

GIT COMMIT ID: 0043fa8c2b36f63152797fe08fcfe8684f1448e0
WITH_MKL: ON
WITH_MKLDNN: ON
WITH_GPU: OFF
WITH_ROCM: OFF
CXX compiler version: 8.2.0

I used docker image to build paddle: paddlepaddle/paddle:latest-dev-cuda10.1-cudnn7-gcc82
cmake flags:

cmake .. -DPYTHON_EXECUTABLE:FILEPATH=/usr/local/bin/python3.7 -DPYTHON_INCLUDE_DIR:PATH=/usr/local/include/python3.7m -DPYTHON_LIBRARIES:FILEPATH=/usr/local/lib/libpython3.7m.so -DWITH_FLUID_ONLY=ON -DWITH_GPU=OFF -DWITH_TESTING=OFF -DCMAKE_BUILD_TYPE=Release -DON_INFER=ON -DWITH_INFERENCE_API_TEST=OFF -DWITH_MKL=ON -DWITH_MKLDNN=ON -DWITH_AVX=ON -DWITH_DISTRIBUTE=OFF -DWITH_STRIP=ON

And I upload my test paddle lib to baidu-bos, you may download from this link

https://paddle-qa.bj.bcebos.com/paddle-pipeline/Master_Cpu_Avx512_LinuxUbuntu_Gcc82_Mkl_Py37_Compile_H/0043fa8c2b36f63152797fe08fcfe8684f1448e0/paddle_inference.tgz

@lidanqing-intel
Copy link
Contributor

lidanqing-intel commented Sep 15, 2021

Reproduce with docker Peihan suggested

log onto shanghai 6271 with sudo
service docker start
# You may need to do `docker rm paddle-test`
docker run --name paddle-test -v $PWD:/paddle --network=host -it registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda10.1-cudnn7-gcc82 /bin/bash
cd ../paddle/Isse_31992/ocr_demo_1/
bash build.sh
./build/model_test --single_thread=false --single_instance=false --test_groups=-1

@OliverLPH Hi, using the docker image and paddle inference lib you provided, I met this error. Is it the same error as you have ?

image

@OliverLPH
Copy link
Contributor Author

@lidanqing-intel the problem I met is segmentation fault
you log looks like test data were corrupts?
3bef61accb796cd063a33c88b3126cd2

@jczaja
Copy link
Contributor

jczaja commented Sep 17, 2021

@OliverLPH Using SHA you gave I was able to reproduce a problem (occasionally) and it is Segmentation Fault. So I'm trying to fix it

@baoachun
Copy link
Contributor

baoachun commented Sep 19, 2021

@jczaja
Hi, this issue requires to be fixed until 21th evening, all the PRs since 22th will not be able to merge to the 2.2 branch.

jczaja added a commit to jczaja/Paddle that referenced this issue Sep 21, 2021
@baoachun
Copy link
Contributor

@jczaja @lidanqing-intel
Hi, this problem may be left to version 2.2, is there an approximate time that this issue could be solved?

@jczaja
Copy link
Contributor

jczaja commented Sep 23, 2021

@baoachun No estimation known (I'm discovering new problems and fixing them one by one).

@jczaja
Copy link
Contributor

jczaja commented Sep 23, 2021

@baoachun I just got fix that fixes this issue (tested on my local set up) . Please test if this resolved this issue : #35884

@baoachun
Copy link
Contributor

@baoachun I just got fix that fixes this issue (tested on my local set up) . Please test if this resolved this issue : #35884

@OliverLPH 沛含,能帮忙测下吗?

@OliverLPH
Copy link
Contributor Author

OliverLPH commented Sep 24, 2021

@jczaja
Hi, I have verified #35884 pr with develop branch, the issue still exists.
But I found some new situation:

for model group 4, when enable mkldnn, multi-thread and multi-instance, if you change line 188
image

  1. if one sample only predict once, it won't crash and works well
  2. if one sample repeat predict 10 times, it will encounter
    14040f75f01a1b7c3d3dbd6d60e47666
  3. if one sample repeat predict 100 times, it will get segmentation fault

but for model group 5, 6
no matter repeat times set to 1 or 10, it will still get segmentation fault

here is my test commit log
image

jczaja added a commit that referenced this issue Sep 24, 2021
* - candidate fix

* - More fixes to #34554

* - another incosnstent fix to key

* - Remvoed unneeded line

* - matching the cache behaviour to other ops
@jczaja
Copy link
Contributor

jczaja commented Sep 24, 2021

@OliverLPH I wasn't able to reproduce you new findings so far. I'm using 485b387 (Merged PR mentioned earlier). So Please write:

  1. What is the platform (CPU) you are using where issue occurs?
  2. Please attach your model_test.cc file , so I can compare if no differences is there.

@OliverLPH
Copy link
Contributor Author

@jczaja
Hi, I will also try 485b387 on my environment, since my last test is by merge #35884 locally.

  1. cpu info: Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz
  2. here is the model_test.cc model_test.txt I used

@lidanqing-intel
Copy link
Contributor

lidanqing-intel commented Sep 27, 2021

@baoachun Hi since reproducing it once take a lot of time, there is no timeline for this issue dicided now.

@jczaja
Copy link
Contributor

jczaja commented Sep 27, 2021

@baoachun , @lidanqing-intel To be on the same page. I do not have reproduction on my setups after 485b387 (fixes done so far for this issue) , so timeline will be provided once I reproduce a problem.

@jczaja
Copy link
Contributor

jczaja commented Sep 27, 2021

@OliverLPH I understand that you use mainly Release build of PaddlePaddle. Do you have issue reproduced when using Debug build?

@OliverLPH
Copy link
Contributor Author

@jczaja
I think Paddle didn't support debug mode before? I will try build it on debug and reproduce this issue

@jczaja
Copy link
Contributor

jczaja commented Sep 28, 2021

@OliverLPH By debug mode I mean -DCMAKE_BUILD_TYPE=Debug in cmake command line and removed -DWITH_STRIP=ON

@OliverLPH
Copy link
Contributor Author

@OliverLPH By debug mode I mean -DCMAKE_BUILD_TYPE=Debug in cmake command line and removed -DWITH_STRIP=ON

OK, I will try this cmake command

AnnaTrainingG pushed a commit to AnnaTrainingG/Paddle that referenced this issue Sep 29, 2021
* - candidate fix

* - More fixes to PaddlePaddle#34554

* - another incosnstent fix to key

* - Remvoed unneeded line

* - matching the cache behaviour to other ops
@jczaja
Copy link
Contributor

jczaja commented Sep 29, 2021

@OliverLPH I just want to let you know that I got stable reproduction of issue when test_groups="4" , so I can investigate the problem. Thanks for hints

@jczaja
Copy link
Contributor

jczaja commented Oct 8, 2021

@OliverLPH Just to let you know where We are with this. Problem was diagnosed and I'm implementing solution(WIP). To temporarily avoid problem you can increase PaddlePaddle cache capacity by adding(inside CreatePredictor, ~110):
config.SetMkldnnCacheCapacity(10);

@OliverLPH
Copy link
Contributor Author

@jczaja
Many thanks for your suggestions.
I tried to set config.SetMkldnnCacheCapacity(110) and it works fine in all models with no core dumped.
looking forward to your new solution

@yaomichael
Copy link

notes from 5/20 meeting

need to monitor any potential side effects of large cache. so keep it open for a while.

@paddle-bot paddle-bot bot added the status/close 已关闭 label Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants