Deploying models with multi-thread will raise MKLDNN error #34554

OliverLPH · 2021-08-02T11:00:10Z

System information
-PaddlePaddle version: develop(393a0b1) or release/2.1
-CPU: MKLDNN v2.2.1
-GPU: None
-OS Platform: ubuntu1604
-Python version: None
-Cmake orders
-C++version.txt
-API information
Note: You can get most of the information by running summary_env.py.
To Reproduce
reproduce procedure is same as #31992

cmake commands

cmake .. -DWITH_FLUID_ONLY=ON -DWITH_GPU=OFF -DWITH_TESTING=OFF -DCMAKE_BUILD_TYPE=Release -DON_INFER=ON -DWITH_INFERENCE_API_TEST=OFF -DWITH_MKL=ON -DWITH_MKLDNN=ON -DWITH_AVX=ON -DWITH_DISTRIBUTE=OFF -DWITH_STRIP=ON -DWITH_PYTHON=OFF

Describe your current behavior

The first test
Run ./build/model_test --single_thread=true --single_instance=false --test_groups="1", it does not case error.
The second test
Run ./build/model_test --single_thread=true --single_instance=true --test_groups="1", it does not case error.
The third test
Run ./build/model_test --single_thread=false --single_instance=true --test_groups="1", it will raise an error as followed

this error might caused by https://github.com/PaddlePaddle/Paddle/pull/33549/files#diff-3363ff47a1f22a111386eeff77a8572f6618d7d0602ffe71e92ecf8682faa84eR2343

Code to reproduce the issue
Other info / logs

The text was updated successfully, but these errors were encountered:

paddle-bot-old · 2021-08-02T11:00:12Z

您好，我们已经收到了您的问题，会安排技术人员尽快解答您的问题，请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时，您也可以通过查看官网API文档、常见问题、历史Issue、AI社区来寻求解答。祝您生活愉快～

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the API，FAQ，Github Issue and AI community to get the answer.Have a nice day!

* - Added softmax without caching * - Binary is no longer manually cached * - Activation onednn caching removed * - Removed manual caching of activation * - modified UT * - fix * - fix * - fixes to building * - fix * - fix * - fix to UT * - Faulty UT workaround * - approval workaround * - Fixes after review * - compilation fixes * - more lint fixes * - more fixes after review * - fixes after another round of review

lidanqing-intel · 2021-08-11T15:00:43Z

Hi, @OliverLPH 可以汇总一下这个测试里的全部模型然后发到 dubox 吗？因为上次军才提这个issue的时候，只给了两个模型，其他的代码里注释掉了。Jacek只有两个模型，没法测test_group 3~9。
因为有好几个模型，估计size较大，可以直接发到dubox，然后分享链接，波兰团队方便直接下载，就不用我先下载再上传再转移了。非常感谢！

This reverts commit 0a5c99e.

baoachun · 2021-08-17T08:19:46Z

@jczaja @lidanqing-intel Hi, I have re-executed the test program, and I found that there will be dnnl errors when config.mkldnn_enabled() is 0.

Here is the log without_mkldnn.log.

lidanqing-intel · 2021-08-18T07:27:18Z

Baoachun: Turn on MKLDNN, some models still crash, core dump.. but no dnnl::error such kind of errors show up
Jacek: Turn on MKLDNN, all models work well and no crash. Only when mkldnn OFF, some int8 models fail because int8 should not run with mkldnn OFF.
@lidanqing-intel Investigate the gap

baoachun · 2021-08-18T08:13:36Z

Baoachun: Turn on MKLDNN, some models still crash, core dump.. but no dnnl::error such kind of errors show up
Jacek: Turn on MKLDNN, all models work well and no crash. Only when mkldnn OFF, some int8 models fail because int8 should not run with mkldnn OFF.
@lidanqing-intel Investigate the gap

This is the crash log when mkldnn is turned on.
Link:https://dubox.com/s/1-I4LjktKo2n1r0ieS91Skg Password:cbak

lidanqing-intel · 2021-08-25T07:40:13Z

Hi, I have tested with newest develop f609ca379173de0be1e288a1832fd074b6c61587,

Note 1, if you turn on mkldnn, you should also turn on IR optim as noted in the running log. So for enable mkldnn, all 10 models, last two params should be both 1, 1

Note 2, there is a bug in Test source code, < is wrongly writtent to , You need to change line 318
From:

 for (int i = 0; i , test_model_params.size(); i++) {

To:

 for (int i = 0; i < test_model_params.size(); i++) {

Following command all passed.
./build/model_test --single_thread=true --single_instance=false --test_groups="1"

Following command all passed
./build/model_test --single_thread=false --single_instance=true --test_groups="1"

Following command also all passed
./build/model_test --single_thread=false --single_instance=true --test_groups="-1"

@baoachun 你可以测了，谢谢！

lidanqing-intel · 2021-08-26T02:52:29Z

@OliverLPH 我这边测这个issue不存在了。
there is a bug in Test source code, < is wrongly writtent to , You need to change in model_test.cc line 318
From:

 for (int i = 0; i , test_model_params.size(); i++) {

To:

 for (int i = 0; i < test_model_params.size(); i++) {

Following command all passed.
./build/model_test --single_thread=true --single_instance=false --test_groups="1"

Following command all passed
./build/model_test --single_thread=false --single_instance=true --test_groups="1"

Following command also all passed
./build/model_test --single_thread=false --single_instance=true --test_groups="-1"

jczaja · 2021-09-14T13:13:32Z

@OliverLPH I got a core dump from you . To use it I need to know SHA (identification of commit) and exact cmake flags and compiler version .

Please also tell me if core dump is for enabled mkldnn ( enable_mkldnn() 0 or 1 ) ?

OliverLPH · 2021-09-15T02:47:59Z

@jczaja hi, the core dump was created by following version, and the core dump is for enabled mkldnn=1
commit: 0043fa8
version.txt is

GIT COMMIT ID: 0043fa8c2b36f63152797fe08fcfe8684f1448e0
WITH_MKL: ON
WITH_MKLDNN: ON
WITH_GPU: OFF
WITH_ROCM: OFF
CXX compiler version: 8.2.0

I used docker image to build paddle: paddlepaddle/paddle:latest-dev-cuda10.1-cudnn7-gcc82
cmake flags:

cmake .. -DPYTHON_EXECUTABLE:FILEPATH=/usr/local/bin/python3.7 -DPYTHON_INCLUDE_DIR:PATH=/usr/local/include/python3.7m -DPYTHON_LIBRARIES:FILEPATH=/usr/local/lib/libpython3.7m.so -DWITH_FLUID_ONLY=ON -DWITH_GPU=OFF -DWITH_TESTING=OFF -DCMAKE_BUILD_TYPE=Release -DON_INFER=ON -DWITH_INFERENCE_API_TEST=OFF -DWITH_MKL=ON -DWITH_MKLDNN=ON -DWITH_AVX=ON -DWITH_DISTRIBUTE=OFF -DWITH_STRIP=ON

And I upload my test paddle lib to baidu-bos, you may download from this link

https://paddle-qa.bj.bcebos.com/paddle-pipeline/Master_Cpu_Avx512_LinuxUbuntu_Gcc82_Mkl_Py37_Compile_H/0043fa8c2b36f63152797fe08fcfe8684f1448e0/paddle_inference.tgz

lidanqing-intel · 2021-09-15T15:58:51Z

Reproduce with docker Peihan suggested

log onto shanghai 6271 with sudo
service docker start
# You may need to do `docker rm paddle-test`
docker run --name paddle-test -v $PWD:/paddle --network=host -it registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda10.1-cudnn7-gcc82 /bin/bash
cd ../paddle/Isse_31992/ocr_demo_1/
bash build.sh
./build/model_test --single_thread=false --single_instance=false --test_groups=-1

@OliverLPH Hi, using the docker image and paddle inference lib you provided, I met this error. Is it the same error as you have ?

OliverLPH · 2021-09-17T06:08:38Z

@lidanqing-intel the problem I met is segmentation fault
you log looks like test data were corrupts?

jczaja · 2021-09-17T11:01:00Z

@OliverLPH Using SHA you gave I was able to reproduce a problem (occasionally) and it is Segmentation Fault. So I'm trying to fix it

baoachun · 2021-09-19T08:48:38Z

@jczaja
Hi, this issue requires to be fixed until 21th evening, all the PRs since 22th will not be able to merge to the 2.2 branch.

baoachun · 2021-09-23T09:23:09Z

@jczaja @lidanqing-intel
Hi, this problem may be left to version 2.2, is there an approximate time that this issue could be solved?

jczaja · 2021-09-23T10:36:34Z

@baoachun No estimation known (I'm discovering new problems and fixing them one by one).

jczaja · 2021-09-23T13:22:01Z

@baoachun I just got fix that fixes this issue (tested on my local set up) . Please test if this resolved this issue : #35884

baoachun · 2021-09-24T02:02:05Z

@baoachun I just got fix that fixes this issue (tested on my local set up) . Please test if this resolved this issue : #35884

@OliverLPH 沛含，能帮忙测下吗？

OliverLPH · 2021-09-24T07:08:07Z

@jczaja
Hi, I have verified #35884 pr with develop branch, the issue still exists.
But I found some new situation:

for model group 4, when enable mkldnn, multi-thread and multi-instance, if you change line 188

if one sample only predict once, it won't crash and works well
if one sample repeat predict 10 times, it will encounter
if one sample repeat predict 100 times, it will get segmentation fault

but for model group 5, 6
no matter repeat times set to 1 or 10, it will still get segmentation fault

here is my test commit log

* - candidate fix * - More fixes to #34554 * - another incosnstent fix to key * - Remvoed unneeded line * - matching the cache behaviour to other ops

jczaja · 2021-09-24T11:42:18Z

@OliverLPH I wasn't able to reproduce you new findings so far. I'm using 485b387 (Merged PR mentioned earlier). So Please write:

What is the platform (CPU) you are using where issue occurs?
Please attach your model_test.cc file , so I can compare if no differences is there.

OliverLPH · 2021-09-26T06:15:07Z

@jczaja
Hi, I will also try 485b387 on my environment, since my last test is by merge #35884 locally.

cpu info: Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz
here is the model_test.cc model_test.txt I used

lidanqing-intel · 2021-09-27T09:08:03Z

@baoachun Hi since reproducing it once take a lot of time, there is no timeline for this issue dicided now.

jczaja · 2021-09-27T10:20:51Z

@baoachun , @lidanqing-intel To be on the same page. I do not have reproduction on my setups after 485b387 (fixes done so far for this issue) , so timeline will be provided once I reproduce a problem.

jczaja · 2021-09-27T17:44:33Z

@OliverLPH I understand that you use mainly Release build of PaddlePaddle. Do you have issue reproduced when using Debug build?

OliverLPH · 2021-09-28T06:52:23Z

@jczaja
I think Paddle didn't support debug mode before? I will try build it on debug and reproduce this issue

jczaja · 2021-09-28T08:41:41Z

@OliverLPH By debug mode I mean -DCMAKE_BUILD_TYPE=Debug in cmake command line and removed -DWITH_STRIP=ON

OliverLPH · 2021-09-28T08:47:23Z

@OliverLPH By debug mode I mean -DCMAKE_BUILD_TYPE=Debug in cmake command line and removed -DWITH_STRIP=ON

OK, I will try this cmake command

* - candidate fix * - More fixes to PaddlePaddle#34554 * - another incosnstent fix to key * - Remvoed unneeded line * - matching the cache behaviour to other ops

jczaja · 2021-09-29T14:15:58Z

@OliverLPH I just want to let you know that I got stable reproduction of issue when test_groups="4" , so I can investigate the problem. Thanks for hints

jczaja · 2021-10-08T08:40:30Z

@OliverLPH Just to let you know where We are with this. Problem was diagnosed and I'm implementing solution(WIP). To temporarily avoid problem you can increase PaddlePaddle cache capacity by adding(inside CreatePredictor, ~110):
config.SetMkldnnCacheCapacity(10);

OliverLPH · 2021-10-09T06:04:27Z

@jczaja
Many thanks for your suggestions.
I tried to set config.SetMkldnnCacheCapacity(110) and it works fine in all models with no core dumped.
looking forward to your new solution

yaomichael · 2022-05-24T07:35:07Z

notes from 5/20 meeting

need to monitor any potential side effects of large cache. so keep it open for a while.

OliverLPH assigned lidanqing-intel Aug 2, 2021

paddle-bot-old bot assigned DannyIsFunny Aug 2, 2021

OliverLPH unassigned DannyIsFunny Aug 2, 2021

DannyIsFunny added the Intel label Aug 2, 2021

lidanqing-intel assigned jczaja Aug 2, 2021

jczaja mentioned this issue Aug 4, 2021

[oneDNN] Fix to issue #34554 #34623

Merged

chenwhql added a commit that referenced this issue Aug 12, 2021

Revert "[oneDNN] Fix to issue #34554 (#34623)"

96a758f

This reverts commit 0a5c99e.

lidanqing-intel added a commit to lidanqing-intel/Paddle that referenced this issue Aug 12, 2021

Revert "[oneDNN] Fix to issue PaddlePaddle#34554 (PaddlePaddle#34623)"

934674f

This reverts commit 0a5c99e.

lidanqing-intel mentioned this issue Aug 12, 2021

Revert "[oneDNN] Fix to issue #34554 (#34623)" #34839

Closed

chenwhql added a commit that referenced this issue Aug 12, 2021

Revert "[oneDNN] Fix to issue #34554 (#34623)" (#34838)

dc62a22

This reverts commit 0a5c99e.

jczaja mentioned this issue Aug 12, 2021

[oneDNN] Fix to 34554 (same as previous PR but should build with GPU) #34859

Merged

jczaja mentioned this issue Aug 17, 2021

When MKLDNN is turned on, frcnn predicts the same picture multiple times, and the prediction results are inconsistent before and after. #34415

Closed

lidanqing-intel added this to the v2.2 milestone Sep 6, 2021

jczaja mentioned this issue Sep 20, 2021

[oneDNN] candidate fix to #34554 #35884

Merged

jczaja added a commit to jczaja/Paddle that referenced this issue Sep 21, 2021

- More fixes to PaddlePaddle#34554

b10e795

jczaja added a commit that referenced this issue Sep 24, 2021

[oneDNN] candidate fix to #34554 (#35884)

485b387

* - candidate fix * - More fixes to #34554 * - another incosnstent fix to key * - Remvoed unneeded line * - matching the cache behaviour to other ops

lidanqing-intel added Bug inference Inference development high priority milestone labels Sep 29, 2021

This was referenced Oct 13, 2021

Second fix to #34554 #36290

Merged

[WIP] partial disabling of caching for conv2d, conv_transpose, quantize and pool2d #36595

Closed

jczaja mentioned this issue Nov 9, 2021

Fix to #34554 #37079

Closed

jczaja mentioned this issue Jan 13, 2022

Disabling PP caching of conv oneDNN primitives #38932

Closed

paddle-bot-old bot closed this as completed Jan 11, 2023

paddle-bot bot added the status/close 已关闭 label Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deploying models with multi-thread will raise MKLDNN error #34554

Deploying models with multi-thread will raise MKLDNN error #34554

OliverLPH commented Aug 2, 2021

paddle-bot-old bot commented Aug 2, 2021

lidanqing-intel commented Aug 11, 2021

baoachun commented Aug 17, 2021

lidanqing-intel commented Aug 18, 2021

baoachun commented Aug 18, 2021

lidanqing-intel commented Aug 25, 2021 •

edited

Loading

lidanqing-intel commented Aug 26, 2021 •

edited

Loading

jczaja commented Sep 14, 2021 •

edited

Loading

OliverLPH commented Sep 15, 2021

lidanqing-intel commented Sep 15, 2021 •

edited

Loading

OliverLPH commented Sep 17, 2021

jczaja commented Sep 17, 2021

baoachun commented Sep 19, 2021 •

edited

Loading

baoachun commented Sep 23, 2021

jczaja commented Sep 23, 2021

jczaja commented Sep 23, 2021

baoachun commented Sep 24, 2021

OliverLPH commented Sep 24, 2021 •

edited

Loading

jczaja commented Sep 24, 2021

OliverLPH commented Sep 26, 2021

lidanqing-intel commented Sep 27, 2021 •

edited

Loading

jczaja commented Sep 27, 2021

jczaja commented Sep 27, 2021

OliverLPH commented Sep 28, 2021

jczaja commented Sep 28, 2021

OliverLPH commented Sep 28, 2021

jczaja commented Sep 29, 2021

jczaja commented Oct 8, 2021

OliverLPH commented Oct 9, 2021

yaomichael commented May 24, 2022

Deploying models with multi-thread will raise MKLDNN error #34554

Deploying models with multi-thread will raise MKLDNN error #34554

Comments

OliverLPH commented Aug 2, 2021

paddle-bot-old bot commented Aug 2, 2021

lidanqing-intel commented Aug 11, 2021

baoachun commented Aug 17, 2021

lidanqing-intel commented Aug 18, 2021

baoachun commented Aug 18, 2021

lidanqing-intel commented Aug 25, 2021 • edited Loading

lidanqing-intel commented Aug 26, 2021 • edited Loading

jczaja commented Sep 14, 2021 • edited Loading

OliverLPH commented Sep 15, 2021

lidanqing-intel commented Sep 15, 2021 • edited Loading

OliverLPH commented Sep 17, 2021

jczaja commented Sep 17, 2021

baoachun commented Sep 19, 2021 • edited Loading

baoachun commented Sep 23, 2021

jczaja commented Sep 23, 2021

jczaja commented Sep 23, 2021

baoachun commented Sep 24, 2021

OliverLPH commented Sep 24, 2021 • edited Loading

jczaja commented Sep 24, 2021

OliverLPH commented Sep 26, 2021

lidanqing-intel commented Sep 27, 2021 • edited Loading

jczaja commented Sep 27, 2021

jczaja commented Sep 27, 2021

OliverLPH commented Sep 28, 2021

jczaja commented Sep 28, 2021

OliverLPH commented Sep 28, 2021

jczaja commented Sep 29, 2021

jczaja commented Oct 8, 2021

OliverLPH commented Oct 9, 2021

yaomichael commented May 24, 2022

lidanqing-intel commented Aug 25, 2021 •

edited

Loading

lidanqing-intel commented Aug 26, 2021 •

edited

Loading

jczaja commented Sep 14, 2021 •

edited

Loading

lidanqing-intel commented Sep 15, 2021 •

edited

Loading

baoachun commented Sep 19, 2021 •

edited

Loading

OliverLPH commented Sep 24, 2021 •

edited

Loading

lidanqing-intel commented Sep 27, 2021 •

edited

Loading