-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deploying models with multi-thread will raise MKLDNN error #34554
Comments
您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档、常见问题、历史Issue、AI社区来寻求解答。祝您生活愉快~ Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the API,FAQ,Github Issue and AI community to get the answer.Have a nice day! |
* - Added softmax without caching * - Binary is no longer manually cached * - Activation onednn caching removed * - Removed manual caching of activation * - modified UT * - fix * - fix * - fixes to building * - fix * - fix * - fix to UT * - Faulty UT workaround * - approval workaround * - Fixes after review * - compilation fixes * - more lint fixes * - more fixes after review * - fixes after another round of review
Hi, @OliverLPH 可以汇总一下这个测试里的全部模型然后发到 dubox 吗?因为上次军才提这个issue的时候,只给了两个模型,其他的代码里注释掉了。Jacek只有两个模型,没法测test_group 3~9。 |
This reverts commit 0a5c99e.
@jczaja @lidanqing-intel Hi, I have re-executed the test program, and I found that there will be dnnl errors when config.mkldnn_enabled() is 0. |
Baoachun: Turn on MKLDNN, some models still crash, core dump.. but no dnnl::error such kind of errors show up |
This is the crash log when mkldnn is turned on. |
Hi, I have tested with newest develop Note 1, if you turn on mkldnn, you should also turn on IR optim as noted in the running log. So for enable mkldnn, all 10 models, last two params should be both 1, 1 Note 2, there is a bug in Test source code,
To:
Following command all passed. Following command all passed Following command also all passed @baoachun 你可以测了,谢谢! |
@OliverLPH 我这边测这个issue不存在了。
To:
Following command all passed. Following command all passed Following command also all passed |
@OliverLPH I got a core dump from you . To use it I need to know SHA (identification of commit) and exact cmake flags and compiler version . Please also tell me if core dump is for enabled mkldnn ( enable_mkldnn() 0 or 1 ) ? |
@jczaja hi, the core dump was created by following version, and the core dump is for enabled mkldnn=1 GIT COMMIT ID: 0043fa8c2b36f63152797fe08fcfe8684f1448e0
WITH_MKL: ON
WITH_MKLDNN: ON
WITH_GPU: OFF
WITH_ROCM: OFF
CXX compiler version: 8.2.0 I used docker image to build paddle: paddlepaddle/paddle:latest-dev-cuda10.1-cudnn7-gcc82
And I upload my test paddle lib to baidu-bos, you may download from this link https://paddle-qa.bj.bcebos.com/paddle-pipeline/Master_Cpu_Avx512_LinuxUbuntu_Gcc82_Mkl_Py37_Compile_H/0043fa8c2b36f63152797fe08fcfe8684f1448e0/paddle_inference.tgz |
Reproduce with docker Peihan suggested
@OliverLPH Hi, using the docker image and paddle inference lib you provided, I met this error. Is it the same error as you have ? |
@lidanqing-intel the problem I met is segmentation fault |
@OliverLPH Using SHA you gave I was able to reproduce a problem (occasionally) and it is Segmentation Fault. So I'm trying to fix it |
@jczaja |
@jczaja @lidanqing-intel |
@baoachun No estimation known (I'm discovering new problems and fixing them one by one). |
@OliverLPH 沛含,能帮忙测下吗? |
@jczaja for model group 4, when enable mkldnn, multi-thread and multi-instance, if you change
but for model group 5, 6 |
@OliverLPH I wasn't able to reproduce you new findings so far. I'm using 485b387 (Merged PR mentioned earlier). So Please write:
|
@jczaja
|
@baoachun Hi since reproducing it once take a lot of time, there is no timeline for this issue dicided now. |
@baoachun , @lidanqing-intel To be on the same page. I do not have reproduction on my setups after 485b387 (fixes done so far for this issue) , so timeline will be provided once I reproduce a problem. |
@OliverLPH I understand that you use mainly Release build of PaddlePaddle. Do you have issue reproduced when using Debug build? |
@jczaja |
@OliverLPH By debug mode I mean -DCMAKE_BUILD_TYPE=Debug in cmake command line and removed -DWITH_STRIP=ON |
OK, I will try this cmake command |
* - candidate fix * - More fixes to PaddlePaddle#34554 * - another incosnstent fix to key * - Remvoed unneeded line * - matching the cache behaviour to other ops
@OliverLPH I just want to let you know that I got stable reproduction of issue when test_groups="4" , so I can investigate the problem. Thanks for hints |
@OliverLPH Just to let you know where We are with this. Problem was diagnosed and I'm implementing solution(WIP). To temporarily avoid problem you can increase PaddlePaddle cache capacity by adding(inside CreatePredictor, ~110): |
@jczaja |
notes from 5/20 meeting need to monitor any potential side effects of large cache. so keep it open for a while. |
System information
-PaddlePaddle version: develop(393a0b1) or release/2.1
-CPU: MKLDNN v2.2.1
-GPU: None
-OS Platform: ubuntu1604
-Python version: None
-Cmake orders
-C++version.txt
-API information
Note: You can get most of the information by running summary_env.py.
To Reproduce
reproduce procedure is same as #31992
cmake commands
Describe your current behavior
The first test
Run
./build/model_test --single_thread=true --single_instance=false --test_groups="1"
, it does not case error.The second test
Run
./build/model_test --single_thread=true --single_instance=true --test_groups="1"
, it does not case error.The third test
Run
./build/model_test --single_thread=false --single_instance=true --test_groups="1"
, it will raise an error as followedthis error might caused by https://github.com/PaddlePaddle/Paddle/pull/33549/files#diff-3363ff47a1f22a111386eeff77a8572f6618d7d0602ffe71e92ecf8682faa84eR2343
Code to reproduce the issue
Other info / logs
The text was updated successfully, but these errors were encountered: