-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dev] mkl多线程下存在内存泄露问题 #22827
Comments
@lidanqing-intel Could you help see it? We verified this problem happens in E5-2620 and E5-2650 on develop branch. |
We are looking at it. |
@jiweibo , @GaoWei8 |
@wojtuss We verify this issue on both E5-2620 and E5-2650, but the memory usage did not change on CLX 6148. Could you reproduce this issue on E5-2650? |
Mkldnn is not enabled in the reproduction environment. We only used mklml. Could you reproduce this issue on E5-2650 or 2620? |
@jiweibo, Could you please show lscpu and export MKLDNN_VERBOSE=1 and print the log as below. Best Regards, yhu5@clx01:~/baidu_ml/mul$ ./build/mul_demo --model_dir=mul_model --thread_num=3 --num=-1 |
@yinghu5 MKLDNN is OFF, machine is avx2 machine, could you please reproduce the issue on avx2 machine? |
lscpu:
Mkldnn is not opened when I compile paddle. export MKLDNN_VERBOSE=1 and print the log as below ./build/mul_demo --model_dir=mul_model/ --thread_num=3 --num=3
I have reproduced memory leaks in the docker images of ubuntu16 and centos6.10. If you need other environmental information, please feel free to leave a message, thank you very much. @yinghu5 |
hi, @jiweibo 1、mul-thread 2、single-thread lscpu: |
@jiweibo @xw-github thank you for the reproduce the issues. I can reproduce the issue too and confirmed there is memory leak. From first -turn debug, the problem seems in libiomp5.so, which alloc memory, but no release. we are checking the details at local lab. Thanks #12 scalable_aligned_malloc (size=0, alignment=1048576) at ../../src/tbbmalloc/frontend.cpp:3058 |
@jiweibo @GaoWei8 @lidanqing-intel is it possible to build paddle without libiomp5? for example, when replace link libmklml_intel.so with link libmklml_gnu.so . (maybe when build mkldnn, remove the -liomp5 option). i did investigation and find the paddle library is linking mix openmp library, libiomp5.so and libgomp5. Such kind of mixing thread model would cause unexpected behaviors. and on the other hands, most of libiomp5 version already integrated the scalable_aligned_malloc. we can't remove it by change the libiomp5 version . So one way i can suggest to remove the libiomp5.so and just keep GNU default openmp. (base) [yhu5@hsw-ep01 build]$ ldd ~/baidu_mklml/paddle/paddle/lib/libpaddle_fluid.so |
Hi, @GaoWei8 I asked @yinghu5. The suggestion is changing related cmake files, remove one intel libiomp5 library and keep gnu libgomp5. It should not decrease others performance because there is libgomp5. But to make sure, some benchmarks are necessary. |
@GaoWei8 Thank you a lot for test. I notice the linked library like could you please check the produced paddle_fluid.so , Thanks |
The memory leak problem has been located and resolved, thanks to all friends for their contributions. |
mkl多线程下存在内存泄露问题。
测试模型(只包括一个mul)
多线程下重复运行1小时,内存曲线变化
尝试将mklml换成 2019.3版本,依旧存在泄露情况,切换成openblas则无问题。
需要intel跟进。
复现文件整理:
复现方法:
unzip mul.zip
cd mul
编译,参考https://www.paddlepaddle.org.cn/documentation/docs/zh/advanced_guide/inference_deployment/inference/native_infer.html#a-name-c-c-a
sh run_impl.sh fluid_inference_install_dir mul_demo
nohup ./build/mul_demo --model_dir=mul_model --thread_num=3 --num=-1 &
监视内存
nohup sh mem_use.sh &
利用top实时查看
top
运行1小时后,kill掉进程
查看memlog.txt 内存变化
cpu型号:2620、2650
mul.zip
The text was updated successfully, but these errors were encountered: