Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock in saxpy_ #883

Closed
vadimkantorov opened this issue May 18, 2016 · 12 comments
Closed

Deadlock in saxpy_ #883

vadimkantorov opened this issue May 18, 2016 · 12 comments

Comments

@vadimkantorov
Copy link

vadimkantorov commented May 18, 2016

Hi,

I am using Torch7 to train conv nets, and in some regime I get infinite hangs from OpenBLAS with this stack trace (the hang is almost always reproducible, but requires a heavy setup, even minor changes to the code, like more debug prints around neural net modules lead to bug disappearing):

#0  0x00007f7822225737 in sched_yield () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f782052b705 in exec_blas_async_wait () from /home/kantorov/.wigwam/prefix/lib/../lib/libopenblas.so.0
#2  0x00007f782052b7f6 in exec_blas () from /home/kantorov/.wigwam/prefix/lib/../lib/libopenblas.so.0
#3  0x00007f782052be28 in blas_level1_thread () from /home/kantorov/.wigwam/prefix/lib/../lib/libopenblas.so.0
#4  0x00007f782031e625 in saxpy_ () from /home/kantorov/.wigwam/prefix/lib/../lib/libopenblas.so.0
#5  0x00007f7821713a05 in THFloatBlas_axpy () from /home/kantorov/.wigwam/prefix/lib/libTH.so
#6  0x00007f782164ed74 in THFloatTensor_cadd () from /home/kantorov/.wigwam/prefix/lib/libTH.so
#7  0x00007f7821bd1506 in torch_FloatTensorOperator___sub__ () from /home/kantorov/.wigwam/prefix/lib/lua/5.1/libtorch.so
#8  0x000000000048ae6a in lj_BC_FUNCC ()
#9  0x00007f7821962c42 in luaT_mt(short, bool __restrict) () from /home/kantorov/.wigwam/prefix/lib/libluaT.so
#10 0x000000000048ae6a in lj_BC_FUNCC ()
#11 0x000000000047c0c0 in lj_cf_dofile ()
#12 0x000000000048ae6a in lj_BC_FUNCC ()
#13 0x000000000047a6dd in lua_pcall ()
#14 0x000000000041131f in pmain ()
#15 0x000000000048ae6a in lj_BC_FUNCC ()
#16 0x000000000047a757 in lua_cpcall ()
#17 0x000000000040f234 in main ()

The OpenBLAS 0.2.18 was built with:
make -j32 FC=gfortran

@brada4
Copy link
Contributor

brada4 commented May 18, 2016

What if you add break in gdb at saxpy_ to see call parameters?

@vadimkantorov
Copy link
Author

It would be hard to do, there're plenty of calls that succeed before it hangs. Should I try to compile openblas/luajit with symbols?

@brada4
Copy link
Contributor

brada4 commented May 19, 2016

You have like enough symbols. Can you run backtrace on all threads
gdb

attach <pid>
thread apply all backtrace
detach

Current backtrace shows that last blas call on particular thread was saxpy and it ended well

@martin-frbg
Copy link
Collaborator

Actually it looks as if the thread that last called saxpy is spinning in sched_yield waiting for someone else to release a lock on something. There was some fundamental criticism of (still) using sched_yield instead of spinlocks on modern systems in #731 but "only" a less invasive solution (not spawning additional threads for small matrix sizes) was implemented there. IF the torch framework is multithreaded itself, it may make more sense to use a single-threaded OpenBLAS as stated in the FAQ

If your application is already multi-threaded, it will conflict with OpenBLAS multi-threading. Thus, you must set OpenBLAS to use single thread as following.
export OPENBLAS_NUM_THREADS=1 in the environment variables. Or
Call openblas_set_num_threads(1) in the application on runtime. Or
Build OpenBLAS single thread version, e.g. make USE_THREAD=0

@vadimkantorov
Copy link
Author

@martin-frbg I think for Matrix operations, Torch relies on underlying BLAS to do things in parallel.
Switching off multi-threading would probably help for my particular case, anyhow any small changes break the poisonous timing.

But regardless of inefficiencies of sched_yield discussed in #731, a deadlock is something even worse :) If you suggest things to be checked, I could try investigate this case deeper.

@vadimkantorov
Copy link
Author

It seems that building with NO_AFFINITY=1 solves the issue, so it could be related

@vadimkantorov
Copy link
Author

@brada4 here's the full one:
(gdb) thread apply all backtrace

Thread 49 (Thread 0x7f934d969700 (LWP 41915)):
#0  0x00007f9378e8684d in poll () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f92f096919b in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007f92f0330931 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007f92f0969918 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 48 (Thread 0x7f934e16a700 (LWP 437)):
#0  0x00007f937937e250 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f92a8c8b289 in THCondition_wait () from /home/kantorov/.wigwam/prefix/lib/lua/5.1/libthreads.so
#2  0x00007f92a8c896a6 in condition_wait () from /home/kantorov/.wigwam/prefix/lib/lua/5.1/libthreads.so
#3  0x000000000048ae6a in lj_BC_FUNCC ()
#4  0x000000000047a6dd in lua_pcall ()
#5  0x00007f92a8a859d8 in THThread_main () from /home/kantorov/.wigwam/prefix/lib/libthreadsmain.so
#6  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#7  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 47 (Thread 0x7f934e96b700 (LWP 438)):
#0  0x00007f937937e250 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f92a8c8b289 in THCondition_wait () from /home/kantorov/.wigwam/prefix/lib/lua/5.1/libthreads.so
#2  0x00007f92a8c896a6 in condition_wait () from /home/kantorov/.wigwam/prefix/lib/lua/5.1/libthreads.so
#3  0x000000000048ae6a in lj_BC_FUNCC ()
#4  0x000000000047a6dd in lua_pcall ()
#5  0x00007f92a8a859d8 in THThread_main () from /home/kantorov/.wigwam/prefix/lib/libthreadsmain.so
#6  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#7  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 46 (Thread 0x7f934f16c700 (LWP 439)):
#0  0x00007f937937e250 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f92a8c8b289 in THCondition_wait () from /home/kantorov/.wigwam/prefix/lib/lua/5.1/libthreads.so
#2  0x00007f92a8c896a6 in condition_wait () from /home/kantorov/.wigwam/prefix/lib/lua/5.1/libthreads.so
#3  0x000000000048ae6a in lj_BC_FUNCC ()
#4  0x000000000047a6dd in lua_pcall ()
#5  0x00007f92a8a859d8 in THThread_main () from /home/kantorov/.wigwam/prefix/lib/libthreadsmain.so
#6  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#7  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 45 (Thread 0x7f92a8a84700 (LWP 440)):
#0  0x00007f937937e250 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f92a8c8b289 in THCondition_wait () from /home/kantorov/.wigwam/prefix/lib/lua/5.1/libthreads.so
#2  0x00007f92a8c896a6 in condition_wait () from /home/kantorov/.wigwam/prefix/lib/lua/5.1/libthreads.so
#3  0x000000000048ae6a in lj_BC_FUNCC ()
#4  0x000000000047a6dd in lua_pcall ()
#5  0x00007f92a8a859d8 in THThread_main () from /home/kantorov/.wigwam/prefix/lib/libthreadsmain.so
#6  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#7  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 44 (Thread 0x7f92a8283700 (LWP 441)):
#0  0x00007f937937e250 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f92a8c8b289 in THCondition_wait () from /home/kantorov/.wigwam/prefix/lib/lua/5.1/libthreads.so
---Type <return> to continue, or q <return> to quit---
#2  0x00007f92a8c896a6 in condition_wait () from /home/kantorov/.wigwam/prefix/lib/lua/5.1/libthreads.so
#3  0x000000000048ae6a in lj_BC_FUNCC ()
#4  0x000000000047a6dd in lua_pcall ()
#5  0x00007f92a8a859d8 in THThread_main () from /home/kantorov/.wigwam/prefix/lib/libthreadsmain.so
#6  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#7  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 43 (Thread 0x7f927b4ce700 (LWP 442)):
#0  0x00007f937937e250 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f92a8c8b289 in THCondition_wait () from /home/kantorov/.wigwam/prefix/lib/lua/5.1/libthreads.so
#2  0x00007f92a8c896a6 in condition_wait () from /home/kantorov/.wigwam/prefix/lib/lua/5.1/libthreads.so
#3  0x000000000048ae6a in lj_BC_FUNCC ()
#4  0x000000000047a6dd in lua_pcall ()
#5  0x00007f92a8a859d8 in THThread_main () from /home/kantorov/.wigwam/prefix/lib/libthreadsmain.so
#6  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#7  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 42 (Thread 0x7f927accd700 (LWP 443)):
#0  0x00007f937937e250 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f92a8c8b289 in THCondition_wait () from /home/kantorov/.wigwam/prefix/lib/lua/5.1/libthreads.so
#2  0x00007f92a8c896a6 in condition_wait () from /home/kantorov/.wigwam/prefix/lib/lua/5.1/libthreads.so
#3  0x000000000048ae6a in lj_BC_FUNCC ()
#4  0x000000000047a6dd in lua_pcall ()
#5  0x00007f92a8a859d8 in THThread_main () from /home/kantorov/.wigwam/prefix/lib/libthreadsmain.so
#6  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#7  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 41 (Thread 0x7f927a4cc700 (LWP 444)):
#0  0x00007f937937e250 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f92a8c8b289 in THCondition_wait () from /home/kantorov/.wigwam/prefix/lib/lua/5.1/libthreads.so
#2  0x00007f92a8c896a6 in condition_wait () from /home/kantorov/.wigwam/prefix/lib/lua/5.1/libthreads.so
#3  0x000000000048ae6a in lj_BC_FUNCC ()
#4  0x000000000047a6dd in lua_pcall ()
#5  0x00007f92a8a859d8 in THThread_main () from /home/kantorov/.wigwam/prefix/lib/libthreadsmain.so
#6  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#7  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 40 (Thread 0x7f9279ccb700 (LWP 445)):
#0  0x00007f937937e250 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f92a8c8b289 in THCondition_wait () from /home/kantorov/.wigwam/prefix/lib/lua/5.1/libthreads.so
#2  0x00007f92a8c896a6 in condition_wait () from /home/kantorov/.wigwam/prefix/lib/lua/5.1/libthreads.so
#3  0x000000000048ae6a in lj_BC_FUNCC ()
#4  0x000000000047a6dd in lua_pcall ()
#5  0x00007f92a8a859d8 in THThread_main () from /home/kantorov/.wigwam/prefix/lib/libthreadsmain.so
#6  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#7  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 39 (Thread 0x7f92794ca700 (LWP 446)):
#0  0x00007f937937e250 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f92a8c8b289 in THCondition_wait () from /home/kantorov/.wigwam/prefix/lib/lua/5.1/libthreads.so
#2  0x00007f92a8c896a6 in condition_wait () from /home/kantorov/.wigwam/prefix/lib/lua/5.1/libthreads.so
#3  0x000000000048ae6a in lj_BC_FUNCC ()
---Type <return> to continue, or q <return> to quit---
#4  0x000000000047a6dd in lua_pcall ()
#5  0x00007f92a8a859d8 in THThread_main () from /home/kantorov/.wigwam/prefix/lib/libthreadsmain.so
#6  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#7  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 38 (Thread 0x7f9278cc9700 (LWP 447)):
#0  0x00007f937937e250 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f92a8c8b289 in THCondition_wait () from /home/kantorov/.wigwam/prefix/lib/lua/5.1/libthreads.so
#2  0x00007f92a8c896a6 in condition_wait () from /home/kantorov/.wigwam/prefix/lib/lua/5.1/libthreads.so
#3  0x000000000048ae6a in lj_BC_FUNCC ()
#4  0x000000000047a6dd in lua_pcall ()
#5  0x00007f92a8a859d8 in THThread_main () from /home/kantorov/.wigwam/prefix/lib/libthreadsmain.so
#6  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#7  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 37 (Thread 0x7f925bfff700 (LWP 451)):
#0  0x00007f937937e250 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f92a8c8b289 in THCondition_wait () from /home/kantorov/.wigwam/prefix/lib/lua/5.1/libthreads.so
#2  0x00007f92a8c896a6 in condition_wait () from /home/kantorov/.wigwam/prefix/lib/lua/5.1/libthreads.so
#3  0x000000000048ae6a in lj_BC_FUNCC ()
#4  0x000000000047a6dd in lua_pcall ()
#5  0x00007f92a8a859d8 in THThread_main () from /home/kantorov/.wigwam/prefix/lib/libthreadsmain.so
#6  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#7  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 36 (Thread 0x7f925b7fe700 (LWP 452)):
#0  0x00007f937937e250 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f92a8c8b289 in THCondition_wait () from /home/kantorov/.wigwam/prefix/lib/lua/5.1/libthreads.so
#2  0x00007f92a8c896a6 in condition_wait () from /home/kantorov/.wigwam/prefix/lib/lua/5.1/libthreads.so
#3  0x000000000048ae6a in lj_BC_FUNCC ()
#4  0x000000000047a6dd in lua_pcall ()
#5  0x00007f92a8a859d8 in THThread_main () from /home/kantorov/.wigwam/prefix/lib/libthreadsmain.so
#6  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#7  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 35 (Thread 0x7f925affd700 (LWP 453)):
#0  0x00007f937937e250 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f92a8c8b289 in THCondition_wait () from /home/kantorov/.wigwam/prefix/lib/lua/5.1/libthreads.so
#2  0x00007f92a8c896a6 in condition_wait () from /home/kantorov/.wigwam/prefix/lib/lua/5.1/libthreads.so
#3  0x000000000048ae6a in lj_BC_FUNCC ()
#4  0x000000000047a6dd in lua_pcall ()
#5  0x00007f92a8a859d8 in THThread_main () from /home/kantorov/.wigwam/prefix/lib/libthreadsmain.so
#6  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#7  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 34 (Thread 0x7f925a7fc700 (LWP 454)):
#0  0x00007f937937e250 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f92a8c8b289 in THCondition_wait () from /home/kantorov/.wigwam/prefix/lib/lua/5.1/libthreads.so
#2  0x00007f92a8c896a6 in condition_wait () from /home/kantorov/.wigwam/prefix/lib/lua/5.1/libthreads.so
#3  0x000000000048ae6a in lj_BC_FUNCC ()
#4  0x000000000047a6dd in lua_pcall ()
#5  0x00007f92a8a859d8 in THThread_main () from /home/kantorov/.wigwam/prefix/lib/libthreadsmain.so
---Type <return> to continue, or q <return> to quit---
#6  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#7  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 33 (Thread 0x7f92518bb700 (LWP 455)):
#0  0x00007f937937e250 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f92a8c8b289 in THCondition_wait () from /home/kantorov/.wigwam/prefix/lib/lua/5.1/libthreads.so
#2  0x00007f92a8c896a6 in condition_wait () from /home/kantorov/.wigwam/prefix/lib/lua/5.1/libthreads.so
#3  0x000000000048ae6a in lj_BC_FUNCC ()
#4  0x000000000047a6dd in lua_pcall ()
#5  0x00007f92a8a859d8 in THThread_main () from /home/kantorov/.wigwam/prefix/lib/libthreadsmain.so
#6  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#7  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 32 (Thread 0x7f9287fa0700 (LWP 719)):
#0  0x00007f9377f1d29e in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#1  0x00007f9377f1abb0 in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#2  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 31 (Thread 0x7f928779f700 (LWP 721)):
#0  0x00007f9377f1d29e in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#1  0x00007f9377f1abb0 in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#2  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 30 (Thread 0x7f9286f9e700 (LWP 722)):
#0  0x00007f9377f1d29e in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#1  0x00007f9377f1abb0 in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#2  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 29 (Thread 0x7f928679d700 (LWP 723)):
#0  0x00007f9377f1d29e in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#1  0x00007f9377f1abb0 in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#2  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 28 (Thread 0x7f9285f9c700 (LWP 724)):
#0  0x00007f9377f1d29e in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#1  0x00007f9377f1abb0 in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#2  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 27 (Thread 0x7f928579b700 (LWP 725)):
#0  0x00007f9377f1d29e in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#1  0x00007f9377f1abb0 in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#2  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 26 (Thread 0x7f9284f9a700 (LWP 726)):
#0  0x00007f9377f1d29e in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#1  0x00007f9377f1abb0 in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
---Type <return> to continue, or q <return> to quit---
#2  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 25 (Thread 0x7f9284799700 (LWP 727)):
#0  0x00007f9377f1d29e in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#1  0x00007f9377f1abb0 in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#2  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 24 (Thread 0x7f9283f98700 (LWP 728)):
#0  0x00007f9377f1d29e in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#1  0x00007f9377f1abb0 in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#2  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 23 (Thread 0x7f9283797700 (LWP 729)):
#0  0x00007f9377f1d29e in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#1  0x00007f9377f1abb0 in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#2  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 22 (Thread 0x7f9282f96700 (LWP 730)):
#0  0x00007f9377f1d29e in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#1  0x00007f9377f1abb0 in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#2  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 21 (Thread 0x7f9282795700 (LWP 731)):
#0  0x00007f9377f1d29e in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#1  0x00007f9377f1abb0 in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#2  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 20 (Thread 0x7f9281f94700 (LWP 732)):
#0  0x00007f9377f1d29e in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#1  0x00007f9377f1abb0 in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#2  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 19 (Thread 0x7f9281793700 (LWP 733)):
#0  0x00007f9377f1d29e in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#1  0x00007f9377f1abb0 in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#2  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 18 (Thread 0x7f9280f92700 (LWP 734)):
#0  0x00007f9377f1d29e in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#1  0x00007f9377f1abb0 in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#2  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 17 (Thread 0x7f9280791700 (LWP 735)):
---Type <return> to continue, or q <return> to quit---
#0  0x00007f9377f1d29e in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#1  0x00007f9377f1abb0 in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#2  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 16 (Thread 0x7f927ff90700 (LWP 736)):
#0  0x00007f9377f1d29e in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#1  0x00007f9377f1abb0 in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#2  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 15 (Thread 0x7f927f78f700 (LWP 738)):
#0  0x00007f9377f1d29e in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#1  0x00007f9377f1abb0 in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#2  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 14 (Thread 0x7f927ef8e700 (LWP 739)):
#0  0x00007f9377f1d29e in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#1  0x00007f9377f1abb0 in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#2  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 13 (Thread 0x7f927e78d700 (LWP 740)):
#0  0x00007f9377f1d29e in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#1  0x00007f9377f1abb0 in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#2  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 12 (Thread 0x7f927df8c700 (LWP 741)):
#0  0x00007f9377f1d29e in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#1  0x00007f9377f1abb0 in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#2  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 11 (Thread 0x7f927d78b700 (LWP 742)):
#0  0x00007f9377f1d29e in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#1  0x00007f9377f1abb0 in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#2  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 10 (Thread 0x7f927cf8a700 (LWP 743)):
#0  0x00007f9377f1d29e in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#1  0x00007f9377f1abb0 in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#2  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 9 (Thread 0x7f927c789700 (LWP 744)):
#0  0x00007f9377f1d29e in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#1  0x00007f9377f1abb0 in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#2  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6
---Type <return> to continue, or q <return> to quit---

Thread 8 (Thread 0x7f927bf88700 (LWP 745)):
#0  0x00007f9377f1d29e in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#1  0x00007f9377f1abb0 in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#2  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 7 (Thread 0x7f92598fb700 (LWP 746)):
#0  0x00007f9377f1d29e in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#1  0x00007f9377f1abb0 in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#2  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 6 (Thread 0x7f92590fa700 (LWP 747)):
#0  0x00007f9377f1d29e in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#1  0x00007f9377f1abb0 in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#2  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 5 (Thread 0x7f92588f9700 (LWP 748)):
#0  0x00007f9377f1d29e in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#1  0x00007f9377f1abb0 in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#2  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 4 (Thread 0x7f92580f8700 (LWP 749)):
#0  0x00007f9377f1d29e in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#1  0x00007f9377f1abb0 in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#2  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 3 (Thread 0x7f92578f7700 (LWP 750)):
#0  0x00007f9377f1d29e in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#1  0x00007f9377f1abb0 in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#2  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 2 (Thread 0x7f92570f6700 (LWP 751)):
#0  0x00007f9377f1d29e in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#1  0x00007f9377f1abb0 in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#2  0x00007f937937a0a5 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007f9378e90cfd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 1 (Thread 0x7f9379c96780 (LWP 40015)):
#0  0x00007f9378e75737 in sched_yield () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f937717b705 in exec_blas_async_wait () from /home/kantorov/.wigwam/prefix/lib/../lib/libopenblas.so.0
#2  0x00007f937717b7f6 in exec_blas () from /home/kantorov/.wigwam/prefix/lib/../lib/libopenblas.so.0
#3  0x00007f937717be28 in blas_level1_thread () from /home/kantorov/.wigwam/prefix/lib/../lib/libopenblas.so.0
#4  0x00007f9376f6e625 in saxpy_ () from /home/kantorov/.wigwam/prefix/lib/../lib/libopenblas.so.0
#5  0x00007f9378363a05 in THFloatBlas_axpy () from /home/kantorov/.wigwam/prefix/lib/libTH.so
#6  0x00007f937829ed74 in THFloatTensor_cadd () from /home/kantorov/.wigwam/prefix/lib/libTH.so
#7  0x00007f9378821506 in torch_FloatTensorOperator___sub__ () from /home/kantorov/.wigwam/prefix/lib/lua/5.1/libtorch.so
---Type <return> to continue, or q <return> to quit---
#8  0x000000000048ae6a in lj_BC_FUNCC ()
#9  0x00007f93785b2c42 in luaT_mt(short, bool __restrict) () from /home/kantorov/.wigwam/prefix/lib/libluaT.so
#10 0x000000000048ae6a in lj_BC_FUNCC ()
#11 0x000000000047c0c0 in lj_cf_dofile ()
#12 0x000000000048ae6a in lj_BC_FUNCC ()
#13 0x000000000047a6dd in lua_pcall ()
#14 0x000000000041131f in pmain ()
#15 0x000000000048ae6a in lj_BC_FUNCC ()
#16 0x000000000047a757 in lua_cpcall ()
#17 0x000000000040f234 in main ()

@vadimkantorov
Copy link
Author

After reviewing the code carefully, yes, it could be possible that my Torch script does some BLAS operations in parallel from two threads. Using NO_AFFINITY=1 seems to solve the issue, but it's still strange.

@jeromerobert
Copy link
Contributor

It looks like a duplicate of #660.

This is not really a dead lock. It's a race condition. OpenBLAS is not thread safe when built with defaults parameters. When build with USE_OPENMP=1 OpenBLAS is thread safe but without the thread_status global variable in driver/others/blas_server.c will be victim of race conditions. Then the internal while loop in exec_blas_async_wait will never end and continuously call sched_yield.

USE_OPENMP=1 is not the default because it's slower.

I do not understand why NO_AFFINITY=1 hide this issue.

@martin-frbg
Copy link
Collaborator

Perhaps just subtle differences in timing if threads are allowed to schedule on any available core ?
(According to a mailing list link in the README, at least part of the motivation for the NO_AFFINITY flag appears to have been python-related cases where OpenBLAS' setting of the cpu affinity mask for its subthreads during the "import" of the relevant module led to all threads of that program executing on the same cpu.)

@brada4
Copy link
Contributor

brada4 commented May 20, 2016

CUDA juggles affinity too.

@brada4
Copy link
Contributor

brada4 commented May 23, 2016

setting pthread affinity on main program thread makes subsequent threads created from it to inherit affinity mask and essentially make huge zoo on a single cpu core

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants