-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Deadlock with ThreadedEngine #18090
Comments
@ruro this is an unfortunate issue. The working hypothesis is that there is a bug in one of the numpy operators, causing some later test to hang. @szha is helping to switch CI to run all unittests via naive engine instead of threadedengine, which may help prevent buggy operator A to cause a hang of some later test. There is also the idea of timing out the job if there is no new output for 30 minutes, but nobody owns implementing this yet AFAIK. Any help would be appreciated. |
@josephevans is looking into a 30m auto timeout. |
The 3 hours you see it the global timeout we are setting for jobs. So that abort is intentional |
Okay! I think I was able to track down the source of this problem. I noticed, that the last 3 times the pipeline froze for me, the last output was After ducking around with It didn't happen every time and setting PRNG seeds didn't make it fail deterministically, so the issue is probably thread-related. Also, it definitely doesn't happen with After making a reproducible setup, I spent some time bisecting the import mxnet as mx
with mx.Context(mx.gpu()):
while True:
for _ in range(100):
mx_data = mx.np.array(0)
mx_x = mx.np.empty_like(mx_data)
print(end='.', flush=True) The above piece of code executes the |
@ruro Thank you for investigating the issue! Unfortunately I can't repro based on your script. Can you provide more details on your environment (mxnet version, cuda version, gpu architecture, etc)? |
However, interrupting (
You need some lucky timing to get this backtrace.. So try a few times if you can't see it |
Here is my environment:
My mxnet install is built from source of the latest master a couple of days ago. CUDA version is 10.2. GPU is GTX1050. The timing for the lockup to happen can vary quite a lot. I'd recommend starting a couple copies of this script and leaving them running for a minute or so to be absolutely sure. Interrupting the script before it hangs indeed sometimes results in a Also, here is the feature list for my mxnet install
Update: some further testing revealed that
Also, interestingly enough, all the CPU pipelines set |
@leezu can you try to reproduce with a recent mxnet version that is built with |
While we are waiting for somebody to reproduce, I've tried playing with the stuck process in gdb. Thread 16 (Thread 0x7ffef27fc700 (LWP 1388714)):
#0 0x00007ffff79c3cf5 in pthread_cond_wait@@GLIBC_2.3.2 () at /usr/lib/libpthread.so.0
#1 0x00007fff7e14ce71 in __gthread_cond_wait (__mutex=<optimized out>, __cond=<optimized out>) at /build/gcc/src/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/bits/gthr-default.h:865
#2 std::condition_variable::wait(std::unique_lock<std::mutex>&) (this=<optimized out>, __lock=...) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/condition_variable.cc:53
#3 0x00007fffc64fa7d1 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<mxnet::op::custom::CustomOperator::SetNumThreads(int)::{lambda()#1}> > >::_M_run() () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so
#4 0x00007fff7e152b24 in std::execute_native_thread_routine(void*) (__p=0x555557d5d620) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/thread.cc:80
#5 0x00007ffff79bd46f in start_thread () at /usr/lib/libpthread.so.0
#6 0x00007ffff7e953d3 in clone () at /usr/lib/libc.so.6
Thread 15 (Thread 0x7ffef2ffd700 (LWP 1388687)):
#0 0x00007ffff79c3cf5 in pthread_cond_wait@@GLIBC_2.3.2 () at /usr/lib/libpthread.so.0
#1 0x00007fff7e14ce71 in __gthread_cond_wait (__mutex=<optimized out>, __cond=<optimized out>) at /build/gcc/src/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/bits/gthr-default.h:865
#2 std::condition_variable::wait(std::unique_lock<std::mutex>&) (this=<optimized out>, __lock=...) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/condition_variable.cc:53
#3 0x00007fffc64fa7d1 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<mxnet::op::custom::CustomOperator::SetNumThreads(int)::{lambda()#1}> > >::_M_run() () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so
#4 0x00007fff7e152b24 in std::execute_native_thread_routine(void*) (__p=0x555557d6a8a0) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/thread.cc:80
#5 0x00007ffff79bd46f in start_thread () at /usr/lib/libpthread.so.0
#6 0x00007ffff7e953d3 in clone () at /usr/lib/libc.so.6
Thread 14 (Thread 0x7ffef37fe700 (LWP 1388625)):
#0 0x00007ffff79c3cf5 in pthread_cond_wait@@GLIBC_2.3.2 () at /usr/lib/libpthread.so.0
#1 0x00007fff7e14ce71 in __gthread_cond_wait (__mutex=<optimized out>, __cond=<optimized out>) at /build/gcc/src/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/bits/gthr-default.h:865
#2 std::condition_variable::wait(std::unique_lock<std::mutex>&) (this=<optimized out>, __lock=...) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/condition_variable.cc:53
#3 0x00007fffc64fa7d1 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<mxnet::op::custom::CustomOperator::SetNumThreads(int)::{lambda()#1}> > >::_M_run() () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so
#4 0x00007fff7e152b24 in std::execute_native_thread_routine(void*) (__p=0x555557d3f910) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/thread.cc:80
#5 0x00007ffff79bd46f in start_thread () at /usr/lib/libpthread.so.0
#6 0x00007ffff7e953d3 in clone () at /usr/lib/libc.so.6
Thread 13 (Thread 0x7ffef3fff700 (LWP 1388624)):
#0 0x00007ffff79c401a in pthread_cond_timedwait@@GLIBC_2.3.2 () at /usr/lib/libpthread.so.0
#1 0x00007ffff7b020c3 in PyEval_RestoreThread () at /usr/lib/libpython3.8.so.1.0
#2 0x00007ffff7aa57d7 in () at /usr/lib/libpython3.8.so.1.0
#3 0x00007ffff72ff168 in () at /usr/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so
#4 0x00007ffff727d8c2 in () at /usr/lib/libffi.so.7
#5 0x00007ffff727dc20 in () at /usr/lib/libffi.so.7
#6 0x00007fffc6bf3782 in () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so
#7 0x00007fffc64840b7 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so
#8 0x00007fffc6bf433e in () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so
#9 0x00007fffc6bf8eca in () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so
#10 0x00007fffc64fa794 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<mxnet::op::custom::CustomOperator::SetNumThreads(int)::{lambda()#1}> > >::_M_run() () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so
#11 0x00007fff7e152b24 in std::execute_native_thread_routine(void*) (__p=0x555557c4f8b0) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/thread.cc:80
#12 0x00007ffff79bd46f in start_thread () at /usr/lib/libpthread.so.0
#13 0x00007ffff7e953d3 in clone () at /usr/lib/libc.so.6
Thread 12 (Thread 0x7fff10ff9700 (LWP 1388623)):
#0 0x00007ffff79c401a in pthread_cond_timedwait@@GLIBC_2.3.2 () at /usr/lib/libpthread.so.0
#1 0x00007ffff7b020c3 in PyEval_RestoreThread () at /usr/lib/libpython3.8.so.1.0
#2 0x00007ffff7300f24 in () at /usr/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so
#3 0x00007ffff730570d in () at /usr/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so
#4 0x00007ffff7b0c3d2 in _PyObject_MakeTpCall () at /usr/lib/libpython3.8.so.1.0
#5 0x00007ffff7bc9c51 in _PyEval_EvalFrameDefault () at /usr/lib/libpython3.8.so.1.0
#6 0x00007ffff7bb6a9d in _PyFunction_Vectorcall () at /usr/lib/libpython3.8.so.1.0
#7 0x00007ffff7bc558e in _PyEval_EvalFrameDefault () at /usr/lib/libpython3.8.so.1.0
#8 0x00007ffff7bb6a9d in _PyFunction_Vectorcall () at /usr/lib/libpython3.8.so.1.0
#9 0x00007ffff7b5acda in () at /usr/lib/libpython3.8.so.1.0
#10 0x00007ffff7b5bc78 in () at /usr/lib/libpython3.8.so.1.0
#11 0x00007ffff7bc5e1c in _PyEval_EvalFrameDefault () at /usr/lib/libpython3.8.so.1.0
#12 0x00007ffff7bb58f4 in _PyEval_EvalCodeWithName () at /usr/lib/libpython3.8.so.1.0
#13 0x00007ffff7bb6c7b in _PyFunction_Vectorcall () at /usr/lib/libpython3.8.so.1.0
#14 0x00007ffff7bc5f5a in _PyEval_EvalFrameDefault () at /usr/lib/libpython3.8.so.1.0
#15 0x00007ffff7bb58f4 in _PyEval_EvalCodeWithName () at /usr/lib/libpython3.8.so.1.0
#16 0x00007ffff7bb6c7b in _PyFunction_Vectorcall () at /usr/lib/libpython3.8.so.1.0
#17 0x00007ffff7b12508 in PyObject_Call () at /usr/lib/libpython3.8.so.1.0
#18 0x00007ffff7bc70c4 in _PyEval_EvalFrameDefault () at /usr/lib/libpython3.8.so.1.0
#19 0x00007ffff7bb58f4 in _PyEval_EvalCodeWithName () at /usr/lib/libpython3.8.so.1.0
#20 0x00007ffff7bb6c7b in _PyFunction_Vectorcall () at /usr/lib/libpython3.8.so.1.0
#21 0x00007ffff7b12508 in PyObject_Call () at /usr/lib/libpython3.8.so.1.0
#22 0x00007ffff7bc70c4 in _PyEval_EvalFrameDefault () at /usr/lib/libpython3.8.so.1.0
#23 0x00007ffff7bb58f4 in _PyEval_EvalCodeWithName () at /usr/lib/libpython3.8.so.1.0
#24 0x00007ffff7bb72c2 in () at /usr/lib/libpython3.8.so.1.0
#25 0x00007ffff7bc5f5a in _PyEval_EvalFrameDefault () at /usr/lib/libpython3.8.so.1.0
#26 0x00007ffff7bb6154 in _PyEval_EvalCodeWithName () at /usr/lib/libpython3.8.so.1.0
#27 0x00007ffff7bb6c7b in _PyFunction_Vectorcall () at /usr/lib/libpython3.8.so.1.0
#28 0x00007ffff7b123fd in PyObject_Call () at /usr/lib/libpython3.8.so.1.0
#29 0x00007ffff72ff2a0 in () at /usr/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so
#30 0x00007ffff727d8c2 in () at /usr/lib/libffi.so.7
#31 0x00007ffff727dc20 in () at /usr/lib/libffi.so.7
#32 0x00007fffc6bf3c74 in () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so
#33 0x00007fffc6bfa4d6 in () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so
#34 0x00007fffc64fa748 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<mxnet::op::custom::CustomOperator::SetNumThreads(int)::{lambda()#1}> > >::_M_run() () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so
#35 0x00007fff7e152b24 in std::execute_native_thread_routine(void*) (__p=0x555557bf09e0) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/thread.cc:80
#36 0x00007ffff79bd46f in start_thread () at /usr/lib/libpthread.so.0
#37 0x00007ffff7e953d3 in clone () at /usr/lib/libc.so.6
Thread 11 (Thread 0x7fff117fa700 (LWP 1388622)):
#0 0x00007ffff79c3cf5 in pthread_cond_wait@@GLIBC_2.3.2 () at /usr/lib/libpthread.so.0
#1 0x00007fff7e14ce71 in __gthread_cond_wait (__mutex=<optimized out>, __cond=<optimized out>) at /build/gcc/src/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/bits/gthr-default.h:865
#2 std::condition_variable::wait(std::unique_lock<std::mutex>&) (this=<optimized out>, __lock=...) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/condition_variable.cc:53
#3 0x00007fffc64fa7d1 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<mxnet::op::custom::CustomOperator::SetNumThreads(int)::{lambda()#1}> > >::_M_run() () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so
#4 0x00007fff7e152b24 in std::execute_native_thread_routine(void*) (__p=0x555557d0b310) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/thread.cc:80
#5 0x00007ffff79bd46f in start_thread () at /usr/lib/libpthread.so.0
#6 0x00007ffff7e953d3 in clone () at /usr/lib/libc.so.6
Thread 10 (Thread 0x7fff11ffb700 (LWP 1388621)):
#0 0x00007ffff79c3cf5 in pthread_cond_wait@@GLIBC_2.3.2 () at /usr/lib/libpthread.so.0
#1 0x00007fff7e14ce71 in __gthread_cond_wait (__mutex=<optimized out>, __cond=<optimized out>) at /build/gcc/src/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/bits/gthr-default.h:865
#2 std::condition_variable::wait(std::unique_lock<std::mutex>&) (this=<optimized out>, __lock=...) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/condition_variable.cc:53
#3 0x00007fffc64fa7d1 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<mxnet::op::custom::CustomOperator::SetNumThreads(int)::{lambda()#1}> > >::_M_run() () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so
#4 0x00007fff7e152b24 in std::execute_native_thread_routine(void*) (__p=0x555557b63970) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/thread.cc:80
#5 0x00007ffff79bd46f in start_thread () at /usr/lib/libpthread.so.0
#6 0x00007ffff7e953d3 in clone () at /usr/lib/libc.so.6
Thread 9 (Thread 0x7fff127fc700 (LWP 1388620)):
#0 0x00007ffff79c3cf5 in pthread_cond_wait@@GLIBC_2.3.2 () at /usr/lib/libpthread.so.0
#1 0x00007fff7e14ce71 in __gthread_cond_wait (__mutex=<optimized out>, __cond=<optimized out>) at /build/gcc/src/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/bits/gthr-default.h:865
#2 std::condition_variable::wait(std::unique_lock<std::mutex>&) (this=<optimized out>, __lock=...) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/condition_variable.cc:53
#3 0x00007fffc65d9181 in std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#1}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&) () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so
#4 0x00007fffc65d76ba in std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)>, std::shared_ptr<dmlc::ManualEvent> > > >::_M_run() () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so
#5 0x00007fff7e152b24 in std::execute_native_thread_routine(void*) (__p=0x555557d093c0) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/thread.cc:80
#6 0x00007ffff79bd46f in start_thread () at /usr/lib/libpthread.so.0
#7 0x00007ffff7e953d3 in clone () at /usr/lib/libc.so.6
Thread 8 (Thread 0x7fff12ffd700 (LWP 1388619)):
#0 0x00007ffff79c3cf5 in pthread_cond_wait@@GLIBC_2.3.2 () at /usr/lib/libpthread.so.0
#1 0x00007fff7e14ce71 in __gthread_cond_wait (__mutex=<optimized out>, __cond=<optimized out>) at /build/gcc/src/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/bits/gthr-default.h:865
#2 std::condition_variable::wait(std::unique_lock<std::mutex>&) (this=<optimized out>, __lock=...) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/condition_variable.cc:53
#3 0x00007fffc65db716 in dmlc::ConcurrentBlockingQueue<mxnet::engine::OprBlock*, (dmlc::ConcurrentQueueType)1>::Pop(mxnet::engine::OprBlock**) () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so
#4 0x00007fffc65dbe99 in std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::Start()::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&) () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so
#5 0x00007fffc65d76ba in std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)>, std::shared_ptr<dmlc::ManualEvent> > > >::_M_run() () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so
#6 0x00007fff7e152b24 in std::execute_native_thread_routine(void*) (__p=0x555556ca9940) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/thread.cc:80
#7 0x00007ffff79bd46f in start_thread () at /usr/lib/libpthread.so.0
#8 0x00007ffff7e953d3 in clone () at /usr/lib/libc.so.6
Thread 7 (Thread 0x7fff137fe700 (LWP 1388618)):
#0 0x00007ffff79c3cf5 in pthread_cond_wait@@GLIBC_2.3.2 () at /usr/lib/libpthread.so.0
#1 0x00007fff7e14ce71 in __gthread_cond_wait (__mutex=<optimized out>, __cond=<optimized out>) at /build/gcc/src/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/bits/gthr-default.h:865
#2 std::condition_variable::wait(std::unique_lock<std::mutex>&) (this=<optimized out>, __lock=...) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/condition_variable.cc:53
#3 0x00007fffc65db716 in dmlc::ConcurrentBlockingQueue<mxnet::engine::OprBlock*, (dmlc::ConcurrentQueueType)1>::Pop(mxnet::engine::OprBlock**) () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so
#4 0x00007fffc65dbe99 in std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::Start()::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&) () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so
#5 0x00007fffc65d76ba in std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)>, std::shared_ptr<dmlc::ManualEvent> > > >::_M_run() () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so
#6 0x00007fff7e152b24 in std::execute_native_thread_routine(void*) (__p=0x555556ca9a50) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/thread.cc:80
#7 0x00007ffff79bd46f in start_thread () at /usr/lib/libpthread.so.0
#8 0x00007ffff7e953d3 in clone () at /usr/lib/libc.so.6
Thread 6 (Thread 0x7fff13fff700 (LWP 1388617)):
#0 0x00007ffff79c3cf5 in pthread_cond_wait@@GLIBC_2.3.2 () at /usr/lib/libpthread.so.0
#1 0x00007fff7e14ce71 in __gthread_cond_wait (__mutex=<optimized out>, __cond=<optimized out>) at /build/gcc/src/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/bits/gthr-default.h:865
#2 std::condition_variable::wait(std::unique_lock<std::mutex>&) (this=<optimized out>, __lock=...) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/condition_variable.cc:53
#3 0x00007fffc65db716 in dmlc::ConcurrentBlockingQueue<mxnet::engine::OprBlock*, (dmlc::ConcurrentQueueType)1>::Pop(mxnet::engine::OprBlock**) () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so
#4 0x00007fffc65dbe99 in std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::Start()::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&) () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so
#5 0x00007fffc65d76ba in std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)>, std::shared_ptr<dmlc::ManualEvent> > > >::_M_run() () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so
#6 0x00007fff7e152b24 in std::execute_native_thread_routine(void*) (__p=0x555557b9fea0) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/thread.cc:80
#7 0x00007ffff79bd46f in start_thread () at /usr/lib/libpthread.so.0
#8 0x00007ffff7e953d3 in clone () at /usr/lib/libc.so.6
Thread 5 (Thread 0x7fff188cf700 (LWP 1388616)):
#0 0x00007ffff79c3cf5 in pthread_cond_wait@@GLIBC_2.3.2 () at /usr/lib/libpthread.so.0
#1 0x00007fff7e14ce71 in __gthread_cond_wait (__mutex=<optimized out>, __cond=<optimized out>) at /build/gcc/src/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/bits/gthr-default.h:865
#2 std::condition_variable::wait(std::unique_lock<std::mutex>&) (this=<optimized out>, __lock=...) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/condition_variable.cc:53
#3 0x00007fffc65db716 in dmlc::ConcurrentBlockingQueue<mxnet::engine::OprBlock*, (dmlc::ConcurrentQueueType)1>::Pop(mxnet::engine::OprBlock**) () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so
#4 0x00007fffc65dbe99 in std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::Start()::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&) () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so
#5 0x00007fffc65d76ba in std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)>, std::shared_ptr<dmlc::ManualEvent> > > >::_M_run() () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so
#6 0x00007fff7e152b24 in std::execute_native_thread_routine(void*) (__p=0x555557d34010) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/thread.cc:80
#7 0x00007ffff79bd46f in start_thread () at /usr/lib/libpthread.so.0
#8 0x00007ffff7e953d3 in clone () at /usr/lib/libc.so.6
Thread 4 (Thread 0x7fff7a60c880 (LWP 1388611)):
#0 0x00007ffff79c3cf5 in pthread_cond_wait@@GLIBC_2.3.2 () at /usr/lib/libpthread.so.0
#1 0x00007ffff14570e1 in () at /opt/intel/mkl/lib/intel64/libiomp5.so
#2 0x00007ffff138f9a1 in () at /opt/intel/mkl/lib/intel64/libiomp5.so
#3 0x00007ffff139223d in () at /opt/intel/mkl/lib/intel64/libiomp5.so
#4 0x00007ffff139926b in () at /opt/intel/mkl/lib/intel64/libiomp5.so
#5 0x00007ffff13d5170 in () at /opt/intel/mkl/lib/intel64/libiomp5.so
#6 0x00007ffff145119c in () at /opt/intel/mkl/lib/intel64/libiomp5.so
#7 0x00007ffff79bd46f in start_thread () at /usr/lib/libpthread.so.0
#8 0x00007ffff7e953d3 in clone () at /usr/lib/libc.so.6
Thread 3 (Thread 0x7fff7ae0e800 (LWP 1388610)):
#0 0x00007ffff79c3cf5 in pthread_cond_wait@@GLIBC_2.3.2 () at /usr/lib/libpthread.so.0
#1 0x00007ffff14570e1 in () at /opt/intel/mkl/lib/intel64/libiomp5.so
#2 0x00007ffff138f9a1 in () at /opt/intel/mkl/lib/intel64/libiomp5.so
#3 0x00007ffff139223d in () at /opt/intel/mkl/lib/intel64/libiomp5.so
#4 0x00007ffff139926b in () at /opt/intel/mkl/lib/intel64/libiomp5.so
#5 0x00007ffff13d5170 in () at /opt/intel/mkl/lib/intel64/libiomp5.so
#6 0x00007ffff145119c in () at /opt/intel/mkl/lib/intel64/libiomp5.so
#7 0x00007ffff79bd46f in start_thread () at /usr/lib/libpthread.so.0
#8 0x00007ffff7e953d3 in clone () at /usr/lib/libc.so.6
Thread 2 (Thread 0x7fff7b610780 (LWP 1388609)):
#0 0x00007ffff79c3cf5 in pthread_cond_wait@@GLIBC_2.3.2 () at /usr/lib/libpthread.so.0
#1 0x00007ffff14570e1 in () at /opt/intel/mkl/lib/intel64/libiomp5.so
#2 0x00007ffff138f9a1 in () at /opt/intel/mkl/lib/intel64/libiomp5.so
#3 0x00007ffff139223d in () at /opt/intel/mkl/lib/intel64/libiomp5.so
#4 0x00007ffff139926b in () at /opt/intel/mkl/lib/intel64/libiomp5.so
#5 0x00007ffff13d5170 in () at /opt/intel/mkl/lib/intel64/libiomp5.so
#6 0x00007ffff145119c in () at /opt/intel/mkl/lib/intel64/libiomp5.so
#7 0x00007ffff79bd46f in start_thread () at /usr/lib/libpthread.so.0
#8 0x00007ffff7e953d3 in clone () at /usr/lib/libc.so.6
Thread 1 (Thread 0x7ffff785f740 (LWP 1388524)):
#0 0x00007ffff79c74cf in __lll_lock_wait () at /usr/lib/libpthread.so.0
#1 0x00007ffff79bfe03 in pthread_mutex_lock () at /usr/lib/libpthread.so.0
#2 0x00007fffc6bf9b4e in mxnet::op::custom::AttrParser(nnvm::NodeAttrs*) () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so
#3 0x00007fffc65053db in MXImperativeInvokeImpl(void*, int, void**, int*, void***, int, char const**, char const**) () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so
#4 0x00007fffc65061da in MXImperativeInvokeEx () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so
#5 0x00007fff79dc143b in __pyx_pf_5mxnet_4_cy3_7ndarray_2_imperative_invoke (__pyx_self=<optimized out>, __pyx_v_output_is_list=<optimized out>, __pyx_v_is_np_op=<optimized out>, __pyx_v_out=<optimized out>, __pyx_v_vals=<optimized out>, __pyx_v_keys=<optimized out>, __pyx_v_ndargs=<optimized out>, __pyx_v_handle=<optimized out>) at /usr/include/c++/9.3.0/bits/stl_vector.h:915
#6 __pyx_pw_5mxnet_4_cy3_7ndarray_3_imperative_invoke(PyObject*, PyObject*, PyObject*) (__pyx_self=<optimized out>, __pyx_args=<optimized out>, __pyx_kwds=<optimized out>) at mxnet/cython/ndarray.cpp:5334
#7 0x00007ffff7b18ff6 in PyCFunction_Call () at /usr/lib/libpython3.8.so.1.0
#8 0x00007ffff7b0c3d2 in _PyObject_MakeTpCall () at /usr/lib/libpython3.8.so.1.0
#9 0x00007ffff7bc979c in _PyEval_EvalFrameDefault () at /usr/lib/libpython3.8.so.1.0
#10 0x00007ffff7bb58f4 in _PyEval_EvalCodeWithName () at /usr/lib/libpython3.8.so.1.0
#11 0x00007ffff7bb6c7b in _PyFunction_Vectorcall () at /usr/lib/libpython3.8.so.1.0
#12 0x00007ffff7b0ef60 in _PyObject_FastCallDict () at /usr/lib/libpython3.8.so.1.0
#13 0x00007ffff7c2e63d in () at /usr/lib/libpython3.8.so.1.0
#14 0x00007ffff7b0c3d2 in _PyObject_MakeTpCall () at /usr/lib/libpython3.8.so.1.0
#15 0x00007ffff7bc9e8b in _PyEval_EvalFrameDefault () at /usr/lib/libpython3.8.so.1.0
#16 0x00007ffff7bb58f4 in _PyEval_EvalCodeWithName () at /usr/lib/libpython3.8.so.1.0
#17 0x00007ffff7bb6c7b in _PyFunction_Vectorcall () at /usr/lib/libpython3.8.so.1.0
#18 0x00007ffff7bc5f5a in _PyEval_EvalFrameDefault () at /usr/lib/libpython3.8.so.1.0
#19 0x00007ffff7bb58f4 in _PyEval_EvalCodeWithName () at /usr/lib/libpython3.8.so.1.0
#20 0x00007ffff7bb6c7b in _PyFunction_Vectorcall () at /usr/lib/libpython3.8.so.1.0
#21 0x00007ffff7bc980a in _PyEval_EvalFrameDefault () at /usr/lib/libpython3.8.so.1.0
#22 0x00007ffff7bb58f4 in _PyEval_EvalCodeWithName () at /usr/lib/libpython3.8.so.1.0
#23 0x00007ffff7c3cd73 in PyEval_EvalCode () at /usr/lib/libpython3.8.so.1.0
#24 0x00007ffff7c3cdc8 in () at /usr/lib/libpython3.8.so.1.0
#25 0x00007ffff7c41063 in () at /usr/lib/libpython3.8.so.1.0
#26 0x00007ffff7adbdf0 in PyRun_FileExFlags () at /usr/lib/libpython3.8.so.1.0
#27 0x00007ffff7ae5aa4 in PyRun_SimpleFileExFlags () at /usr/lib/libpython3.8.so.1.0
#28 0x00007ffff7c4d81e in Py_RunMain () at /usr/lib/libpython3.8.so.1.0
#29 0x00007ffff7c4d909 in Py_BytesMain () at /usr/lib/libpython3.8.so.1.0
#30 0x00007ffff7dbd023 in __libc_start_main () at /usr/lib/libc.so.6
#31 0x000055555555505e in _start () As you can see, all the threads seem to be either in |
#18014 appears to be due to wrong mutex handling and may be related. |
@ruro wrt to reproducing, we should first fix the |
Are you sure, that the backtrace you got after interrupting is in any way related to this issue? It doesn't seem likely to me.
Also, I don't quite understand, even if there are unhandled errors somewhere in |
@ruro, @reminisce helped point out an error in your reproducer script. You use zero-shape tensor, but didn't enable numpy shape semantics. It should be
This explains the error in #18090 (comment) You're right, it shouldn't cause a deadlock. |
I can reproduce the hang with the |
Ah, I see. I had the Also, the hang happens with or without the import mxnet as mx
mx.npx.set_np()
while True:
for _ in range(100):
mx_data = mx.np.array(0)
mx_x = mx.np.empty_like(mx_data)
print(end='.', flush=True) |
Any action item to unblock the ci now? |
With respect to the numpy op causing triggering the issue: We can replace those based on CustomOp with native operators. There are only 4 in this file For CustomOp itself, we may want to rewrite it based on the PackedFunc or drop it. |
Are we sure, that this is 100% a Also, dropping Also, @zhreshold the CI is technically not blocked since the test is just unstable and not completely impossible. This issue was around for a while and until now the solution was just to restart Also, @leezu if fixing CI is a priority, we could temporarily disable all the CI tests, which use the By the way, the official tutorial for |
A little further investigation in GDB with some debug symbols revealed the following information: Thread 1 (main thread) is in Thread 13 is in I am really not sure, how could Thread 13 arrive at the GDB logs with evidence(gdb) run test.py
(gdb) backtrace
#0 0x00007ffff79a14cf in __lll_lock_wait () at /usr/lib/libpthread.so.0
#1 0x00007ffff7999e03 in pthread_mutex_lock () at /usr/lib/libpthread.so.0
#2 0x00007fffc69c0b4e in mxnet::op::custom::AttrParser(nnvm::NodeAttrs*) () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so
#3 0x00007fffc62cc3db in MXImperativeInvokeImpl(void*, int, void**, int*, void***, int, char const**, char const**) () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so
#4 0x00007fffc62cd1da in MXImperativeInvokeEx () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so
#5 0x00007fff79b8f43b in __pyx_pf_5mxnet_4_cy3_7ndarray_2_imperative_invoke (__pyx_self=<optimized out>, __pyx_v_output_is_list=<optimized out>, __pyx_v_is_np_op=<optimized out>, __pyx_v_out=<optimized out>, __pyx_v_vals=<optimized out>, __pyx_v_keys=<optimized out>, __pyx_v_ndargs=<optimized out>, __pyx_v_handle=<optimized out>) at /usr/include/c++/9.3.0/bits/stl_vector.h:915
#6 __pyx_pw_5mxnet_4_cy3_7ndarray_3_imperative_invoke(PyObject*, PyObject*, PyObject*) (__pyx_self=<optimized out>, __pyx_args=<optimized out>, __pyx_kwds=<optimized out>) at mxnet/cython/ndarray.cpp:5334
#7 0x00007ffff7b59968 in cfunction_call_varargs (kwargs=0x0, args=(93825001767984, [<ndarray at remote 0x7ffef526bb90>], ['op_type', 'dtype', 'order', 'subok', 'shape'], ['empty_like_fallback', 'None', 'C', False, None], None, True, False), func=<built-in function _imperative_invoke>) at Objects/call.c:742
#8 PyCFunction_Call (func=<built-in function _imperative_invoke>, args=(93825001767984, [<ndarray at remote 0x7ffef526bb90>], ['op_type', 'dtype', 'order', 'subok', 'shape'], ['empty_like_fallback', 'None', 'C', False, None], None, True, False), kwargs=0x0) at Objects/call.c:772
#9 0x00007ffff7b1325f in _PyObject_MakeTpCall (callable=<built-in function _imperative_invoke>, args=<optimized out>, nargs=<optimized out>, keywords=<optimized out>) at Objects/call.c:168
#10 0x00007ffff7b7e4a8 in _PyObject_Vectorcall (kwnames=0x0, nargsf=9223372036854775815, args=0x555557e486f0, callable=<optimized out>) at Python/ceval.c:3493
#11 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=0x55555557e5f0) at Python/ceval.c:4987
#12 _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3500
#13 0x00007ffff7b4bc9a in _PyEval_EvalCodeWithName (_co=<code at remote 0x7fff78d621e0>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, kwnames=<optimized out>, kwargs=0x7fff142e60e8, kwcount=<optimized out>, kwstep=1, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name='Custom', qualname='Custom') at Python/ceval.c:4298
#14 0x00007ffff7b4c969 in _PyFunction_Vectorcall (func=<optimized out>, stack=0x7fff142e60e0, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:435
#15 0x00007ffff7b476da in _PyObject_FastCallDict (callable=<function at remote 0x7fff78d68eb0>, args=0x7ffef55f16f8, nargsf=<optimized out>, kwargs=<optimized out>) at Objects/call.c:104
#16 0x00007ffff7bf2fbe in partial_fastcall (pto=<optimized out>, pto=<optimized out>, kwargs=Python Exception <class 'gdb.error'> Attempt to extract a component of a value that is not a struct/class/union.:
, nargs=1, args=<optimized out>) at ./Modules/_functoolsmodule.c:169
#17 partial_call (pto=0x7fff17ec30c0, args=<optimized out>, kwargs=<optimized out>) at ./Modules/_functoolsmodule.c:224
#18 0x00007ffff7b1325f in _PyObject_MakeTpCall (callable=<functools.partial at remote 0x7fff17ec30c0>, args=<optimized out>, nargs=<optimized out>, keywords=<optimized out>) at Objects/call.c:168
#19 0x00007ffff7b83429 in _PyObject_Vectorcall (kwnames=<optimized out>, nargsf=<optimized out>, args=0x555557e4ce90, callable=<functools.partial at remote 0x7fff17ec30c0>) at Python/ceval.c:1555
#20 call_function (kwnames=<optimized out>, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=0x55555557e5f0) at Python/ceval.c:4987
#21 _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3515
#22 0x00007ffff7b4bc9a in _PyEval_EvalCodeWithName (_co=<code at remote 0x7fff78f97520>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, kwnames=<optimized out>, kwargs=0x7fff17ed41f0, kwcount=<optimized out>, kwstep=1, defs=0x7fff78fb2488, defcount=4, kwdefs=0x0, closure=0x0, name='empty_like', qualname='empty_like') at Python/ceval.c:4298
#23 0x00007ffff7b4c969 in _PyFunction_Vectorcall (func=<optimized out>, stack=0x7fff17ed41e8, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:435
#24 0x00007ffff7b7ecb9 in _PyObject_Vectorcall (kwnames=('dtype', 'order', 'subok', 'shape'), nargsf=<optimized out>, args=0x7fff17ed41e8, callable=<function at remote 0x7fff78fab690>) at ./Include/cpython/abstract.h:92
#25 call_function (kwnames=('dtype', 'order', 'subok', 'shape'), oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=0x55555557e5f0) at Python/ceval.c:4987
#26 _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3515
#27 0x00007ffff7b4bc9a in _PyEval_EvalCodeWithName (_co=<code at remote 0x7fff78bd5ba0>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, kwnames=<optimized out>, kwargs=0x5555555e49e8, kwcount=<optimized out>, kwstep=1, defs=0x7fff78be8a88, defcount=4, kwdefs=0x0, closure=0x0, name='empty_like', qualname='empty_like') at Python/ceval.c:4298
#28 0x00007ffff7b4c969 in _PyFunction_Vectorcall (func=<optimized out>, stack=0x5555555e49e0, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:435
#29 0x00007ffff7b7e2ea in _PyObject_Vectorcall (kwnames=0x0, nargsf=9223372036854775809, args=0x5555555e49e0, callable=<function at remote 0x7fff6be6d050>) at ./Include/cpython/abstract.h:92
#30 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=0x55555557e5f0) at Python/ceval.c:4987
#31 _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3469
#32 0x00007ffff7b4bc9a in _PyEval_EvalCodeWithName (_co=<code at remote 0x7ffff73952b0>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, kwnames=<optimized out>, kwargs=0x0, kwcount=<optimized out>, kwstep=2, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name=0x0, qualname=0x0) at Python/ceval.c:4298
#33 0x00007ffff7be52da in PyEval_EvalCodeEx (_co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, kws=<optimized out>, kwcount=0, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0) at Python/ceval.c:4327
#34 0x00007ffff7be52fc in PyEval_EvalCode (co=co@entry=<code at remote 0x7ffff73952b0>, globals=globals@entry=Python Exception <class 'gdb.error'> Attempt to extract a component of a value that is not a struct/class/union.:
, locals=locals@entry=Python Exception <class 'gdb.error'> Attempt to extract a component of a value that is not a struct/class/union.:
) at Python/ceval.c:718
#35 0x00007ffff7be53aa in run_eval_code_obj (co=0x7ffff73952b0, globals=Python Exception <class 'gdb.error'> Attempt to extract a component of a value that is not a struct/class/union.:
, locals=Python Exception <class 'gdb.error'> Attempt to extract a component of a value that is not a struct/class/union.:
) at Python/pythonrun.c:1125
#36 0x00007ffff7c2a134 in run_mod (mod=<optimized out>, filename=<optimized out>, globals=Python Exception <class 'gdb.error'> Attempt to extract a component of a value that is not a struct/class/union.:
, locals=Python Exception <class 'gdb.error'> Attempt to extract a component of a value that is not a struct/class/union.:
, flags=<optimized out>, arena=<optimized out>) at Python/pythonrun.c:1147
#37 0x00007ffff7af1a95 in PyRun_FileExFlags (fp=fp@entry=0x555555621020, filename_str=filename_str@entry=0x7ffff742aee0 "test.py", start=start@entry=257, globals=globals@entry=Python Exception <class 'gdb.error'> Attempt to extract a component of a value that is not a struct/class/union.:
, locals=locals@entry=Python Exception <class 'gdb.error'> Attempt to extract a component of a value that is not a struct/class/union.:
, closeit=closeit@entry=1, flags=0x7ffffffe0578) at Python/pythonrun.c:1063
#38 0x00007ffff7af4600 in PyRun_SimpleFileExFlags (fp=fp@entry=0x555555621020, filename=<optimized out>, closeit=closeit@entry=1, flags=flags@entry=0x7ffffffe0578) at Python/pythonrun.c:428
#39 0x00007ffff7af4995 in PyRun_AnyFileExFlags (fp=fp@entry=0x555555621020, filename=0x0, closeit=closeit@entry=1, flags=flags@entry=0x7ffffffe0578) at Python/pythonrun.c:86
#40 0x00007ffff7c2c796 in pymain_run_file (cf=0x7ffffffe0578, config=0x55555557d8e0) at Modules/main.c:381
#41 pymain_run_python (exitcode=0x7ffffffe0570) at Modules/main.c:565
#42 Py_RunMain () at Modules/main.c:644
#43 0x00007ffff7c2c889 in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:698
#44 0x00007ffff7dbd023 in __libc_start_main () at /usr/lib/libc.so.6
#45 0x000055555555505e in _start ()
(gdb) print (**(pthread_mutex_t **)($sp+0x10)).__data.__owner
2252813
(gdb) info thread
Id Target Id Frame
...
13 Thread 0x7ffef77fe700 (LWP 2252813) "python" 0x00007ffff799e01a in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
(gdb) thread 13
(gdb) backtrace
#0 0x00007ffff799e01a in pthread_cond_timedwait@@GLIBC_2.3.2 () at /usr/lib/libpthread.so.0
#1 0x00007ffff7b0a9fb in PyCOND_TIMEDWAIT (us=<optimized out>, mut=0x7ffff7d759b0 <_PyRuntime+1232>, cond=0x7ffff7d75980 <_PyRuntime+1184>) at Python/condvar.h:73
#2 take_gil (tstate=0x7ffee80022b0, ceval=0x7ffff7d75728 <_PyRuntime+584>) at Python/ceval_gil.h:206
#3 PyEval_RestoreThread (tstate=0x7ffee80022b0) at Python/ceval.c:399
#4 0x00007ffff7a30b8d in PyGILState_Ensure () at Python/pystate.c:1298
#5 0x00007ffff7217d4c in _CallPythonObject (pArgs=0x7ffef77fdc50, flags=<optimized out>, converters=(<_ctypes.PyCSimpleType at remote 0x5555556405f0>,), callable=<function at remote 0x7ffef568e050>, setfunc=<optimized out>, restype=0x7ffff725b638, mem=0x7ffef77fddd0) at /home/custompkgs/PKGBUILDS/python-dbg/src/Python-3.8.2/Modules/_ctypes/callbacks.c:145
#6 closure_fcn (cif=<optimized out>, resp=0x7ffef77fddd0, args=0x7ffef77fdc50, userdata=<optimized out>) at /home/custompkgs/PKGBUILDS/python-dbg/src/Python-3.8.2/Modules/_ctypes/callbacks.c:297
#7 0x00007ffff71938c2 in () at /usr/lib/libffi.so.7
#8 0x00007ffff7193c20 in () at /usr/lib/libffi.so.7
#9 0x00007fffc69ba782 in () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so
#10 0x00007fffc624b0b7 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so
#11 0x00007fffc69bb33e in () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so
#12 0x00007fffc69bfeca in () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so
#13 0x00007fffc62c1794 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<mxnet::op::custom::CustomOperator::SetNumThreads(int)::{lambda()#1}> > >::_M_run() () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so
#14 0x00007fff7df19b24 in std::execute_native_thread_routine(void*) (__p=0x555557d8b890) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/thread.cc:80
#15 0x00007ffff799746f in start_thread () at /usr/lib/libpthread.so.0
#16 0x00007ffff7e953d3 in clone () at /usr/lib/libc.so.6
(gdb) print *(PyFunctionObject *)0x7ffef568e050
{
...
func_doc = 'C Callback for CustomOp::del',
func_name = 'delete_entry',
func_dict = 0x0,
func_weakreflist = 0x0,
func_module = 'mxnet.operator',
func_annotations = 0x0,
func_qualname = 'register.<locals>.do_register.<locals>.creator.<locals>.create_operator_entry.<locals>.delete_entry',
vectorcall = 0x7ffff7b4c750 <_PyFunction_Vectorcall>
}
(gdb) print/x ((PyThreadState*)_PyRuntime.ceval.gil.last_holder)->thread_id
0x7ffff7839740
(gdb) info thread
Id Target Id Frame
1 Thread 0x7ffff7839740 (LWP 2252671) "python" 0x00007ffff79a14cf in __lll_lock_wait () from /usr/lib/libpthread.so.0
... |
Just FYI, I am currently not working on a fix for this. Unfortunately, I am currently very busy and will probably be so for quite a while, also I am not very familiar with Just wanted to point that out, in case I created a false expectation, that I am working on this issue and therefore it doesn't need somebody else to come in and fix it. |
) These tests are prone to triggering a deadlock. See apache#18090 apache#18144
Description
There currently exists some weird behaviour with
unix-gpu
CI jobs, where the build sometimes gets aborted and other times completes fine. I've seen this multiple times on different PRs.Until today, I thought, that this is caused by limited available GPU executors and the jobs are getting manually aborted or aborted by some automatic priority setup in Jenkins (maybe priority goes to CI/CD for master or something).
However, I've noticed a few weird consistent things about these aborted jobs, so I wanted to make sure, that the current behaviour is intentional.
unix-gpu
getting aborted, it's almost always in a situation, where all the build steps and all the other tests were completed, but there is just a singlePython 3: GPU
orPython 3: GPU (TVM_OP OFF)
test step that was aborted.test_operator_gpu.test_np_empty ... ok
at14:58
andSending interrupt signal to process
at17:33
.Occurrences
Here are the 2 examples from the screenshots:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-18055/4/pipeline
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-18054/6/pipeline
and a random example, not from my PRs:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-18081/1/pipeline
I am aware, that we can just restart the job via mxnet-bot, but this is annoying since the job takes a long time to complete even without this issue. Can somebody clarify, if
unix-gpu
CI jobs getting aborted is intentional (and what is the current policy on aborting CI jobs etc)The text was updated successfully, but these errors were encountered: