-
Notifications
You must be signed in to change notification settings - Fork 851
Description
The TSVConnFd and polite_hook_wait CI tests frequently fail in CI due to ATS initialization hanging. I reproduced this manually and observe that the hang is due to a deadlock between DbgCtl and PCRE which seems to acquire the dl_load_lock as a part of its implementation. I'm attaching the full backtrace across the threads: TSVConnFd_deadlock.txt
The relevant threads seem to be the following:
Thread 10: Holds DbgCtl::Registry::_mtx, awaits dl_load_lock
Thread 10 (Thread 0x7ffbf52646c0 (LWP 55868) "[LOG_FLUSH]"):
#0 futex_wait (private=0, expected=2, futex_word=0x7ffbf73a3a08 <_rtld_local+2568>) at ../sysdeps/nptl/futex-internal.h:146
#1 __GI___lll_lock_wait (futex=futex@entry=0x7ffbf73a3a08 <_rtld_local+2568>, private=0) at lowlevellock.c:49
#2 0x00007ffbf65b0e77 in lll_mutex_lock_optimized (mutex=0x7ffbf73a3a08 <_rtld_local+2568>) at pthread_mutex_lock.c:48
#3 ___pthread_mutex_lock (mutex=mutex@entry=0x7ffbf73a3a08 <_rtld_local+2568>) at pthread_mutex_lock.c:128
#4 0x00007ffbf6560daa in __cxa_thread_atexit_impl (func=<optimized out>, obj=0x7ffbf52646a0, dso_symbol=0xd48008) at cxa_thread_atexit_impl.c:117
#5 0x00007ffbf68b95dd in __cxxabiv1::__cxa_thread_atexit (dtor=<optimized out>, obj=<optimized out>, dso_handle=<optimized out>) at ../../../../libstdc++-v3/libsupc++/atexit_thread.cc:69
#6 0x0000000000d15a20 in get_jit_stack (data=0x0) at Regex.cc:43
#7 0x00007ffbf7121965 in _pcre_jit_exec (offset_count=2, offsets=0x7ffbf5262f10, options=0, start_offset=0, length=12, subject=0xdb58d7 "log-periodic", extra_data=0x2a44150) at /usr/src/debug/pcre-8.45-1.fc3
8.3.x86_64/pcre_jit_compile.c:11650
#8 pcre_exec (argument_re=0x2a440e0, extra_data=0x2a44150, subject=<optimized out>, length=12, start_offset=0, options=<optimized out>, offsets=0x7ffbf5262f10, offsetcount=30) at /usr/src/debug/pcre-8.45-1.f
c38.3.x86_64/pcre_exec.c:6489
#9 0x0000000000d15c64 in Regex::exec (this=0x2a44300, str="log-periodic", ovector=0x7ffbf5262f10, ovecsize=30) at Regex.cc:119
#10 0x0000000000d15bfe in Regex::exec (this=0x2a44300, str="log-periodic") at Regex.cc:111
#11 0x0000000000d15fae in DFA::match (this=0x2a440c0, str="log-periodic") at Regex.cc:188
#12 0x0000000000ceb76d in Diags::tag_activated (this=0x2933e00, tag=0xdb58d7 "log-periodic", mode=DiagsTagType_Debug) at Diags.cc:379
#13 0x0000000000ce9632 in DbgCtl::_new_reference (tag=0xdb58d7 "log-periodic") at DbgCtl.cc:158
#14 0x000000000085b76c in DbgCtl::DbgCtl (this=0x1001cf0 <Log::periodic_tasks(long)::Debug_ctl>, tag=0xdb58d7 "log-periodic") at ../../include/tscore/DbgCtl.h:36
#15 0x0000000000a63984 in Log::periodic_tasks (time_now=1690907172) at Log.cc:224
#16 0x0000000000a6ed19 in Log::flush_thread_main () at Log.cc:1415
#17 0x0000000000a6f95d in LoggingFlushContinuation::mainEvent (this=0x3294f80) at Log.cc:284
#18 0x0000000000863ac9 in Continuation::handleEvent (this=0x3294f80, event=1, data=0x3294fe0) at /var/tmp/trafficserver/iocore/eventsystem/I_Continuation.h:228
#19 0x0000000000cdf621 in EThread::execute (this=0x3295050) at UnixEThread.cc:318
#20 0x0000000000cddf66 in spawn_thread_internal (a=0x33a2040) at Thread.cc:79
#21 0x00007ffbf65ad907 in start_thread (arg=<optimized out>) at pthread_create.c:444
#22 0x00007ffbf6633774 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100
Thread 1: holds dl_load_lock, awaits DbgCtl::Registry::_mtx
Thread 1 (Thread 0x7ffbf651d340 (LWP 55858) "traffic_server"):
#0 futex_wait (private=0, expected=2, futex_word=0x28c29b0) at ../sysdeps/nptl/futex-internal.h:146
#1 __GI___lll_lock_wait (futex=futex@entry=0x28c29b0, private=0) at lowlevellock.c:49
#2 0x00007ffbf65b0e21 in lll_mutex_lock_optimized (mutex=0x28c29b0) at pthread_mutex_lock.c:48
#3 ___pthread_mutex_lock (mutex=0x28c29b0) at pthread_mutex_lock.c:93
#4 0x000000000086e5e9 in __gthread_mutex_lock (__mutex=0x28c29b0) at /usr/include/c++/13/x86_64-redhat-linux/bits/gthr-default.h:749
#5 0x000000000086e868 in std::mutex::lock (this=0x28c29b0) at /usr/include/c++/13/bits/std_mutex.h:113
#6 0x0000000000ce9aad in DbgCtl::_RegistryAccessor::_RegistryAccessor (this=0x7fff091f43bf) at DbgCtl.cc:84
#7 0x0000000000ce953f in DbgCtl::_new_reference (tag=0x7ffbf5163828 "TSVConnFd") at DbgCtl.cc:141
#8 0x000000000089ccb8 in TSDbgCtlCreate (tag=0x7ffbf5163828 "TSVConnFd") at traffic_server/InkAPI.cc:10136
#9 0x00007ffbf515b8d4 in __static_initialization_and_destruction_0 () at gold_tests/pluginTest/TSVConnFd/TSVConnFd.cc:449
#10 0x00007ffbf515b9f2 in _GLOBAL__sub_I_TSVConnFd.cc(void) () at gold_tests/pluginTest/TSVConnFd/TSVConnFd.cc:864
#11 0x00007ffbf737517f in call_init (env=0x28c12b0, argv=0x7fff091f67d8, argc=2, l=<optimized out>) at dl-init.c:70
#12 call_init (l=<optimized out>, argc=2, argv=0x7fff091f67d8, env=0x28c12b0) at dl-init.c:26
#13 0x00007ffbf737527d in _dl_init (main_map=0x33a2560, argc=2, argv=0x7fff091f67d8, env=0x28c12b0) at dl-init.c:117
#14 0x00007ffbf73715c2 in __GI__dl_catch_exception (exception=exception@entry=0x0, operate=operate@entry=0x7ffbf737bea0 <call_dl_init>, args=args@entry=0x7fff091f4610) at dl-catch.c:211
#15 0x00007ffbf737be3c in dl_open_worker (a=a@entry=0x7fff091f47c0) at dl-open.c:808
#16 0x00007ffbf7371523 in __GI__dl_catch_exception (exception=exception@entry=0x7fff091f47a0, operate=operate@entry=0x7ffbf737bda0 <dl_open_worker>, args=args@entry=0x7fff091f47c0) at dl-catch.c:237
#17 0x00007ffbf737c1b4 in _dl_open (file=0x7fff091f4b70 "/tmp/sb/TSVConnFd/ts/plugin/TSVConnFd.so", mode=<optimized out>, caller_dlopen=0xbcc915 <plugin_dso_load(char const*, void*&, void*&, std::__cxx11::bas
ic_string<char, std::char_traits<char>, std::allocator<char> >&)+44>, nsid=<optimized out>, argc=2, argv=0x7fff091f67d8, env=0x28c12b0) at dl-open.c:884
#18 0x00007ffbf65a96d4 in dlopen_doit (a=a@entry=0x7fff091f4a70) at dlopen.c:56
#19 0x00007ffbf7371523 in __GI__dl_catch_exception (exception=exception@entry=0x7fff091f49b0, operate=0x7ffbf65a9670 <dlopen_doit>, args=0x7fff091f4a70) at dl-catch.c:237
#20 0x00007ffbf7371679 in _dl_catch_error (objname=0x7fff091f4a18, errstring=0x7fff091f4a20, mallocedp=0x7fff091f4a17, operate=<optimized out>, args=<optimized out>) at dl-catch.c:256
#21 0x00007ffbf65a91b3 in _dlerror_run (operate=operate@entry=0x7ffbf65a9670 <dlopen_doit>, args=args@entry=0x7fff091f4a70) at dlerror.c:138
#22 0x00007ffbf65a978f in dlopen_implementation (dl_caller=<optimized out>, mode=<optimized out>, file=<optimized out>) at dlopen.c:71
#23 ___dlopen (file=<optimized out>, mode=<optimized out>) at dlopen.c:81
#24 0x0000000000bcc915 in plugin_dso_load (path=0x7fff091f4b70 "/tmp/sb/TSVConnFd/ts/plugin/TSVConnFd.so", handle=@0x7fff091f4b48: 0xd1159e <Layout::relative[abi:cxx11](std::basic_string_view<char, std::char_
traits<char> >)+80>, init=@0x7fff091f4b40: 0x0, error="") at Plugin.cc:105
#25 0x0000000000bcce9c in single_plugin_init (argc=3, argv=0x7fff091f5dd0, validateOnly=false) at Plugin.cc:153
#26 0x0000000000bcdbe3 in plugin_init (validateOnly=false) at Plugin.cc:351
#27 0x00000000008c3e07 in main (argv=0x7fff091f67d8) at traffic_server/traffic_server.cc:2160
This has a helpful way to find which thread is holding a mutex that another thread is waiting upon:
https://stackoverflow.com/a/3491304/629530
Thus, thread 10 above is waiting upon some lock. Who owns it?
(gdb) thread 10
[Switching to thread 10 (Thread 0x7ffbf52646c0 (LWP 55868))]
#0 futex_wait (private=0, expected=2, futex_word=0x7ffbf73a3a08 <_rtld_local+2568>) at ../sysdeps/nptl/futex-internal.h:146
146 int err = lll_futex_timed_wait (futex_word, expected, NULL, private);
(gdb) f 3
#3 ___pthread_mutex_lock (mutex=mutex@entry=0x7ffbf73a3a08 <_rtld_local+2568>) at pthread_mutex_lock.c:128
128 LLL_MUTEX_LOCK_OPTIMIZED (mutex);
(gdb) p *mutex
$3 = {__data = {__lock = 2, __count = 1, __owner = 55858, __nusers = 1, __kind = 1, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\002\000\000\000\001\000\000\0002\332\000\000\001\000\000\000\001", '\000' <repeats 22 times>, __align = 4294967298}
(gdb) thread find 55858
Thread 1 has target id 'Thread 0x7ffbf651d340 (LWP 55858)'
(gdb)
Thus, thread 1 has the mutex which thread 10 is waiting upon. Thread 1 is doing the dlopen of the TSVConnFd plugin and is processing DbgCtl. Inspecting thread 10 further, it looks like the lock it is waiting upon is the dl_load_lock:
(gdb) thread 10
[Switching to thread 10 (Thread 0x7ffbf52646c0 (LWP 55868))]
#0 futex_wait (private=0, expected=2, futex_word=0x7ffbf73a3a08 <_rtld_local+2568>) at ../sysdeps/nptl/futex-internal.h:146
146 int err = lll_futex_timed_wait (futex_word, expected, NULL, private);
(gdb) f 4
#4 0x00007ffbf6560daa in __cxa_thread_atexit_impl (func=<optimized out>, obj=0x7ffbf52646a0, dso_symbol=0xd48008) at cxa_thread_atexit_impl.c:117
117 __rtld_lock_lock_recursive (GL(dl_load_lock)); <-----------------
dl_load_lock doesn't seem to be a publicly published part of dlopen, but this thread mentions it as a lock associated with dlopen. I'm pretty sure we're seeing a deadlock in thread 10 with the dlopen associated with loading the TSVConnFD plugin in thread 1.
To be explicit, this seems to be what is happening:
- Thread 1 grabs a
dl_load_lockimplicitly via dlopen while loading theTSVConnFd.soplugin at traffic_server.cc#L2160 - Thread 10 grabs the
DbgCtl::Registry::_mtxwhile processing the periodic flushing thread to log a debug message at Log.cc#L224 - Thread 1 tries to grab
DbgCtl::Registry::_mtxto initializeTSVConnFd's DbgCtl static object at TSVConnFd.cc#L449. This DbgCtl lock will never be released because it is held by thread 10. - Thread 10 tries to grab the
dl_load_lockwhile initializing the PCRE JIT stack at Regex.cc#L43. This will block forever because thread 1 holds it until thread 10 releases theDbgCtl::Registry::_mtx.