Skip to content

Deadlock between DbgCtl and PCRE #10129

@bneradt

Description

@bneradt

The TSVConnFd and polite_hook_wait CI tests frequently fail in CI due to ATS initialization hanging. I reproduced this manually and observe that the hang is due to a deadlock between DbgCtl and PCRE which seems to acquire the dl_load_lock as a part of its implementation. I'm attaching the full backtrace across the threads: TSVConnFd_deadlock.txt

The relevant threads seem to be the following:

Thread 10: Holds DbgCtl::Registry::_mtx, awaits dl_load_lock

  Thread 10 (Thread 0x7ffbf52646c0 (LWP 55868) "[LOG_FLUSH]"):
  #0  futex_wait (private=0, expected=2, futex_word=0x7ffbf73a3a08 <_rtld_local+2568>) at ../sysdeps/nptl/futex-internal.h:146
  #1  __GI___lll_lock_wait (futex=futex@entry=0x7ffbf73a3a08 <_rtld_local+2568>, private=0) at lowlevellock.c:49
  #2  0x00007ffbf65b0e77 in lll_mutex_lock_optimized (mutex=0x7ffbf73a3a08 <_rtld_local+2568>) at pthread_mutex_lock.c:48
  #3  ___pthread_mutex_lock (mutex=mutex@entry=0x7ffbf73a3a08 <_rtld_local+2568>) at pthread_mutex_lock.c:128
  #4  0x00007ffbf6560daa in __cxa_thread_atexit_impl (func=<optimized out>, obj=0x7ffbf52646a0, dso_symbol=0xd48008) at cxa_thread_atexit_impl.c:117
  #5  0x00007ffbf68b95dd in __cxxabiv1::__cxa_thread_atexit (dtor=<optimized out>, obj=<optimized out>, dso_handle=<optimized out>) at ../../../../libstdc++-v3/libsupc++/atexit_thread.cc:69
  #6  0x0000000000d15a20 in get_jit_stack (data=0x0) at Regex.cc:43                                      
  #7  0x00007ffbf7121965 in _pcre_jit_exec (offset_count=2, offsets=0x7ffbf5262f10, options=0, start_offset=0, length=12, subject=0xdb58d7 "log-periodic", extra_data=0x2a44150) at /usr/src/debug/pcre-8.45-1.fc3
  8.3.x86_64/pcre_jit_compile.c:11650                                                                    
  #8  pcre_exec (argument_re=0x2a440e0, extra_data=0x2a44150, subject=<optimized out>, length=12, start_offset=0, options=<optimized out>, offsets=0x7ffbf5262f10, offsetcount=30) at /usr/src/debug/pcre-8.45-1.f
  c38.3.x86_64/pcre_exec.c:6489                                                                                              
  #9  0x0000000000d15c64 in Regex::exec (this=0x2a44300, str="log-periodic", ovector=0x7ffbf5262f10, ovecsize=30) at Regex.cc:119                                          
  #10 0x0000000000d15bfe in Regex::exec (this=0x2a44300, str="log-periodic") at Regex.cc:111                               
  #11 0x0000000000d15fae in DFA::match (this=0x2a440c0, str="log-periodic") at Regex.cc:188
  #12 0x0000000000ceb76d in Diags::tag_activated (this=0x2933e00, tag=0xdb58d7 "log-periodic", mode=DiagsTagType_Debug) at Diags.cc:379
  #13 0x0000000000ce9632 in DbgCtl::_new_reference (tag=0xdb58d7 "log-periodic") at DbgCtl.cc:158
  #14 0x000000000085b76c in DbgCtl::DbgCtl (this=0x1001cf0 <Log::periodic_tasks(long)::Debug_ctl>, tag=0xdb58d7 "log-periodic") at ../../include/tscore/DbgCtl.h:36
  #15 0x0000000000a63984 in Log::periodic_tasks (time_now=1690907172) at Log.cc:224  
  #16 0x0000000000a6ed19 in Log::flush_thread_main () at Log.cc:1415
  #17 0x0000000000a6f95d in LoggingFlushContinuation::mainEvent (this=0x3294f80) at Log.cc:284
  #18 0x0000000000863ac9 in Continuation::handleEvent (this=0x3294f80, event=1, data=0x3294fe0) at /var/tmp/trafficserver/iocore/eventsystem/I_Continuation.h:228
  #19 0x0000000000cdf621 in EThread::execute (this=0x3295050) at UnixEThread.cc:318                                                                                                                      
  #20 0x0000000000cddf66 in spawn_thread_internal (a=0x33a2040) at Thread.cc:79                                              
  #21 0x00007ffbf65ad907 in start_thread (arg=<optimized out>) at pthread_create.c:444                                                                                     
  #22 0x00007ffbf6633774 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100                                      

Thread 1: holds dl_load_lock, awaits DbgCtl::Registry::_mtx

  Thread 1 (Thread 0x7ffbf651d340 (LWP 55858) "traffic_server"):                                                                                                  
  #0  futex_wait (private=0, expected=2, futex_word=0x28c29b0) at ../sysdeps/nptl/futex-internal.h:146        
  #1  __GI___lll_lock_wait (futex=futex@entry=0x28c29b0, private=0) at lowlevellock.c:49
  #2  0x00007ffbf65b0e21 in lll_mutex_lock_optimized (mutex=0x28c29b0) at pthread_mutex_lock.c:48                                      
  #3  ___pthread_mutex_lock (mutex=0x28c29b0) at pthread_mutex_lock.c:93                                                                                                                   
  #4  0x000000000086e5e9 in __gthread_mutex_lock (__mutex=0x28c29b0) at /usr/include/c++/13/x86_64-redhat-linux/bits/gthr-default.h:749 
  #5  0x000000000086e868 in std::mutex::lock (this=0x28c29b0) at /usr/include/c++/13/bits/std_mutex.h:113
  #6  0x0000000000ce9aad in DbgCtl::_RegistryAccessor::_RegistryAccessor (this=0x7fff091f43bf) at DbgCtl.cc:84
  #7  0x0000000000ce953f in DbgCtl::_new_reference (tag=0x7ffbf5163828 "TSVConnFd") at DbgCtl.cc:141                        
  #8  0x000000000089ccb8 in TSDbgCtlCreate (tag=0x7ffbf5163828 "TSVConnFd") at traffic_server/InkAPI.cc:10136                                                         
  #9  0x00007ffbf515b8d4 in __static_initialization_and_destruction_0 () at gold_tests/pluginTest/TSVConnFd/TSVConnFd.cc:449                                         
  #10 0x00007ffbf515b9f2 in _GLOBAL__sub_I_TSVConnFd.cc(void) () at gold_tests/pluginTest/TSVConnFd/TSVConnFd.cc:864                                                    
  #11 0x00007ffbf737517f in call_init (env=0x28c12b0, argv=0x7fff091f67d8, argc=2, l=<optimized out>) at dl-init.c:70                                        
  #12 call_init (l=<optimized out>, argc=2, argv=0x7fff091f67d8, env=0x28c12b0) at dl-init.c:26                                                                                            
  #13 0x00007ffbf737527d in _dl_init (main_map=0x33a2560, argc=2, argv=0x7fff091f67d8, env=0x28c12b0) at dl-init.c:117                                                                
  #14 0x00007ffbf73715c2 in __GI__dl_catch_exception (exception=exception@entry=0x0, operate=operate@entry=0x7ffbf737bea0 <call_dl_init>, args=args@entry=0x7fff091f4610) at dl-catch.c:211                     
  #15 0x00007ffbf737be3c in dl_open_worker (a=a@entry=0x7fff091f47c0) at dl-open.c:808                                                 
  #16 0x00007ffbf7371523 in __GI__dl_catch_exception (exception=exception@entry=0x7fff091f47a0, operate=operate@entry=0x7ffbf737bda0 <dl_open_worker>, args=args@entry=0x7fff091f47c0) at dl-catch.c:237
  #17 0x00007ffbf737c1b4 in _dl_open (file=0x7fff091f4b70 "/tmp/sb/TSVConnFd/ts/plugin/TSVConnFd.so", mode=<optimized out>, caller_dlopen=0xbcc915 <plugin_dso_load(char const*, void*&, void*&, std::__cxx11::bas
  ic_string<char, std::char_traits<char>, std::allocator<char> >&)+44>, nsid=<optimized out>, argc=2, argv=0x7fff091f67d8, env=0x28c12b0) at dl-open.c:884                  
  #18 0x00007ffbf65a96d4 in dlopen_doit (a=a@entry=0x7fff091f4a70) at dlopen.c:56                                                                                                        
  #19 0x00007ffbf7371523 in __GI__dl_catch_exception (exception=exception@entry=0x7fff091f49b0, operate=0x7ffbf65a9670 <dlopen_doit>, args=0x7fff091f4a70) at dl-catch.c:237
  #20 0x00007ffbf7371679 in _dl_catch_error (objname=0x7fff091f4a18, errstring=0x7fff091f4a20, mallocedp=0x7fff091f4a17, operate=<optimized out>, args=<optimized out>) at dl-catch.c:256
  #21 0x00007ffbf65a91b3 in _dlerror_run (operate=operate@entry=0x7ffbf65a9670 <dlopen_doit>, args=args@entry=0x7fff091f4a70) at dlerror.c:138
  #22 0x00007ffbf65a978f in dlopen_implementation (dl_caller=<optimized out>, mode=<optimized out>, file=<optimized out>) at dlopen.c:71
  #23 ___dlopen (file=<optimized out>, mode=<optimized out>) at dlopen.c:81                                                                                                 
  #24 0x0000000000bcc915 in plugin_dso_load (path=0x7fff091f4b70 "/tmp/sb/TSVConnFd/ts/plugin/TSVConnFd.so", handle=@0x7fff091f4b48: 0xd1159e <Layout::relative[abi:cxx11](std::basic_string_view<char, std::char_
  traits<char> >)+80>, init=@0x7fff091f4b40: 0x0, error="") at Plugin.cc:105 
  #25 0x0000000000bcce9c in single_plugin_init (argc=3, argv=0x7fff091f5dd0, validateOnly=false) at Plugin.cc:153
  #26 0x0000000000bcdbe3 in plugin_init (validateOnly=false) at Plugin.cc:351
  #27 0x00000000008c3e07 in main (argv=0x7fff091f67d8) at traffic_server/traffic_server.cc:2160                                                             

This has a helpful way to find which thread is holding a mutex that another thread is waiting upon:
https://stackoverflow.com/a/3491304/629530

Thus, thread 10 above is waiting upon some lock. Who owns it?

(gdb) thread 10
[Switching to thread 10 (Thread 0x7ffbf52646c0 (LWP 55868))]
#0  futex_wait (private=0, expected=2, futex_word=0x7ffbf73a3a08 <_rtld_local+2568>) at ../sysdeps/nptl/futex-internal.h:146
146       int err = lll_futex_timed_wait (futex_word, expected, NULL, private);
(gdb) f 3
#3  ___pthread_mutex_lock (mutex=mutex@entry=0x7ffbf73a3a08 <_rtld_local+2568>) at pthread_mutex_lock.c:128
128           LLL_MUTEX_LOCK_OPTIMIZED (mutex);
(gdb) p *mutex
$3 = {__data = {__lock = 2, __count = 1, __owner = 55858, __nusers = 1, __kind = 1, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\002\000\000\000\001\000\000\0002\332\000\000\001\000\000\000\001", '\000' <repeats 22 times>, __align = 4294967298}
(gdb) thread find 55858
Thread 1 has target id 'Thread 0x7ffbf651d340 (LWP 55858)'
(gdb) 

Thus, thread 1 has the mutex which thread 10 is waiting upon. Thread 1 is doing the dlopen of the TSVConnFd plugin and is processing DbgCtl. Inspecting thread 10 further, it looks like the lock it is waiting upon is the dl_load_lock:

(gdb) thread 10

[Switching to thread 10 (Thread 0x7ffbf52646c0 (LWP 55868))]
#0  futex_wait (private=0, expected=2, futex_word=0x7ffbf73a3a08 <_rtld_local+2568>) at ../sysdeps/nptl/futex-internal.h:146
146       int err = lll_futex_timed_wait (futex_word, expected, NULL, private);
(gdb) f 4

#4  0x00007ffbf6560daa in __cxa_thread_atexit_impl (func=<optimized out>, obj=0x7ffbf52646a0, dso_symbol=0xd48008) at cxa_thread_atexit_impl.c:117
117       __rtld_lock_lock_recursive (GL(dl_load_lock));   <-----------------

dl_load_lock doesn't seem to be a publicly published part of dlopen, but this thread mentions it as a lock associated with dlopen. I'm pretty sure we're seeing a deadlock in thread 10 with the dlopen associated with loading the TSVConnFD plugin in thread 1.

To be explicit, this seems to be what is happening:

  1. Thread 1 grabs a dl_load_lock implicitly via dlopen while loading the TSVConnFd.so plugin at traffic_server.cc#L2160
  2. Thread 10 grabs the DbgCtl::Registry::_mtx while processing the periodic flushing thread to log a debug message at Log.cc#L224
  3. Thread 1 tries to grab DbgCtl::Registry::_mtx to initialize TSVConnFd's DbgCtl static object at TSVConnFd.cc#L449. This DbgCtl lock will never be released because it is held by thread 10.
  4. Thread 10 tries to grab the dl_load_lock while initializing the PCRE JIT stack at Regex.cc#L43. This will block forever because thread 1 holds it until thread 10 releases the DbgCtl::Registry::_mtx.

Metadata

Metadata

Assignees

Labels

DebugSupport for system debugging

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions