Detach race with new thread on UNIX b/c of late signal init w/o lock #2779

derekbruening · 2017-12-18T19:35:31Z

Quoting from #2762

I have a unit test where there's a small chance that the thread doing
detach will send the SUSPEND_SIGNAL to a newly created thread before its
dcontext's signal_field is fully_initialized; the new thread discards the
signal (can_always_delay[SUSPEND_SIGNAL] == true) and later initializes its
signal_field, but the thread doing detach currently never retries to send
the signal. So, this is kind of a workaround for a specific instance of
#26 that results in DR deadlocking on itself.

Detach gets its list of threads from DR's internal list, and a new thread
only adds itself while holding thread_initexit_lock in the thread init
sequence where it initializes everything else, including the signal
field. But: signal_thread_inherit is split off and called later, after the
thread_initexit_lock has been released, and that's where the signal_field
is fully_initialized. This leads to this race.

This issue covers trying to solve the race, either by pulling back the
final signal init under the lock, or through some other means.

Xref #2270

derekbruening · 2018-05-21T19:51:29Z

The timeout added in #2762 breaks the assumptions of the UNIX suspend process:

    /* If suspend_count is 0, we are not trying to suspend this thread
     * (os_thread_resume() may have already decremented suspend_count to 0, but
     * os_thread_suspend() will not send a signal until this thread unsets
     * ostd->suspended, so not having a lock around the suspend_count read is
     * ok), so pass signal to app.
     * If we are trying or have already suspended this thread, our own
     * os_thread_suspend() will not send a 2nd suspend signal until we are
     * completely resumed, so we can distinguish app uses of SUSPEND_SIGNAL.  We
     * can't have a race between the read and write of suspended_sigcxt b/c
     * signals are blocked.  It's fine to have a race and reorder the app's
     * signal w/ DR's.
     */

The "will not send a 2nd suspend signal" is no longer true now that we have
a timeout in os_thread_suspend. The timeout makes everything racy and
can cause all kinds of problems without accompanying structural changes
in the suspend process.

The simplest fix looks like solving the race covered by this issue by refactoring
the code to pull signal_thread_inherite() back into signal_thread_init() and then
removing the timeout.

derekbruening · 2018-05-23T01:55:49Z

Another issue is that handle_suspend_signal() relaxes its signal mask too early, which can result in a hang like so:

suspender 124572 timed out suspending 124575

Thread 7 (Thread 0x7f15e5643700 (LWP 124575)):
#2  0x0000556b60db85a4 in ksynch_wait (futex=0x556b637490c0 <thread_initexit_lock>, mustbe=1,
    timeout_ms=0) at core/unix/ksynch_linux.c:125
#3  0x0000556b60dee0c2 in mutex_wait_contended_lock (lock=0x556b637490c0 <thread_initexit_lock>,
    mc=0x0) at core/unix/os.c:9540
#4  0x0000556b60e1f799 in mutex_lock_app (lock=0x556b637490c0 <thread_initexit_lock>, mc=0x0)
    at core/utils.c:904
#5  0x0000556b60e1d199 in mutex_lock (lock=0x556b637490c0 <thread_initexit_lock>)
    at core/utils.c:917
#6  0x0000556b60e0f18b in translate_sigcontext (dcontext=0x556b3e8ef440, uc=0x556b3e972140,
    avoid_failure=0 '\000', f=0x0) at core/unix/signal.c:2476
#7  0x0000556b60e0411b in record_pending_signal (dcontext=0x556b3e8ef440, sig=27,
    ucxt=0x556b3e972140, frame=0x556b3e972138, forged=0 '\000', access_address=0x0)
    at core/unix/signal.c:4160
#8  0x0000556b60dfe415 in master_signal_handler_C (sig=27, siginfo=0x556b3e972270,
    ucxt=0x556b3e972140, xsp=0x556b3e972138 "\363;\357`kU")
    at core/unix/signal.c:5074
#9  0x0000556b60ef3bf3 in xfer_to_new_libdr () at base/internal/logging.h:853
#10 0x0000000000000007 in ?? ()

(gdb) x/4i ucxt->uc_mcontext.rip-2
   0x556b60ef4078 <syscall_ready+3>:    syscall
   0x556b60ef407a <syscall_ready+5>:    pop    %rbx

    rbp = 0x556b3e972730,

(gdb) dps ucxt->uc_mcontext.rsp ucxt->uc_mcontext.rsp+320
0x0000556b3e972700  0x0000000000000002  No symbol matches (void *)$retaddr.
0x0000556b3e972708  0x0000556b60df2fb7  sigprocmask_syscall + 55
0x0000556b3e972710  0x0000000000000008  No symbol matches (void *)$retaddr.
0x0000556b3e972718  0x0000556b3e9727a0  No symbol matches (void *)$retaddr.
0x0000556b3e972720  0x0000556b3e972be8  No symbol matches (void *)$retaddr.
0x0000556b3e972728  0x000000023e972790  No symbol matches (void *)$retaddr.
0x0000556b3e972730  0x0000556b3e9727d0  No symbol matches (void *)$retaddr.
0x0000556b3e972738  0x0000556b60e05b8d  handle_suspend_signal + 973
0x0000556b3e972740  0x0000000000000000  No symbol matches (void *)$retaddr.
(gdb) x/3i 0x0000556b60e05b8d-5
   0x556b60e05b88 <handle_suspend_signal+968>:  callq  0x556b60df2f80 <sigprocmask_syscall>

Refactors signal_thread_inherit() to be called from a new routine os_thread_init_finalize() which is invoked while holding thread_initexit_lock yet after synch_thread_init(). This eliminates races with suspend signals arriving in newly half-initialized threads, which then drop the signals. The refactoring rearranges several thread initialization sequences to pass the clone record through dynamo_thread_init(). This refactoring allows us to revert the os_thread_suspend timeout from commit 972cddf PR #2762 which added a timeout to os_thread_suspend that turns out to not be safe on UNIX as the suspend model assumes there is no retry. Delays mask relaxing in handle_suspend_signal() to avoid timeout on suspend due to an intervening signal. Includes tweaks to an i#3020-related assert and i#2993-related alarm lock retry which got in the way of testing the final solution here. Tested by running thread creating apps that attach and detach many times, similar to the static burst tests in our suite. Issue: #3020, #2993 Fixes: #2779

derekbruening added Bug-Hang OpSys-UNIX labels Dec 18, 2017

derekbruening mentioned this issue May 21, 2018

i#2157 re-attach: Add timeout to os_thread_suspend #2762

Merged

derekbruening self-assigned this May 21, 2018

derekbruening mentioned this issue May 23, 2018

i#2779: fix thread creation signal race #3021

Merged

derekbruening closed this as completed in #3021 May 23, 2018

derekbruening mentioned this issue May 24, 2018

hang during detach due to xl8 of signal arriving while suspended #3026

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detach race with new thread on UNIX b/c of late signal init w/o lock #2779

Detach race with new thread on UNIX b/c of late signal init w/o lock #2779

derekbruening commented Dec 18, 2017

derekbruening commented May 21, 2018

derekbruening commented May 23, 2018

Detach race with new thread on UNIX b/c of late signal init w/o lock #2779

Detach race with new thread on UNIX b/c of late signal init w/o lock #2779

Comments

derekbruening commented Dec 18, 2017

derekbruening commented May 21, 2018

derekbruening commented May 23, 2018