Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detach race with new thread on UNIX b/c of late signal init w/o lock #2779

Closed
derekbruening opened this issue Dec 18, 2017 · 2 comments
Closed

Comments

@derekbruening
Copy link
Contributor

Quoting from #2762

I have a unit test where there's a small chance that the thread doing
detach will send the SUSPEND_SIGNAL to a newly created thread before its
dcontext's signal_field is fully_initialized; the new thread discards the
signal (can_always_delay[SUSPEND_SIGNAL] == true) and later initializes its
signal_field, but the thread doing detach currently never retries to send
the signal. So, this is kind of a workaround for a specific instance of
#26 that results in DR deadlocking on itself.

Detach gets its list of threads from DR's internal list, and a new thread
only adds itself while holding thread_initexit_lock in the thread init
sequence where it initializes everything else, including the signal
field. But: signal_thread_inherit is split off and called later, after the
thread_initexit_lock has been released, and that's where the signal_field
is fully_initialized. This leads to this race.

This issue covers trying to solve the race, either by pulling back the
final signal init under the lock, or through some other means.

Xref #2270

@derekbruening
Copy link
Contributor Author

The timeout added in #2762 breaks the assumptions of the UNIX suspend process:

    /* If suspend_count is 0, we are not trying to suspend this thread
     * (os_thread_resume() may have already decremented suspend_count to 0, but
     * os_thread_suspend() will not send a signal until this thread unsets
     * ostd->suspended, so not having a lock around the suspend_count read is
     * ok), so pass signal to app.
     * If we are trying or have already suspended this thread, our own
     * os_thread_suspend() will not send a 2nd suspend signal until we are
     * completely resumed, so we can distinguish app uses of SUSPEND_SIGNAL.  We
     * can't have a race between the read and write of suspended_sigcxt b/c
     * signals are blocked.  It's fine to have a race and reorder the app's
     * signal w/ DR's.
     */

The "will not send a 2nd suspend signal" is no longer true now that we have
a timeout in os_thread_suspend. The timeout makes everything racy and
can cause all kinds of problems without accompanying structural changes
in the suspend process.

The simplest fix looks like solving the race covered by this issue by refactoring
the code to pull signal_thread_inherite() back into signal_thread_init() and then
removing the timeout.

@derekbruening
Copy link
Contributor Author

Another issue is that handle_suspend_signal() relaxes its signal mask too early, which can result in a hang like so:

suspender 124572 timed out suspending 124575

Thread 7 (Thread 0x7f15e5643700 (LWP 124575)):
#2  0x0000556b60db85a4 in ksynch_wait (futex=0x556b637490c0 <thread_initexit_lock>, mustbe=1,
    timeout_ms=0) at core/unix/ksynch_linux.c:125
#3  0x0000556b60dee0c2 in mutex_wait_contended_lock (lock=0x556b637490c0 <thread_initexit_lock>,
    mc=0x0) at core/unix/os.c:9540
#4  0x0000556b60e1f799 in mutex_lock_app (lock=0x556b637490c0 <thread_initexit_lock>, mc=0x0)
    at core/utils.c:904
#5  0x0000556b60e1d199 in mutex_lock (lock=0x556b637490c0 <thread_initexit_lock>)
    at core/utils.c:917
#6  0x0000556b60e0f18b in translate_sigcontext (dcontext=0x556b3e8ef440, uc=0x556b3e972140,
    avoid_failure=0 '\000', f=0x0) at core/unix/signal.c:2476
#7  0x0000556b60e0411b in record_pending_signal (dcontext=0x556b3e8ef440, sig=27,
    ucxt=0x556b3e972140, frame=0x556b3e972138, forged=0 '\000', access_address=0x0)
    at core/unix/signal.c:4160
#8  0x0000556b60dfe415 in master_signal_handler_C (sig=27, siginfo=0x556b3e972270,
    ucxt=0x556b3e972140, xsp=0x556b3e972138 "\363;\357`kU")
    at core/unix/signal.c:5074
#9  0x0000556b60ef3bf3 in xfer_to_new_libdr () at base/internal/logging.h:853
#10 0x0000000000000007 in ?? ()

(gdb) x/4i ucxt->uc_mcontext.rip-2
   0x556b60ef4078 <syscall_ready+3>:    syscall
   0x556b60ef407a <syscall_ready+5>:    pop    %rbx

    rbp = 0x556b3e972730,

(gdb) dps ucxt->uc_mcontext.rsp ucxt->uc_mcontext.rsp+320
0x0000556b3e972700  0x0000000000000002  No symbol matches (void *)$retaddr.
0x0000556b3e972708  0x0000556b60df2fb7  sigprocmask_syscall + 55
0x0000556b3e972710  0x0000000000000008  No symbol matches (void *)$retaddr.
0x0000556b3e972718  0x0000556b3e9727a0  No symbol matches (void *)$retaddr.
0x0000556b3e972720  0x0000556b3e972be8  No symbol matches (void *)$retaddr.
0x0000556b3e972728  0x000000023e972790  No symbol matches (void *)$retaddr.
0x0000556b3e972730  0x0000556b3e9727d0  No symbol matches (void *)$retaddr.
0x0000556b3e972738  0x0000556b60e05b8d  handle_suspend_signal + 973
0x0000556b3e972740  0x0000000000000000  No symbol matches (void *)$retaddr.
(gdb) x/3i 0x0000556b60e05b8d-5
   0x556b60e05b88 <handle_suspend_signal+968>:  callq  0x556b60df2f80 <sigprocmask_syscall>

derekbruening added a commit that referenced this issue May 23, 2018
Refactors signal_thread_inherit() to be called from a new routine
os_thread_init_finalize() which is invoked while holding
thread_initexit_lock yet after synch_thread_init().  This eliminates
races with suspend signals arriving in newly half-initialized threads,
which then drop the signals.  The refactoring rearranges several
thread initialization sequences to pass the clone record through
dynamo_thread_init().

This refactoring allows us to revert the os_thread_suspend timeout
from commit 972cddf PR #2762 which added a timeout to
os_thread_suspend that turns out to not be safe on UNIX as the suspend
model assumes there is no retry.

Delays mask relaxing in handle_suspend_signal() to avoid timeout on
suspend due to an intervening signal.

Includes tweaks to an i#3020-related assert and i#2993-related alarm
lock retry which got in the way of testing the final solution here.

Tested by running thread creating apps that attach and detach many
times, similar to the static burst tests in our suite.

Issue: #3020, #2993
Fixes: #2779
derekbruening added a commit that referenced this issue May 23, 2018
Refactors signal_thread_inherit() to be called from a new routine
os_thread_init_finalize() which is invoked while holding
thread_initexit_lock yet after synch_thread_init().  This eliminates
races with suspend signals arriving in newly half-initialized threads,
which then drop the signals.  The refactoring rearranges several
thread initialization sequences to pass the clone record through
dynamo_thread_init().

This refactoring allows us to revert the os_thread_suspend timeout
from commit 972cddf PR #2762 which added a timeout to
os_thread_suspend that turns out to not be safe on UNIX as the suspend
model assumes there is no retry.

Delays mask relaxing in handle_suspend_signal() to avoid timeout on
suspend due to an intervening signal.

Includes tweaks to an i#3020-related assert and i#2993-related alarm
lock retry which got in the way of testing the final solution here.

Tested by running thread creating apps that attach and detach many
times, similar to the static burst tests in our suite.

Issue: #3020, #2993
Fixes: #2779
fhahn pushed a commit that referenced this issue Jun 18, 2018
Refactors signal_thread_inherit() to be called from a new routine
os_thread_init_finalize() which is invoked while holding
thread_initexit_lock yet after synch_thread_init().  This eliminates
races with suspend signals arriving in newly half-initialized threads,
which then drop the signals.  The refactoring rearranges several
thread initialization sequences to pass the clone record through
dynamo_thread_init().

This refactoring allows us to revert the os_thread_suspend timeout
from commit 972cddf PR #2762 which added a timeout to
os_thread_suspend that turns out to not be safe on UNIX as the suspend
model assumes there is no retry.

Delays mask relaxing in handle_suspend_signal() to avoid timeout on
suspend due to an intervening signal.

Includes tweaks to an i#3020-related assert and i#2993-related alarm
lock retry which got in the way of testing the final solution here.

Tested by running thread creating apps that attach and detach many
times, similar to the static burst tests in our suite.

Issue: #3020, #2993
Fixes: #2779
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant