-
Notifications
You must be signed in to change notification settings - Fork 571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detach race with new thread on UNIX b/c of late signal init w/o lock #2779
Comments
The timeout added in #2762 breaks the assumptions of the UNIX suspend process:
The "will not send a 2nd suspend signal" is no longer true now that we have The simplest fix looks like solving the race covered by this issue by refactoring |
Another issue is that handle_suspend_signal() relaxes its signal mask too early, which can result in a hang like so:
|
Refactors signal_thread_inherit() to be called from a new routine os_thread_init_finalize() which is invoked while holding thread_initexit_lock yet after synch_thread_init(). This eliminates races with suspend signals arriving in newly half-initialized threads, which then drop the signals. The refactoring rearranges several thread initialization sequences to pass the clone record through dynamo_thread_init(). This refactoring allows us to revert the os_thread_suspend timeout from commit 972cddf PR #2762 which added a timeout to os_thread_suspend that turns out to not be safe on UNIX as the suspend model assumes there is no retry. Delays mask relaxing in handle_suspend_signal() to avoid timeout on suspend due to an intervening signal. Includes tweaks to an i#3020-related assert and i#2993-related alarm lock retry which got in the way of testing the final solution here. Tested by running thread creating apps that attach and detach many times, similar to the static burst tests in our suite. Issue: #3020, #2993 Fixes: #2779
Refactors signal_thread_inherit() to be called from a new routine os_thread_init_finalize() which is invoked while holding thread_initexit_lock yet after synch_thread_init(). This eliminates races with suspend signals arriving in newly half-initialized threads, which then drop the signals. The refactoring rearranges several thread initialization sequences to pass the clone record through dynamo_thread_init(). This refactoring allows us to revert the os_thread_suspend timeout from commit 972cddf PR #2762 which added a timeout to os_thread_suspend that turns out to not be safe on UNIX as the suspend model assumes there is no retry. Delays mask relaxing in handle_suspend_signal() to avoid timeout on suspend due to an intervening signal. Includes tweaks to an i#3020-related assert and i#2993-related alarm lock retry which got in the way of testing the final solution here. Tested by running thread creating apps that attach and detach many times, similar to the static burst tests in our suite. Issue: #3020, #2993 Fixes: #2779
Refactors signal_thread_inherit() to be called from a new routine os_thread_init_finalize() which is invoked while holding thread_initexit_lock yet after synch_thread_init(). This eliminates races with suspend signals arriving in newly half-initialized threads, which then drop the signals. The refactoring rearranges several thread initialization sequences to pass the clone record through dynamo_thread_init(). This refactoring allows us to revert the os_thread_suspend timeout from commit 972cddf PR #2762 which added a timeout to os_thread_suspend that turns out to not be safe on UNIX as the suspend model assumes there is no retry. Delays mask relaxing in handle_suspend_signal() to avoid timeout on suspend due to an intervening signal. Includes tweaks to an i#3020-related assert and i#2993-related alarm lock retry which got in the way of testing the final solution here. Tested by running thread creating apps that attach and detach many times, similar to the static burst tests in our suite. Issue: #3020, #2993 Fixes: #2779
Quoting from #2762
Detach gets its list of threads from DR's internal list, and a new thread
only adds itself while holding thread_initexit_lock in the thread init
sequence where it initializes everything else, including the signal
field. But: signal_thread_inherit is split off and called later, after the
thread_initexit_lock has been released, and that's where the signal_field
is fully_initialized. This leads to this race.
This issue covers trying to solve the race, either by pulling back the
final signal init under the lock, or through some other means.
Xref #2270
The text was updated successfully, but these errors were encountered: