i#2157 re-attach: Add timeout to os_thread_suspend #2762

Carrotman42 · 2017-12-13T01:12:35Z

After 5 seconds of waiting for a thread to acknowledge a received
signal, os_thread_suspend now returns false so that the caller can
retry.

I have a unit test where there's a small chance that the thread doing detach will send the SUSPEND_SIGNAL to a newly created thread before its dcontext's signal_field is fully_initialized; the new thread discards the signal (can_always_delay[SUSPEND_SIGNAL] == true) and later initializes its signal_field, but the thread doing detach currently never retries to send the signal. So, this is kind of a workaround for a specific instance of i#26 that results in DR deadlocking on itself.

Issue: #2157

After 5 seconds of waiting for a thread to acknowledge a received signal, os_thread_suspend now returns false so that the caller can retry. Issue #2157

derekbruening · 2017-12-13T02:13:06Z

Issue #2157

nit: the convention is with a colon after Issue as in https://github.com/DynamoRIO/dynamorio/wiki/Code-Reviews#commit-messages

derekbruening

Could you comment on the reasons for this: have you observed a thread failing to receive a suspend signal or this is just trying to be defensive?

derekbruening · 2017-12-13T02:30:45Z

core/synch.c

     */
    /* i#297: we only synch client threads after process exit event. */
-    uint flags = THREAD_SYNCH_SUSPEND_FAILURE_IGNORE | THREAD_SYNCH_SKIP_CLIENT_THREAD;
+    uint flags = THREAD_SYNCH_SUSPEND_FAILURE_RETRY | THREAD_SYNCH_SKIP_CLIENT_THREAD;


I am afraid this is going to cause problems on Windows where it is not uncommon to have injected threads (CTRL_SHUTDOWN, CTRL_LOGOFF, etc.) we have no privileges to suspend -- and thus retrying will just fail again, and with the new "synchall failure is fatal and should kill the process" approach it turns what used to work for us on Windows into process death.

How about exposing a flag option about whether or not this should be IGNORE, FATAL, or RETRY?

OK, I think the main thing for me is that it seems safest to keep Windows w/ a default of ignore.

OK, then I went with compile-time constants which keep windows functionality the same but updates Unix-based platforms.

derekbruening · 2017-12-13T02:34:07Z

core/unix/os.c

         */
-        ksynch_wait(&ostd->suspended, 0, 0);
+        if (ksynch_wait(&ostd->suspended, 0, 5000) == -ETIMEDOUT) {


This is shared UNIX code but this return value check is Linux-specific and will not match on Mac

Ah yes, thanks for the reminder to check. Implemented Mac OS (docs were so hard to find!) and added a documented the API in ksynch.h.

derekbruening · 2017-12-13T02:35:35Z

core/unix/os.c

         */
-        ksynch_wait(&ostd->suspended, 0, 0);
+        if (ksynch_wait(&ostd->suspended, 0, 5000) == -ETIMEDOUT) {


Use a named constant

Done (constant name ended up elsewhere)

derekbruening · 2017-12-13T02:39:44Z

core/unix/os.c

-        ksynch_wait(&ostd->suspended, 0, 0);
+        if (ksynch_wait(&ostd->suspended, 0, 5000) == -ETIMEDOUT) {
+            mutex_lock(&ostd->suspend_lock);
+            ASSERT(ostd->suspend_count > 0);


I don't think this assert is valid: a resume could have happened after ksynch_wait returns but before the lock is acquired.

I believe the contract of os_thread_suspend/os_thread_resume is that you cannot call into os_thread_resume more times than os_thread_suspend has returned. Therefore, even if other os_thread_suspend/os_thread_resume calls went through and modified suspend_count between the ksynch_wait and mutex_lock, since this was the thread which incremented suspend_count and hasn't returned yet, it would be incorrect for any caller to call os_thread_resume and cause suspend_count to be decremented down to 0.

Please correct me if my logic or assumptions are incorrect.

Yes, this sounds right.

derekbruening · 2017-12-13T02:40:18Z

core/unix/os.c

+        if (ksynch_wait(&ostd->suspended, 0, 5000) == -ETIMEDOUT) {
+            mutex_lock(&ostd->suspend_lock);
+            ASSERT(ostd->suspend_count > 0);
+            ostd->suspend_count--;


Similarly I don't think this is safe w/o some check

Discussed on another thread.

derekbruening · 2017-12-13T02:45:26Z

core/unix/os.c

+            ASSERT(ostd->suspend_count > 0);
+            ostd->suspend_count--;
+            mutex_unlock(&ostd->suspend_lock);
+            return false;


Hmm, I'm worried there's some use of os_thread_suspend() outside of synch_with_all_threads() (e.g., just direct synch_with_thread()) that is not prepared to retry, and that in some extreme circumstance, such as a heavily loaded machine or machine resuming from suspend-to-disk or something, where there will be a long delay for everything and now we'll fail and mess something up where before it worked. Do other uses exist? I know they do on Windows b/c of app syscalls that suspend threads.

This also splits the behavior of this cross-OS interface from Windows and should probably be mentioned in the header.

I've added documentation about how os_thread_suspend may or may not respect the new timeout_ms parameter, and now only supply a timeout parameter to the function when the synch_with_thread's block parameter is false. My logic is that if a caller requested that block is true they probably do not want the function to time-out (they expect it to "block" until the thread is synch'd), whereas if they requested a non-blocking call then a default timeout makes sense. Along with synch_with_thread's flags parameter, the caller can get all sorts of useful functionality (wait forever, timeout -> no error, timeout -> error).

I wasn't aware of the other sorts of use-cases you mention which would cause long synch times. Does my change to infer timeouts from the block parameter of synch_with_thread make things better?

Yes, this is better: all the other calls pass block=true.

* Add timeout parameter to os_suspend_thread. * Move decision about whether a thread suspend request times out to the caller of synch_with_thread, specifically its 'block' parameter. * Implemented correctly for macos * Changed code around so that Windows is unaffected.

Carrotman42 · 2017-12-14T21:59:32Z

PTAL

derekbruening

The code looks good but I have a question about this scenario:

I have a unit test where there's a small chance that the thread doing detach will send the SUSPEND_SIGNAL to a newly created thread before its dcontext's signal_field is fully_initialized; the new thread discards the signal (can_always_delay[SUSPEND_SIGNAL] == true) and later initializes its signal_field, but the thread doing detach currently never retries to send the signal. So, this is kind of a workaround for a specific instance of i#26 that results in DR deadlocking on itself.

I don't see how this can happen: detach gets its list of threads from DR's internal list, and a new thread only adds itself while holding thread_initexit_lock in the thread init sequence where it initializes everything else, including the signal field. Looking at the code -- ok I think I see: this is a regression, coming from the splitting away of signal_thread_inherit from signal_thread_init. Before that split this would be impossible. So the split made things worse by delaying some initialization until the lock was released. We should revisit the reason for the split and try to remedy the situation and avoid the problem entirely. Can you file an issue on this?

derekbruening · 2017-12-15T03:41:57Z

core/unix/ksynch.h

@@ -87,7 +87,8 @@ ksynch_set_value(mac_synch_t *synch, int new_val);
 KSYNCH_TYPE *
 mutex_get_contended_event(mutex_t *lock);

-/* These return 0 on success: */
+/* These return 0 on success and a negative value on failure. ksynch_wait
+ * returns -ETIMEDOUT if there was a timeout condition. */


style nit: prefer */ on own line

derekbruening · 2017-12-15T03:43:46Z

core/unix/os.c

+            ASSERT(ostd->suspend_count > 0);
+            ostd->suspend_count--;
+            mutex_unlock(&ostd->suspend_lock);
+            return false;


Yes, this is better: all the other calls pass block=true.

derekbruening · 2018-05-21T17:45:40Z

This timeout breaks the thread suspension model on UNIX: it introduces races between a suspender and suspendee. The model assumes the suspender will not bail, undo its state changes, and then come back. The new thread regression was filed by me as #2779 and I believe it should be addressed and then the timeout reverted.

Refactors signal_thread_inherit() to be called from a new routine os_thread_init_finalize() which is invoked while holding thread_initexit_lock yet after synch_thread_init(). This eliminates races with suspend signals arriving in newly half-initialized threads, which then drop the signals. The refactoring rearranges several thread initialization sequences to pass the clone record through dynamo_thread_init(). This refactoring allows us to revert the os_thread_suspend timeout from commit 972cddf PR #2762 which added a timeout to os_thread_suspend that turns out to not be safe on UNIX as the suspend model assumes there is no retry. Delays mask relaxing in handle_suspend_signal() to avoid timeout on suspend due to an intervening signal. Includes tweaks to an i#3020-related assert and i#2993-related alarm lock retry which got in the way of testing the final solution here. Tested by running thread creating apps that attach and detach many times, similar to the static burst tests in our suite. Issue: #3020, #2993 Fixes: #2779

i#2157 re-attach: Add timeout to os_thread_suspend

8d81ca6

After 5 seconds of waiting for a thread to acknowledge a received signal, os_thread_suspend now returns false so that the caller can retry. Issue #2157

Carrotman42 requested a review from derekbruening December 13, 2017 01:12

derekbruening reviewed Dec 13, 2017

View reviewed changes

Fix windows build

95b2a24

derekbruening approved these changes Dec 15, 2017

View reviewed changes

Carrotman42 added 2 commits December 15, 2017 07:30

Fix comment format

9594e90

Merge branch 'master' into i2157-retry-suspend-signal

0dfba39

Carrotman42 merged commit 972cddf into master Dec 15, 2017

Carrotman42 deleted the i2157-retry-suspend-signal branch December 15, 2017 15:59

derekbruening mentioned this pull request Dec 18, 2017

Detach race with new thread on UNIX b/c of late signal init w/o lock #2779

Closed

derekbruening mentioned this pull request May 23, 2018

i#2779: fix thread creation signal race #3021

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

i#2157 re-attach: Add timeout to os_thread_suspend #2762

i#2157 re-attach: Add timeout to os_thread_suspend #2762

Carrotman42 commented Dec 13, 2017 •

edited

Loading

derekbruening commented Dec 13, 2017

derekbruening left a comment

derekbruening Dec 13, 2017

Carrotman42 Dec 13, 2017

derekbruening Dec 13, 2017

Carrotman42 Dec 14, 2017

derekbruening Dec 13, 2017

Carrotman42 Dec 13, 2017

derekbruening Dec 13, 2017

Carrotman42 Dec 14, 2017

derekbruening Dec 13, 2017

Carrotman42 Dec 13, 2017

derekbruening Dec 13, 2017

derekbruening Dec 13, 2017

Carrotman42 Dec 14, 2017

derekbruening Dec 13, 2017

Carrotman42 Dec 14, 2017

derekbruening Dec 15, 2017

Carrotman42 commented Dec 14, 2017

derekbruening left a comment

derekbruening Dec 15, 2017

Carrotman42 Dec 15, 2017

derekbruening Dec 15, 2017

derekbruening commented May 21, 2018

i#2157 re-attach: Add timeout to os_thread_suspend #2762

i#2157 re-attach: Add timeout to os_thread_suspend #2762

Conversation

Carrotman42 commented Dec 13, 2017 • edited Loading

derekbruening commented Dec 13, 2017

derekbruening left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Carrotman42 commented Dec 14, 2017

derekbruening left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

derekbruening commented May 21, 2018

Carrotman42 commented Dec 13, 2017 •

edited

Loading