Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

i#2157 re-attach: Add timeout to os_thread_suspend #2762

Merged
merged 5 commits into from
Dec 15, 2017

Conversation

Carrotman42
Copy link
Contributor

@Carrotman42 Carrotman42 commented Dec 13, 2017

After 5 seconds of waiting for a thread to acknowledge a received
signal, os_thread_suspend now returns false so that the caller can
retry.

I have a unit test where there's a small chance that the thread doing detach will send the SUSPEND_SIGNAL to a newly created thread before its dcontext's signal_field is fully_initialized; the new thread discards the signal (can_always_delay[SUSPEND_SIGNAL] == true) and later initializes its signal_field, but the thread doing detach currently never retries to send the signal. So, this is kind of a workaround for a specific instance of i#26 that results in DR deadlocking on itself.

Issue: #2157

After 5 seconds of waiting for a thread to acknowledge a received
signal, os_thread_suspend now returns false so that the caller can
retry.

Issue #2157
@derekbruening
Copy link
Contributor

Issue #2157

nit: the convention is with a colon after Issue as in https://github.com/DynamoRIO/dynamorio/wiki/Code-Reviews#commit-messages

Copy link
Contributor

@derekbruening derekbruening left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you comment on the reasons for this: have you observed a thread failing to receive a suspend signal or this is just trying to be defensive?

core/synch.c Outdated
*/
/* i#297: we only synch client threads after process exit event. */
uint flags = THREAD_SYNCH_SUSPEND_FAILURE_IGNORE | THREAD_SYNCH_SKIP_CLIENT_THREAD;
uint flags = THREAD_SYNCH_SUSPEND_FAILURE_RETRY | THREAD_SYNCH_SKIP_CLIENT_THREAD;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am afraid this is going to cause problems on Windows where it is not uncommon to have injected threads (CTRL_SHUTDOWN, CTRL_LOGOFF, etc.) we have no privileges to suspend -- and thus retrying will just fail again, and with the new "synchall failure is fatal and should kill the process" approach it turns what used to work for us on Windows into process death.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about exposing a flag option about whether or not this should be IGNORE, FATAL, or RETRY?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think the main thing for me is that it seems safest to keep Windows w/ a default of ignore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, then I went with compile-time constants which keep windows functionality the same but updates Unix-based platforms.

core/unix/os.c Outdated
*/
ksynch_wait(&ostd->suspended, 0, 0);
if (ksynch_wait(&ostd->suspended, 0, 5000) == -ETIMEDOUT) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is shared UNIX code but this return value check is Linux-specific and will not match on Mac

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, thanks for the reminder to check. Implemented Mac OS (docs were so hard to find!) and added a documented the API in ksynch.h.

core/unix/os.c Outdated
*/
ksynch_wait(&ostd->suspended, 0, 0);
if (ksynch_wait(&ostd->suspended, 0, 5000) == -ETIMEDOUT) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a named constant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (constant name ended up elsewhere)

core/unix/os.c Outdated
ksynch_wait(&ostd->suspended, 0, 0);
if (ksynch_wait(&ostd->suspended, 0, 5000) == -ETIMEDOUT) {
mutex_lock(&ostd->suspend_lock);
ASSERT(ostd->suspend_count > 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this assert is valid: a resume could have happened after ksynch_wait returns but before the lock is acquired.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the contract of os_thread_suspend/os_thread_resume is that you cannot call into os_thread_resume more times than os_thread_suspend has returned. Therefore, even if other os_thread_suspend/os_thread_resume calls went through and modified suspend_count between the ksynch_wait and mutex_lock, since this was the thread which incremented suspend_count and hasn't returned yet, it would be incorrect for any caller to call os_thread_resume and cause suspend_count to be decremented down to 0.

Please correct me if my logic or assumptions are incorrect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this sounds right.

core/unix/os.c Outdated
if (ksynch_wait(&ostd->suspended, 0, 5000) == -ETIMEDOUT) {
mutex_lock(&ostd->suspend_lock);
ASSERT(ostd->suspend_count > 0);
ostd->suspend_count--;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly I don't think this is safe w/o some check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed on another thread.

ASSERT(ostd->suspend_count > 0);
ostd->suspend_count--;
mutex_unlock(&ostd->suspend_lock);
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm worried there's some use of os_thread_suspend() outside of synch_with_all_threads() (e.g., just direct synch_with_thread()) that is not prepared to retry, and that in some extreme circumstance, such as a heavily loaded machine or machine resuming from suspend-to-disk or something, where there will be a long delay for everything and now we'll fail and mess something up where before it worked. Do other uses exist? I know they do on Windows b/c of app syscalls that suspend threads.

This also splits the behavior of this cross-OS interface from Windows and should probably be mentioned in the header.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added documentation about how os_thread_suspend may or may not respect the new timeout_ms parameter, and now only supply a timeout parameter to the function when the synch_with_thread's block parameter is false. My logic is that if a caller requested that block is true they probably do not want the function to time-out (they expect it to "block" until the thread is synch'd), whereas if they requested a non-blocking call then a default timeout makes sense. Along with synch_with_thread's flags parameter, the caller can get all sorts of useful functionality (wait forever, timeout -> no error, timeout -> error).

I wasn't aware of the other sorts of use-cases you mention which would cause long synch times. Does my change to infer timeouts from the block parameter of synch_with_thread make things better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is better: all the other calls pass block=true.

* Add timeout parameter to os_suspend_thread.
* Move decision about whether a thread suspend request times out to the caller of synch_with_thread, specifically its 'block' parameter.
* Implemented correctly for macos
* Changed code around so that Windows is unaffected.
@Carrotman42
Copy link
Contributor Author

PTAL

Copy link
Contributor

@derekbruening derekbruening left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good but I have a question about this scenario:

I have a unit test where there's a small chance that the thread doing detach will send the SUSPEND_SIGNAL to a newly created thread before its dcontext's signal_field is fully_initialized; the new thread discards the signal (can_always_delay[SUSPEND_SIGNAL] == true) and later initializes its signal_field, but the thread doing detach currently never retries to send the signal. So, this is kind of a workaround for a specific instance of i#26 that results in DR deadlocking on itself.

I don't see how this can happen: detach gets its list of threads from DR's internal list, and a new thread only adds itself while holding thread_initexit_lock in the thread init sequence where it initializes everything else, including the signal field. Looking at the code -- ok I think I see: this is a regression, coming from the splitting away of signal_thread_inherit from signal_thread_init. Before that split this would be impossible. So the split made things worse by delaying some initialization until the lock was released. We should revisit the reason for the split and try to remedy the situation and avoid the problem entirely. Can you file an issue on this?

@@ -87,7 +87,8 @@ ksynch_set_value(mac_synch_t *synch, int new_val);
KSYNCH_TYPE *
mutex_get_contended_event(mutex_t *lock);

/* These return 0 on success: */
/* These return 0 on success and a negative value on failure. ksynch_wait
* returns -ETIMEDOUT if there was a timeout condition. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style nit: prefer */ on own line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

ASSERT(ostd->suspend_count > 0);
ostd->suspend_count--;
mutex_unlock(&ostd->suspend_lock);
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is better: all the other calls pass block=true.

@Carrotman42 Carrotman42 merged commit 972cddf into master Dec 15, 2017
@Carrotman42 Carrotman42 deleted the i2157-retry-suspend-signal branch December 15, 2017 15:59
@derekbruening
Copy link
Contributor

This timeout breaks the thread suspension model on UNIX: it introduces races between a suspender and suspendee. The model assumes the suspender will not bail, undo its state changes, and then come back. The new thread regression was filed by me as #2779 and I believe it should be addressed and then the timeout reverted.

derekbruening added a commit that referenced this pull request May 23, 2018
Refactors signal_thread_inherit() to be called from a new routine
os_thread_init_finalize() which is invoked while holding
thread_initexit_lock yet after synch_thread_init().  This eliminates
races with suspend signals arriving in newly half-initialized threads,
which then drop the signals.  The refactoring rearranges several
thread initialization sequences to pass the clone record through
dynamo_thread_init().

This refactoring allows us to revert the os_thread_suspend timeout
from commit 972cddf PR #2762 which added a timeout to
os_thread_suspend that turns out to not be safe on UNIX as the suspend
model assumes there is no retry.

Delays mask relaxing in handle_suspend_signal() to avoid timeout on
suspend due to an intervening signal.

Includes tweaks to an i#3020-related assert and i#2993-related alarm
lock retry which got in the way of testing the final solution here.

Tested by running thread creating apps that attach and detach many
times, similar to the static burst tests in our suite.

Issue: #3020, #2993
Fixes: #2779
derekbruening added a commit that referenced this pull request May 23, 2018
Refactors signal_thread_inherit() to be called from a new routine
os_thread_init_finalize() which is invoked while holding
thread_initexit_lock yet after synch_thread_init().  This eliminates
races with suspend signals arriving in newly half-initialized threads,
which then drop the signals.  The refactoring rearranges several
thread initialization sequences to pass the clone record through
dynamo_thread_init().

This refactoring allows us to revert the os_thread_suspend timeout
from commit 972cddf PR #2762 which added a timeout to
os_thread_suspend that turns out to not be safe on UNIX as the suspend
model assumes there is no retry.

Delays mask relaxing in handle_suspend_signal() to avoid timeout on
suspend due to an intervening signal.

Includes tweaks to an i#3020-related assert and i#2993-related alarm
lock retry which got in the way of testing the final solution here.

Tested by running thread creating apps that attach and detach many
times, similar to the static burst tests in our suite.

Issue: #3020, #2993
Fixes: #2779
fhahn pushed a commit that referenced this pull request Jun 18, 2018
Refactors signal_thread_inherit() to be called from a new routine
os_thread_init_finalize() which is invoked while holding
thread_initexit_lock yet after synch_thread_init().  This eliminates
races with suspend signals arriving in newly half-initialized threads,
which then drop the signals.  The refactoring rearranges several
thread initialization sequences to pass the clone record through
dynamo_thread_init().

This refactoring allows us to revert the os_thread_suspend timeout
from commit 972cddf PR #2762 which added a timeout to
os_thread_suspend that turns out to not be safe on UNIX as the suspend
model assumes there is no retry.

Delays mask relaxing in handle_suspend_signal() to avoid timeout on
suspend due to an intervening signal.

Includes tweaks to an i#3020-related assert and i#2993-related alarm
lock retry which got in the way of testing the final solution here.

Tested by running thread creating apps that attach and detach many
times, similar to the static burst tests in our suite.

Issue: #3020, #2993
Fixes: #2779
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants