-
Notifications
You must be signed in to change notification settings - Fork 570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CRASH during detach in now-native thread #2270
Comments
We can drop alarm signals, but about other signals? Is this a fatal flaw in the #2089 safe_read_tls approach? Should we try to come up with some other solution? It is not an easy problem to solve, unfortunately, but #2089's approach is not ideal in other ways: there are several faults on every new thread on attach (#2271). |
Setting the 3 classic alarm signal handlers to SIG_IGN during detach will help with the common cases, just like for attach in #2161. There are still corner cases of non-timer signals, or non-classic-alarm signals used as timer signals via timer_create. |
Sets the signal handler for the 3 alarm signals to SIG_IGN during attach and detach, to reduce problems with races while we try to coordinate taking over or sending native all of the app threads. This is not a panacea, as timer_create can send out any signal, not just the 3 classic alarm signals, on timer expiration. Plus, non-timer-related signals could arrive during attach or detach. However, this will help with the most typical cases. Adds an itimer to the api.static_signal test, though it is not easy to reproduce these problems in small applications. I'm considering this to fix the filed issues despite the above-mentioned remaining corner cases as I'm considering those to be pathological: Fixes #2161 Fixes #2270
Sets the signal handler for the 3 alarm signals to SIG_IGN during attach and detach, to reduce problems with races while we try to coordinate taking over or sending native all of the app threads. This is not a panacea, as timer_create can send out any signal, not just the 3 classic alarm signals, on timer expiration. Plus, non-timer-related signals could arrive during attach or detach. However, this will help with the most typical cases. Adds an itimer to the api.static_signal test, though it is not easy to reproduce these problems in small applications. I'm considering this to fix the filed issues despite the above-mentioned remaining corner cases as I'm considering those to be pathological: Fixes #2161 Fixes #2270
The not-yet-handled pathological corner cases are now part of #26 |
I'm reopening this for one more race: when a DR or client itimer is in place (xref #140), I've seen crashes right after detach. I'm pretty sure it's an itimer signal arriving after detach: we mark alarms as ignore My proposal is: on detach, if an alarm itimer is in place only for DR or a client and not the app, and the app's handler is default, we permanently leave the handler as ignore after detach. It seems a small transparency loss for a big robustness gain. |
Adds ignoring of alarm signals post-detach if DR has an itimer and the app does not, to avoid crashing on a signal that arrives after detach. Adds testing to api.static_signal but it is not easy to reproduce this race. Fixes #2270
Adds ignoring of alarm signals post-detach if DR has an itimer and the app does not, to avoid crashing on a signal that arrives after detach. Adds testing to api.static_signal but it is not easy to reproduce this race. Fixes #2270
If an alarm is received by a thread after it has blocked in check_wait_at_safe_spot but before the detaching thread sends the SUSPEND_SIGNAL, it is possible the fcache_unit_areas lock is being held in record_pending_signal when the SUSPEND_SIGNAL is received. Since the receiving signal was alerady marked as waiting at a safe spot, we synchronize with the thread and detach it, and the fcache_unit_areas lock is never unlocked. Issue #2270
…ads. (#3249) If an alarm is received by a thread after it has blocked in check_wait_at_safe_spot but before the detaching thread sends the SUSPEND_SIGNAL, it is possible the fcache_unit_areas lock is being held in record_pending_signal when the SUSPEND_SIGNAL is received. Since the receiving thread was already marked as waiting at a safe spot, we synchronize with the thread and detach it, and the fcache_unit_areas lock is never unlocked. Issue: #2270
An app with a statically-linked DR often crashes during detach:
Just SIGPROF arriving at random point of thread that's been detached and is
now native. Our handler is still in place, and it calls
get_thread_private_dcontext().
Looking back down the stack at the SIGSEGV:
So it's the expected fault after we've removed our segment.
So why didn't the SIGSEGV just go to our safe_read_tls_magic check and from
there go to safe_read_tls_magic_recover?
Is it a race where we removed our handler before the SIGSEGV was delivered,
and that's why it went to the app? We remove it once we detach from the
final thread: actually once we also do thread exit from the detaching
thread, right?
It looks like dynamo_exit_post_detach() has run, though maybe the detacher
made further progress while the fault was being processed.
Proposal: check doing_detach in master_signal_handler_C and if true, and
it's some alarm signal, just drop it on the floor? Or try to invoke app
handler if it's not SIGUSR2 (or a fault?).
The text was updated successfully, but these errors were encountered: