Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRASH during detach in now-native thread #2270

Closed
derekbruening opened this issue Mar 7, 2017 · 4 comments · Fixed by #2313
Closed

CRASH during detach in now-native thread #2270

derekbruening opened this issue Mar 7, 2017 · 4 comments · Fixed by #2313

Comments

@derekbruening
Copy link
Contributor

An app with a statically-linked DR often crashes during detach:

<Starting application myapp (595523)>
<Attached to 653 threads in application myapp (595523)>
<Detaching from application myapp (595523)>
SIGSEGV received by PID 595523 (TID 598205)
PC: @          0x4b3dc5d  (unknown)  safe_read_tls_magic
    @          0xfee3683        880  appSignalHandler()
    @     0x7f8574c47390       1440  (unknown)
    @          0x4bf5836        992  master_signal_handler_C
    @          0x4b3d883     321120  (unknown)
(gdb) thread find 598205                                                   
Thread 1 has target id 'LWP 598205'

(gdb) dps $rsp $rsp+8000
<...>
0x00007f76be49ddc8  0x000000000fee36d3  appSignalHandler
<...>
0x00007f76be49e6c8  0x0000000004be9559  get_thread_private_dcontext + 89 in section .text
0x00007f76be49e6d0  0x00007f76be49eab0  No symbol matches (void *)$retaddr.
0x00007f76be49e6d8  0x0000000004bf5836  master_signal_handler_C + 54 in section .text
0x00007f76be49e6e0  0x0000000000000000  No symbol matches (void *)$retaddr.
<...>
0x00007f76be49eab8  0x0000000004b3d883  dynamorio_sigreturn in section .text

    (gdb) dps 0x00007f76be49eab8 0x00007f76be49eab8+512
    0x00007f76be49eab8  0x0000000004b3d883  dynamorio_sigreturn in section .text
    0x00007f76be49eac0  0x0000000000000001  No symbol matches (void *)$retaddr.
    0x00007f76be49eac8  0x0000000000000000  No symbol matches (void *)$retaddr.
sp  0x00007f76be49ead0  0x00007f76be48f000  No symbol matches (void *)$retaddr.
    <...>
rsp 0x00007f76be49eb60  0x00007f76be4ecf50  No symbol matches (void *)$retaddr.
rip 0x00007f76be49eb68  0x00000000073fc4dc  appfunc
    <...>
sig 0x00007f76be49ebf0  0x000000000000001b  No symbol matches (void *)$retaddr.
    0x00007f76be49ebf8  0x0000000000000080  No symbol matches (void *)$retaddr.

Just SIGPROF arriving at random point of thread that's been detached and is
now native. Our handler is still in place, and it calls
get_thread_private_dcontext().

Looking back down the stack at the SIGSEGV:

rsp 0x00007f76be49e1e0  0x00007f76be49e6c8  No symbol matches (void *)$retaddr.
rip 0x00007f76be49e1e8  0x0000000004b3dc5d  safe_read_tls_magic in section .text
    <...>
sig 0x00007f76be49e270  0x000000000000000b  No symbol matches (void *)$retaddr.
    0x00007f76be49e278  0x0000000000000001  No symbol matches (void *)$retaddr.

(gdb) x/2i 0x0000000004b3dc5d
   0x4b3dc5d <safe_read_tls_magic>:     mov    %gs:0x60,%eax

So it's the expected fault after we've removed our segment.
So why didn't the SIGSEGV just go to our safe_read_tls_magic check and from
there go to safe_read_tls_magic_recover?
Is it a race where we removed our handler before the SIGSEGV was delivered,
and that's why it went to the app? We remove it once we detach from the
final thread: actually once we also do thread exit from the detaching
thread, right?

(gdb) p dynamo_exited
$2 = 0
(gdb) p doing_detach
$3 = 256
(gdb) p dynamo_detaching_flag
$4 = -1
(gdb) p dynamo_exited_and_cleaned
$5 = 0
(gdb) p num_known_threads
$6 = 0
(gdb) p dynamo_initialized
$7 = 0
(gdb) p do_once_generation
$8 = 2

It looks like dynamo_exit_post_detach() has run, though maybe the detacher
made further progress while the fault was being processed.

Proposal: check doing_detach in master_signal_handler_C and if true, and
it's some alarm signal, just drop it on the floor? Or try to invoke app
handler if it's not SIGUSR2 (or a fault?).

@derekbruening
Copy link
Contributor Author

We can drop alarm signals, but about other signals? Is this a fatal flaw in the #2089 safe_read_tls approach? Should we try to come up with some other solution? It is not an easy problem to solve, unfortunately, but #2089's approach is not ideal in other ways: there are several faults on every new thread on attach (#2271).

@derekbruening
Copy link
Contributor Author

Setting the 3 classic alarm signal handlers to SIG_IGN during detach will help with the common cases, just like for attach in #2161. There are still corner cases of non-timer signals, or non-classic-alarm signals used as timer signals via timer_create.

derekbruening added a commit that referenced this issue Mar 28, 2017
Sets the signal handler for the 3 alarm signals to SIG_IGN during attach
and detach, to reduce problems with races while we try to coordinate taking
over or sending native all of the app threads.

This is not a panacea, as timer_create can send out any signal, not just
the 3 classic alarm signals, on timer expiration.  Plus, non-timer-related
signals could arrive during attach or detach.  However, this will help with
the most typical cases.

Adds an itimer to the api.static_signal test, though it is not easy to
reproduce these problems in small applications.

I'm considering this to fix the filed issues despite the above-mentioned
remaining corner cases as I'm considering those to be pathological:

Fixes #2161
Fixes #2270
derekbruening added a commit that referenced this issue Mar 28, 2017
Sets the signal handler for the 3 alarm signals to SIG_IGN during attach
and detach, to reduce problems with races while we try to coordinate taking
over or sending native all of the app threads.

This is not a panacea, as timer_create can send out any signal, not just
the 3 classic alarm signals, on timer expiration.  Plus, non-timer-related
signals could arrive during attach or detach.  However, this will help with
the most typical cases.

Adds an itimer to the api.static_signal test, though it is not easy to
reproduce these problems in small applications.

I'm considering this to fix the filed issues despite the above-mentioned
remaining corner cases as I'm considering those to be pathological:

Fixes #2161
Fixes #2270
@derekbruening
Copy link
Contributor Author

The not-yet-handled pathological corner cases are now part of #26

@derekbruening
Copy link
Contributor Author

I'm reopening this for one more race: when a DR or client itimer is in place (xref #140), I've seen crashes right after detach. I'm pretty sure it's an itimer signal arriving after detach: we mark alarms as ignore
for duration of detach but then we restore the app's default handler.

My proposal is: on detach, if an alarm itimer is in place only for DR or a client and not the app, and the app's handler is default, we permanently leave the handler as ignore after detach. It seems a small transparency loss for a big robustness gain.

derekbruening added a commit that referenced this issue Apr 19, 2018
Adds ignoring of alarm signals post-detach if DR has an itimer and the app
does not, to avoid crashing on a signal that arrives after detach.

Adds testing to api.static_signal but it is not easy to reproduce this
race.

Fixes #2270
derekbruening added a commit that referenced this issue Apr 19, 2018
Adds ignoring of alarm signals post-detach if DR has an itimer and the app
does not, to avoid crashing on a signal that arrives after detach.

Adds testing to api.static_signal but it is not easy to reproduce this
race.

Fixes #2270
Carrotman42 added a commit that referenced this issue Nov 5, 2018
If an alarm is received by a thread after it has blocked in
check_wait_at_safe_spot but before the detaching thread sends the
SUSPEND_SIGNAL, it is possible the fcache_unit_areas lock is being held in
record_pending_signal when the SUSPEND_SIGNAL is received. Since the
receiving signal was alerady marked as waiting at a safe spot, we
synchronize with the thread and detach it, and the fcache_unit_areas
lock is never unlocked.

Issue #2270
Carrotman42 added a commit that referenced this issue Nov 6, 2018
…ads. (#3249)

If an alarm is received by a thread after it has blocked in
check_wait_at_safe_spot but before the detaching thread sends the
SUSPEND_SIGNAL, it is possible the fcache_unit_areas lock is being held in
record_pending_signal when the SUSPEND_SIGNAL is received. Since the
receiving thread was already marked as waiting at a safe spot, we
synchronize with the thread and detach it, and the fcache_unit_areas
lock is never unlocked.

Issue: #2270
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant