CRASH during detach in now-native thread #2270

derekbruening · 2017-03-07T03:33:32Z

An app with a statically-linked DR often crashes during detach:

<Starting application myapp (595523)>
<Attached to 653 threads in application myapp (595523)>
<Detaching from application myapp (595523)>
SIGSEGV received by PID 595523 (TID 598205)
PC: @          0x4b3dc5d  (unknown)  safe_read_tls_magic
    @          0xfee3683        880  appSignalHandler()
    @     0x7f8574c47390       1440  (unknown)
    @          0x4bf5836        992  master_signal_handler_C
    @          0x4b3d883     321120  (unknown)

(gdb) thread find 598205                                                   
Thread 1 has target id 'LWP 598205'

(gdb) dps $rsp $rsp+8000
<...>
0x00007f76be49ddc8  0x000000000fee36d3  appSignalHandler
<...>
0x00007f76be49e6c8  0x0000000004be9559  get_thread_private_dcontext + 89 in section .text
0x00007f76be49e6d0  0x00007f76be49eab0  No symbol matches (void *)$retaddr.
0x00007f76be49e6d8  0x0000000004bf5836  master_signal_handler_C + 54 in section .text
0x00007f76be49e6e0  0x0000000000000000  No symbol matches (void *)$retaddr.
<...>
0x00007f76be49eab8  0x0000000004b3d883  dynamorio_sigreturn in section .text

    (gdb) dps 0x00007f76be49eab8 0x00007f76be49eab8+512
    0x00007f76be49eab8  0x0000000004b3d883  dynamorio_sigreturn in section .text
    0x00007f76be49eac0  0x0000000000000001  No symbol matches (void *)$retaddr.
    0x00007f76be49eac8  0x0000000000000000  No symbol matches (void *)$retaddr.
sp  0x00007f76be49ead0  0x00007f76be48f000  No symbol matches (void *)$retaddr.
    <...>
rsp 0x00007f76be49eb60  0x00007f76be4ecf50  No symbol matches (void *)$retaddr.
rip 0x00007f76be49eb68  0x00000000073fc4dc  appfunc
    <...>
sig 0x00007f76be49ebf0  0x000000000000001b  No symbol matches (void *)$retaddr.
    0x00007f76be49ebf8  0x0000000000000080  No symbol matches (void *)$retaddr.

Just SIGPROF arriving at random point of thread that's been detached and is
now native. Our handler is still in place, and it calls
get_thread_private_dcontext().

Looking back down the stack at the SIGSEGV:

rsp 0x00007f76be49e1e0  0x00007f76be49e6c8  No symbol matches (void *)$retaddr.
rip 0x00007f76be49e1e8  0x0000000004b3dc5d  safe_read_tls_magic in section .text
    <...>
sig 0x00007f76be49e270  0x000000000000000b  No symbol matches (void *)$retaddr.
    0x00007f76be49e278  0x0000000000000001  No symbol matches (void *)$retaddr.

(gdb) x/2i 0x0000000004b3dc5d
   0x4b3dc5d <safe_read_tls_magic>:     mov    %gs:0x60,%eax

So it's the expected fault after we've removed our segment.
So why didn't the SIGSEGV just go to our safe_read_tls_magic check and from
there go to safe_read_tls_magic_recover?
Is it a race where we removed our handler before the SIGSEGV was delivered,
and that's why it went to the app? We remove it once we detach from the
final thread: actually once we also do thread exit from the detaching
thread, right?

(gdb) p dynamo_exited
$2 = 0
(gdb) p doing_detach
$3 = 256
(gdb) p dynamo_detaching_flag
$4 = -1
(gdb) p dynamo_exited_and_cleaned
$5 = 0
(gdb) p num_known_threads
$6 = 0
(gdb) p dynamo_initialized
$7 = 0
(gdb) p do_once_generation
$8 = 2

It looks like dynamo_exit_post_detach() has run, though maybe the detacher
made further progress while the fault was being processed.

Proposal: check doing_detach in master_signal_handler_C and if true, and
it's some alarm signal, just drop it on the floor? Or try to invoke app
handler if it's not SIGUSR2 (or a fault?).

The text was updated successfully, but these errors were encountered:

derekbruening · 2017-03-07T17:23:42Z

We can drop alarm signals, but about other signals? Is this a fatal flaw in the #2089 safe_read_tls approach? Should we try to come up with some other solution? It is not an easy problem to solve, unfortunately, but #2089's approach is not ideal in other ways: there are several faults on every new thread on attach (#2271).

derekbruening · 2017-03-28T22:08:40Z

Setting the 3 classic alarm signal handlers to SIG_IGN during detach will help with the common cases, just like for attach in #2161. There are still corner cases of non-timer signals, or non-classic-alarm signals used as timer signals via timer_create.

Sets the signal handler for the 3 alarm signals to SIG_IGN during attach and detach, to reduce problems with races while we try to coordinate taking over or sending native all of the app threads. This is not a panacea, as timer_create can send out any signal, not just the 3 classic alarm signals, on timer expiration. Plus, non-timer-related signals could arrive during attach or detach. However, this will help with the most typical cases. Adds an itimer to the api.static_signal test, though it is not easy to reproduce these problems in small applications. I'm considering this to fix the filed issues despite the above-mentioned remaining corner cases as I'm considering those to be pathological: Fixes #2161 Fixes #2270

derekbruening · 2017-03-29T01:58:04Z

The not-yet-handled pathological corner cases are now part of #26

derekbruening · 2018-04-19T02:29:27Z

I'm reopening this for one more race: when a DR or client itimer is in place (xref #140), I've seen crashes right after detach. I'm pretty sure it's an itimer signal arriving after detach: we mark alarms as ignore
for duration of detach but then we restore the app's default handler.

My proposal is: on detach, if an alarm itimer is in place only for DR or a client and not the app, and the app's handler is default, we permanently leave the handler as ignore after detach. It seems a small transparency loss for a big robustness gain.

Adds ignoring of alarm signals post-detach if DR has an itimer and the app does not, to avoid crashing on a signal that arrives after detach. Adds testing to api.static_signal but it is not easy to reproduce this race. Fixes #2270

If an alarm is received by a thread after it has blocked in check_wait_at_safe_spot but before the detaching thread sends the SUSPEND_SIGNAL, it is possible the fcache_unit_areas lock is being held in record_pending_signal when the SUSPEND_SIGNAL is received. Since the receiving signal was alerady marked as waiting at a safe spot, we synchronize with the thread and detach it, and the fcache_unit_areas lock is never unlocked. Issue #2270

…ads. (#3249) If an alarm is received by a thread after it has blocked in check_wait_at_safe_spot but before the detaching thread sends the SUSPEND_SIGNAL, it is possible the fcache_unit_areas lock is being held in record_pending_signal when the SUSPEND_SIGNAL is received. Since the receiving thread was already marked as waiting at a safe spot, we synchronize with the thread and detach it, and the fcache_unit_areas lock is never unlocked. Issue: #2270

derekbruening added Bug-AppCrash OpSys-Linux labels Mar 7, 2017

derekbruening self-assigned this Mar 7, 2017

derekbruening mentioned this issue Mar 7, 2017

many SIGSEGV for safe_read_tls_magic on every thread on attach #2271

Open

derekbruening mentioned this issue Mar 15, 2017

DynamoRIO fails to run trivial "clone" example on ARM #1936

Open

derekbruening mentioned this issue Mar 24, 2017

on attach, DR signal handler invoked before takeover can result in self-interp problem #2161

Closed

derekbruening mentioned this issue Mar 28, 2017

i#2161, i#2270: ignore alarms during attach and detach #2313

Merged

derekbruening closed this as completed in #2313 Mar 28, 2017

derekbruening mentioned this issue Mar 29, 2017

handle pre-thread-init and post-exit signals #26

Open

derekbruening mentioned this issue Dec 18, 2017

Detach race with new thread on UNIX b/c of late signal init w/o lock #2779

Closed

derekbruening reopened this Apr 19, 2018

derekbruening mentioned this issue Apr 19, 2018

i#2270 itimer detach: ignore alarm signals post-detach #2943

Merged

derekbruening closed this as completed in #2943 Apr 19, 2018

Carrotman42 mentioned this issue Nov 5, 2018

i#2270 itimer detach: Remove alarm handlers before synching with threads. #3249

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CRASH during detach in now-native thread #2270

CRASH during detach in now-native thread #2270

derekbruening commented Mar 7, 2017

derekbruening commented Mar 7, 2017

derekbruening commented Mar 28, 2017

derekbruening commented Mar 29, 2017

derekbruening commented Apr 19, 2018

CRASH during detach in now-native thread #2270

CRASH during detach in now-native thread #2270

Comments

derekbruening commented Mar 7, 2017

derekbruening commented Mar 7, 2017

derekbruening commented Mar 28, 2017

derekbruening commented Mar 29, 2017

derekbruening commented Apr 19, 2018