Recording Firefox with Gecko Profiler causes crash #1930

hotsphink · 2016-12-28T20:11:08Z

For https://bugzilla.mozilla.org/show_bug.cgi?id=1322559 I was trying to record a --disable-profiling build of Firefox with the (new) Gecko Profiler enabled ( https://raw.githubusercontent.com/mstange/Gecko-Profiler-Addon/master/gecko_profiler.xpi ). I was seeing a crash in GeckoSampler::doNativeBacktrace, which is actually what I wanted to see and debug, but it appears that it is behaving differently when rr is recording so it isn't the crash I was looking for. (To be clear, this is not a problem of divergence between record and replay; this is the recording affecting the initial run.)

What appears to be happening is that when a SIGPROF signal handler gets invoked, the stack pointer stored in its context argument is the stack pointer for rr's syscall hooking code, which is in a completely different stack from the actual executing program:

(rr) bt 14
#0  0x00007f9ea85830f5 in GeckoSampler::doNativeBacktrace(ThreadProfile&, TickSample*) (this=this@entry=0x7f9e6fca3ca0, aProfile=..., aSample=aSample@entry=0x681ff6d8) at /home/sfink/src/mozilla/tools/profiler/core/GeckoSampler.cpp:1139
#1  0x00007f9ea8583694 in GeckoSampler::InplaceTick(TickSample*) (this=0x7f9e6fca3ca0, sample=0x681ff6d8)
    at /home/sfink/src/mozilla/tools/profiler/core/GeckoSampler.cpp:1219
#2  0x00007f9ea857f209 in (anonymous namespace)::ProfilerSignalHandler(int, siginfo_t*, void*) (signal=<optimized out>, info=<optimized out>, context=0x681ff740) at /home/sfink/src/mozilla/tools/profiler/core/platform-linux.cc:252
#3  0x00007f9eb6c09c30 in <signal handler called> () at /lib64/libpthread.so.0
#4  0x000000007000000e in  ()
#5  0x00007f9eb700e233 in _raw_syscall () at /home/sfink/src/rr/src/preload/raw_syscall.S:120
#6  0x00007f9eb700b082 in untraced_syscall_base (syscallno=syscallno@entry=7, a0=a0@entry=140319650562601, a1=a1@entry=1, a2=a2@entry=-1, a3=a3@entry=0, a4=a4@entry=0, a5=0, syscall_instruction=0x7000000c) at /home/sfink/src/rr/src/preload/preload.c:334
#7  0x00007f9eb700c362 in syscall_hook (call=0x681fffa0) at /home/sfink/src/rr/src/preload/preload.c:1753
#8  0x00007f9eb700c362 in syscall_hook (call=0x681fffa0) at /home/sfink/src/rr/src/preload/preload.c:2402
#9  0x00007f9eb700c362 in syscall_hook (call=0x681fffa0) at /home/sfink/src/rr/src/preload/preload.c:2456
#10 0x00007f9eb700e26a in _syscall_hook_trampoline () at /home/sfink/src/rr/src/preload/syscall_hook.S:216
#11 0x00007f9eb700e293 in __morestack () at /home/sfink/src/rr/src/preload/syscall_hook.S:348
#12 0x00007f9eb700e2cd in _syscall_hook_trampoline_48_8b_3c_24 () at /home/sfink/src/rr/src/preload/syscall_hook.S:364
#13 0x00007f9eb5e82571 in poll () at /lib64/libc.so.6
#14 0x00007f9eaff93f80 in _xcb_conn_wait () at /lib64/libxcb.so.1

(rr) f 4
#4  0x000000007000000e in ?? ()
(rr) p/x $rsp
$14 = 0x681ffdf0
(rr) f 12
#12 0x00007f9eb700e2cd in _syscall_hook_trampoline_48_8b_3c_24 () at /home/sfink/src/rr/src/preload/syscall_hook.S:364
(rr) p/x $rsp
$17 = 0x681ffff0
(rr) f 13
#13 0x00007f9eb5e82571 in poll () from /lib64/libc.so.6
(rr) p/x $rsp
$18 = 0x7ffe68e53a90

doNativeBacktrace grabs a chunk of the stack to memcpy, and ends up biting off more than it can chew -- I mean, access. I guess what I'd like it to do is give the signal handler the register state as of the "call" to _syscall_hook_trampoline?

The text was updated successfully, but these errors were encountered:

Keno · 2016-12-28T20:33:19Z

Try running with syscallbuf disabled rr record -n? Futzing with register state for the signal handler may be possible, but I'm not entirely convinced it's a good idea.

rocallahan · 2016-12-28T20:57:24Z

For sanity we'd have to completely unwind all register values. We'd also want to handle changes to the register values. I think it's doable. @Keno, what worries you about it?

rocallahan · 2016-12-28T20:59:20Z

I guess there would be some weirdness where according to the signal handler's stack we're doing a plain syscall but in the syscallbuf code we might be doing a different syscall or not in the kernel at all. But that wouldn't break anything that isn't already broken.

rocallahan · 2016-12-28T21:01:13Z

I suppose that if a signal handler tries to manipulate registers to alter the syscall restart, it's going to be disappointed and angry, whereas today it just might work.

rocallahan · 2016-12-28T21:08:05Z

It might be possible to exit from the syscallbuf code to a trampoline which calls the signal handler with the right register state, and after sigreturn if it needs to restart the syscall, reenters the syscallbuf again. We'd need to have rr redirect the sigreturn back to the trampoline since we wouldn't want to return to the RIP in the signal frame.

rocallahan · 2016-12-28T21:10:30Z

It would be a pain to implement though since I think it would have to complicate replay.

Keno · 2016-12-29T01:05:24Z

Yes, I'm mostly worried about signal handlers that try to modify register values. There is also a concern about what happens if the signal handler longjmps out of there. Though I guess that doesn't work today either.

rocallahan · 2016-12-29T02:21:15Z

If we did what I suggested above, longjmp-out would actually work.

rocallahan · 2016-12-29T02:25:38Z

It could be kinda nice since it would mean we never have to worry about reentering the syscallbuf. No nested descheds, no worries about running user code on the syscallbuf alt-stack.

However, it would be a large change to a fragile part of the system.

hotsphink · 2016-12-29T20:13:41Z

Now that I know what's going on, I can't say this is a huge deal for me. It is rare that I'd want rr and the profiler running at the same time, and disabling the syscallbuf seems like a fine workaround. The main problem is that rr is just too good (too seamless) these days, so a discrepancy like this is unexpected and is therefore harder to diagnose and understand. I just wish it could be more obvious that something is up.

At least for this case, relying on replay to detect this would be fine. If replay knew that it was feeding bogus values to a replayed signal handler, it could... uh, do something. Transmit a warning through the gdb connection or replace them with 0xdeadbeef and then produce a different error message if divergence was detected or... ok, maybe I just jumped the shark.

hotsphink · 2017-03-29T21:56:22Z

Heh. I happen to be working on something that requires running with the profiler on, and of course, I had forgotten about this bug. Running rr record -n is fine, if but it would be nice if it could somehow tell me that they're incompatible.

hotsphink · 2024-01-30T00:48:19Z

I've run into this problem again, but this time -n wasn't good enough. I was trying to run firefox to debug a crash in JS stackwalking, but was getting the wrong crash when running without -n. With -n, it was very slow (expected), and was not able to load pages (unexpected), which was necessary for the reproduction steps. I tried running with -n -h in hopes that chaos mode would schedule things in a way that would allow progress, but no luck.

mstange mentioned this issue Jan 10, 2018

Crashing when starting the gecko profiler with rr #2158

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recording Firefox with Gecko Profiler causes crash #1930

Recording Firefox with Gecko Profiler causes crash #1930

hotsphink commented Dec 28, 2016 •

edited by rocallahan

Loading

Keno commented Dec 28, 2016

rocallahan commented Dec 28, 2016

rocallahan commented Dec 28, 2016

rocallahan commented Dec 28, 2016

rocallahan commented Dec 28, 2016

rocallahan commented Dec 28, 2016

Keno commented Dec 29, 2016

rocallahan commented Dec 29, 2016

rocallahan commented Dec 29, 2016

hotsphink commented Dec 29, 2016

hotsphink commented Mar 29, 2017

hotsphink commented Jan 30, 2024

Recording Firefox with Gecko Profiler causes crash #1930

Recording Firefox with Gecko Profiler causes crash #1930

Comments

hotsphink commented Dec 28, 2016 • edited by rocallahan Loading

Keno commented Dec 28, 2016

rocallahan commented Dec 28, 2016

rocallahan commented Dec 28, 2016

rocallahan commented Dec 28, 2016

rocallahan commented Dec 28, 2016

rocallahan commented Dec 28, 2016

Keno commented Dec 29, 2016

rocallahan commented Dec 29, 2016

rocallahan commented Dec 29, 2016

hotsphink commented Dec 29, 2016

hotsphink commented Mar 29, 2017

hotsphink commented Jan 30, 2024

hotsphink commented Dec 28, 2016 •

edited by rocallahan

Loading