Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recording Firefox with Gecko Profiler causes crash #1930

Open
hotsphink opened this issue Dec 28, 2016 · 12 comments
Open

Recording Firefox with Gecko Profiler causes crash #1930

hotsphink opened this issue Dec 28, 2016 · 12 comments

Comments

@hotsphink
Copy link
Contributor

hotsphink commented Dec 28, 2016

For https://bugzilla.mozilla.org/show_bug.cgi?id=1322559 I was trying to record a --disable-profiling build of Firefox with the (new) Gecko Profiler enabled ( https://raw.githubusercontent.com/mstange/Gecko-Profiler-Addon/master/gecko_profiler.xpi ). I was seeing a crash in GeckoSampler::doNativeBacktrace, which is actually what I wanted to see and debug, but it appears that it is behaving differently when rr is recording so it isn't the crash I was looking for. (To be clear, this is not a problem of divergence between record and replay; this is the recording affecting the initial run.)

What appears to be happening is that when a SIGPROF signal handler gets invoked, the stack pointer stored in its context argument is the stack pointer for rr's syscall hooking code, which is in a completely different stack from the actual executing program:

(rr) bt 14
#0  0x00007f9ea85830f5 in GeckoSampler::doNativeBacktrace(ThreadProfile&, TickSample*) (this=this@entry=0x7f9e6fca3ca0, aProfile=..., aSample=aSample@entry=0x681ff6d8) at /home/sfink/src/mozilla/tools/profiler/core/GeckoSampler.cpp:1139
#1  0x00007f9ea8583694 in GeckoSampler::InplaceTick(TickSample*) (this=0x7f9e6fca3ca0, sample=0x681ff6d8)
    at /home/sfink/src/mozilla/tools/profiler/core/GeckoSampler.cpp:1219
#2  0x00007f9ea857f209 in (anonymous namespace)::ProfilerSignalHandler(int, siginfo_t*, void*) (signal=<optimized out>, info=<optimized out>, context=0x681ff740) at /home/sfink/src/mozilla/tools/profiler/core/platform-linux.cc:252
#3  0x00007f9eb6c09c30 in <signal handler called> () at /lib64/libpthread.so.0
#4  0x000000007000000e in  ()
#5  0x00007f9eb700e233 in _raw_syscall () at /home/sfink/src/rr/src/preload/raw_syscall.S:120
#6  0x00007f9eb700b082 in untraced_syscall_base (syscallno=syscallno@entry=7, a0=a0@entry=140319650562601, a1=a1@entry=1, a2=a2@entry=-1, a3=a3@entry=0, a4=a4@entry=0, a5=0, syscall_instruction=0x7000000c) at /home/sfink/src/rr/src/preload/preload.c:334
#7  0x00007f9eb700c362 in syscall_hook (call=0x681fffa0) at /home/sfink/src/rr/src/preload/preload.c:1753
#8  0x00007f9eb700c362 in syscall_hook (call=0x681fffa0) at /home/sfink/src/rr/src/preload/preload.c:2402
#9  0x00007f9eb700c362 in syscall_hook (call=0x681fffa0) at /home/sfink/src/rr/src/preload/preload.c:2456
#10 0x00007f9eb700e26a in _syscall_hook_trampoline () at /home/sfink/src/rr/src/preload/syscall_hook.S:216
#11 0x00007f9eb700e293 in __morestack () at /home/sfink/src/rr/src/preload/syscall_hook.S:348
#12 0x00007f9eb700e2cd in _syscall_hook_trampoline_48_8b_3c_24 () at /home/sfink/src/rr/src/preload/syscall_hook.S:364
#13 0x00007f9eb5e82571 in poll () at /lib64/libc.so.6
#14 0x00007f9eaff93f80 in _xcb_conn_wait () at /lib64/libxcb.so.1

(rr) f 4
#4  0x000000007000000e in ?? ()
(rr) p/x $rsp
$14 = 0x681ffdf0
(rr) f 12
#12 0x00007f9eb700e2cd in _syscall_hook_trampoline_48_8b_3c_24 () at /home/sfink/src/rr/src/preload/syscall_hook.S:364
(rr) p/x $rsp
$17 = 0x681ffff0
(rr) f 13
#13 0x00007f9eb5e82571 in poll () from /lib64/libc.so.6
(rr) p/x $rsp
$18 = 0x7ffe68e53a90

doNativeBacktrace grabs a chunk of the stack to memcpy, and ends up biting off more than it can chew -- I mean, access. I guess what I'd like it to do is give the signal handler the register state as of the "call" to _syscall_hook_trampoline?

@Keno
Copy link
Member

Keno commented Dec 28, 2016

Try running with syscallbuf disabled rr record -n? Futzing with register state for the signal handler may be possible, but I'm not entirely convinced it's a good idea.

@rocallahan
Copy link
Collaborator

For sanity we'd have to completely unwind all register values. We'd also want to handle changes to the register values. I think it's doable. @Keno, what worries you about it?

@rocallahan
Copy link
Collaborator

I guess there would be some weirdness where according to the signal handler's stack we're doing a plain syscall but in the syscallbuf code we might be doing a different syscall or not in the kernel at all. But that wouldn't break anything that isn't already broken.

@rocallahan
Copy link
Collaborator

I suppose that if a signal handler tries to manipulate registers to alter the syscall restart, it's going to be disappointed and angry, whereas today it just might work.

@rocallahan
Copy link
Collaborator

It might be possible to exit from the syscallbuf code to a trampoline which calls the signal handler with the right register state, and after sigreturn if it needs to restart the syscall, reenters the syscallbuf again. We'd need to have rr redirect the sigreturn back to the trampoline since we wouldn't want to return to the RIP in the signal frame.

@rocallahan
Copy link
Collaborator

It would be a pain to implement though since I think it would have to complicate replay.

@Keno
Copy link
Member

Keno commented Dec 29, 2016

Yes, I'm mostly worried about signal handlers that try to modify register values. There is also a concern about what happens if the signal handler longjmps out of there. Though I guess that doesn't work today either.

@rocallahan
Copy link
Collaborator

If we did what I suggested above, longjmp-out would actually work.

@rocallahan
Copy link
Collaborator

It could be kinda nice since it would mean we never have to worry about reentering the syscallbuf. No nested descheds, no worries about running user code on the syscallbuf alt-stack.

However, it would be a large change to a fragile part of the system.

@hotsphink
Copy link
Contributor Author

Now that I know what's going on, I can't say this is a huge deal for me. It is rare that I'd want rr and the profiler running at the same time, and disabling the syscallbuf seems like a fine workaround. The main problem is that rr is just too good (too seamless) these days, so a discrepancy like this is unexpected and is therefore harder to diagnose and understand. I just wish it could be more obvious that something is up.

At least for this case, relying on replay to detect this would be fine. If replay knew that it was feeding bogus values to a replayed signal handler, it could... uh, do something. Transmit a warning through the gdb connection or replace them with 0xdeadbeef and then produce a different error message if divergence was detected or... ok, maybe I just jumped the shark.

@hotsphink
Copy link
Contributor Author

Heh. I happen to be working on something that requires running with the profiler on, and of course, I had forgotten about this bug. Running rr record -n is fine, if but it would be nice if it could somehow tell me that they're incompatible.

@hotsphink
Copy link
Contributor Author

I've run into this problem again, but this time -n wasn't good enough. I was trying to run firefox to debug a crash in JS stackwalking, but was getting the wrong crash when running without -n. With -n, it was very slow (expected), and was not able to load pages (unexpected), which was necessary for the reproduction steps. I tried running with -n -h in hopes that chaos mode would schedule things in a way that would allow progress, but no luck.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants