-
Notifications
You must be signed in to change notification settings - Fork 581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock in infallible_munmap_syscall_if_alive #3807
Comments
I managed to reproduce this overnight on my laptop with After what I copied there is just 300GB of the same group of messages over and over again
I gather that |
Ah, I wonder if we need this:
Since EDIT: No, this doesn't work. It does kick us into the seccomp-based "slow-path" but it's still not receiving the SIGTRAP it's expecting:
|
This looks like it might be working:
But of course this means we leak the thread's scratch space in more circumstances. Which is better than hanging, I suppose :) |
After your patch for the singlestep path, is rr stuck in a loop in |
Seems that way:
Is If it's the former, perhaps the event gets triggered because of the EDIT: No that doesn't make sense, the actual syscall itself isn't happening in a loop, i think. |
The latter.
Message ID: ***@***.***>
… |
I held There's a crapton of data there and I'm not yet sure what it means, but I'm going to try and find out today :) EDIT: https://gist.github.com/KJTsanaktsidis/c36b93dd169e0e8c952953aa237e7e06 might be a better trace.
|
So I think what's happening is the following sequence of events:
Importantly, the descheduling at step 10 is not required to enter this loop; it's just what happened to occur in this ftrace recording. If rr gets around to calling I'm now going to go for a long walk and think about a patch. I guess we potentially want to mask SYSCALLBUF_DESCHED_SIGNAL during |
This still doesn't make any sense to me :(
I can't quite figure out how both of these things can be true at the same time... |
That's a great analysis effort, thanks! This is hard stuff. Sounds like you need to check whether the perf event is enabled or not. If it is, then why is it enabled even though it shouldn't be? If it isn't, what's up with your previous analysis? Thanks!!! |
I pored through some of the debug logs again and annotated some of the relevant ones with my observations here - https://gist.github.com/KJTsanaktsidis/c88e5e087ab57b0f5e38e8e31550465d. This covers the life of the thread which got conscripted into doing the syscallbuf-unmap-after-execve and then fell into an infinite loop. I think my summary of what happened to that thread is...
Whew. That was a lotIt seems to me there's a few different things we could do in order to make this very long convoluted sequence of events not happen.
Just want to say thank you again for parsing some very dense walls of text in this issue by the way! ^1 I didn't actually verify this, but I would imagine this can actually happen synchronously; the CPU would actually raise an interrupt (fault?) when the tick counter overflowed, interrupting the tracee process at some random userspace point, generating the signal, and then dispatching the signal (and thus signal-delivery-stopping) on return to userspace). ^2 Of couse we can't see this in the logs, because the tracee doesn't log. My attempt to try and add logs in that section of librrpreload caused other unrelated deadlocks, unsurprisingly. So this is an educated guess on my part. ^3 This seems to be discussed here: Lines 366 to 369 in 21f051b
^4 I think this is what is being said here: Lines 344 to 349 in 21f051b
^5 " No matter which method caused the syscall-entry-stop, if the tracer restarts the tracee with PTRACE_SYSCALL, the tracee enters syscall-exit-stop when the system call is finished, or if it is interrupted by a signal. (That is, signal-delivery-stop never happens between syscall-enter-stop and syscall-exit-stop; it happens after syscall-exit-stop.)" |
@rocallahan I think I have landed on a fix for this - doing both option 2 and 3 from ☝️ - #3826 |
Epic debugging and a thorough fix with tests --- thank you very much. |
I'm able to somewhat reliably trigger a deadlock in
rr
when recording the Ruby test suite. This is using the latest rr sources compiled from master.rr
itself is blocked in this stack:If you ask the kernel what the RR process is doing, it's just blocked in
do_wait
:The PID it's waiting on though (you can see in the arguments -
__waitid (idtype=P_PID, id=2377, ...)
is ptrace-stopped!The syscall it was running was the
munmap
syscall that was requested:So it seems to me, that the sequence of events is:
vfork(2)
, and thenexec(2)
munmap
, and then didPTRACE_SINGLESTEP
on itwaitid(P_PID, pid)
on itwaitid
call never got notified?Any thoughts? I'm wondering if perhaps a differnet
wait(P_PID, -1)
call accidently stole the signal from this one somehow? (although I can't possibly see how).The text was updated successfully, but these errors were encountered: