Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock in infallible_munmap_syscall_if_alive #3807

Closed
KJTsanaktsidis opened this issue Sep 1, 2024 · 13 comments
Closed

Deadlock in infallible_munmap_syscall_if_alive #3807

KJTsanaktsidis opened this issue Sep 1, 2024 · 13 comments

Comments

@KJTsanaktsidis
Copy link
Contributor

KJTsanaktsidis commented Sep 1, 2024

I'm able to somewhat reliably trigger a deadlock in rr when recording the Ruby test suite. This is using the latest rr sources compiled from master.

rr itself is blocked in this stack:

#0  0x00007faec6912ebd in __waitid (idtype=P_PID, id=2377, infop=0x7ffc2569d4e0, options=1073741826) at ../sysdeps/unix/sysv/linux/waitid.c:29
#1  0x00000000005e13ef in rr::WaitState::do_wait (this=<optimized out>, tid=2377, consume=<optimized out>, type=<optimized out>, block_seconds=<optimized out>, status=...) at /root/rr/src/WaitManager.cc:58
#2  0x00000000005a7843 in rr::WaitState::wait (this=<optimized out>, options=<synthetic pointer>..., type=2) at /root/rr/src/WaitManager.cc:114
#3  rr::WaitManager::wait_stop (options=<synthetic pointer>...) at /root/rr/src/WaitManager.cc:195
#4  rr::Task::wait (this=0x96b6690, interrupt_after_elapsed=<optimized out>) at /root/rr/src/Task.cc:2128
#5  0x00000000005a8470 in rr::Task::resume_execution (this=0x96b6690, how=rr::RESUME_SINGLESTEP, wait_how=rr::RESUME_WAIT_NO_EXIT, tick_period=<optimized out>, sig=0) at /root/rr/src/Task.cc:1674
#6  0x000000000046021a in rr::AutoRemoteSyscalls::syscall_base (this=this@entry=0x7ffc2569d950, syscallno=syscallno@entry=11, callregs=...) at /root/rr/src/AutoRemoteSyscalls.cc:394
#7  0x0000000000461df2 in rr::AutoRemoteSyscalls::syscall_helper<3> (this=0x7ffc2569d950, syscallno=11, callregs=...) at /root/rr/src/AutoRemoteSyscalls.h:305
#8  rr::AutoRemoteSyscalls::syscall_helper<2, unsigned long> (this=0x7ffc2569d950, syscallno=11, callregs=..., arg=<optimized out>) at /root/rr/src/AutoRemoteSyscalls.h:298
#9  rr::AutoRemoteSyscalls::syscall_helper<1, rr::remote_ptr<void>, unsigned long> (this=0x7ffc2569d950, syscallno=11, callregs=..., arg=...) at /root/rr/src/AutoRemoteSyscalls.h:298
#10 rr::AutoRemoteSyscalls::infallible_syscall_if_alive<rr::remote_ptr<void>, unsigned long> (this=0x7ffc2569d950, syscallno=11) at /root/rr/src/AutoRemoteSyscalls.h:174
#11 rr::AutoRemoteSyscalls::infallible_munmap_syscall_if_alive (this=0x7ffc2569d950, addr=..., length=<optimized out>) at /root/rr/src/AutoRemoteSyscalls.cc:806
#12 0x00000000005a4060 in rr::Task::unmap_buffers_for (this=0x96d9420, remote=..., other=0x96d9420, saved_syscallbuf_child=...) at /root/rr/src/Task.cc:458
#13 0x00000000005a9a27 in rr::Task::post_exec (this=0x96d9420, exe_file="/home/minipc-agent/workspace/ruby_ci-rr/ruby/build/ruby") at /root/rr/src/Task.cc:1074
#14 0x0000000000542216 in rr::RecordTask::post_exec (this=0x96d9420) at /root/rr/src/RecordTask.cc:367
#15 0x00000000004f88f1 in rr::RecordSession::handle_ptrace_event (this=0x9689030, t_ptr=0x7ffc2569dc70, step_state=0x7ffc2569dc6c, result=0x7ffc2569dde0, did_enter_syscall=<optimized out>)
    at /root/rr/src/RecordSession.cc:789
#16 0x0000000000501ece in rr::RecordSession::record_step (this=0x9689030) at /root/rr/src/RecordSession.cc:2660
#17 0x00000000004f4223 in rr::record (args=..., flags=...) at /root/rr/src/RecordCommand.cc:711
#18 rr::RecordCommand::run (this=<optimized out>, args=...) at /root/rr/src/RecordCommand.cc:874
#19 0x0000000000445c40 in main (argc=<optimized out>, argv=<optimized out>) at /root/rr/src/main.cc:278

If you ask the kernel what the RR process is doing, it's just blocked in do_wait:

% sudo cat /proc/3422661/task/3422661/stack
[<0>] do_wait+0x60/0x100
[<0>] kernel_waitid+0x6e/0xa0
[<0>] __do_sys_waitid+0xfb/0x120
[<0>] do_syscall_64+0x82/0x160
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

The PID it's waiting on though (you can see in the arguments - __waitid (idtype=P_PID, id=2377, ...) is ptrace-stopped!

% sudo cat /proc/3422694/task/3422696/stack
[<0>] ptrace_stop+0x127/0x330
[<0>] get_signal+0x47a/0x950
[<0>] arch_do_signal_or_restart+0x3b/0x240
[<0>] syscall_exit_to_user_mode+0x1bd/0x220
[<0>] do_syscall_64+0x8e/0x160
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

% sudo cat /proc/3422694/task/3422696/status
Name:   ruby
Umask:  0022
State:  t (tracing stop)
...

The syscall it was running was the munmap syscall that was requested:

% sudo cat /proc/3422694/task/3422696/syscall
11 0x7f384d800000 0x200000 0x8 0x0 0x0 0x0 0x7f38561ffd50 0x7000000f

So it seems to me, that the sequence of events is:

  • The tracee called vfork(2), and then exec(2)
  • The exec'd process looked for a ptrace-stopped process in the original address space, and found one
  • RR set set the IP reg to the syscall trampoline & syscall argument register to munmap, and then did PTRACE_SINGLESTEP on it
  • RR waited for the singlestep'd syscall to complete by calling waitid(P_PID, pid) on it
  • The munmap syscall did happen and complete,
  • And the kernel put the syscalling process into ptrace-stop afterwards,
  • But the waitid call never got notified?

Any thoughts? I'm wondering if perhaps a differnet wait(P_PID, -1) call accidently stole the signal from this one somehow? (although I can't possibly see how).

@KJTsanaktsidis
Copy link
Contributor Author

KJTsanaktsidis commented Sep 2, 2024

I managed to reproduce this overnight on my laptop with --log=all:debug. This seems like the relevant bit: https://gist.github.com/KJTsanaktsidis/5102807d2c9aab383c02d906320db37a

After what I copied there is just 300GB of the same group of messages over and over again

[RecordTask] Set signal mask for 1142502 to block all signals (bar SYSCALLBUF_DESCHED_SIGNAL/TIME_SLICE_SIGNAL) while we have a stashed signal
[Task] resuming execution of 1142502 with PTRACE_SINGLESTEP tick_period -2 wait 2
[Task] Flushing registers for tid 1142502 { ip:0x7000000f args:(0x7fa9ca400000,0x200000,0x8,0,0,0) orig_syscall: 11 syscallno: 11 }
[Scheduler] Starting 1142502
[Task] going into blocking wait for 1142502 ...
[Task]   Task 1142502 changed status to 0x1e7f (STOP-SIGPWR)
[Task]   (refreshing register cache)
[Scheduler] Stopping 1142502
[Task] Requesting registers from tracee 1142502
[AutoRemoteSyscalls] Used singlestep path; status=0x1e7f (STOP-SIGPWR)

I gather that SIGPWR is used internally by rr so I guess this probably is an rr bug of some kind. I'll try find some time this week to investigate further.

@KJTsanaktsidis
Copy link
Contributor Author

KJTsanaktsidis commented Sep 3, 2024

Ah, I wonder if we need this:

diff --git a/src/AutoRemoteSyscalls.cc b/src/AutoRemoteSyscalls.cc
index d08376d2..0bb6274b 100644
--- a/src/AutoRemoteSyscalls.cc
+++ b/src/AutoRemoteSyscalls.cc
@@ -116,7 +116,8 @@ AutoRemoteSyscalls::AutoRemoteSyscalls(Task* t,
   // Use the slow path if SIGTRAP is blocked or ignored because otherwise
   // the PTRACE_SINGLESTEP will cause the kernel to unblock it.
   setup_path(t->vm()->has_rr_page() && !running_under_rr() &&
-             is_SIGTRAP_default_and_unblocked(t));
+             is_SIGTRAP_default_and_unblocked(t) &&
+             !(t->session().is_recording() && dynamic_cast<RecordTask*>(t)->stashed_signals_blocking_more_signals));
   if (enable_mem_params == ENABLE_MEMORY_PARAMS) {
     maybe_fix_stack_pointer();
   }

Since stashed_signals_blocking_more_signals causes SIGTRAP to be blocked? I'll run my reproduction in a loop overnight again and see how it does.

EDIT: No, this doesn't work. It does kick us into the seccomp-based "slow-path" but it's still not receiving the SIGTRAP it's expecting:

[WARN post_exec()] Unmapping buffers using tid 1281553
[AutoRemoteSyscalls] syscall munmap { ip:0x70000000 args:(0x7f15fca00000,0x200000,0x2000,0,0,0) orig_syscall: 0 syscallno: -11 }
[RecordTask] Set signal mask for 1281553 to block all signals (bar SYSCALLBUF_DESCHED_SIGNAL/TIME_SLICE_SIGNAL) while we have a stashed signal
[Task] resuming execution of 1281553 with PTRACE_CONT tick_period -2 wait 2
[Task] Flushing registers for tid 1281553 { ip:0x70000000 args:(0x7f15fca00000,0x200000,0x2000,0,0,0) orig_syscall: 11 syscallno: 11 }
[Scheduler] Starting 1281553
[Task] going into blocking wait for 1281553 ...
[Task]   Task 1281553 changed status to 0x1e7f (STOP-SIGPWR)
[Task]   (refreshing register cache)
[Scheduler] Stopping 1281553
[Task] Requesting registers from tracee 1281553
[RecordTask] Set signal mask for 1281553 to block all signals (bar SYSCALLBUF_DESCHED_SIGNAL/TIME_SLICE_SIGNAL) while we have a stashed signal
[Task] resuming execution of 1281553 with PTRACE_CONT tick_period -2 wait 2
[Scheduler] Starting 1281553
[Task] going into blocking wait for 1281553 ...
[Task]   Task 1281553 changed status to 0x1e7f (STOP-SIGPWR)
[Task]   (refreshing register cache)
[Scheduler] Stopping 1281553
[Task] Requesting registers from tracee 1281553
[RecordTask] Set signal mask for 1281553 to block all signals (bar SYSCALLBUF_DESCHED_SIGNAL/TIME_SLICE_SIGNAL) while we have a stashed signal
[Task] resuming execution of 1281553 with PTRACE_CONT tick_period -2 wait 2
[Scheduler] Starting 1281553
[Task] going into blocking wait for 1281553 ...
[Task]   Task 1281553 changed status to 0x1e7f (STOP-SIGPWR)
[Task]   (refreshing register cache)
[Scheduler] Stopping 1281553
[Task] Requesting registers from tracee 1281553
[RecordTask] Set signal mask for 1281553 to block all signals (bar SYSCALLBUF_DESCHED_SIGNAL/TIME_SLICE_SIGNAL) while we have a stashed signal
[Task] resuming execution of 1281553 with PTRACE_CONT tick_period -2 wait 2
[Scheduler] Starting 1281553
[Task] going into blocking wait for 1281553 ...

@KJTsanaktsidis
Copy link
Contributor Author

This looks like it might be working:

diff --git a/src/Task.cc b/src/Task.cc
index ab457faa..825493db 100644
--- a/src/Task.cc
+++ b/src/Task.cc
@@ -1062,7 +1062,7 @@ void Task::post_exec(const string& exe_file) {
   for (Task* t : as->task_set()) {
     if (t != this) {
       other_task_in_address_space = true;
-      if (t->is_stopped_) {
+      if (t->is_stopped_ && !(t->session().is_recording() && dynamic_cast<RecordTask*>(t)->stashed_signals_blocking_more_signals)) {
         stopped_task_in_address_space = t;
         break;
       }

But of course this means we leak the thread's scratch space in more circumstances. Which is better than hanging, I suppose :)

@rocallahan
Copy link
Collaborator

SIGPWR is SYSCALLBUF_DESCHED_SIGNAL. I don't understand why it keeps being delivered here.

After your patch for the singlestep path, is rr stuck in a loop in Task::enter_syscall()?

@KJTsanaktsidis
Copy link
Contributor Author

KJTsanaktsidis commented Sep 4, 2024

After your patch for the singlestep path, is rr stuck in a loop in Task::enter_syscall()?

Seems that way:

(gdb) bt
#0  __waitid (idtype=P_PID, id=2291930, infop=0x7ffe6faf71d0, options=1073741831) at ../sysdeps/unix/sysv/linux/waitid.c:26
#1  0x00000000009c76df in rr::WaitState::do_wait (this=0xbeebc0 <rr::wait_state()::static_state>, tid=2291930, consume=true, type=6, block_seconds=0, status=...)
    at /home/kjtsanaktsidis/upstream_sources/rr/src/WaitManager.cc:58
#2  0x00000000009c7b66 in rr::WaitState::wait (this=0xbeebc0 <rr::wait_state()::static_state>, options=..., type=6) at /home/kjtsanaktsidis/upstream_sources/rr/src/WaitManager.cc:114
#3  0x00000000009c81f6 in rr::WaitManager::wait_stop_or_exit (options=...) at /home/kjtsanaktsidis/upstream_sources/rr/src/WaitManager.cc:218
#4  0x000000000097be5e in rr::Task::resume_execution (this=0x3e1ab8b0, how=rr::RESUME_CONT, wait_how=rr::RESUME_WAIT_NO_EXIT, tick_period=rr::RESUME_NO_TICKS, sig=0)
    at /home/kjtsanaktsidis/upstream_sources/rr/src/Task.cc:1616
#5  0x00000000009791d2 in rr::Task::enter_syscall (this=0x3e1ab8b0, allow_exit=true) at /home/kjtsanaktsidis/upstream_sources/rr/src/Task.cc:948
#6  0x000000000073f21b in rr::AutoRemoteSyscalls::syscall_base (this=0x7ffe6faf7b80, syscallno=11, callregs=...) at /home/kjtsanaktsidis/upstream_sources/rr/src/AutoRemoteSyscalls.cc:422
#7  0x00000000007364fc in rr::AutoRemoteSyscalls::syscall_helper<3> (this=0x7ffe6faf7b80, syscallno=11, callregs=...) at /home/kjtsanaktsidis/upstream_sources/rr/src/AutoRemoteSyscalls.h:305
#8  0x000000000073179d in rr::AutoRemoteSyscalls::syscall_helper<2, unsigned long> (this=0x7ffe6faf7b80, syscallno=11, callregs=..., arg=2097152)
    at /home/kjtsanaktsidis/upstream_sources/rr/src/AutoRemoteSyscalls.h:298
#9  0x000000000072c46b in rr::AutoRemoteSyscalls::syscall_helper<1, rr::remote_ptr<void>, unsigned long> (this=0x7ffe6faf7b80, syscallno=11, callregs=..., arg=...)
    at /home/kjtsanaktsidis/upstream_sources/rr/src/AutoRemoteSyscalls.h:298
#10 0x000000000074792d in rr::AutoRemoteSyscalls::infallible_syscall_if_alive<rr::remote_ptr<void>, unsigned long> (this=0x7ffe6faf7b80, syscallno=11)
    at /home/kjtsanaktsidis/upstream_sources/rr/src/AutoRemoteSyscalls.h:174
#11 0x0000000000740a04 in rr::AutoRemoteSyscalls::infallible_munmap_syscall_if_alive (this=0x7ffe6faf7b80, addr=..., length=2097152) at /home/kjtsanaktsidis/upstream_sources/rr/src/AutoRemoteSyscalls.cc:807
#12 0x0000000000978a11 in rr::Task::unmap_buffers_for (this=0x3e1b80a0, remote=..., other=0x3e1b80a0, saved_syscallbuf_child=...) at /home/kjtsanaktsidis/upstream_sources/rr/src/Task.cc:458
#13 0x0000000000979bbb in rr::Task::post_exec (this=0x3e1b80a0, exe_file="/home/kjtsanaktsidis/ruby/build/ruby") at /home/kjtsanaktsidis/upstream_sources/rr/src/Task.cc:1074
#14 0x00000000008d74c2 in rr::RecordTask::post_exec (this=0x3e1b80a0) at /home/kjtsanaktsidis/upstream_sources/rr/src/RecordTask.cc:367
#15 0x00000000008546b6 in rr::RecordSession::handle_ptrace_event (this=0x3e17e530, t_ptr=0x7ffe6faf80b0, step_state=0x7ffe6faf80ac, result=0x7ffe6faf81d0, did_enter_syscall=0x7ffe6faf80ab)
    at /home/kjtsanaktsidis/upstream_sources/rr/src/RecordSession.cc:789
#16 0x000000000085ccd3 in rr::RecordSession::record_step (this=0x3e17e530) at /home/kjtsanaktsidis/upstream_sources/rr/src/RecordSession.cc:2660
#17 0x000000000084fd15 in rr::record (args=std::vector of length 5, capacity 32 = {...}, flags=...) at /home/kjtsanaktsidis/upstream_sources/rr/src/RecordCommand.cc:711
#18 0x000000000085098f in rr::RecordCommand::run (this=0xbecd10 <rr::RecordCommand::singleton>, args=std::vector of length 5, capacity 32 = {...})
    at /home/kjtsanaktsidis/upstream_sources/rr/src/RecordCommand.cc:874
#19 0x000000000081251d in main (argc=20, argv=0x7ffe6faf8588) at /home/kjtsanaktsidis/upstream_sources/rr/src/main.cc:278

SIGPWR is SYSCALLBUF_DESCHED_SIGNAL. I don't understand why it keeps being delivered here.

Is PERF_COUNT_SW_CONTEXT_SWITCHES supposed to count all user->kernel context switches (i.e. including doing a syscall), or just when a process is switched away from in order to execute a different process? (I'll have a look at the kernel sources to try and figure this out).

If it's the former, perhaps the event gets triggered because of the munmap syscall we're actually trying to execute, and the F_SETSIG signal gets delivered, and the ptrace-stop for that happens before the ptrace-stop for the seccomp filter?

EDIT: No that doesn't make sense, the actual syscall itself isn't happening in a loop, i think.

@rocallahan
Copy link
Collaborator

rocallahan commented Sep 4, 2024 via email

@KJTsanaktsidis
Copy link
Contributor Author

KJTsanaktsidis commented Sep 7, 2024

I held rr in GDB breaking at waitid(2), and set up ftrace (with trace-cmd) to capture the kernel function graph of both the tracer (rr) and the tracee(ruby) as I cont'd a few times in GDB.

https://gist.githubusercontent.com/KJTsanaktsidis/8f323b34bf23f069fe0d1aca3ee2c9cd/raw/ffa516f59172830676bf20bae0d6e3b1630dabde/gistfile1.txt

There's a crapton of data there and I'm not yet sure what it means, but I'm going to try and find out today :)

EDIT: https://gist.github.com/KJTsanaktsidis/c36b93dd169e0e8c952953aa237e7e06 might be a better trace.

  • I removed the breakpoint in gdb and just allowed it to continue for a few iterations, since the breakpoints were confusing things a bit
  • I trimmed out all the write(2) syscalls for the debug logging, since they were just noise.

@KJTsanaktsidis
Copy link
Contributor Author

KJTsanaktsidis commented Sep 7, 2024

So I think what's happening is the following sequence of events:

  1. The tracee (ruby) has a pending SIGPWR signal. When the task is resumed, it immediately goes into a ptrace-stop so that the tracer (rr) can make a decision about the signal:
ruby-2291930 [001] d..2. 595260.577718: funcgraph_entry: |  dequeue_signal() {
ruby-2291930 [001] d..2. 595260.577719: funcgraph_entry: |    collect_signal() {
ruby-2291930 [001] d..1. 595260.577727: funcgraph_exit:  |  } (ret=0x1e) (n.b. 0x1e == SIGPWR)
ruby-2291930 [001] d..2. 595260.577728: funcgraph_entry: |  ptrace_stop() {
  1. Because the tracee has entered a ptrace-stop, it gets descheduled:
ruby-2291930 [001] ...2. 595260.577930: funcgraph_entry: |    cgroup_enter_frozen() {
ruby-2291930 [001] ...1. 595260.577934: funcgraph_entry: |    schedule() {
ruby-2291930 [001] d..3. 595260.577939: funcgraph_entry: |      dequeue_task_fair() {
ruby-2291930 [001] d..3. 595260.578032: funcgraph_entry: |      psi_task_switch() {
  1. Because the task is being descheduled, an IRQ event handler thingy kind of thing is enqueued in order to dispatch the PERF_COUNT_SW_CONTEXT_SWITCHES
ruby-2291930 [001] d..3. 595260.578057: funcgraph_entry: |      ___perf_sw_event() {
ruby-2291930 [001] d..3. 595260.578060: funcgraph_entry: |        perf_swevent_event() {
ruby-2291930 [001] d..3. 595260.578061: funcgraph_entry: |          __perf_event_overflow() {
ruby-2291930 [001] d..4. 595260.578072: funcgraph_entry: |            arch_irq_work_raise() {
  1. The rr process is then scheduled to run.
rr-2291897 [001] d..3. 595260.578109: funcgraph_entry: |            __switch_to_xtra(); (ret=0x0)
rr-2291897 [001] d..3. 595260.578111: funcgraph_entry: |            finish_task_switch.isra.0() {
rr-2291897 [001] d..3. 595260.578112: funcgraph_entry: |              __perf_event_task_sched_in() {
  1. There's a pending IRQ to be worked on, so that gets run first
rr-2291897 [001] d.h2. 595260.578122: funcgraph_entry: |              __sysvec_irq_work() {
rr-2291897 [001] d.h2. 595260.578123: funcgraph_entry: |                perf_pending_irq() {
rr-2291897 [001] d.h2. 595260.578124: funcgraph_entry: |                  perf_event_wakeup() {
  1. That IRQ then processes the perf PERF_COUNT_SW_CONTEXT_SWITCHES event from step 4, including sending a signal to the ruby tracee process because of the F_SETSIG machinery. Because the previous SIGPWR signal was already dequeued in step 1, it gets delivered again and added to the process's pending set. We know this for sure, because execution got as far as __sigqueue_alloc() (https://elixir.bootlin.com/linux/v6.10.8/source/kernel/signal.c#L1119) and didn't bail out at https://elixir.bootlin.com/linux/v6.10.8/source/kernel/signal.c#L1095.
rr-2291897 [001] d.h2. 595260.578127: funcgraph_entry: |                    kill_fasync() {
rr-2291897 [001] d.h3. 595260.578130: funcgraph_entry: |                      send_sigio() {
rr-2291897 [001] d.h4. 595260.578133: funcgraph_entry: |                        send_sigio_to_task() {
rr-2291897 [001] d.h4. 595260.578151: funcgraph_entry: |                          do_send_sig_info() {
rr-2291897 [001] d.h5. 595260.578157: funcgraph_entry: |                              __send_signal_locked() {
rr-2291897 [001] d.h5. 595260.578157: funcgraph_entry: |                                prepare_signal(); (ret=0x1)
rr-2291897 [001] d.h5. 595260.578158: funcgraph_entry: |                                __sigqueue_alloc() {
  1. The rr process then returns from waitid(2) with the signal which was dequeued in step 1.
rr-2291897 [001] ...1. 595260.578318: funcgraph_entry: |          __do_wait() {
rr-2291897 [001] ..... 595260.578333: funcgraph_exit:  |        } (ret=0x22f8da) (n.b. == 2291930, pid of ruby tracee)
rr-2291897 [001] ...1. 595260.578337: funcgraph_entry:  |  syscall_exit_to_user_mode_prepare(); (ret=0x0)
  1. Task::enter_syscall() sees that it got a SIGPWR (i.e. the desched signal) here, and ignores it:

    rr/src/Task.cc

    Lines 976 to 978 in 4d027c6

    if (stop_sig() == session().as_record()->syscallbuf_desched_sig()) {
    continue;
    }

  2. So Task::enter_syscall() then calls PTRACE_CONTINUE again. This makes the tracee ruby runnable again and so it's enqueued to run.

rr-2291897 [001] ...1. 595260.580871: funcgraph_entry: |  x64_sys_call() {
rr-2291897 [001] ...1. 595260.580871: funcgraph_entry: |    __x64_sys_ptrace() {
rr-2291897 [001] ...1. 595260.580886: funcgraph_entry: |      arch_ptrace() {
rr-2291897 [001] ...1. 595260.580887: funcgraph_entry: |        ptrace_request() {
rr-2291897 [001] ...1. 595260.580887: funcgraph_entry: |          user_disable_single_step(); (ret=0x1c002)
rr-2291897 [001] d..2. 595260.580889: funcgraph_entry: |          wake_up_state() {
rr-2291897 [001] d..2. 595260.580890: funcgraph_entry: |            try_to_wake_up() {
rr-2291897 [001] d..5. 595260.580914: funcgraph_entry: |                enqueue_task_fair() {
  1. The kernel then actually takes this moment, after the PTRACE_CONTINUE syscall completes, to deschedule rr and run something else.
rr-2291897 [001] .N.1. 595260.581002: funcgraph_entry: |  schedule() {
rr-2291897 [001] dN.3. 595260.581007: funcgraph_entry: |    pick_next_task_fair() {
  1. The "something else" is the trace, ruby:
ruby-2291930 [001] d..3. 595260.581078: funcgraph_entry: |      finish_task_switch.isra.0() {
  1. But, before ruby can do anything, there's a pending signal - the SIGPWR enqueued in step 6! So the tracee again immediately goes into a signal-delivery ptrace-stop again.
ruby-2291930 [001] d..2. 595260.581149: funcgraph_entry: |  dequeue_signal() {
ruby-2291930 [001] d..1. 595260.581156: funcgraph_exit:  |  } (ret=0x1e)
ruby-2291930 [001] d..2. 595260.581156: funcgraph_entry: |  ptrace_stop() {
  1. We're basically back at step 4 here; Ruby will get descheduled, rr will get scheduled, the IRQ handlre will run and enqueue another SIGPWR to Ruby, and the whole thing happens all over again.

Importantly, the descheduling at step 10 is not required to enter this loop; it's just what happened to occur in this ftrace recording. If rr gets around to calling waitid(2) before it runs out of CPU time slice or what not, it will block itself at that point anyway.


I'm now going to go for a long walk and think about a patch. I guess we potentially want to mask SYSCALLBUF_DESCHED_SIGNAL during AutoRemoteSyscalls::syscall_base perhaps? This is in a way "hijacking" the process to do some random actual syscall, so the syscallbuf machinery shouldn't be getting touched during that.

@KJTsanaktsidis
Copy link
Contributor Author

This still doesn't make any sense to me :(

  • In order to be receiving SIGPWR, the desched signal needs to be armed, which means the thread needs to be blocked in a buffered syscall - because that gets enabled/disabled in start_commit_buffered_syscall and commit_raw_syscall
  • But, in order to be selected for use in unmapping the other thread's syscallbuf, the thread needs to be stopped and NOT blocked in a syscall.

I can't quite figure out how both of these things can be true at the same time...

@rocallahan
Copy link
Collaborator

That's a great analysis effort, thanks! This is hard stuff.

Sounds like you need to check whether the perf event is enabled or not. If it is, then why is it enabled even though it shouldn't be? If it isn't, what's up with your previous analysis?

Thanks!!!

@KJTsanaktsidis
Copy link
Contributor Author

I pored through some of the debug logs again and annotated some of the relevant ones with my observations here - https://gist.github.com/KJTsanaktsidis/c88e5e087ab57b0f5e38e8e31550465d. This covers the life of the thread which got conscripted into doing the syscallbuf-unmap-after-execve and then fell into an infinite loop.

I think my summary of what happened to that thread is...

  • It was in the syscallbuf code, about to call read(2)
  • It received a SIGSTKFLT (TIME_SLICE_SIGNAL) [^1]
  • But, since we were in the syscallbuf code, the signal had to be stashed.
  • The tracee enabled the desched signal in start_commit_buffered_syscall ^2
  • Then made an untraced read(2) syscall
  • But the untraced syscall instruction is actually now an int 3 brekpoint, because the signal-stashing process makes sure to add this trap.
  • We ignore a couple of SIGPWR's (SYSCALLBUF_DESCHED_SIGNAL) which were generated by the prevoius ptrace-stops themselves, I think ^3
  • But, we eventually disarm the desched event and PTRACE_SYSCALL through the read(2) syscall.
  • The first time we do this, it returns EINTR (because we interrupted it with a SIGPWR signal).
  • So, we must re-run the syscall.
  • We wouldn't normally need to have the desched signal enabled during this second syscall (we're executing it with PTRACE_SYSCALL, so we'll get a notification when the syscall exits, however it exits). But, we actually still have a stashed SIGSTKFLT (nothing has actually dealt with it yet, because we haven't left the syscallbuf region!). And this code (

    rr/src/RecordSession.cc

    Lines 1203 to 1204 in 21f051b

    LOG(debug) << "Rearming desched event so we'll get a chance to deliver "
    "stashed signal";
    ) is re-arming the desched signal when there's a stashed signal.
  • The read syscall is blocking, so the scheduler goes and runs another thread.
  • That means that the thread is going to immediately get a SIGPWR again, although we're not paying attention to it for the moment.
  • That thread is the vfork'd child which is going to call execve shortly
  • While doing that, the scheduler gets an event for the read(2) thread. The event it gets is NOT the SIGPWR, though, but the syscall-exit event. The man page for ptrace(2) says that signal-delivery-stop is always delivered AFTER syscall-exit-stop. ^5
  • But because we received the ptrace-stop for it, we've marked the task as stopped
  • That means that when we actually execve(2) out of the vfork-child, it sees this thread as one of the candidates that's stopped and can be used to unmap the syscallbuf out of the address space.
  • But nothing ever disarmed the desched signal, so we get the infinite loop when we try and use the thread for AutoRemoteSyscalls!

Whew. That was a lot


It seems to me there's a few different things we could do in order to make this very long convoluted sequence of events not happen.

  1. Should we be doing more checks on what state a thread can be in before we can borrow it to unmap a defunct threads syscallbuf?
  2. Should AutoRemoteSyscalls just be disarming/rearming the desched signal if it's armed?
  3. Should we actually just make a note of the fact that the unmapping needs to happen, and check to see if we should do this unmapping elsewhere in a more convenient place where we've got more guarantees about what state the thread is in?
  4. Should

    rr/src/RecordSession.cc

    Lines 1203 to 1204 in 21f051b

    LOG(debug) << "Rearming desched event so we'll get a chance to deliver "
    "stashed signal";
    be arming the desched signal like this if the stashed signal is itself a timeslice signal? I guess the idea is that for real signals, we want the syscall to fail with EINTR and to actually expose that to the tracee, but for the timeslice signal we want to hide it so we can never actually leave the syscallbuf region?

Just want to say thank you again for parsing some very dense walls of text in this issue by the way!


^1 I didn't actually verify this, but I would imagine this can actually happen synchronously; the CPU would actually raise an interrupt (fault?) when the tick counter overflowed, interrupting the tracee process at some random userspace point, generating the signal, and then dispatching the signal (and thus signal-delivery-stopping) on return to userspace).

^2 Of couse we can't see this in the logs, because the tracee doesn't log. My attempt to try and add logs in that section of librrpreload caused other unrelated deadlocks, unsurprisingly. So this is an educated guess on my part.

^3 This seems to be discussed here:

rr/src/record_signal.cc

Lines 366 to 369 in 21f051b

* An annoyance of the desched signal is that when the tracer
* is descheduled in interval (C) above, we see normally (see
* below) see *two* signals. The current theory of what's
* happening is

^4 I think this is what is being said here:

rr/src/record_signal.cc

Lines 344 to 349 in 21f051b

/* The desched event just fired. That implies that the
* arm-desched ioctl went into effect, and that the
* disarm-desched syscall didn't take effect. Since a signal
* is pending for the tracee, then if the tracee was in a
* syscall, linux has exited it with an -ERESTART* error code.
* That means the tracee is about to (re-)enter either

^5 " No matter which method caused the syscall-entry-stop, if the tracer restarts the tracee with PTRACE_SYSCALL, the tracee enters syscall-exit-stop when the system call is finished, or if it is interrupted by a signal. (That is, signal-delivery-stop never happens between syscall-enter-stop and syscall-exit-stop; it happens after syscall-exit-stop.)"

@KJTsanaktsidis
Copy link
Contributor Author

@rocallahan I think I have landed on a fix for this - doing both option 2 and 3 from ☝️ - #3826

@rocallahan
Copy link
Collaborator

Epic debugging and a thorough fix with tests --- thank you very much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants