Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

process: block SIGCHLD signal when spawning process #6953

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sheremetyev
Copy link

Motivation

I'm observing that waiting for tokio::process::Command occasionally hangs on macOS (reproduced in Jujutsu version control CLI executed from inside jj_tui). Issue #6770 looks similar.

Solution

Inspecting the code, there seems to be a race possible between registering signal handler for SIGCHLD and receiving signal. Standard solution is to block signal during forking: https://stackoverflow.com/a/6757240. This is implemented in the pull request.

With this change I cannot reproduce hanging anymore.

@Darksonn Darksonn added A-tokio Area: The main tokio crate M-process Module: tokio/process M-signal Module: tokio/signal labels Nov 6, 2024
@Darksonn Darksonn requested a review from ipetkov November 6, 2024 09:58
@Darksonn
Copy link
Contributor

Darksonn commented Nov 6, 2024

@ipetkov what do you think?

At the very least, this code looks like it would break if called in parallel from two threads at once.

Copy link
Contributor

@notgull notgull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this cause child processes that complete while SIGCHLD is blocked to hang?

@ipetkov
Copy link
Member

ipetkov commented Nov 7, 2024

At the very least, this code looks like it would break if called in parallel from two threads at once.

Very much agree with this statement; this proposed change does not look safe in the general sense. Even if we were to introduce our own mutex here there's no guarantee that some other part of the process isn't trying to mess with signals.


I have two main questions:

  1. Are we sure this race is the real culprit or just suspected to impact things? Hard to tell without any real steps to reproduce (though maybe the "fix" changes the execution order enough to not fall afoul of some other race condition). It's worth noting that the current implementation tries to be resilient against missed signals since we'll effectively always poll the child at least once to see if it has exited on start up.
  2. If there really is a race between the child exiting before we've set up our signal handler, why do we need to block and unblock those signals instead of moving this line up to before spawning the child? This might be a much safer (and atomic) way of avoiding the race without introducing other races

@sheremetyev
Copy link
Author

Thank you for reviewing! Here is a bit more context in case it's useful.

Steps to reproduce the issue:

  • install Jujutsu (https://github.com/martinvonz/jj) version 0.22 or 0.23, or compile from master
  • install jj_tui (https://github.com/faldor20/jj_tui/releases/tag/v0.8.8)
  • create repository with jj git init and a few commits with jj commit
  • run jj_tui and keep pressing key "e" to edit commit - it should hang after a few attempts (stops reacting to any key)
  • ps aux | grep jj would show process jj edit ...
  • running jj st or some other Jujutsu command in another terminal hangs (because it's waiting for file lock)
  • attaching debugger to jj process would should it's waiting in mio::poll

(sorry I don't have a shorter way to reproduce!)

In order to test the fix in this PR, compile Jujutsu against custom version of Tokio


Are we sure this race is the real culprit or just suspected to impact things?

Not 100% sure.

...we'll effectively always poll the child at least once..

I observed that the ChildDropGuard: :poll function was called only 3 times in the hanging process. Then it was waiting in mio::poll - not entirely sure what's happening. Probably waiting for a signal wake up and run ChildDropGuard: :poll again?


... moving this line up to before spawning the child?

I did try installing signal before the spawn call - didn't help. IIRC installing signal handler is not guaranteed to have immediate effect (cannot find a reference at the moment).


Wouldn't this cause child processes that complete while SIGCHLD is blocked to hang?

Pending signals are delivered after unblocking (https://man7.org/linux/man-pages/man7/signal.7.html).


At the very least, this code looks like it would break if called in parallel from two threads at once.

Signal is blocked for all threads but unblocking it by one thread could affect concurrent calls to spawn_child indeed. I wonder if the current implementation installing signal handler safe for multithreading environments?

Playing with signals in a library is inherently fragile - it's unfortunate that macOS version is relying on signals IMHO. Linux implementation is using file descriptors.

@ipetkov
Copy link
Member

ipetkov commented Nov 8, 2024

I'd recommend running the test scenario under strace (or whatever is the equivalent on macOS). That should paint a clearer picture of what is happening, both from tokio's side as well as the rest of the application

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate M-process Module: tokio/process M-signal Module: tokio/signal
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants