Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multithreaded fork appears flaky on OSX #14232

Closed
alexcrichton opened this issue May 15, 2014 · 24 comments
Closed

Multithreaded fork appears flaky on OSX #14232

alexcrichton opened this issue May 15, 2014 · 24 comments
Assignees
Labels
P-medium Medium priority T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@alexcrichton
Copy link
Member

In the following program, a number of threads are made, and then each thread forks of a child that sleeps forever and then immediately kills it. I would expect this program to succeed continuously, but it wedges on OSX occasionally, reporting a successful signal delivery, but failing to actually deliver the signal apparently.

#include <stdlib.h>
#include <pthread.h>
#include <signal.h>
#include <stdio.h>
#include <unistd.h>
#include <assert.h>

void *work(void *arg) {
#define M 100
  int i;
  for (i = 0; i < M; i++) {
    int a = fork();
    assert(a >= 0);
    if (a == 0) {
      sleep(1000000000);
      return arg;
    }
    assert(kill(a, SIGTERM) == 0);
    int ret;
    assert(waitpid(a, &ret, 0) == a);
    assert(!WIFEXITED(ret));
  }
  return arg;
}

int main() {
#define N 8
  pthread_t c[N];
  int i;
  for (i = 0; i < N; i++) {
    assert(pthread_create(&c[i], NULL, work, (void*) (size_t) i) == 0);
  }
  for (i = 0; i < N; i++) {
    assert(pthread_join(c[i], NULL) == 0);
  }
}

This is essentially how we fork() in libnative, and it's how we're using fork from libgreen. Trying to investigate a solution to this, but I'm starting to think that multithreaded fork is just fundamentally broken on basically all platforms except linux.

This issue has appeared as various forms of flakiness on the bots, which is why I started investigating.

@thestinger
Copy link
Contributor

It should really just use posix_spawn. It's significantly faster and has fewer issues on non-Linux platforms.

@steveklabnik
Copy link
Member

@alexcrichton with the runtime changes, is this still relevant?

@alexcrichton
Copy link
Member Author

Unfortunately, yes.

@alexcrichton
Copy link
Member Author

I... don't think there's anything we can do about this, so I think it's just something we're going to have to live with unfortunately.

@thestinger
Copy link
Contributor

Intentionally providing broken functionality rather than using posix_spawn? At least document that Rust's subprocess management doesn't work.

@aturon
Copy link
Member

aturon commented Mar 5, 2015

I'm uncomfortable just closing this outright without some further discussion; I'm going to re-open as needs-decision.

@aturon aturon reopened this Mar 5, 2015
@aturon
Copy link
Member

aturon commented Mar 5, 2015

triage: P-backcompat-libs (1.0 beta)

@aturon
Copy link
Member

aturon commented Mar 24, 2015

triage: P-backcompat-libs (1.0)

@pnkfelix
Copy link
Member

pnkfelix commented Apr 2, 2015

nominating for discussion at next week's triage.

@alexcrichton alexcrichton removed this from the 1.0 milestone Apr 2, 2015
@alexcrichton alexcrichton removed the A-runtime Area: std's runtime and "pre-main" init for handling backtraces, unwinds, stack overflows label Apr 9, 2015
@pnkfelix
Copy link
Member

pnkfelix commented Apr 9, 2015

There are many open questions here, and many ways that we might try to resolve or at least work around this in future versions of the stdlib.

One fundamental issue is that there are some usage patterns supported by separated fork followed by exec that are not supported out the box by posix_spawn.

We should however:

  • investigate where we can do a best-effort attempt to use posix_spawn when possible
  • document the current behavior, with its flakiness, along with our plans for potential changes in the future.

The former bullet need not happen for 1.0, but the latter bullet definitely should happen for 1.0.

@aturon
Copy link
Member

aturon commented Apr 9, 2015

(In particular, posix_spawn does not support setting the working directory for the child process)

@pnkfelix
Copy link
Member

pnkfelix commented Apr 9, 2015

assigning to @aturon to follow through on the 1.0 parts.

@pnkfelix pnkfelix assigned aturon and unassigned alexcrichton Apr 9, 2015
@pnkfelix pnkfelix added this to the 1.0 milestone Apr 9, 2015
@rprichard
Copy link
Contributor

FWIW: I modified the test program to call work directly from main, without creating any threads. It still hung fairly reliably on my OS X 10.10.2 laptop. I increased M from 100 to 20000. It also printed the (!WIFEXITED(ret)) assertion failure, but infrequently.

@alexcrichton
Copy link
Member Author

@rprichard fascinating! I have also been able to reproduce that, but I haven't gotten the assertion to trigger yet, it just deadlocks when the child doesn't receive the signal. I'm getting more and more suspicious of this over time...

@brson brson removed this from the 1.0 milestone Apr 24, 2015
@brson brson added P-medium Medium priority and removed P-backcompat-libs labels Apr 24, 2015
@brson
Copy link
Contributor

brson commented Apr 24, 2015

Hopefully this is a bug that can be fixed in the fullness of time, because the API is not changing now.

@pnkfelix
Copy link
Member

In the notes from triage, I noted that this should happen for 1.0:

document the current behavior, with its flakiness, along with our plans for potential changes in the future.

Has this documentation actually happened? Should we open a separate issue?

@alexcrichton
Copy link
Member Author

The documentation has not changed yet, but it's not something I'd want to block the release on, so I'm not sure a new bug is worth it (it would still be nice to do though)

@geofft
Copy link
Contributor

geofft commented May 15, 2015

I can reproduce the gist of @rprichard's results on an iMac13,2 ("27-inch, Late 2012") running 10.10.3: if I replace main with work(NULL), deadlocks are rarer, but certainly possible (especially if I run eight parallel shell loops running the program 500 times). So this isn't (solely) about threads: single-threaded programs can hit the same bug too.

Also, if I replace the fork call with a posix_spawn of sleep 100000, deadlocks are rarer, but still not impossible: I've observed at least one deadlock where sending a SIGTERM to the sleep process caused the app to terminate normally. So this isn't (solely) about fork: posix_spawn can hit the same bug too.

Looking at the xnu kernel source for kill (see bsd/kern/kern_sig.c), it's implemented via psignal / psignal_internal, both of which return void and have no error reporting. It's certainly possible there's something racy inside xnu itself between setting up a new process and returning to the caller of fork / posix_spawn. I don't fully understand the Mach-plus-BSD setup in xnu, but I could imagine that there's a period of time when the process exists enough to have a BSD pid, but not enough (missing a Mach task or thread?) to get signals.

If I add a 1-millisecond sleep in the parent, the original version of the program (threads and fork) no longer seems to deadlock at all, but it'd be useful if someone else can confirm that. Which brings to mind a bad solution: we could just sleep a millisecond or so in library code after forking on OS X. I wonder if that would help the issues with @alexcrichton's flaky bots.

Can we rename this issue to be something more like "signals immediately after process creation are racy on OS X", since it's not strictly related to multiple threads? I think the problem only appears worse there because more cores are used, which exercises races in the kernel more....

@alexcrichton
Copy link
Member Author

Wow, thanks for the thorough investigation @geofft! I think at this point it's pretty clear that there's not a whole lot we can do about this (if it's a kernel bug of some form), beyond your sleeping suggestion, but unfortunately even that wouldn't be rock-solid.

I'm going to nominate this bug for closure at this point, however, as the prospects for solving it seem to be getting bleaker every day.

@pnkfelix
Copy link
Member

This concern of mine (from an earlier comment) has not yet been addressed, AFAICT:

In the notes from triage, I noted that this should happen for 1.0:

document the current behavior, with its flakiness, along with our plans for potential changes in the future.

Has this documentation actually happened? Should we open a separate issue?

I would argue against closing this until that's addressed.

@geofft
Copy link
Contributor

geofft commented May 16, 2015

The thinking earlier in the thread was that this could be resolved by switching to a more reliable OS X API, but as far as I can tell, no such API exists: this is a fundamental bug in how OS X signal delivery works, regardless of whether you're using fork or posix_spawn or whatever. I think it might have made sense to document something like "Currently, Rust's subprocess mechanism uses a less-reliable API, but we're looking at switching to a more reliable one" (or "but you should use posix_spawn manually if you care", or something). But I don't think it's necessary to document straight-up platform bugs.

We should probably report this to Apple, though. I could do that.

@comex
Copy link
Contributor

comex commented Jun 4, 2015

I'm curious why a bug that supposedly only occurs when a new process is killed near-instantly after starting up is showing up as an issue in practice (e.g. on the bots) - why are you starting a process if you don't want it to do anything? Anyway, if it's a child process, a waitpid-with-WNOHANG and kill loop would probably work around any cases where the process doesn't actually die.

@alexcrichton alexcrichton added T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. A-docs and removed I-nominated labels Jun 9, 2015
alexcrichton added a commit to alexcrichton/rust that referenced this issue Aug 4, 2015
The investigation into rust-lang#14232 discovered that it's possible that signal delivery
to a newly spawned process is racy on OSX. This test has been failing spuriously
on the OSX bots for some time now, so ignore it as we don't currently know a
solution and it looks like it may be out of our control.
@steveklabnik
Copy link
Member

Can someone help me out with what exactly needs to be documented here, and what our plans are? It's been a long time since the discussion happened, and I don't want something incorrect.

@alexcrichton
Copy link
Member Author

OK, I'm going to close this in favor of #27537. With @geofft's investigation it's becoming clearer that there's basically not much we can do here, so the "documentation bug" should just be documenting what we're doing. I don't really want to add a clause to Command::spawn that says something along the lines of "we think there's a race in the OSX kernel". Rather I'd rather add "here's the syscalls we're calling to spawn a process!".

As a result, this is basically not-a-bug and documentation is covered by #27537

bors added a commit that referenced this issue Aug 10, 2015
The investigation into #14232 discovered that it's possible that signal delivery
to a newly spawned process is racy on OSX. This test has been failing spuriously
on the OSX bots for some time now, so ignore it as we don't currently know a
solution and it looks like it may be out of our control.
lnicola pushed a commit to lnicola/rust that referenced this issue Mar 13, 2023
MIR episode 2

This PR adds:
1. `need-mut` and `unused-mut` diagnostics
2. `View mir` command which shows MIR for the body under cursor, useful for debugging
3. MIR lowering for or-patterns and for-loops
flip1995 pushed a commit to flip1995/rust that referenced this issue Feb 20, 2025
Cleans up some changes from
rust-lang/rust-clippy#11421

I searched for any `.stderr` files where the number of errors changed
and reverted + manually added the annotations for them

Also fixes `tests/ui/asm_syntax_not_x86.rs`

r? @flip1995

changelog: none
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P-medium Medium priority T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

10 participants