std::sync::Once can block forever in forked process

There's a bug in [`std::sync::Once`](https://doc.rust-lang.org/std/sync/struct.Once.html) that makes it so that, under certain conditions, a call to [`call_once`](https://doc.rust-lang.org/std/sync/struct.Once.html#method.call_once) in a process which was forked from another Rust process can block forever.

## How `Once` works

Simplifying a bit (ignoring poisoning), the algorithm employed by `Once` works as follows: The `Once` can be in one of three states: `INCOMPLETE`, `COMPLETE`, and `RUNNING`.
* The `Once` starts off in the `INCOMPLETE` state.
* When a call to `call_once` begins, the `Once` might be in any of the three states:
  * If the `Once` is in the `INCOMPLETE` state, then it is transitioned to the `RUNNING` state, and the function begins executing.
  * If the `Once` is in the `RUNNING` state, then some other call to `call_once` is executing the function, so this call puts itself on a list of waiters and goes to sleep. It will be woken back up once the function is done executing in whatever thread is executing it.
  * If the `Once` is in the `COMPLETE` state, then the function has already been executed, so `call_once` returns immediately without doing anything.

Finally, when the function's execution completes, the thread doing the execution transitions the `Once` into the `COMPLETE` state, and wakes up any waiters that accumulated while it was executing the function.

## The issue

This algorithm is broken when forking. In particular, if a `Once` is in the `RUNNING` state at the point that the process forks, when the child's memory space (which, by default, is a copy-on-write copy of the parent's) is created, the `Once` will still be in the `RUNNING` state. However, in the child process, calls to `call_once` will fail for two reasons:
* If the call happens while the function is still being executed, the waiter object that is enqueued will not actually be visible to the executor because it will only affect the child's memory space, not the parent's, and so the executor (a member of the parent thread) finishes, it will wake up all of the waiters in the parent process, blissfully ignorant that a thread from the child process is also waiting.
* If the function execution finishes first, the change of the `Once`'s state from `RUNNING` to `COMPLETE` will not be reflected in the child's memory space. Thus, a future call to `call_once` will spuriously find the `Once` still in the `RUNNING` state even though it isn't really in that state anymore.

These two problems can be seen in action in two proofs of concept that I wrote: [This one](https://play.rust-lang.org/?gist=c368c63c4ec27e02c894f1795fea22ed&version=stable) demonstrates the first issue, while [this one](https://play.rust-lang.org/?gist=5baeb000cb8e8feae574fdeff9fdb98c&version=stable) demonstrates the second.

## A proposed fix

Joint credit for this proposal goes to @ezrosent.

The idea behind this fix is to record the process' PID when transitioning a `Once` from `INCOMPLETE` to `RUNNING`, and having future accesses that find the `Once` in the `RUNNING` state verify that it wasn't transitioned by a parent process. Unfortunately, this doesn't quite work because PIDs can be re-used, so if process A spawns process B, then process A quits, then process B spawns process C, it's possible for A and C to have the same PID.

Instead, we introduce the idea of an "MPID" - a monotonically-increasing PID-like counter that is maintained by the process (e.g. . We increment it every time a process forks, and use it in the `Once` objects to record which process transitioned an object from `INCOMPLETE` to `RUNNING`.

More concretely, here are the components of the proposed solution:
* There is a process-global MPID variable (could be either `usize` or `u64`) that is initialized to 0 and is incremented immediately after `fork`. Note that this does _not_ guarantee that no two processes anywhere in the tree of processes forked from a particular process have the same MPID. In fact, all processes forked by a given process will all have the same MPID. However, it does guarantee that a process will not share an MPID with any of its ancestors, and that is all we need.
* The `Once` object is modified to have another `mpid` field that is initialized to 0.
* Each waiter object is modified to have another `mpid` field that is initialized to the MPID of the current process when the object is created.
* A modified algorithm for `call_once` looks roughly like this:
  * Loop:
    * Load the current state. If it is `COMPLETE`, return.
    * If the state is `INCOMPLETE`, do then load `mpid` and:
      * If `mpid` is equal to the current MPID, then try to CAS the state from `INCOMPLETE` to `RUNNING`. If it fails, retry the entire loop. If it succeeds, you're responsible for running the function, so do the original algorithm.
      * If `mpid` is not equal to the current MPID, then try to CAS it from its current value to the current MPID. If this succeeds, go to the previous step (where `mpid` is equal to the current MPID), and if it fails, retry the entire loop.
    * If the state is `RUNNING`, then load `mpid` and:
      * If `mpid` is equal to the current MPID, then the thread that transitioned the `Once` into the `RUNNING` state is in the current process, so do the normal algorithm: wait for it to be done (recording the current MPID in the waiter object).
      * If `mpid` is not equal to the current MPID, then the thread that transitioned the `Once` into the `RUNNING` state is in an ancestor process. Thus, attempt to CAS `mpid` to the current MPID. If it fails, repeat the entire loop. If it succeeds, then it is your responsibility to run the function, so continue as if you had transitioned the `Once` into the `RUNNING` state, with one exception: when waking up waiters, you need to check that they are not waiters in an ancestor process; do this by checking the waiter object's `mpid` field, and only waking waiters with an `mpid` field equal to the current MPID.

One thing to note: It is safe to try to CAS `mpid` and then separately to transition into `RUNNING` even though the value of `mpid` needs to reflect the MPID of the thread that transitioned into `RUNNING` - a thread that successfully transitions a `Once` into the `RUNNING` state will have previously verified that `mpid` is correct, and thus it will not change forever in the future (at least, not in this process) since the MPID of a process never changes.

### Open question
One open question is how to ensure that code is run just after `fork` (to increment the global MPID variable) and, critically, before any other code runs (especially code that uses `Once`). `pthread` provides the [`pthread_atfork`](http://pubs.opengroup.org/onlinepubs/009695399/functions/pthread_atfork.html) function to register callbacks that run before and after `fork` calls, but obviously this doesn't address Windows, and I also don't know if there's a good way to ensure that the necessary call to `pthread_atfork` is made at process initialization time.

## Prior art

There's some prior art here. In particular, jemalloc [has acknowledged a similar issue](https://github.com/jemalloc/jemalloc/blob/a9f7732d45c22ca7d22bed6ff2eaeb702356884e/src/jemalloc.c#L3204), and has a partial fix for it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

std::sync::Once can block forever in forked process #43448

How `Once` works

The issue

A proposed fix

Open question

Prior art

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

std::sync::Once can block forever in forked process #43448

Description

How Once works

The issue

A proposed fix

Open question

Prior art

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

How `Once` works