Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue for improving std::sync::{Mutex, RwLock, Condvar} #93740

Closed
59 of 63 tasks
m-ou-se opened this issue Feb 7, 2022 · 120 comments
Closed
59 of 63 tasks

Tracking issue for improving std::sync::{Mutex, RwLock, Condvar} #93740

m-ou-se opened this issue Feb 7, 2022 · 120 comments
Assignees
Labels
A-atomic Area: Atomics, barriers, and sync primitives A-concurrency Area: Concurrency C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC T-libs Relevant to the library team, which will review and decide on the PR/issue.

Comments

@m-ou-se
Copy link
Member

m-ou-se commented Feb 7, 2022

A few weeks ago, on January 12th, the @rust-lang/libs team discussed a plan to improve std::sync::Mutex, std::sync::Condvar, and std::sync::RwLock.

On most platforms, these structures are currently wrappers around their pthread equivalent, such as pthread_mutex_t. These types are not movable, however, forcing us to wrap them in a Box, resulting in an allocation and indirection for our lock types. This also gets in the way of a const constructor for these types, which makes static locks more complicated than necessary.

@Amanieu presented his library, parking_lot, which implements the 'parking lot' structure from WebKit. Rather than letting the operating system handle any waiting queues, this manages the queues in user space in a global hash table indexed by the address of the lock. This allows for flexible 1-byte mutexes with almost no restrictions.

There has been an attempt at merging parking_lot into std, as a replacement for the implementation of std's locking primitives. However, many different blockers came up and that PR did not get merged. (See also my RustConf talk for some of this history.) Many of these blockers have been resolved since.

One of the problems with replacing std's lock implementations by parking_lot is that parking_lot allocates memory for its global hash table. A Rust program can define its own custom allocator, and such a custom allocator will likely use the standard library's locks, creating a cyclic dependency problem where you can't allocate memory without locking, but you can't lock without first allocating the hash table.

A possible solution is to use a static table with a fixed number of entries, which does not scale with the number of threads. This could work well in many situations, but can become problematic in highly parallel programs and systems. Another possible solution is to skip the global allocator and directly allocate this hash table with a native system API such as mmap on Unix. (See this thread for some discussion.)

In our meeting, we discussed what's natively available on various operating systems/platforms:

  • On Windows, we have already switched to Windows SRW locks, which do not require allocation as they can be moved. They are small (32-bit), efficient, and const constructible. Just for Windows, there does not seem much advantage to switching to parking_lot, although there might be some (small) performance difference.

  • On both Linux and some BSDs there's a native futex() syscall, which provides functionality similar (but more primitive) than a parking lot. Nearly equivalent functionality is available in Wasm. While not trivial, it's possible to implement our synchronization primitives directly based on such a futex. This makes all the primitives small (32-bit), efficient, and const constructible.

  • On macOS and some other Unixes there doesn't seem to be anything that's similar futex syscall. (parking_lot uses pthread for its thread parking operations on those platforms.)

  • Some smaller platforms that are (partially) supported by the standard library have their own non-trivial implementation of the standard locks, which are hard to maintain and have not gotten much validation.

After some discussion, the consensus was to providing the locks as 'thinnest possible wrapper' around the native lock APIs as long as they are still small, efficient, and const constructible. This means SRW locks on Windows, and futex-based locks on Linux, some BSDs, and Wasm. And for other platforms without suitable native APIs, a parking_lot-based implementation using one of the possible workarounds for the allocation problem.

This means that on platforms like Linux and Windows, the operating system will be responsible for managing the waiting queues of the locks, such that any kernel improvements and features like debugging facilities in this area are directly available for Rust programs.

It also means that porting the standard library to a new platform will be easier, and maintainance of the std implementation for the less common platforms will become easier. These platforms will now only have to implement a thread parker and can rely on a performant and correct implementation of the standard locks on top of that.

Update: We've decided to not directly use parking lot's one-byte (two bit) locks, but instead use the equivalent of its internal WordLock or Windows' SRW locks, which are one pointer in size and require no global state in the program. That solves the allocation problem.

Update 2: To maintain priority inheritance of the native locks (e.g. on macOS), we've kept using pthread locks on non-futex unix platforms, at least for now. Using #97647, we've made it possible for the locks to have const fn new.


Primary goals:

Possible later goals:

  • Add a Condvar::wait_rwlock to make Condvar usable with RwLocks?
  • Allow Sending MutexGuards to other threads?

To do:

@m-ou-se m-ou-se added C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Feb 7, 2022
@hkratz
Copy link
Contributor

hkratz commented Feb 7, 2022

Did you investigate using os_unfair_lock (Apple Developer Docs) on macOS? It can't be moved once it is in use either but it is essentially just a struct with a u32 in it, initialized to 0 before use. But maybe I am missing some requirements here. If it seems possible, investigation of it could be added as a subtask.

It is available since macOS 10.12 on x86-64, but could be used unconditionally on aarch64.

Nevermind it does not have timeouts for os_unfair_lock_lock/os_unfair_lock_trylock, so it is obviously out.

@nbdd0121
Copy link
Contributor

nbdd0121 commented Feb 7, 2022

I don't think we need to use hash tables like parking lot for targets without futex support. We could just store the queues inside Mutex. I think the only downside is size, but that'll still be better than the current boxed pthread mutex though.

@Amanieu
Copy link
Member

Amanieu commented Feb 7, 2022

Nevermind it does not have timeouts for os_unfair_lock_lock/os_unfair_lock_trylock, so it is obviously out.

Also it doesn't support Condvar, which is required for Rust locks.

@thomcc
Copy link
Member

thomcc commented Feb 8, 2022

Oh. This is a subject extremely near-and-dear to my heart -- I even started on a pre-RFC for atomic wait/wake ops, like the ones that C++ got recently. That said, much of my motivation was hoping it would find a way to unstick us here, and I this is clearly a higher priority.

For futex APIs, I wrote a blog post on them at one point. https://shift.click/blog/futex-like-apis I wrote about some of the options here for system APIs, which might be helpful as reference.

That said, I got asked about Darwin's __ulock_* API in a Zulip DM, since I wrote that post as well as hte ulock-sys crate.


For Darwin, it I think it's quite unlikely we could use __ulock_{wait,wake} for a futex API directly without risking... many problems. The fact that it may break in the future is actually one of the less concerning issues, in comparison to the fact that that use of private APIs can cause rejection from app stores, which would be a problem for Rust code that on the iOS/Mac App Stores (for clarity: I'm just going on a hunch1 that could happen for __ulock_*, I do not know for certain, but I think libstd needs to favor caution here).

That said, there's... a pretty cursed option for a workaround on these targets. Too cursed to use? Not sure... possibly.

That is: libc++'s implementation of std::atomic<u32>::wait, notify_one, and notify_all will ultimately bottom out to calls to __ulock_wait/__ulock_wake, which... I guess they get to use because libc++ gets shipped and updated with the OS on Darwin, fair enough. Anyway, in theory we could just call that somehow. And there's a sense in which this would be pretty nice, since it's a system-provided function that implements (something similar to) the API we want... But it's certainly not ideal for a few reasons:

  1. It's unclear to me how much we promise about what we link to on various platforms, but it's possible we cannot do this, and likely that we do not want to... I don't know if we make promises about what we need to link against for libstd, but suddenly dynamically linking against the C++ standard library on some platforms... feels like it would be a change.

  2. Not to mention -- even if we are allowed to do this, it's still a pain in the butt: The version of libc++ that provides this this is much newer than our platform guarantees, so we'd need to weak-import2 the dubiously-public3 symbol that this stuff under the hood (although I suppose this is better than requiring a C++ compiler to build Rust code for these platforms).

So, maybe? Given that the OS provides libc++, it seems like it should technically be fine (unless we specifically say we won't). At least, so long as we manage to do it in a way that doesn't versions older than we care about. But maybe I've missed something (or I haven't, but it's still several bridges too far).

(If not, oh well, someday an API shaped mostly like that is bound to end up as part of the libc someday, after it's copied to C from C++ in C9Z 😛)

(EDIT: Thinking about it more, it's fairly likely we should have some justification for this approach, since looking at the libc++ code, enough happens between what our all would be that it could defeat the point. And the effort spent here to make it work would be better spent on improving our fallback)


All that said, there are a few options on various platforms that could probably help implement the fallback thread parker, in case there are regressions (which plausibly there wouldn't be if we use parking_lot_core as-is, but maybe there would be if we need to abandon the growth behavior). In particular:

  • A few of the futexless unices (NetBSD, the Solaris-likes, perhaps others?) have a lwp_park and lwp_unpark functions, which is basically just an OS primitive for thread parking and unparking.

  • On Darwin (yes, yes, more Darwin), I think we could probably use libdispatch's dispatch_semaphore instead, as it's been the target of a lot of optimization -- probably more than the currently-active verions of pthread_mutex_t / pthread_cond_t -- Not to mention, you'd expect that a one-piece semaphore can be implemented more efficiently than a condvar/mutex pair, at least... in theory4. At least, unless the unified semaphore doesn't have to uphold extra guarantees (as it does for POSIX's sem_t, I believe).

  • Hermit (and perhaps only hermit but perhaps others) only has a semaphore, and we implement other primitives on top of them. I don't have a vested interest in hermit, but I would really like any excuse to remove https://github.com/rust-lang/rust/blob/master/library/std/src/sys/hermit/condvar.rs which look like it's implementing the same algorithm it cites, and seems like it could be buggy5.

That said, a lot of this probably should be in the a tail of followups, and only if it's worthwhile.

Footnotes

  1. I mean, if nothing else they start with __ and don't appear in a single public header (they only appear in the headers inside apple source repos). So, even though these APIs don't seem to be in the blacklists that prevents you from even attempting app store submission (which is not a guarantee), it seems completely out of the question to use in libstd (even if I do think they're cool, and I do).

  2. Calls to dlopen/dlsym tend to be disallowed in the same situations, since tooling can't tell if you're using it to use private APIs. (We already do this all over libstd though, so maybe it's not the issue it once was)

  3. Which isn't API-compatible, but presumably has to stay around for ABI reasons. I mean, it must, right? C++ standard libraries are famously resilient to ABI issues, after all 😅...

  4. That is, at worst the single-piece semaphore could just be implemented in terms of a private condvar and mutex pair, resulting in the same performance. That said, relying too heavily on this style of logic is a good way to get unpleasant surprises, so who knows.

  5. For example, the load-then-sub in notify_one is seems like it has a race that could lead to a deadlock, but there may be a reason this is impossible -- I've never spent that long thinking about it, just seems dubious, and like the kinda code that'd be real nice to replace with something that (if nothing else) will get more attention.

@yaahc yaahc added the A-concurrency Area: Concurrency label Feb 8, 2022
@bjorn3
Copy link
Member

bjorn3 commented Feb 9, 2022

On both Linux and some BSDs there's a native futex() syscall, which provides functionality similar (but more primitive) than a parking lot. Nearly equivalent functionality is available in Wasm. While not trivial, it's possible to implement our synchronization primitives directly based on such a futex. This makes all the primitives small (32-bit), efficient, and const constructible.

Please make sure you implement a fair or eventually fair (this is what parking lot does and needs the hashmap for) lock to prevent starvation.

@nbdd0121
Copy link
Contributor

nbdd0121 commented Feb 9, 2022

I don't think pthread mutexes are fair so the new implementation shouldn't be required to be fair either.

@thomcc
Copy link
Member

thomcc commented Feb 9, 2022

SRWLOCK is also unfair too, pthread_mutex_t depends (POSIX doesn't require it, apple has it, I believe glibc doesn't)... So this would mean we need to a parking lot everywhere, even if the OS has otherwise good locks (like with Window's SRWLOCK).

More generally: There are a bunch of locking features that would be nice to have in an ideal world, but I think they start becoming contradictory, or only achievable easily on one platform or another -- for example, I believe the parking-lot style of lock won't respect OS priorities (unless there's a clever trick I don't know about or something). Trying to solve all of this probably has no solution, and I'm not sure that we're in the right position decide which things we want to promise yet1... assuming we even want to make these sorts of promises (and I'm not sure we do).

Footnotes

  1. I feel like the place for that is afterwards we ship it, if/when bugs come in from users and such. Or at least when we're further in the implementation.

@hkratz
Copy link
Contributor

hkratz commented Feb 9, 2022

Apple switched the default from PTHREAD_MUTEX_POLICY_FAIRSHARE_NP to the newly introduced PTHREAD_MUTEX_POLICY_FIRSTFIT_NP in macOS 10.14 (iOS and tvOS 12.0, watchOS 5.0), so they are not fair for recent macOS either (by default).

@nagisa
Copy link
Member

nagisa commented Feb 12, 2022

Wouldn't implementing lock algorithms on top of futex et al. effectively mean that both the parking_lot based implementations and these other algorithms receive less test/use coverage overall? Or is there some plan to share the gnarly implementation details between parking_lot based implementations and the futex/etc. ones?

@m-ou-se
Copy link
Member Author

m-ou-se commented Feb 16, 2022

@nagisa Yes. That's an issue with all platform-specific code, I suppose. But we do plan to also have the CI test the generic implementation (parking lot) on Linux as well, so it gets test coverage.

@m-ou-se
Copy link
Member Author

m-ou-se commented Feb 16, 2022

I'm now reading through the various popular lock implementation to see how those are implemented.

Starting with musl.

Mutexes

Musl's pthread_mutex_t is 40 bytes (on 64-bit), although for regular/normal mutexes, only two 32-bit integers are relevant/used: the lock futex and the waiters counter. (Pthread mutexes support various attributes and different types which are all runtime settings, and thus stored inside the mutex itself. Depending on the type, other fields in the mutex are used. In Rust, however, different types are tracked in the type system, not dynamically inside the mutex.)

Its implementation is relatively simple. The lock integer contains just two bits: one to indicate whether the mutex is locked, and one to indicate whether it's contended. The contended bit can only be set when the lock bit it set as well. Locking tries to compare-and-swap 0 to 'locked' (spinning 100 times if necessary), and otherwise blocks on a 'futex wait' on the lock. Before waiting on the futex, the high bit of the lock is set and the waiter count is incremented. After waking (whether spurious or not) the waiter count is decremented.

When unlocking, a 0 is unconditionally written to the lock, and when the high bit was set or the waiter count was nonzero, a one thread is awoken through the futex.

It's not entirely clear to me how useful the waiter count is. This mechanism would also work correctly with only the 'contended'-bit. The waiter count is only used by the spin loop while locking, to avoid spinning and sleep directly when there's other waiters, to avoid stealing the lock in front of others. This makes things a little bit more 'fair'. See commit f5fb20b.

Condition variables

Musl's pthread_cond_t is 48 bytes (on 64-bit), although for 'normal' condition variables, only two pointers and a 32-bit lock/futex is used.

The lock protects a doubly linked list pointed at by the two pointers. This doubly linked list is the waiting queue, where each node is stored on the stack of a waiting thread. Each node contains a state (waiting, signalled, or leaving) and a futex that the thread is waiting on. pthread_cond_signal wakes up the first thread on the list that wasn't signalled yet. pthread_cond_broadcast marks all nodes as 'signalled', but only wakes up one thread. That thread, after woken up (and locking the mutex), will requeue the next signalled thread onto the mutex. That way, there's no need to store a pointer to the mutex in the condition variable, as pthread_cond_broadcast doesn't have to know about the mutex. It only requeues one thread. When that thread is woken up, that one will requeue the next one, and so on.

Reader-writer locks

Musl's pthread_rwlock_t is 56 bytes (on 64-bit), although it only uses two 32-bit integers in the normal case: one lock futex, and one waiters count. Its implementation is very similar to pthread_mutex_t, except 31 bits of the lock futex are used to count the number of active reader locks. The maximum value is used to indicate a writer lock. The other bit is used as the 'contended' bit, and is used for both readers and writers. It does not prioritize writers, allowing more readers to lock the lock even if there's writers waiting. The reason for that is explained in commit 50304f2: posix requires reader locks to be recursive. Proritizing writers could make a second reader lock on the same thread block.

Read-unlocking a contended lock will wake up one thread. Write-unlocking a contended lock will wake up all threads. The latter could cause another writer thread and many reader threads to wake up at the same time, potentially resulting in the writer thread winning the race and all the reader threads going back to sleep again.

The waiter counter has the same purpose as in pthread_mutex_t above.


In general it's interesting to see how many implementation details are relevant to Posix, but not to us here in Rust. For example, commit c68de0b mentions how Posix allows destroying and unmapping a mutex even if another thread hasn't returned from pthread_mutex_unlock yet. In Rust, this is impossible due to borrow rules. And on top of things like that, Posix supports more types of locks (recursive, process-shared, priority-inheriting, ..) all within the same type, which we do not need to support (within the same type).


Tl;dr: Musl uses trivial rwlocks and mutexes, except it separately counts waiters to avoid spinning when there's other waiters. Rwlocks don't prioritize writers because Posix wants reentrant reader locks. Condition variables are not trivial; they keep a waiter queue as linked list in user space, and requeue threads one by one.

@m-ou-se
Copy link
Member Author

m-ou-se commented Feb 16, 2022

Next: glibc's mutexes. (Glibc's condition variables and rwlocks are a bit more complicated. I'll come back to those later.)

Mutexes

Glibc's mutexes are 40 bytes and support a lot of different properties. The 'normal' kind, (PTHREAD_MUTEX_TIMED_NP) however is very simple, and only uses three 32-bit fields: the lock futex, the owner thread id, and the number of users. The last two fields are useful for debugging and dead-lock/misuse detection, but not used for the basic functionality of the mutex. (The n_users field represents the current number of threads that hold the lock (0 or 1), except it doesn't get decremented when pthread_cond_wait temporarily unlocks the mutex, which means that those waiting threads are also included in this number.)

The normal mutex, unlike the many other variants, is very simple. It has three states. 0 for unlocked, 1 for locked without waiters, and 2 for locked with waiters. Locking tries to compare-and-swap 0 to 1, or otherwise swaps in a 2 and futex-wait's for things to change. Unlocking swaps a 0 in place, and futex-wake's one waiting thread if it was set to 2. Unlike musl's implementation, it does not try to spin at any point, and directly switches to sleeping if the compare-and-swap fails.

It's basically identical to the one I wrote in my experiment: https://github.com/m-ou-se/futex-lock-experiment/blob/ab0e9a135aff5f3b2e01218a3281cbc552d76653/src/lib.rs#L30-L109

@kprotty
Copy link

kprotty commented Feb 17, 2022

  • Has anyone looked into the condition variable darwin libpthread? Works with a futex-based api (ulock), uses 64bits of space, doesn't do extra wakes or require requeue, and is simpler than glibc's condition variable implementation.

  • Would os_unfair_lock be possible for darwin systems? It supports priority inheritance, is 32bits large and movable, but is only available on macos 10.12+

  • For non-futex/windows platforms, a thread parker + parking_lot's WordLock could be used for the Mutex. There's also windows' CONDITION_VARIABLE and SRWLOCK source to look into and some inspection suggests the latter is very similar to WordLock (which allows it to be a usize in space) but with more complications.

@thomcc
Copy link
Member

thomcc commented Feb 17, 2022

Would os_unfair_lock be possible for darwin systems? It supports priority inheritance, is 32bits large and movable, but is only available on macos 10.12+

Lack of timed wait is a problem, and libstd supports back to 10.7+ (and I'm not sure what's required to reconsider this).

@kprotty
Copy link

kprotty commented Feb 17, 2022

Ah, fair point for the 10.7+ requirement. Does the new Mutex impl need timed wait? SRWLock and the current API don't support it.

@thomcc
Copy link
Member

thomcc commented Feb 17, 2022

It's supported via condvar: https://doc.rust-lang.org/std/sync/struct.Condvar.html

@kprotty
Copy link

kprotty commented Feb 17, 2022

Condvar's wait_timeout functions look like they always return the MutexGuard (excluding when poisoned). This implies that the mutex itself must be locked on return, hence no need for Mutex to internally have a timed_lock(). darwin's and musl's condvar implementations help confirm.

@m-ou-se
Copy link
Member Author

m-ou-se commented Feb 23, 2022

Continuing with glibc's condition variables:

Condition variables

Glibc's condition variables are 48 bytes, and all of them are used in normal operation. Unlike musl's implementation, this one does not keep a queue in user space, as the same implementation is used for process-shared condition variables, which cannot refer to any memory addresses that might be different in different processes.

It keeps a 64-bit sequence number that's incremented by every new waiter, and a second 64-bit number that indicates which of those have been signalled. Within the structure, there are futexes and some other fields for exactly two 'groups' of waiters: G1 and G2. Waiters add themselves to G2, and signals wake up threads in G1. When G1 is empty, signalling first swaps the two groups. This is a workaround for bug 13165, which is about time-traveling signals that end up waking future waiters rather than current waiters.

Two bits of the condition variable represent its internal lock (a trivial three state mutex), which protects the state of the condition variable.

There is some additional complexity about destroying a condition variable while there are still waiters. However, this is irrelevant for our Rust implementation, since we don't allow dropping things that are still in use.

This implementation raises the following questions for a Rust implementation:

  • Do we need a 64-bit counter? A futex (on Linux and BSD, today) is 32-bit, so having 64-bit counters makes things more complicated. A 32-bit counter could potentially overflow all the way until it reaches the same value while unlocking the mutex in condvar.wait(), although that seems extremely unlikely.
  • Does a trivial (single futex) condition variable implementation suffer from the same 'time traveling signal' bug? If so, do we consider that a bug as well in Rust?

@m-ou-se
Copy link
Member Author

m-ou-se commented Feb 24, 2022

More on glibc's condition variables:

Their old design before the bugfix for the time-traveling signals is documented here in pseudocode: https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/DESIGN-condvar.txt;h=4845251c7586c9667cd33c38af3701910d65eaa8;hb=c0ff3befa9861171498dd29666d32899e9d8145b

It keeps a (64-bit) counter for the number of signals that have been sent, and another for the number of threads that have been woken up. The difference between those two is effectively the number of signals ('tokens') that can still be consumed. If a waiting thread wakes up from the futex-wait operation and there are no tokens left (i.e. the counters are equal), it goes back to sleep. That's how a future thread can consume a token and prevent another waiting thread from waiting up from pthread_cond_wait.

@tschuett
Copy link

@m-ou-se
Copy link
Member Author

m-ou-se commented Mar 1, 2022

Finishing up glibc:

Reader-writer locks

Glibc's rwlock is documented here. It is 32 bytes, of which 22 are used during normal operation. It contains two futexes, a readers counter, a writers counter, the thread ID of the current writer, the selected mode, and a flag indicating whether it is process-shared or not. It supports three different modes: a) prefer readers (default), b) prefer writers with recursive reader locks, c) prefer writers without recursive reader locks. Only in the last mode do waiting writers block additional reader locks.

Part of its complexity is due to supporting these different modes within the same structure. Another part of the complexity is due to posix allowing destruction in cases where the Rust borrow checker would not allow it. Both of these are not relevant to our implementation in Rust.

The reader counter also contains three state bits. Using it directly as a futex is avoided, in part because that number can rapidly change as readers come and go. Therefore, one of the state bits (whether the lock is in reader or writer mode) is duplicated in a separate futex.

It also includes a mechanism through which writers can hand over a writer lock to each other without unlocking, which is only used if the lock is set to a mode that prefers writers.

It has nine states, each of which is either part of the 'read phase' (1 through 4a) or the 'write phase' (5 through 8). Even the idle state is split in both a 'read phase' (1) and a 'write phase' (5) idle state. It's entirely symmetrical, except for one more state (4a) in the read phase which is used in the non-recursive writer-preferred mode to block new waiters even when the lock is already read-locked. The transition from one of the idle states to the locked state in the other phase (2 or 7) goes through a intermediary state (3 or 6), during which a thread trying to lock the lock in the 'current phase' (read/write) can prevent the phase change and steal the lock. (This intermediary state is not used for the 'try lock' functions.) The locked states (2 and 7) are split into two states where one indicates there are waiters of the opposite phase (4/4a and 8).

Below is a state transition diagram of this lock. The dashed arrows are only used if writers are preferred. The thin arrows are only used by the 'try lock' functions.

glibc-rwlock-states drawio

The thread ID of the current writer that's stored inside the lock is only used for debugging and deadlock detection when the holder of a write lock tries to acquire a read lock.

@m-ou-se
Copy link
Member Author

m-ou-se commented Mar 1, 2022

Next up: Wine

We don't have the source code to Microsoft's SRW locks and condition variables, but we do have the source code to Wine, which implements these as well. However, while they have to be the same size as Microsoft's implementation, they are implemented very differently and you'll see very different bit patterns in them if you inspect them while they are in use.

Reader-writer locks (and Mutexes)

On Windows we use SRW locks for both RwLock and Mutex, so there's no separate mutex implementation we're looking at. They are the size of a pointer, but Wine's implementation only uses 32 bits, regardless of pointer size. It is split into two 16-bit counters: The number of active reader locks ('owners'), and the number of waiting writers. The reader/owner counter is set to u16::MAX when it is write-locked.

The lock and unlock functions are very simple. Write-locking increments the waiter counter (using a 16-bit operation), and then uses a 32-bit CAS operation on both counters at once to acquire the lock if the owner counter is zero by setting it to u16::MAX and decrementing the waiter counter again. If the owner counter wasn't zero, it'll do a futex1 wait to wait until it is.

Read-locking uses a CAS operation to increment the number of owners only if there's no waiting writers and the lock isn't write-locked. So, waiting writers block new readers. If the conditions aren't met, it futex-wait's until they are.

Interestingly, two different addresses are used within the SRW lock as two different futexes. Readers wait on the address of the SRW lock itself and use the full 32-bit value, while writers wait only on the second half which contains the 16-bit owner counter. Unlocking the last reader lock will wake a single thread from the writer queue. Unlocking a writer lock will wake a single writer if there was any waiting (indicated by the waiting counter), or wake all readers if there are no waiting writers.

Mixing 16-bit and 32-bit atomic operations on overlapping memory is not common, and the Intel manual (vol. 3A) seems to suggest this is not a great idea:

Software should access semaphores (shared memory used for signalling between multiple processors) using identical addresses and operand lengths.

Also, surprisingly, no overflow checks are done. If either of the counter overflows, unexpected things happen.

Condition variables

Wine's condition variables are trivial. Just like SRW locks, they are the size of a pointer, but Wine's implementation only uses 32-bits. It is simply a 32-bit counter that gets incremented on every notification. It does not keep track of which lock it is used together with, nor does it make use of requeueing. It does not keep track of any information to prevent a futex-wake operation when there's no thread waiting, etc. It is identical to Condvar1 in my experiment.

Footnotes

  1. They are not called 'futexes' on Windows, but the operations are identical: WaitOnAddress, WakeByAddressSingle, and WakeByAddressAll.

@m-ou-se
Copy link
Member Author

m-ou-se commented Mar 10, 2022

Next up: Windows

Now this one is a bit tricky because the source code isn't available, but I made some guesses and have good reason to believe they are accurate.

Reader-writer locks (and Mutexes)

On Windows we use SRW locks for both RwLock and Mutex, so there's no separate mutex implementation we're looking at. SRW locks are one pointer in size, of which the last four bits encode four flags. The other bits either represent a (60-bit or 28-bit) counter, or they represent a (64-bit or 32-bit) pointer with the last four bits missing. That pointer points to something 16-byte aligned, to make sure the last bits are always 0.

When there's no contention, the number of active read locks is stored in the counter. For a write lock, the counter is 0. The least significant bit is used to indicate whether the lock is locked (regardless of read or write mode). An unlocked lock is entirely zero.

When there are threads waiting, that is indicated by one of the four flag bits, and the counter is no longer used and instead used as a pointer to a doubly linked list of waiting threads. The nodes of that list are stored on the stack of the waiting threads. Each node also contains a pointer to the last node in the queue, so you can quickly get there from the first pointer. (The lock only stores a pointer to the first node.)

The last node contains the number of active read locks, since the pointer took the place of that counter within the lock itself.

Each node contains the thread id of the waiting thread, and an additional flag that indicates whether the thread is waiting to acquire a read or write lock.

The thread id is used with the undocumented NtWaitForAlertByThreadId and NtAlertThreadByThreadId APIs. These seem to be effectively identical to Rust's thread::park() and Thread::unpark() functions. So, SRW locks are not futex-based, but thread parker based.

I'm assuming some of the flag bits in the lock are used as a mutex to protect modifications to the queue.

A waiting writer on a read-locked lock prevent new readers from acquiring the lock, to avoid writer starvation, but it doesn't exactly prefer writers in general: it ~roughly attempts to handle the requests in order. SRW locks are not fair.

Like most lock implementations, it first spins a while before going to sleep, to avoid expensive syscalls when the lock is only locked for a very short time. Spinning seems configurable through some global. It also seems to support an alternative to spinning based on the monitorx and mwaitx instructions.

Condition variables

Condition variables on Windows are also a single pointer. They work very similar to SRW locks, storing a pointer with flag bits in the least significant bits, with the pointer pointing to a linked list over waiting threads. The nodes even have the same structure as the nodes used by SRW locks.

WakeAllConditionVariable iterates over the entire list and wakes all threads. It does not requeue waiters to the relevant lock.


Here's a diagram showing an SRW lock in different states:

srw-locks drawio

A condition variable looks nearly identical, except for the locked states and the four flag bits.


tl;dr: SRW locks and condition variables on Windows are thread parker based and keep their waiting queue as a linked list in user space through nodes on the stack of the waiting threads. Thread parking is natively supported on Windows through an undocumented API. It prefers writers to avoid writer starvation. They spin before sleeping, and can use the special monitorx and mwaitx instructions as an optimization.

jamesbornholt added a commit to jamesbornholt/shuttle that referenced this issue May 19, 2023
These constructors have been const since Rust 1.63
(rust-lang/rust#93740). It's pretty easy for
us to make them const too, which allows code that relies on them being
const to correctly compile with Shuttle.

The one exception is that HashMap::new isn't const, and our Condvar
implementation uses a HashMap to track waiters. I took the easy way out
and just used a vector as an association list instead -- we shouldn't
expect large numbers of waiters on the same condvar, so this shouldn't
be too inefficient.
jamesbornholt added a commit to jamesbornholt/shuttle that referenced this issue May 19, 2023
These constructors have been const since Rust 1.63
(rust-lang/rust#93740). It's pretty easy for
us to make them const too, which allows code that relies on them being
const to correctly compile with Shuttle.

The one exception is that HashMap::new isn't const, and our Condvar
implementation uses a HashMap to track waiters. I took the easy way out
and just used a vector as an association list instead -- we shouldn't
expect large numbers of waiters on the same condvar, so this shouldn't
be too inefficient.
jorajeev pushed a commit to awslabs/shuttle that referenced this issue May 22, 2023
These constructors have been const since Rust 1.63
(rust-lang/rust#93740). It's pretty easy for
us to make them const too, which allows code that relies on them being
const to correctly compile with Shuttle.

The one exception is that HashMap::new isn't const, and our Condvar
implementation uses a HashMap to track waiters. I took the easy way out
and just used a vector as an association list instead -- we shouldn't
expect large numbers of waiters on the same condvar, so this shouldn't
be too inefficient.
bors added a commit to rust-lang-ci/rust that referenced this issue Feb 12, 2024
Replace pthread `RwLock` with custom implementation

This is one of the last items in rust-lang#93740. I'm doing `RwLock` first because it is more self-contained and has less tradeoffs to make. The motivation is explained in the documentation, but in short: the pthread rwlock is slow and buggy and `std` can do much better. I considered implementing a parking lot, as was discussed in the tracking issue, but settled for the queue-based version because writing self-balancing binary trees is not fun in Rust...

This is a rather complex change, so I have added quite a bit of documentation to help explain it. Please point out any part that could be explained better.

~~The read performance is really good, I'm getting 4x the throughput of the pthread version and about the same performance as usync/parking_lot on an Apple M1 Max in the usync benchmark suite, but the write performance still falls way behind what usync and parking_lot achieve. I tried using a separate queue lock like what usync uses, but that didn't help. I'll try to investigate further in the future, but I wanted to get some eyes on this first.~~ [Resolved](rust-lang#110211 (comment))

r? `@m-ou-se`
CC `@kprotty`
@joboet
Copy link
Member

joboet commented Feb 12, 2024

futex support for the Apple platforms is in beta 🥳!

So, with a bit of trickery, we could soon switch over Condvar and RwLock to the Linux versions. Since the minimum supported version for Rust is macOS 10.12, which has __ulock_wait and __ulock_wait, we can use that private API as a fallback on older platforms and remain resistant to its removal on new ones.

For Mutex, os_unfair_lock is the best option as it supports priority inheritance.

jorajeev pushed a commit to awslabs/shuttle that referenced this issue Feb 29, 2024
These constructors have been const since Rust 1.63
(rust-lang/rust#93740). It's pretty easy for
us to make them const too, which allows code that relies on them being
const to correctly compile with Shuttle.

The one exception is that HashMap::new isn't const, and our Condvar
implementation uses a HashMap to track waiters. I took the easy way out
and just used a vector as an association list instead -- we shouldn't
expect large numbers of waiters on the same condvar, so this shouldn't
be too inefficient.
@schreter
Copy link

@m-ou-se I'm working on macOS and I was quite surprised that my malfunction tests panicked on OOM in very surprising places, like locking a mutex. After some investigation I found this issue and your comment:

All of pthread_mutex_t, pthread_cond_t and pthread_rwlock_t in this implementation contain fields which need a higher alignment than the alignment of the struct itself. This issue is worked around by adding padding after the fields, and dynamically calculating the offset of the field. This means that this now seems to be the first implementation I've seen where the types are truly not movable. Moving one of these to a differently aligned address would shift around those 'overaligned' fields.

I looked into the implementation of the pthread_mutex_lock for Darwin (libpthread of macOS 14.3 to be exact, here, as well as some older versions) and indeed there is this brain-dead re-alignment to 8 bytes into a potentially reserved space. The pthread_mutex_t is only 4 bytes aligned.

But this doesn't mean the mutex is unmovable. I failed to find any reason why it should not be movable, provided we make sure it's 8B aligned in Rust, since there is no self-referentiality. The address of the Mutex is only relevant to the OS while locked, but Rust wouldn't allow moving the Mutex anyway while it's locked.

I.e., adding #[repr(align(8))] at an appropriate place will solve this issue and we can use in-place Mutex on macOS as well. My C++ experiments with moving pthread_mutex_t on 8B boundary also didn't show any adverse effects.

I didn't check the other sync primitives in detail, but it seems to be the same in green.

So what's preventing us from simply declaring pthread_mutex_t as 8B aligned and using it in-place? That would solve those lazy memory allocations which are quite annoying :-).

Thanks in advance.

@RalfJung
Copy link
Member

RalfJung commented Mar 12, 2024

The address of the Mutex is only relevant to the OS while locked, but Rust wouldn't allow moving the Mutex anyway while it's locked.

Rust does allow this:

let m: Mutex = ...;
mem::forget(m.lock().unwrap()); // never unlocks
let move_locked_mutex = Box::new(m);

@joboet
Copy link
Member

joboet commented Mar 12, 2024

I have an almost-finished patch sitting around that switches macOS to os_unfair_lock and the futex-based implementations, which should solve this while preserving the priority-inheritance properties. The only thing missing are miri shims.

Edit: uploaded it as draft, see #122408

@schreter
Copy link

Rust does allow this:

let m: Mutex = ...;
mem::forget(m.lock().unwrap()); // never unlocks
let move_locked_mutex = Box::new(m);

That's true. In case a second lock() call is done on such a moved mutex, all bets are off anyway. It may take the ownership (e.g., if the lock is recursive), it may block forever, whatever.

But this has nothing to do with the problem at hand on macOS - the same potential issue is present on all platforms.

I have an almost-finished patch sitting around that switches macOS to os_unfair_lock and the futex-based implementations

Great, thanks! I hope it makes it in the next Rust release :-).

If I understand it correctly, the implementation relies on presence of the respective symbols and if they are not found, then a fallback is used? Anyway, I'm now in progress of installing 14.4 :-).

@bjorn3
Copy link
Member

bjorn3 commented Mar 12, 2024

That's true. In case a second lock() call is done on such a moved mutex, all bets are off anyway.

We need it to reliably deadlock (or abort the program) for soundness. Mutex::lock will never recursively acquire the lock. Otherwise you would get aliasing mutable references. It must also not invoke UB like dereferencing a dangling pointer. All these things could happen on macOS if we didn't heap allocate the actual lock, but are guaranteed to not happen on other platforms where we use our own mutex implementation.

@RalfJung
Copy link
Member

RalfJung commented Mar 13, 2024

It may take the ownership (e.g., if the lock is recursive)

That would be unsound, and a critical bug. Lucky enough we are ensuring that this is not the case.
My code sample (extended with another call to lock(), either in the same thread or in another) is entirely safe Rust so it must not violate safety. It can deadlock or panic, but nothing else. It follows that when using POSIX locks, we must heap-allocate them.

This is leaving aside the question whether macOS guarantees that the lock can be moved at all, even when it is not currently held. AFAIK POSIX does not provide such a guarantee. You made a claim based on the current macOS POSIX lock implementation, but it's not really relevant what the implementation does, it's only relevant what Apple guarantees will be true for the current and all future implementations.

@schreter
Copy link

It may take the ownership (e.g., if the lock is recursive)

That would be unsound, and a critical bug. Lucky enough we are ensuring that this is not the case.

Thanks. Even after more than 2 years doing systems programming in Rust I'm still discovering where I made thinking mistakes at the beginning (too much C++ influence over the past 30 years :-).

This is leaving aside the question whether macOS guarantees that the lock can be moved at all, even when it is not currently held.
AFAIK POSIX does not provide such a guarantee.

No, I'm afraid, there is no such guarantee. As mentioned, I investigated the coding behind pthread_mutex_lock in Apple's sources, which didn't change (much) in the past 20 years or so. Aside from the strange handling of alignment, the mutex is movable, when unlocked. Also the case with forgetting the guard can be handled by a flag in Rust code to panic appropriately. But...

it's only relevant what Apple guarantees will be true for the current and all future implementations.

...this is absolutely correct and we can't rely on it.

It follows that when using POSIX locks, we must heap-allocate them.

For std::sync::Mutex, I agree, yes. However, one could offer some form of UnsafeMutex or so which delegates the guarantee not to move it to the user. That's probably not the job of std, though :-).

Anyway, let's close the discussion here - I definitely understand the need to make the library sound. OTOH, random panics/aborts at unexpected locations (mutex lock) due to OOM situations are not optimal (especially since undocumented). Therefore, I'm looking forward to the change by @joboet switching macOS Mutex implementation to the new futex-like syscall.

@daira
Copy link
Contributor

daira commented Sep 27, 2024

Of course we need soundness guarantees. But in general, there's a way to safely use platform implementations that don't guarantee a property you need, but that actually have it in practice:

  • If you can detect that the platform is using a known implementation that has the property (e.g. by checking the combination of the kernel version and the userspace locking library version), then use that implementation.
  • If you can't, use a less efficient fallback.

In parallel, you also ask the platform vendor to make the guarantee — either for all platform versions, or in a way that is detectable for future platform versions. If they make it for all supported platform versions, then you can remove the fallback.

This assumes that there is some way to implement the fallback. It is also subject to temporary performance regressions for new platform versions until they've been marked as okay. But that's what asking the platform vendor to guarantee the property is for. It can be a reasonable strategy if you think that the vendor is likely to guarantee the property but just aren't sure how long that will take.

@RalfJung
Copy link
Member

RalfJung commented Sep 27, 2024

This is a closed issue, there's no point in attempting a discussion here. If you have a concrete issue you think should be fixed, or a concrete change you want to propose, please file a new issue. "Make macOS Mutex/RwLock/Condvar not need heap allocations" does seem like a goal worth tracking.

@RalfJung
Copy link
Member

Here we go: #131005

@rust-lang rust-lang locked as resolved and limited conversation to collaborators Sep 29, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-atomic Area: Atomics, barriers, and sync primitives A-concurrency Area: Concurrency C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests