-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a Mutex, AtomicU64, and deal with lots of related fallout #11610
Conversation
@brson yep |
I didn't see where |
I see that you went with this approach to maximize native performance (which might be possible to improve in #11520, haven't really tried), but is an implementation that is completely unfair in the combination of native and green threads really acceptable to ship as a core synchronization primitive? As written, a native task will always be blocked forever on lock() if there are enough contending green tasks, which seems to make a Mutex unreliable in such a configuration. In fact, as far as I can tell, a set of N + 1 green tasks, where N is the number of concurrent task schedulers, (i.e. typically the number of CPU cores) that take a mutex in a loop without sleeping/scheduling with the mutex unlocked (perhaps because they do all their sleeping while holding the mutex) seems enough to make sure that native tasks will never get the mutex. And if such a configuration is not really usable and native performance must be optimal, then maybe the best option is to ban mixing native and green tasks for either a given mutex (create one or the other depending on the type of the task that calls Mutex::new()) or the whole process, which means that one can have two separate implementations of all primitives. That way, one can directly use the OS primitives in the native case, and use the algorithms that the Linux kernel uses for it's own sleepable primitives for the green tasks, providing optimal performance and eliminating the need of engineering from scratch primitives for the unique Rust use case of mixed native/green tasks. [There's also the issue that using yields is bad (e.g. if one sends SIGSTOP or breaks a thread in a debugger at the wrong time, the others will be spinning forever) and neither glibc nor Linux use them in library code, but at least that doesn't impact the correctness of programs.] |
Is there a modification to this strategy that can prevent native thread starvation in the presence of green thread contention? |
It might be possible to always unlock and relock the lock on unlock (at some performance cost, and perhaps losing the guarantee) and then add a check for the queue being empty to native tasks to prevent the opposite. |
I think there is at least one race condition in the code: as the comment in the code says, one can ignore the case when the MPSC queue is inconsistent IF the pusher is guaranteed to see that held has been set to false. However, there is no fence that guarantees that the self.held.store(false, atomics::Relaxed) won't be moved after the q.pop(). It looks like there needs to be a SeqCst fence there and a SeqCst fence after the push in lock(), or alternatively either the MPSC queue or the held variable need to be manipulated with SeqCst (acquire and release don't seem to enough, since they aren't StoreLoad fences). Also, there should be at least a compiler optimization barrier (an empty volatile asm!() with a memory clobber should work, not sure if LLVM has something better) after the write to held in lock(), since otherwise the compiler could in principle move that store arbitrarily down after inlining lock(), resulting in green tasks spinning for a long time; likewise, there should be at least a compiler barrier before the held store in unlock() for the same reason. #[inline(never)] on both lock() and unlock() would also work, but with the negative side effect of always preventing inlining. A fence, but I think all real CPUs guarantee that stores are made visible as soon as possible, so it should not be needed. I'm not sure whether that makes it free from race conditions or if there are others. |
I changed the code in #10520 (see https://github.com/bill-myers/rust/tree/mutex-lockless) to start cmpxchg loops with a relaxed load, which seems to be a bit faster on Haswell compared to starting with an optimistic cmpxchg. After that change to #10520, and adding SeqCst to the held store in unlock in this pull (which means it's going to use xchg instead of mov), they both take around 2.5 seconds real time to unlock a mutex a million times from each of 8 native tasks on an HT 4-core mobile 2.2GHz Haswell (with Linux on bare hardware), while using unstable::Mutex directly takes 1.7 seconds. The original racy code in this pull after removing the assert!() on held results in 2 seconds, and completely removing both held load and store results in 1.85 seconds (but both are incorrect), while #10520 without the optimization took around 2.85 seconds. So it seems that this closes the gap on my machine, but Alex Crichton's results seemed to show a larger gap in the first place, so not sure what it would look like there with those changes (was the benchmark on #10520 perhaps done with 64-bit atomics on x86-32?) |
I've updated with what @brson and I talked about today
|
Is this still faster than https://github.com/bill-myers/rust/tree/mutex-lockless on 64-bit Linux on x86 when using atomics::SeqCnt? Comparing performance with atomics relaxed makes little sense, since that means benchmarking an incorrect implementation, and the atomics are in fact the main source of slowdown. Anyway, I think both https://github.com/bill-myers/rust/tree/mutex-lockless and this should be acceptable from a "doesn't break programs" perspective. With the key insight of alternating between green and native tasks, I think there is actually a third option, which would involve taking this pull request, and removing the need to yield, in the following way:
Compared to this pull, it would remove yields and reduce the size of the mutex structure (yes, the mutex would be touched after unlocking, but that's not a problem in Rust as argued in #11520). I think this might perhaps be the best option. BTW, regarding the TLS hit, having a "is_nonblocking_task" thread-local variable (which needs to be written to only when a green task scheduler starts or stops) rather than having to fetch the task would speed it up, and perhaps make it possible to remove the is_blocking_locker variable (and eventually, one could ask the glibc maintainers to reserve a bit in the glibc thread structure for this purpose). |
I do believe that the performance comparison with Relaxed is a fair comparison. The only reason these use SeqCst is that I don't personally feel like I have a good enough understanding of atomics in order to relax the constraints. This causes a store to become an As I've discussed before, relaxing the constraint of re-using the mutex in unlock is something I believe to be unacceptable. There is code which does this today in rust (seen here), and you cannot have "just use an arc" as a solution. |
The store to held in unlock must be an xchg, since otherwise the loads in q.pop() can be moved before it. The fetch_and_sub and fetch_and_add must be lock xadd since just xadd is not even atomic::Relaxed. Loads are implemented as mov regardless of atomic memory model. The stores to held in lock() can indeed be relaxed. So I think code generation with all SeqCst is the optimal one on x86 (although SeqCst might not be the optimal choice otherwise), except for storing true in held, with can and should be relaxed. |
The x86 architecture does not do this, the compiler may do this. The
This is only in the contended case, so speed is much less of a worry once you've reached this situation |
No, x86 architecture CPUs with out of order execution (i.e. all the desktop/mobile ones) can move loads before stores (to a different memory location, of course) unless the store is an xchg or there is a locked op, xchg or fence in the middle. So assuming q.pop() checks that the queue is empty with only reads, then storing false to held must be done with an xchg (or mov + mfence or similar). See https://en.wikipedia.org/wiki/Memory_ordering, "Stores reordered after loads", for instance. |
@alexcrichton: There are still multiple levels of atomic ordering guarantees on x86 itself. You'll get different instructions from relaxed and sequential consistency. The compiler is still free to reorder the atomic instructions within a thread if it can figure out what's going on. IIRC, we mark all atomic operations as volatile unnecessarily so we're probably not getting code generation as good as C++ atomics. They should be split out into |
I'm satisfied with this approach and want to merge it. Further improvements can be made in future pull requests. |
This also generalizes all atomic intrinsics over T so we'll be able to add u8 atomics if we really feel the need to (do we really want to?)
With the recently added double word CAS functionality on 32-bit ARM (enabled via a 64-bit atomic instruction in LLVM IR), without some extra features enabled LLVM lowers code to function calls which emulate atomic instructions. With the v7 feature enabled, proper 64-bit CAS instructions are used on 32-bit arm. I've been told that v7 for arm is what we should have been doing anyway. This is overridable by providing some other non-empty feature string.
This allows for easier static initialization of a pthread mutex, although the windows mutexes still sadly suffer. Note that this commit removes the clone() method from a mutex because it no longer makes sense for pthreads mutexes. This also removes the Once type for now, but it'll get added back shortly.
There are no current consumers of this mutex, but rather more will be added soon.
This originally lived in std::unstable::mutex, but now it has a new home (and a more proper one).
Now that extra::sync primitives are built on a proper mutex instead of a pthreads one, there's no longer any use for this function.
The comments are in the commit itself, but the general idea is to have green threads relinquish control to native blockers and to have native blockers relinquish control to green threads. This should help keep things a little more fair instead of just always favoring green threads.
This is becoming our general policy until we have a better understanding of these orderings.
It turns out that windows cricital sections will succeed in EnterCriticalSection or TryEnterCriticalSection if the calling thread already owns the mutex, but this is obviously bad for us because there could be two contenting green threads on the same os thread. With some extra checks of `self.held` in try_lock(), I believe that we can circumvent this limitation with minimal impact to the mutex.
I think the only simple way of solving this is to switch to locks that don't have special behavior when recursively locked. On Windows Vista, I think you can use SRW locks for that. On Windows XP I think it should work to implement mutexes as an auto-reset event that starts as signalled, use WaitForSingleObject to lock it, and SetEvent(TRUE) to unlock it. Otherwise, you need to, at least:
BTW, I just noticed that locking a mutex on a native thread and unlocking it on another is undefined behavior according to POSIX, which means that Rust should refuse to compile for any unknown POSIX platform, and it should be checked that this does indeed work on all supported POSIX platforms (I think Linux should be fine as long as one uses PTHREAD_MUTEX_NORMAL, but haven't checked in detail). |
@bill-myers: It's not a good idea to depend on undefined behavior based on glibc implementation details. They're willing to break any such incorrect code at any time, and lock elision may do exactly that as it matures and becomes the default. |
@thestinger If your argument is accepted, then this Mutex implementation must be completely redesigned (the other approaches I proposed share this issue, although they may or may not more easily accomodate the changes) Lock elision is actually a really important point; we can of course request to disable it, but it would be nice to take advantage of it both for native and for green tasks. Maybe we really need to go back to the drawing board, make all tasks always drop the OS mutex on unlock and redesign everything as needed to accomodate that? |
Here's an example of their willingness to break anything relying on undefined behavior: https://bugzilla.redhat.com/show_bug.cgi?id=638477 |
glibc doesn't seem to have any documentation of pthread functions other than the POSIX specification. So, the problem of always unlocking is that it would starve green tasks at least in case the native OS mutex implements FIFO wakeups, since green tasks cannot block and thus cannot reserve a spot in the OS mutex FIFO queue. The only solution is to make new native tasks queue as well in the MPSC queue once any green tasks are queued. This is what #11520 did, but I think the code in this pull request can also be changed to do that. |
In something as critical as mutex, I don't want to rely on undefined behavior in POSIX. I didn't realize how much of a problem it was to unlock a mutex on another thread that originally locked it. Holding an OS mutex across the lock/unlock call I'm thinking at this point is unacceptable for green threads. It's possible for green threads to migrate around os threads, so you have no guarantee that the green thread will unlock on the same os thread. For now, I'm going to close this because this still has work to hash out. |
I have an idea of how to get around this. |
…nt in async function for the `needless_pass_by_ref_mut` lint
…ut-regression-test-11610, r=blyxyas Add regression test for rust-lang#11610 about mutable usage of argument in async function for the `needless_pass_by_ref_mut` lint Fixes rust-lang/rust-clippy#11610. This was already fixed. I simply added a regression test. changelog: Add regression test for rust-lang#11610 about mutable usage of argument in async function for the `needless_pass_by_ref_mut` lint
This is the follow up PR to #11520, #11462, and #11509. This takes the original approach for #11462 for the mutex (due to my comment).
I had originally added an AtomicU64 for usage in #11520's implementation, and I figured I may as well leave it implemented because it may turn out to be useful someday.