Alternative lockless Mutex implementation (2nd try) #11520

bill-myers · 2014-01-13T22:34:15Z

This is yet another attempt to create a satisfying implementation of Mutex (first in #11462, second in #11509), created since I believe I have managed to devise a working lockless algorithm.

I think the algorithm is correct, but please check; if it is, then this has none of the drawbacks of the previous pulls.

This code is based on Alex Crichton's code, but replaces the "held" AtomicBool variable with a more sophisticated MutexState struct which consists of an atomic (lockers, queue_size) pair.

Also, we replace the intrusive MPSC queue implementation with one using list reversal, to remove the need to yield there and the need to store a stub node in the Mutex.

Features:

No yields!
Only one OS mutex
Smallest Mutex struct size ever, since the new MPSC implementation no longers needs a stub node
Fair even when combining native and green tasks
Provides "can zap Mutex just after unlock" guarantee since we can now provide it for free

Non-features:

Recursively locking will deadlock; this can be fixed, but enlarges the mutex or requires an OS mutex with a trylock that can detect recursion (but you can anyway deadlock by taking two mutexes in opposite orders from two threads, so not sure if it's essential)

Non-issues:

Uses cmpxchg loops rather than non-loops; however, the previous version was using yield loops
Green tasks can block on the OS mutex, but only when it is known that another thread is about to release it; again, the previous version was using yield loops for the same purpose, which are worse
unlock is O(n) in the number of contending tasks in the worst case due to list reversal, when green tasks are used; however mutex lock is already fundamentally O(n) due to the obvious contention properties of mutexes, so this shouldn't really matter

The only known issue is that as implemented now this only works with up to 2^16 tasks on 32-bit and up to 2^32 tasks on 64-bit, since it needs to store two variables with a task count in a single atomic variable.

This shouldn't be a problem for two reasons:

As long as tasks use at least 64KB (in user stack, kernel stack, etc.) and 64-bit platforms actually only provide 48-bit addressing (this is the case for x86-64 right now) the 2^16/2^32 limitation cannot be overrun
Anyway, Pentium Pro+ and ARMv6+ provide 64-bit atomics in 32-bit mode (cmpxchg8b and ldrexd etc.) and x86-64 processors beyond the early Opterons as well as ARMv8 provide 128-bit atomics in 64-bit mode (cmpxchg16b), so it's only a matter of implementing support for and using those if desired

Currently Rust uses 2MB stacks by default, so programs that don't specify the stack size manually already cannot trip this limitation.

omasanori · 2014-01-13T23:08:37Z

you can anyway deadlock by taking two mutexes in opposite orders from two threads

Couldn't we provide something like std::lock in C++? (maybe off-topic, though)

bill-myers · 2014-01-13T23:42:13Z

Couldn't we provide something like std::lock in C++? (maybe off-topic, though)

Maybe as a macro; anyway, it doesn't help with the user incorrectly calling it recursively unless you ban all recursive locking (which is limiting, since it prevents composition).

That said, I'm not sure whether we should detect recursion, or even have the option of allowing recursion, and it's mostly independent of the mutex implementation algorithm.

alexcrichton · 2014-01-14T06:44:04Z

This looks promising, thanks for implementing this!

I'll take a look at this tomorrow.

alexcrichton · 2014-01-14T19:30:25Z

In general, I'd prefer a cleaner history, so could you replace the original mutex commit with your mutex commit? Dealing with the fallout will probably be similar.

alexcrichton · 2014-01-14T19:32:04Z

src/libstd/sync/mpsc_intrusive.rs

@@ -0,0 +1,111 @@
+/* Copyright (c) 2010-2011 Dmitry Vyukov. All rights reserved.


This license is the wrong license if you've re-implemented it.

I'm certainly no lock-free expert, so to me this queue looks correct, but I don't consider myself qualified to officially say so. Do you know of any external implementations of this flavor of queue that I could look at as well? Any literature/papers would also be a good reference point.

Well, it has been implemented by drastically modifying the previous version, so whether it's a derived work or not could be debatable.

The conditions say that the copyright notice must not be removed if it is a derived work, so the simplest course of action is to leave it as is, and add an extra copyright line if desired.

If this is implemented from memory, I'm much more uncomfortable than using the previous version. Why did you re-implement the queue? Was it to remove the Thread::yield_now() in the inconsistent case?

alexcrichton · 2014-01-15T16:42:34Z

src/libstd/sync/mutex.rs

+// have or are about to take the OS mutex using a blocking lock call.
+//
+// It is now as fair as the OS mutex allows, even when mixing green and native tasks,
+// since native threads will queue like green tasks, if any green task is queued.


I'd like to see a lot more about this explanation here. There's still no mention of the general protocol or the general idea of why native tasks are blocked on the queue. This also doesn't justify why it's ok to block native tasks on the queue.

Otherwise native tasks will either always have priority over green tasks, or green tasks will always have priority over native tasks, depending on whether unlock prefers to unlock or dequeue when lockers > 1 && queue_size != 0.

bill-myers · 2014-01-15T17:20:39Z

I don't think this is true. Each thread will have only one &mut pointer, the problem is that it is aliased across threads. That has nothing to do with LLVM optimizations.

First of all, &mut guarantees that it is not aliased AT ALL, that's the whole point of it!

If aliased &mut between tasks are allowed in general then memory safety and freedom from data races are both broken.

Also, it's incorrect that each thread will have only one &mut to the Mutex.

Let's say you have a MutexArc implemented using Mutex and make a clone of it.

Now, you lock the first one, which returns the guard object, which must ultimately contain a LockGuard and thus an &mut to the Mutex.

Then you recursively attempt to lock the clone: at this point that recursive lock operation is using an &mut which is the same as the one in the LockGuard of the first lock.

This breaks Rust semantics.

Again, any illusion of safety is purely an illusion, there is no code in the mutex that is truly safe

The Mutex is safe to use (except for destroy, which is unsafe and maybe shouldn't be there in the first place).

Let's say you have tasks and you want each to make three beeps using the computer speaker using a distinctive frequency, by calling an hypotetical speaker::beep(frequency: uint) three times.

You don't want the beeps to interleave, so you put a Mutex in an Arc, and use that to make sure it doesn't happen.

This is a perfectly valid and memory-safe use for Mutex that doesn't need unsafe code.

I believe that usage of &mut will protect against recursive locking if you're using a Mutex

No, because if you are locking the Mutex as part of a MutexArc, for instance, then the MutexArc is clonable so you can do the recursive locking by locking the other clone.

If you are a "naked" Mutex where other paths cannot be created, you can't send have the tasks simultaneously accessing it, which makes it pointless to use one.

For comparison, Cell::set also mutates the underlying value, but it doesn't take an &mut self.

This is an entirely different use case where you're capturing interior mutability, it has nothing to do with mutexes
or atomics. Using a Cell is considered a smell and all of rust favors inherited mutability.

Cell and Atomic (let's assume the latter exists) are both mechanisms to allow to mutate aliased memory, and the only difference is that Cell either must not be allowed or will result in undefined values if the aliases are shared between tasks.

Cell is thus strictly more restrictive in its pre-conditions that Atomic, so it makes no sense for Cell to require a self pointer that provides less guarantees than Atomic.

And of course, it breaks the non-aliasing property of &mut as already argued.

That said, the fix to the mutability of self is mostly separate from the rest, but it conflicts with the rest of the code in a non-trivial way (because lock as implemented no longer borrow checks with &mut self), so I think introducing Mutex with the correct &self pointers is simpler that doing it in two commits.

alexcrichton · 2014-01-15T19:33:02Z

First of all, &mut guarantees that it is not aliased AT ALL, that's the whole point of it!

Correct. That is not the only point of it. For atomics we are using it to indicate "this is being mutated"

If aliased &mut between tasks are allowed in general then memory safety and freedom from data races are both broken.

Correct. That is why it's unsafe to create multiple &mut aliases to an atomic.

Let's say you have a MutexArc implemented using Mutex and make a clone of it.

The point that you're missing is that it's unsafe to clone it.

This breaks Rust semantics.

So does sharing the memory of the mutex itself, but that's what must be done.

The Mutex is safe to use (except for destroy, which is unsafe and maybe shouldn't be there in the first place).

Correct. It must be unsafely shared. This is not a first-class primitive for everyone everywhere to use. This is a primitive to build other concurrency primitives. For example, a MutexArc would contain one of these inside of it, and it would protect some shared data. You should never be interacting with a Mutex directly.

No, because if you are locking the Mutex as part of a MutexArc, for instance, then the MutexArc is clonable so you can do the recursive locking by locking the other clone.

Again, cloning a MutexArc is unsafe in terms of the usage of the mutex.

If you want to remove &mut self in atomics, please do so in a separate PR or issue. This is not up for debate as a portion of this PR (it has enough as-is). If you would like to debate whether Mutex should take &self or &mut self, I think it should take &mut self because that's what it's truly doing and it doesn't really matter from a usability perspective because no one should be using this directly anyway.

bill-myers · 2014-01-15T21:39:34Z

Correct. That is why it's unsafe to create multiple &mut aliases to an atomic.

I think having multiple &mut aliases is not merely unsafe, but needs to be defined as resulting in undefined behavior if they are actually used, and thus unsafe code must not do so, just like unsafe code must not transmute random integers or zero to &T or ~T and access those pointers, and so on.

The alternative of having multiple &mut aliases not be undefined behavior means that Rust programs will be slower because some optimizations cannot be applied (specifically, telling LLVM using TBAA or another mechanism that there cannot be any aliases and thus that memory accesses can be moved with no restrictions and that loads can be freely merged or duplicated).

Anyway, it's pretty easy to remove all the changes from & -> &mut self if desired.

bill-myers · 2014-01-15T21:58:05Z

Unfortunately, I'm not sure if I'll have time to do further work myself on this.

I believe the current code in the pull request should be an acceptable design and mergeable, except for potential bugs, cosmetic issues and the fact that things like the &mut change and Mutex/Cond split could perhaps be split in separate commits (but at the cost of having more "churn" in the series as things are written and then changed by subsequent commits).

We might want to switch to 64-bit atomics on 32-bit and perhaps 128-bit atomics on 64-bit, but that can be done later, since the task limit won't be hit anyway without non-default stack sizes that make it impossible to call C functions.

Another thing to do is to replace critical sections and events with SRW locks and condition variables on Windows Vista+: note that both of those structures are pointed-sized and statically initializable just like the atomic uints in the current code, so it's trivial to switch at runtime between the current code on Windows XP and reinterpreting those fields as SRW locks or condition variables for Windows Vista and later.

Finally an rwlock is probably needed as well, and should be possible to implement by extending the algorithm in the code to use a (queue_size, readers, writers) tuple and putting the read mode in the queued nodes (64/128-bit atomics will allow to fit the triple).

Native rwlocks don't always support a truly fair FIFO mode (at least the one in Linux glibc NPTL either lets all readers or all writers get ahead), so it might be a good idea to let the user specify what they prefer or if they want the rwlock to be FIFO, and avoid blocking native tasks on the native rwlock at all if the user selects FIFO.

alexcrichton · 2014-01-15T22:05:11Z

Unfortunately, I'm not sure if I'll have time to do further work myself on this.

Would you like me to take it from here?

bill-myers · 2014-01-15T22:24:07Z

Would you like me to take it from here?

Yes, if you'd like, thanks.

Opened issue #11583 for mutability in atomics (I used an issue since the code is not complete, since it doesn't fix the actual intrinsics).

alexcrichton · 2014-01-15T22:38:59Z

Thanks again for all your work!

alexcrichton · 2014-01-16T01:15:26Z

Closing for now, I'll reopen once I've gotten through all the rebasings

alexcrichton · 2014-01-17T02:40:20Z

I got around to benchmarking this code, and here's the results I got. The cxchg is this mutex and the held is the original version that I wrote. Each test was run with 1,2,4,6,8 threads, and the test was simply locking a mutex N times and then immediately unlocking. I used this test to get an idea of what the overhead of using a lock is and also what happens on contention of lots of threads running around.

The caveat here is that the green numbers aren't too significant. We currently need to seriously tune the green schedulers for workloads like this because the numbers all over the place. The numbers provided here are all run with RUST_THREADS=1 which essentially serializes the whole program (rust tasks never yield to one another). Again, these numbers don't mean much in terms of green, but they do in terms of native.

In this table, these are all runtimes, so lower is better. All tests were run in an arch linux VM with 4 cores.

# threads	pthreads	native held	native cxchg	green held	green cxchg
1	0.207	0.195	0.358	0.204	0.325
2	1.113	1.098	1.644	0.394	0.644
4	3.437 - 4.530	2.945 - 4.199	8.066 - 10.656	0.774	1.284
6	4.859 - 6.070	5.547	14.090	1.136	1.909
8	7.009	6.391	17.365	1.614	2.575\

My conclusions from this are that doing pretty much any atomic other than pthread_mutex_lock is just not going to cut it. If we truly want to get right next to pthreads performance, we're going to pretty much eliminate all atomics. The green threads aren't a good enough case to look at just yet because scheduling among them is in such an un-tuned state right now that there's no way we can compare alternate mutex implementations.

With this in mind, I'm going to abandon this implementation and go back to the original version. Once the green scheduler is tuned, we can revisit to see what the impact is for green threads in finer detail.

brson · 2014-01-17T02:45:24Z

@alexcrichton Thanks for the measurements. That's a little disappointing since I liked the elegance of this solution.

bill-myers · 2014-01-19T07:32:00Z

Added an optimization to the cmpxchg loops and benchmarks, see https://github.com/bill-myers/rust/tree/mutex-lockless for the updated code.

…nt-levels, r=xFrednet used_underscore_bindings: respect lint levels on the binding definition Fixes rust-lang#11520 Fixes rust-lang#947 Also ignores usages of `PhantomData` changelog: none

This was referenced Jan 13, 2014

Alternative Mutex implementation #11509

Closed

Implement an efficient mutex in std::sync #11462

Closed

bill-myers closed this Jan 13, 2014

bill-myers reopened this Jan 13, 2014

alexcrichton reviewed Jan 14, 2014
View reviewed changes

alexcrichton reviewed Jan 15, 2014
View reviewed changes

bill-myers mentioned this pull request Jan 15, 2014

Atomics need to take &self rather than &mut sef #11583

Closed

alexcrichton closed this Jan 16, 2014

alexcrichton mentioned this pull request Jan 17, 2014

Add a Mutex, AtomicU64, and deal with lots of related fallout #11610

Closed

		@@ -0,0 +1,111 @@
		/* Copyright (c) 2010-2011 Dmitry Vyukov. All rights reserved.

Alternative lockless Mutex implementation (2nd try) #11520

Alternative lockless Mutex implementation (2nd try) #11520

Uh oh!

Conversation

bill-myers commented Jan 13, 2014

Uh oh!

omasanori commented Jan 13, 2014

Uh oh!

bill-myers commented Jan 13, 2014

Uh oh!

alexcrichton commented Jan 14, 2014

Uh oh!

alexcrichton commented Jan 14, 2014

Uh oh!

alexcrichton Jan 14, 2014

Choose a reason for hiding this comment

Uh oh!

bill-myers Jan 14, 2014

Choose a reason for hiding this comment

Uh oh!

alexcrichton Jan 14, 2014

Choose a reason for hiding this comment

Uh oh!

alexcrichton Jan 15, 2014

Choose a reason for hiding this comment

Uh oh!

bill-myers Jan 15, 2014

Choose a reason for hiding this comment

Uh oh!

bill-myers commented Jan 15, 2014

Uh oh!

alexcrichton commented Jan 15, 2014

Uh oh!

bill-myers commented Jan 15, 2014

Uh oh!

bill-myers commented Jan 15, 2014

Uh oh!

alexcrichton commented Jan 15, 2014

Uh oh!

bill-myers commented Jan 15, 2014

Uh oh!

alexcrichton commented Jan 15, 2014

Uh oh!

alexcrichton commented Jan 16, 2014

Uh oh!

alexcrichton commented Jan 17, 2014

Uh oh!

brson commented Jan 17, 2014

Uh oh!

bill-myers commented Jan 19, 2014

Uh oh!

Uh oh!