Rewrite sync::one::Once.doit #13351

lilyball · 2014-04-06T01:31:56Z

The old Once.doit() was overly strict and used read-modify-writes when
it didn't have to. The new algorithm will be just a read-acquire load in
the already-initialized case, and the first-call and contested cases use
less strict memory orderings than the old algorithm's SeqCst.

On my machine (3.4GHz Intel Core i7) I was seeing the old algorithm take
13ns (in the already-initialized case). The new one takes only 3ns, and
that drops to 0.5ns with the #[inline] annotation. I am unsure how to
properly measure the contested case, because of the very nature of this
object.

This algorithm definitely needs to be carefully reviewed by someone with
more experience using atomics.

stepancheg · 2014-04-06T01:39:40Z

src/libsync/one.rs

-            return
+        // The general algorithm here is we keep a state value that indicates whether
+        // we've finished the work. This lets us avoid the expensive atomic
+        // read-modify-writes and mutex if there's no need. If we hvaen't finished


typo in "haven't"

Looks like there was also a typo in already-existing comments just above ("should't" instead of "shouldn't").

alexcrichton · 2014-04-06T03:56:12Z

Looking over this, I'm not entirely convinced how favorable this is over #13349. This patch seems to have more optimized orderings for the slow path (when initialization is performed), but the same performance on the fast path (bailing out early).

This has a lot of atomic orderings which may be correct (I'm still getting a grasp on them), but I'd rather err on the side of SeqCst wherever possible. I agree that the current "fast path" being an xadd and an xchg (atomic add and atomic store) is a little excessive.

I think that #13349 is along the right path in terms of fast path optimizations, and I'm not sure how much the slow path optimizations will matter in the long run. I do agree that the fast/slow function split will provide a bit more perf, but see my above comments for why I think it may not be necessary quite yet.

Does that sound ok to you?

lilyball · 2014-04-06T04:00:18Z

@alexcrichton I think that, even with changing all the atomics to SeqCst, this PR is still preferable to #13349 because it's a little easier to actually understand the algorithm. I had to read the existing implementation very carefully to understand exactly what it was doing (what with swapping one count over to the other). Whereas this PR makes it more straightforward, just keeping a tri-state flag that is separate from the count.

As for the atomic orderings, I'm reasonably confident that what I have here is correct. Sequential consistency is not necessary for this algorithm, especially because the mutex involved already provides synchronization guarantees. The main guarantee we need to enforce is that the closure happens-before anything after the call to doit(), hence the acquires (and that the closure happens-before the updating of the state to the final value, hence the release).

bill-myers · 2014-04-06T13:55:26Z

EDIT: oops, on second read, the atomics no longer seem correct.

In particular, I think if self.lock_cnt.fetch_add(-1, atomics::Relaxed) == 1 { needs to be Release, because otherwise it can be moved before drop(guard), and then another task can free the mutex before we unlock it.

Other than this, they seem correct.

I think there are two further optimizations possible.

First, it should be possible to save a word of memory by combining state and lock_cnt.

Second, making the fastpath check a check for >= 0 instead of != 2 might save an instruction on RISC architectures with a zero register and no compare-with-immediate instruction.

This brings to a design where the high bit is set when we should run the fastpath, and the second highest bit is used to represent whether the work is being done, the third highest whether the lock has been freed, and the rest being the lock count.

Hence, this code (UNTESTED!):

    // remove lock_cnt from the struct

    #[inline]
    pub fn doit(&self, f: ||) {
        // Implementation-wise, this would seem like a fairly trivial primitive.
        // The stickler part is where our mutexes currently require an
        // allocation, and usage of a `Once` shouldn't leak this allocation.
        //
        // This means that there must be a deterministic destroyer of the mutex
        // contained within (because it's not needed after the initialization
        // has run).
        //
        // The general algorithm here is we keep a state value that indicates whether
        // we've finished the work. This lets us avoid the expensive atomic
        // read-modify-writes and mutex if there's no need. If we haven't finished
        // the work, then we use a second value to keep track of how many outstanding
        // threads are trying to use the mutex. When the last thread releases the
        // mutex it drops the lock count to a "very negative" value to indicate to
        // other threads that the mutex is gone.

        let state = self.state.load(atomics::Acquire);
        if state >= 0 {
            self.doit_slow(f)
        }
    }

    #[inline(never)]
    fn doit_slow(&self, f: ||) {
        // If the count is negative, then someone else finished the job,
        // otherwise we run the job and record how many people will try to grab
        // this lock
        let HIGH_BIT = (uint::MAX >> 1) + 1;
        let DONE_BIT = HIGH_BIT as int;
        let STARTED_BIT = (HIGH_BIT >> 1) as int;
        let LOCK_DONE_BIT = (HIGH_BIT >> 2) as int;
        let LOCK_MASK = (LOCK_DONE_BIT - 1) as int;

        if (self.state.fetch_add(1, atomics::Acquire) & LOCK_DONE_BIT) != 0 {
            // Make sure we never overflow.
            self.state.store(DONE_BIT | STARTED_BIT | LOCK_DONE_BIT, atomics::Relaxed);
            return
        }

        let guard = self.mutex.lock();
        if (self.state.fetch_or(STARTED_BIT, atomics::Relaxed) & STARTED_BIT) == 0 {
            // we're the first one here
            f();
            self.state.fetch_or(DONE_BIT, atomics::Release);
        }
        drop(guard);

        // Last one out cleans up after everyone else, no leaks!
        // here STARTED_BIT and DONE_BIT are always set
        if (self.state.fetch_add(-1, atomics::Release) == (DONE_BIT | STARTED_BIT | 1) {
            // we just decremented it to 0, now make sure we can drop it to int::MIN.
            // If this fails, someone else just waltzed in and took the mutex
            if self.state.compare_and_swap(DONE_BIT | STARTED_BIT, DONE_BIT | STARTED_BIT | LOCK_DONE_BIT, atomics::AcqRel) == (DONE_BIT | STARTED_BIT) {
                // good, we really were the last ones out
                unsafe { self.mutex.destroy() }
            }
        }
    }

bill-myers · 2014-04-06T14:19:48Z

src/libsync/one.rs

        }
        drop(guard);

        // Last one out cleans up after everyone else, no leaks!
-        if self.lock_cnt.fetch_add(-1, atomics::SeqCst) == 1 {
-            unsafe { self.mutex.destroy() }
+        if self.lock_cnt.fetch_add(-1, atomics::Relaxed) == 1 {


I think this needs to be Release, not Relaxed, because otherwise it can be moved before drop(guard), and then another task can destroy the mutex before we unlock it.

drop(guard) is dropping a mutex. I assumed that was at least an AcqRel operation. Is that untrue?

POSIX says that pthread_mutex_unlock() shall synchronize memory with respect to other threads. I'm assuming this basically means it's a SeqCst operation.

After considering this some more, I think I will go ahead and make this Relaxed. I believe that it's not necessary, but it's a very small price to pay for peace of mind.

lilyball · 2014-04-06T19:49:44Z

@bill-myers Interesting. Thanks for looking at this. I haven't carefully read your implementation yet, but I'm not convinced combining state and lock_cnt is a good idea. It will save a word of memory, but it will mean that new callers of doit() will take longer to hit the fast path than they should. Normally a caller of doit() should hit the fast path as soon as the work is done, but by combining the two variables, it doesn't hit the fast path until the mutex has been dropped. This could theoretically lead to a case where successive threads always enter doit() before the previous thread has exited the mutex, leading to the slow path being used forever (or at least long after it should).

Or as a more trivial example, a single thread calling o.doit(f); o.doit(f); should reasonably expect that the second call to doit() hits the fast path, because it knows the work has completed. That's true with my implementation, but with yours, a second thread hitting the slow path can cause this first thread to hit the slow path on its second call.

This may be fixable by moving the definition of DONE_BIT up and testing for it in doit() instead of testing >= 0, although I haven't satisfied myself yet about the correctness of this, but the downside here is that, since you're already talking about trying to save an instruction on RISC, doing a bit test in the fast path is presumably going to be slower than my current != 2. And the use of Once variables should not be common enough that saving a single word makes any difference (especially because they can only be created as statics).

lilyball · 2014-04-06T19:52:18Z

@bill-myers If you truly wanted to save memory, we could drop Once down to a single int with no mutex, and just spinlock when waiting. This is actually how libdispatch's dispatch_once() function works. But I'd rather not change the semantics of doit() in such a fashion.

lilyball · 2014-04-06T19:55:55Z

@alexcrichton Looking at sync::mutex, it appears that the Mutex type has been fixed to handle green/native correctly, despite the comment in sync::one claiming it doesn't. Is that correct? (If so I should update the comment).

The old Once.doit() was overly strict and used read-modify-writes when it didn't have to. The new algorithm will be just a read-acquire load in the already-initialized case, and the first-call and contested cases use less strict memory orderings than the old algorithm's SeqCst. On my machine (3.4GHz Intel Core i7) I was seeing the old algorithm take 13ns (in the already-initialized case). The new one takes only 3ns, and that drops to 0.5ns with the #[inline] annotation. I am unsure how to properly measure the contested case, because of the very nature of this object. This algorithm definitely needs to be carefully reviewed by someone with more experience using atomics.

lilyball · 2014-04-06T20:08:20Z

I've updated the Relaxed to Release as @bill-myers suggested. I don't believe this is actually necessary, because pthread_mutex_unlock() guarantees that it synchronizes memory (though it doesn't define the precise synchronization, I have to assume that means it's basically SeqCst), and similarly mutex.green_unlock() uses a SeqCst operation. However, for peace of mind, and for protection in case green_unlock() ever decides that it only needs to be a Release, I decided to go ahead and make the change.

sync::one was claiming a) that it was unsafe, and b) that it would block the calling os thread because Mutex didn't know about green threads. Neither of these claims are true anymore.

bill-myers · 2014-04-06T21:22:12Z

No, my implementation is equivalent to yours, except for the different data layout, so it will hit the fast path in the same cases (unless I screwed it up).
Note that state >= 0 is equivalent to (state & DONE_BIT) == 0 (since DONE_BIT is the sign bit) and thus equivalent to state != 2 in your implementation.
I guess it should be documented. Also, I didn't change the comments, but indeed several are now wrong.
No, a portable application that spins is generally broken. The idea is to save memory without changing the semantics in any way.
Yes, Mutex handles to green/native correctly (or at least it tries to)
In general, mutex lock is Acquire and mutex unlock is Release (since otherwise the mutex is equivalent to a no-op), and I made that comment assuming the Rust mutex does not provide additional guarantees.

Regarding memory consumption, now that I think of it, what could be done to do even better is to put the Mutex into a global hash table using the address of the Once as a key (where the hash table is either lockless or protected by a global mutex), so that you can have a single-bit Once.

This has significant tradeoffs though (e.g. it allocates memory), and I think it would be best done in a separate pull request if at all and perhaps exposed as a separate struct.

lilyball · 2014-04-06T21:38:51Z

@bill-myers

Ah hah, that escaped my notice. Although I still haven't spent the time to guarantee to myself that your implementation is indeed correct (as it's more complicated to reason about).
The semantics of spinlock (re-using the state as the spinlock instead of keeping a separate value) vs mutex are the same, it only affects the scheduling of threads. If the once behavior is used for initializing some memory, a spinlock is likely to behave better on a multicore system. If it's used for, say, I/O then of course a mutex is better. And of course on a single-core system a mutex is better. In any case, I wasn't actually advocating we switch to a spinlock, I was just suggesting it as a way to save more memory since you seem to think that's important.
I updated the doc comments in another commit.
That's certainly what it needs to provide. My belief is it typically provides at least AcqRel, but as I said before, I'm ok with using Release instead of Relaxed in case the mutex does in fact only provide Release (and I already made the change).

You can't have a single-bit once. There's two bits of data here: whether we've started, and whether we've finished. So you'd need a two-bit once. And I suspect this approach would actually be worse from a memory perspective, because hash tables have their own overhead and programs are typically not going to have very many Once values. I expect the overhead from the hash table to be larger than the memory saved by taking the mutex out of the Once.

In any case, assuming your implementation is correct (and again I haven't taken the time to satisfy to myself that it is), I'm personally in favor of not going that far. Saving a single word of memory per Once at the cost of an implementation that is significantly harder to read does not seem to me like a good tradeoff. Even the most complex program is likely to have no more than few dozen Onces.

alexcrichton · 2014-04-07T07:04:03Z

Let's please not travel too far down the premature optimization path. @kballard, you said that a primary reason for this patch was simplifying the code, and conserving one word of space I believe does not simplify this at all. While a fun exercise, I would like to keep this code as readable as possible.

Additionally, I still do not understand why it is necessary to avoid SeqCst on the slow path (invoking the initialization, locking the mutex, etc.). I see this as optimization which is not necessary (seeing how it's the slow path), and difficult to reason about. If you choose to continue to not use SeqCst orderings, I expect a significant amount of commenting explaining why the atomic orderings are indeed correct (there are many in play, not just one or two).

If this patch is targeted at simplification of the logic, then it should stick to that. If it is targeted at speeding up the existing logic, then I don't think that this patch is quite necessary (it just needs a fast path on what exists now).

because it's a little easier to actually understand the algorithm

I don't really consider this grounds for rewriting everything, per se. Concurrent code is always tricky to understand no matter how you write it. For example, I don't necessarily agree that this is simpler than what it was before, it will take me time to verify this. I am fine with re-working things, but please understand that simply rewriting is not necessarily a simplification.

lilyball · 2014-04-07T07:39:13Z

Additionally, I still do not understand why it is necessary to avoid SeqCst on the slow path

I don't understand why we should use SeqCst when we don't need that strong a fence. The atomic memory orderings in this code are actually pretty simple to reason about. The only confusion was whether unlocking a mutex is a Release or an AcqRel (or strong), and I've already modified the code to make the conservative assumption.

I get that you're saying that the slow path doesn't really need optimization, because it's the slow path, but just because it's the slow path doesn't mean we need to make it slower than necessary.

Concurrent code is always tricky to understand no matter how you write it.

Sure, I agree with that. But again, that doesn't mean we need to make it any harder to understand than necessary.

lilyball · 2014-04-07T07:57:08Z

Additionally, I still do not understand why it is necessary to avoid SeqCst on the slow path

To clarify: if my proposed rewrite did nothing other than change the memory orderings on the slow path, then I'd understand your objection. But my proposed rewrite changes the algorithm to make it easier to reason about. And as long as it's being rewritten, using the correct memory orderings is a sensible thing to do.

flaper87 · 2014-05-03T14:26:27Z

It looks like this change needs further discussion too but it's been stalled for 1 month. I'll close it, it can be reopened if necessary.

huonw · 2014-05-03T14:40:22Z

There is discussion happening on #13349 (comment) , but this can just be reopened if/when a decision is made.

feat: Add landing/faq walkthrough pages This is a basic implementation of a landing and FAQ page; I've included the bare-bones information as well as a [recommended section on inlay hints](https://rust-lang.zulipchat.com/#narrow/stream/185405-t-compiler.2Frust-analyzer/topic/Landing.20Page/near/446135321). I've also added `rust-analyzer: Open Walkthrough` and `rust-analyzer: Open FAQ` commands for ease of access. I am hoping to create a small list of FAQ to include that might be useful as well as any other information I may have missed in the landing page. Feel free to share any suggestions! ![landing/faq page demo](https://github.com/rust-lang/rust-analyzer/assets/100006388/4644e4f0-4555-4b29-83c1-b048084ad63d) cc rust-lang#13351

lilyball mentioned this pull request Apr 6, 2014

Optimize Once::doit when initialization is already completed #13349

Closed

stepancheg reviewed Apr 6, 2014
View reviewed changes

bill-myers reviewed Apr 6, 2014
View reviewed changes

Update the doc comments for sync::one

1176d9d

sync::one was claiming a) that it was unsafe, and b) that it would block the calling os thread because Mutex didn't know about green threads. Neither of these claims are true anymore.

flaper87 closed this May 3, 2014

Rewrite sync::one::Once.doit #13351

Rewrite sync::one::Once.doit #13351

Uh oh!

Conversation

lilyball commented Apr 6, 2014

Uh oh!

stepancheg Apr 6, 2014

Choose a reason for hiding this comment

Uh oh!

lilyball Apr 6, 2014

Choose a reason for hiding this comment

Uh oh!

alexcrichton commented Apr 6, 2014

Uh oh!

lilyball commented Apr 6, 2014

Uh oh!

bill-myers commented Apr 6, 2014

Uh oh!

bill-myers Apr 6, 2014

Choose a reason for hiding this comment

Uh oh!

lilyball Apr 6, 2014

Choose a reason for hiding this comment

Uh oh!

lilyball Apr 6, 2014

Choose a reason for hiding this comment

Uh oh!

lilyball Apr 6, 2014

Choose a reason for hiding this comment

Uh oh!

lilyball commented Apr 6, 2014

Uh oh!

lilyball commented Apr 6, 2014

Uh oh!

lilyball commented Apr 6, 2014

Uh oh!

lilyball commented Apr 6, 2014

Uh oh!

bill-myers commented Apr 6, 2014

Uh oh!

lilyball commented Apr 6, 2014

Uh oh!

alexcrichton commented Apr 7, 2014

Uh oh!

lilyball commented Apr 7, 2014

Uh oh!

lilyball commented Apr 7, 2014

Uh oh!

flaper87 commented May 3, 2014

Uh oh!

huonw commented May 3, 2014

Uh oh!

Uh oh!