-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite sync::one::Once.doit #13351
Rewrite sync::one::Once.doit #13351
Conversation
return | ||
// The general algorithm here is we keep a state value that indicates whether | ||
// we've finished the work. This lets us avoid the expensive atomic | ||
// read-modify-writes and mutex if there's no need. If we hvaen't finished |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo in "haven't"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there was also a typo in already-existing comments just above ("should't" instead of "shouldn't").
Looking over this, I'm not entirely convinced how favorable this is over #13349. This patch seems to have more optimized orderings for the slow path (when initialization is performed), but the same performance on the fast path (bailing out early). This has a lot of atomic orderings which may be correct (I'm still getting a grasp on them), but I'd rather err on the side of SeqCst wherever possible. I agree that the current "fast path" being an xadd and an xchg (atomic add and atomic store) is a little excessive. I think that #13349 is along the right path in terms of fast path optimizations, and I'm not sure how much the slow path optimizations will matter in the long run. I do agree that the fast/slow function split will provide a bit more perf, but see my above comments for why I think it may not be necessary quite yet. Does that sound ok to you? |
@alexcrichton I think that, even with changing all the atomics to As for the atomic orderings, I'm reasonably confident that what I have here is correct. Sequential consistency is not necessary for this algorithm, especially because the mutex involved already provides synchronization guarantees. The main guarantee we need to enforce is that the closure happens-before anything after the call to |
EDIT: oops, on second read, the atomics no longer seem correct. In particular, I think Other than this, they seem correct. I think there are two further optimizations possible. First, it should be possible to save a word of memory by combining state and lock_cnt. Second, making the fastpath check a check for >= 0 instead of != 2 might save an instruction on RISC architectures with a zero register and no compare-with-immediate instruction. This brings to a design where the high bit is set when we should run the fastpath, and the second highest bit is used to represent whether the work is being done, the third highest whether the lock has been freed, and the rest being the lock count. Hence, this code (UNTESTED!):
|
} | ||
drop(guard); | ||
|
||
// Last one out cleans up after everyone else, no leaks! | ||
if self.lock_cnt.fetch_add(-1, atomics::SeqCst) == 1 { | ||
unsafe { self.mutex.destroy() } | ||
if self.lock_cnt.fetch_add(-1, atomics::Relaxed) == 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to be Release, not Relaxed, because otherwise it can be moved before drop(guard)
, and then another task can destroy the mutex before we unlock it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drop(guard)
is dropping a mutex. I assumed that was at least an AcqRel operation. Is that untrue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
POSIX says that pthread_mutex_unlock()
shall synchronize memory with respect to other threads. I'm assuming this basically means it's a SeqCst operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After considering this some more, I think I will go ahead and make this Relaxed
. I believe that it's not necessary, but it's a very small price to pay for peace of mind.
@bill-myers Interesting. Thanks for looking at this. I haven't carefully read your implementation yet, but I'm not convinced combining Or as a more trivial example, a single thread calling This may be fixable by moving the definition of |
@bill-myers If you truly wanted to save memory, we could drop |
@alexcrichton Looking at |
The old Once.doit() was overly strict and used read-modify-writes when it didn't have to. The new algorithm will be just a read-acquire load in the already-initialized case, and the first-call and contested cases use less strict memory orderings than the old algorithm's SeqCst. On my machine (3.4GHz Intel Core i7) I was seeing the old algorithm take 13ns (in the already-initialized case). The new one takes only 3ns, and that drops to 0.5ns with the #[inline] annotation. I am unsure how to properly measure the contested case, because of the very nature of this object. This algorithm definitely needs to be carefully reviewed by someone with more experience using atomics.
I've updated the |
sync::one was claiming a) that it was unsafe, and b) that it would block the calling os thread because Mutex didn't know about green threads. Neither of these claims are true anymore.
Regarding memory consumption, now that I think of it, what could be done to do even better is to put the Mutex into a global hash table using the address of the Once as a key (where the hash table is either lockless or protected by a global mutex), so that you can have a single-bit Once. This has significant tradeoffs though (e.g. it allocates memory), and I think it would be best done in a separate pull request if at all and perhaps exposed as a separate struct. |
You can't have a single-bit once. There's two bits of data here: whether we've started, and whether we've finished. So you'd need a two-bit once. And I suspect this approach would actually be worse from a memory perspective, because hash tables have their own overhead and programs are typically not going to have very many Once values. I expect the overhead from the hash table to be larger than the memory saved by taking the mutex out of the Once. In any case, assuming your implementation is correct (and again I haven't taken the time to satisfy to myself that it is), I'm personally in favor of not going that far. Saving a single word of memory per Once at the cost of an implementation that is significantly harder to read does not seem to me like a good tradeoff. Even the most complex program is likely to have no more than few dozen Onces. |
Let's please not travel too far down the premature optimization path. @kballard, you said that a primary reason for this patch was simplifying the code, and conserving one word of space I believe does not simplify this at all. While a fun exercise, I would like to keep this code as readable as possible. Additionally, I still do not understand why it is necessary to avoid If this patch is targeted at simplification of the logic, then it should stick to that. If it is targeted at speeding up the existing logic, then I don't think that this patch is quite necessary (it just needs a fast path on what exists now).
I don't really consider this grounds for rewriting everything, per se. Concurrent code is always tricky to understand no matter how you write it. For example, I don't necessarily agree that this is simpler than what it was before, it will take me time to verify this. I am fine with re-working things, but please understand that simply rewriting is not necessarily a simplification. |
I don't understand why we should use I get that you're saying that the slow path doesn't really need optimization, because it's the slow path, but just because it's the slow path doesn't mean we need to make it slower than necessary.
Sure, I agree with that. But again, that doesn't mean we need to make it any harder to understand than necessary. |
To clarify: if my proposed rewrite did nothing other than change the memory orderings on the slow path, then I'd understand your objection. But my proposed rewrite changes the algorithm to make it easier to reason about. And as long as it's being rewritten, using the correct memory orderings is a sensible thing to do. |
It looks like this change needs further discussion too but it's been stalled for 1 month. I'll close it, it can be reopened if necessary. |
There is discussion happening on #13349 (comment) , but this can just be reopened if/when a decision is made. |
feat: Add landing/faq walkthrough pages This is a basic implementation of a landing and FAQ page; I've included the bare-bones information as well as a [recommended section on inlay hints](https://rust-lang.zulipchat.com/#narrow/stream/185405-t-compiler.2Frust-analyzer/topic/Landing.20Page/near/446135321). I've also added `rust-analyzer: Open Walkthrough` and `rust-analyzer: Open FAQ` commands for ease of access. I am hoping to create a small list of FAQ to include that might be useful as well as any other information I may have missed in the landing page. Feel free to share any suggestions!  cc rust-lang#13351
The old Once.doit() was overly strict and used read-modify-writes when
it didn't have to. The new algorithm will be just a read-acquire load in
the already-initialized case, and the first-call and contested cases use
less strict memory orderings than the old algorithm's SeqCst.
On my machine (3.4GHz Intel Core i7) I was seeing the old algorithm take
13ns (in the already-initialized case). The new one takes only 3ns, and
that drops to 0.5ns with the #[inline] annotation. I am unsure how to
properly measure the contested case, because of the very nature of this
object.
This algorithm definitely needs to be carefully reviewed by someone with
more experience using atomics.