-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use a load rather than a fence when dropping the contents of an Arc. #41714
Conversation
This is what Gecko does [1]. If I understand correctly, an Acquire load on an address just needs to synchronize with other Release operations on that specific address, whereas an Acquire fence needs to synchronize globally with all other Release operations in the program. [1] http://searchfox.org/mozilla-central/rev/ae8c2e2354db652950fe0ec16983360c21857f2a/xpcom/base/nsISupportsImpl.h#337
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @alexcrichton (or someone else) soon. If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes. Please see the contribution instructions for more information. |
It's entirely possible that I have this wrong somehow, but if so I'd love to understand why. r? @aturon |
Also added a long comment explaining why we need the Acquire/Release handshake. |
Note that there's another fence for the weak count which may matter here as well? Could you benchmark to see what the difference is? |
Yeah, that should probably be a load as well I think.
I think a good benchmark will be hard to construct here, because it depends entirely on the cache line breakdown and what the other CPUs happen to be doing at the same time. This patch just gives the CPU more degrees of freedom in contended situations. If there's no contention, I'd expect there to be no difference. |
@bholley I'm curious, though, if this change is motivated by some direct performance problems or benchmarking you've done? I ask just because this kind of change will entail some pretty intensive reviewing, and the code right now at least has solid lineage back through clang, IIRC. |
Nope - I was just forking Arc for other reasons and came across this. If the answer is that we think the risk/reward here isn't worth it for now, that's fine. We should at least check in the comment though. |
I don't think you meant to tag me? (Which means someone may be missing?) |
Yes, I meant @gankro, sorry. |
@@ -767,8 +767,39 @@ unsafe impl<#[may_dangle] T: ?Sized> Drop for Arc<T> { | |||
// > through this reference must obviously happened before), and an | |||
// > "acquire" operation before deleting the object. | |||
// | |||
// Note that Rust is not C++, and so it's valid to ask whether we could | |||
// do without the Acquire/Release handshake here. The contents of an Arc | |||
// are immutable, and thus we could hypothetically assume that there are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The contents aren't immutable, due to the potential for interior mutability (through UnsafeCell
or the atomics you mention below). When thinking about this fence in the past, I'd always considered it as ensuring that any outstanding interior non-Relaxed
writes are visible prior to the destructor running.
I do agree that allocators in general are virtually guaranteed to have enough fencing to at least avoid relaxed writes landing after the memory is transferred to another thread. But I worry about people relying on ordering with respect to the destructor code running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When thinking about this fence in the past, I'd always considered it as ensuring that
any outstanding interior non-Relaxed writes are visible prior to the destructor running.
Oh I see. I'd been thinking of such writes as being either atomic (in which case they're governed by their own memory ordering) or protected by something like a Mutex, which have their own synchronization. But I realize now that the destructor doesn't actually acquire the mutex before dropping the contents, and so we can't rely on its internal synchronization to flush the writes.
Let me know what you want to do about the fence. If it's not worth the risk of fixing (at least right now), I can clean up the comment to take the above discussion into account and land it separately. In general though, I do think it's worth fixing these kinds of potential performance issues when we find them. An overly conservative fence in Arc is exactly the sort of thing that can cause bad performance under high system load, but would never be discovered by profiling because (a) it's hard to reproduce those conditions reliably and (b) there are very few people, even in the Rust community, who are comfortable reasoning about memory orderings. |
cc @Amanieu |
I agree that it's worth exploring these kinds of fixes, but they are relatively high risk and cost, so I'd like to have some evidence that they make a difference (beyond purely hypothetical performance reasoning). That's a standard we've been trying to apply across In short, my feeling is that, absent concrete evidence of a performance improvement here, I'd prefer not to pursue it at this time. (Thanks, though, for the PR!) |
I think the comments about the allocator are misleading. The purpose of the fence is to ensure that any operations performed by other threads on the shared data happen before the I don't think there is much performance difference between an acquire fence and an acquire load:
So basically, with a fence you avoid an extra load. Also, as far as I know, it is perfectly valid to have a fence synchronize with a atomic operation. |
I've approved the pure-docs PR, and am going to close this one. Thanks again @bholley! |
Document the reasoning for the Acquire/Release handshake when dropping Arcs. Split out from rust-lang#41714. r? @aturon
@Amanieu interesting! I'd always thought that acquire/release fences imposed a greater restriction on the compiler and CPU than acquire/release loads/stores, since the latter are only enforced when the two threads are operating on the same memory address. So, for example, a CPU would need to synchronously flush the cache lines if core X released mutex A and then core Y acquired mutex A, but would not need to do so if core Y acquired mutex B instead. Do I have that wrong, or does the C++11 memory model allow for a wider range of optimization than current compilers and CPUs actually use? |
The C11 model is engineered to support things like the DEC Alpha, so yeah modern hardware doesn't take full advantage of it. Making sloppy code work and fast is kinda the job of hardware devs. |
Cough… compiler devs come before that. |
Comment withdrawn, I misunderstood the spec. |
This is what Gecko does [1]. If I understand correctly, an Acquire load on
an address just needs to synchronize with other Release operations on that
specific address, whereas an Acquire fence needs to synchronize globally
with all other Release operations in the program.
[1] http://searchfox.org/mozilla-central/rev/ae8c2e2354db652950fe0ec16983360c21857f2a/xpcom/base/nsISupportsImpl.h#337