-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Soundness fix on 32-bit to cache.rs #811
Conversation
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @gnzlbg (or someone else) soon. If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes. Please see the contribution instructions for more information. |
/// Is the cache uninitialized? | ||
#[inline] | ||
pub(crate) fn is_uninitialized(&self) -> bool { | ||
self.1.load(Ordering::Relaxed) == u32::max_value() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So IIUC the problem was here right? This checks whether the second atomic has been initialized, and if that is the case, it assumes both atomics to be initialized. However, the function that initializes the two atomics below performs two relaxes writes, and while this test returns true if the second write has happened, there is no guarantee that the first write has happened, and therefore the UB. Is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, wouldn't it suffice to have the first write happen before the second write ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. This could also be fixed by using Acquire
here and Release
when doing self.1.store
further down. However doing it this way also allows you to get away with just a single atomic load on the fast path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you wish to do it with Release
semantics, then the fast path would contain a single load with Release
semantics and a single load with Relaxed
semantics. This PR only uses a single Relaxed
load on the fast path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We currently have this code:
// thread 1
self.0.store(rel);
self.1.store(rel);
// thread 2
self.1.load(rel); // assumes self.0.store(rel) has happened
Is there a way to have self.0.store
happen-before self.1.store
such that a relaxed load in thread 2 of self.1
can only observe the initialized state if both stores have happened ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot get my example to work either. I think it was broken.
You are right that the value is never reset in the current code. I only reset it to run the experiment multiple times.
The point I am trying to make is that there is currently nothing introduces a synchronization point across threads. So even though it works on x86 which has a very strong memory model, I think the code can give the wrong result when only relying on the LLVM memory model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @RalfJung
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point I am trying to make is that there is currently nothing introduces a synchronization point across threads
Yes, I am able to follow this much. I also understand that Release+Acquire would suffice here.
What I'm still not sure about is why do we need the Acquire at all? IIUC having a relaxed store followed by a release store, and then doing a relaxed load on the second store, should be enough to guarantee that both stores happen before the relaxed load that observes the initialized cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as the C11 memory model (and thus presumably the primary target of the llvm memory model) goes, a release without an acquire is meaningless. As a practical example, the problematic code is actually something like
if y.load(relaxed) == 0xFFFFFFFF) {
x.store(relaxed) = foo;
y.store(release) = bar; //..never 0xFFFFFFFF..
}
let r = x.load(relaxed);
and the x.load(relaxed)
could be moved before the if statement, so long as it's corrected for the write inside of the branch, ie it could be:
let mut r = x.load(relaxed);
if y.load(relaxed) == 0xFFFFFFFF) {
r = foo;
x.store(relaxed) = r;
y.store(release) = bar; //..never 0xFFFFFFFF..
}
making it be y.load(acquire)
prevents this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd personally always use release/acquire unless it is either blatantly obvious that relaxed is correct (that's clearly not the case here) or we have benchmarks showing that release/acquire is too expensive.
I don't have time at the moment for an in-depth review of weak-memory concurrency code (and a superficial review is no good here). But by default I'd expect the work required for that review to not be worth the effort. Most of the time, release/acquire and relaxed will even generate the same x86 assembly (LLVM is not terribly good at exploiting the optimization potential granted by relaxed accesses).
Thanks, that clarifies it for me! I think I would prefer to pursue the
acquire/release for x86 32-bit because it is simple and without a benchmark
that shows and improvement it is hard to tell weather complicating the
implementation is worth it, particularly for 32-bit x86 which is a target
with many performance pitfalls. If someone wants to work on a benchmark,
i’d Still recommend to try to land a commented acquire / release approach
first, so that we can fix the soundness bug in the simplest way possible,
and also have a baseline to which compare the optimizations.
For x86-64 I’d prefer to keep the code as simple as possible.
|
I really don't see any possibility of the current code having faster in the common case for either 32-bit or 64-bit. Here are a summary of the atomic counts in the different versions:
As I see it I try to achieve multiple things with this PR, some of which you might not want and some of which should probably be in a different PR. I think that is the main reason why you think it is too complicated. Please let me know which of the following you think are worth pursuing and whether you want them in a separate PR.
|
So I think it would be better to split the 64-bit and 32-bit version of this PR as follows. For 64-bit, I think we should try to get rid of the double relaxed-load in the simplest way possible, on its own PR. For 32-bit, I think I would prefer to split the work in two PRs. First, fix the soundness bug in the simplest way possible, using Acquire/Release - don't worry about performance here. Once the 32-bit implementation is sound, then figure out how to make it faster on its own different PR. I hope that by proceeding this way the tradeoffs of each problem will be clearer, and we can make progress faster. How does that sound? |
☔ The latest upstream changes (presumably 9fa96fa) made this pull request unmergeable. Please resolve the merge conflicts. |
This was fixed in #837 by changing the If this at some point ends up being enough of a performance problem that you would like to pursue more complicated solution based on |
This PR fixes a soundness issue on 32-bit.
The problem is that on 32-bit relaxed semantics are not good enough as the code is currently written. The problem is that it is possible for a thread to observe the initialized value for CACHE.1 while still observing the uninitialized value for CACHE.0.
The way I fixed it was by putting a bit in both of the
AtomicU32
values that shows whether it has been initialized or not.While changing the code, I also changed it so only a single atomic load is needed on the fast path on both 32 and 64 bit.