Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Soundness fix on 32-bit to cache.rs #811

Closed
wants to merge 2 commits into from
Closed

Soundness fix on 32-bit to cache.rs #811

wants to merge 2 commits into from

Conversation

TethysSvensson
Copy link

This PR fixes a soundness issue on 32-bit.

The problem is that on 32-bit relaxed semantics are not good enough as the code is currently written. The problem is that it is possible for a thread to observe the initialized value for CACHE.1 while still observing the uninitialized value for CACHE.0.

The way I fixed it was by putting a bit in both of the AtomicU32 values that shows whether it has been initialized or not.

While changing the code, I also changed it so only a single atomic load is needed on the fast path on both 32 and 64 bit.

@rust-highfive
Copy link

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @gnzlbg (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

/// Is the cache uninitialized?
#[inline]
pub(crate) fn is_uninitialized(&self) -> bool {
self.1.load(Ordering::Relaxed) == u32::max_value()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So IIUC the problem was here right? This checks whether the second atomic has been initialized, and if that is the case, it assumes both atomics to be initialized. However, the function that initializes the two atomics below performs two relaxes writes, and while this test returns true if the second write has happened, there is no guarantee that the first write has happened, and therefore the UB. Is that correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, wouldn't it suffice to have the first write happen before the second write ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This could also be fixed by using Acquire here and Release when doing self.1.store further down. However doing it this way also allows you to get away with just a single atomic load on the fast path.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you wish to do it with Release semantics, then the fast path would contain a single load with Release semantics and a single load with Relaxed semantics. This PR only uses a single Relaxed load on the fast path.

Copy link
Contributor

@gnzlbg gnzlbg Sep 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently have this code:

// thread 1
self.0.store(rel);
self.1.store(rel); 

// thread 2
self.1.load(rel); // assumes self.0.store(rel) has happened

Is there a way to have self.0.store happen-before self.1.store such that a relaxed load in thread 2 of self.1 can only observe the initialized state if both stores have happened ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot get my example to work either. I think it was broken.

You are right that the value is never reset in the current code. I only reset it to run the experiment multiple times.

The point I am trying to make is that there is currently nothing introduces a synchronization point across threads. So even though it works on x86 which has a very strong memory model, I think the code can give the wrong result when only relying on the LLVM memory model.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point I am trying to make is that there is currently nothing introduces a synchronization point across threads

Yes, I am able to follow this much. I also understand that Release+Acquire would suffice here.

What I'm still not sure about is why do we need the Acquire at all? IIUC having a relaxed store followed by a release store, and then doing a relaxed load on the second store, should be enough to guarantee that both stores happen before the relaxed load that observes the initialized cache.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as the C11 memory model (and thus presumably the primary target of the llvm memory model) goes, a release without an acquire is meaningless. As a practical example, the problematic code is actually something like

if y.load(relaxed) == 0xFFFFFFFF) {
      x.store(relaxed) = foo; 
      y.store(release) = bar; //..never 0xFFFFFFFF..
}
let r = x.load(relaxed);

and the x.load(relaxed) could be moved before the if statement, so long as it's corrected for the write inside of the branch, ie it could be:

let mut r = x.load(relaxed);
if y.load(relaxed) == 0xFFFFFFFF) {
      r = foo;
      x.store(relaxed) = r; 
      y.store(release) = bar; //..never 0xFFFFFFFF..
}

making it be y.load(acquire) prevents this

Copy link
Member

@RalfJung RalfJung Oct 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd personally always use release/acquire unless it is either blatantly obvious that relaxed is correct (that's clearly not the case here) or we have benchmarks showing that release/acquire is too expensive.

I don't have time at the moment for an in-depth review of weak-memory concurrency code (and a superficial review is no good here). But by default I'd expect the work required for that review to not be worth the effort. Most of the time, release/acquire and relaxed will even generate the same x86 assembly (LLVM is not terribly good at exploiting the optimization potential granted by relaxed accesses).

@gnzlbg
Copy link
Contributor

gnzlbg commented Sep 26, 2019 via email

@TethysSvensson
Copy link
Author

I really don't see any possibility of the current code having faster in the common case for either 32-bit or 64-bit. Here are a summary of the atomic counts in the different versions:

Fast path Initialization path
Current 64-bit 2 relaxed loads 2 relaxed loads, 1 relaxed store
My 64-bit 1 relaxed load 1 relaxed load, 1 relaxed store
Current 32-bit 2 relaxed loads 2 relaxed loads, 2 relaxed stores
32-bit with acquire/release 1 release load, 1 relaxed load 1 release load, 1 relaxed store, 1 acquire store, 1 relaxed load
My 32-bit 1 relaxed load 1 relaxed load, 2 relaxed stores

As I see it I try to achieve multiple things with this PR, some of which you might not want and some of which should probably be in a different PR. I think that is the main reason why you think it is too complicated.

Please let me know which of the following you think are worth pursuing and whether you want them in a separate PR.

  • General cleanup, such as removing the second argument to cache::test and increasing clarity by defining more constants.
  • Introduction of some arguable unnecessarily defensive code (e.g. this and this)
  • Getting rid of the double loads on 64-bit.
  • Solving the soundness problem on 32-bit by getting rid of double loads. This sacrifices a one of the 63 feature-bits.
  • Solving the soundness problem on 32-bit by introducing acquire/release semantics. This sacrifices some performance.

@gnzlbg
Copy link
Contributor

gnzlbg commented Sep 27, 2019

So I think it would be better to split the 64-bit and 32-bit version of this PR as follows.

For 64-bit, I think we should try to get rid of the double relaxed-load in the simplest way possible, on its own PR.

For 32-bit, I think I would prefer to split the work in two PRs. First, fix the soundness bug in the simplest way possible, using Acquire/Release - don't worry about performance here.

Once the 32-bit implementation is sound, then figure out how to make it faster on its own different PR.

I hope that by proceeding this way the tradeoffs of each problem will be clearer, and we can make progress faster. How does that sound?

@bors
Copy link
Contributor

bors commented Jan 28, 2020

☔ The latest upstream changes (presumably 9fa96fa) made this pull request unmergeable. Please resolve the merge conflicts.

@TethysSvensson
Copy link
Author

TethysSvensson commented Jan 31, 2020

This was fixed in #837 by changing the Relaxed semantics to SeqCst semantics.

If this at some point ends up being enough of a performance problem that you would like to pursue more complicated solution based on Relaxed semantics, let me know and I can try to resurrect this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants