-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document movnt needs sfence #1457
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1683,6 +1683,17 @@ pub unsafe fn _mm256_lddqu_si256(mem_addr: *const __m256i) -> __m256i { | |
/// aligned memory location. To minimize caching, the data is flagged as | ||
/// non-temporal (unlikely to be used again soon) | ||
/// | ||
/// # Safety | ||
/// | ||
/// After using this intrinsic, but before any atomic operations occur, a call | ||
/// to `_mm_sfence()` must be performed. A safe function that includes unsafe | ||
/// usage of this intrinsic must always end in `_mm_sfence()`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We are currently moving into a different direction in rust-lang/rust#114582: if we follow @talchas' approach, the rule will be: after using this intrinsic, but (happens-)before any other read or write of this location in any thread, a fence must occur (the fence must be in the same thread that called the intrinsic). Basically, the write actually happens in another thread via a non-atomic store, so accessing this location may cause a data race. Doing a fence waits for that other thread to complete its store, avoiding the data race. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm. I tried to keep the rationale for the requirement slightly vague as we discuss it, and approach it as more of a "do this or your program explodes into flames" type of warning, which is why it goes on to explain further reads or writes to the location are discouraged. So with that in mind, what is this missing? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd prefer if we'd call that UB instead of merely "discouraged". Ideally even further nontemporal writes (before the next sfence) would be UB... I'd find it strange to have a situation where a nontemporal store would be allowed but a regular store would not. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I opened #1534 with my alternative wording. |
||
/// | ||
/// Reading and writing to the memory stored-to by any other means, after any | ||
/// nontemporal store has been used to write to that memory, but before the | ||
/// use of `_mm_sfence()`, is discouraged. Such reads can lead to pipeline | ||
/// stalls and yet-unspecified program behavior. | ||
Comment on lines
+1692
to
+1695
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hm. Well, both x86's movntdq and Arm's stnp retain local-thread serial ordering if you use them twice on the same location, i.e. that using There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I do agree that we should canonically allow There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The thing is that the proposed doc just doesn't align with the proposed spec in rust-lang/rust#114582. If you go with @talchas' original proposed spec, the doc should be something like "After the Basically, the nontemporal writes performed by streaming operations should be considered not ordered even with other operations in the thread they appear in, except with other nontemporal writes, until the next sfence which establishes synchronization with all nontemporal writes of the current thread." That would allow your example. Crucially it disallows There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ah, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh sorry, I misread your example and thought it said If we make There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is a myth that "you cannot read uninitialized memory" and I think it is harmful. I've encountered tons of confusion caused by people axiomatically thinking "I cannot read uninit memory". Instead we should teach people what actually happens in our spec: when reading memory at a certain type, that memory must be sufficiently "valid" for this type. If you read at But anyway, it seems unlikely we will get to an agreement on terminology or teaching philosophy here and that's all rather off-topic anyway. The on-topic question is whether we should allow people to perform regular writes in between the nontemporal write and the fence. I don't see a good motivation for doing that, and it opens some tricky questions that we'd need to carefully figure out before allowing it. For instance, if I do a nontemporal write and then a regular write, do I even still need the fence or is it now guaranteed that the next "release" operation synchronizes everything properly? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ehnn I mean yes, just... ...Anyways I think, dreadfully, to actually answer your question about the write series (i.e. movnti [somewhere], something
mov [somewhere], something
mov [flag], 1 ), that on x86 you kiiiinda still need the fence, from the ISA's point of view? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm. The local results should be consistent absolutely, but I am concerned about code that may look more like moving overlapping sizes, so vmovntdq [somewhere], ymm09
mov [somewhere], r09
mov [flag], 1 The first 8 bytes are guaranteed once the flag is set due to the two regular movs participating in TSO, but the vmovntdq means 24 bytes are in an ambiguous state. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
If the answer to that is yes then I don't think we can allow regular writes. A regular write in Rust is guaranteed to be properly released by a release operation, after all. We can have intrinsics that do "regular write to something that might be in the nontemporal state", but those need the same inline-assembly-and-Rust-replacement-code-spec treatment as streaming writes. I'd really prefer if we didn't have to do this... the usecases we are aware of don't need this, do they? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, alright then, I didn't realize that was the specific way the sequencing invariants were formed. It's possible the reality may be amenable to regular writes (assuming a sort of "wholesale adoption of the x86 mechanics into Rust" approach for this case) but I'd have to examine the rules... very closely. I don't think the majority of use cases need this in practice, correct, so now that we have hashed out that issue as existing, then I am happy to let this one go. I will update the documentation changes here accordingly. |
||
/// | ||
/// [Intel's documentation](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm256_stream_si256) | ||
#[inline] | ||
#[target_feature(enable = "avx")] | ||
|
@@ -1696,6 +1707,17 @@ pub unsafe fn _mm256_stream_si256(mem_addr: *mut __m256i, a: __m256i) { | |
/// to a 32-byte aligned memory location. To minimize caching, the data is | ||
/// flagged as non-temporal (unlikely to be used again soon). | ||
/// | ||
/// # Safety | ||
/// | ||
/// After using this intrinsic, but before any atomic operations occur, a call | ||
/// to `_mm_sfence()` must be performed. A safe function that includes unsafe | ||
/// usage of this intrinsic must always end in `_mm_sfence()`. | ||
/// | ||
/// Reading and writing to the memory stored-to by any other means, after any | ||
/// nontemporal store has been used to write to that memory, but before the | ||
/// use of `_mm_sfence()`, is discouraged. Such reads can lead to pipeline | ||
/// stalls and yet-unspecified program behavior. | ||
/// | ||
/// [Intel's documentation](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm256_stream_pd) | ||
#[inline] | ||
#[target_feature(enable = "avx")] | ||
|
@@ -1711,6 +1733,17 @@ pub unsafe fn _mm256_stream_pd(mem_addr: *mut f64, a: __m256d) { | |
/// caching, the data is flagged as non-temporal (unlikely to be used again | ||
/// soon). | ||
/// | ||
/// # Safety | ||
/// | ||
/// After using this intrinsic, but before any atomic operations occur, a call | ||
/// to `_mm_sfence()` must be performed. A safe function that includes unsafe | ||
/// usage of this intrinsic must always end in `_mm_sfence()`. | ||
/// | ||
/// Reading and writing to the memory stored-to by any other means, after any | ||
/// nontemporal store has been used to write to that memory, but before the | ||
/// use of `_mm_sfence()`, is discouraged. Such reads can lead to pipeline | ||
/// stalls and yet-unspecified program behavior. | ||
/// | ||
/// [Intel's documentation](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm256_stream_ps) | ||
#[inline] | ||
#[target_feature(enable = "avx")] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a bit too strict. There are niche scenarios where one would write out data without ever reading it again or at least without reading it again on another thread.
In those cases some later release write would be incidental and only meant to order other, regular writes.
Maybe the entire requirement could be conditional on "if the written memory is intended to be made accessible on another thread through a release operation", with the recommendation "if in doubt, add a fence".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, I don't think it's coherent to leave an unsatisfied obligation hanging unless it's still in
unsafe
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That depends on the perspective. Yes, it's an obligation to... restore consistency with the current rust memory model which assumes that all writes must be ordered with a release operation.
But under some unspecified extended model it may be valid to leave some memory locations unordered.
AIUI the purpose of ordering is to avoid data races which are UB. If the another thread never accesses the memory then this is unspecified behavior but not necessarily UB.
E.g. if you're writing bytes to a framebuffer that's concurrently being scanned out by the GPU then the fence doesn't add anything. You're not synchronizing with anything. You're just racing against time. Either the write makes out to the pixels or it doesn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the conservative definition is fine for now. But it could be phrased in a way that makes it clear that it may be replaced with a more refined definition at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to tell people to always discharge the obligation (even in the case you mention, the store-store fence would serialize any remaining deferred write-buffers, which is desired if you might run out of time otherwise!) and relax things when we mechanize a better spec.