-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does the concept of a compiler fence make any sense? #347
Comments
I think atomic fences have additional guarantees that cannot be specified in terms of the abstract machine. Compiler fences provide a subset of those additional guarantees. These guarantees would probably have to be specified in the same way that FFI or inline assembly is specified - as a relation between the abstract machine state and the underlying lower level state of the machine. If you look at the uses for a compiler fence: https://stackoverflow.com/a/18454971 Both uses boil down to a single thread of execution being interrupted (either by an actual interrupt, or by a context switch) to run some other code, and the programmer wants to synchronize access to memory between the interrupted code and the interrupting code. This cannot be specified at an abstract machine level, because the abstract machine doesn't have a concept of a thread being interrupted in this way. Going back to "atomic fences have additional guarantees that cannot be specified in terms of the abstract machine": I think this is evidenced by the fact that we would expect atomic access from Rust code to be able to synchronize with atomic accesses from another language (or indeed assembly) in the same program. Therefore there must be "more to" atomic accesses than just their semantics within the abstract machine. |
So you are going in the direction of what was discussed around here.
That would block a lot more optimizations that you would want when writing concurrent software with atomics. In particular we want there to be a meaningful difference between atomic fences with different orderings, which I do not think is the case under your proposal. So I don't think atomic fences should have any effect like that. So maybe the statement that an atomic fence is strictly stronger than a compiler fence is wrong?
I don't agree. Atomics are a feature inside the language and its memory model, so one should only expect them to synchronize inside that memory model. So whatever the other code does must be cast in terms of that memory model. C++ atomics are not some abstract way of talking about what the hardware does, they are their completely own thing that is then implemented in terms of what the hardware does. There are sometimes even multiple different mutually incompatible schemes to implement the same model on a given hardware! @DemiMarie (in the old thread)
I guess part of the question here is whether interrupts exist as a language feature (that we'd have to add to the AM) and compiler fences interact with that, or whether we consider them to be outside the AM and compiler-fence to be a way to somehow coordinate with out-of-AM activities. Atomic operations are not an out-of-AM mechanism. This is one of the many key differences between atomic accesses and volatile accesses. (E.g., the compiler is allowed to reorder and remove atomic accesses when that preserve the semantics of the abstract memory model.) So it would be very strange to me if atomic fences were ouf-of-AM mechanisms, and indeed that would put a lot of extra cost on concurrent algorithms that are perfectly happy staying entirely inside the AM. |
Cross-FFI atomics need to work. The question is how to specify atomics in such a way that they do. |
That seems like a bit of a leap? I assume you are referring to:
I would expect atomics to be like part of the ABI - ie. the compiler is not expected to make any kind of atomic accesses from FFI code work correctly, but as long as the FFI code follows the same rules (for that ordering) as the Rust compiler does, then they would be expected to synchronize. I don't see why this would make orderings redundant. |
It's my understanding that atomics are the expected way to do IPC via shared memory. I don't see how to reconcile your statement with that. (This is certainly a use-case supported by C++'s atomics) |
For shared memory with other code that follows the same memory model. Basically, with other instances of (roughly) the same AM. |
I'm unclear: are you saying that Rust atomics can sync with C++ atomics, or not? |
No that's not what I am referring to. I'm saying
We are using the C++ memory model so that works fine. But syncing e.g. with assembly code only makes sense if whatever that assembly code does can be expressed in terms of this memory model. You can't expect an atomic fence to have any effect other than what it says in the memory model. |
To be clear, I think saying that compiler_fence is an out-of-AM syncing mechanism is not a bad idea. However, that would make compiler_fence incomparable in strength with atomic fences (and having That is actually consistent with what @comex wrote
Volatile accesses and atomic accesses are also incomparable in strength. However, I just realized compiler_fence has an |
This comment was marked as off-topic.
This comment was marked as off-topic.
Oh I see - because the memory ordering is part of the C++/Rust memory model. I guess another option would be to add the concept of "interruption" to the abstract machine, and then define compiler fences in terms of that. I think it's possible to define everything about "interruption" except for what triggers it - that's the only part that's "outside" what we can define. |
This comment was marked as off-topic.
This comment was marked as off-topic.
Why is that wrong? Single-threaded / signal / compiler fences are relevant for signal handlers, interrupts, and things like Linux' |
What's wrong is the "that means a compiler fence has no effect..." part, because they're supposed to have some sort of effect. We all seem to agree they should do something. |
They do have an effect, but not in a program without any atomic operations. |
Ah, then I was incorrect and we seem to disagree. Many people (myself included) have been under the impression that a compiler_fence has an effect even in a program with no atomic operations (particularly, even for targets where there are no atomics). If that's not the case we really need to fix the docs. |
The C++ standard defines
(¹ Aka our And adds:
Other types of interrupts or things like SYS_membarrier aren't part of the C++ standard, but are also use cases for a single-threaded/compiler/signal fence. A more practical/wider applicable way of defining |
Well, I've never programmed C++, and so honestly I only vaguely follow what that's saying. I'm unclear on what the implications are for targets without atomics, where we might assume that there's only one core and possibly an interrupt handler. |
When you say "without atomics", do you mean without atomic read-modify-write/cas operations, or even without If a platform has no atomic load or store for any size atomic.. then there's no way to communicate between a signal handler/interrupt and the main code at all through shared memory. Then there's indeed no use for a fence. |
I mean for example ARMv4T, which can uninterruptedly read-write "swap" with And it does have interrupts. But you're telling me there's no official/legal way for the interrupt to communicate anything to the main program? That seems... unfortunate. |
You don't need compare-and-swap or atomic-read-modify-write. Just a simple atomic |
Well, there's normal load and store instructions, but not atomic anything instructions. |
Aren't these loads and stores atomic, if they are "uninterruptable" and there is no multi-core version of this hardware? |
That's the same. All load and store instructions on nearly all architectures are also atomic.
Indeed. |
Oh, well then sure, there's atomic loading and storing. |
BTW, I interpret Reading it this way makes quite a bit of sense to me at least - it just modifies how the fence-atomic, atomic-fence, and fence-fence synchronization rules function. |
Query: How does one enforce ordering between volatile and non-volatile ops. Previously, I've also seen people assume that But, again, apparently |
You can't*. And yes, rust needs volatile atomic operations. |
this is what I've been using #[inline]
#[stable(feature = "compiler_fences", since = "1.21.0")]
#[rustc_diagnostic_item = "compiler_fence"]
pub fn compiler_fence(order: Ordering) {
// SAFETY: using an atomic fence is safe.
unsafe {
match order {
Acquire => intrinsics::atomic_singlethreadfence_acq(),
Release => intrinsics::atomic_singlethreadfence_rel(),
AcqRel => intrinsics::atomic_singlethreadfence_acqrel(),
SeqCst => intrinsics::atomic_singlethreadfence(),
Relaxed => panic!("there is no such thing as a relaxed compiler fence"),
}
}
} which makes me think that the docs really need improved to mention it only applies to atomic accesses if that's the only guarantee that it makes, and it doesn't help that it links to the Linux memory barrier docs which discuss much more than just atomic orderings. IMO there should be some other way to prevent non-volatile & volatile reordering as its necessary to be able to guarantee that for things like driver code. The Linux kernel has functions specifically for accessing MMIO with barriers to prevent reordering in code, but do we make any such guarantee about |
I see what you're saying. Importantly, for any clarification of So it would instead need to be more like:
...which makes it out of scope for this issue. OTOH, I think the way |
The documented definition of Any other definition is either not implemented (compiler_fence is as strong as |
This could be simplified by saying that:
Then every time we define a new thing that acts like an atomic fence, we would get the documentation for the interaction with compiler fences "for free." Note that this could be done regardless o whether compiler fence is effectively renamed to something that indicates it is signal-handler specific. |
fence including compiler_fence is fine, from a semantics perspective (I don't know that it simplifies the semantics, though, to define it that way). All non-Relaxed atomics being fences does not work, since a fence has affects that spply to multiple atomic accesses, whereas an aromic access alone only forms synchronizes-with between actual access/load pairs. |
https://eel.is/c++draft/atomics.fences
(and similar provisions for synchronizing with atomic operations on both sides) Notably, there's no restriction on which |
Yes... though we have to be a bit careful in the wording here. Mutex acquire can be implemented entirely with acquire loads, not using any acquire fence, so I don't think a fence-based explanation can work. Instead the clarification needs to be in terms of which happens-before / synchronizes-with edges are guaranteed to exist. But as you said that's getting off-topic for this issue.
Agreed. Explanations in terms of reordering may be good for getting some first intuition, but they are unsuited as normative definitions. And the only normative definition for compiler_fence that I know of is C++'s |
Regarding clarifying how a Mutex works, I filed rust-lang/rust#126239. |
Seen on LWN:
That's not how compiler fences are specified, but I wonder -- does anything go wrong with this approach? |
That sounds right - the "same physical CPU" doesn't have meaning inside the AM AIUI, but I think we all expected that there would be a non-AM component to the definition of a compiler fence. I guess a follow-up question is "what counts as the same physical CPU?". Is it a logical CPU core, a physical CPU core, a physical CPU, or just left up to the platform to decide (in which case it would be good to have answers for x86[-64] and ARM64 at least.) |
It's something that behaves entirely in-order. So when you run in userspace, a single thread (with its signal handler) is "the same physical CPU". I would say that if you spawn a thread with On hardware, it's whatever the unit of "sequential computation" in this hardware is. If "logical CPU core" refers to things like hyperthreading, then I think that's what it is. |
That doesn't seem like it would enable any optimizations, since the compiler is not necessarily aware of what signal handlers exist, and so would still have to restrict re-orderings on both threads (even assuming they execute on different cores) under the presumption that both threads have a signal handler. Given that, it seems as though a user could effectively rely on compiler fences working cross-thread (on the same logical core) even if we don't guarantee that explicitly.
Yeah, I guess the pathological case for hyper-threading is that despite two logical cores sharing the same L1 cache, the cores execute instructions out-of-order, breaking any guarantees of the fence. |
I have a use case for compiler fences that (as far as I can tell) was not yet discussed in this issue, and where I'd want compiler fences to mean something even across CPU cores: stack sampling for profiling, especially in bytecode interpreters. The gold standard way of building sampling profilers is to periodically interrupt a currently running thread, and take a snapshot of its PC, call stack, etc. before you let it continue. Doing it this way requires some fiddly and platform-specific code to suspend threads but avoids large, systematic inaccuracies introduced by only sampling at certain "safe points" in the program. For this to work well, you need two kinds of "synchronization" in a muddy, conventional sense:
The first point is generally ensured by the platform-specific facilities used to suspend a thread. Only the second point is related to compiler fences. For example, setting up a stack frame during a call is not "atomic" in any real sense, but it's better if the stack pointer is (from profiler's POV) only adjusted after the return address is written to the expected position relative to the new stack pointer. An AOT or JIT compiler that generates machine code directly, or an interpreter written in assembly, can achieve this by just being careful about the sequence of CPU instructions it uses to manipulate the call stack. Because the thread will be completely interrupted when the sample is taken, this can use conventional non-atomic loads/stores even on machines with very weak memory models (as long as they have precise interrupts). After all, that's how any sampling profiling of C, C++, Rust, etc. code works. But a bytecode interpreter written in Rust, which wants to permit cross-thread sampling of the bytecode call stack, needs to be more careful. There's basically three options:
The third option might look like this (on the interpreter side; assume the code that takes the snapshot has a matching acquire fence): struct Interpreter {
// all Arcs shared with the profiler thread
pc: Arc<AtomicU32>,
sp: Arc<AtomicU32>,
stack: Arc<[AtomicU32]>
}
impl Interpreter {
fn exec_call(&mut self, new_pc: u32) {
let sp = self.sp.load(Relaxed);
let old_pc = self.pc.load(Relaxed);
self.stack[sp as usize].store(old_pc, Relaxed);
self.pc.store(new_pc, Relaxed);
// SP increment needs to happen after storing the new return address
compiler_fence(Release);
self.sp.store(sp + 1, Relaxed);
}
fn exec_return(&mut self) {
let sp = self.sp.load(Relaxed);
let new_pc = self.stack[(sp - 1) as usize].load(Relaxed);
self.pc.store(new_pc, Relaxed);
self.sp.store(sp - 1, Relaxed);
// SP decrement needs to happen before any writes to the stack from a subsequent call.
compiler_fence(Release);
}
} Note that this is all safe code, taking the snapshot also uses relaxed loads, so the problem is only possible race conditions. The argument why this sort of thing should work necessarily depends on the platform-specific bits now shown above. On Linux, suspending is generally done by delivering a signal to the thread you want to interrupt. The signal handler takes the snapshot and uses proper synchronization primitives (e.g., POSIX semaphores because those are async-signal-safe) to talk to the profiler thread. If a compiler fence is like a regular fence restricted to a signal handler and the thread it interrupts, you can probably argue for the correctness almost entirely in Rust memory model terms. Or disprove it, if I made a mistake in the example code. But on Windows and macOS, the preferred/only(?) way is to ask the OS to suspend the target thread entirely and take the sample from another thread, which will generally run on another CPU core. Even outside of AM concerns, thinking only about the CPU architecture, it's not obvious that this can be correct. However, in practice it should be fine:
Because it's all safe code in the end (well, except the "suspending a running thread" bit, but that's a different matter) I am not exactly losing sleep over it. Still, this is a very gray area and I'd love to have more official ammunition for this way of using compiler fences. Alternatively, a convincing argument why compiler fences are the wrong tool for this and I need to do X instead would also be welcome. |
Do these OSes provide any way to synchronize with the suspended thread? An OS already has to ensure that migrating a thread to another core do sufficient synchronization and acts like logically, the thread continues running on the same core. I assume they have to do the same here. If so, then arguably one can say that a thread X that does things after thread A got suspended (in a "happens-before" sense of "after") is "running on the same core as A", even if it's not the same hardware core. |
I've not seen any OS documentation that comes out and says anything like that. This is a very sparsely documented area, e.g., it took me a whole lot of searching to learn that Windows requires you to call I agree that migrating a thread between cores faces a similar problem and operating systems solve it somehow. In practical terms, it also seems very likely that they make "suspend another thread and read its thread state" apply enough synchronization (whatever that means on the CPU in question) to make basic tools like profilers work. However, it's not exactly the same as migrating a single thread and I'm not totally sure if that complicates anything under sufficiently weak memory models. For example, if thread migration requires some strong barrier instructions both on the old and the new core, then the OS would also have to do that on the other core when reporting that the thread was suspended, even if a weaker barrier was sufficient for basic message passing. Or maybe there's a fast path that checks a "is this thread running?" flag without necessarily creating enough synchronization. In this case, the semantics of compiler fence might affect whether it needs to emit machine code or not on the relevant platforms. |
A compiler fence never has to emit machine code. That's pretty much its defining property. ;) Fundamentally, the OS needs to provide some operation that is guaranteed to sync with a previously suspended thread (if the "suspend a thread" function does not do that itself). And it needs to be strong enough to do everything a migration of that thread to the current core would do. I understand OSes don't document that, but this is not something Rust is much involved with. Once that is given, one can make an argument that the "inspecting" thread has basically "taken over the logical core" of the previously running thread and therefore a compiler fence is sufficient. |
Well, in practice it sometimes does emit machine code (rust-lang/rust#62256) and on a sufficiently cursed target machine some extra code may be required just to implement whatever Rust ends up choosing as the meaning of "compiler fences" (even if such a target would likely also have crappy codegen for regular loads and stores). It shouldn't, but strong promises about generated machine code are always very dubious. In any case, I see what you're saying. If compiler fences were defined with respect to logical CPU cores, one could make that arguments. I wouldn't love relying on it, but it would be better than if compiler fences only meant anything within a single AM thread (because I don't see how one could argue about that). |
That seems to bottom out here. I don't know what But anyway that's kind of off-topic here.
The challenge for the docs is how to define "logical CPU core". But other than that, I think it is the right definition. So if someone wants to update the docs with an attempt at such a definition, please go ahead. :) |
For what it's worth, another use case for cross-thread compiler fences is when they're paired with another thread doing membarrier syscalls (mentioned briefly earlier in the thread) or other ways of achieving the same effect. In those cases, you can't very well argue that the side calling membarrier has taken over the CPU core from the thread using compiler fences, because the latter is not suspended and may in fact be running at the same time. Clearly the OS + hardware establishes some sort of synchronization, but how does that connect to the compiler fence? Perhaps you could pretend that the OS scheduler interrupts issue both a compiler fence and a regular fence every time it fires on a logical core? That seems to imply much more synchronization happening than the AM would like to admit. Maybe it won't prevent any optimizations if userspace code can't rely on those fences because it can't know when they happen (for lack of any shared state with the interrupt). |
Is that the one that fires a cross core interrupt to all other cores, i.e. some sort of asymmetric fence? I think that is a whole new primitive that needs to be added to the memory model as such.
|
It is an asymmetric fence in a way, but IPI is only one possible way of implementing it. Linux membarrier can either do IPI or it can piggy-backs on the RCU grace period (whatever that means, don't ask me). barrierd is a userspace + eBPF solution mostly based on waiting until any interrupt has fired on every core at least once (relying on those being serializing for the CPU). |
Interrupts are a different conccurency context from the thread they're running on. Synchronizing between an interrupt handler running on a thread and another thread won't do anything to synchronize between the two threads. |
It's not obvious to me why there should be any difference between asking the OS to send my thread a signal every 10ms and doing something funky in the signal handler, or (in bare-metal code) setting up a timer interrupt for every 10 ms and doing the same funky stuff in the interrupt handler. It's not literally exactly the same, but both seem equally foreign to the AM and both have pretty much the same implications for the hardware memory model. |
There are accelerators on which one isn’t even guaranteed to read what one just wrote to memory unless one uses a barrier instruction. |
For such an accelerator to run Rust the compiler would need to insert a barrier between every read and every write. The C++ memory model enforces happens before within a single thread. If the cpu core doesn't guarantee this the compiler is required to insert barriers as necessary to enforce this. |
Given that this would be ludicrously expensive, I suspect that the compiler would instead only allow local memory (which is usually coherent with program execution) to be accessed using Rust pointer syntax, and require access to incoherent memory to use function calls that are opaque to the compiler. |
This is off-topic, though. |
Based on the stack sampling example, I think it could be potentially reasonable to document
but unfortunately it's on the thread suspension to ensure this is actually true on weak memory targets (e.g. by ensuring all weak operations commit before considering a thread to be fully suspended) as otherwise the weak operations could become visible out of order across the intended ordering of the TL;DR: that's a lot of words to say again that synchronizing within a logical CPU core must be transitive across synchronizing with thread suspension on any useful system. |
Atomic fences are specified in terms of things that happen when certain atomic reads occur:
In that particular situation, if the load reads-from the store, then the fences kick in and have an effect. That is the only effect they have, I think.
So, if your program contains no atomic accesses, but some atomic fences, those fences do nothing.
We also think that an atomic fence has at least all the effects of a compiler fence, i.e., a compiler fence is strictly weaker than an atomic fence. But that means a compiler fence has no effect on programs without atomic accesses -- which is just wrong, that's not how they are supposed to behave.
So what is the operational spec of a compiler fence? I have no idea.
See the discussion here for some more details. Let's go on discussing here.
The text was updated successfully, but these errors were encountered: