Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

## Pre-Pre-RFC: core::arch::{load, store} and stricter volatile semantics #321

Open
DemiMarie opened this issue Mar 16, 2022 · 115 comments
Open

Comments

@DemiMarie
Copy link

DemiMarie commented Mar 16, 2022

core::arch::{load, store}

This proposes new load and store functions (in core::arch), for raw hardware loads and stores, and a concept of an always-valid type that can safely be cast to a byte array. It also defines volatile accesses in terms of these functions.

The functions proposed here have the same semantics as raw machine load and store instructions. The compiler is not permitted to assume that the values loaded or stored are initialized, or even that they point to valid memory. However, it is permitted to assume that load and store do not violate Rust’s mutability rules.

In particular, it is valid to use these functions to manipulate memory that is being concurrently accessed or modified by any means whatsoever. Therefore, they can be used to access memory that is shared with untrusted code. For example, a kernel could use them to access userspace memory, and a user-mode server could use them to access memory shared with a less-privileged user-mode process. It is also safe to use these functions to manipulate memory that is being concurrently accessed via DMA, or that corresponds to a memory-mapped hardware register.

The core guarantee that makes load and store useful is this: A call to load or store is guaranteed to result in exactly one non-tearing non-interlocked load from or store to the exact address passed to the function, no matter what that address happens to be. To ensure this, load and store are considered partially opaque to the optimizer. The optimizer must consider them to be calls to functions that may or may not dereference their arguments. It is even possible that the operation triggers a hardware fault that some other code catches and recovers from. Hence, the compiler can never prove that a given call to core::arch::load and core::arch::store will have undefined behavior. In other ways, a call to load or store does not disable any optimizations that a call to an unknown function with the same argument would not also disable. In short: garbage in, garbage out.

The actual functions are as follows:

unsafe fn load<T>(ptr: *const T) -> T;
unsafe fn store<T>(ptr: *mut T, arg: T);

Performs a single memory access (of size size_of::<T>()) on ptr. The compiler must compile each these function calls into exactly one machine instruction. If this is not possible, it is a compile-time error. The types T for which a compiler can successfully generate code for these calls is dependent on the target architecture. Using a T that cannot safely be transmuted to or from a byte array is not forbidden, but is often erroneous, and thus triggers a lint (see below). Provided that ptr is properly aligned, these functions are guaranteed to not cause tearing. If ptr is not properly aligned, the results are architecture-dependent.

The optimizer is not permitted to assume that ptr is dereferenceable or that it is properly aligned. This allows these functions to be used for in-process debuggers, crash dumpers, and other applications that may need to access memory at addresses obtained from some external source, such as a debug console or /proc/self/maps. If load is used to violate the aliasing rules (by accessing memory the compiler thinks cannot be accessed), the value returned may be non-deterministic and may contain sensitive data. If store is used to overwrite memory the compiler can assume will not be modified, subsequent execution (after the call to store returns) has undefined behavior.

The semantics of volatile

A call to ptr::read_volatile desugars to one or more calls to load, and a call to ptr::write_volatile desugars to one or more calls to store. The compiler is required to minimize tearing to the extent possible, provided that doing so does not require the use of interlocked or otherwise synchronized instructions. const fn core::arch::volatile_non_tearing::<T>() -> bool returns true if T is such that tearing cannot occur for naturally-aligned accesses. It may still occur for non-aligned accesses (see below).

Unaligned volatile access

The compiler is not allowed to assume that the arguments of core::{ptr::{read_volatile, write_volatile}, arch::{load, store}} are aligned. However, it is also not required to generate code to handle unaligned access, if doing so would cause a performance penalty for the aligned case. In particular, whether the no-tearing guarantee applies to unaligned access is architecture dependent. On some architectures, it is even possible for unaligned access to cause a hardware trap.

New lints

Use of core::ptr::{read_volatile, write_volatile} with a type that cannot be safely transmuted to and from a byte slice will trigger a dubious_type_in_volatile lint. Use of core::arch::{load, store} with such types will trigger a dubious_type_in_load_or_store lint. Both are Warn by default. Thanks to @comex for the suggestion!

Lowering

LLVM volatile semantics are still unclear, and may turn out to be weaker than necessary. It is also possible that LLVM volatile requires dereferenceable or otherwise interacts poorly with some of the permitted corner-cases. Therefore, I recommend lowering core::{arch::{load, store}, ptr::{read_volatile, write_volatile}} to LLVM inline assembly instead, which is at least guaranteed to work. This may change in the future.

@comex
Copy link

comex commented Mar 16, 2022

I like this, except:

If such a transmute would be unsafe (in the sense of Project Safe Transmute), it is a compile-time error (regardless of the target platform)

This seems like it should be a trait bound.

write_volatile<T> is a compile-time error if T has any padding bits or an unspecified layout, as it as no useful semantics in that case. Similarly, read_volatile<T> is a compile-time error if T has any invalid representations.

This part is a breaking change and doesn't seem well motivated to me.

For writes: Writing padding bits is potentially a security concern due to the potential to leak memory contents, but it doesn't seem inherently unsound; any undefined bits should just be implicitly frozen to an arbitrary value. As for unspecified layout, if by that you mean things like repr(Rust), this layout can still be probed at runtime, or perhaps you don't care about the layout because you only need to read the value back later as the same type in the same program.

For reads: Just because volatile is well-suited for dealing with untrusted or potentially-corrupted memory doesn't mean that's the only possible use case. You may happen to know for whatever reason that the load will return a valid value. Perhaps you're reading it from an MMIO register; perhaps you're abusing volatile to implement atomics (bad idea but, in the right circumstances, not unsound); perhaps the load doesn't have to be volatile but is anyway due to some generics madness.

All of these cases seem dubious enough to be worth a lint, but I'm skeptical that they should be hard errors even with the new functions, let alone the existing already-stable functions.

@DemiMarie
Copy link
Author

All of these cases seem dubious enough to be worth a lint, but I'm skeptical that they should be hard errors even with the new functions, let alone the existing already-stable functions.

Agreed, lint added.

Perhaps you're reading it from an MMIO register

I generally assume that MMIO devices are not automatically trustworthy, but your point stands.

@bjorn3
Copy link
Member

bjorn3 commented Mar 16, 2022

Therefore, the Rust compiler is never allowed to make assumptions about the memory accessed by these functions, or the results of such accesses.

That can't be done without forcing deoptimization of any program that may call this. To prevent deoptimization it would be better to say that it can access any memory which an opaque function that gets passed the pointer as argument may access. That would for example not include stack variables which don't have their address taken.

@DemiMarie
Copy link
Author

Therefore, the Rust compiler is never allowed to make assumptions about the memory accessed by these functions, or the results of such accesses.

That can't be done without forcing deoptimization of any program that may call this. To prevent deoptimization it would be better to say that it can access any memory which an opaque function that gets passed the pointer as argument may access. That would for example not include stack variables which don't have their address taken.

Is this also the semantics of using inline assembly? The goal is that volatile operations will always operate on whatever happens to be at that memory address; the compiler can’t just say “I know this volatile load or store will have undefined behavior if X” and optimize accordingly. The situation you are referring to is supposed to be covered by, “At the same time, a call to load or store does not disable any optimizations that a call to an unknown function with the same argument would not also disable. In short: garbage in, garbage out.”

The reason for these seemingly contradictory requirements is that I want volatile memory access to be usable for in-process debuggers and crash dumpers. These programs need to be able to access whatever happens to be at an arbitrary caller-provided memory location and know the compiler will not try to outsmart them. This is also why using these functions to dereference null or dangling pointers is explicitly permitted. Just because you can use these functions to read from a piece of memory does not mean that Rust makes any guarantees whatsoever about what you will find there, or that your attempt won’t cause a hardware fault of some sort. Similarly, just because you can use these functions to write to a piece of memory does not mean that Rust makes any guarantees as to what impact that will have on other code. If you misuse them and something breaks, you get to keep both pieces.

@DemiMarie
Copy link
Author

@bjorn3: do you have suggestions for better phrasing here? The intent is that you can use volatile memory access to peek and poke at whatever memory you want, but the consequences of doing so are entirely your responsibility. For instance, I might need to test that my program’s crash handler triggers successfully when I dereference a null pointer, or that a hardened memory allocator detects modification of freed memory and properly aborts.

@bjorn3
Copy link
Member

bjorn3 commented Mar 16, 2022

Is this also the semantics of using inline assembly?

I believe so.

the compiler can’t just say “I know this volatile load or store will have undefined behavior if X” and optimize accordingly.

It has to for any optimization to be possible.

The situation you are referring to is supposed to be covered by, “At the same time, a call to load or store does not disable any optimizations that a call to an unknown function with the same argument would not also disable. In short: garbage in, garbage out.”

Didn't see that sentence. I agree that covers my situation.

The reason for these seemingly contradictory requirements is that I want volatile memory access to be usable for in-process debuggers and crash dumpers.

Those things are UB either way. Individual compilers just do a best effort at trying to make them work the way a user expects them to work when optimizations are disabled. When optimizations are enabled it is even more in a best effort basis. For example it may not be possible to change function parameters if the compiler found that a function argument is constant and optimized accordingly.

@DemiMarie
Copy link
Author

Is this also the semantics of using inline assembly?

I believe so.

Good to know!

the compiler can’t just say “I know this volatile load or store will have undefined behavior if X” and optimize accordingly.

It has to for any optimization to be possible.

To elaborate, what I am not okay with is the compiler optimizing out the entire basic block as unreachable code. Compilers have a nasty habit of doing that, so I wanted to be absolutely clear that is not permitted.

The situation you are referring to is supposed to be covered by, “At the same time, a call to load or store does not disable any optimizations that a call to an unknown function with the same argument would not also disable. In short: garbage in, garbage out.”

Didn't see that sentence. I agree that covers my situation.

Thank you.

The reason for these seemingly contradictory requirements is that I want volatile memory access to be usable for in-process debuggers and crash dumpers.

Those things are UB either way. Individual compilers just do a best effort at trying to make them work the way a user expects them to work when optimizations are disabled. When optimizations are enabled it is even more in a best effort basis. For example it may not be possible to change function parameters if the compiler found that a function argument is constant and optimized accordingly.

That behavior is perfectly acceptable (though ideally it would be reflected in the debug info). I wonder if our definitions of UB are slightly different. To mean, an execution with UB has no meaning at all, and the compiler is allowed to prune any basic block that invokes UB. core::arch::{load, store} never invoke this form of UB themselves, but are unsafe because they can cause a crash or even invoke UB in unrelated code.

@Diggsey
Copy link

Diggsey commented Mar 16, 2022

To elaborate, what I am not okay with is the compiler optimizing out the entire basic block as unreachable code. Compilers have a nasty habit of doing that, so I wanted to be absolutely clear that is not permitted.

If the code is unreachable, then of course the compiler is permitted to optimize it away. The compiler is allowed to assume that no UB occurs within the program when determining what is reachable (and for all other optimizations). This is true for all code: there's no special case for atomics.

@DemiMarie
Copy link
Author

To elaborate, what I am not okay with is the compiler optimizing out the entire basic block as unreachable code. Compilers have a nasty habit of doing that, so I wanted to be absolutely clear that is not permitted.

If the code is unreachable, then of course the compiler is permitted to optimize it away. The compiler is allowed to assume that no UB occurs within the program when determining what is reachable (and for all other optimizations). This is true for all code: there's no special case for atomics.

According to the model mentioned above (a volatile memory access is an opaque function call or inline assembler), the compiler does not actually know that std::ptr::read_volatile(x) actually dereferences x, merely that it may dereference x. So it cannot prove that std::ptr::read_volatile(x) is UB no matter what x is.

@Diggsey
Copy link

Diggsey commented Mar 16, 2022

So it cannot prove that std::ptr::read_volatile(x) is UB no matter what x is.

Right, but whether or not x is dereferenceable is only one potential source of UB. The compiler could still decide that the call is unreachable for other reasons (say x is the result of an expression that invokes UB).

I'd also question whether it makes sense to be quite so lenient with x - it definitely makes sense that x might not be dereferenceable, but what if it's undef or poison? I don't really have the expertise to answer that though...

@DemiMarie
Copy link
Author

So it cannot prove that std::ptr::read_volatile(x) is UB no matter what x is.

Right, but whether or not x is dereferenceable is only one potential source of UB. The compiler could still decide that the call is unreachable for other reasons (say x is the result of an expression that invokes UB).

That’s fine.

I'd also question whether it makes sense to be quite so lenient with x - it definitely makes sense that x might not be dereferenceable, but what if it's undef or poison? I don't really have the expertise to answer that though...

I decided to err on the side of the simplest possible semantics.

@RalfJung
Copy link
Member

RalfJung commented Mar 18, 2022

Thanks for writing this up!

I was confused what the difference to volatile would be. Is it fair to say that this is morally what volatile "ought to be", and the only reason you are avoiding the term volatile is that volatile has bad interactions with atomic accesses in C/C++/LLVM?

an always-valid type that can safely be cast to a byte array

This sounds contradictory: a byte array ([u8; N]) cannot hold all data (e.g., uninit memory), so an always-valid type cannot be safely cast to a byte array.

The functions proposed here have the same semantics as raw machine load and store instructions. The compiler is not permitted to assume that the values loaded or stored are initialized, or even that they point to valid memory. However, it is permitted to assume that loads via store do not violate Rust’s mutability rules.

"loads via store" seems odd; I assume "loads and stores" is what you meant?

I think this is too weak. In particular, I think the compiler should be able to assume that the given address is the only memory (that the compiler knows about) that this operation can access, and that it will not have other side-effects (like synchronization) either. So for example if arch::load is called on some *const i32 that does not alias some other pointer p, the compiler is allowed to reorder regular stores to p with those non-aliasing loads.

This also makes these operations quite different from inline assembly, which is allowed to do much more. With inline assembly, the intention is that even swapping out the assembly block by some other code at runtime (that still adheres to the same clobber annotations) is allowed. I don't think we want that for arch::{load, store}.

non-tearing non-interlocked load from or store to

What is an "interlocked" load/store?

The actual functions are as follows:

I am very strongly opposed to using usize to indicate memory accesses. In particular, the aliasing rules still matter (as you say), and hence provenance still matters. That means these functions should work on pointers, not integers.

T must be a type such that T can safely be transmuted to (resp. from) [u8; size_of::]. Otherwise, the behavior of this function is the same as that of inline assembly that contains a single non-atomic load (resp. store) instruction of the correct size.

"Otherwise" here sounds like "in case T cannot be safely transmuted as described", but I doubt that is what you mean.

"non-atomic" is a bad choice here, since assembly languages don't even distinguish atomic and non-atomic accesses -- that is a surface language thing. And the entire point of your proposal is that these accesses are atomic, in the sense of not introducing data races.

But this reminds me, the interactions with the concurrency model need to be specified better I think. If there are overlapping concurrent non-atomic writes to the same location, that is UB. If there are non-atomic reads concurrent to an overlapping arch::store, that is UB. And if there are concurrent overlapping atomic accesses, then what are the possible observed values on read accesses? If my hardware has strong consistency guarantees, like x86, can I use this to do the equivalent of an "acquire read"? That would be in contradiction with some of the reorderings I mentioned above.

To avoid problems with LLVM’s unclear volatile semantics, the LLVM backend should in fact lower this function to LLVM inline assembler.

What is wrong with using LLVM volatile atomic accesses?

@Lokathor
Copy link
Contributor

Ralf I'm 99% certain that "always valid" is meant to be read as "all initialized bit patterns are valid", not that it's also allowing uninit memory.

@RalfJung
Copy link
Member

Ralf I'm 99% certain that "always valid" is meant to be read as "all initialized bit patterns are valid", not that it's also allowing uninit memory.

That still leaves provenance as a concern -- transmuting pointers with provenance to integers is subtle at best, making (arrays of) integer types like u8 IMO unsuited as "generic data containers".

@Lokathor
Copy link
Contributor

I'm pretty sure that everything you just said also applies if you "raw" read/write a pointer with memory, right? Because provenance isn't a physical concept that's actually stored in a device. So if you raw read a pointer from memory you already don't know the provenance, regardless of the array of u8 part or not. It would have to be treated as an int read with an automatic int2ptr cast, or something like that.

@RalfJung
Copy link
Member

RalfJung commented Mar 18, 2022

provenance isn't a physical concept that's actually stored in a device

Sure (except on CHERI ;).
But that doesn't mean that the Abstract Machine model of an arch::load will always return provenance-free data. That would be a very strong assumption that I would not be willing to make. It would mean you couldn't use arch::load to load a pointer that has provenance attached to it, making a bunch of code UB that probably shouldn't be UB.

let mut loc = &mut 0i32;
let ptr = arch::load(&loc as *const _ as *const *mut i32);
let _val = *ptr; // UB, since this ptr does not have the right provenance to access `loc`

@Lokathor
Copy link
Contributor

Hm. What are the provenance rules if asm! is involved? It seems that the intent of these load/store functions is to have things works "basically like the inline assembly would", and just limit off all the other things asm also does to have a simpler API for this particular use case.

@RalfJung
Copy link
Member

RalfJung commented Mar 18, 2022

Making asm! formally precise is a lot harder than making volatile accesses precise. ;)
But I don't think we would want the "effect on the Abstract Machine state" that asm! can have to be limited to things not involving provenance. The equivalent of the above code written using asm! instead of arch::load should likely also be defined behavior, after all.

@DemiMarie
Copy link
Author

@andyhhp what are your thoughts? What semantics do you want for volatile loads and stores?

@DemiMarie
Copy link
Author

Thanks for writing this up!

You’re welcome!

I was confused what the difference to volatile would be. Is it fair to say that this is morally what volatile "ought to be", and the only reason you are avoiding the term volatile is that volatile has bad interactions with atomic accesses in C/C++/LLVM?

Yes and no. Volatile in C/C++/LLVM does have unclear semantics with respect to concurrency, but it also has unclear semantics in other ways as well, which means that those writing low-level code might find themselves resorting to asm!. I would rather avoid that.

an always-valid type that can safely be cast to a byte array

This sounds contradictory: a byte array ([u8; N]) cannot hold all data (e.g., uninit memory), so an always-valid type cannot be safely cast to a byte array.

In my mind, core::arch::load doesn’t know (or care) if the memory it accesses is initialized or not. It just grabs whatever happens to be at the given hardware address. So using it to access uninitialized memory isn’t UB, but rather produces an initialized but unpredictable value.

The functions proposed here have the same semantics as raw machine load and store instructions. The compiler is not permitted to assume that the values loaded or stored are initialized, or even that they point to valid memory. However, it is permitted to assume that loads via store do not violate Rust’s mutability rules.

"loads via store" seems odd; I assume "loads and stores" is what you meant?

Yes

I think this is too weak. In particular, I think the compiler should be able to assume that the given address is the only memory (that the compiler knows about) that this operation can access, and that it will not have other side-effects (like synchronization) either. So for example if arch::load is called on some *const i32 that does not alias some other pointer p, the compiler is allowed to reorder regular stores to p with those non-aliasing loads.

I’m not sure about that. @andyhhp @marmarek @HW42 thoughts?

This also makes these operations quite different from inline assembly, which is allowed to do much more. With inline assembly, the intention is that even swapping out the assembly block by some other code at runtime (that still adheres to the same clobber annotations) is allowed. I don't think we want that for arch::{load, store}.

That is a fair point. The reason I mentioned asm! is that asm! is the only thing I know of in current Rust that does what is needed here. In other words, I know asm! is sufficient, and I am not aware of anything else that is.

non-tearing non-interlocked load from or store to

What is an "interlocked" load/store?

core::arch::{load, store} should lower to a single mov instruction or similar, not cmpxchg or the like, and shouldn’t insert any memory barrier instructions.

The actual functions are as follows:

I am very strongly opposed to using usize to indicate memory accesses. In particular, the aliasing rules still matter (as you say), and hence provenance still matters. That means these functions should work on pointers, not integers.

That’s valid, though most use-cases I can think of will at very least want a newtype. What matters here is that I need to be able to write:

fn peek_u32(s: string_from_debug_prompt) -> Result<u32, std::num::TryFromIntError> {
    let s: usize = s.parse()?;
    Ok(core::arch::load(s as *const u32))
}

and not have the compiler second-guess me. I know that converting some user-provided string to a usize, casting it to a pointer, and then dereferencing the pointer is almost always wrong (and doesn’t work on CHERI), but kernel debuggers actually need to do it, and the compiler shouldn’t be breaking them.

T must be a type such that T can safely be transmuted to (resp. from) [u8; size_of::]. Otherwise, the behavior of this function is the same as that of inline assembly that contains a single non-atomic load (resp. store) instruction of the correct size.

"Otherwise" here sounds like "in case T cannot be safely transmuted as described", but I doubt that is what you mean.

Yeah, I wrote this when I was quite tired, so there are definitely some thinkos.

"non-atomic" is a bad choice here, since assembly languages don't even distinguish atomic and non-atomic accesses -- that is a surface language thing. And the entire point of your proposal is that these accesses are atomic, in the sense of not introducing data races.

Fair point. What I meant is that the compiler does not need to insert any memory barriers.

But this reminds me, the interactions with the concurrency model need to be specified better I think. If there are overlapping concurrent non-atomic writes to the same location, that is UB. If there are non-atomic reads concurrent to an overlapping arch::store, that is UB. And if there are concurrent overlapping atomic accesses, then what are the possible observed values on read accesses? If my hardware has strong consistency guarantees, like x86, can I use this to do the equivalent of an "acquire read"?

You can certainly use it as a “release read”; I am not sure about the others. My intuition is that a volatile load should imply a compile-time memory barrier, but I will leave that to the people who actually use them in anger.

That would be in contradiction with some of the reorderings I mentioned above.

To avoid problems with LLVM’s unclear volatile semantics, the LLVM backend should in fact lower this function to LLVM inline assembler.

What is wrong with using LLVM volatile atomic accesses?

Nothing, other than that I don’t known what the semantics of LLVM volatile atomic accesses are 🙂.

More generally, the main uses I can think of for volatile operations are in machine-dependent code, which is why my proposal defers to hardware semantics in so many places. If I am manipulating page tables in a kernel or hypervisor, or accessing MMIO registers in a driver, I need to know that core::arch::{load, store} will lower to exactly one machine instruction, and that that instruction is guaranteed to not tear. I also need to know that the it will access whatever machine address I specify, even if that address is something that (from the compiler’s perspective) I have no business looking at. I might be running on a platform where address 0 really does have memory that I need to access, or I might be deliberately dereferencing NULL to make sure that my page fault handler works properly. There are almost certainly lots of other uses I haven’t considered here, but which should still work.

Making asm! formally precise is a lot harder than making volatile accesses precise. ;)

Welcome to the messy world of interacting with hardware 😄.

But I don't think we would want the "effect on the Abstract Machine state" that asm! can have to be limited to things not involving provenance. The equivalent of the above code written using asm! instead of arch::load should likely also be defined behavior, after all.

Agreed.

To avoid problems with LLVM’s unclear volatile semantics, the LLVM backend should in fact lower this function to LLVM inline assembler.

What is wrong with using LLVM volatile atomic accesses?

The problem with LLVM volatile atomic accesses is that their semantics are not known to be what I would want for the kind of low-level applications I am envisioning. In particular, my understanding is that their semantics are architecture-independent, whereas I believe lots of use-cases explicitly want semantics that are architecture-dependent. Some synchronization primitives, for instance, explicitly rely on the fact that modern hardware (DEC Alpha is long dead) does not reorder dependent loads even if there is no hardware memory barrier.

@RalfJung
Copy link
Member

RalfJung commented Mar 19, 2022

Yes and no. Volatile in C/C++/LLVM does have unclear semantics with respect to concurrency, but it also has unclear semantics in other ways as well, which means that those writing low-level code might find themselves resorting to asm!. I would rather avoid that.

Which other ways are you referring to here? That seems worth discussing in depth in the RFC.

In my mind, core::arch::load doesn’t know (or care) if the memory it accesses is initialized or not. It just grabs whatever happens to be at the given hardware address. So using it to access uninitialized memory isn’t UB, but rather produces an initialized but unpredictable value.

We have to describe its effect in the Abstract Machine though.
So it seems you are proposing that arch::load implicitly does something like freeze. This seems worth discussing explicitly. (And it doesn't entirely solve the problem as other "non-standard" Bytes remain, due to provenance -- see my last comments above.)
Rust doesn't currently have freeze, so exposing this seems like a fundamental extension to the expressivity of the language. OTOH if you interact with untrusted parties and they can make that memory uninit, you have a problem -- for these cases you have to be able to rely on "no uninit".

That’s valid, though most use-cases I can think of will at very least want a newtype. What matters here is that I need to be able to write:

I think "where you get the pointer from" and having volatile (for lack of a better term) semantics on the access are entirely orthogonal concerns. So I don't think this should affect the type signature of these operations.

Integers have no provenance, so whether you parsed that integer from a string or got it via a ptr-to-int cast makes no difference, even with regular Rust memory accesses.

You can certainly use it as a “release read”

"release read"? In C11, release is for writes and acquire for reads, so a "release read" doesn't make a ton of sense to me.

Also after what you said earlier about "non-interlocked" semantics, I would strongly expect that these operations do not have any synchronizing side-effect (e.g., the compiler can treat them as nosync).

(Interestingly, I just noticed LLVM LangRef says volatile accesses are not nosync. I am very confused by this.)

The problem with LLVM volatile atomic accesses is that their semantics are not known to be what I would want for the kind of low-level applications I am envisioning. In particular, my understanding is that their semantics are architecture-independent, whereas I believe lots of use-cases explicitly want semantics that are architecture-dependent. Some synchronization primitives, for instance, explicitly rely on the fact that modern hardware (DEC Alpha is long dead) does not reorder dependent loads even if there is no hardware memory barrier.

My thinking was that we want to give an architecture-independent description that is enough to tell us what the compiler can and cannot do. The exact effect of course is architecture-dependent, but unlike asm! blocks I think here we can give a reasonable "envelope" of what the effect of an arch::{load,store} on the Abstract Machine state can be. (For example: can it ever return uninit bytes? can it return bytes with provenance? Which parts of the AM memory can be mutated?)

I fully agree re: accessing NULL and shenanigans like that. I am pretty sure that volatile atomic accesses are guaranteed not to tear (that is AFAIK generally true for atomic accesses). I think people want LLVM volatile (atomic) accesses to be what you describe here, so IMO it would make sense to use them and then work with LLVM in case things don't work out.

Either way, the Rust semantics should be described independently of LLVM, so we can change the implementation strategy as needed. But given that these things are so hard to describe, drawing comparisons with things that are at least somewhat established makes sense IMO.

@RalfJung
Copy link
Member

Your example about synchronization patterns worries me quite a bit; I am not sure we want to encourage using this kind of operation as a replacement for our regular atomics. Relying on "dependent loads" establishing some sort of synchronization seems like a problem to me, and not least because the entire notion of a "dependency" is on very shaky grounds in a surface language; there's a reason consume isn't working out.

For example, it is generally a permissible transformation for a compiler to turn

let i = make_i();
work_on_i(i);

into

let i = make_i();
if i == 0 {
  work_on_i(0);
} else {
  work_on_i(i);
}

However, if i is initialized with a arch::load, and work_on_i does another arch::load on something like other_ptr.add(i), then this transformation will turn a data-dependent load into a merely control-dependent load (and some modern hardware does very shady things with control dependencies). This optimization is a universal principle, i.e., it works even when the code in make_i and work_on_i is not known to the compiler -- thus, no amount of wording in arch::load can make this optimization invalid.

In that sense, I don't think your idea of an architecture-dependent semantics works (or maybe I understood what you meant by it). The only way to exploit this kind of "dependent load" pattern is to have both loads in the same asm! block, so that the architecture notion of a "dependency" can be used. But there cannot be any "architecture dependency edges" between asm! blocks (or between arch::{load,store}); any kind of synchronization between those blocks must be fully captured by the Abstract Machine semantics.

@DemiMarie
Copy link
Author

I fully agree re: accessing NULL and shenanigans like that.

Thanks. Good to know that we agree on that.

I am pretty sure that volatile atomic accesses are guaranteed not to tear (that is AFAIK generally true for atomic accesses).

It’s actually more subtle: volatile accesses should be guaranteed not to tear if the argument is properly aligned, but if the argument is not aligned, the behavior is hardware-dependent. On x86, accesses that cross a cache line may tear, while on some platforms unaligned access actually faults. Volatile accesses are a raw interface and should not try to hide this behavior.

I think people want LLVM volatile (atomic) accesses to be what you describe here, so IMO it would make sense to use them and then work with LLVM in case things don't work out.

That is a good idea, but I would like to make sure that the bugs are worked out before this gets pushed.

Relying on "dependent loads" establishing some sort of synchronization seems like a problem to me, and not least because the entire notion of a "dependency" is on very shaky grounds in a surface language; there's a reason consume isn't working out.

The problem is that Linux RCU has a hard dependency on consume to get unlimited read-side scaling. Right now they use relaxed and rely on GCC and Clang not breaking it. Rust ought to be able to express rcu_dereference without a huge performance penalty.

You can certainly use it as a “release read”

"release read"? In C11, release is for writes and acquire for reads, so a "release read" doesn't make a ton of sense to me.

Release loads are needed for seqlocks, and the alternatives available in Rust have such poor performance that in practice people wind up using an approach that is technically UB. Rust doesn’t have the atomic_thread_fence primitive needed to implement a high performing approach safely, and even that isn’t really what is wanted.

@digama0
Copy link

digama0 commented Mar 20, 2022

You can certainly use it as a “release read”

"release read"? In C11, release is for writes and acquire for reads, so a "release read" doesn't make a ton of sense to me.

Release loads are needed for seqlocks, and the alternatives available in Rust have such poor performance that in practice people wind up using an approach that is technically UB. Rust doesn’t have the atomic_thread_fence primitive needed to implement a high performing approach safely, and even that isn’t really what is wanted.

It's a bit off topic but I got curious about what a "release load" is and whether we need one, since like Ralf this does not make any sense to me as a concept. Here's a simplified version of the code from the blog post focusing on order of operations:

fn read() {
  let seq1 = seq.load(Acquire); // R1
  let result = data.read_volatile(); // R2
  let seq2 = seq.load(Release); // R3
  // do stuff
}

static seq: AtomicUsize = _; // W0
static data: UnsafeCell<T> = _; // W0
fn write() {
  seq.compare_and_swap(_, _, Acquire); // W1
  data.write_volatile(_); // W2
  seq.store(_, Release); // W3
}

It is claimed that we want operation R3 to be a release load, because we want to ensure that R2 is not moved after R3. This makes some sense from the perspective of "roach motel" reorderings, but from the "message passing" interpretation a release load doesn't make sense since load operations can't broadcast information to synchronize on.

To break it down, the litmus test situation showing why it would be bad to reorder R2 and R3 is where R2 reads the result of W2, and R3 reads the result prior to W1, leading to a disallowed cycle since R3 <rf- W0 -mo> W2 -rf> R2 -sb> R3 is a violation of the coherence axiom of C11 atomics (where rf is reads-from (a write relates to a read that reads from it), mo is modification-order (subsequent writes to an atomic location), and sb is sequenced-before (program order within a thread)). However this requires the mo ordering on data, which requires that the accesses to data are atomic and should be release/acquire as well.

What is needed to make this work is to add a read-read fence between R2 and R3. This effectively upgrades the prior read_volatile operation to an acquire load. The nearest equivalent in rust is atomic::fence(Acquire), but this only works on atomic operations, so perhaps what we need is a stronger kind of fence that works on non-atomic operations as well. I don't think that a "release load" is the right way to solve the problem because the loads aren't publishing information to other threads (and in this example we explicitly don't want to incur the cost of getting exclusive ownership of the cache line in the reader thread).

@DemiMarie
Copy link
Author

a stronger kind of fence that works on non-atomic operations

This should (hopefully!) be enough for seqlocks, but I am not sure if it is enough for RCU.

@DemiMarie
Copy link
Author

I just made an edit that should hopefully clean up some stuff. I should probably make an actual RFC out of this.

@comex
Copy link

comex commented Mar 22, 2022

Relying on "dependent loads" establishing some sort of synchronization seems like a problem to me, and not least because the entire notion of a "dependency" is on very shaky grounds in a surface language; there's a reason consume isn't working out.

The problem is that Linux RCU has a hard dependency on consume to get unlimited read-side scaling. Right now they use relaxed and rely on GCC and Clang not breaking it. Rust ought to be able to express rcu_dereference without a huge performance penalty.

You may be fully aware of the following, but I'll say it anyway, since I've seen suggestions floating around before that RCU and seqlocks are uniquely problematic in Rust.

Today, rustc compiles atomics, volatile accesses (the existing ones), and inline assembly to the same LLVM IR as Clang does their C counterparts, at which point LLVM's optimizer applies the same optimizations to it. rustc does additionally have some MIR optimizations which are specific to Rust, but they're pretty minimal and unlikely to cause a problem here. So when it comes to relying on data dependencies being preserved, the Linux kernel's "wing it" approach should work equally well in Rust as in C.

Which is to say: it does work most of the time in practice, and most examples where it breaks are somewhat pathological/unrealistic code. But on the other hand, it's hard to rule out that there might be some real code where it breaks, and LLVM and GCC's optimizers will only get smarter in the future.

We shouldn't encourage typical Rust users to rely on it, but we also should make sure the Linux project doesn't see it as a reason to avoid adopting Rust.

This RFC's choice to use inline assembly rather than LLVM volatile accesses doesn't help: as Ralf showed, the main problematic optimization for data dependencies can happen regardless of what construct is used for the store to memory. I support this RFC's choice to use inline assembly, because LLVM's volatile semantics are unclear and might hypothetically be weaker than desired. But in practice I would be shocked if inline assembly blocked any optimizations that LLVM volatile doesn't, except by accident (e.g. due to a bug in LLVM volatile, or due to LLVM being overly conservative with inline assembly in a way that isn't related to the guarantees needed here). Though note: LLVM volatile doesn't have an implicit compiler barrier, so if we decide to add one to this API, the proper comparison would be to LLVM volatile combined with an explicit compiler barrier.

@DemiMarie
Copy link
Author

Relying on "dependent loads" establishing some sort of synchronization seems like a problem to me, and not least because the entire notion of a "dependency" is on very shaky grounds in a surface language; there's a reason consume isn't working out.

The problem is that Linux RCU has a hard dependency on consume to get unlimited read-side scaling. Right now they use relaxed and rely on GCC and Clang not breaking it. Rust ought to be able to express rcu_dereference without a huge performance penalty.

You may be fully aware of the following, but I'll say it anyway, since I've seen suggestions floating around before that RCU and seqlocks are uniquely problematic in Rust.

Today, rustc compiles atomics, volatile accesses (the existing ones), and inline assembly to the same LLVM IR as Clang does their C counterparts, at which point LLVM's optimizer applies the same optimizations to it. rustc does additionally have some MIR optimizations which are specific to Rust, but they're pretty minimal and unlikely to cause a problem here. So when it comes to relying on data dependencies being preserved, the Linux kernel's "wing it" approach should work equally well in Rust as in C.

Which is to say: it does work most of the time in practice, and most examples where it breaks are somewhat pathological/unrealistic code. But on the other hand, it's hard to rule out that there might be some real code where it breaks, and LLVM and GCC's optimizers will only get smarter in the future.

We shouldn't encourage typical Rust users to rely on it, but we also should make sure the Linux project doesn't see it as a reason to avoid adopting Rust.

I agree, and I also think we should work on a solution for both Rust and C that doesn’t have undefined behavior.

This RFC's choice to use inline assembly rather than LLVM volatile accesses doesn't help: as Ralf showed, the main problematic optimization for data dependencies can happen regardless of what construct is used for the store to memory. I support this RFC's choice to use inline assembly, because LLVM's volatile semantics are unclear and might hypothetically be weaker than desired.

👍

But in practice I would be shocked if inline assembly blocked any optimizations that LLVM volatile doesn't, except by accident (e.g. due to a bug in LLVM volatile, or due to LLVM being overly conservative with inline assembly in a way th
at isn't related to the guarantees needed here). Though note: LLVM volatile doesn't have an implicit compiler barrier, so if we decide to add one to this API, the proper comparison would be to LLVM volatile combined with an explicit compiler barrier.

@andyhhp @marmarek: would a compiler barrier be desired here?

@RalfJung
Copy link
Member

RalfJung commented Mar 23, 2022

volatile accesses should be guaranteed not to tear if the argument is properly aligned

That is certainly not the case in general as volatile accesses can be arbitrarily large. I doubt my read_volatile on [u64; 64] will be compiled in a non-tearing way. ;)

The problem is that Linux RCU has a hard dependency on consume to get unlimited read-side scaling. Right now they use relaxed and rely on GCC and Clang not breaking it. Rust ought to be able to express rcu_dereference without a huge performance penalty.

Is doing this with inline assembly an option (in the way I described above, where the syntactic dependency is all within the same asm block)? If not, then we'd still be no worse off than C if we tell people to use Relaxed...

Release loads are needed for seqlocks

I literally can't make sense of it -- I think of release/acquire in terms of models like this, and "release load" just doesn't make sense as a concept. 'Release' means that the write event generated by this operation captures the 'view' of the current thread such that when that write event is later observed by an 'acquire' read event, the attached 'view' is integrated into the 'view' of the thread that did the 'acquire'. A load does not generate a write event, so saying it is 'release' is a category error.

The blog post thinks of atomic operations as being define in terms of reorderings, but that is not how they actually work -- defining an operation in terms of its allowable optimization is a sure path to madness.

The main issue with seqlocks (as far as I am aware) is the data read, which people want to be non-atomic but that is not allowed under C11 because it's a data race. Using volatile reads is a gross and non-compliant hack to work around that. The proper solution IMO is to take a hint from the LLVM memory model (which is different from the C++ memory model) and say that a racy read returns undef, but is not UB. Then you could read self.data into a MaybeUninit, do the second self.seq load to verify that there was no race, and call assume_init. If all seq.load are acquire (and the RMWs that increment seq.load are RelAcq), is there still anything that goes wrong?

(After re-reading this old classic): okay I think I see the problem, but that paper also discusses solutions -- and a "release load" is not one of them. ;) It also mentions a fence-based solution that should work in Rust as well?

But anyway -- since "release read" doesn't make sense in the memory model, I don't think arch::load can possibly have an effect like that. There is no cheap way to close the gaps in our memory model, there is no shortcut around actually supporting whatever RCU and seqlocks need in our atomics. That is exactly what arch::{load,store} should explicitly not be allowed to do, since it simply doesn't work -- the compiler will still do language-based reasoning 'between' arch operations.

So I think we should keep "gaps in the concurrency memory model" entirely out of this thread, since those issues fundamentally cannot be solved by just adding a few new opaque functions in arch. Opaque functions still fundamentally have to have some behavior that is expressible in the Abstract Machine; the issues with seqlock and RCU are about behaviors the Abstract Machine cannot express; ergo the only way to fix that is by actually making the Abstract Machine more expressible -- e.g. by adding read-dont-modify-write operations.

IOW, please let's not mix up the following two mostly unrelated concerns:

  • expressing interactions with the "machine-level outside world" (MMIO registers, memory that is shared with untrusted other processes)
  • expressing some interesting concurrency patterns that cannot currently be expressed, for data structures that operate entirely inside the Rust Abstract Machine (but where that machine currently lacks some primitives)

@digama0
Copy link

digama0 commented Aug 14, 2022

My understanding is that it is defined to do whatever the hardware does in that case. So if the write is ignored, then it would not be UB, it would be defined to do nothing, and it would also not be subsequent UB for use of & because the value is still intact.


Regarding the effect on the AM, I still think this is underspecified. The call to store takes a pointer, but it sounds like the provenance of that pointer is irrelevant to the access itself: it is always defined behavior even if the pointer doesn't have any business giving access to that memory, and the whole thing is a bit closer to an FFI call where we request the hardware to go do something with this address and then we may have UB after the fact if something odd happened.

So far so good, but now the analogy with FFI breaks down because we normally talk about FFI in terms of things any legal rust code could do, and this is beyond that - it's a bit closer to /proc/self/mem shenanigans in that the OS / hardware is making "unapproved" changes to the hardware state, bringing it out of alignment with the AM state. So we continue execution until we do something to "observe" the misalignment, like reading a shared reference that wasn't supposed to change and see that it has a different value... but this doesn't work either because that's not how reads work, they are UB only because the relevant borrow stack item has been popped (which is a necessary precondition for changing the value in normal SB).

So it seems we need to manufacture some kind of "reason" for this value to change in terms the rust AM can understand: for example, picking a random provenance valid for writes on those bytes and making the change there, popping everything above it. That still doesn't work for read-only memory though, which doesn't have any provenance valid for writes.

Sorry, that ended up as a stream of consciousness post. I think this might be workable with enough AM magic, but I think there will be cases where there is no choice but to make it immediate UB if the change on the state cannot be explained by anything rust code could do.

@DemiMarie
Copy link
Author

@Lokathor:

You can't assume that read-only memory will segfault when written to. The write can simply be silently ignored. So the program is not "defined to terminate with a segfault or similar".

I was thinking “pages marked read-only in the page tables” (which will indeed SIGSEGV), but you are correct regarding actual ROM.

@digama0:

My understanding is that it is defined to do whatever the hardware does in that case. So if the write is ignored, then it would not be UB, it would be defined to do nothing, and it would also not be subsequent UB for use of & because the value is still intact.

That is correct.

Regarding the effect on the AM, I still think this is underspecified. The call to store takes a pointer, but it sounds like the provenance of that pointer is irrelevant to the access itself: it is always defined behavior even if the pointer doesn't have any business giving access to that memory, and the whole thing is a bit closer to an FFI call where we request the hardware to go do something with this address and then we may have UB after the fact if something odd happened.

Pretty much! I wanted something that people writing drivers and other very low-level code could count on to work, without having to constantly worry if what they were doing was UB.

So far so good, but now the analogy with FFI breaks down because we normally talk about FFI in terms of things any legal rust code could do, and this is beyond that - it's a bit closer to /proc/self/mem shenanigans in that the OS / hardware is making "unapproved" changes to the hardware state, bringing it out of alignment with the AM state.

Not really. An FFI call could do the same thing via e.g. inline assembler.

@comex
Copy link

comex commented Aug 14, 2022

Sorry, that ended up as a stream of consciousness post. I think this might be workable with enough AM magic, but I think there will be cases where there is no choice but to make it immediate UB if the change on the state cannot be explained by anything rust code could do.

From an Abstract Machine perspective, it's true that there are many cases where any UB caused by a store must be immediate rather than delayed until a later load. After all, a load that, at the assembly level, occurs later than the wild write may only be that way due to reordering: at the Abstract Machine level it may have occurred earlier. That's why popping a protector has to be immediate UB: the compiler, when compiling the function that made the call that pushed the protector, may have reordered loads across that call. And this reasoning applies regardless of whether the store is volatile or has any other special marking, because when the compiler reorders accesses across calls, it can't know whether the function being called contains any 'special' operations.

On the other hand, not all memory accesses that exist at a low level exist in the Abstract Machine. When the kernel pages some memory out to disk, it reads the memory as part of writing it to a file somewhere; later on, it reads the file and writes the memory (or rather, it writes some arbitrary newly-allocated piece of memory that's mapped at the same virtual address). But the Abstract Machine doesn't care. The kernel provides an abstraction of memory to userland, satisfying basic properties like "loads return the values written by prior stores". Meanwhile, the Rust compiler takes source code targeting the Abstract Machine and turns it into machine code targeting that lower-level abstraction. Anything the kernel does to implement that abstraction doesn't need to be mapped back up to the Abstract Machine.

That logic applies to the kernel. It also applies to something like CRIU that exists in userland – even running in the same process as the Rust code – but performs a job similar to what a kernel traditionally does.

The question is: can that logic ever apply to volatile accesses performed by the program itself? Can we justify an 'illegal' volatile store to some memory by saying: we'll put the original value back before the program could perform any loads of that memory – and I don't mean Abstract Machine loads, but lower-level loads – therefore the store is unobservable, therefore from an Abstract Machine perspective we can pretend the stores don't exist?

To me, this sounds reasonable in theory, but difficult to guarantee in the face of all changes that might hypothetically be made to the implementation.

In fact, there are situations where it doesn't work today. Both the kernel and CRIU can suspend the entire program while mucking with its memory, but that's not an option if the program itself mucks with memory. @DemiMarie mentioned "suspend[ing] all other threads in the program" (besides the one mucking with memory), but that can be problematic. If you take a lock, you might deadlock on a lock that's held by one of those suspended threads. So while the other threads are suspended, you must not call any code that might take a lock, including the memory allocator. What if you never touch the standard library? Well, the compiler sometimes generates calls to runtime functions to implement certain operations, and the dynamic linker can interpose its own runtime code into calls within the program (this would normally happen when calling a function implemented in a shared library, as part of lazy symbol binding). That runtime code can take locks. Lazy symbol binding normally takes a lock, and while other runtime calls usually don't, there are probably obscure situations where they can do so (say, when using sanitizers or segmented stacks).

What if you manage to get past any potential deadlock issues (or if there are no other threads in the first place)? Well, you're still assuming that the implementation won't insert some access to the target memory into the middle of your sequence of volatile operations. Why would that happen? It might be an access from earlier or later that was reordered, but such reordering can be prevented with barriers. However, I can think of more exotic cases. Suppose some stack variable is supposed to be live but unaccessed in a particular range of code and also has a known constant value. The compiler could hypothetically decide to reuse the memory for something else, then put the constant value back later. (It's not clear whether the compiler can legally reuse memory for another stack variable whose address is taken, but it can certainly do it if the address isn't taken, or for data that isn't part of the Abstract Machine.)

@digama0
Copy link

digama0 commented Aug 14, 2022

On the other hand, not all memory accesses that exist at a low level exist in the Abstract Machine. When the kernel pages some memory out to disk, it reads the memory as part of writing it to a file somewhere; later on, it reads the file and writes the memory (or rather, it writes some arbitrary newly-allocated piece of memory that's mapped at the same virtual address). But the Abstract Machine doesn't care. The kernel provides an abstraction of memory to userland, satisfying basic properties like "loads return the values written by prior stores". Meanwhile, the Rust compiler takes source code targeting the Abstract Machine and turns it into machine code targeting that lower-level abstraction. Anything the kernel does to implement that abstraction doesn't need to be mapped back up to the Abstract Machine.

I think that makes sense: we can do a bunch of things at the low level and then "resume" the correspondence to the abstract machine and it doesn't have to know that anything untoward happened. Except... this is rust code doing the low level code itself! The compiler is handling everything in sight around the store call, it's not like an inline assembly block where we are in control for an entire indivisible unit and the compiler picks up only after the block is done. So I think it is necessary for store to be able to immediately resume the correspondence to the abstract machine after the write, and if this is impossible then we get immediate UB. This may very well make certain kinds of low level programs impossible to write, and in those cases you need to use asm! instead.

On the whole, while I can see some use cases for it I worry that this tool is just too dangerous to use in most cases. It is extremely easy to cause UB with it, and Miri can't help because it can't model the hardware to say whether the thing you did makes sense.

@DemiMarie
Copy link
Author

. So I think it is necessary for store to be able to immediately resume the correspondence to the abstract machine after the write, and if this is impossible then we get immediate UB. This may very well make certain kinds of low level programs impossible to write, and in those cases you need to use asm! instead.

Indeed so. At a minimum one would likely need asm! to switch to a separate stack, then call an extern "C" function or similar.

On the whole, while I can see some use cases for it I worry that this tool is just too dangerous to use in most cases. It is extremely easy to cause UB with it, and Miri can't help because it can't model the hardware to say whether the thing you did makes sense.

Miri should pretend that these are normal stores and loads.

@RalfJung
Copy link
Member

RalfJung commented Aug 20, 2022

using core::arch::store to write to memory that is read-only at the hardware level (and thus causing a trap) is explicitly permitted.

This means the store is potentially diverging (as in, it might not return), which puts tight limits on reordering code around it. volatile must not be diverging like that. But of course we could be deliberately stricter than volatile.

If the write changes memory at that location, then it has to also invalidate any & and &mut references to those bytes,

Definitely. Like all inline assembly, this can only do things to the AM state that could also have been done by regular Rust code.

I feel like you essentially just want to piggy-back on the specification of inline assembly (that we have to figure out anyway), and then basically just say "this generates inline assembly for the appropriate load/store instruction on this platform". All inline assembly blocks should ideally come with a description of their AM-visible effects, and a safety comment explaining why those are well-defined to occur here -- arch::store/load is no different.

Pretty much! I wanted something that people writing drivers and other very low-level code could count on to work, without having to constantly worry if what they were doing was UB.

I just want unicorns and rainbows, is that too much to task? ;)

What you are asking is impossible (or maybe I am misunderstanding). If we want to have optimizations, all code must play by the aliasing rules, including volatile, FFI, and inline assembly.

@Lokathor
Copy link
Contributor

You can already write drivers using volatile and inline asm.

I'm essentially with Ralf, and will even go slightly farther: If there's more magical ways to do things that are added to the language that's more ways people have to consider when trying to figure out if they did something right.

LLVM already basically shrugs and calls volatile "some sort of hardware specific thing, you break it you bought it", which is about all that these load/store ops seem to be anyway, except that these have specific size requirements. We can just issue a warning on tearing volatile access if that's such a concern.

@RalfJung
Copy link
Member

Well, I like the idea of simply not using LLVM volatile any more and making direct use of inline assembly -- that would make us independent of the LLVM volatile semantics when it comes to questions like how volatile interacts with atomics, or with uninit memory.

@DemiMarie
Copy link
Author

Well, I like the idea of simply not using LLVM volatile any more and making direct use of inline assembly -- that would make us independent of the LLVM volatile semantics when it comes to questions like how volatile interacts with atomics, or with uninit memory.

Yeah, that is my thought too. Volatile has very poorly specified semantics, and as far as I can tell this only makes writing drivers harder than it needs to be. Inline assembly needs to be supported anyway, and implementing volatile in terms of it means one less thing the compiler needs to know about. There might be some magic needed to work around const generics limitations, but even that should be able to be removed eventually.

@hsivonen
Copy link
Member

However, it is also not required to generate code to handle unaligned access, if doing so would cause a performance penalty for the aligned case.

I think this phrasing is unfortunate as it seems to rule out some legitimate use cases.

Consider the case where the kernel, hypervisor, wasm host, browser parent process or similar that you are writing is dealing with accessing the potentially-hostile client memory by 128-bit SIMD instruction to load/store vectors that logically have some smaller-than-128-bits elements (e.g. 16 8-bit elements) and you may want to access using an address that is aligned to the smaller subelements. On x86_64, there should be some facility that generates movdqu for that. Yet, to avoid pessimizing the 128-bit-aligned case (at least on some CPU models), there should also be a facility that generates movdqa.

It seems to me that it would be better to have four functions than two: load, store, load_unaligned, and store_unaligned where the first two would generate instructions that are allowed to trap on bad alignment and the latter two would generate instructions that are valid for unaligned access.

The unaligned versions shouldn't purport to provide any guarantees against tearing. Such guarantees are unnecessary when the use case is that a well-behaved client stays away from the shared memory during the privileged access and a badly-behaved client is welcome to UB itself but not to UB the more privileged kernel/hypervisor/wasm host/browser parent.

@chorman0773
Copy link
Contributor

chorman0773 commented Oct 11, 2022 via email

@RalfJung
Copy link
Member

RalfJung commented Oct 11, 2022

Looks like GH now submitted some emails sent months ago, I had the same happen to me...

That is UB, already at the store. Aliasing rules still apply. And since it is UB, the store doesn't actually have to get emitted.

A more interesting option arises around reads: we could declare that aliasing-violating reads return uninit rather than being UB. The formal version of Stacked Borrows has to do that anyway for all reads to keep working correctly in a concurrent setting. Though with basically all our loads being noundef, that might not be so useful... (speculative loads introduced by SB do have to cope with undef)

@DemiMarie
Copy link
Author

DemiMarie commented Oct 12, 2022

aliasing-violating reads return uninit rather than being UB

I would prefer freeze(poison). One of the explicit goals of core::arch::load is for a program to be able to peek anywhere in its address space without UB. Of course, if you violate aliasing rules, the value you will get is unspecified, but it can’t be undef.

That is UB, already at the store. Aliasing rules still apply. And since it is UB, the store doesn't actually have to get emitted.

In this case, yes, since x is ordinary stack memory. However, it is possible to have situations where this is not the case, such as the following:

const SOME_MMIO_ADDRESS_THAT_IGNORES_WRITES: usize = 0xdeaddead;

fn main() {
    let x = SOME_MMIO_ADDRESS_THAT_IGNORES_WRITES as *const u8 as &u8;
    core::arch::store(SOME_MMIO_ADDRESS_THAT_IGNORES_WRITES as *mut u8, 4u8);
    println!("{}", x);
}

Because the MMIO address ignores anything written to it, the call to core::arch::store is actually a no-op. In particular, it can be freely reordered with any reads to that address. Therefore, this program does not have UB.

Fundamentally, core::arch::{load, store, load_unaligned, store_unaligned} are most similar to inline assembler in that both are defined in terms of how they affect the concrete machine, not the abstract machine. In the case of inline assembler, Rust provides certain guarantees about concrete machine state at the entry to the assembler block, and certain guarantees about how changes to the concrete machine state by the assembler will be reflected in the abstract machine. The same applies to core::arch::{load, store, load_unaligned, store_unaligned}: they operate on the concrete machine, and Rust specifies how the concrete and abstract machines relate to each other.

Another case where one might need to use core::arch::store on memory that aliases an & reference is when implementing paging of memory. If memory is paged out, reads will fault, and the program must somehow be able to re-create the memory region that the read expected. In a non-concurrent program, one way to do this is in fact to map the page with arbitrary contents and then fill it in.

@RalfJung
Copy link
Member

I would prefer freeze(poison). One of the explicit goals of core::arch::load is for a program to be able to peek anywhere in its address space without UB. Of course, if you violate aliasing rules, the value you will get is unspecified, but it can’t be undef.

If we decide that freeze is something Rust should have (which IMO requires a separate RFC), then I agree volatile loads should freeze.

Because the MMIO address ignores anything written to it, the call to core::arch::store is actually a no-op. In particular, it can be freely reordered with any reads to that address. Therefore, this program does not have UB.

This relates to other questions we discussed some time ago, about whether the compiler can reorder arch::store to disjoint regions of memory. Basically it boils down to, should we specify that these operations are, fundamentally, memory operations that just have "extra" effects outside of the directly-Rust-visible state, or should these be basically opaque FFI calls.

Whenever they affect regular Rust-visible memory, however, they must fully follow the memory model rules.

@DemiMarie
Copy link
Author

This relates to other questions we discussed some time ago, about whether the compiler can reorder arch::store to disjoint regions of memory.

From a driver perspective, I suspect the answer is “no”. I am not aware of any reason why two different MMIO operations should commute.

@Lokathor
Copy link
Contributor

I don't think store ops should be able to be reordered. There's no way to tell when, for example, one region is video memory chunk and another region controls the video operation. If those got reordered then the video operation would begin without all of the data written out.

And similar for sound and DMA and so on.

@RalfJung
Copy link
Member

RalfJung commented Oct 12, 2022 via email

@Lokathor
Copy link
Contributor

Honestly, volatile and non-volatile should probably should not be reordered with each other, but LLVM does it so, we might be stuck with that. I guess unless we use inline asm?

The discussion you're thinking of is probably the compiler fence topic: #347

@fbq
Copy link

fbq commented Nov 23, 2023

@DemiMarie I wonder whether there is a particular blocker on this issue/RFC? Do we have a plan to move forward?

@DemiMarie
Copy link
Author

@fbq I am not aware of any. I’m fine with making a full RFC out of this.

@fbq
Copy link

fbq commented Nov 25, 2023

Great to hear! I think core::arch::{load, store} would be a good tool to implement Rust version of READ_ONCE and WRITE_ONCE in Linux kernel, and maybe for MMIO. Since they have more clear ABI-level semantics compared to read_volatile and write_volatile.

@DemiMarie
Copy link
Author

@fbq One last question: Should there be separate functions for MMIO and non-MMIO? The reason is that MMIO is often trapped and emulated (e.g. in virtualized environments) and so must use very specific instructions the emulator can handle. These instructions may be suboptimal for non-MMIO uses.

@DemiMarie
Copy link
Author

Specifically, for MMIO I want the RFC to guarantee not only the effect of these functions, but also the exact machine instructions these functions use to perform the load or store operation. For non-MMIO, this is overkill.

@RalfJung
Copy link
Member

RalfJung commented Feb 5, 2024

I think these APIs should be designed for MMIO. Any other use is incidental, and likely a misuse.

The Linux kernel memory model is an extremely special case; all other code should use proper atomics, so we shouldn't have the LKMM affect our volatile APIs.

@fbq
Copy link

fbq commented Feb 5, 2024

I agree with @RalfJung, for Rust language, these APIs should be designed for MMIO, for a particular Rust project (e.g. Linux kernel), developers might think these as an asm library, which can avoid re-inventing the wheel if possible (and it's handy to have some primitives that can access shared memory without causing UB, if the race is intended). Of course, this is probably a temporarily case for Linux kernel, I think we would like to move a proper memory model/atomics in the future.

@DemiMarie
Copy link
Author

I think a non-MMIO use of these functions is likely either:

  1. A bug in the program.
  2. A suboptimal choice of API when a better API is available.
  3. A workaround for a limitation of Rust that should be addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests