Add LLVM atomic memcpy intrinsics, expose in core/std #58599

joshlf · 2019-02-20T16:36:55Z

Expose LLVM's "element wise atomic memory intrinsics". In particular:

Expose llvm.memcpy.element.unordered.atomic as atomic_element_unordered_copy_memory_nonoverlapping
Expose llvm.memmove.element.unordered.atomic as atomic_element_unordered_copy_memory
Expose llvm.memset.element.unordered.atomic as atomic_element_unordered_set_memory

Expose these through functions in the std::ptr module. Each function is implemented by the equivalent intrinsic or by a loop of relaxed atomic operations if the intrinsic is not available on the target platform (TODO: Given this platform-specific behavior, can this also be exposed in core::ptr?)

copy_nonoverlapping_atomic_unordered, backed by atomic_element_unordered_copy_memory_nonoverlapping
copy_atomic_unordered, backed by atomic_element_unordered_copy_memory
write_atomic_unordered, backed by atomic_element_unordered_set_memory

Previously discussed on the internals forum here.

Folks with the authority to approve this: Let me know if this is OK to move forward; I'd like to post it to This Week in Rust's CFP.

The text was updated successfully, but these errors were encountered:

joshlf · 2019-02-22T20:21:07Z

@alexcrichton Do you have thoughts on this? Want to make sure I get the go-ahead before I send somebody on TWiR's CFP down a dead end.

alexcrichton · 2019-02-22T23:26:46Z

I suspect there'll be opinions on stabilization, but seems fine to add as unstable at least to me!

joshlf · 2019-02-22T23:33:49Z

OK cool, thanks!

tmccombs · 2019-03-02T21:55:16Z

I'd like to take a stab at implementing this.

joshlf · 2019-03-02T22:01:09Z

Awesome, thanks @tmccombs ! This will unblock a lot of important work :)

joshlf · 2019-03-02T22:03:51Z

@alexcrichton On the subject of exposing in core::ptr: Is the ban on platform-specific logic in core a general one, or just for API-visible elements? If it's the latter, then we should be able to implement the facade in core::ptr as well.

alexcrichton · 2019-03-04T15:40:50Z

Currently it's largely just a "we strongly discorage platform-specific APIs" in libcore, there's already platform-specific pieces to the implementation, especially some math-related pieces on MSVC

tarcieri · 2019-05-19T19:40:17Z

I saw there was some discussions around use cases on #59155 so I thought I'd bring up mine (and perhaps you can all cross-check me that this would be the proper solution).

I maintain a crate for zeroizing memory (for e.g. wiping cryptographic keys/plaintexts) which presently relies on undefined behavior - namely mixed volatile / non-volatile memory accesses (in conjunction with compiler and hardware fences):

https://github.com/iqlusioninc/crates/tree/develop/zeroize

I'm presently (ab)using core::ptr::write_volatile (and relying on undefined LLVM behavior / implementation details) in conjunction with core::atomic::fence.

What I'd really like is an atomic memset-style operation, i.e. llvm.memset.element.unordered.atomic, which can write repeated copies of a value atomically. The closest thing that presently exists is volatile_set_memory, however that comes with undefined behavior when volatile and non-volatile operations are mixed.

RalfJung · 2019-05-19T20:37:07Z

undefined behavior when volatile and non-volatile operations are mixed.

I don't know why you think that mixing volatile and non-volatile accesses is UB. I see no issue with that per se.

That said, atomic memset might still be useful.

tarcieri · 2019-05-19T20:57:51Z

It is explicitly documented as UB:

https://doc.rust-lang.org/core/ptr/fn.write_volatile.html

Memory accessed with read_volatile or write_volatile should not be accessed with non-volatile operations.

Here is a thread on the subject of using volatile writes for zeroing sensitive memory:

https://internals.rust-lang.org/t/volatile-and-sensitive-memory/3188

RalfJung · 2019-05-20T07:34:12Z

That sentences does not say "UB" or does it say "must not". The section on when this function is UB does not mention mixed accesses.

AFAIK, it mostly refers to "normal" usage of volatile, namely interaction with IO registers: you don't want to mix volatile and non-volatile accesses on those. But I do not know of any form of UB there. @rkruppe, do you?

EDIT: Opened #60972

RalfJung · 2019-05-20T08:42:41Z

Here is a thread on the subject of using volatile writes for zeroing sensitive memory:

Note that that thread does not reflect our current understanding of volatile accesses and the issues around that. See rust-lang/unsafe-code-guidelines#33 for the latest discussion. I'd love to amend that thread but it got auto-closed :(

tarcieri · 2019-05-20T18:03:05Z

@RalfJung interesting (and also lengthy)! I'll have to read through it later.

Discrepancies between volatile and atomic aside, there's still no stable API for volatile memset either (only the unstable core::intrinsics::volatile_set_memory). It'd be nice if there were something along the lines of core::ptr::set_volatile which provides a stable wrapper around this intrinsic.

@ecstatic-morse

remove confusing remarks about mixed volatile and non-volatile accesses These comments were originally added by @ecstatic-morse in rust-lang@911d35f and then later edited by me. The intention, I think, was to make sure people do both their reads and writes with these methods if the affected memory really is used for communication with external devices. However, [people read this as saying that mixed volatile/non-volatile accesses are UB](rust-lang#58599 (comment)), which -- to my knowledge -- they are not. So better remove this. Cc @rkruppe @rust-lang/wg-unsafe-code-guidelines

htfy96 · 2019-06-02T23:41:13Z

This is also particularly useful to implement a seqlock. Currently all implementations read unprotected data on the reader side with or without volatile, ~~which definitely depends on LLVM's implementation to synchronize with previous load-acquire semantic~~

More details can be read at this C++ proposal.

RalfJung · 2019-06-03T08:27:08Z

This is also particularly useful to implement a seqlock. Currently all implementations read unprotected data on the reader side with or without volatile, which definitely depends on LLVM's implementation to synchronize with previous load-acquire semantics.

It depends on LLVM semantics ensuring that read-write races are not UB, and instead yield undef/poison on the read side. But that is something LLVM guarantees (but Rust does not, neither do C/C++). I don't see any connection to release-acquire synchronization.

More details can be read at this C++ proposal.

That's interesting, thanks! I am surprised though that they don't even mention the LLVM semantics, which are an alternative approach to solve the issue with SeqLocks. Seems like relevant related work.

hsivonen · 2019-06-04T12:05:22Z

I suspect there'll be opinions on stabilization, but seems fine to add as unstable at least to me!

Considering how much focus there is on Wasm, it would be great for LLVM "unordered" loads and stores to have a path to stabilization in order to have stable-Rust facilities for implementing host services for multi-threaded Wasm. (Since Wasm code is untrusted, if Wasm thread A calls into a host service and designates a range of Wasm heap for reading/writing, the Rust code implementing the host service can't be optimized with the assumption of freedom from data races in that memory region, because Wasm thread B could access the region without adhering to the Rust/C11/C++11 rules.)

HadrienG2 · 2019-06-04T12:57:24Z

Wait a second.

Actually, unordered loads and stores do not resolve your problem, if you have no way to guarantee that the adversary code is interacting with memory using unordered operations. As far as I can tell from the LLVM manual, unordered loads from memory that is written using non-unordered stores is still UB.

It could be argued that no sane compiler optimization or known hardware behavior could ever turn this UB into something dangerous in practice. But well... UB is UB, and we never know what the future holds.

So if you want to strictly avoid UB, the correct thing to do in your case is to tweak the WASM compiler so that all reads and writes to the WASM heap that may be used to interact with the host are atomic. They may be unordered, they may be relaxed, it doesn't really matter from an UB-correctness point of view (though it may matter from a performance point of view). But they must be atomic.

jstarks · 2019-06-14T17:23:24Z

Without these intrinsics, is there a supported way to read or write shared memory that is being concurrently and arbitrarily mutated by another processor? If the other processor is executing code in a different security boundary, then there may be no way to prevent it from mutating the memory with an arbitrary mix of atomic, non-atomic, and unaligned accesses. But clearly this works at the processor level, because this is the foundation for most OS and virtualization communcation models.

Right now, for example, Google's crosvm VMM uses an (arbitrary?) mix of ptr::{write_volatile, copy, write_bytes} while accessing memory that is shared between its process and a VM. It seems to me that at least some of this is UB today, and therefore future LLVM optimizations could break this code, but I wouldn't know how to fix this with what Rust offers today (especially while retaining good performance).

The requested intrinsics would seem to be what crosvm, and software like it, needs.

RalfJung · 2019-06-14T17:29:48Z

Without these intrinsics, is there a supported way to read or write shared memory that is being concurrently and arbitrarily mutated by another processor?

Yes, use atomic accesses.

A data race of an atomic and a non-atomic access in one program is UB if at least one of them is a write; but if another separately compiled program races with your atomic accesses, the only side that has to expect problems is that side. In some way, the data race is UB "only on the non-atomic side", which is not a distinction that usually makes any sense because UB is a whole-program property, but it does if one side is encapsulated and compiled separately.

Mind you, this is informal. You are outside of the scope of any formal model I am aware of.

hsivonen · 2019-06-17T07:54:42Z

Without these intrinsics, is there a supported way to read or write shared memory that is being concurrently and arbitrarily mutated by another processor?

Yes, use atomic accesses.

Over in the SharedArrayBuffer issue that motivates my comments here, I was told by a SpiderMonkey developer that C++11/C11 atomics are not a valid answer.

In some way, the data race is UB "only on the non-atomic side", which is not a distinction that usually makes any sense because UB is a whole-program property, but it does if one side is encapsulated and compiled separately.

Is there advice for programmers of virtual machines for how to use atomics safely with a non-conforming execution on the other processor such that LLVM devs agree that the advice may be safely relied upon?

RalfJung · 2019-06-17T09:02:44Z

The point seems to be

gcc is quite explicit about this in its documentation, saying that its __atomic intrinsics may not execute fences if gcc thinks they're not required.

Indeed compilers are allowed to optimize atomic accesses to non-atomic accesses if they can prove that that does not introduce a race or otherwise change behavior.

The problem we are having here is that the C/C++ memory model is inherently a "whole-program spec". The question we are asking though only makes any sense when considering "separate compilation". We want some kind of modular notion of correctness when linked with arbitrary code -- and worse, with the additional constraint that we have run-time mechanisms in place that ensure that the arbitrary code does not cause UB.

However, I rest my case that I think using atomic instructions will work for this. GCC can only remove synchronization when it knows all accesses done to a location. In case of a VM, I don't know how memory gets allocated and handed to the program inside the VM, but surely it happens in a way that the compiler must consider the pointer "escaped"? Even though the spec does not consider separate compilation, that's what compilers do in practice -- they have to compile a compilation unit in a way that works fine with any conforming other compilation unit. Hence they cannot remove fences from atomic accesses to locations that have "escaped".

But I will admit that this is all based on a somewhat idealized understanding of the VM and compilers, the details of which I am not extremely familiar with. I just cannot imagine a way to implement a compiler that supports separate compilation of concurrent code, such that using atomics for this use-case does not have the intended semantics.

RalfJung · 2019-10-12T15:36:04Z

One issue with these intrinsics is that they use LLVM's "unordered" atomic memory ordering, which Rust currently does not have, and I have some reservations to adding it.

Are there "relaxed" versions of these intrinsics that we could use instead?

joshlf · 2019-10-12T18:15:20Z

Are there "relaxed" versions of these intrinsics that we could use instead?

Not sure. Anecdotally, I benchmarked a loop of relaxed atomic loads (loading from a slice of AtomicU8s using the Relaxed ordering), and found it to be ~10x slower than a loop of unsychronized loads (just accessing the elements of a slice of u8). I haven't benchmarked the unordered atomic memcpy intrinsics, but I suspect they'll be significantly faster.

For my use case in particular, speed is paramount, which is why I'm interested in these intrinsics. That said, I completely understand your concern with introducing "unordered" into our memory model. Perhaps a middle-road path would be to add these as unstable with a big warning in their doc comments saying "we will probably never stabilize these because of memory model issues"? That way we could still use them for the time being while we let the various discussions about the semantics of inter-process shared memory play out, and hopefully switch to the "official" solution once such a solution exists.

RalfJung · 2019-10-12T18:25:56Z

Not sure. Anecdotally, I benchmarked a loop of relaxed atomic loads (loading from a slice of AtomicU8s using the Relaxed ordering), and found it to be ~10x slower than a loop of unsychronized loads (just accessing the elements of a slice of u8). I haven't benchmarked the unordered atomic memcpy intrinsics, but I suspect they'll be significantly faster.

My current theory for why relaxed is so much slower is that LLVM is really bad at optimizing it.

add these as unstable

I am not opposed to any unstable experimentation. I am okay with giving people enough rope to hang themselves, at least on nightly. ;)
I am surprised though that this to be enough for you to actually use it.

HadrienG2 · 2019-10-12T18:26:35Z

@joshlf The cause of this particular problem is known. LLVM's optimizer is currently unable to fuse together neighboring relaxed atomic loads and stores, so a loop of AtomicU8 operations will translate into actual byte operations in hardware assembly, which is stupidly slow.

According to @comex, that's an optimizer bug, and the relaxed atomic semantics should actually allow for such code to be optimized into e.g. an efficient rep movq or SSE2 memcpy loop on x86_64.

RalfJung · 2019-10-12T18:27:12Z

inter-process shared memory play

Note that relaxed vs unordered has no effect on inter-process shared memory (between distrusting parties). General consensus is (to my knowledge) still that this does and will require volatile -- specifically, "atomic volatile", which LLVM has but Rust does not expose.

joshlf · 2019-10-12T18:41:36Z

inter-process shared memory play

Note that relaxed vs unordered has no effect on inter-process shared memory (between distrusting parties). General consensus is (to my knowledge) still that this does and will require volatile -- specifically, "atomic volatile", which LLVM has but Rust does not expose.

IIUC from the discussion in rust-lang/unsafe-code-guidelines#152, normal volatile may be sufficient if combined with a freeze operation (which doesn't yet exist, of course)?

RalfJung · 2019-10-12T18:53:51Z

IIUC from the discussion in rust-lang/unsafe-code-guidelines#152, normal volatile may be sufficient if combined with a freeze operation (which doesn't yet exist, of course)?

That is true if we adapt LLVM's handling of read-write races. So, we would be deviating from C++ concurrency there as well. This is slightly better studied than LLVM's handling of write-write races, but that doesn't say much (as the latter is entirely unstudied to my knowledge).

steveklabnik · 2021-02-01T18:47:32Z

Triage: I don't believe there's been any changes here.

m-ou-se · 2022-01-07T18:24:08Z

One issue with these intrinsics is that they use LLVM's "unordered" atomic memory ordering, which Rust currently does not have, and I have some reservations to adding it.

Wg21's p1478r6 might be interesting. It proposes to add to C++ an atomic_load_per_byte_memcpy and atomic_store_per_byte_memcpy which, unlike the llvm intrinsics, both take a memory ordering as argument. The only valid options are Relaxed and Acquire/Release. It comes down to an unordered read followed by (or write preceded by) an atomic fence, which is exactly what you want in cases like a seqlock. We could use that design in Rust as well, to avoid the need for a new memory ordering.

RalfJung · 2022-01-07T21:31:59Z

Yeah that sounds like a great solution. :)

m-ou-se · 2022-01-07T21:34:07Z

Cool. Time to add those intrinsics and and add this API as unstable to play around with then. :)

Not sure if I'll have time for that soon. So if anyone else has, please do! ^^

m-ou-se · 2022-01-08T01:24:44Z

@Amanieu mentions here (#59155 (comment)):

LLVM's element-wise atomic operations actually generate very poor code in practice since they are always lowered to a call to the intrinsic. You can get much better codegen by using volatile reads and writes instead, which have similar (but stricter) guarantees.

Does that mean we could provide these atomic_load_per_byte_memcpy and atomic_store_per_byte_memcpy but implement them without using those llvm intrinsics, and instead use volatile operations since we know llvm makes the guarantees we need on them? We'd still add a fence and take a memory ordering argument, to make sure it all fits in the existing memory model, but the implementation makes use of some extra guarantees llvm makes.

RalfJung · 2022-01-08T10:09:14Z

instead use volatile operations since we know llvm makes the guarantees we need on them?

How do we know that?
I don't think LLVM makes guarantee on volatile accesses that race with atomic accesses.

RalfJung · 2022-01-08T10:10:19Z

LLVM's element-wise atomic operations actually generate very poor code in practice since they are always lowered to a call to the intrinsic.

That sounds like something that is fixable in LLVM? (And if these operations are really adopted by C/C++ and used for seqlocks, they will want to do that anyway.)

joshlf · 2022-10-14T17:05:37Z

Cool. Time to add those intrinsics and and add this API as unstable to play around with then. :)

@m-ou-se, could you clarify which intrinsics and what API surface you'd like to see added here? I'd like to post this issue on This Week in Rust's CFP section to see if we can find someone to implement it.

estebank added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Feb 20, 2019

Centril added T-lang Relevant to the language team, which will review and decide on the PR/issue. C-feature-request Category: A feature request, i.e: not implemented / a PR. labels Feb 20, 2019

tmccombs mentioned this issue Mar 13, 2019

Add element-wise atomic memory operations #59155

Closed

RalfJung mentioned this issue May 20, 2019

remove confusing remarks about mixed volatile and non-volatile accesses #60972

Merged

htfy96 mentioned this issue Jun 2, 2019

Use atomic_element_unordered_copy_memory_nonoverlapping to read the data Amanieu/seqlock#4

Open

RalfJung mentioned this issue Jun 18, 2019

Prerequisites from dep graph refactoring #2 #61923

Merged

RalfJung mentioned this issue Jun 26, 2019

What about: volatile, concurrency, and interaction with untrusted threads rust-lang/unsafe-code-guidelines#152

Open

Ralith mentioned this issue Aug 20, 2019

Implement LLVM's elementwise unordered atomic memory intrinsics rust-lang/compiler-builtins#311

Merged

RalfJung mentioned this issue Oct 12, 2019

SIMD + Data races rust-lang/unsafe-code-guidelines#209

Closed

RalfJung mentioned this issue May 18, 2020

Cannot cross-compile from x86_64 to i686 with LLVM 8 rust-lang/compiler-builtins#356

Closed

taiki-e mentioned this issue Apr 25, 2022

AtomicU8 is stable (and supported by the MSRV) crossbeam-rs/crossbeam#744

Open

workingjubilee added the A-atomic Area: Atomics, barriers, and sync primitives label Aug 17, 2022

Add LLVM atomic memcpy intrinsics, expose in core/std #58599

Add LLVM atomic memcpy intrinsics, expose in core/std #58599

Comments

joshlf commented Feb 20, 2019

joshlf commented Feb 22, 2019

alexcrichton commented Feb 22, 2019

joshlf commented Feb 22, 2019

tmccombs commented Mar 2, 2019

joshlf commented Mar 2, 2019

joshlf commented Mar 2, 2019

alexcrichton commented Mar 4, 2019

tarcieri commented May 19, 2019

RalfJung commented May 19, 2019

tarcieri commented May 19, 2019

RalfJung commented May 20, 2019 • edited Loading

RalfJung commented May 20, 2019 • edited Loading

tarcieri commented May 20, 2019 • edited Loading

htfy96 commented Jun 2, 2019 • edited Loading

RalfJung commented Jun 3, 2019 • edited Loading

hsivonen commented Jun 4, 2019

HadrienG2 commented Jun 4, 2019 • edited Loading

jstarks commented Jun 14, 2019

RalfJung commented Jun 14, 2019 • edited Loading

hsivonen commented Jun 17, 2019

RalfJung commented Jun 17, 2019

RalfJung commented Oct 12, 2019

joshlf commented Oct 12, 2019

RalfJung commented Oct 12, 2019

HadrienG2 commented Oct 12, 2019 • edited Loading

RalfJung commented Oct 12, 2019 • edited Loading

joshlf commented Oct 12, 2019

RalfJung commented Oct 12, 2019

steveklabnik commented Feb 1, 2021

m-ou-se commented Jan 7, 2022 • edited Loading

RalfJung commented Jan 7, 2022

m-ou-se commented Jan 7, 2022

m-ou-se commented Jan 8, 2022

RalfJung commented Jan 8, 2022

RalfJung commented Jan 8, 2022

joshlf commented Oct 14, 2022

RalfJung commented May 20, 2019 •

edited

Loading

RalfJung commented May 20, 2019 •

edited

Loading

tarcieri commented May 20, 2019 •

edited

Loading

htfy96 commented Jun 2, 2019 •

edited

Loading

RalfJung commented Jun 3, 2019 •

edited

Loading

HadrienG2 commented Jun 4, 2019 •

edited

Loading

RalfJung commented Jun 14, 2019 •

edited

Loading

HadrienG2 commented Oct 12, 2019 •

edited

Loading

RalfJung commented Oct 12, 2019 •

edited

Loading

m-ou-se commented Jan 7, 2022 •

edited

Loading