Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] AtomicPerByte (aka "atomic memcpy") #3301

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
291 changes: 291 additions & 0 deletions text/3301-atomic-memcpy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,291 @@
- Feature Name: `atomic_memcpy`
- Start Date: 2022-08-14
- RFC PR: [rust-lang/rfcs#3301](https://github.com/rust-lang/rfcs/pull/3301)
- Rust Issue: [rust-lang/rust#0000](https://github.com/rust-lang/rust/issues/0000)

# Summary

This is a proposal to add `AtomicPerByte<T>`, to represent _tearable atomics_.
This makes it possible to properly implement a _sequence lock_ in Rust.

# The Problem

It's currently not possible to implement an efficient and perfectly
(theoretically) correct sequence lock in Rust.

Unlike most locking mechanisms, a sequence lock doesn't prevent a race
to access the data it projects.
Instead, it detects a race only after the load operation already happened,
and retries it if the load operation raced with a write operation.

A sequence lock in Rust looks something like this:

```rust
// Incomplete example

pub struct SeqLock<T> {
seq: AtomicUsize,
data: UnsafeCell<T>,
}

unsafe impl Sync<T: Copy + Send> for SeqLock<T> {}

impl<T: Copy> SeqLock<T> {
/// Safety: Only call from one thread.
pub unsafe fn write(&self, value: T) {
self.seq.fetch_add(1, Relaxed);
write_data(&mut self.data, value, Release);
self.seq.fetch_add(1, Release);
}

pub fn read(&self) -> T {
loop {
let s1 = self.seq.load(Acquire);
let data = read_data(&self.data, Acquire);
let s2 = self.seq.load(Relaxed);
Copy link
Member

@RalfJung RalfJung Aug 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's something very subtle here that I had not appreciated until a few weeks ago: we have to ensure that the load here cannot return an outdated value that would prevent us from noticing a seqnum bump.

The reason this is the case is that if there is a concurrent write, and if any
part of data reads from that write, then we have a release-acquire pair, so then we are guaranteed to see at least the first fetch_add from write, and thus we will definitely see a version conflict. OTOH if the s1 reads-from some second fetch_add in write, then that forms a release-acquire pair, and we will definitely see the full data.

So, all the release/acquire are necessary here. (I know this is not a seqlock tutorial, and @m-ou-se is certainly aware of this, but it still seemed worth pointing out -- many people reading this will not be aware of this.)

(This is related to this comment by @cbeuw.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah exactly. This is why people are sometimes asking for a "release-load" operation. This second load operation needs to happen "after" the read_data() part, but the usual (incorrect) read_data implementation doesn't involve atomic operations or a memory ordering, so they attempt to solve this issue with a memory ordering on that final load, which isn't possible. The right solution is a memory ordering on the read_data() operation.

Copy link
Member

@ibraheemdev ibraheemdev Aug 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under a reordering based atomic model (as CPUs use), a release load makes sense and works. Release loads don't really work unless they are also RMWs (fetch_add(0)) under the C11 model.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the famous seqlock paper discusses "read dont-modify write" operations.

if s1 & 1 == 0 && s1 == s2 {
return unsafe { assume_valid(data) };
}
}
}
}
```

The `write_data` and `read_data` calls can happen concurrently.
The `write` method increments the counter before and after,
such that the counter is odd during `write_data`.
The `read` function will repeat `read_data` until
the counter was identical and even both before and after reading.
This way, `assume_valid` is only ever called on data that was
not the result of a race.

A big question is how to implement `write_data`, `read_data`, and `assume_valid`
in Rust in an efficient way while satisfying the memory model.

The somewhat popular `seqlock` crate and similar implementations found in the ecosystem
all use a regular non-atomic write (preceded by an atomic fence) for writing,
and `ptr::read_volatile` (followed by an atomic fence) for reading.
This "works" "fine", but is technically undefined behavior.
m-ou-se marked this conversation as resolved.
Show resolved Hide resolved

The C++ and Rust memory model doesn't allow for data races,
so doesn't allow for a data race to be detected after the fact;
that's too late.

All of the data would have to be written and read through
atomic operations to prevent a data race.
We don't need the atomicity of the data as a whole though;
it's fine if there's tearing, since we re-start on a race anyway.

Additionally, memory fences technically only "interact" with atomic operations, not with volatile operations.

# The C++ Solution

C++'s [P1478] proposes the addition of these two functions to the C++ standard
library to solve this problem:

```cpp
void *atomic_load_per_byte_memcpy(void *dest, const void *source, size_t, memory_order);
void *atomic_store_per_byte_memcpy(void *dest, const void *source, size_t, memory_order);
```

The first one is effectively a series of `AtomicU8::load`s followed by a memory fence,
while the second one is basically a memory fence followed by series of `AtomicU8::store`s.
Except the implementation can be much more efficient.
The implementation is allowed to load/store the bytes in any order,
and doesn't have to operate on individual bytes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "load/store bytes in any order" part is quite tricky, and I think means that the specification needs to be more complicated to allow for that.

I was originally thinking this would be specified as a series of AtomicU8 load/store with the respective order, no fence involved. That would still allow merging adjacent writes (I think), but it would not allow reordering bytes. I wonder if we could get away with that, or if implementations actually need the ability to reorder.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a memcpy (meaning the two regions are exclusive) you generally want to copy using increasing address order ("forward") on all hardware I've ever heard of. Even if a forward copy isn't faster (which it often is), it's still the same speed as a reverse copy.

I suspect the "any order is allowed" is just left in as wiggle room for potentially strange situations where somehow a reverse order copy would improve performance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "load/store bytes in any order" part is quite tricky, and I think means that the specification needs to be more complicated to allow for that.

A loop of relaxed load/store operations followed/preceded by an acquire/release fence already effectively allows for the relaxed operations to happen in any order, right?

I was originally thinking this would be specified as a series of AtomicU8 load/store with the respective order, no fence involved.

In the C++ paper they are basically as:

for (size_t i = 0; i < count; ++i) {
  reinterpret_cast<char*>(dest)[i] =
      atomic_ref<char>(reinterpret_cast<char*>(source)[i]).load(memory_order::relaxed);
}
atomic_thread_fence(order);

and

atomic_thread_fence(order);
for (size_t i = 0; i < count; ++i) {
  atomic_ref<char>(reinterpret_cast<char*>(dest)[i]).store(
      reinterpret_cast<char*>(source)[i], memory_order::relaxed);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A loop of relaxed load/store operations followed/preceded by an acquire/release fence already effectively allows for the relaxed operations to happen in any order, right?

Yes, relaxed loads/stores to different locations can be reordered, so specifying their order is moot under the as-if rule.

In the C++ paper they are basically as:

Hm... but usually fences and accesses are far from equivalent. If we specify them like this, calling code can rely on the presence of these fences. For example changing a 4-byte atomic acquire memcpy to an AtomicU32 acquire load would not be correct (even if we know everything is initialized and aligned etc).

Fence make all preceding/following relaxed accesses potentially induce synchronization, whereas release/acquire accesses only do that for that particular access.


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the rules for mixed accesses? Can I use an atomic memcpy to access memory that is, at the same time, also accessed via an AtomicU32? AFAIK C++ rules out such mixed accesses, so I assume the answer is no, but that means we need to carefully document this.

Also see rust-lang/miri#2303.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think mixed accesses have to be forbidden still. Even if we want to support them in a future memory model (unclear), I believe the C++ spec for this was pretty careful to forbid them.

Currently allowing them will also break things like tsan, if nothing else... and the fact that they're also more or less forbidden by even Intel's arch manual makes allowing them in the memory model hard to argue for.

Anyway, the current API doesn't allow anything like this, but I suppose I agree this should be documented as intentional.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imho mixed atomics should be fine...wasm and javascript have them after all, even on intel.

imho C++ doesn't have them because they didn't want to go through the hassle of specifying how they'd work and C++ has TBAA, making mixing between two non-char1 types illegal even with normal load/stores on a single thread.

Footnotes

  1. and unsigned char and std::byte_t iirc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mixed atomics also don't work with locks, right? C++ allows locks for it's atomic implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imho mixed atomics should be fine...

From the x86 architecture manual 3A/I 8.1.2.2:

Software should access semaphores (shared memory used for signalling between multiple processors) using identical addresses and operand lengths. For example, if one processor uses accesses a semaphore using a word access, other processors should not access the semaphore using a byte access.

It does work in practice though, for the most part. I'm not sure we want to rely on this given that the architecture manual says not to do it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree for now we should keep ruling them out, because I assume that's what C++ does -- and that should probably be explicitly documented for atomic memcpy, because that operation is much more likely to be used in unsafe type-punning code than the regular Atomic types.

imho mixed atomics should be fine...wasm and javascript have them after all, even on intel.

Would be interesting to see what wasm/JS engines do with them on intel, if they just ignore what intel says in their manual or if they took some precautions. But that is probably a separate discussion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK C++ rules out such mixed accesses

The proposed atomic memcpy in C++ takes void*, so typed memory is of no help here. The x86 restriction might have slipped through their meetings, otherwise the proposal would've at least mentioned it. I've started a discussion on their mailing list https://lists.isocpp.org/std-proposals/2023/04/6235.php. If they can convince Intel to drop that line, good for us too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does work in practice though, for the most part. I'm not sure we want to rely on this given that the architecture manual says not to do it.

Apparently this is mostly reliable: rust-lang/unsafe-code-guidelines#345

The memory order can only be relaxed, acquire (for load), and release (for store).
Sequentially consistent ordering for these operations is disallowed,
since it's not obvious what that means for these tearable operations.

# The Rust Solution

While C++'s solution can be easily copy-pasted into Rust with a nearly identical signature,
it wouldn't fit with the rest of our atomic APIs.

All our atomic operations happen through the `Atomic*` types,
and we don't have atomic operations that operate on raw pointers.
(Other than as unstable intrinsics.)

Adding this functionality as a variant on `copy_nonatomic`, similar to the C++ solution,
would not be very ergonomic an can easily result in subtle bugs causing undefined behavior.

Instead, I propose to add a `AtomicPerByte<T>` type
similar to our existing atomic types: a `Sync` storage for a `T`
that can be written to and read from by multiple threads concurrently.

The `SeqLock` implementation above would use this type instead of an `UnsafeCell`.
It'd no longer need an unsafe `Sync` implementation,
since the `AtomicPerByte<T>` type can be shared between threads safely.

This type has a (safe!) `store` method consuming a `T`,
and a (safe!) `load` method producing a `MaybeUninit<T>`.
The `MaybeUninit` type is used to represent the potentially invalid state
the data might be in, since it might be the result of tearing during a race.

Only after confirming that there was no race and the data is valid
can one safely use `MaybeUninit::assume_init` to get the actual `T` out.

```rust
pub struct SeqLock<T> {
seq: AtomicUsize,
data: AtomicPerByte<T>,
}

impl<T: Copy> SeqLock<T> {
/// Safety: Only call from one thread.
pub unsafe fn write(&self, value: T) {
self.seq.fetch_add(1, Relaxed);
self.data.store(value, Release);
self.seq.fetch_add(1, Release);
}

pub fn read(&self) -> T {
loop {
let s1 = self.seq.load(Acquire);
let data = self.data.load(Acquire);
let s2 = self.seq.load(Relaxed);
if s1 & 1 == 0 && s1 == s2 {
return unsafe { data.assume_init() };
}
}
}
}
```

# Full API Overview

The `AtomicPerByte<T>` type can be thought of as
the `Sync` (data race free) equivalent of `MaybeUninit<T>`.
It can contain a `T`, but it might be invalid in various ways
due to concurrent store operations.
Its interface resembles a mix of the interfaces of `MaybeUninit` and the atomic types.

```rust
#[repr(transparent)]
struct AtomicPerByte<T> { inner: UnsafeCell<MaybeUninit<T>> }

unsafe impl<T: Send> Sync for AtomicPerByte<T> {}

impl<T> AtomicPerByte<T> {
pub const fn new(value: T) -> Self;
pub const fn uninit() -> Self;

pub fn store(&self, value: T, ordering: Ordering);
m-ou-se marked this conversation as resolved.
Show resolved Hide resolved
pub fn load(&self, ordering: Ordering) -> MaybeUninit<T>;

pub fn store_from(&self, src: &MaybeUninit<T>, ordering: Ordering);
pub fn load_to(&self, dest: &mut MaybeUninit<T>, ordering: Ordering);

pub fn store_from_slice(this: &[Self], src: &[MaybeUninit<T>], ordering: Ordering);
pub fn load_to_slice(this: &[Self], dest: &mut [MaybeUninit<T>], ordering: Ordering);

pub const fn into_inner(self) -> MaybeUninit<T>;

pub const fn as_ptr(&self) -> *const T;
pub const fn as_mut_ptr(&self) -> *mut T;

pub const fn get_mut(&mut self) -> &mut MaybeUninit<T>;
pub const fn get_mut_slice(this: &mut [Self]) -> &mut [MaybeUninit<T>];

pub const fn from_mut(value: &mut MaybeUninit<T>) -> &mut Self;
pub const fn from_mut_slice(slice: &mut [MaybeUninit<T>]) -> &mut [Self];
}

impl<T> Debug for AtomicPerByte<T>;
impl<T> From<MaybeUninit<T>> for AtomicPerByte<T>;
```

Note how the entire interface is safe.
All potential unsafety is captured by the use of `MaybeUninit`.

The load functions panic if the `ordering` is not `Relaxed` or `Acquire`.
The store functions panic if the `ordering` is not `Relaxed` or `Release`.
The slice functions panic if the slices are not of the same length.

# Drawbacks

- In order for this to be efficient, we need an additional intrinsic hooking into
special support in LLVM. (Which LLVM needs to have anyway for C++.)
Comment on lines +208 to +209
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you plan to implement this until LLVM implements this?

I don't think it is necessary to explain the implementation details in the RFC, but if we provide an unsound implementation until the as yet unmerged C++ proposal is implemented in LLVM in the future, that seems to be a problem.

(Also, if the language provides the functionality necessary to implement this soundly in Rust, the ecosystem can implement this soundly as well without inline assembly.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked into the details yet of what's possible today with LLVM. There's a few possible outcomes:

  • We wait until LLVM supports this. (Or contribute it to LLVM.) This feature is delayed until some point in the future when we can rely on an LLVM version that includes it.
  • Until LLVM supports it, we use a theoretically unsound but known-to-work-today hack like ptr::{read_volatile, write_volatile} combined with a fence. In the standard library we can more easily rely on implementation details of today's compiler.
  • We use the existing llvm.memcpy.element.unordered.atomic, after figuring out the consequences of the unordered property.
  • Until LLVM supports appears, we implement it in the library using a loop of AtomicUsize::load()/store()s and a fence, possibly using an efficient inline assembly alternative for some popular architectures.

I'm not fully sure yet which of these are feasible.


- It's not immediately obvious this type behaves like a `MaybeUninit`,
making it easy to forget to manually drop any values that implement `Drop`.

This could be solved by requiring `T: Copy`, or by using a better name for this type. (See alternatives below.)

Very clear documentation might be enough, though.

- `MaybeUninit<T>` today isn't as ergonomic as it should be.

For a simple `Copy` type like `u8` it might be nicer to be able to use types like `&[u8]`
rather than `&[MaybeUninit<u8>]`, etc.
(But that's a larger problem affecting many other things, like `MaybeUninit`'s interface,
`Read::read_buf`, etc. Maybe this should be solved separately.)

# Alternatives

- Instead of a type, this could all be just two functions on raw pointers,
such as something like `std::ptr::copy_nonoverlaping_load_atomic_per_byte`.

This means having to use `UnsafeCell` and more unsafe code wherever this functionality is used.

It'd be inconsistent with the other atomic operations.
We don't have e.g. `std::ptr::load_atomic` that operates on pointers either.
Comment on lines +227 to +233
Copy link

@Pointerbender Pointerbender Dec 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would still be convenient to have both the AtomicPerByte type and the two functions on raw pointers. This would allow for additional use cases, such as:

  • This lets programmers choose if they want to use just a Seqlock or combine a Seqlock with an RwLock to form a hybrid lock (for example).
  • The Seqlock version counter and the protected data are able to live on separate cache lines, which reduces false sharing while a reader is repeatedly trying to read the version counter in a loop (in cases where the version counter is checked for writers before doing a potentially racy read).
  • The protected data memory layout is unconstrained by the lock, e.g. elements in a continuous array can each be associated with individual locks stored in a separate array, without having to account for this in the data array memory layout, which could otherwise lead to additional unnecessary implicit padding if the elements and the version counter have a different size.

Copy link

@Pointerbender Pointerbender Dec 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more about this, I believe I got two things mixed up! The AtomicPerByte type already allows for all these use cases, although there is another use case for which it is not clear whether it is covered by AtomicPerByte:

Does AtomicPerByte make any guarantees about whether its load_to_slice/store_from_slice methods make element-wise atomic copies of T or does it do an atomic copy for the entire slice at once? The latter may allow for a more efficient atomic memcpy implementation if the slice happens to be word-aligned even though T is not (and the slice length is also a multiple of the word-alignment), but then it would be unsound to call methods like <[AtomicPerByte<T>]>::split_at that can affect the alignment of the slice if T is not word-aligned. If AtomicPerByte would promise to only make element-wise atomic copies in this case, then this yields a (very advanced) use case for the two separate pointer methods, which could only be soundly called if the programmer can uphold that there will be no multiple racing atomic memcpys that:

  1. operate on an overlapping memory region, and
  2. for which one of the copies uses a non-word-aligned offset or a length which is not a multiple of the word size.

otherwise this could lead to mixing non-perfectly-overlapping atomic accesses in the memcpy implementation which might be UB (assuming that atomic memcpy uses the machine word size as the largest granularity for a single instruction, otherwise replace "word size" in this comment with the maximum granularity with which memcpy can theoretically operate - EDIT: it seems like the idea for atomic SIMD operations has been floated before and there are 128-bit atomic instructions like core::arch::x86_64::cmpxchg16b, suggesting that an atomic memcpy could choose to use larger than machine word copy granularity).

Copy link
Member

@RalfJung RalfJung Dec 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect AtomicPerByte to make per-byte atomic copies, as its name indicates and as the RFC specifies. That means for elements that are larger than 1 byte, this is not even atomic for individual elements, let alone the entire slice.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With "element-wise atomic copy" I mean whether the internal implementation of AtomicPerByte will perform a separate atomic memcpy method call per element, as opposed to performing a single atomic memcpy method call on the entire slice just once (which may still issue one or more atomic instructions due to making per-byte atomic copies). The difference may matter for the alignment of the pointer that is passed to the atomic memcpy method, which may cause the internal implementation of the atomic memcpy method to choose a different atomic word size and/or alignment depending on that pointer's alignment, leading to potentially mixing non-perfectly-overlapping atomic accesses in the face of safe methods like <[AtomicPerByte<T>]>::split_at, for which the second half of the split slice may have a different alignment in memory and size than the whole slice.

To complete the example of how this could be unsound when AtomicPerByte does just a single atomic memcpy method call internally for the entire slice, consider calling AtomicPerByte::store_from_slice(&whole_slice, ..) and AtomicPerByte::load_to_slice(whole_slice.split_at(1).1, ..)) simultaneously where whole_slice is a slice of bytes aligned to the word size or largest SIMD size.

If "perform a single atomic memcpy method call for this slice of T" is not a supported use case of AtomicPerByte, then maybe that could be an argument to also include the atomic_load_per_byte_memcpy and atomic_store_per_byte_memcpy methods in the public API proposed by the RFC? (currently it is phrased to exclude exposing those methods)

Copy link
Member

@RalfJung RalfJung Dec 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With "element-wise atomic copy" I mean whether the internal implementation of AtomicPerByte will perform a separate atomic memcpy method call per element, as opposed to performing a single atomic memcpy method call on the entire slice just once (which may still issue one or more atomic instructions due to making per-byte atomic copies).

I don't think any of that is even defined in the spec; these are all unspecified implementation details.

The only alignment that load_to_slice and store_from_slice can assume is the one given by the type: this is aligned according to [Self], which has the same alignment as Self. Also clearly these methods may not perform any (Abstract Machine visible) memory access outside of the slice they are handed.

I feel like I don't understand the question, since this seems obvious.

Or are you repeating #3301 (comment) here? I am assuming this thread is disjoint from the one where we are discussing the x86 "concurrent atomic accesses to the not-perfectly-overlapping locations" problem.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the question here is different than the linked comment though touching on similar areas. By my understanding it is:

Can AtomicPerByte::store_from_slice use larger atomic stores to write multiple elements at once or not. (The same argument applies to loads but for brevity I'm limiting my explanation to stores)

Consider the following situation, explained rather much but i'd rather overexplain than introduce more confusion by underexplaining:

static ARRAY: [AtomicPerByte<u16>; LEN] = [AtomicPerByte::new(0); LEN];

fn thread_1() {
    let new = [MaybeUninit::new(1u16); LEN];
    AtomicPerByte::store_from_slice(&ARRAY, &new, Ordering::Release);
}

If store_from_slice can make larger loads, then it could be implemented as an optimized memcpy, which may do individual stores larger than the size of the individual elements. For instance:

  • write 0-7 bytes at the start and end of the slice using one-byte atomic stores
  • write the middle (now 8-byte aligned) part of the slice with 64-byte atomic stores

However, this is unsound, because another thread could store to a non-exactly overlapping slice:

fn thread_2() {
    let new = [MaybeUninit::new(2u16); 3];
    let lower = &ARRAY[0..3];
    AtomicPerByte::store_from_slice(&lower, &new, Ordering::Release);
}

In the case that ARRAY happened to be 8-byte aligned, (and LEN >= 4) then thread 1 will write against the beginning of the array with a 8-byte store, and thread 2 will write against the beginning of the array with a 1-byte store, leading to undefined behavior on x86_64.

store_from_slice (and load_to_slice), on platforms like x86_64 that disallow non-exactly-overlapping atomic accesses, cannot be implemented with atomic stores that overlap multiple elements, and thus are limited in store sizes by the size of the element.

Note that on x86_64 I believe this problem is only true for Release stores (and Acquire loads) because Relaxed operations compile to non-atomic stores which are allowed to be non-exactly-overlapping (I think?? I'm not confident in this part.)

Phrased differently:

store_from_slice cannot write to (Abstract Machine visible) memory outside the slice (already expected). But furthermore it must take into account that where atomic operations cannot non-exactly overlap, it must make the same-size atomic operations for any slice, and so must do its writes as-if it does individual AtomicPerByte::stores for each element. And furthermore it must actually do this in reality (paying a performance cost versus larger writes when its not actually required) for at least x86_64 with Acquire or Release.

Thus,

If "perform a single atomic memcpy method call for this slice of T" is not a supported use case of AtomicPerByte, then maybe that could be an argument to also include the atomic_load_per_byte_memcpy and atomic_store_per_byte_memcpy methods in the public API proposed by the RFC? (currently it is phrased to exclude exposing those methods)

is an argument to add additional functions

unsafe fn atomic_load_per_byte_memcpy<T>(src: &[AtomicPerByte<T>], dest: &mut [T], ordering: Ordering);
unsafe fn atomic_store_per_byte_memcpy<T>(src: &[T], dest; &[AtomicPerByte<T>], ordering: Ordering);

that have the additional requirement that no non-exactly-overlapping slice of the provided slice of AtomicPerBytes is accessed concurrently, thereby enabling the optimization to use larger atomic writes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leading to undefined behavior on x86_64

Despite what the manual says, Intel has a "verbal" guarantee that non-exactly-overlapping atomics is fine: rust-lang/unsafe-code-guidelines#345 (comment). It may tear across cache lines (which is fine) but will not put the CPU into an undefined state. This situation is unlikely to change given that need to keep C++'s byte-wise memcpy working too.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Despite what the manual says, Intel has a "verbal" guarantee that non-exactly-overlapping atomics is fine

While this is true, in that linked issue and elsewhere in this pr this has been discussed and my readings of these discussions is a disposition to respect the manual and treat it as undefined, even if it works in practice. If that changes, then that does solve this issue, but otherwise this remains a problem (though it could be not solved at all or a solution could be postponed)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this has been discussed

That mailing list post was sent on 6 April, all prior discussions were done without this piece of information. The only comment on this RFC on non-exactly-overlapping access since then was yours, so I don't think we can say there's a disposition yet, especially in light of the new information.

In general I don't think "stick to the documentation" is a good approach when dealing with proprietary platforms, otherwise we can't ship anything for macOS at all. If the vendor is receptive, we should nag them to change the documentation like @m-ou-se has done before with MS MicrosoftDocs/sdk-api#447, but even if they don't change it, we shouldn't have our hands tied by it.

In this case, the C++ proposal is committed to allowing multi-byte access optimisations, and they are privy to more information from vendors, so I think sticking to their approach should be safe.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's a totally reasonable (and probably good) resolution to this and other issues. I'm not trying to argue against (or necessarily for) that, but rather to clearly explain the issue that was raised in this particular sub thread, over which there seemed to be confusion that was not resolved.


- Require `T: Copy` for `AtomicPerByte<T>`, such that we don't need to worry about
duplicating non-`Copy` data.

There are valid use cases with non-`Copy` data, though, such as [in crossbeam-deque](https://github.com/crossbeam-rs/crossbeam/blob/2d9e7e0f81d3dd3efb1975b6379ea8b05fcf9bdd/crossbeam-deque/src/deque.rs#L60-L78).
Also, not all "memcpy'able" data is always marked as `Copy` (e.g. to prevent implicit copies).

- Leave this to the ecosystem, outside of the standard library.

Since this requires special compiler support, a community crate would
have to use (platform specific) inline assembly
or (probably technically unsound) hacks like volatile operations.

- Use a new `MaybeTorn<T>` instead of `MaybeUninit<T>`.

`AtomicPerByte` doesn't _have_ to support uninitialized bytes,
but it does need a wrapper type to represent potentially torn values.

If Rust had a `MaybeTorn<T>`, we could make it possible to load types like `[bool; _]` or even `f32` without any unsafe code,
since, for those types, combining bytes from different values always results in a valid value.

However, the use cases for this are very limited, it would require a new trait to mark the types for which this is valid,
and it makes the API a lot more complicated or verbose to use.

Also, such a API for safely handling torn values can be built on top of the proposed API,
so we can leave that to a (niche) ecosystem crate.

- Don't allow an uninitialized state.

Even if we use `MaybeUninit<T>` to represent a 'potentially torn value',
we could still attempt to design an API where we do not allow an uninitialized state.

It might seem like that results in a much simpler API with `MaybeUninit<T>` replaced by `T` in
methods like `into_inner()` and `get_mut()`, but that is not the case:

As long as `store()` can be called concurrently by multiple threads,
it is not only the `load()` method that can result in a torn value,
since the `AtomicPerByte<T>` object itself might end up storing a torn value.

Therefore, even if we disallow uninitialized values,
every method will still have `MaybeUninit<T>` in its signature,
at which point we lose basically all benefits of removing the uninitialized state.

Removing the uninitialized state does result in a big downside for users who need to add that state back,
as the interface of a `AtomicPerByte<MaybeUninit<T>>` would result in doubly wrapped `MaybeUninit<MaybeUninit<T>>` in many places,
which is can be quite unergonomic and confusing.

# Unresolved questions

- Should we require `T: Copy`?
m-ou-se marked this conversation as resolved.
Show resolved Hide resolved

There might be some valid use cases for non-`Copy` data,
but it's easy to accidentally cause undefined behavior by using `load`
to make an extra copy of data that shouldn't be copied.

- Naming: `AtomicPerByte`? `TearableAtomic`? `NoDataRace`? `NotQuiteAtomic`?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given these options and considering what the C++ paper chose, AtomicPerByte sounds OK and has the advantage of having Atomic as a prefix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AtomicPerByteMaybeUninit or AtomicPerByteManuallyDrop to also resolve the other concern around dropping? Those are terrible names though...


[P1478]: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p1478r7.html