Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I need to do an oob vector load. How? #2

Open
brson opened this issue Jul 5, 2018 · 66 comments
Open

I need to do an oob vector load. How? #2

brson opened this issue Jul 5, 2018 · 66 comments
Labels
A-memory Topic: Related to memory accesses S-pending-design Status: Resolving this issue requires addressing some open design questions

Comments

@brson
Copy link

brson commented Jul 5, 2018

As an optimization during a buffer search, I need (very want) to load that buffer into a SIMD vector, even when the buffer doesn't fit into the vector. E.g. I might have a 31-byte buffer that can be efficiently searched with a 32-byte wide AVX2 vector.

From a machine perspective, I don't see this as a problem, as long as the load doesn't extend beyond the current page; from LLVM's perspective this seems like UB.

I'd really like to be able to write this code in Rust and not have to use assembly.

Here's an example of this pattern:

    #[inline(always)]
    unsafe fn do_tail_clever(needle: u8, p: *const u8, len: isize,
                             i: isize, q: __m256i) -> Option<usize> {
        let rem = len - i;
        debug_assert!(rem < 32);

        // Check if the 32-byte load is within the current page
        let page_alignment = 4096;
        let page_mask = !(page_alignment - 1);
        let current_p = p.offset(i) as usize;
        let avx_read_end = current_p + 32;
        let next_page = (current_p & page_mask) + page_alignment;

        if likely(avx_read_end <= next_page) {
            let x = _mm256_loadu_si256(p.offset(i) as *const __m256i);
            let r = _mm256_cmpeq_epi8(x, q);
            let z = _mm256_movemask_epi8(r);
            let garbage_mask = {
                let ones = u32::max_value();
                let mask = ones << rem;
                let mask = !mask;
                mask as i32
            };
            let z = z & garbage_mask;
            if z != 0 {
                return off(i, z);
            }

            return None;
        }

        // Slow path
        do_tail_simple(needle, p, len, i, q)
    }

It loads beyond the array, does vector operations on it, then disregards the oob bytes with a mask.

I'm hopeful that there is some mechanism to tell LLVM to 'forget' what it knows about this pointer, 'fooling' the optimizer into not messing with it.

From the LLVM aliasing rules, there is some language that makes me hopeful:

An integer constant other than zero or a pointer value returned from a function not defined within LLVM may be associated with address ranges allocated through mechanisms other than those provided by LLVM. Such ranges shall not overlap with any ranges of addresses allocated by mechanisms provided by LLVM.

So there is a class of pointers that can operate on arbitrary memory (those that don't come from LLVM). That suggests to me that I could e.g. send my pointer through assembly or some other black-box function to 'clean it', maybe. On the other hand, calling into any function, or even into inline asm imposes extra instructions that more-or-less defeat the optimization (inline asm in LLVM seems to always spill registers). Though that sentence also says "such ranges shall not overlap with any ranges of addresses allocated by mechanisms provided by LLVM"

I'm not sure how much 'wiggle-room' there is. Is a malloc'd array "provided by LLVM"? What are the consequences of disobeying this "shall not"?

Even if there's no in-language solution and it is technically UB, I am hopeful that I can do this thing without LLVM messing with my codegen.

cc @nikomatsakis writing this here per your request.

@brson brson changed the title I need to do an oob vector read. How? I need to do an oob vector load. How? Jul 5, 2018
@brson
Copy link
Author

brson commented Jul 5, 2018

One thing I could do here is track the capacity of the original vector, and only do the oob load if there's enough capacity. That would definitely reduce how often this could hit the fast path, but not sure how much.

Edit: NVM, this routine never sees the Vec capacity - it operates only on slices.

@RalfJung
Copy link
Member

RalfJung commented Jul 8, 2018

A very related question has recently come up on stackoverflow. Someone has been suggesting to read a full u32 through a u8 pointer, making sure that never crosses page boundaries so there can't be a SEGFAULT.

As already discussed there, I think there are two related but distinct problems before we can even start taking Rust's own rules into account: You are potentially performing accesses outside of any allocation (as you already mentioned), and if not then you may be racing with other accesses to the bytes outside if your buffer.

For the out-of-bounds part, that is pretty much entirely in LLVM's hands. Rustc/MIR is not doing anything interesting there, but LLVM certainly does (for example, when you are accessing some pointer x + 3 and you have another pointer that LLVM knows points into an object of size <= 2, it will assume these accesses do not alias). You'd have to find a way to work around that, preferably something sanctioned by LLVM. That's probably something that would require discussion on the llvm-dev list. (I am sure the need for this comes up in C as well.)

For data races, Rust officially is using the C11 memory model. Read-write races are immediate UB under that model. So, if the extra byte you are accessing is actually allocated and currently accessed by some other thread, you would introduce UB. However, LLVM says that such read-write races yield undef/poison (effectively: uninitialized bytes) instead of raising UB. If Rust decided to switch from C11's model to LLVM's, that would enable your use-case if you carefully decorate everything with MaybeUninit to inform Rust that there may be uninitialized data around here.
The trouble is that C11's model is much better studied and much more clearly defined by now.

Only if we solve those two points, our own (Rust-level) aliasing rules even become relevant. I could imagine us following LLVM's lead and making "bad" loads return undef instead of raising UB.

@brson
Copy link
Author

brson commented Jul 8, 2018

@RalfJung when you say "you may be racing with other accesses to the bytes outside if your buffer." What is the practical impact of that? In what concurrent/atomic scenario will my loads change the outcome for other thread? Eg making atomic values visible before they should be?

@RalfJung
Copy link
Member

RalfJung commented Jul 8, 2018

The practical impact is hard to determine. Compilers are allowed to and will perform optimizations that are only valid if non-atomic accesses never have a data race. Let me try to construct an example for how they might break when combining an otherwise correct unsafely implemented library with your code.

For example, in the following C code

int x = *x_ptr;
acquire_lock(l);
int y = *x_ptr;

gcc may and sometimes will replace the last line by int y = x;, which is correct because it knows there cannot be a concurrent write that could change the value behind x_ptr in the mean time. Now imagine a situation where a 32-byte (aligned) buffer (&mut [u8; 32]) is split into a 31-byte buffer (part1: &mut [u8; 31]) and a location (part2: &mut u8) that is put under the control of some unsafely implemented library lib. That library makes the location accessible from multiple threads and uses a lock stored somewhere else to synchronize (like a Mutex but with the data not stored in-band with the lock).

Now we have something like

let h = lib::put_under_library_control(part2);
something_that_uses_tail_clever(part1);
let val = h.get();

If everything gets inlined, this matches the C code above: tail_clever will read part2 but throw away the result, then h.get() will acquire a lock and read part2 again. The compiler may optimize this to use the result of the first read, assuming there are no data races -- and we got a miscompilation.

Now, this is clearly a very contrived example. But the point is, we cannot just ignore UB due to data races. The only thing we can do is pick different rules and make sure the compiler follows those rules -- LLVM will not perform the optimization outlined above precisely because under LLVM semantics, this read-write race is not UB.


Coming back to the higher level, I think this is an excellent example for why one may prefer the LLVM memory model over the C11 one. seqlocks are another example that causes trouble with the C11 memory model and AFAIK works fine with the LLVM model (though I have not seen an analysis of the latter).
There may be other arguments for the C11 model, e.g. I do not know the situation and DRF theorems (data-race-freedeom theorems) for the LLVM model. The C11 model has some pretty strong DRF theorems saying e.g. that a program that is race-free under sequential consistent semantics and only uses non-atomic and sequential consistent accesses, does not gain any additional behaviors when considering the full C11 semantics. These theorems ensure that programs not using the weaker access modes do not have to care. I haven't seen such theorems for the LLVM model, but that's just because I haven't seen that model studied very much at all.

@brson
Copy link
Author

brson commented Jul 8, 2018 via email

@RalfJung
Copy link
Member

RalfJung commented Jul 9, 2018

Yeah, unfortunately these inlining hints don't actually change the program semantics -- they affect what the compiler will do, but not what it could do. From a correctness stand-point, I do not know of a way to make inlining hints "mean" anything.

@avadacatavra avadacatavra added the A-memory Topic: Related to memory accesses label Aug 24, 2018
@gnzlbg
Copy link
Contributor

gnzlbg commented Nov 27, 2018

@Amanieu would like to be able to do oob atomic loads as well: rust-lang/rust#32976 (comment)

That's required for correctness, and is not an optimization AFAICT.

@RalfJung
Copy link
Member

@Amanieu what is that doing?

@Amanieu
Copy link
Member

Amanieu commented Nov 27, 2018

It is emulating 8/16/32 atomic operations on older ARM architectures (without atomic support) using a kernel-provided 32-bit cmpxchg function.

@RalfJung
Copy link
Member

Does LLVM know that these are 32bit memory accesses? Code in other translation units, compiled from a different language with different UB (and linked on the assembly level, i.e., in a language where this is not UB), does not have to follow the same rules. Syscalls are an extreme case of "different translation unit".

Only LLVM IR itself is subject to LLVM IR's rules. (Of course there must be some amount of interop, and a shared memory model, but that seems plausible in this case.)

@gnzlbg
Copy link
Contributor

gnzlbg commented Dec 13, 2018

Talking about doing SIMD loads OOBs:

if you carefully decorate everything with MaybeUninit to inform Rust that there may be uninitialized data around here.

@RalfJung that might be doable, but @brson would need to heavily re-write its code. Here:

let p: *const __m256i = /* ptr to allocation smaller than 32 bytes */;
let x: __m256i = _mm256_loadu_si256(p);

The problem is that core::arch::x86_64::_mm256_loadu_si256 returns an __m256i - not a MaybeUninit<__m256i> (which wouldn't help much), nor a Simd<[MaybeUninit<i64>; 4]>.

If packed_simd supported Simd<[MaybeUninit<i64>; 4]>, one could maybe write:

let p: *const Simd<[MaybeUninit<i64>; 4]> = /* ptr to allocation smaller than 32 bytes */;
let x: Simd<[MaybeUninit<i64>; 4]> = ptr.read_unaligned(p); 
// ^^ Is ptr::read_unaligned the right tool for reading memory OOB ?

where Simd<[MaybeUninit<i64>; 4]> would support the same API as Simd<[i64 ;4]> (comparisons, arithmetic, bit manipulation, reductions, etc.) but propagating undef.


Implementation wise, I don't really know how that would work. Adding the API to packed_simd is "trivial", but what LLVM-IR should it generate ? LLVM vectors are of type <N x T>, but I don't know whether we can put <4 x MaybeUninit_i64> there, and even if we could, whether LLVM could do something meaningful with it. Maybe an attribute <4 x maybe_undef i64> ?

@RalfJung
Copy link
Member

@gnzlbg Notice that this only "solves" the data-race part. One alternative (not correct in theory but experimentally confirmed to work in practice) is to use volatile reads for the non-atomic maybe-racy reads. LLVM didn't sanction this, and maybe we should have a discussion with them about this. Another alternative might be to use LLVM monotone accesses, not sure if anybody experimented with them in Rust yet.

None of this helps with the fact that the accesses are OOB. There is no solution to that other than having explicit support for this from LLVM.

@RalfJung RalfJung added the C-open-question Category: An open question that we should revisit label Aug 14, 2019
gnzlbg added a commit that referenced this issue Aug 27, 2019
rearrange a bit and be more explicit about how our rules interact
@RalfJung
Copy link
Member

RalfJung commented Nov 7, 2020

@Amanieu and @thomcc recently had a related discussion on Zulip. It seems the general preference is to permit this for volatile accesses, assuming we can get LLVM to sanction that.

My concern with this is that volatile will inhibit optimizations, which seems in opposition to the goal stated in the OP -- to use a vectorized loop for performance. So it might be that only giving volatile accesses "OOB powers" is not enough, we might also need some (opt-in) way to do this for regular accesses.

@chorman0773
Copy link
Contributor

I'd be concerned with allowing any kind of OOB access (or OOB pointer arithmetic, note that wrapping_add would be implemented as integer arithmetic). The second it can cross into an unreachable object, either you definately do have undefined behaviour, or way too many optimizations go out the window. There could also be concerns about allowing OOB Access period, as padding could be theoretically manipulated to store internal compiler state when there isn't a chance of it getting overwritten.

@RalfJung
Copy link
Member

RalfJung commented Nov 7, 2020

note that wrapping_add would be implemented as integer arithmetic

FWIW, it currently is not -- it is implemented as getelementptr without "inbounds". Speaking in terms of LLVM semantics, this preserves provenance, which integer arithmetic will not (assuming LLVM wants to support the usual arithmetic identities).

I am not sure what you mean by "would".

The second it can cross into an unreachable object, either you definately do have undefined behaviour, or way too many optimizations go out the window.

This is exactly why we use getelementptr for wrapping_add: it cannot cross allocation boundaries.

There could also be concerns about allowing OOB Access period, as padding could be theoretically manipulated to store internal compiler state when there isn't a chance of it getting overwritten.

AFAIK we are only talking about reads here. I do not know of a reasonable way to permit OOB writes.

OOB reads would return "uninit" for the OOB part, even if that happens to be in-bounds for another object. This should hopefully suffice to preserve optimizations.

@chorman0773
Copy link
Contributor

chorman0773 commented Nov 7, 2020

I am not sure what you mean by "would".

In this case, I am refering the lccc model, in which pointer arithmetic comes straight out of the C and C++ Standards.

it cannot cross allocation boundaries.

wrapping_add (or actually it might be called wrapping_offset) can. offset cannot cross allocation boundaries, that's UB. Integer arithmetic from a pointer value on lccc does preserve provenence as long as it the equivalent operation applied to the pointer value would have defined behaviour (That is, given p is *mut T, (p as usize + 4*size_of::<T>()) as *mut T would be the same value as p.offset(4), if that expression has defined behaviour, otherwise the result is an invalid pointer).

OOB reads would return "uninit" for the OOB part, even if that happens to be in-bounds for another object.

Returning uninit from OOB may be fine. However, as I have mentioned in #76, for scalar values, uninit in lccc is poisoning (if one byte of a scalar object is uninit, the entire value is uninit). This shouldn't cause issues, at least in lccc, provided the read from type doesn't have any validity requirements. Note that for volatile, this is less of an issue, as volatile accesses are always freezing in lccc (which prevents the posioning of the entire value, since that occurs on reads and writes).

@RalfJung
Copy link
Member

RalfJung commented Nov 7, 2020

wrapping_add (or actually it might be called wrapping_offset) can.

No it cannot. Quoting from the docs:

In particular, the resulting pointer remains attached to the same allocated object that self points to. It may not be used to access a different allocated object. Note that in Rust, every (stack-allocated) variable is considered a separate allocated object.

@chorman0773
Copy link
Contributor

What I mean is that it's valid to use wrapping add to exceed the allocation, you just can't access outside. add cannot get a pointer outside the allocation, full stop. wrapping_add can, but it cannot be derefenced (which is why I noted that the implementation of p.wrapping_add(4) would return the value of p.add(4) if it the latter is defined, otherwise an invalid pointer. invalid pointers are ub to do much of anything with, or claim much of anything about)

@RalfJung
Copy link
Member

RalfJung commented Nov 7, 2020

What I mean is that it's valid to use wrapping add to exceed the allocation, you just can't access outside. add cannot get a pointer outside the allocation, full stop. wrapping_add can, but it cannot be derefenced

Ah, that is a terminology difference then. I would say that the pointer you get from wrapping_add never enters another allocation -- its provenance still stays attached with the original allocation. So, it can never "point to another allocation", even if its integer address is inside another allocation.

the implementation of p.wrapping_add(4) would return the value of p.add(4) if it the latter is defined, otherwise an invalid pointer.

That would not be a correct implementation. p.wrapping_add(400).wrapping_sub(400) returns the original pointer, even if the intermediate pointer is out-of-bounds.

@chorman0773
Copy link
Contributor

That would not be a correct implementation. p.wrapping_add(400).wrapping_sub(400) returns the original pointer, even if the intermediate pointer is out-of-bounds.

lccc also has a reverse round-trip rule to complement the round-trip rule (this also might be required by C++, idk), which says that if x is an appropriately sized integer type U, (x as *mut T).add(n) as U has the value x+n*size_of<T>() (provided the former has defined behaviour). These two rules combined make that situation well-defined and correct, even though the intermediate pointer is invalid (note: an invalid pointer is not the same as an invalid value, or particularily "the" invalid value). These two rules together reduce that entire operation to just p.

@comex
Copy link

comex commented Nov 7, 2020

Returning uninit from OOB may be fine. However, as I have mentioned in #76, for scalar values, uninit in lccc is poisoning (if one byte of a scalar object is uninit, the entire value is uninit). This shouldn't cause issues, at least in lccc, provided the read from type doesn't have any validity requirements.

The OP's use case doesn't just need the load to be non-UB, it needs the load to produce a value where the bits corresponding to in-bounds bytes are correct. So it seems like either you must track uninitializedness on a per-bit level (as LLVM does), or this must be a special kind of load which produces something different from normal uninitialized values.

@RalfJung
Copy link
Member

RalfJung commented Nov 7, 2020

you must track uninitializedness on a per-bit level (as LLVM does)

I don't think it does... at least, with the proposal to track this via poison, an iX is either fully poison or fully initialized.

However, an [iN x M] has per-element poison tracking.

@RalfJung
Copy link
Member

If this function would access any bytes outside of the provenance of ptr,

I don't think volatile should be allowed to bypass provenance rules such as Stacked Borrows. Allowing that would inhibit all the optimizations the aliasing rules are meant to enable. This should be strictly about "outside the bounds of an allocation", not "outside the bounds of what provenance says can be done".

How is "conditionally-supported" defined? Is this like "implementation-defined", in that implementations need to state the conditions under which it is supported? If so, what would be something an implementation could say to actually enable OOB accesses?

One main point of a spec is to enable programmers to reason that their code is correct, and I do not think your spec lets them do that. The spec needs to answer the question "as a programmer, what do I need to do to ensure that my program will behave correctly after compilation".

@chorman0773
Copy link
Contributor

(Reposted because reply by mail works flawlessly)

I don't think volatile should be allowed to bypass provenance rules such as Stacked Borrows. Allowing that would inhibit all the optimizations the aliasing rules are meant to enable.

Fair point, and it could be changed to talk about the same thing. However, isn't the upper-bound of pointer provenance the allocation it points into? Additionally, unspecified and may be uninit is extraordinarily permissive (on the level of an indeterminate value in C, defined as an unspecified value or a trap representation). It wouldn't even have to represent any possible state the byte held when the read occurred, even with other (non-volatile) writes reordered, so this would seem to keep the optimizations intact, aside from reordering the read, which can't be done anyways (as it's volatile).

How is "conditionally-supported" defined

The implementation chooses whether it is supported at all, and documents if and when it is not.

One main point of a spec is to enable programmers to reason that their code is correct, and I do not think your spec lets them do.

In all cases, you'd need to look at the documentation for the particular compiler, and certainly never use any type that has a validity invariant stricter than u8 or MaybeUninit<u8> (depending on whether or not UCG allows uninit integers). Volatile is probably the easiest to reason about, it's well-defined (from a language perspective) provided it's supported and you don't violate the validity invariant. For non-volatile it's harder. Maybe if we have an implementation-defined "buffer zone" that you can read freely and know that it won't raise a signal (and thus be UB for non-volatile). Implementation-defined is really the best that can be done, though.

@RalfJung
Copy link
Member

Additionally, unspecified and may be uninit is extraordinarily permissive (on the level of an indeterminate value in C, defined as an unspecified value or a trap representation). It wouldn't even have to represent any possible state the byte held when the read occurred, even with other (non-volatile) writes reordered, so this would seem to keep the optimizations intact, aside from reordering the read, which can't be done anyways (as it's volatile).

Good point. Since this is only about reads and it doesn't actually "leak" any information, it is hard to imagine this breaking any optimization.

I guess you are coming from the perspective that the bounds of an allocation are themselves just an expression of provenance? In my mental model, allocations fundamentally have a given size, and the gaps between allocations have no value associated with them at all (not even Uninit). So OOB errors are of a different nature than Stacked Borrows provenance errors. (This is also how things are implemented in Miri). It seems you are viewing OOB as "just another kind of provenance error", and I can see how that view is appealing. However, it is not obvious to me that the correctness proofs we have for provenance-based optimizations in Stacked Borrows will easily carry over to a semantics where provenance may be violated on reads but the read then yields Uninit. Intuitively this makes sense; doing the proof is a different game. ;)

In all cases, you'd need to look at the documentation for the particular compiler

So it seems like you just moved the hard work of specifying OOB loads such that the above code is allowed to the compiler. That's not solving the problem though. I don't think we are done here until we have a proposal for a spec that actually permits the kind of code the OP is asking for. So in terms of your proposal that would mean not only writing the relevant part of the Rust spec, but also writing the relevant part of the rustc docs that complete the spec to an actually concrete semantics, so that code authors can point to those docs and say "my code is correct because of what it says here".


Also, I noticed your proposal permits a signal to be raised. I don't think that's a good idea, since it makes things observable that really shouldn't be observable. As I said before: "I imagined some language where the programmer has to ensure that the OOB load has no further side-effects on the underlying platform. Usually, the compiler has to prove that a platform load correctly implements an Abstract Machine load; once you go OOB, that responsibility would be shifted to the programmer."

In other words, we basically require a proof from the programmer that a load instruction on the underlying hardware with the given size will correctly implement an Abstract Machine load. Or putting it differently, Behavior is Undefined unless a load instruction on the underlying hardware correctly implement an Abstract Machine load. "Correctly implement" unfortunately depends on the concrete simulation relation used by the implementation in question, but I think we can say for sure that it involves "no side-effects" and "always returns successfully", which rules out signals.

For example, on x86-64 we should be able to say that the load needs to be fully within a page such that there provably is a pointer that is dereferencable for size 1 pointing to the same page. (Optimizations might replace memory by registers, so there might not actually be any physical page, but then the OOB part also has no chance of triggering a signal so we should be good.) When doing the correctness argument for the compiler, this should be sufficient to prove that the load will always complete and never raise a signal. And when reasoning about our code as a programmer, this gives us enough information to actually say for sure that our code will be correct.

In fact, if there are no other conditions required to make such a load work, we could even make the page size implementation-defined and fix everything else. Implementations can still pick a page size of 1 to avoid making any promises. Then if there is a constant like core::mem::PAGE_SIZE, programmers can write code that will work with any implementation. ("Page size" might be a bad term for this as it doesn't have to match physical memory pages; suggestions welcome.)

@chorman0773
Copy link
Contributor

Also, I noticed your proposal permits a signal to be raised

Only for volatile reads, which are already observable. For non-volatile reads it's UB if the equivalent volatile read would raise a signal. This preserves the optimizations for reordering non volatile accesses. I don't see how adding the option for volatile reads to trap within defined behaviour would inhibit too many optimizations, as volatile is very limited in how it can be optimized.

that responsibility would be shifted to the programmer

This would apply here, in order to validly perform a non-volatile read, you would have the responsibility of ensuring the volatile read wouldn't trap. The minimum buffer width would provide some of that, by giving a sequence of bytes known to be correct.

I guess you are coming from the perspective that the bounds of an allocation are themselves just an expression of provenance

Kind of, in lccc, they are equivalent (or at least related) concepts, the reachability of a pointer. The reachability of a pointer to an object is defined as the largest sequence of bytes that are part of the object-representation of the largest object pointer-interconvertible with it, and the immediately enclosing array thereof (with some exclusions to permit unique and readonly optimizations). This, and the reachability of any pointer that can be validly created from it, would be the provenance of that pointer in rust terms. So under this model, the bounds of the allocation provides an upper-bound for the reachability, and thus the provenance.

but also writing the relevant part of the rustc docs that complete the spec to an actually concrete semantics

For rustc would the following be good:

Volatile and non-volatile out-of-bounds accesses are supported, provided the pointer is into an allocation of at least one byte and is non-null. For volatile accesses, it is guaranteed not to raise a signal if the accessed byte is in the same page as any byte in that allocation, otherwise, it is not specified whether the access raises an asynchronous SIGSEGV. Pages are aligned to their size. The size of pages is platform-dependant, and is a power of 2.

And then provide examples of page sizes, like x86-64 has 4096 bytes in a page.

Or putting it differently, Behavior is Undefined unless a load instruction on the underlying hardware correctly implement an Abstract Machine load

I think fundamentally, this requires a lot more knowledge then this, and certainly a lot more than what mine would. The implementation is bound to emulate the observable behaviour of the abstract machine unless it contains UB. By shifting the burden of ensuring the evaluation does so onto the programmer, I'd argue you've created a circular case. The implementation is required to perform the access correctly if the implementation performs the access correctly. The underlying hardware is, after all, a part of the implementation. An implementation could "support" it, but then choose a mechanism for emulating the load that is never correct for OOB, and under this idea, that would be valid.

@RalfJung
Copy link
Member

RalfJung commented Dec 2, 2020

I think fundamentally, this requires a lot more knowledge then this, and certainly a lot more than what mine would. The implementation is bound to emulate the observable behaviour of the abstract machine unless it contains UB. By shifting the burden of ensuring the evaluation does so onto the programmer, I'd argue you've created a circular case.

I have done no such thing, I have just provided a way to "plug in" to what the compiler does so that the user can help the compiler complete its argument.

But that was anyway just the explanation for how to arrive at the proposal I made at the end of my post, which I think is fairly close to what you proposed for the rustc docs. However, by moving everything relevant into the rustc docs you made it impossible to do OOB accesses in Rust code that can be compiled with more than 1 compiler, hence my proposal to put something like a page size all the way into the spec.

@chorman0773
Copy link
Contributor

chorman0773 commented Dec 2, 2020

hence my proposal to put something like a page size all the way into the spec.

Doing that may work, but it may leave certain kinds of implementations off the table. And, even then, this has the same defect really, except giving a way to express this limit. Although, now that I put that in words, it is kind of growing on me. I'm wondering about a hybrid one, that combines the two. So perhaps I revise the specification as follows for core::ptr::read_volatile:

  • If any byte accessed is not within the provenance of the pointer, then the access is conditionally-supported. If the accessed byte is on the same page as any byte which can be accessed by the pointer, the result is unspecified (and may be uninitialized). Otherwise, the resulting byte is uninitialized or an implementation-defined signal is raised. If the resulting value is invalid for the type of the access, the behaviour is undefined. Pages are a sequence of contiguous bytes of an implementation-defined size, which are aligned to their size.

And then core::mem::PAGE_SIZE:

The value of type usize which is the implementation-defined size (and thus alignment) of pages or the value 0 to indicate that pages are not distinguished by the implementation. Shall be a power of 2, or the value 0.

Does the above sound good?

@RalfJung
Copy link
Member

RalfJung commented Dec 2, 2020

And, even then, this has the same defect really, except giving a way to express this limit.

Being able to query the limit from inside the code makes all the difference, IMO.

For your proposed read_volatile spec, two questions:

  • Why support the cross-page case at all? I think this has not been requested in this thread. So for now I'd prefer to limit this to same-page (i.e., no-signal) accesses. I'm in favor of trying to solve one problem at a time. :)
  • "the result is unspecified" -- I think it is important to say that the out-of-bounds bytes are unspecified. And since "they are uninitialized" is observably equivalent to "they may be uninitialized", I think we can just say that they are uninitialized.

For the PAGE_SIZE, why not use "1" as sentinel value for "pages are not distinguished"? Then this would always be a power of two.

@chorman0773
Copy link
Contributor

chorman0773 commented Dec 2, 2020

Why support the cross-page case at all

I don't think it's necessarily bad to say it cannot be supported, and saying it can raise a signal even if it is supported, I think is reasonable. The main issue I've heard from this thread against raising a signal is that it would inhibit some reordering optimizations, but volatile operations already do so, and are already observable behaviour. An implementation could also choose not to support cross-page access under the blanket conditionally-supported. It would simply have to document this choice.

I think it is important to say that the out-of-bounds bytes are unspecified

That is true, that wording can be fixed. I think it was in the original version, but got left out in the rewrite. As for why it's unspecified (and may be uninitialized), I think saying the implementation is allowed to produce a particular value is ok, and this matches the C definition of an indeterminate value ("An unspecified value or a trap representation", and uninitialized bytes are a trap representation). An implementation, for example, could freeze all volatile accesses. This indicates that is a valid implementation.

For the PAGE_SIZE, why not use "1" as sentinel value for "pages are not distinguished"

By "pages are not distinguished" I mean a fictious implementation that doesn't have pages, so the volatile read could never trap (IE. the page size is 2^n where n is 8*size_of<*const T>()). A PAGE_SIZE value 1, in contrast, means that each individual byte is a different logical page, so the volatile access can always trap (it may not necessarily trap, but it can).

@chorman0773
Copy link
Contributor

I'd note that in the above case, the documentation for rustc would then be the page size (and thus the value of core::mem::PAGE_SIZE), as well as any signals the cross-page access can raise (or if it would never support cross-page access, that choice), which would probably be SIGSEGV (at least on unix-like operating systems).

@chorman0773
Copy link
Contributor

Now, of course, the real question is whether or not these rules can be implemented on an llvm backend.

@RalfJung
Copy link
Member

RalfJung commented Dec 6, 2020

I just think it is easier to solve these problems in isolation than trying to solve more problems at once.^^ That's why I'd prefer to keep cross-page accesses out of the discussion. shrug

@chorman0773
Copy link
Contributor

I just think it is easier to solve these problems in isolation than trying to solve more problems at once.

Possibly. In my opinion, the cross-page access problem isn't necessarily being solved directly, it's just being solved as a side-effect of solving the main problem, though I can see the opposite argument. In either case, the rule I proposed for read_volatile doesn't necessarily need the cross-page rule (and going accross pages could just be made into blanket UB). So that can be removed if we are completely adamant against solving the problem now (or if the proposed solution is deficient in some reasonable manner), and then what has been proposed can be used to direct future solutions if and when one is needed or desired. However, if it is a reasonable solution, I don't see why it can't be adopted now.

@comex
Copy link

comex commented Jan 10, 2023

(Two years later…)

This pattern came up as a concern in an LLVM discussion about changing uninitialized reads to return poison instead of undef:

https://discourse.llvm.org/t/rfc-load-instruction-uninitialized-memory-semantics/67481/4

@JakobDegen JakobDegen added S-pending-design Status: Resolving this issue requires addressing some open design questions and removed C-open-question Category: An open question that we should revisit labels May 23, 2023
@JakobDegen
Copy link
Contributor

Briefly discussed in backlog bonanza: This is still open. Rust does not support it today, but it seems plausible to have in the language at some point

@RalfJung
Copy link
Member

RalfJung commented Apr 5, 2024

We actually now have an intrinsic that can do something like this: simd_maksed_load. However, you need to produce a mask that indicates which parts of the vector are in-bounds and which are not.

@RalfJung
Copy link
Member

RalfJung commented Nov 6, 2024

This pattern also again came up here, and here.

@nikic as far as I can tell, this is currently blocked on finding some way for LLVM to generate the desired code. There's a very clear idea of what the assembly code is that we want, but apparently no good way to get LLVM to generate that code without UB. Do you have any thoughts on what a realistic way forward may look like here? Some sort of flag on load operations with the meaning of "make the OOB parts poison; it is UB only if they end up causing a trap"?

@nikic
Copy link

nikic commented Nov 6, 2024

@RalfJung I think a flag on load operations is unlikely -- it would have to be an actual flag, not metadata, and that will take significant effort to preserve through the compiler.

It should be pretty easy to provide an intrinsic for this though. I'd like to double check I understand the requirements here:

  • If the load actually traps, that's undefined behavior, we're not required to preserve the trap, right? So the loads may be DCEd.
  • The load is still provenance aware, right? That is, any bytes that are outside the underlying allocated object are undef. Two consecutive loads without changes in between may return different results for the out-of-bounds bytes. This means this is suitable for implementing a memcpy, but not a memcmp. (I think it would be very hard to support without this limitation -- the AA implications of a providing a pure physical load would be terrible.)
  • What kind of loads do we need? Just plain loads? Volatile loads? Atomic loads? If it's just plain (and volatile if necessary) that's okay, but atomic loads would be a substantial complication.

@RalfJung
Copy link
Member

RalfJung commented Nov 6, 2024

If the load actually traps, that's undefined behavior, we're not required to preserve the trap, right? So the loads may be DCEd.

Yes.

The load is still provenance aware, right? That is, any bytes that are outside the underlying allocated object are undef. Two consecutive loads without changes in between may return different results for the out-of-bounds bytes. This means this is suitable for implementing a memcpy, but not a memcmp. (I think it would be very hard to support without this limitation -- the AA implications of a providing a pure physical load would be terrible.)

Yes. This is actually tricky to define from an aliasing model perspective -- if we have two noalias pointers that are 4 bytes apart from each other, and we do such an OOB load of size 8 through one of them with the intent of only loading 4 bytes and ignoring the remaining bytes, we don't want this to be an aliasing conflict.

So I think we need the intrinsic to take the "size of the logical read" as a parameter, on top of the size of the physical read. In the AM, the load acts like a normal load on the logical size, with the full consequences for provenance and aliasing model. The part between the logical and physical size will always be uninit/undef/poison, the AM entirely ignores it except that it is UB if this part traps. (In Miri we'll have to figure out some fun way to define whether there can be a trap here.)

What kind of loads do we need? Just plain loads? Volatile loads? Atomic loads? If it's just plain (and volatile if necessary) that's okay, but atomic loads would be a substantial complication.

For now this has only come up with plain loads, but it definitely seems possible that this would come up with atomic loads.

Volatile loads arguably already allow this since they are basically inline assembly.

@bjorn3
Copy link
Member

bjorn3 commented Nov 6, 2024

For atomic loads can the load be split into two operations with respect to the opsem? One half for what Ralf calls the "logical read". This half follows the regular memory ordering rules for the specified ordering of the atomic load. And one half for the rest of the load which acts effectively as a regular load and thus returns uninit/undef/poison in case of a race with a write on this half. And afterwards both halves are combined into a single return value for the atomic load as a whole.

@nikic
Copy link

nikic commented Nov 6, 2024

Draft RFC for LLVM: https://hackmd.io/@nikic/S1O4QWYZkx Let me know if this makes sense. Also, does anyone has a good idea for the intrinsic name?

@Amanieu
Copy link
Member

Amanieu commented Nov 6, 2024

@nikic Could this be generalized to handle OOB bytes at the start of a slice? This would be useful for memcpy implementations like the one in compiler-builtins.

@RalfJung
Copy link
Member

RalfJung commented Nov 7, 2024

Thanks for writing this up!

Making the remaining bytes undef means Rust will have to immediately generate a mask, since our integers cannot be undef... but I guess that's not too bad.

What is the reason why doing this for atomics is tricky? Is it just "there are so many variants of them", or something deeper?

@nikic
Copy link

nikic commented Nov 10, 2024

Making the remaining bytes undef means Rust will have to immediately generate a mask, since our integers cannot be undef... but I guess that's not too bad.

Would it work if I said they're freeze poison instead? (In the sense that the bytes are fixed once loaded, but can differ on each load.)

What is the reason why doing this for atomics is tricky? Is it just "there are so many variants of them", or something deeper?

The former.

Could this be generalized to handle OOB bytes at the start of a slice? This would be useful for memcpy implementations like the one in compiler-builtins.

We could replace %defined_size with %defined_offset plus %defined_size, making this intrinsic even uglier :(

I'm really unhappy about this parameter. It is a spec-only construct that gets completely ignored by the implementation.

@RalfJung
Copy link
Member

Would it work if I said they're freeze poison instead? (In the sense that the bytes are fixed once loaded, but can differ on each load.)

If LLVM can guarantee that, sure. But Rust might still want to apply a mask until we have officially decided that exposing the contents of uninit memory to sound programs is something we want to do.

I'm really unhappy about this parameter. It is a spec-only construct that gets completely ignored by the implementation.

Well, in a sense so is freeze...

Wouldn't alias analysis look at this parameter to determine whether such an access aliases with something else?

@nikic
Copy link

nikic commented Nov 10, 2024

Well, in a sense so is freeze...

Not really. Freeze doesn't generate code, but it does affect analysis and transforms a lot. Here we'd have a parameter that is completely ignored, at all levels.

Wouldn't alias analysis look at this parameter to determine whether such an access aliases with something else?

It could use it if defined_size is a constant, but if it were a constant you wouldn't be using this intrinsic in the first place. For AA purposes, we'd just model this as "accesses at most sizeof(T) bytes" (LocationSize::upperBound).

@RalfJung
Copy link
Member

If LLVM can deduce information about defined_size, AA could use that. Not sure how realistic that is, but a lot can happen after inlining.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-memory Topic: Related to memory accesses S-pending-design Status: Resolving this issue requires addressing some open design questions
Projects
None yet
Development

No branches or pull requests