-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I need to do an oob vector load. How? #2
Comments
One thing I could do here is track the capacity of the original vector, and only do the oob load if there's enough capacity. That would definitely reduce how often this could hit the fast path, but not sure how much. Edit: NVM, this routine never sees the Vec capacity - it operates only on slices. |
A very related question has recently come up on stackoverflow. Someone has been suggesting to read a full As already discussed there, I think there are two related but distinct problems before we can even start taking Rust's own rules into account: You are potentially performing accesses outside of any allocation (as you already mentioned), and if not then you may be racing with other accesses to the bytes outside if your buffer. For the out-of-bounds part, that is pretty much entirely in LLVM's hands. Rustc/MIR is not doing anything interesting there, but LLVM certainly does (for example, when you are accessing some pointer For data races, Rust officially is using the C11 memory model. Read-write races are immediate UB under that model. So, if the extra byte you are accessing is actually allocated and currently accessed by some other thread, you would introduce UB. However, LLVM says that such read-write races yield Only if we solve those two points, our own (Rust-level) aliasing rules even become relevant. I could imagine us following LLVM's lead and making "bad" loads return |
@RalfJung when you say "you may be racing with other accesses to the bytes outside if your buffer." What is the practical impact of that? In what concurrent/atomic scenario will my loads change the outcome for other thread? Eg making atomic values visible before they should be? |
The practical impact is hard to determine. Compilers are allowed to and will perform optimizations that are only valid if non-atomic accesses never have a data race. Let me try to construct an example for how they might break when combining an otherwise correct unsafely implemented library with your code. For example, in the following C code int x = *x_ptr;
acquire_lock(l);
int y = *x_ptr; gcc may and sometimes will replace the last line by Now we have something like let h = lib::put_under_library_control(part2);
something_that_uses_tail_clever(part1);
let val = h.get(); If everything gets inlined, this matches the C code above: Now, this is clearly a very contrived example. But the point is, we cannot just ignore UB due to data races. The only thing we can do is pick different rules and make sure the compiler follows those rules -- LLVM will not perform the optimization outlined above precisely because under LLVM semantics, this read-write race is not UB. Coming back to the higher level, I think this is an excellent example for why one may prefer the LLVM memory model over the C11 one. seqlocks are another example that causes trouble with the C11 memory model and AFAIK works fine with the LLVM model (though I have not seen an analysis of the latter). |
Thanks for the detail.
I don't know if it helps anything but this particular code will never be
inlined into any other code ("If everything gets inlined") because the
caller is under my control and is inline (never).
…On Sun, Jul 8, 2018, 12:49 PM Ralf Jung ***@***.***> wrote:
The practical impact is hard to determine. Compilers are allowed to and
will perform optimizations that are only valid if non-atomic accesses never
have a data race. It is possible that none of these optimizations break
your code, but that's the usual problem with undefined behavior -- just
because the code isn't broken with one context now, doesn't mean it won't
be broken with another context in the future.
For example, in the following C code
int x = *x_ptr;acquire_lock(l);int y = *x_ptr;
gcc may and sometimes will replace the last line by int y = x;, which is
correct because it knows there cannot be a concurrent write that could
change the value behind x_ptr in the mean time. Now imagine a situation
where a 32-byte (aligned) buffer (&mut [u8; 32]) is split into a 31-byte
buffer (part1: &mut [u8; 31]) and a location (part2: &mut u8) that is put
under the control of some unsafely implemented library lib. That library
makes the location accessible from multiple threads and uses a lock stored
somewhere else to synchronize (like a Mutex but with the data not stored
in-band with the lock).
Now we have something like
let h = lib::put_under_library_control(part2);something_that_uses_tail_clever(part1);let val = h.get();
If everything gets inlined, this matches the C code above: tail_clever
will read part2 but throw away the result, then h.get() will acquire a
lock and read part2 again. The compiler may optimize this to use the
result of the first read, assuming there are no data races -- and we got a
miscompilation.
Now, this is clearly a very contrived example. But the point is, we cannot
just ignore UB due to data races. The only thing we can do is pick
different rules and make sure the compiler follows those rules -- LLVM will
not perform the optimization outlined above precises because under LLVM
semantics, this read-write race is not UB.
------------------------------
Coming back to the higher level, I think this is an excellent example for
why one may prefer the LLVM memory model over the C11 one. seqlocks
<https://en.wikipedia.org/wiki/Seqlock> are another example that causes
trouble with the C11 memory model
<http://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf> and AFAIK works
fine with the LLVM model (though I have not seen an analysis of the latter).
There may be other arguments for the C11 model, e.g. I do not know the
situation and DRF theorems (data-race-freedeom theorems) for the LLVM
model. The C11 model has some pretty strong DRF theorems saying e.g. that a
program that is race-free under sequential consistent semantics and only
uses non-atomic and sequential consistent accesses, does not gain any
additional behaviors when considering the full C11 semantics. These
theorems ensure that programs not using the weaker access modes do not have
to care. I haven't seen such theorems for the LLVM model, but that's just
because I haven't seen that model studied very much at all.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAI_DtJ4Cu8C7TnvpPgC91hcoosim_h1ks5uEmIugaJpZM4VEX04>
.
|
Yeah, unfortunately these inlining hints don't actually change the program semantics -- they affect what the compiler will do, but not what it could do. From a correctness stand-point, I do not know of a way to make inlining hints "mean" anything. |
@Amanieu would like to be able to do oob atomic loads as well: rust-lang/rust#32976 (comment) That's required for correctness, and is not an optimization AFAICT. |
@Amanieu what is that doing? |
It is emulating 8/16/32 atomic operations on older ARM architectures (without atomic support) using a kernel-provided 32-bit cmpxchg function. |
Does LLVM know that these are 32bit memory accesses? Code in other translation units, compiled from a different language with different UB (and linked on the assembly level, i.e., in a language where this is not UB), does not have to follow the same rules. Syscalls are an extreme case of "different translation unit". Only LLVM IR itself is subject to LLVM IR's rules. (Of course there must be some amount of interop, and a shared memory model, but that seems plausible in this case.) |
Talking about doing SIMD loads OOBs:
@RalfJung that might be doable, but @brson would need to heavily re-write its code. Here: let p: *const __m256i = /* ptr to allocation smaller than 32 bytes */;
let x: __m256i = _mm256_loadu_si256(p); The problem is that If let p: *const Simd<[MaybeUninit<i64>; 4]> = /* ptr to allocation smaller than 32 bytes */;
let x: Simd<[MaybeUninit<i64>; 4]> = ptr.read_unaligned(p);
// ^^ Is ptr::read_unaligned the right tool for reading memory OOB ? where Implementation wise, I don't really know how that would work. Adding the API to |
@gnzlbg Notice that this only "solves" the data-race part. One alternative (not correct in theory but experimentally confirmed to work in practice) is to use volatile reads for the non-atomic maybe-racy reads. LLVM didn't sanction this, and maybe we should have a discussion with them about this. Another alternative might be to use LLVM None of this helps with the fact that the accesses are OOB. There is no solution to that other than having explicit support for this from LLVM. |
rearrange a bit and be more explicit about how our rules interact
@Amanieu and @thomcc recently had a related discussion on Zulip. It seems the general preference is to permit this for volatile accesses, assuming we can get LLVM to sanction that. My concern with this is that volatile will inhibit optimizations, which seems in opposition to the goal stated in the OP -- to use a vectorized loop for performance. So it might be that only giving volatile accesses "OOB powers" is not enough, we might also need some (opt-in) way to do this for regular accesses. |
I'd be concerned with allowing any kind of OOB access (or OOB pointer arithmetic, note that wrapping_add would be implemented as integer arithmetic). The second it can cross into an unreachable object, either you definately do have undefined behaviour, or way too many optimizations go out the window. There could also be concerns about allowing OOB Access period, as padding could be theoretically manipulated to store internal compiler state when there isn't a chance of it getting overwritten. |
FWIW, it currently is not -- it is implemented as I am not sure what you mean by "would".
This is exactly why we use
AFAIK we are only talking about reads here. I do not know of a reasonable way to permit OOB writes. OOB reads would return "uninit" for the OOB part, even if that happens to be in-bounds for another object. This should hopefully suffice to preserve optimizations. |
In this case, I am refering the lccc model, in which pointer arithmetic comes straight out of the C and C++ Standards.
Returning uninit from OOB may be fine. However, as I have mentioned in #76, for scalar values, uninit in lccc is poisoning (if one byte of a scalar object is uninit, the entire value is uninit). This shouldn't cause issues, at least in lccc, provided the read from type doesn't have any validity requirements. Note that for volatile, this is less of an issue, as volatile accesses are always freezing in lccc (which prevents the posioning of the entire value, since that occurs on reads and writes). |
No it cannot. Quoting from the docs:
|
What I mean is that it's valid to use wrapping add to exceed the allocation, you just can't access outside. |
Ah, that is a terminology difference then. I would say that the pointer you get from
That would not be a correct implementation. |
lccc also has a reverse round-trip rule to complement the round-trip rule (this also might be required by C++, idk), which says that if |
The OP's use case doesn't just need the load to be non-UB, it needs the load to produce a value where the bits corresponding to in-bounds bytes are correct. So it seems like either you must track uninitializedness on a per-bit level (as LLVM does), or this must be a special kind of load which produces something different from normal uninitialized values. |
I don't think it does... at least, with the proposal to track this via However, an |
I don't think How is "conditionally-supported" defined? Is this like "implementation-defined", in that implementations need to state the conditions under which it is supported? If so, what would be something an implementation could say to actually enable OOB accesses? One main point of a spec is to enable programmers to reason that their code is correct, and I do not think your spec lets them do that. The spec needs to answer the question "as a programmer, what do I need to do to ensure that my program will behave correctly after compilation". |
(Reposted because reply by mail works flawlessly)
Fair point, and it could be changed to talk about the same thing. However, isn't the upper-bound of pointer provenance the allocation it points into? Additionally, unspecified and may be uninit is extraordinarily permissive (on the level of an indeterminate value in C, defined as an unspecified value or a trap representation). It wouldn't even have to represent any possible state the byte held when the read occurred, even with other (non-volatile) writes reordered, so this would seem to keep the optimizations intact, aside from reordering the read, which can't be done anyways (as it's volatile).
The implementation chooses whether it is supported at all, and documents if and when it is not.
In all cases, you'd need to look at the documentation for the particular compiler, and certainly never use any type that has a validity invariant stricter than |
Good point. Since this is only about reads and it doesn't actually "leak" any information, it is hard to imagine this breaking any optimization. I guess you are coming from the perspective that the bounds of an allocation are themselves just an expression of provenance? In my mental model, allocations fundamentally have a given size, and the gaps between allocations have no value associated with them at all (not even
So it seems like you just moved the hard work of specifying OOB loads such that the above code is allowed to the compiler. That's not solving the problem though. I don't think we are done here until we have a proposal for a spec that actually permits the kind of code the OP is asking for. So in terms of your proposal that would mean not only writing the relevant part of the Rust spec, but also writing the relevant part of the rustc docs that complete the spec to an actually concrete semantics, so that code authors can point to those docs and say "my code is correct because of what it says here". Also, I noticed your proposal permits a signal to be raised. I don't think that's a good idea, since it makes things observable that really shouldn't be observable. As I said before: "I imagined some language where the programmer has to ensure that the OOB load has no further side-effects on the underlying platform. Usually, the compiler has to prove that a platform load correctly implements an Abstract Machine load; once you go OOB, that responsibility would be shifted to the programmer." In other words, we basically require a proof from the programmer that a load instruction on the underlying hardware with the given size will correctly implement an Abstract Machine load. Or putting it differently, Behavior is Undefined unless a load instruction on the underlying hardware correctly implement an Abstract Machine load. "Correctly implement" unfortunately depends on the concrete simulation relation used by the implementation in question, but I think we can say for sure that it involves "no side-effects" and "always returns successfully", which rules out signals. For example, on x86-64 we should be able to say that the load needs to be fully within a page such that there provably is a pointer that is dereferencable for size 1 pointing to the same page. (Optimizations might replace memory by registers, so there might not actually be any physical page, but then the OOB part also has no chance of triggering a signal so we should be good.) When doing the correctness argument for the compiler, this should be sufficient to prove that the load will always complete and never raise a signal. And when reasoning about our code as a programmer, this gives us enough information to actually say for sure that our code will be correct. In fact, if there are no other conditions required to make such a load work, we could even make the page size implementation-defined and fix everything else. Implementations can still pick a page size of 1 to avoid making any promises. Then if there is a constant like |
Only for volatile reads, which are already observable. For non-volatile reads it's UB if the equivalent volatile read would raise a signal. This preserves the optimizations for reordering non volatile accesses. I don't see how adding the option for volatile reads to trap within defined behaviour would inhibit too many optimizations, as volatile is very limited in how it can be optimized.
This would apply here, in order to validly perform a non-volatile read, you would have the responsibility of ensuring the volatile read wouldn't trap. The minimum buffer width would provide some of that, by giving a sequence of bytes known to be correct.
Kind of, in lccc, they are equivalent (or at least related) concepts, the reachability of a pointer. The reachability of a pointer to an object is defined as the largest sequence of bytes that are part of the object-representation of the largest object pointer-interconvertible with it, and the immediately enclosing array thereof (with some exclusions to permit unique and readonly optimizations). This, and the reachability of any pointer that can be validly created from it, would be the provenance of that pointer in rust terms. So under this model, the bounds of the allocation provides an upper-bound for the reachability, and thus the provenance.
For rustc would the following be good:
And then provide examples of page sizes, like x86-64 has 4096 bytes in a page.
I think fundamentally, this requires a lot more knowledge then this, and certainly a lot more than what mine would. The implementation is bound to emulate the observable behaviour of the abstract machine unless it contains UB. By shifting the burden of ensuring the evaluation does so onto the programmer, I'd argue you've created a circular case. The implementation is required to perform the access correctly if the implementation performs the access correctly. The underlying hardware is, after all, a part of the implementation. An implementation could "support" it, but then choose a mechanism for emulating the load that is never correct for OOB, and under this idea, that would be valid. |
I have done no such thing, I have just provided a way to "plug in" to what the compiler does so that the user can help the compiler complete its argument. But that was anyway just the explanation for how to arrive at the proposal I made at the end of my post, which I think is fairly close to what you proposed for the rustc docs. However, by moving everything relevant into the rustc docs you made it impossible to do OOB accesses in Rust code that can be compiled with more than 1 compiler, hence my proposal to put something like a page size all the way into the spec. |
Doing that may work, but it may leave certain kinds of implementations off the table. And, even then, this has the same defect really, except giving a way to express this limit. Although, now that I put that in words, it is kind of growing on me. I'm wondering about a hybrid one, that combines the two. So perhaps I revise the specification as follows for
And then
Does the above sound good? |
Being able to query the limit from inside the code makes all the difference, IMO. For your proposed
For the |
I don't think it's necessarily bad to say it cannot be supported, and saying it can raise a signal even if it is supported, I think is reasonable. The main issue I've heard from this thread against raising a signal is that it would inhibit some reordering optimizations, but volatile operations already do so, and are already observable behaviour. An implementation could also choose not to support cross-page access under the blanket conditionally-supported. It would simply have to document this choice.
That is true, that wording can be fixed. I think it was in the original version, but got left out in the rewrite. As for why it's unspecified (and may be uninitialized), I think saying the implementation is allowed to produce a particular value is ok, and this matches the C definition of an indeterminate value ("An unspecified value or a trap representation", and uninitialized bytes are a trap representation). An implementation, for example, could freeze all volatile accesses. This indicates that is a valid implementation.
By "pages are not distinguished" I mean a fictious implementation that doesn't have pages, so the volatile read could never trap (IE. the page size is |
I'd note that in the above case, the documentation for rustc would then be the page size (and thus the value of |
Now, of course, the real question is whether or not these rules can be implemented on an llvm backend. |
I just think it is easier to solve these problems in isolation than trying to solve more problems at once.^^ That's why I'd prefer to keep cross-page accesses out of the discussion. shrug |
Possibly. In my opinion, the cross-page access problem isn't necessarily being solved directly, it's just being solved as a side-effect of solving the main problem, though I can see the opposite argument. In either case, the rule I proposed for |
(Two years later…) This pattern came up as a concern in an LLVM discussion about changing uninitialized reads to return https://discourse.llvm.org/t/rfc-load-instruction-uninitialized-memory-semantics/67481/4 |
Briefly discussed in backlog bonanza: This is still open. Rust does not support it today, but it seems plausible to have in the language at some point |
We actually now have an intrinsic that can do something like this: |
This pattern also again came up here, and here. @nikic as far as I can tell, this is currently blocked on finding some way for LLVM to generate the desired code. There's a very clear idea of what the assembly code is that we want, but apparently no good way to get LLVM to generate that code without UB. Do you have any thoughts on what a realistic way forward may look like here? Some sort of flag on |
@RalfJung I think a flag on load operations is unlikely -- it would have to be an actual flag, not metadata, and that will take significant effort to preserve through the compiler. It should be pretty easy to provide an intrinsic for this though. I'd like to double check I understand the requirements here:
|
Yes.
Yes. This is actually tricky to define from an aliasing model perspective -- if we have two So I think we need the intrinsic to take the "size of the logical read" as a parameter, on top of the size of the physical read. In the AM, the load acts like a normal load on the logical size, with the full consequences for provenance and aliasing model. The part between the logical and physical size will always be uninit/undef/poison, the AM entirely ignores it except that it is UB if this part traps. (In Miri we'll have to figure out some fun way to define whether there can be a trap here.)
For now this has only come up with plain loads, but it definitely seems possible that this would come up with atomic loads. Volatile loads arguably already allow this since they are basically inline assembly. |
For atomic loads can the load be split into two operations with respect to the opsem? One half for what Ralf calls the "logical read". This half follows the regular memory ordering rules for the specified ordering of the atomic load. And one half for the rest of the load which acts effectively as a regular load and thus returns uninit/undef/poison in case of a race with a write on this half. And afterwards both halves are combined into a single return value for the atomic load as a whole. |
Draft RFC for LLVM: https://hackmd.io/@nikic/S1O4QWYZkx Let me know if this makes sense. Also, does anyone has a good idea for the intrinsic name? |
@nikic Could this be generalized to handle OOB bytes at the start of a slice? This would be useful for memcpy implementations like the one in compiler-builtins. |
Thanks for writing this up! Making the remaining bytes What is the reason why doing this for atomics is tricky? Is it just "there are so many variants of them", or something deeper? |
Would it work if I said they're
The former.
We could replace I'm really unhappy about this parameter. It is a spec-only construct that gets completely ignored by the implementation. |
If LLVM can guarantee that, sure. But Rust might still want to apply a mask until we have officially decided that exposing the contents of uninit memory to sound programs is something we want to do.
Well, in a sense so is Wouldn't alias analysis look at this parameter to determine whether such an access aliases with something else? |
Not really. Freeze doesn't generate code, but it does affect analysis and transforms a lot. Here we'd have a parameter that is completely ignored, at all levels.
It could use it if defined_size is a constant, but if it were a constant you wouldn't be using this intrinsic in the first place. For AA purposes, we'd just model this as "accesses at most |
If LLVM can deduce information about |
As an optimization during a buffer search, I need (very want) to load that buffer into a SIMD vector, even when the buffer doesn't fit into the vector. E.g. I might have a 31-byte buffer that can be efficiently searched with a 32-byte wide AVX2 vector.
From a machine perspective, I don't see this as a problem, as long as the load doesn't extend beyond the current page; from LLVM's perspective this seems like UB.
I'd really like to be able to write this code in Rust and not have to use assembly.
Here's an example of this pattern:
It loads beyond the array, does vector operations on it, then disregards the oob bytes with a mask.
I'm hopeful that there is some mechanism to tell LLVM to 'forget' what it knows about this pointer, 'fooling' the optimizer into not messing with it.
From the LLVM aliasing rules, there is some language that makes me hopeful:
So there is a class of pointers that can operate on arbitrary memory (those that don't come from LLVM). That suggests to me that I could e.g. send my pointer through assembly or some other black-box function to 'clean it', maybe. On the other hand, calling into any function, or even into inline asm imposes extra instructions that more-or-less defeat the optimization (inline asm in LLVM seems to always spill registers). Though that sentence also says "such ranges shall not overlap with any ranges of addresses allocated by mechanisms provided by LLVM"
I'm not sure how much 'wiggle-room' there is. Is a malloc'd array "provided by LLVM"? What are the consequences of disobeying this "shall not"?
Even if there's no in-language solution and it is technically UB, I am hopeful that I can do this thing without LLVM messing with my codegen.
cc @nikomatsakis writing this here per your request.
The text was updated successfully, but these errors were encountered: