-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Portable packed SIMD vector types #2366
Conversation
b28a35a
to
47ba7e7
Compare
|
||
* `{Add,Sub,Mul,Div,Rem}<RHS=Self,Output=Self>`, | ||
`{Add,Sub,Mul,Div,Rem}Assign<RHS=Self>`: vertical (lane-wise) arithmetic | ||
operations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably worth describing how behavior of these operators is different (or not different) from the same operators for non-vector arithmetic types - e.g. wrapping vs panicking on overflow.
(Same applies to bit-shift traits.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'm on this. @sunfish also requested to know the behavior of Div
/Rem
when dividing by zero, so I'll add that as well, and also to mention that Div
and Rem
are provided as a convenience, but that most (or no) hardware actually offers intrinsics for these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From looking at rustc's implementation this is the behavior that they seem to have (I need to add stdsimd
test to exercise these everywhere though), but this is what LLVM says that it guarantees:
-
on integer vectors overflowing results in the correct mathematical result modulo 2^n (see exceptions below)
-
on floating point vectors overflowing and division by zero has the same behavior as that of f32 or f64 (+-INFINITY for overflow, and some NaN for division by zero IIRC).
-
on integer vectors we directly map division and rem to LLVM's
{s,u}{div,rem}
which state that if any element is divided by zero or the operation overflows these operations result in undefined behavior.
So this is what we are currently doing. Now the question: can we do better?
-
For
Add
,Sub
,Mul
,Shl
,Shr
: we could usenuw
/nsw
to generate a poison value and get the same behavior than for the scalar integers (that is, a panic). We could then expose wrapping arithmetic viawrapping_{add,sub,mul,shl,shr}
. This might be a bit of a performance foot-gun, so the alternative could be to just expose the wrapping methods instead. -
For
Div
andRem
: we could expose these asunsafe fn {div, rem}(self, o: Self) -> Self
methods with a precondition on division by zero, and provide an implementation in the traits that checks that no element is zero before performing the operation (panicking on precondition violation). This is "expensive": one vector comparison + one horizontal reduction, and whether it is better than not providing these operations at all is debatable. Another alternative might be to not expose these operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@petrochenkov I've gone ahead and updated the RFC which what I consider the "sane" arithmetic and shift operator semantics:
-
the arithmetic and shift traits behave like the operators for the scalars. That is, for integers they panic on overflow and division by zero, and for floats they produce the same results as the scalar.
-
the integer vectors also implement a set of
wrapping_{add,sub,...}
methods that perform modulo 2^n arithmetic, just like the scalar integers do -
the integer vectors also implement two
unsafe wrapping_{div,rem}_unchecked
methods that perform no checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's sane here is a hard question.
On one hand, the only reason for SIMD's existence is improving performance and uNxM + uNxM
should ideally turn into a single hardware SIMD addition instruction, but checked operations can't be lowered into a single instruction.
On the other hand, consistency with scalar arithmetic, error detection and everything...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that any SIMD proposal that does not mesh with the rest of the language is not going to be "good enough". So what I meant with "sane" was the same defaults than the rest of the std library, while at the same time offering easy-to-reach more efficient operations with slightly different semantics. While the std
library puts these typically in core::intrinsics
, the portable SIMD vector types have these as inherent methods, making them more "first class".
IMO the SIMD std facilities have two jobs: protecting users from easy mistakes, and allowing them to write efficient code. These two jobs are at constant tension.
I don't think that unsafe operations should become the default, but this does not mean that they should be hard to reach for either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand, consistency with scalar arithmetic, error detection and everything...
Being consistent with scalar arithmetic would have overflow checking for +
(etc.) only in debug mode, so release mode would still lower it to a single instruction.
I think being consistent with scalar math on this point would be OK, but making +
check for overflow in release mode would probably not be OK in terms of performance expectations.
If overflow is checked in debug mode, wrapping_foo
variants are needed. E.g. the Rust SIMD code in Firefox for checking if a UTF-16 string contains right-to-left characters currently relies on wrapping underflow for -
on u16x8
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hsivonen I've added that overflow checking only happens in debug mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW "release/debug" is not the right distinction for overflow checks. They're a separate switch (-C overflow-checks=[on|off]
) that can be flipped independently of debug info, optimization level, etc. -- it's just by default tied to debug_assertions
which in turn is by default tied to release/debug.
returns `true` if all lanes compare `true`. It is equivalent to | ||
`a.eq(b).all()`. | ||
* `PartialOrd<Self>`: compares two vectors lexicographically. | ||
* `From/Into` lossless casts between vectors with the same number of lanes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about From
conversions [{et}{lw}; {nl}] -> {et}{wl}x{nl}
(arrays) and {et}{wl}^{nl} -> {et}{wl}x{nl}
(tuples) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't given these much thought until now, so I don't know if adding these is worth it. In any case, this is how they could be implemented:
-
For arrays, one could implement these on top of the
{load,store}_unaligned
APIs without any issues. -
For tuples, it depends. IIRC the layout of tuple elements is unspecified to allow for things like reordering tuple elements to reduce their size. There was, however, a variadic RFC built on tuples that required the layout of the tuples to be fully specified to allow for things like slicing tuples. Iff the layout of tuples with multiple elements of the same type would be specified as being the same layout as that of an array/slice, then we could implement these for tuples with unaligned memory load/stores as well. Otherwise, we'll probably need to copy the tuple elements into an array first, and then perform the memory loads (and vice-versa for stores).
Having said this, we don't need to add these now either. Those interested in trying these out could add them to stdsimd
behind an unstable feature gate. If they turn out to be useful a mini-FCP might be enough to stabilize these later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough; On the question of "is it worth it" I'd like to note my opinion that it would help with ergonomics so I'd be 👍 for those impls.
Could you perhaps note the possibility of adding these somewhere in the RFC (like the Unresolved questions) for future reference?
text/0000-ppv.md
Outdated
|
||
##### Semantics for floating-point numbers | ||
|
||
* `eq`: yields `true` if both operands are not a `QNAN` and `self` is equal to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why quiet NaN? Surely any NaN should result in false?
And what does "self
is equal to other
" mean anyway? Is it a bitwise comparison? Then that doesn't match scalar float comparisons wrt positive/negative zero.
I don't think this section (and others like it) is needed or useful at all. It's enough to simply state that each lane is compared as in the scalar case. I assume that's what you're going for anyway (if not, that's a big problem IMO)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signaling NaN may result in the CPU trapping to the OS without giving the program a choice. LLVM recently tightened up their assumptions regarding the default floating-point environment, so only qNaN should be generated anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eternaleye That's multiple kinds of false
- signalling NaNs do not trap. Operations on them raise a floating point exception, which is nothing like language exceptions or traps. The specific exception sNaN operations raise, (invalid operation) is also raised by sqrt(-1) and myriad other operations and is handled by returning a qNaN.
- some LLVM people have been confused about this too but they're now finally tidying up their stuff to explicitly not consider signalling NaNs UB-producing
- the LangRef change you link does not say sNaN is not produced, it simply spells out some consequences of assuming the program does not change the default rounding mode, exception handling, etc. -- one consequences of that is that qNaN and sNaN are indistinguishable so LLVM doesn't need to preserve signalling-ness, but you can still get an sNaN bit pattern in many way.
- finally, in any case this issue affects vectors and scalars equally so even if what you wrote was true it still would not explain the divergence here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rkruppe Ah, sorry - thanks for clarifying that for me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, I just documented here what we are already doing, and QNAN is what LLVM guarantees. Why? I don't know. cc'ing here @sunfish, @chandlerc - they might know why LLVM guarantee this. While what you mention makes sense, I don't have a global view of what all architectures that LLVM supports actually do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this section (and others like it) is needed or useful at all. It's enough to simply state that each lane is compared as in the scalar case.
I need to check whether the way we are comparing floating-point vectors is actually the same way in which we are handling the scalar case. If not, then arguably we should, but maybe we can't for some reason. From an implementation point-of-view, the underlying LLVM intrinsics work on both scalars and vectors, so we ought to be doing the exact same thing for vectors as for scalars, but as mentioned, I haven't checked.
In any case, defining these to have the same semantics as the operations for scalar vectors makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As stated earlier, quite a few LLVM developers (and consequently the LangRef) was really confused about sNaN for a long time. The reference to QNAN specifically in the LangRef probably comes from that confusion. Certainly there is no remotely justifiable reason to treat sNaN differently here, and IEEE 754 does not do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this section (and others like it)
Could you ping me in the other sections where wording like this is unnecessary?
text/0000-ppv.md
Outdated
than two times the number of lanes in the input vector. The length of the | ||
resulting vector equals the number of indices provided. | ||
|
||
Given a vector with `N` lanes, the indices in range `[0, N)` refer to the `N` elements in the vector. In the two-vector version, the indices in range `[N, 2*N)` refer to elements in the second vector. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens on out-of-range indices? Compile time error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, out-of-range indices are a compile-time error. Right now they are a monomorphization time error but before stabilization we should turn these into MIR typeck errors.
text/0000-ppv.md
Outdated
this in one form or another, but if one future backend does not, this RFC can be | ||
implemented on top of the architecture specific types. | ||
|
||
## Zero-overhead requirement for backends |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The compiler engineer in me is a bit worried by the tone of this section. Obviously we want to make nice source code execute as efficiently as possible, but this is also true of the rest of the language and, in the end, very much best-effort. It is simply not possible to guarantee optimality in any useful sense (cf. the "full employment theorems" for compiler writers 😄).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shhh. We all want to be employed full time on Rust. Don't spoil this!
Kidding aside. Yes. This is all best effort. The good thing is that if there are compiler backend issues that must be resolved, in most cases we can resolve them in Rust - either in the compiler or the library. Sometimes this might not be enough, but at least for stdsimd
our way has always been to diagnose issues first, fill backend bug reports afterwards, and in the mean time, add library or rustc workarounds. As the backends and rustc have been fixed (often by ourselves) we have consistently been able to remove library workarounds.
While I actually wanted to write that the best thing we can do here is only to strive for optimality, even though we can never achieve it, I actually think that's not true. For many trivial SIMD operations we can often reliably generate optimal code. For non trivial operations we just have to keep trying. We will obviously never be perfect on all cases, but being good enough on most cases suffices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rkruppe I've reworded this section
text/0000-ppv.md
Outdated
called scalable Vectors or scalable vectors. These include, amongst others, NecSX, | ||
ARM SVE, RISC-V Vectors. These architectures have traditionally relied on | ||
auto-vectorization combined with support for explicit vectorization annotations, | ||
but newer architectures like ARM SVE and RISC-V introduce explicit vectorization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SVE, sure, but RISC-V? Leaving aside that neither the vector extension nor compiler support for it is anywhere near finished, the architects have gone on record stating that they vastly prefer auto-vectorization for targeting their architecture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I actually expect that intrinsics will wind up existing, but I'm still confused that this section implying they already exist. Although, for that matter, there are some aspects of RISC-V that might make assembly more attractive than intrinsics, but whatever.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the architects have gone on record stating that they vastly prefer auto-vectorization for targeting their architecture.
This is true, but it is also worth remarking that ARM SVE architects prefer auto-vectorization as well, 2) that they ended up providing explicit intrinsics that map to the ISA instructions anyways, and 3) that at least compared to "packed" SIMD vector models the RISC-V vector ISA is not that far away from the ARM SVE one.
In any case, time will tell. This paragraph does not suggest that the current programming model applies 1:1 to "scalable" vectors in any way, but rather that scalable vectors would require a slightly different programming model, and that this model does not really interfere that much with non-scalable vectors. Nothing more, nothing less.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again my main question/confusion here is the implication someone already went and defined intrinsics for the RISC-V vector extension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again my main question/confusion here is the implication someone already went and defined intrinsics for the RISC-V vector extension.
I'll reword this section then. Technically, RISC-V does not even have SIMD support yet - the second revision of the vector extension ISA is only a proposal that is still being iterated on in the workshops.
text/0000-ppv.md
Outdated
The `shuffle!` macro returns a new vector that contains a shuffle of the elements in | ||
one or two input vectors. That is, there are two versions: | ||
|
||
* `shuffle!(vec, [indices...])`: one-vector version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What form do the indices take? Can they be arbitrary constant expressions, or do they have to be literals? Obviously we want the former but this should be mentioned, since a natural way to implement a macro would be to match on literals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can they be arbitrary constant expressions, or do they have to be literals?
I'll mention that arbitrary constant expressions are allowed.
Minor request: Please define the What are the layout guarantees of these types? Are they |
@rkruppe and @hsivonen I think this commit incorporates all of your feedback, but it would be great if you could review the changes in case I missed or misinterpreted something. @scottmcm I've added more documentation about vector masks to the guide-level explanation in this commit. Let me know what you think. |
Those are really good questions. The short answer is that, as the ABI section mentions, their layout is unspecified. The ABI section does however explicitly forbids their use on Also, neither this RFC nor AFAICT the union A {
data: [f32; 4],
vec: f32x4,
}
let x: [f32; 4] = unsafe { A { vec: f32x4::splat(0.) }.data }; so for all practical purposes this is also currently unspecified behavior (triple-check: @alexcrichton is this statement correct or is this behavior specified somewhere?). The only thing this RFC (and IIRC the A future RFC specifying the layout and ABI of these types will certainly have to properly answer all these questions. Right now, there are a just a couple of things that could be done differently ABI-wise and that might produce a better ABI than what we currently have. So the purpose of leaving the ABI unspecified in these initial RFCs is to not limit those efforts in any way. |
text/0000-ppv.md
Outdated
# Unresolved questions | ||
[unresolved]: #unresolved-questions | ||
|
||
### Interaction with scalable vectors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit but I think "scalable" is very much an Arm-ism here. The more neutral and common term is probably "Cray-style".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree. In the introduction I mention:
packed: means that these vectors have a compile-time fixed size. It is the opposite of scalable or "Cray vectors", which are SIMD vector types with a dynamic size, that is, whose size is only known at run-time.
and I consistently used the term "scalable" throughout the RFC because I thought "Cray vectors" would be a less approachable term. I don't like the term "scalable" much either TBH.
text/0000-ppv.md
Outdated
The floating-point semantics follow the semantics of `min` and `max` for the | ||
scalar `f32` and `f64` types. That is: | ||
|
||
If either operand is a `NaN`, returns the other non-NaN operand. Returns `NaN` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gnzlbg This section also unnecessarily describes behavior that matches the scalar case.
I'd personally expect such |
I think there are a couple of issues. The first one is whether one can The other issue is whether these types are layout compatible with arrays and tuples. That is, whether we guarantee the following assert to pass everywhere: union A { arr: [f32; 4], vec: f32x4, tup: (f32,f32,f32,f32) }
let x: [f32; 4] = unsafe { A { vec: f32x4::new(0., 1., 2., 3.) }.arr };
assert_eq!(x[2], 2.); // OK?
let y: (f32,f32,f32,f32) = unsafe { A { vec: f32x4::new(0., 1., 2., 3.) }.tup };
assert_eq!(y.2, 2.); // OK? These two cases currently work on all platforms, but this is incidental. For example, the layout of tuples is unspecified, so we can't guarantee this for tuples AFAICT. The layout of arrays is specified so we could do this for arrays. For that, we probably need to write down that a vector type has the same layout as an array or something like this. Also, it is worth pointing out that |
* `indices` must be a `const` array of type `[usize; N]` where `N` is any | ||
power-of-two in range `(0, 2 * {vec,vec0,vec1}::lanes()]`. | ||
* the values of `indices` must be in range `[0, N)` for the one-vector version, | ||
and in range `[0, 2N)` for the two-vector version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this precondition statically checked when possible? is it dynamically enforced when necessary? if not, what are the results of indexing outside the range?
EDIT: oh, answered here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The line below this one states:
On precondition violation a type error is produced.
The current implementation produces a monomorphization time error, but that can be fixed.
@gnzlbg ok cool! I wonder, would it be possible to use |
The problem is that if we make them newtypes then |
@gnzlbg ah ok. I do suspect though that trying to work with |
I understand that such a type definition could be contentious, but it is very common in the ecosystem (e.g. |
@gnzlbg and I had some discussion on IRC (pasted below), but unfortunately not a ton of conclusions :(
|
oh also gisted a bit with @rkruppe which I didn't get a chance to follow over the weekend! |
FWIW my recollection of the internals thread(s) I participated in about this is that people were concerned about it being premature to speculate about If @gnzlbg has implemented it and it turned out that it not only works but works better than the alternatives, then that answers some of those questions, I think? |
@glaebhoerl My main worry is that I don't know the answer to the following questions:
so I don't know whether it is possible to stabilize some part of the implementation without stabilizing all of it (and instantaneously stabilizing all future additions). (EDIT: these questions are important because given a If the answer to both questions is that yes, those things are possible, and trait impls behind feature flags do not affect type inference on stable, and on nightly when that particular feature flag is not active, then I think all my concerns would be resolved (cc @eddyb ?) . @alexcrichton mentioned that similar discussions about generics have been had, and that other points about these issues have been raised. It might be worth it to dig those up: could someone share a link? |
@gnzlbg yes |
So this might be the killer. To explain the problem, The way that's implemented in the library is just using a trait to constrain This is problematic, because there are parts of the library that we don't want to stabilize at first, like 512-bit wide vectors, maybe vectors of pointers, etc. but we still want to be able to use and improve on nightly. The problem is that we can't do that, because we can't mark the AFAICT const generics won't help. If we had a The only idea I have is that we could use conditional compilation to only put in |
Sorry about the slow response time. Looks great! Thank you. (I wish we could get safe wrappers around endianness-exposing bitcasts of vectors of the same bit width but different lane configuration sooner than later, though.)
Is it endianness-dependent? (Asking out of curiosity, not out of complaint, because endianness-exposing bitcasts were excluded from the RFC but endianness-dependent
Out of curiosity, have you checked if the compiler is smart enough to hoist the panic condition checks out of the loop when used with |
No. AFAIK one cannot create endian-dependent behavior with
Most of the recent work in the Having said this, there have been many changes that have improved performance in the recent weeks, so when all benchmark-related PRs have been merged it might be worth it to test this again by just enabling |
@gnzlbg ah yeah that's why I was wondering if newtype wrappers would work, but I unfortunately don't know of a great solution :( |
I'm closing this for now:
This is kind of blocked on how much time I've got, and I don't think we should advance this RFC until both issues are resolved, so I'm closing this for now to reduce the RFC repo queue. Once that work is done, I'll update the RFC and re-open it or re-submit it if the changes are significant (I hope not). |
This RFC extends Rust with portable packed SIMD vector types.
Rendered.
Acknowledgments
It would not have happened without the invaluable contributions of @alexcrichton and @BurntSushi , the excellent feedback given by the many reviewers of the initial drafts: @sunfish, @nagisa, @eddyb, and @hsivonen, amongst others, all the people using
stdsimd
, filling issues, and submitting PRs, and @huonw's work on therust-lang-nursery/simd
crate.