-
-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: SIMD spec #9389
Comments
See also #903 |
I must say I really love the explicit Is there a clear design roadmap/RFC for SIMD in zig beyond the comments in #903? If not, maybe it would be worth it to start a document where existing implementation and ideas like these can be consolidated and massaged into a single vision? |
This proposal is lacking in some details, for example left and right shift can be both arithmetical and logical. The other problem is that it does sidestep all the quirks of floating behavior in SIMD, which are relevant to estimate if extended semantics integer and float simd is unifiable or not, how much of a maintenance it would be, future directions of hardware etc. |
The goal of this proposal is to use the same operation semantics as for scalar, so left and right shift semantics are basically arithmetical for signed types, and logical for unsigned types. The same goes for the other operations regarding the overflows: use the I think we should not do anything related to signaling, and just always use the non-signaling behavior, at least for now. Signaling variants could be implemented later with builtins.
The semantics of floating point operations in this proposal are exactly the same as for scalars, so full IEEE754 (except maybe for signaling NaNs). The goal of this proposal is not to be able to implement any algorithm efficiently on any platform. If some platforms have specific instructions or features that does not fit into this model, we should not try to force it into it: #7702. If you want to deal with platforms without IEEE754 (which are rarer and rarer), then intrinsics should be the way to go. EDIT: I have updated the RFC to reflect my position on this. |
I'm very new to zig, but how do vectors work now for unclear operations?
This is a fantastic interface. The required target should support 128 bit instructions. But many things can be unclear.
What is preventing operations like Continuing with We could have the size of operation be inferred based on the size of the load, but that isn't necessarily what "everyone" would want. I personally would be fine with that. |
You'll be using the types You can also use operators to get operations that don't correspond to one instruction on common targets like srli_epi8 or cmpgt_epu32, and you'll ultimately get a sequence of instructions for these (probably !(a==min(a,b)) for a>b and a 16-wide shift and mask for the 8-wide shift). |
Hello everyone, I would like to discuss the SIMD aspect of Zig.
There are many aspects to consider with SIMD. To simplify the discussion, I will use the following straw-man syntax to designate a vector:
@Vector(N, T, ABI)
. As it will be important,@Mask(N, T, ABI)
is a mask type for a@Vector(N, T, ABI)
vector.Types that can be elements of a SIMD vector
Vectors of arithmetic types should definitely be supported (
iN
,uN
,fN
).I think it would also be a good idea to support vectors of pointers, especially to achieve nice scatter/gather syntax.
We could also support vectors of optionals as a way to represent a masked vector.
It would also be quite interesting to have vectors of vectors as a fancy way to have SIMD in SIMD computation, which is especially useful for shuffles.
However, I don't see much use for vectors of (tagged) unions or struct. It would definitely be possible to standardize those, if we really wanted, but for now, lets keep them out of the discussion.
Another question is: do we want to consider vectors of non-native arithmetic types, like
i30
orf128
?Also, what would be the size of a
@Vector(4, i30, ABI)
?I think currently Zig has already an answer to that question, but it would be interesting to reassess it.
Size of a vector
What should be the allowed vector sizes (the
N
from@Vec(N, T, ABI)
)?Every time I try to write SIMD wrappers in C or C++, I limit myself to powers of 2 because it is much simpler, and there not really much use for non powers of 2.
But considering that you already could have non power-of-2-wide integers (like
i30
), maybe we could have them, but it might make some operations more tricky.Order of the elements
This is a confusing part for many people.
We tend to think about integers in big endian (most significant bits on the left), while thinking about memory in little endian (least significant bits on the left).
Intel documentation refers to everything in big endian: integers, vector elements, memory.
I think we should keep the big endian for integers and little endian for everything else as it is the most natural, but we should avoid talking about left and right when referring to elements inside of vectors. We should most likely prefer the terms low and high instead. Therefore, the order is consistent with arrays, and thus no extra confusion arise.
Masks
This is a big chunk. Recent SIMD architectures have native masks (AVX512, SVE, Risc-V V). However, their interface is not compatible with each other: SVE masks have 1-bit per byte, AVX512 (and maybe Risc-V V) masks have 1 bit per element, and emulated masks (for archs without masks like SSE or Neon) have 1 bit per bit.
So it is impossible to have a single mask type for either all vectors of the same size, or the same length. It depends on the target arch.
That is why I believe we need a different mask type for every single vector type (hence the
T
in@Mask(N, T, ABI)
). In theory, we might just need to know the number of bits of the elements and not their actual types, but I think Zig has no problem handling a gazillion types.Also, we really need to deal with the ABI when defining the type as a
@Mask(4, i32, .AVX512)
would be a%k
register, while a@Mask(4, i32, .SSE2)
would be a%xmm
register, despite masking in both cases a vector of 4i32
.Also, you might want to write code for different archs in the same binary (for dynamic dispatch for instance).
The layout of a mask is then opaque. The
ABI
tag can change the actual layout used.Semantically, a
@Mask(N, T, ABI)
is the same as a[N] bool
or@Vector(N, bool)
, but optimized for masking a@Vector(N, T, ABI)
.Operations
Arithmetic and bit operations
This is the simplest part of SIMD. Arithmetic operations are defined lane-wise (ie: per element), and are only defined on vectors of the same length. Specifically, operations keep their scalar semantics. The standard type coercion rules for element types apply.
-a
(with wrap-around:-%a
)a + b
(with wrap-around:a +% b
, saturating:a +| b
)a - b
(with wrap-around:a -% b
, saturating:a -| b
)a * b
(with wrap-around:a *% b
, saturating:a *| b
)a / b
a % b
~a
a & b
a | b
a ^ b
a << b
(lane-wise)a >> b
(lane-wise, arithmetic for signed types and logical for unsigned types, like scalar shifts)@byteSwap
,@bitReverse
,@clz
,@ctz
,@divExact
,@divFloor
,@divTrunc
,@max
,@min
,@mod
,@popCount
,@rem
,@shlExact
,@shrExact
,@sqrt
,@sin
,@cos
,@tan
,@exp
,@exp2
,@log
,@log2
,@log10
,@fabs
,@floor
,@ceil
,@trunc
,@round
It might be nice to have
@addWithOverflow
,@subWithOverflow
,@mulWithOverflow
, and@shlWithOverflow
variants that return a mask instead of a bool. However, those might be tricky to efficiently implement.As for the scalar operations, all floating points are IEEE754 compliant except configured otherwise with
@setFloatMode
. In case ofFloatMode.Optimized
, the compiler would be allowed to use SIMD instructions to perform division on architectures that do not support it like ARMv7 with approximate instructions.In case of
FloatMode.Strict
(the default), division would be forced to IEEE754 implementation, even if the arch do not support it. This could be implemented by devectorizing the operation.Note about signaling and traps
For now, signaling and trapping operations have no special support in Zig. I think we should keep the same with SIMD operations. This means that no operation is signaling or trapping, including overflows, interger division by zero, or sqrt of negative number. They do however produce detectable Undefined Behavior like their scalar counterparts.
Conversion
Conversion is a simple case of element-wise operation, but the input and output types can mismatch in every aspect, except their length. A specific builtin could be used, or we could just overload the scalar ones to the same effect.
Reinterpret cast (or bit cast)
It is extremely common in SIMD to reinterpret the bits of a vector from one type to another.
This is an operation we really want to be available.
Reusing the @bitcast would be fine, with the same constraints.
Vector creation
To create a vector with specific elements, an anonymous list literal could be used:
However, this way of creating a vector tends to be tedious for large vector (in AVX512, we can have
@Vector(64, i8, .AVX512)
, which would require 64 elements in the literal).There are 2 pattern extremely common:
@splat
in the current zig).{0, 1, 2, 3...}
)I think the first one could be covered by implicit conversion (more on that later), but
@splat
is also fine.The second one would need a new builtin like
@iota(@Vector(N, T, ABI))
(or a type that implicitly coerce to any vector types, creating the right vector).Element access
Accessing a specific element could be a simple as
vec[i]
withvec
a vector andi
an integer.Shuffles
Shuffles are extremely important in SIMD. There are many flavors of shuffles, even for the general purpose ones.
The current
@shuffle
builtin has a weird semantic to select the second vector.It is usual that indices to select the second vector are just after the indices to select the first vector, which makes most of its usage much simpler.
Also, runtime indices are crucial for some algorithms as it allows to have a LUT in register.
I recommend to use the following semantic for a general purpose shuffle:
The fact that
len_a
,len_b
andlen
could be different could really lead to some interesting patterns like merging 2 vectors together.In the case where
a
,b
andidx
have the same power of 2 length, the semantic of this function would be exactly the same as the__builtin_shuffle
from gcc.Some special shuffle patterns could have dedicated builtins just to make them easier like concat_odd, interleave_low, interleave_even...
Also, a one input shuffle is often useful, and such a shuffle could a special syntax:
a[idx]
.Memory access
This might be a bit controversial, but when writing SIMD code, we often have explicit memory loads and stores. This means that we do not have pointers/slices of vectors, we have pointers/slices of scalars, and we use a load instrinsic to get a vector from this pointer/slice.
This is simpler to reason about than casting scalar pointers to vector pointers, and dereference. And it does not require to change the input/output data types.
That is why I think it is good to have a way to do this, without relying on pointer casting.
@load(@Vector(N, T, ABI), p)
and@store(v, p)
could do it:But sometimes, it might be useful to load elements that are not contiguous, but have a constant stride. Another builtin might useful for that (like
@load_strided
):And finally, there are cases where we need to load elements whose index in an array is given in a vector.
This load operation is a gather (a scatter for a store). They would have the following semantic:
If the compiler can detect that
idx
is in the form.{ i + 0*a, i + 1*a, i + 2*a, i + 3*a... }
, then it could use a more specialized instruction. And even more so ifa
is 0 (plain old load/store).Vector of pointers could be another way to have gathers/scatters using the same kind of syntax trick.
Masks
Masks could be manipulated with bitwise operations like standard vectors, but have no arithmetic nor shift operations.
They have, however, shuffle operations.
Masks would be primarily generated by comparing vectors. But could also be created by list literal (with bools).
A mask cannot be loaded nor stored in a meaningful way because of the arch dependent layout, so no load/store must be provided.
However, conversion between masks and vectors would be necessary.
Converting a vector to a mask would be semantically equivalent to create the mask from the sign bits of the vector (for arithmetic types).
And converting from a mask would be equivalent to create a vector whose elements are all bits set if the mask is set for this element.
A mask could be used to select elements from 2 vectors.
In gcc extensions, such an operation has been given the following syntax:
m ? a : b
, but as there is no ternary operator, I don't recommend having a special syntax for that.Also, we could have a syntax for masked values:
a[m]
orm(a)
. Those masked values could be assignable.It would also be useful to have builtins to convert from/to plain integers (bag of bits) as this can be used for LUTs in some algorithms.
Such functions could be the portable way to load/store masks in memory.
Platform specific operations
Many SIMD ISAs have custom operations to handle very specific use-cases that do not fit well into an abstract interface like this proposal. These operations should not be integrated in any special way to this proposal.
These operations should be accessible in a non generic way, like using intrinsics: #7702.
Implicit conversions
Something that would be really nice is implicit conversions from scalar to vectors.
It makes it so much easier to write SIMD code:
In fact, I think we could provide implicit conversion from a small vector to a larger vector (by repeating the elements) with the following constraints:
Sorry for such a long issue, but I have the feeling that it is required.
Maybe I could later split into multiple issues to better keep track of the disscussion, but I'm not sure.
The text was updated successfully, but these errors were encountered: