-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add packing and unpacking selection operations for Vectors #15837
Comments
A few questions from my side
|
While it is conceivable that we could have an extremely generic
I'm not sure what you're asking.
Not really, but I found this interesting:
Unfortunately I do not know enough about all these architectures to know how to square this circle. However, I agree with the conclusion reached by #7702 that intrinsics are sometimes necessary to get the full performance out of your CPU and cannot be expressed generically. Not every operation can be cleanly mapped to every architecture. There's no This proposal indeed adds yet another operation or two that will run better or worse on different architectures. I'm suggesting adding these operations because they are generally useful, difficult or impossible to implement efficiently without being blessed by the compiler, and common. Also, these operations should be encouraged in my view. I believe the example code I gave is a compelling strategy and it makes for good software in my view. Although I mentioned a direct mapping to |
packSelect is a very useful operation, but it seems like it will be difficult to get consensus on an implementation if it's a builtin. Some of the fastest implementations of this operation on targets like AVX2 involve giant lookup tables of tiny lookup tables. |
I'd be interested in any links you might be referencing. To me it seems like Zig at the moment should be fine with giant lookup tables in ReleaseFast, since it forces the use of O3 anyway and thus automatic loop unrolling is done everywhere by default, so code size reduction seems like less of a priority in general. I think this technically would come from the data cache but I think the same principle applies. It would be nice to have more control over these things though, one day. #978 might provide a decent way to switch between different implementations of packSelect. |
I heard about simdprune but for some reason didn't link it here. It's an implementation that should be considered. |
Do you have any numbers or can you link any representable numbers to extrapolate on this? |
Of Daniel Lemire's implementations, the giant lookup table versions are the fastest in the benchmarks. https://github.com/lemire/simdprune#how-fast-is-it The tables file is here. |
If this were to be a builtin, I would argue that it should remain separate from the I'm not 100% sure that this should be a builtin though, considering that it can be implemented in userspace as a library, like Daniel Lemire's implementation. Cnsidering this, I guess I'd want to ask: why should this be implemented as a builtin than a library? |
This is what I argued for in my above comment.
At the moment Zig lacks support for intrinsics, so there is no way to access LLVM's vector facilities except through builtin functions or through trying to write a function that LLVM can recognize as equivalent to an intrinsic (which does not work for everything you might want). But even once we have intrinsics, I still think these operations belong in the class of fundamental vector operations with |
I don't agree with the idea that the language should make something that could be done with the existing vector operations into its own builtin – the standard library is a better place for this, and there's nothing wrong with using packages for whatever isn't in the standard library. However, I missed your earlier mention of |
@Validark The packSelect function cannot be implemented using the following method. Can you provide detailed code?
Below is the Zig code for the packSelect function I wrote according to the above method, but the result is incorrect. fn vectorLength(comptime VectorType: type) comptime_int {
return switch (@typeInfo(VectorType)) {
.Vector => |info| info.len,
.Array => |info| info.len,
else => @compileError("Invalid type " ++ @typeName(VectorType)),
};
}
fn VecChild(comptime T: type) type {
return std.meta.Child(T);
}
pub fn packSelect(vec: anytype, mask: @Vector(vectorLength(@TypeOf(vec)), bool)) @Vector(vectorLength(@TypeOf(vec)), VecChild(@TypeOf(vec))) {
const Child = VecChild(@TypeOf(vec));
const vecLen = comptime vectorLength(@TypeOf(vec));
const int_mask = @as(std.meta.Int(.unsigned, vecLen), @bitCast(mask));
std.debug.print("packSelect int_mask is: 0b{b:0>32}\n", .{int_mask});
const select_mask = std.simd.iota(u8, vecLen) >= @as(@Vector(vecLen, u8), @splat(@as(u8, @popCount(int_mask))));
return @select(Child, select_mask, vec, @as(@Vector(vecLen, Child), @splat(0)));
} Could you help me point out what's wrong with the above code? |
@flyfish30 That quote was referring to this behavior:
The best way to do this depends on your target CPU architecture, and at the moment, I have not written an extensive polyfill for packSelect (although I have done so for unpackSelect). What kind of CPU are you targeting? But if I were you, I would start with https://stackoverflow.com/questions/36932240/avx2-what-is-the-most-efficient-way-to-pack-left-based-on-a-mask. A pext-based solution would be your best bet if you are on an Intel chip, Haswell (2013) or newer, or an AMD Zen 3 machine. Otherwise, you could try generalizing this solution from that stackoverflow link: inline __m128 left_pack(__m128 val, __m128i mask) noexcept
{
const __m128i shiftMask0 = _mm_shuffle_epi32(mask, 0xA4);
const __m128i shiftMask1 = _mm_shuffle_epi32(mask, 0x54);
const __m128i shiftMask2 = _mm_shuffle_epi32(mask, 0x00);
__m128 v = val;
v = _mm_blendv_ps(_mm_permute_ps(v, 0xF9), v, shiftMask0);
v = _mm_blendv_ps(_mm_permute_ps(v, 0xF9), v, shiftMask1);
v = _mm_blendv_ps(_mm_permute_ps(v, 0xF9), v, shiftMask2);
return v;
}
inline __m256 left_pack(__m256d val, __m256i mask) noexcept
{
const __m256i shiftMask0 = _mm256_permute4x64_epi64(mask, 0xA4);
const __m256i shiftMask1 = _mm256_permute4x64_epi64(mask, 0x54);
const __m256i shiftMask2 = _mm256_permute4x64_epi64(mask, 0x00);
__m256d v = val;
v = _mm256_blendv_pd(_mm256_permute4x64_pd(v, 0xF9), v, shiftMask0);
v = _mm256_blendv_pd(_mm256_permute4x64_pd(v, 0xF9), v, shiftMask1);
v = _mm256_blendv_pd(_mm256_permute4x64_pd(v, 0xF9), v, shiftMask2);
return v;
} Another thing that is on my to-study list, is this mention in the risc-v vector ISA: https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#158-vector-iota-instruction. For some reason, it explains:
Which is weird, because it already has a vector compress instruction. However, it might be possible to do a SWAR emulation of the routine given there, but I just don't understand what magic the Is this enough information for you? |
@Validark Thanks for you help? |
@Validark I used the general simd instruction to implement the packSelect function by table lookup method. |
This is a proposal to add
@packSelect
and@unpackSelect
. These operations are analogous to pdep/pext/@extractBits/@depositBits (see: #14995), except this proposal is for Vectors, not fixed-length bitvectors (i.e. integers).packSelect
I define
@packSelect(mask: @Vector(VEC_SIZE, bool), vector: @Vector(VEC_SIZE, VEC_TYPE))
which packs the vector into the left-hand side according tomask
. It's basically likepext
but it operates on vector lanes instead of bits. This is equivalent toVPCOMPRESS
on new x86_64 machines. However, even withoutVPCOMPRESS
support, this is a very common operation that can be performed in a wide variety of ways. Here are some stackoverflow questions about this:https://stackoverflow.com/questions/36932240/avx2-what-is-the-most-efficient-way-to-pack-left-based-on-a-mask
https://stackoverflow.com/questions/28735461/shift-elements-to-the-left-of-a-simd-register-based-on-boolean-mask
https://stackoverflow.com/questions/25074197/compact-avx2-register-so-selected-integers-are-contiguous-according-to-mask
https://stackoverflow.com/questions/7886628/optimizing-array-compaction
Motivating example
When parsing a file, we often want to copy some of a file over to a buffer. Let's say we are reading a JSON file into vectors of size
64
and the first64
characters are"name": "Validark", "is_programmer": true, "favorite_color": "re
. Here is this information as Zig code:Let's say we want to copy all of the characters between quotes into a buffer. For simplicity we are going to assume there are no escaped quotation marks within quoted strings. Here is the simplified inner loop of the first iteration:
Here are the vectors/bitvectors:
Click here for more information on the prefix_xor operation
prefix_xor
See this article: https://branchfree.org/2019/03/06/code-fragment-finding-quote-pairs-with-carry-less-multiply-pclmulqdq/
Here is an implementation of it with #9631:
Here is another implementation that does not rely on carryless multiply. Hopefully one day LLVM will know this is equivalent when carryless multiply is not supported (or very slow) on a particular machine:
(no guarantees when passing in integers that are not powers of 2)
What about the fill values?
In case it comes in handy for optimization purposes, it might be a good idea to make it undefined what the non-packed values are in a packed vector. Hopefully that also means it is undefined what bytes will end up in
chars[@popCount(quoted_mask)..VEC_SIZE];
in the code above. Even writing nothing at all to those bytes should be valid in the case above. x86_64 has a variant which fills the rest of the vector with values from another source and a variant which fills it with zeroes. It could be nice to be able to specify which behavior you want. E.g., one could pass insrc
, or@splat(VEC_SIZE, @as(u8, 0))
, or@splat(VEC_SIZE, @as(u8, undefined))
if you aren't relying on any particular behavior. However, this effect can already be achieved by creating a mask withstd.simd.iota(u8, VEC_SIZE) >= @splat(VEC_SIZE, @as(u8, @popCount(quoted_mask)))
and then doing a@select
to move eithersrc
or@splat(VEC_SIZE, @as(u8, 0))
in the right places. Hopefully the optimizer will be smart enough at some point to know that that pattern still only needs 1VPCOMPRESSB
on AVX512_VBMI2 + AVX512VL x86_64 machines. Alternately,@packSelect
could be made to take in a scalar value with which to fill in the empty spaces, but I think this might lead to someone filling with 0's, then doing a@select
which turns all the 0's into the element fromsrc
. That would be very bad for the optimizer because it would then have to prove that 0's could not have been selected by@packSelect
, which would be impossible to prove in cases like the example given above. Hence, I think making it fill with undefined's is the best move.Other uses
Here are some more problems that can be solved with
@packSelect
:http://0x80.pl/notesen/2019-01-05-avx512vbmi-remove-spaces.html
https://lemire.me/blog/2017/04/10/removing-duplicates-from-lists-quickly/
Here is a fun snippet that would print the indices in a vector where tabs occur:
Daniel Lemire has some articles on how to efficiently iterate over set bits too. Hopefully code like the above could one day be optimized as well as the C++ code which uses intrinsics:
https://lemire.me/blog/2022/05/10/faster-bitset-decoding-using-intel-avx-512/
https://lemire.me/blog/2022/05/06/fast-bitset-decoding-using-intel-avx-512/
https://lemire.me/blog/2019/05/15/bitset-decoding-on-apples-a12/
unpackSelect
The second part of this proposal is for
@unpackSelect
, which corresponds toVPEXPAND
on x86_64, which is basically like PDEP but operates on vector lanes rather than bits. It's the opposite of@packSelect
, so in the example above, you could spread out the bytes in thepacked_vec
back into the same positions as invec
by doing@unpackSelect(@bitCast(@Vector(VEC_SIZE, bool), quoted_mask), packed_vec)
. Again, even without directVPEXPAND
support this operation can be done in a number of ways on different architectures. Note: I am using the same signature as@packSelect
above. Here is a stackoverflow with one method given for accomplishing this operation:https://stackoverflow.com/questions/48174640/avx2-expand-contiguous-elements-to-a-sparse-vector-based-on-a-condition-like-a
Uses
@unpackSelect
is useful for a few situations I can think of:vec1
containing[a, b, c, d, e, f, g, h]
, and amask
that indicates you want to convertvec1
into[_, b, c, _, e, _, _, h]
, with each_
replaced by successive values invec2
which contains[1, 2, 3, 4, 5, 6, 7, 8]
. You need to get a vector with[1, _, _, 2, _, 3, 4, _]
so a@select
can usemask
andvec1
to produce[1, b, c, 2, e, 3, 4, h]
. The missing vector can be generated with@unpackSelect(@bitCast(@Vector(VEC_SIZE, bool), mask), vec2)
.What about the fill values?
I think the same logic applies to
@unpackSelect
as@packSelect
that it should be undefined what the non-relevant values are.Other uses
More problems solvable with
@unpackSelect
/VPEXPAND
:http://0x80.pl/notesen/2022-01-24-avx512vbmi2-varuint.html
https://zeux.io/2022/09/02/vpexpandb-neon-z3/
Bikeshedding welcome.
The text was updated successfully, but these errors were encountered: