Description
Hello @flyfish30, I am writing this at your request in the following comment: ziglang/zig#15837 (comment)
I intend to write my own version of these compaction instructions at some point, just to fully think about it and see if I can do better than what you have done here, but I thought in the meantime, I should at least share my preliminary thoughts and opinions on the code presented here. I have not run your code, but I have a few thoughts based on my work on the Accelerated Zig Parser.
First of all, good job doing this work and figuring this stuff out.
I actually did learn something/get an idea from looking at your code, here:
Lines 293 to 295 in e2447be
In my experiments with this idea, I determined that it's typically better to prefer:
std.simd.interlace(.{ byte_idx, byte_idx + @as(@Vector(VEC_SIZE, u8), @splat(1)) })
Here is the difference between the two in assembly (should be no difference, if the compiler could make up its mind...)
.LCPI0_0:
.short 256
.short 256
.short 256
.short 256
.short 256
.short 256
.short 256
.short 256
foo:
vpunpcklbw xmm0, xmm0, xmm0
vpaddw xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
ret
bar:
vpcmpeqd xmm1, xmm1, xmm1; check if we are equal to ourselves, i.e. make a vector of all ones
vpsubb xmm1, xmm0, xmm1; subtract -1
vpunpcklbw xmm0, xmm0, xmm1
ret
This trick is even more relevant if you want to scale the trick up to 16 byte vectors:
.LCPI0_0:
.short 256
.short 256
.short 256
.short 256
.short 256
.short 256
.short 256
.short 256
.short 256
.short 256
.short 256
.short 256
.short 256
.short 256
.short 256
.short 256
your_version:
vpermq ymm0, ymm0, 216
vpunpcklbw ymm0, ymm0, ymm0
vpaddw ymm0, ymm0, ymmword ptr [rip + .LCPI0_0]
ret
my_version:
vpcmpeqd xmm1, xmm1, xmm1
vpsubb xmm1, xmm0, xmm1
vpunpckhbw xmm2, xmm0, xmm1
vpunpcklbw xmm0, xmm0, xmm1
vinserti128 ymm0, ymm0, xmm2, 1
ret
Here is a Godbolt playground link where you can play with this.
I submitted an issue to LLVM for this: llvm/llvm-project#89858
To respond to this comment:
Lines 115 to 117 in e2447be
It's actually pretty efficient to unpack nibbles:
// Workaround until https://github.com/llvm/llvm-project/issues/79094 is solved.
fn expand8xu8To16xu4AsByteVector(vec: @Vector(8, u8)) @Vector(16, u8) {
return switch (comptime builtin.cpu.arch.endian()) {
.little => switch (builtin.cpu.arch) {
// Doesn't have shifts that operate at byte granularity.
// To do a shift with byte-granularity, the compiler must insert an `&` operation.
// Therefore, it's better to do a single `&` after interlacing, and get a 2-for-1.
// We need to have all these bitCasts because of https://github.com/llvm/llvm-project/issues/89600
.x86_64 => std.simd.interlace([2]@Vector(8, u8){ vec, @bitCast(@as(@Vector(4, u16), @bitCast(vec)) >> @splat(4)) }) & @as(@Vector(16, u8), @splat(0xF)),
else => std.simd.interlace(.{ vec & @as(@Vector(8, u8), @splat(0xF)), vec >> @splat(4) }),
},
.big => std.simd.interlace(.{ vec >> @splat(4), vec & @as(@Vector(8, u8), @splat(0xF)) }),
};
}
On x86 it's just (Godbolt link):
.LCPI0_0:
.zero 16,15
expand8xu8To16xu4AsByteVector:
vpsrlw xmm1, xmm0, 4
vpunpcklbw xmm0, xmm0, xmm1
vpand xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
On arm:
expand8xu8To16xu4AsByteVector:
movi v1.8b, #15
and v1.8b, v0.8b, v1.8b
ushr v0.8b, v0.8b, #4
zip1 v0.16b, v1.16b, v0.16b
ret
It is also my preference to generate lookup tables using comptime logic. That way, you can eliminate over 100 lines of code (actually data), and also it better encodes the intent of the lookup table IMO. Also you don't have to define the individual elements of your lookup tables as a u8
. You can instead make it an array of arrays like [128][16]u8
, and the language can manage the multiplication for you, so you don't have to write it out manually.
Also, btw, if you take my code, please include a comment somewhere giving me credit. If you took it from a place that has a license you may need to include that in your code too. If it's a trivial piece of code like just a piece of inline assembly or intrinsic call, no need for a license or anything, but the pext
function in particular with its multiple fallback techniques (and at least one more to come) is definitely something that I would like credit for :).
Also note that the semantics of vector shuffle instructions differ between architectures:
- x86-64: If bit 7 is 1, set to 0, otherwise use lower 4 bits for lookup
- ARM: if index is out of range (0-15), set to 0
- PPC64: use lower 4 bits for lookup (no out of range handling)
- MIPS: if bit 6 or bit 7 is 1, set to 0; otherwise use lower 4 bits for lookup (or rather use lower 5 bits for lookup into a table that has 32 elements constructed from 2 input vectors, but if both vectors are the same then it effectively means bits 4,5 are ignored)
- RISC-V: if index is out of range (0-15), set to 0.
This means that for certain applications, you will require an additional vector-AND operation to get the right semantics.
tableLookup128Bytes
seems like a weird name to me. Maybe tableLookup16Bytes
would be better?
I will give you a more in-depth review next month!