You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I intend to write my own version of these compaction instructions at some point, just to fully think about it and see if I can do better than what you have done here, but I thought in the meantime, I should at least share my preliminary thoughts and opinions on the code presented here. I have not run your code, but I have a few thoughts based on my work on the Accelerated Zig Parser.
First of all, good job doing this work and figuring this stuff out.
I actually did learn something/get an idea from looking at your code, here:
Here is the difference between the two in assembly (should be no difference, if the compiler could make up its mind...)
.LCPI0_0: .short 256 .short 256 .short 256 .short 256 .short 256 .short 256 .short 256 .short 256foo:vpunpcklbwxmm0,xmm0,xmm0vpaddwxmm0,xmm0, xmmword ptr [rip+ .LCPI0_0]retbar:vpcmpeqdxmm1,xmm1,xmm1; check if we are equal to ourselves, i.e. make a vector of all onesvpsubbxmm1,xmm0,xmm1; subtract -1vpunpcklbwxmm0,xmm0,xmm1ret
This trick is even more relevant if you want to scale the trick up to 16 byte vectors:
// broadcasts them into each 32-bit lane and shifts. Here, 16-bit lanes are too
// narrow to hold all bits, and unpacking nibbles is likely more costly than
// the higher cache footprint from storing bytes.
It's actually pretty efficient to unpack nibbles:
// Workaround until https://github.com/llvm/llvm-project/issues/79094 is solved.fnexpand8xu8To16xu4AsByteVector(vec: @Vector(8, u8)) @Vector(16, u8) {
returnswitch (comptimebuiltin.cpu.arch.endian()) {
.little=>switch (builtin.cpu.arch) {
// Doesn't have shifts that operate at byte granularity.// To do a shift with byte-granularity, the compiler must insert an `&` operation.// Therefore, it's better to do a single `&` after interlacing, and get a 2-for-1.// We need to have all these bitCasts because of https://github.com/llvm/llvm-project/issues/89600.x86_64=>std.simd.interlace([2]@Vector(8, u8){ vec, @bitCast(@as(@Vector(4, u16), @bitCast(vec)) >>@splat(4)) }) & @as(@Vector(16, u8), @splat(0xF)),
else=>std.simd.interlace(.{ vec & @as(@Vector(8, u8), @splat(0xF)), vec>>@splat(4) }),
},
.big=>std.simd.interlace(.{ vec>>@splat(4), vec & @as(@Vector(8, u8), @splat(0xF)) }),
};
}
It is also my preference to generate lookup tables using comptime logic. That way, you can eliminate over 100 lines of code (actually data), and also it better encodes the intent of the lookup table IMO. Also you don't have to define the individual elements of your lookup tables as a u8. You can instead make it an array of arrays like [128][16]u8, and the language can manage the multiplication for you, so you don't have to write it out manually.
Also, btw, if you take my code, please include a comment somewhere giving me credit. If you took it from a place that has a license you may need to include that in your code too. If it's a trivial piece of code like just a piece of inline assembly or intrinsic call, no need for a license or anything, but the pext function in particular with its multiple fallback techniques (and at least one more to come) is definitely something that I would like credit for :).
Also note that the semantics of vector shuffle instructions differ between architectures:
x86-64: If bit 7 is 1, set to 0, otherwise use lower 4 bits for lookup
ARM: if index is out of range (0-15), set to 0
PPC64: use lower 4 bits for lookup (no out of range handling)
MIPS: if bit 6 or bit 7 is 1, set to 0; otherwise use lower 4 bits for lookup (or rather use lower 5 bits for lookup into a table that has 32 elements constructed from 2 input vectors, but if both vectors are the same then it effectively means bits 4,5 are ignored)
RISC-V: if index is out of range (0-15), set to 0.
This means that for certain applications, you will require an additional vector-AND operation to get the right semantics.
tableLookup128Bytes seems like a weird name to me. Maybe tableLookup16Bytes would be better?
I will give you a more in-depth review next month!
The text was updated successfully, but these errors were encountered:
Sorry, @Validark, I haven't logged in to github for a while, and I didn't see your reply until yesterday.
Thank you very much for reviewing my code. I read your comments and made some corresponding changes. There are two main changes, The first is to use your following implementation when calculating the index of table16x8.
In my experiments with this idea, I determined that it's typically better to prefer:
The second is to follow your suggestion and use comptime logic to generate lookup tables.
I had added reference remarks and license text for your source code of pext function.
By the way, I've renamed the function tableLookup128Bytes to tableLookup16Bytes, which was a small mistake.
Hello @flyfish30, I am writing this at your request in the following comment: ziglang/zig#15837 (comment)
I intend to write my own version of these compaction instructions at some point, just to fully think about it and see if I can do better than what you have done here, but I thought in the meantime, I should at least share my preliminary thoughts and opinions on the code presented here. I have not run your code, but I have a few thoughts based on my work on the Accelerated Zig Parser.
First of all, good job doing this work and figuring this stuff out.
I actually did learn something/get an idea from looking at your code, here:
zig-basic/src/pack_select.zig
Lines 293 to 295 in e2447be
In my experiments with this idea, I determined that it's typically better to prefer:
Here is the difference between the two in assembly (should be no difference, if the compiler could make up its mind...)
This trick is even more relevant if you want to scale the trick up to 16 byte vectors:
Here is a Godbolt playground link where you can play with this.
I submitted an issue to LLVM for this: llvm/llvm-project#89858
To respond to this comment:
zig-basic/src/pack_select.zig
Lines 115 to 117 in e2447be
It's actually pretty efficient to unpack nibbles:
On x86 it's just (Godbolt link):
On arm:
It is also my preference to generate lookup tables using comptime logic. That way, you can eliminate over 100 lines of code (actually data), and also it better encodes the intent of the lookup table IMO. Also you don't have to define the individual elements of your lookup tables as a
u8
. You can instead make it an array of arrays like[128][16]u8
, and the language can manage the multiplication for you, so you don't have to write it out manually.Also, btw, if you take my code, please include a comment somewhere giving me credit. If you took it from a place that has a license you may need to include that in your code too. If it's a trivial piece of code like just a piece of inline assembly or intrinsic call, no need for a license or anything, but the
pext
function in particular with its multiple fallback techniques (and at least one more to come) is definitely something that I would like credit for :).Also note that the semantics of vector shuffle instructions differ between architectures:
This means that for certain applications, you will require an additional vector-AND operation to get the right semantics.
tableLookup128Bytes
seems like a weird name to me. MaybetableLookup16Bytes
would be better?I will give you a more in-depth review next month!
The text was updated successfully, but these errors were encountered: