My Review of your codebase

Hello @flyfish30, I am writing this at your request in the following comment: https://github.com/ziglang/zig/issues/15837#issuecomment-2066446954

I intend to write my own version of these compaction instructions at some point, just to fully think about it and see if I can do better than what you have done here, but I thought in the meantime, I should at least share my preliminary thoughts and opinions on the code presented here. I have not run your code, but I have a few thoughts based on my work on the [Accelerated Zig Parser](https://github.com/Validark/Accelerated-Zig-Parser).

First of all, good job doing this work and figuring this stuff out.

I actually did learn something/get an idea from looking at your code, here: https://github.com/flyfish30/zig-basic/blob/e2447becc81acd6be85f3eb19e9a690a2fc440ee/src/pack_select.zig#L293-L295

In my experiments with this idea, I determined that it's typically better to prefer:

```zig
std.simd.interlace(.{ byte_idx, byte_idx + @as(@Vector(VEC_SIZE, u8), @splat(1)) })
```

Here is the difference between the two in assembly (should be no difference, if the compiler could make up its mind...)

```asm
.LCPI0_0:
        .short  256
        .short  256
        .short  256
        .short  256
        .short  256
        .short  256
        .short  256
        .short  256
foo:
        vpunpcklbw      xmm0, xmm0, xmm0
        vpaddw  xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
        ret

bar:
        vpcmpeqd        xmm1, xmm1, xmm1; check if we are equal to ourselves, i.e. make a vector of all ones
        vpsubb  xmm1, xmm0, xmm1; subtract -1
        vpunpcklbw      xmm0, xmm0, xmm1
        ret
```

This trick is even more relevant if you want to scale the trick up to 16 byte vectors:

```asm
.LCPI0_0:
        .short  256
        .short  256
        .short  256
        .short  256
        .short  256
        .short  256
        .short  256
        .short  256
        .short  256
        .short  256
        .short  256
        .short  256
        .short  256
        .short  256
        .short  256
        .short  256
your_version:
        vpermq  ymm0, ymm0, 216
        vpunpcklbw      ymm0, ymm0, ymm0
        vpaddw  ymm0, ymm0, ymmword ptr [rip + .LCPI0_0]
        ret

my_version:
        vpcmpeqd        xmm1, xmm1, xmm1
        vpsubb  xmm1, xmm0, xmm1
        vpunpckhbw      xmm2, xmm0, xmm1
        vpunpcklbw      xmm0, xmm0, xmm1
        vinserti128     ymm0, ymm0, xmm2, 1
        ret
```

[Here is a Godbolt playground link where you can play with this.](https://zig.godbolt.org/z/jbzEY9qh3)

I submitted an issue to LLVM for this: https://github.com/llvm/llvm-project/issues/89858

To respond to this comment:

https://github.com/flyfish30/zig-basic/blob/e2447becc81acd6be85f3eb19e9a690a2fc440ee/src/pack_select.zig#L115-L117

It's actually pretty efficient to unpack nibbles:

```zig
// Workaround until https://github.com/llvm/llvm-project/issues/79094 is solved.
fn expand8xu8To16xu4AsByteVector(vec: @Vector(8, u8)) @Vector(16, u8) {
    return switch (comptime builtin.cpu.arch.endian()) {
        .little => switch (builtin.cpu.arch) {
            // Doesn't have shifts that operate at byte granularity.
            // To do a shift with byte-granularity, the compiler must insert an `&` operation.
            // Therefore, it's better to do a single `&` after interlacing, and get a 2-for-1.
            // We need to have all these bitCasts because of https://github.com/llvm/llvm-project/issues/89600
            .x86_64 => std.simd.interlace([2]@Vector(8, u8){ vec, @bitCast(@as(@Vector(4, u16), @bitCast(vec)) >> @splat(4)) }) & @as(@Vector(16, u8), @splat(0xF)),

            else => std.simd.interlace(.{ vec & @as(@Vector(8, u8), @splat(0xF)), vec >> @splat(4) }),
        },
        .big => std.simd.interlace(.{ vec >> @splat(4), vec & @as(@Vector(8, u8), @splat(0xF)) }),
    };
}
```

On x86 it's just ([Godbolt link](https://zig.godbolt.org/z/c1KbEj7d8)): 

```asm
.LCPI0_0:
        .zero   16,15
expand8xu8To16xu4AsByteVector:
        vpsrlw  xmm1, xmm0, 4
        vpunpcklbw      xmm0, xmm0, xmm1
        vpand   xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
```

On arm:

```asm
expand8xu8To16xu4AsByteVector:
        movi    v1.8b, #15
        and     v1.8b, v0.8b, v1.8b
        ushr    v0.8b, v0.8b, #4
        zip1    v0.16b, v1.16b, v0.16b
        ret
```

It is also my preference to generate lookup tables using comptime logic. That way, you can eliminate over 100 lines of code (actually data), and also it better encodes the intent of the lookup table IMO. Also you don't have to define the individual elements of your lookup tables as a `u8`. You can instead make it an array of arrays like `[128][16]u8`, and the language can manage the multiplication for you, so you don't have to write it out manually.

Also, btw, if you take my code, please include a comment somewhere giving me credit. If you took it from a place that has a license you may need to include that in your code too. If it's a trivial piece of code like just a piece of inline assembly or intrinsic call, no need for a license or anything, but the `pext` function in particular with its multiple fallback techniques (and at least one more to come) is definitely something that I would like credit for :).

Also note that the semantics of vector shuffle instructions differ between architectures:

 - x86-64: If bit 7 is 1, set to 0, otherwise use lower 4 bits for lookup
 - ARM: if index is out of range (0-15), set to 0
- PPC64: use lower 4 bits for lookup (no out of range handling)
- MIPS: if bit 6 or bit 7 is 1, set to 0; otherwise use lower 4 bits for lookup (or rather use lower 5 bits for lookup into a table that has 32 elements constructed from 2 input vectors, but if both vectors are the same then it effectively means bits 4,5 are ignored)
- RISC-V: if index is out of range (0-15), set to 0.

This means that for certain applications, you will require an additional vector-AND operation to get the right semantics.

`tableLookup128Bytes` seems like a weird name to me. Maybe `tableLookup16Bytes` would be better?

I will give you a more in-depth review next month!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

My Review of your codebase #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

	const byte_idx: @Vector(8, u8) = table16x8[mask_bits * 8 ..][0..8].*;
	const pairs: @Vector(8, u16) = @bitCast(std.simd.interlace([_]@Vector(8, u8){ byte_idx, byte_idx }));
	return @bitCast(pairs + @as(@Vector(8, u16), @splat(0x100)));

	// broadcasts them into each 32-bit lane and shifts. Here, 16-bit lanes are too
	// narrow to hold all bits, and unpacking nibbles is likely more costly than
	// the higher cache footprint from storing bytes.

My Review of your codebase #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions