Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

My Review of your codebase #1

Open
Validark opened this issue Apr 24, 2024 · 1 comment
Open

My Review of your codebase #1

Validark opened this issue Apr 24, 2024 · 1 comment

Comments

@Validark
Copy link

Hello @flyfish30, I am writing this at your request in the following comment: ziglang/zig#15837 (comment)

I intend to write my own version of these compaction instructions at some point, just to fully think about it and see if I can do better than what you have done here, but I thought in the meantime, I should at least share my preliminary thoughts and opinions on the code presented here. I have not run your code, but I have a few thoughts based on my work on the Accelerated Zig Parser.

First of all, good job doing this work and figuring this stuff out.

I actually did learn something/get an idea from looking at your code, here:

const byte_idx: @Vector(8, u8) = table16x8[mask_bits * 8 ..][0..8].*;
const pairs: @Vector(8, u16) = @bitCast(std.simd.interlace([_]@Vector(8, u8){ byte_idx, byte_idx }));
return @bitCast(pairs + @as(@Vector(8, u16), @splat(0x100)));

In my experiments with this idea, I determined that it's typically better to prefer:

std.simd.interlace(.{ byte_idx, byte_idx + @as(@Vector(VEC_SIZE, u8), @splat(1)) })

Here is the difference between the two in assembly (should be no difference, if the compiler could make up its mind...)

.LCPI0_0:
        .short  256
        .short  256
        .short  256
        .short  256
        .short  256
        .short  256
        .short  256
        .short  256
foo:
        vpunpcklbw      xmm0, xmm0, xmm0
        vpaddw  xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
        ret

bar:
        vpcmpeqd        xmm1, xmm1, xmm1; check if we are equal to ourselves, i.e. make a vector of all ones
        vpsubb  xmm1, xmm0, xmm1; subtract -1
        vpunpcklbw      xmm0, xmm0, xmm1
        ret

This trick is even more relevant if you want to scale the trick up to 16 byte vectors:

.LCPI0_0:
        .short  256
        .short  256
        .short  256
        .short  256
        .short  256
        .short  256
        .short  256
        .short  256
        .short  256
        .short  256
        .short  256
        .short  256
        .short  256
        .short  256
        .short  256
        .short  256
your_version:
        vpermq  ymm0, ymm0, 216
        vpunpcklbw      ymm0, ymm0, ymm0
        vpaddw  ymm0, ymm0, ymmword ptr [rip + .LCPI0_0]
        ret

my_version:
        vpcmpeqd        xmm1, xmm1, xmm1
        vpsubb  xmm1, xmm0, xmm1
        vpunpckhbw      xmm2, xmm0, xmm1
        vpunpcklbw      xmm0, xmm0, xmm1
        vinserti128     ymm0, ymm0, xmm2, 1
        ret

Here is a Godbolt playground link where you can play with this.

I submitted an issue to LLVM for this: llvm/llvm-project#89858

To respond to this comment:

// broadcasts them into each 32-bit lane and shifts. Here, 16-bit lanes are too
// narrow to hold all bits, and unpacking nibbles is likely more costly than
// the higher cache footprint from storing bytes.

It's actually pretty efficient to unpack nibbles:

// Workaround until https://github.com/llvm/llvm-project/issues/79094 is solved.
fn expand8xu8To16xu4AsByteVector(vec: @Vector(8, u8)) @Vector(16, u8) {
    return switch (comptime builtin.cpu.arch.endian()) {
        .little => switch (builtin.cpu.arch) {
            // Doesn't have shifts that operate at byte granularity.
            // To do a shift with byte-granularity, the compiler must insert an `&` operation.
            // Therefore, it's better to do a single `&` after interlacing, and get a 2-for-1.
            // We need to have all these bitCasts because of https://github.com/llvm/llvm-project/issues/89600
            .x86_64 => std.simd.interlace([2]@Vector(8, u8){ vec, @bitCast(@as(@Vector(4, u16), @bitCast(vec)) >> @splat(4)) }) & @as(@Vector(16, u8), @splat(0xF)),

            else => std.simd.interlace(.{ vec & @as(@Vector(8, u8), @splat(0xF)), vec >> @splat(4) }),
        },
        .big => std.simd.interlace(.{ vec >> @splat(4), vec & @as(@Vector(8, u8), @splat(0xF)) }),
    };
}

On x86 it's just (Godbolt link):

.LCPI0_0:
        .zero   16,15
expand8xu8To16xu4AsByteVector:
        vpsrlw  xmm1, xmm0, 4
        vpunpcklbw      xmm0, xmm0, xmm1
        vpand   xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]

On arm:

expand8xu8To16xu4AsByteVector:
        movi    v1.8b, #15
        and     v1.8b, v0.8b, v1.8b
        ushr    v0.8b, v0.8b, #4
        zip1    v0.16b, v1.16b, v0.16b
        ret

It is also my preference to generate lookup tables using comptime logic. That way, you can eliminate over 100 lines of code (actually data), and also it better encodes the intent of the lookup table IMO. Also you don't have to define the individual elements of your lookup tables as a u8. You can instead make it an array of arrays like [128][16]u8, and the language can manage the multiplication for you, so you don't have to write it out manually.

Also, btw, if you take my code, please include a comment somewhere giving me credit. If you took it from a place that has a license you may need to include that in your code too. If it's a trivial piece of code like just a piece of inline assembly or intrinsic call, no need for a license or anything, but the pext function in particular with its multiple fallback techniques (and at least one more to come) is definitely something that I would like credit for :).

Also note that the semantics of vector shuffle instructions differ between architectures:

  • x86-64: If bit 7 is 1, set to 0, otherwise use lower 4 bits for lookup
  • ARM: if index is out of range (0-15), set to 0
  • PPC64: use lower 4 bits for lookup (no out of range handling)
  • MIPS: if bit 6 or bit 7 is 1, set to 0; otherwise use lower 4 bits for lookup (or rather use lower 5 bits for lookup into a table that has 32 elements constructed from 2 input vectors, but if both vectors are the same then it effectively means bits 4,5 are ignored)
  • RISC-V: if index is out of range (0-15), set to 0.

This means that for certain applications, you will require an additional vector-AND operation to get the right semantics.

tableLookup128Bytes seems like a weird name to me. Maybe tableLookup16Bytes would be better?

I will give you a more in-depth review next month!

@flyfish30
Copy link
Owner

flyfish30 commented May 11, 2024

Sorry, @Validark, I haven't logged in to github for a while, and I didn't see your reply until yesterday.
Thank you very much for reviewing my code. I read your comments and made some corresponding changes. There are two main changes, The first is to use your following implementation when calculating the index of table16x8.

In my experiments with this idea, I determined that it's typically better to prefer:

std.simd.interlace(.{ byte_idx, byte_idx + @as(@Vector(VEC_SIZE, u8), @splat(1)) })

The second is to follow your suggestion and use comptime logic to generate lookup tables.
I had added reference remarks and license text for your source code of pext function.

By the way, I've renamed the function tableLookup128Bytes to tableLookup16Bytes, which was a small mistake.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants