Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suboptimal codegen for mask8x32 all #316

Open
Nugine opened this issue Nov 18, 2022 · 4 comments · Fixed by rust-lang/rust#104693
Open

Suboptimal codegen for mask8x32 all #316

Nugine opened this issue Nov 18, 2022 · 4 comments · Fixed by rust-lang/rust#104693

Comments

@Nugine
Copy link

Nugine commented Nov 18, 2022

https://rust.godbolt.org/z/b4cdaqTcP

pub fn mask8x32_all_v1(m: mask8x32) -> bool {
    m.all()
}

pub unsafe fn mask8x32_all_v2(m: mask8x32) -> bool {
    let [a, b]: [__m128i; 2] = transmute(m);
    _mm_movemask_epi8(_mm_and_si128(a, b)) as i16 == -1
}

-C opt-level=3 --edition 2021 --target x86_64-unknown-linux-gnu -C target-feature=+sse2

example::mask8x32_all_v1:
        movdqa  xmm0, xmmword ptr [rdi + 16]
        pand    xmm0, xmmword ptr [rdi]
        psllw   xmm0, 7
        pmovmskb        eax, xmm0
        cmp     ax, -1
        sete    al
        ret

example::mask8x32_all_v2:
        movdqa  xmm0, xmmword ptr [rdi + 16]
        pand    xmm0, xmmword ptr [rdi]
        pmovmskb        eax, xmm0
        cmp     eax, 65535
        sete    al
        ret

mask8x32_all_v1 generates an extra psllw instruction, which is unnecessary.

Same bug for wasm32 simd128: https://rust.godbolt.org/z/7r1fKhsM9

@TheIronBorn
Copy link

there are a lot of missed optimizations with masks because LLVM defines them as a single bit and apparently doesn't adapt for the way x86 at least handles them

@jhorstmann
Copy link

This seems to be an artifact of the llvm trunc instruction

define i16 @test_trunc_and_bit_cast_i16(<16 x i8>* %pa) {
  %a = load <16 x i8>, <16 x i8>* %pa, align 32
  %b = trunc <16 x i8> %a to <16 x i1>
  %r = bitcast <16 x i1> %b to i16
  ret i16 %r
}

compiles to

test_trunc_and_bit_cast_i16: # @test_trunc_and_bit_cast_i16
  vmovdqa xmm0, xmmword ptr [rdi]
  vpsllw xmm0, xmm0, 7
  vpmovmskb eax, xmm0
  ret

So truncate means to use the low bits of the mask, the shift then moves that bit into the highes bit for use with movmask.

It seems inserting an arithmetic shift to duplicate the high bits into all positions makes llvm optimize this away.

define i16 @test_trunc_and_bit_cast_i16_shr(<16 x i8>* %pa) {
  %a = load <16 x i8>, <16 x i8>* %pa, align 32
  %b = ashr <16 x i8> %a, <i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7>
  %c = trunc <16 x i8> %b to <16 x i1>
  %r = bitcast <16 x i1> %c to i16
  ret i16 %r
}
test_trunc_and_bit_cast_i16_shr: # @test_trunc_and_bit_cast_i16_shr
  vmovdqa xmm0, xmmword ptr [rdi]
  vpmovmskb eax, xmm0
  ret

Such a shift is done for simd_bitmask but might be missing for bitwise_red

@programmerjake
Copy link
Member

imho the real reason masks constantly have codegen issues is llvm has no annotation or something so it knows a vector is really a mask vector where element values are either 0 or -1 and never anything else.

@calebzulawski
Copy link
Member

I wonder if that would possibly fix our various reduction issues on arm/aarch64...

jhorstmann added a commit to jhorstmann/rust that referenced this issue Mar 13, 2024
This improves the codegen for vector `select`, `gather`, `scatter` and
boolean reduction intrinsics and fixes rust-lang/portable-simd#316.

The current behavior of mask operations during llvm codegen is to
truncate the mask vector to <N x i1>, telling llvm to use the least
significat bit.

Since sse/avx instructions are defined to use the most significant bit,
llvm has to insert a left shift before the mask can actually be used.

Similarly on aarch64, mask operations like blend work bit by bit,
repeating the least significant bit across the whole lane involves
shifting it into the sign position and then comparing against zero.

By shifting before truncating to <N x i1>, we tell llvm that we only
consider the most significant bit, removing the need for additional
shift instructions in the assembly.
bors pushed a commit to rust-lang-ci/rust that referenced this issue Jan 26, 2025
This improves the codegen for vector `select`, `gather`, `scatter` and
boolean reduction intrinsics and fixes rust-lang/portable-simd#316.

The current behavior of most mask operations during llvm codegen is to
truncate the mask vector to <N x i1>, telling llvm to use the least
significat bit. The exception is the `simd_bitmask` intrinsics, which
already used the most signifiant bit.

Since sse/avx instructions are defined to use the most significant bit,
truncating means that llvm has to insert a left shift to move the bit
into the most significant position, before the mask can actually be
used.

Similarly on aarch64, mask operations like blend work bit by bit,
repeating the least significant bit across the whole lane involves
shifting it into the sign position and then comparing against zero.

By shifting before truncating to <N x i1>, we tell llvm that we only
consider the most significant bit, removing the need for additional
shift instructions in the assembly.
bors added a commit to rust-lang-ci/rust that referenced this issue Jan 26, 2025
…select, r=workingjubilee

Consistently use the highest bit of vector masks when converting to i1 vectors

This improves the codegen for vector `select`, `gather`, `scatter` and boolean reduction intrinsics and fixes rust-lang/portable-simd#316.

The current behavior of most mask operations during llvm codegen is to truncate the mask vector to <N x i1>, telling llvm to use the least significat bit. The exception is the `simd_bitmask` intrinsics, which already used the most signifiant bit.

Since sse/avx instructions are defined to use the most significant bit, truncating means that llvm has to insert a left shift to move the bit into the most significant position, before the mask can actually be used.

Similarly on aarch64, mask operations like blend work bit by bit, repeating the least significant bit across the whole lane involves shifting it into the sign position and then comparing against zero.

By shifting before truncating to <N x i1>, we tell llvm that we only consider the most significant bit, removing the need for additional shift instructions in the assembly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants