-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suboptimal codegen for mask8x32 all #316
Comments
there are a lot of missed optimizations with masks because LLVM defines them as a single bit and apparently doesn't adapt for the way x86 at least handles them |
This seems to be an artifact of the llvm define i16 @test_trunc_and_bit_cast_i16(<16 x i8>* %pa) {
%a = load <16 x i8>, <16 x i8>* %pa, align 32
%b = trunc <16 x i8> %a to <16 x i1>
%r = bitcast <16 x i1> %b to i16
ret i16 %r
} compiles to test_trunc_and_bit_cast_i16: # @test_trunc_and_bit_cast_i16
vmovdqa xmm0, xmmword ptr [rdi]
vpsllw xmm0, xmm0, 7
vpmovmskb eax, xmm0
ret So truncate means to use the low bits of the mask, the shift then moves that bit into the highes bit for use with It seems inserting an arithmetic shift to duplicate the high bits into all positions makes llvm optimize this away. define i16 @test_trunc_and_bit_cast_i16_shr(<16 x i8>* %pa) {
%a = load <16 x i8>, <16 x i8>* %pa, align 32
%b = ashr <16 x i8> %a, <i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7>
%c = trunc <16 x i8> %b to <16 x i1>
%r = bitcast <16 x i1> %c to i16
ret i16 %r
} test_trunc_and_bit_cast_i16_shr: # @test_trunc_and_bit_cast_i16_shr
vmovdqa xmm0, xmmword ptr [rdi]
vpmovmskb eax, xmm0
ret Such a shift is done for |
imho the real reason masks constantly have codegen issues is llvm has no annotation or something so it knows a vector is really a mask vector where element values are either 0 or -1 and never anything else. |
I wonder if that would possibly fix our various reduction issues on arm/aarch64... |
This improves the codegen for vector `select`, `gather`, `scatter` and boolean reduction intrinsics and fixes rust-lang/portable-simd#316. The current behavior of mask operations during llvm codegen is to truncate the mask vector to <N x i1>, telling llvm to use the least significat bit. Since sse/avx instructions are defined to use the most significant bit, llvm has to insert a left shift before the mask can actually be used. Similarly on aarch64, mask operations like blend work bit by bit, repeating the least significant bit across the whole lane involves shifting it into the sign position and then comparing against zero. By shifting before truncating to <N x i1>, we tell llvm that we only consider the most significant bit, removing the need for additional shift instructions in the assembly.
This improves the codegen for vector `select`, `gather`, `scatter` and boolean reduction intrinsics and fixes rust-lang/portable-simd#316. The current behavior of most mask operations during llvm codegen is to truncate the mask vector to <N x i1>, telling llvm to use the least significat bit. The exception is the `simd_bitmask` intrinsics, which already used the most signifiant bit. Since sse/avx instructions are defined to use the most significant bit, truncating means that llvm has to insert a left shift to move the bit into the most significant position, before the mask can actually be used. Similarly on aarch64, mask operations like blend work bit by bit, repeating the least significant bit across the whole lane involves shifting it into the sign position and then comparing against zero. By shifting before truncating to <N x i1>, we tell llvm that we only consider the most significant bit, removing the need for additional shift instructions in the assembly.
…select, r=workingjubilee Consistently use the highest bit of vector masks when converting to i1 vectors This improves the codegen for vector `select`, `gather`, `scatter` and boolean reduction intrinsics and fixes rust-lang/portable-simd#316. The current behavior of most mask operations during llvm codegen is to truncate the mask vector to <N x i1>, telling llvm to use the least significat bit. The exception is the `simd_bitmask` intrinsics, which already used the most signifiant bit. Since sse/avx instructions are defined to use the most significant bit, truncating means that llvm has to insert a left shift to move the bit into the most significant position, before the mask can actually be used. Similarly on aarch64, mask operations like blend work bit by bit, repeating the least significant bit across the whole lane involves shifting it into the sign position and then comparing against zero. By shifting before truncating to <N x i1>, we tell llvm that we only consider the most significant bit, removing the need for additional shift instructions in the assembly.
https://rust.godbolt.org/z/b4cdaqTcP
-C opt-level=3 --edition 2021 --target x86_64-unknown-linux-gnu -C target-feature=+sse2
mask8x32_all_v1
generates an extrapsllw
instruction, which is unnecessary.Same bug for wasm32 simd128: https://rust.godbolt.org/z/7r1fKhsM9
The text was updated successfully, but these errors were encountered: