-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consistently use the highest bit of vector masks when converting to i1 vectors #104693
base: master
Are you sure you want to change the base?
Consistently use the highest bit of vector masks when converting to i1 vectors #104693
Conversation
r? @davidtwco (rustbot has picked a reviewer for you, use r? to override) |
Does this affect codegen for non-x86 platforms? |
You may want to study the other simd intrinsic tests and also the assembly or codegen tests. |
Seems to have a very similar effect for target pub fn select(m: mask32x4, a: i32x4, b: i32x4) -> i32x4 {
m.select(a, b)
} Before portable_simd_test::select:
ldr q1, [x0]
ldr q0, [x1]
ldr q2, [x2]
shl v1.4s, v1.4s, #31
cmlt v1.4s, v1.4s, #0
bif v0.16b, v2.16b, v1.16b
str q0, [x8]
ret After portable_simd_test::select:
ldr q0, [x0]
ldr q1, [x1]
ldr q2, [x2]
cmlt v0.4s, v0.4s, #0
bsl v0.16b, v1.16b, v2.16b
str q0, [x8]
ret Interestingly there is a big effect on pub unsafe fn mask_all(m: mask8x16) -> bool {
m.all()
} Before portable_simd_test::mask_all:
.cfi_startproc
sub sp, sp, #16
.cfi_def_cfa_offset 16
ldr q0, [x0]
mov w8, #65535
umov w9, v0.b[1]
umov w11, v0.b[2]
umov w10, v0.b[0]
umov w12, v0.b[3]
umov w13, v0.b[4]
umov w14, v0.b[5]
and w9, w9, #0x1
and w11, w11, #0x1
and w10, w10, #0x1
and w12, w12, #0x1
and w13, w13, #0x1
and w14, w14, #0x1
bfi w10, w9, #1, #1
umov w9, v0.b[6]
bfi w10, w11, #2, #1
umov w11, v0.b[7]
bfi w10, w12, #3, #1
umov w12, v0.b[8]
bfi w10, w13, #4, #1
umov w13, v0.b[9]
and w9, w9, #0x1
bfi w10, w14, #5, #1
umov w14, v0.b[10]
and w11, w11, #0x1
orr w9, w10, w9, lsl #6
umov w10, v0.b[11]
and w12, w12, #0x1
orr w9, w9, w11, lsl #7
umov w11, v0.b[12]
and w13, w13, #0x1
orr w9, w9, w12, lsl #8
umov w12, v0.b[13]
and w14, w14, #0x1
orr w9, w9, w13, lsl #9
umov w13, v0.b[14]
and w10, w10, #0x1
orr w9, w9, w14, lsl #10
and w11, w11, #0x1
orr w9, w9, w10, lsl #11
and w10, w12, #0x1
umov w12, v0.b[15]
orr w9, w9, w11, lsl #12
and w11, w13, #0x1
orr w9, w9, w10, lsl #13
orr w9, w9, w11, lsl #14
orr w9, w9, w12, lsl #15
bics wzr, w8, w9
cset w0, eq
add sp, sp, #16
.cfi_def_cfa_offset 0
ret After portable_simd_test::mask_all:
movi v0.2d, #0xffffffffffffffff
ldr q1, [x0]
cmgt v0.16b, v1.16b, v0.16b
umaxv b0, v0.16b
fmov w8, s0
mvn w8, w8
and w0, w8, #0x1
ret But this does not seem to lead to an improvement for the I haven't checked any other target platforms yet. Thanks for the pointer to the tests, I'll have a look at those. |
Cool! That's weird, but also good to know. |
That pub fn is_hex_mask(chunk: &[u8; 16]) -> bool {
let x = u8x16::from_array(*chunk);
let m1 = x.simd_gt(splat(b'0' - 1));
let m2 = x.simd_lt(splat(b'9' + 1));
let m3 = x.simd_gt(splat(b'a' - 1));
let m4 = x.simd_lt(splat(b'f' + 1));
let m = (m1 & m2) | (m3 & m4);
m.all()
} With the changes from this PR, changing that last line to either one of the following alternatives gets vectorized: (m.to_int().simd_ne(splat(0))).all() !((!m).any()) And maybe the weirdest one fn mask_all(m: mask8x16) -> bool {
m.all()
}
mask_all(m) All these seem to help llvm in knowing that the bit pattern is all ones/zeros. Maybe this is less of a problem on x86 since llvm knows which of the mask instructions only use the high bit. So the pattern of "shift to lowest, trunc to i1, i1 vector op", allows it to directly use the high bit. While on aarch64, the operations require a full mask, but with the shift this mask can be created with a simpler vector comparison. |
r? wg-llvm |
☔ The latest upstream changes (presumably #106573) made this pull request unmergeable. Please resolve the merge conflicts. |
18516d7
to
ca3971d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason this remains in a draft state? I see that some tests have been added already and they now test two targets. I believe proving one or two target to produce a better code to be enough of a bar for this kind of change.
|
||
// assembly-output: emit-asm | ||
// compile-flags: --crate-type=lib -O -C llvm-args=-x86-asm-syntax=intel | ||
// only-x86_64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests could be deduplicated slightly if you made these tests #[no_core]
and to use revisions and --target=$TARGET
(grep for revisions
in the test suite for usage examples).
These tests look like they wouldn’t be too hard to adapt to #[no_core]
either, given that Copy
and Clone
derives are superfluous…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm looking into combining the tests, but so far I could not get the CHECK
annotations to work, or rather now they only work for one revision. Looking at runtest.rs I think the following should work:
// revisions: aarch64 x86_64
// [x86_64] compile-flags: --target=x86_64-unknown-linux-gnu -C llvm-args=-x86-asm-syntax=intel
// [x86_64] needs-llvm-components: x86_64
// [aarch64] compile-flags: --target=aarch64-unknown-linux-gnu
// [aarch64] needs-llvm-components: aarch64
// assembly-output: emit-asm
// compile-flags: --crate-type=lib -O
...
pub unsafe fn mask_reduce_all(m: mask8x16) -> bool {
// CHECK,NONMSVC,x86_64-NOT: psllw
// CHECK,NONMSVC,x86_64: pmovmskb
// CHECK,NONMSVC,x86_64: foo
// CHECK,NONMSVC,aarch64: umaxv
// CHECK,NONMSVC,aarch64: foo
simd_reduce_all(m)
}
The aarch64
checks seem to work, since I can cause them to fail by looking for foo
. The x86_64
revision always passes though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess CHECK,NONMSVC,x86_64
comes from the --check-prefix
argument to filecheck
– that just means that it will accept all of these prefixes (separated by comma) and ignore all others. But it won't understand a CHECK,NONMSVC,x86_64
prefix.
So writing CHECK
s for specific revisions is just replacing CHECK
with the name of the revision. So:
// x86_64-NOT: psllw
// x86_64: pmovmskb
// x86_64: foo
// aarch64: umaxv
// aarch64: foo
I’m sorry we don't have good docs for this T_T
But Î'm really less concerned about the duplication and more about the ability to run the x86_64 tests from e.g. a M1 mac and vice-versa, which would require them to be #[no_core]
and drop the only-*
annotations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'll take another look at this, my previous assumption was that the whole prefix was matched as a single string.
ca3971d
to
9dd9391
Compare
That was an interesting rebase, with the tests having moved in the mean time. Sorry for taking so long coming back to this PR. |
Nagisa said they'll try to get to this soon. In the meantime, my experience is that these kinds of PRs are very prone to having something break during the cross-platform builds, so let's @bors try |
⌛ Trying commit 9dd9391 with merge 8e6cd1d9198b904379afff0d143d7d037b710b61... |
☀️ Try build successful - checks-actions |
Oh hm that doesn't run the full CI (anymore?) (did it ever?), I should probably just finally start up an aarch64 machine and test this on that then. |
Since we already have a try build, it wouldn’t hurt a to have a |
Awaiting bors try build completion. @rustbot label: +S-waiting-on-perf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the extremely late review.
I see a couple examples where a plain trunc
is replaced by a shift with a follow-up trunc. Can you expand more on why that is correct? At the very least Alive2 claims this is not equivalent. Do we always store the mask bit as the MSB when legalizing the argument into a passable type?
// CHECK: select <4 x i1> | ||
// CHECK: [[A:%[0-9]+]] = lshr <4 x i8> %{{m|1}}, <i8 7, i8 7, i8 7, i8 7> | ||
// CHECK: [[B:%[0-9]+]] = trunc <4 x i8> [[A]] to <4 x i1> | ||
// CHECK: select <4 x i1> [[B]] | ||
simd_select(m, a, b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe machine-code wise this ends up being a regression on x86_64, isn’t it? Seems like the optimizations proposed here are very dependent on the source test code...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually (or at least in portable_simd) the size of the mask elements would be the same as the corresponding data elements. So if this example regresses, but the one using 32 bit masks for f32 improves I think that would be an ok tradeoff.
I think though that llvm should be able to still optimize this. Running the llvm bitcode through opt
first transforms it into
define void @select(ptr noalias nocapture noundef writeonly sret(<4 x float>) dereferenceable(16) %v0, ptr noalias nocapture noundef readonly dereferenceable(4) %m, ptr noalias nocapture noundef readonly dereferenceable(16) %a, ptr noalias nocapture noundef readonly dereferenceable(16) %b) unnamed_addr #0 !dbg !6 {
%v1 = load <4 x i8>, ptr %m, align 4
%v2 = load <4 x float>, ptr %a, align 16
%v3 = load <4 x float>, ptr %b, align 16
%v4.not1 = icmp slt <4 x i8> %v1, zeroinitializer
%v5 = select <4 x i1> %v4.not1, <4 x float> %v2, <4 x float> %v3
store <4 x float> %v5, ptr %v0, align 16
ret void, !dbg !11
}
Which then compiles to even shorter assembly, removing also the shift by 31 bits:
select_opt: # @select_opt
mov rax, rdi
vpmovsxbd xmm0, dword ptr [rsi]
vmovaps xmm1, xmmword ptr [rcx]
vblendvps xmm0, xmm1, xmmword ptr [rdx], xmm0
vmovaps xmmword ptr [rdi], xmm0
ret
// SIMD vectors are passed directly, resulting in `%x` being a vector, | ||
// while on others they're passed indirectly, resulting in `%x` being | ||
// a pointer to a vector, and `%1` a vector loaded from that pointer. | ||
// This is controlled by the target spec option `simd_types_indirect`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems somewhat fickle, but I guess there isn't much in terms of alternatives here…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment and logic was copied from another codegen test simd-intrinsic-generic-bitmask.rs, Since the code there has not changed I assume there is still no better alternative.
@@ -1518,31 +1515,25 @@ fn generic_simd_intrinsic<'ll, 'tcx>( | |||
assert_eq!(underlying_ty, non_ptr(element_ty0)); | |||
|
|||
// The element type of the third argument must be a signed integer type of any width: | |||
match element_ty2.kind() { | |||
ty::Int(_) => (), | |||
let mask_elem_bitwidth = match element_ty2.kind() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes the simd_scatter
intrinsic, but none of the tests tests that were touched in this test involve it? Do we have no simd_scatter
tests? Please add one or adjust an existing one if there is one.
@@ -1377,31 +1380,25 @@ fn generic_simd_intrinsic<'ll, 'tcx>( | |||
|
|||
// The element type of the third argument must be a signed integer type of any width: | |||
let (_, element_ty2) = arg_tys[2].simd_size_and_type(bx.tcx()); | |||
match element_ty2.kind() { | |||
ty::Int(_) => (), | |||
let mask_elem_bitwidth = match element_ty2.kind() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes the simd_gather
intrinsic, but none of the tests tests that were touched in this test involve it? Do we have no simd_gather
tests? Please add one or adjust an existing one if there is one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do have existing simd_gather
and simd_scatter
tests, but I believe none of the existing tests are affected by this change. Is that a problem, exactly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say that this exposes a hole/deficiency in the test suite coverage. Ideally we take care of it at the same time we find out about this, but I’d be okay with an issue describing tests that should be added or even just not doing anything if y'all are comfortable with not having test coverage here.
That said, seeing test changes would definitely have helped with reviewing this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized I still had some uncommitted changes to the tests locally :( will push those in a minute.
There were codegen tests for gather/scatter, but they only checked that the llvm intrinsic was correctly emitted and did not verify how the mask was converted to i1
. I adjusted these tests.
What might be missing is more assembly tests for these intrinsics that verify no unnecessary shifts or other instructions get emitted.
Switching to waiting on author as it seems there has been a review. @jhorstmann Feel free to request a review with @rustbot author |
☔ The latest upstream changes (presumably #114148) made this pull request unmergeable. Please resolve the merge conflicts. |
This comment has been minimized.
This comment has been minimized.
☔ The latest upstream changes (presumably #105545) made this pull request unmergeable. Please resolve the merge conflicts. |
☔ The latest upstream changes (presumably #117444) made this pull request unmergeable. Please resolve the merge conflicts. |
@jhorstmann any updates on this? thanks |
d17819e
to
83a5dd4
Compare
@Dylan-DPC I updated the PR and adjusted also the new masked load/store intrinsics to use the same logic. I also added another assembly test for masked load that shows there are no unneeded shift instructions in the output. The current output is load_f64x4:
vpsllq ymm0, ymmword ptr [rdi], 63
vpmovq2m k1, ymm0
vmovupd ymm0 {k1} {z}, ymmword ptr [rsi]
vmovapd ymmword ptr [rdx], ymm0
vzeroupper
ret |
This comment has been minimized.
This comment has been minimized.
require!( | ||
matches!(mask_elem.kind(), ty::Int(_)), | ||
let m_elem_bitwidth = require_int_ty!( | ||
mask_elem.kind(), | ||
InvalidMonomorphization::ThirdArgElementType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should refer to the first argument, but there is a ui test for it and fixing it probably belongs in a separate PR
This comment has been minimized.
This comment has been minimized.
The tests show that the code generation currently uses the least significant bits of <iX x N> vector masks when converting to <i1 xN>. This leads to an additional left shift operation in the assembly for x86, since mask operations on x86 operate based on the most significant bit. The exception is simd_bitmask, which already uses the most-significant bit. This additional instruction would be removed by the fix in rust-lang#104693, which uses the most significant bit for all mask operations.
I opened #121953 with assembly tests showing the current behavior. If those are accepted to not cause problems during further llvm updates I would rebase this PR on top of those tests. |
The tests show that the code generation currently uses the least significant bits of <iX x N> vector masks when converting to <i1 xN>. This leads to an additional left shift operation in the assembly for x86, since mask operations on x86 operate based on the most significant bit. On aarch64 the left shift is followed by a comparison against zero, which repeats the sign bit across the whole lane. The exception, which does not introduce an unneeded shift, is simd_bitmask, because the code generation already shifts before truncating. This additional instruction would be removed by the fix in rust-lang#104693, which uses the most significant bit for all mask operations.
The tests show that the code generation currently uses the least significant bits of <iX x N> vector masks when converting to <i1 xN>. This leads to an additional left shift operation in the assembly for x86, since mask operations on x86 operate based on the most significant bit. On aarch64 the left shift is followed by a comparison against zero, which repeats the sign bit across the whole lane. The exception, which does not introduce an unneeded shift, is simd_bitmask, because the code generation already shifts before truncating. By using the "C" calling convention the tests should be stable regarding changes in register allocation, but it is possible that future llvm updates will require updating some of the checks. This additional instruction would be removed by the fix in rust-lang#104693, which uses the most significant bit for all mask operations.
…ed-simd-instructions, r=workingjubilee Add tests for the generated assembly of mask related simd instructions. The tests show that the code generation currently uses the least significant bits of <iX x N> vector masks when converting to <i1 xN>. This leads to an additional left shift operation in the assembly for x86, since mask operations on x86 operate based on the most significant bit. The exception is simd_bitmask, which already uses the most-significant bit. This additional instruction would be removed by the changes in rust-lang#104693, which makes all mask operations consistently use the most significant bits. By using the "C" calling convention the tests should be stable regarding changes in register allocation, but it is possible that future llvm updates will require updating some of the checks.
Rollup merge of rust-lang#121953 - jhorstmann:assembly-tests-for-masked-simd-instructions, r=workingjubilee Add tests for the generated assembly of mask related simd instructions. The tests show that the code generation currently uses the least significant bits of <iX x N> vector masks when converting to <i1 xN>. This leads to an additional left shift operation in the assembly for x86, since mask operations on x86 operate based on the most significant bit. The exception is simd_bitmask, which already uses the most-significant bit. This additional instruction would be removed by the changes in rust-lang#104693, which makes all mask operations consistently use the most significant bits. By using the "C" calling convention the tests should be stable regarding changes in register allocation, but it is possible that future llvm updates will require updating some of the checks.
☔ The latest upstream changes (presumably #122389) made this pull request unmergeable. Please resolve the merge conflicts. |
This improves the codegen for vector `select`, `gather`, `scatter` and boolean reduction intrinsics and fixes rust-lang/portable-simd#316. The current behavior of mask operations during llvm codegen is to truncate the mask vector to <N x i1>, telling llvm to use the least significat bit. Since sse/avx instructions are defined to use the most significant bit, llvm has to insert a left shift before the mask can actually be used. Similarly on aarch64, mask operations like blend work bit by bit, repeating the least significant bit across the whole lane involves shifting it into the sign position and then comparing against zero. By shifting before truncating to <N x i1>, we tell llvm that we only consider the most significant bit, removing the need for additional shift instructions in the assembly.
a0e0528
to
7252870
Compare
Rebased on top of the just merged assembly tests from #121953, this hopefully makes the improvement clearer. I also tried adding an pub unsafe extern "C" fn mask_reduce_all(m: mask8x16) -> bool {
//let either_ones_or_zeroes: mask8x16 = simd_eq(m, simd_shr(m, mask8x16([7; 16])));
let either_ones_or_zeroes: mask8x16 = simd_or(simd_eq(m, mask8x16([0; 16])), simd_eq(m, mask8x16([0xFF_u8 as i8; 16])));
if simd_reduce_all(either_ones_or_zeroes) {
simd_reduce_all(m)
} else {
unreachable()
}
} |
This improves the codegen for vector
select
,gather
,scatter
and boolean reduction intrinsics and fixes rust-lang/portable-simd#316.The current behavior of mask operations during llvm codegen is to truncate the mask vector to , telling llvm to use the least significat bit.
Since sse/avx instructions are defined to use the most significant bit, llvm has to insert a left shift before the mask can actually be used.
Similarly on aarch64, mask operations like blend work bit by bit, repeating the least significant bit across the whole lane involves shifting it into the sign position and then comparing against zero.
By shifting before truncating to , we tell llvm that we only consider the most significant bit, removing the need for additional shift instructions in the assembly.