-
Notifications
You must be signed in to change notification settings - Fork 42
8283709: Add x86 back-end implementation for bit BIT_COUNT operation #185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8283709: Add x86 back-end implementation for bit BIT_COUNT operation #185
Conversation
|
👋 Welcome back jbhateja! A progress list of the required criteria for merging this PR into |
Webrevs
|
| ShouldNotReachHere(); | ||
| } | ||
| if (bt == T_INT) { | ||
| if (rbt == T_INT) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the SLP and the VectorAPI uses the same PopCountVL with different basic type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to separate this cast for SLP and always uses "T_LONG' as the basic type for PopCountVL ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can have type homogeneity for PopCountVL across VectorAPI and SLP once we start auto-vectorizing ConvL2I operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that makes the code hard to maintain, and I think @merykitty 's unaddressed comment in: openjdk/jdk#6857 (comment) is a preferable approach.
FYI: @fg1417 has a ConvL2I auto-vectorization support under review: openjdk/jdk#7806.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, so if the ConvL2I is supported to be auto-vectorized after then, the PopCountL is better to be auto-vectorized to "PopCountVL + VectorCastL2X" ? Do you have any plan on it? Thanks so much!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, so if the
ConvL2Iis supported to be auto-vectorized after then, thePopCountLis better to be auto-vectorized to "PopCountVL + VectorCastL2X" ? Do you have any plan on it? Thanks so much!
Agree to your proposal, once we have auto-vectorization for ConvL2I n place we can change the IR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, so if the
ConvL2Iis supported to be auto-vectorized after then, thePopCountLis better to be auto-vectorized to "PopCountVL + VectorCastL2X" ? Do you have any plan on it? Thanks so much!Agree to your proposal, once we have auto-vectorization for ConvL2I n place we can change the IR.
Here is an interim patch which type casts the result of PopCountVL into integer vector and thus prevent any special
handling in target specific backend implementations. But it seems more appropriate to generate correct IR snippet (PopCountL + ConvL2I) during scalar operation inline expansion and explicitly handle ConvL2I during auto-vectorization.
diff --git a/src/hotspot/share/opto/superword.cpp b/src/hotspot/share/opto/superword.cpp
index 1bf4c7a3282..b3dd90d1867 100644
--- a/src/hotspot/share/opto/superword.cpp
+++ b/src/hotspot/share/opto/superword.cpp
@@ -2549,11 +2549,21 @@ void SuperWord::output() {
vn = VectorNode::make(opc, in1, in2, vlen, velt_basic_type(n));
vlen_in_bytes = vn->as_Vector()->length_in_bytes();
}
+ } else if (opc == Op_PopCountL) {
+ // TODO: Succeed PopCountL by ConvL2I during initial graph construction
+ // once auto-vectorizer supports ConvL2I operation.
+ assert(n->req() == 2, "only one input expected");
+ Node* in = vector_opd(p, 1);
+ BasicType bt = in->bottom_type()->is_vect()->element_basic_type();
+ in = VectorNode::make(opc, in, NULL, vlen, bt);
+ _igvn.register_new_node_with_optimizer(in);
+ vn = VectorCastNode::make(Op_VectorCastL2X, in, T_INT, vlen);
+ vlen_in_bytes = vn->as_Vector()->length_in_bytes();
} else if (opc == Op_SqrtF || opc == Op_SqrtD ||
opc == Op_AbsF || opc == Op_AbsD ||
opc == Op_AbsI || opc == Op_AbsL ||
opc == Op_NegF || opc == Op_NegD ||
- opc == Op_PopCountI || opc == Op_PopCountL) {
+ opc == Op_PopCountI) {
assert(n->req() == 2, "only one input expected");
Node* in = vector_opd(p, 1);
vn = VectorNode::make(opc, in, NULL, vlen, velt_basic_type(n));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, thanks for it. I will do it once PR openjdk/jdk#7806 is merged. Thanks!
|
The mid-end part looks good to me! Thanks! |
|
@jatin-bhateja This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been no new commits pushed to the ➡️ To integrate this PR with the above commit message to the |
|
Patch needs to be updated with predicated patterns. |
…-vectorized generated IR.
…, this is auto-vectorized as load from Boolean vector. Adding missing types in macroassembler to cover these cases.
| case T_BYTE: | ||
| case T_SHORT: | ||
| case T_INT: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks misplaced. There is no switch here.
| } | ||
| } | ||
|
|
||
| void C2_MacroAssembler::vbroadcastd(XMMRegister dst, Register rtmp, int imm32, int vec_enc) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to have rtmp after imm32. Basically temp registers after required inputs.
|
|
||
| void C2_MacroAssembler::vector_popcount_int(XMMRegister dst, XMMRegister src, XMMRegister xtmp1, | ||
| XMMRegister xtmp2, XMMRegister xtmp3, Register rtmp, | ||
| int vec_enc) { | ||
| if (VM_Version::supports_avx512_vpopcntdq()) { | ||
| vpopcntd(dst, src, vec_enc); | ||
| } else { | ||
| assert((vec_enc == Assembler::AVX_512bit && VM_Version::supports_avx512bw()) || VM_Version::supports_avx2(), ""); | ||
| movl(rtmp, 0x0F0F0F0F); | ||
| movdl(xtmp1, rtmp); | ||
| vpbroadcastd(xtmp1, xtmp1, vec_enc); | ||
| if (Assembler::AVX_512bit == vec_enc) { | ||
| evmovdqul(xtmp2, k0, ExternalAddress(StubRoutines::x86::vector_popcount_lut()), false, vec_enc, rtmp); | ||
| } else { | ||
| vmovdqu(xtmp2, ExternalAddress(StubRoutines::x86::vector_popcount_lut()), rtmp); | ||
| } | ||
| vpand(xtmp3, src, xtmp1, vec_enc); | ||
| vpshufb(xtmp3, xtmp2, xtmp3, vec_enc); | ||
| vpsrlw(dst, src, 4, vec_enc); | ||
| vpand(dst, dst, xtmp1, vec_enc); | ||
| vpshufb(dst, xtmp2, dst, vec_enc); | ||
| vpaddb(xtmp3, dst, xtmp3, vec_enc); | ||
| vpxor(xtmp1, xtmp1, xtmp1, vec_enc); | ||
| vpunpckhdq(dst, xtmp3, xtmp1, vec_enc); | ||
| vpsadbw(dst, dst, xtmp1, vec_enc); | ||
| vpunpckldq(xtmp2, xtmp3, xtmp1, vec_enc); | ||
| vpsadbw(xtmp2, xtmp2, xtmp1, vec_enc); | ||
| vpackuswb(dst, xtmp2, dst, vec_enc); | ||
| } | ||
| } | ||
|
|
||
| void C2_MacroAssembler::vector_popcount_long(BasicType bt, XMMRegister dst, XMMRegister src, XMMRegister xtmp1, | ||
| XMMRegister xtmp2, XMMRegister xtmp3, Register rtmp, int vec_enc) { | ||
| if (VM_Version::supports_avx512_vpopcntdq()) { | ||
| vpopcntq(dst, src, vec_enc); | ||
| } else if (vec_enc == Assembler::AVX_512bit) { | ||
| assert(VM_Version::supports_avx512bw(), ""); | ||
| movl(rtmp, 0x0F0F0F0F); | ||
| movdl(xtmp1, rtmp); | ||
| vpbroadcastd(xtmp1, xtmp1, vec_enc); | ||
| evmovdqul(xtmp2, k0, ExternalAddress(StubRoutines::x86::vector_popcount_lut()), true, vec_enc, rtmp); | ||
| vpandq(xtmp3, src, xtmp1, vec_enc); | ||
| vpshufb(xtmp3, xtmp2, xtmp3, vec_enc); | ||
| vpsrlw(dst, src, 4, vec_enc); | ||
| vpandq(dst, dst, xtmp1, vec_enc); | ||
| vpshufb(dst, xtmp2, dst, vec_enc); | ||
| vpaddb(xtmp3, dst, xtmp3, vec_enc); | ||
| vpxorq(xtmp1, xtmp1, xtmp1, vec_enc); | ||
| vpsadbw(dst, xtmp3, xtmp1, vec_enc); | ||
| XMMRegister xtmp2, Register rtmp, int vec_enc) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments need updating. Majority of comments above this method need to move to vector_popcount_byte. And then it is better to give comments at each step below in the method for easy review and maintenance. Please add comments to short, byte, long as well on similar lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New code has been modularized byte vector popcount is the leaf level operation and all the other primitive type operation build on top of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without comments around the instructions it is very hard to review what the set of instructions are supposed to do.
| // We do not see any performance benefit of running | ||
| // above instruction sequence on 256 bit vector which | ||
| // can operate over maximum 4 long elements. | ||
| ShouldNotReachHere(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From Vector API point of view it would be good to have the intrinsic for 256-bit/128-bit Long vectors as well.
src/hotspot/cpu/x86/x86.ad
Outdated
| if ((bt == T_LONG) && !VM_Version::supports_avx512_vpopcntdq()) { | ||
| return false; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bt is T_LONG for this case, we don't need to check type.
| // TODO: Once auto-vectorizer supports ConvL2I operation, PopCountVL | ||
| // should be succeeded by its corresponding vector IR and following | ||
| // special handling should be removed. | ||
| if (opcode == Op_PopCountVL && Matcher::vector_element_basic_type(this) == T_INT) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs clarification in comments that the behavior is different based on vector api vs auto vectorizer. Have we tested this to work appropriately in both cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we do have explicit bit count tests in for auto-vectorizer flow to test this flow.
| int vlen_enc = vector_length_encoding(this, $src); | ||
| BasicType bt = Matcher::vector_element_basic_type(this, $src); | ||
| __ evmovdquq($dst$$XMMRegister, $src$$XMMRegister, vlen_enc); | ||
| __ vector_popcount_integral_evex(bt, $dst$$XMMRegister, $src$$XMMRegister, $mask$$KRegister, true, vlen_enc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No auto vectorizer path here where the result in int vector for long vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, because this is predicated operation pattern and auto-vectorizer does not support predicate operation inferencing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The post loop work on the mainline would support the masked instructions generation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Auto vectorizer restricts predication to load and store operations.
src/hotspot/cpu/x86/x86.ad
Outdated
| effect(TEMP dst, TEMP xtmp1, TEMP xtmp2, TEMP rtmp); | ||
| format %{ "vector_popcount_int $dst, $src\t! using $xtmp1, $xtmp2 and $rtmp as TEMP" %} | ||
| ins_encode %{ | ||
| assert(UsePopCountInstruction, "not enabled"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assert needs to be removed on this path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same a above, it should be OK to keep this as is since its already used in match_rule_supported method.
src/hotspot/cpu/x86/x86.ad
Outdated
| effect(TEMP dst, TEMP xtmp1, TEMP xtmp2, TEMP xtmp3, TEMP rtmp); | ||
| format %{ "vector_popcount_long $dst, $src\t! using $xtmp1, $xtmp2, $xtmp3, and $rtmp as TEMP" %} | ||
| ins_encode %{ | ||
| assert(UsePopCountInstruction, "not enabled"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assert needs to be removed on this path. Do you know why the testing didn't catch this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same a above, it should be OK to keep this as is since its already used in match_rule_supported method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this is !VM_Version::supports_avx512_vpopcntdq() so UsePopCountInstruction should have been false.
src/hotspot/cpu/x86/x86.ad
Outdated
| // should be succeeded by its corresponding vector IR and following | ||
| // special handling should be removed. | ||
| if (bt == T_INT) { | ||
| __ evpmovqd($dst$$XMMRegister, $dst$$XMMRegister, vlen_enc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we using evex instruction for avx path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AVX512 targets not supporting AVX512_POPCOUNTD feature can still have 64 byte vector operation, since long pop-count was only enabled for 512 bit vectors uptill now hence a down-casting EVEX instruction worked fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you fixed this is in your review comment resolution by adding a check for avx512vl around this instruction and a separate instruction generation for non avx512 path, thanks.
|
Other than couple of comments above, rest of the patch looks good to me. |
sviswa7
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make the changes suggested here. No need for re-review.
|
|
||
| // Use population count instruction if available. | ||
| if (supports_popcnt()) { | ||
| if (supports_popcnt() || supports_avx512_vpopcntdq() || supports_avx512_bitalg()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The UsePopCountInstruction is only for scalar popcont. Extending it to Vector popcount is causing lot of confustion. Let us keep it for scalar only as below:
if (supports_popcnt()) {
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UsePopcountInstruction implies generating direct popcount instruction for targets which support it. A bit count operation can also be implemented without popcount as depicted by this patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree its adding some confusion since user may turn off UsePopcountInstruction and may want to generate optimized instructions for bit count over AVX2 targets using Vector API.
For SLP turning off UsePopcountInstruction will prevent creating scalar IR which will also inhabit inferencing of vector operation by auto vectorizer.
src/hotspot/cpu/x86/x86.ad
Outdated
| predicate(UsePopCountInstruction && | ||
| ((is_subword_type(Matcher::vector_element_basic_type(n->in(1))) && | ||
| VM_Version::supports_avx512_bitalg()) || | ||
| (is_non_subword_integral_type(Matcher::vector_element_basic_type(n->in(1))) && | ||
| VM_Version::supports_avx512_vpopcntdq()))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be replaced by:
predicate(is_pop_count_instr_target(Matcher::vector_element_basic_type(n->in(1))));
Also no need to check for UsePopCountInstruction any where as it is only meant for scalar popcount.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I already cleaned this up, its used at couple of places and will refactor when I create mainline integration patch.
src/hotspot/cpu/x86/x86.ad
Outdated
| predicate(UsePopCountInstruction && | ||
| ((is_subword_type(Matcher::vector_element_basic_type(n->in(1))) && | ||
| VM_Version::supports_avx512_bitalg()) || | ||
| (is_non_subword_integral_type(Matcher::vector_element_basic_type(n->in(1))) && | ||
| VM_Version::supports_avx512_vpopcntdq()))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please replace by:
predicate(is_pop_count_instr_target(Matcher::vector_element_basic_type(n->in(1))));
Also no need to check for UsePopCountInstruction any where as it is only meant for scalar popcount.
src/hotspot/cpu/x86/x86.ad
Outdated
| predicate(!UsePopCountInstruction || | ||
| (!VM_Version::supports_avx512_vpopcntdq() && Matcher::vector_element_basic_type(n->in(1)) == T_INT) || | ||
| (!VM_Version::supports_avx512_bitalg() && is_subword_type(Matcher::vector_element_basic_type(n->in(1))))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be replaced by:
predicate((Matcher::vector_element_basic_type(n->in(1)) != T_LONG)
&& !is_pop_count_instr_target(Matcher::vector_element_basic_type(n->in(1))));
| StubRoutines::x86::_vector_reverse_byte_perm_mask_short = generate_vector_reverse_byte_perm_mask_short("perm_mask_short"); | ||
|
|
||
| if (UsePopCountInstruction && VM_Version::supports_avx2() && !VM_Version::supports_avx512_vpopcntdq()) { | ||
| if (VM_Version::supports_avx2() && (!VM_Version::supports_avx512_vpopcntdq() || !UsePopCountInstruction)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the reference to UsePopCountInstruction here.
| StubRoutines::x86::_vector_reverse_byte_perm_mask_short = generate_vector_reverse_byte_perm_mask_short("perm_mask_short"); | ||
|
|
||
| if (UsePopCountInstruction && VM_Version::supports_avx2() && !VM_Version::supports_avx512_vpopcntdq()) { | ||
| if (VM_Version::supports_avx2() && (!VM_Version::supports_avx512_vpopcntdq() || !UsePopCountInstruction)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the reference to UsePopCountInstruction here.
src/hotspot/cpu/x86/x86.ad
Outdated
|
|
||
| instruct vpopcountL_avx_reg(vec dst, vec src, vec xtmp1, vec xtmp2, vec xtmp3, rRegP rtmp) %{ | ||
| predicate(!VM_Version::supports_avx512_vpopcntdq()); | ||
| predicate(!UsePopCountInstruction || !VM_Version::supports_avx512_vpopcntdq()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the reference to UsePopCountInstruction here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is still a matcher path which can be chosen even if user turns off UsePopCountInstruction, relating this flag to an operation may not be correct. But I agree that to avoid any confusion we can remove its usage from pattern for the time being.
| void C2_MacroAssembler::vector_popcount_integral_evex(BasicType bt, XMMRegister dst, XMMRegister src, | ||
| KRegister mask, bool merge, int vec_enc) { | ||
| assert(VM_Version::supports_avx512vl() || vec_enc == Assembler::AVX_512bit, ""); | ||
| assert(UsePopCountInstruction, ""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this assert on UsePopCountInstruction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, its counter intuitive to enter into this control flow with UsePopcountInstruction being false.
|
@sviswa7 , your closing comments addressed. |
|
/integrate |
|
Going to push as commit 1bc4187. |
|
@jatin-bhateja Pushed as commit 1bc4187. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
Summary of changes:
for sub-word type (BYTE, SHORT) vectors over X86 targets supporting AVA2 and AVX512 features.
Kindly review and share you feedback.
Best Regards,
Jatin
Progress
Issue
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.java.net/panama-vector pull/185/head:pull/185$ git checkout pull/185Update a local copy of the PR:
$ git checkout pull/185$ git pull https://git.openjdk.java.net/panama-vector pull/185/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 185View PR using the GUI difftool:
$ git pr show -t 185Using diff file
Download this PR as a diff file:
https://git.openjdk.java.net/panama-vector/pull/185.diff