Skip to content

Conversation

@jatin-bhateja
Copy link
Member

@jatin-bhateja jatin-bhateja commented Mar 25, 2022

Summary of changes:

  • Patch re-uses existing C2 IR nodes and re-organizes LUT based JIT code sequence of VectorOperations.BIT_COUNT operation
    for sub-word type (BYTE, SHORT) vectors over X86 targets supporting AVA2 and AVX512 features.
  • Efficient single instruction POPCOUNT instruction is emitted for applicable targets.

Kindly review and share you feedback.

Best Regards,
Jatin


Progress

  • Change must not contain extraneous whitespace
  • Change must be properly reviewed

Issue

  • JDK-8283709: Add x86 back-end implementation for bit BIT_COUNT operation

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/panama-vector pull/185/head:pull/185
$ git checkout pull/185

Update a local copy of the PR:
$ git checkout pull/185
$ git pull https://git.openjdk.java.net/panama-vector pull/185/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 185

View PR using the GUI difftool:
$ git pr show -t 185

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/panama-vector/pull/185.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Mar 25, 2022

👋 Welcome back jbhateja! A progress list of the required criteria for merging this PR into vectorIntrinsics will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Mar 25, 2022
@mlbridge
Copy link

mlbridge bot commented Mar 25, 2022

ShouldNotReachHere();
}
if (bt == T_INT) {
if (rbt == T_INT) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the SLP and the VectorAPI uses the same PopCountVL with different basic type?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to separate this cast for SLP and always uses "T_LONG' as the basic type for PopCountVL ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have type homogeneity for PopCountVL across VectorAPI and SLP once we start auto-vectorizing ConvL2I operation.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes the code hard to maintain, and I think @merykitty 's unaddressed comment in: openjdk/jdk#6857 (comment) is a preferable approach.

FYI: @fg1417 has a ConvL2I auto-vectorization support under review: openjdk/jdk#7806.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, so if the ConvL2I is supported to be auto-vectorized after then, the PopCountL is better to be auto-vectorized to "PopCountVL + VectorCastL2X" ? Do you have any plan on it? Thanks so much!

Copy link
Member Author

@jatin-bhateja jatin-bhateja Mar 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, so if the ConvL2I is supported to be auto-vectorized after then, the PopCountL is better to be auto-vectorized to "PopCountVL + VectorCastL2X" ? Do you have any plan on it? Thanks so much!

Agree to your proposal, once we have auto-vectorization for ConvL2I n place we can change the IR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, so if the ConvL2I is supported to be auto-vectorized after then, the PopCountL is better to be auto-vectorized to "PopCountVL + VectorCastL2X" ? Do you have any plan on it? Thanks so much!

Agree to your proposal, once we have auto-vectorization for ConvL2I n place we can change the IR.

Here is an interim patch which type casts the result of PopCountVL into integer vector and thus prevent any special
handling in target specific backend implementations. But it seems more appropriate to generate correct IR snippet (PopCountL + ConvL2I) during scalar operation inline expansion and explicitly handle ConvL2I during auto-vectorization.

diff --git a/src/hotspot/share/opto/superword.cpp b/src/hotspot/share/opto/superword.cpp
index 1bf4c7a3282..b3dd90d1867 100644
--- a/src/hotspot/share/opto/superword.cpp
+++ b/src/hotspot/share/opto/superword.cpp
@@ -2549,11 +2549,21 @@ void SuperWord::output() {
           vn = VectorNode::make(opc, in1, in2, vlen, velt_basic_type(n));
           vlen_in_bytes = vn->as_Vector()->length_in_bytes();
         }
+      } else if (opc == Op_PopCountL) {
+        // TODO: Succeed PopCountL by ConvL2I during initial graph construction 
+        // once auto-vectorizer supports ConvL2I operation.
+        assert(n->req() == 2, "only one input expected");
+        Node* in = vector_opd(p, 1);
+  	   BasicType bt = in->bottom_type()->is_vect()->element_basic_type();
+         in = VectorNode::make(opc, in, NULL, vlen, bt);
+         _igvn.register_new_node_with_optimizer(in);
+	   vn = VectorCastNode::make(Op_VectorCastL2X, in, T_INT, vlen);
+         vlen_in_bytes = vn->as_Vector()->length_in_bytes();
       } else if (opc == Op_SqrtF || opc == Op_SqrtD ||
                  opc == Op_AbsF || opc == Op_AbsD ||
                  opc == Op_AbsI || opc == Op_AbsL ||
                  opc == Op_NegF || opc == Op_NegD ||
-                 opc == Op_PopCountI || opc == Op_PopCountL) {
+                opc == Op_PopCountI) {
         assert(n->req() == 2, "only one input expected");
         Node* in = vector_opd(p, 1);
         vn = VectorNode::make(opc, in, NULL, vlen, velt_basic_type(n));

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, thanks for it. I will do it once PR openjdk/jdk#7806 is merged. Thanks!

@XiaohongGong
Copy link
Collaborator

The mid-end part looks good to me! Thanks!

@openjdk
Copy link

openjdk bot commented Mar 30, 2022

@jatin-bhateja This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8283709: Add x86 back-end implementation for bit BIT_COUNT operation

Reviewed-by: xgong, sviswanathan

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been no new commits pushed to the vectorIntrinsics branch. If another commit should be pushed before you perform the /integrate command, your PR will be automatically rebased. If you prefer to avoid any potential automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the vectorIntrinsics branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Mar 30, 2022
@jatin-bhateja
Copy link
Member Author

Patch needs to be updated with predicated patterns.

Comment on lines 154 to 156
case T_BYTE:
case T_SHORT:
case T_INT:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks misplaced. There is no switch here.

}
}

void C2_MacroAssembler::vbroadcastd(XMMRegister dst, Register rtmp, int imm32, int vec_enc) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to have rtmp after imm32. Basically temp registers after required inputs.

Comment on lines 4476 to 4478

void C2_MacroAssembler::vector_popcount_int(XMMRegister dst, XMMRegister src, XMMRegister xtmp1,
XMMRegister xtmp2, XMMRegister xtmp3, Register rtmp,
int vec_enc) {
if (VM_Version::supports_avx512_vpopcntdq()) {
vpopcntd(dst, src, vec_enc);
} else {
assert((vec_enc == Assembler::AVX_512bit && VM_Version::supports_avx512bw()) || VM_Version::supports_avx2(), "");
movl(rtmp, 0x0F0F0F0F);
movdl(xtmp1, rtmp);
vpbroadcastd(xtmp1, xtmp1, vec_enc);
if (Assembler::AVX_512bit == vec_enc) {
evmovdqul(xtmp2, k0, ExternalAddress(StubRoutines::x86::vector_popcount_lut()), false, vec_enc, rtmp);
} else {
vmovdqu(xtmp2, ExternalAddress(StubRoutines::x86::vector_popcount_lut()), rtmp);
}
vpand(xtmp3, src, xtmp1, vec_enc);
vpshufb(xtmp3, xtmp2, xtmp3, vec_enc);
vpsrlw(dst, src, 4, vec_enc);
vpand(dst, dst, xtmp1, vec_enc);
vpshufb(dst, xtmp2, dst, vec_enc);
vpaddb(xtmp3, dst, xtmp3, vec_enc);
vpxor(xtmp1, xtmp1, xtmp1, vec_enc);
vpunpckhdq(dst, xtmp3, xtmp1, vec_enc);
vpsadbw(dst, dst, xtmp1, vec_enc);
vpunpckldq(xtmp2, xtmp3, xtmp1, vec_enc);
vpsadbw(xtmp2, xtmp2, xtmp1, vec_enc);
vpackuswb(dst, xtmp2, dst, vec_enc);
}
}

void C2_MacroAssembler::vector_popcount_long(BasicType bt, XMMRegister dst, XMMRegister src, XMMRegister xtmp1,
XMMRegister xtmp2, XMMRegister xtmp3, Register rtmp, int vec_enc) {
if (VM_Version::supports_avx512_vpopcntdq()) {
vpopcntq(dst, src, vec_enc);
} else if (vec_enc == Assembler::AVX_512bit) {
assert(VM_Version::supports_avx512bw(), "");
movl(rtmp, 0x0F0F0F0F);
movdl(xtmp1, rtmp);
vpbroadcastd(xtmp1, xtmp1, vec_enc);
evmovdqul(xtmp2, k0, ExternalAddress(StubRoutines::x86::vector_popcount_lut()), true, vec_enc, rtmp);
vpandq(xtmp3, src, xtmp1, vec_enc);
vpshufb(xtmp3, xtmp2, xtmp3, vec_enc);
vpsrlw(dst, src, 4, vec_enc);
vpandq(dst, dst, xtmp1, vec_enc);
vpshufb(dst, xtmp2, dst, vec_enc);
vpaddb(xtmp3, dst, xtmp3, vec_enc);
vpxorq(xtmp1, xtmp1, xtmp1, vec_enc);
vpsadbw(dst, xtmp3, xtmp1, vec_enc);
XMMRegister xtmp2, Register rtmp, int vec_enc) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments need updating. Majority of comments above this method need to move to vector_popcount_byte. And then it is better to give comments at each step below in the method for easy review and maintenance. Please add comments to short, byte, long as well on similar lines.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New code has been modularized byte vector popcount is the leaf level operation and all the other primitive type operation build on top of it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without comments around the instructions it is very hard to review what the set of instructions are supposed to do.

Comment on lines 4517 to 4520
// We do not see any performance benefit of running
// above instruction sequence on 256 bit vector which
// can operate over maximum 4 long elements.
ShouldNotReachHere();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From Vector API point of view it would be good to have the intrinsic for 256-bit/128-bit Long vectors as well.

Comment on lines 2061 to 2063
if ((bt == T_LONG) && !VM_Version::supports_avx512_vpopcntdq()) {
return false;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bt is T_LONG for this case, we don't need to check type.

// TODO: Once auto-vectorizer supports ConvL2I operation, PopCountVL
// should be succeeded by its corresponding vector IR and following
// special handling should be removed.
if (opcode == Op_PopCountVL && Matcher::vector_element_basic_type(this) == T_INT) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs clarification in comments that the behavior is different based on vector api vs auto vectorizer. Have we tested this to work appropriately in both cases?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we do have explicit bit count tests in for auto-vectorizer flow to test this flow.

int vlen_enc = vector_length_encoding(this, $src);
BasicType bt = Matcher::vector_element_basic_type(this, $src);
__ evmovdquq($dst$$XMMRegister, $src$$XMMRegister, vlen_enc);
__ vector_popcount_integral_evex(bt, $dst$$XMMRegister, $src$$XMMRegister, $mask$$KRegister, true, vlen_enc);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No auto vectorizer path here where the result in int vector for long vector.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because this is predicated operation pattern and auto-vectorizer does not support predicate operation inferencing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The post loop work on the mainline would support the masked instructions generation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auto vectorizer restricts predication to load and store operations.

effect(TEMP dst, TEMP xtmp1, TEMP xtmp2, TEMP rtmp);
format %{ "vector_popcount_int $dst, $src\t! using $xtmp1, $xtmp2 and $rtmp as TEMP" %}
ins_encode %{
assert(UsePopCountInstruction, "not enabled");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assert needs to be removed on this path.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same a above, it should be OK to keep this as is since its already used in match_rule_supported method.

effect(TEMP dst, TEMP xtmp1, TEMP xtmp2, TEMP xtmp3, TEMP rtmp);
format %{ "vector_popcount_long $dst, $src\t! using $xtmp1, $xtmp2, $xtmp3, and $rtmp as TEMP" %}
ins_encode %{
assert(UsePopCountInstruction, "not enabled");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assert needs to be removed on this path. Do you know why the testing didn't catch this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same a above, it should be OK to keep this as is since its already used in match_rule_supported method.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this is !VM_Version::supports_avx512_vpopcntdq() so UsePopCountInstruction should have been false.

// should be succeeded by its corresponding vector IR and following
// special handling should be removed.
if (bt == T_INT) {
__ evpmovqd($dst$$XMMRegister, $dst$$XMMRegister, vlen_enc);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we using evex instruction for avx path?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AVX512 targets not supporting AVX512_POPCOUNTD feature can still have 64 byte vector operation, since long pop-count was only enabled for 512 bit vectors uptill now hence a down-casting EVEX instruction worked fine.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you fixed this is in your review comment resolution by adding a check for avx512vl around this instruction and a separate instruction generation for non avx512 path, thanks.

@sviswa7
Copy link
Collaborator

sviswa7 commented Apr 8, 2022

Other than couple of comments above, rest of the patch looks good to me.

Copy link
Collaborator

@sviswa7 sviswa7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make the changes suggested here. No need for re-review.


// Use population count instruction if available.
if (supports_popcnt()) {
if (supports_popcnt() || supports_avx512_vpopcntdq() || supports_avx512_bitalg()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UsePopCountInstruction is only for scalar popcont. Extending it to Vector popcount is causing lot of confustion. Let us keep it for scalar only as below:
if (supports_popcnt()) {
...
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UsePopcountInstruction implies generating direct popcount instruction for targets which support it. A bit count operation can also be implemented without popcount as depicted by this patch.

Copy link
Member Author

@jatin-bhateja jatin-bhateja Apr 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree its adding some confusion since user may turn off UsePopcountInstruction and may want to generate optimized instructions for bit count over AVX2 targets using Vector API.

For SLP turning off UsePopcountInstruction will prevent creating scalar IR which will also inhabit inferencing of vector operation by auto vectorizer.

Comment on lines 8656 to 8660
predicate(UsePopCountInstruction &&
((is_subword_type(Matcher::vector_element_basic_type(n->in(1))) &&
VM_Version::supports_avx512_bitalg()) ||
(is_non_subword_integral_type(Matcher::vector_element_basic_type(n->in(1))) &&
VM_Version::supports_avx512_vpopcntdq())));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be replaced by:
predicate(is_pop_count_instr_target(Matcher::vector_element_basic_type(n->in(1))));
Also no need to check for UsePopCountInstruction any where as it is only meant for scalar popcount.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I already cleaned this up, its used at couple of places and will refactor when I create mainline integration patch.

Comment on lines 8631 to 8635
predicate(UsePopCountInstruction &&
((is_subword_type(Matcher::vector_element_basic_type(n->in(1))) &&
VM_Version::supports_avx512_bitalg()) ||
(is_non_subword_integral_type(Matcher::vector_element_basic_type(n->in(1))) &&
VM_Version::supports_avx512_vpopcntdq())));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please replace by:
predicate(is_pop_count_instr_target(Matcher::vector_element_basic_type(n->in(1))));
Also no need to check for UsePopCountInstruction any where as it is only meant for scalar popcount.

Comment on lines 8675 to 8677
predicate(!UsePopCountInstruction ||
(!VM_Version::supports_avx512_vpopcntdq() && Matcher::vector_element_basic_type(n->in(1)) == T_INT) ||
(!VM_Version::supports_avx512_bitalg() && is_subword_type(Matcher::vector_element_basic_type(n->in(1)))));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be replaced by:
predicate((Matcher::vector_element_basic_type(n->in(1)) != T_LONG)
&& !is_pop_count_instr_target(Matcher::vector_element_basic_type(n->in(1))));

StubRoutines::x86::_vector_reverse_byte_perm_mask_short = generate_vector_reverse_byte_perm_mask_short("perm_mask_short");

if (UsePopCountInstruction && VM_Version::supports_avx2() && !VM_Version::supports_avx512_vpopcntdq()) {
if (VM_Version::supports_avx2() && (!VM_Version::supports_avx512_vpopcntdq() || !UsePopCountInstruction)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the reference to UsePopCountInstruction here.

StubRoutines::x86::_vector_reverse_byte_perm_mask_short = generate_vector_reverse_byte_perm_mask_short("perm_mask_short");

if (UsePopCountInstruction && VM_Version::supports_avx2() && !VM_Version::supports_avx512_vpopcntdq()) {
if (VM_Version::supports_avx2() && (!VM_Version::supports_avx512_vpopcntdq() || !UsePopCountInstruction)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the reference to UsePopCountInstruction here.


instruct vpopcountL_avx_reg(vec dst, vec src, vec xtmp1, vec xtmp2, vec xtmp3, rRegP rtmp) %{
predicate(!VM_Version::supports_avx512_vpopcntdq());
predicate(!UsePopCountInstruction || !VM_Version::supports_avx512_vpopcntdq());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the reference to UsePopCountInstruction here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is still a matcher path which can be chosen even if user turns off UsePopCountInstruction, relating this flag to an operation may not be correct. But I agree that to avoid any confusion we can remove its usage from pattern for the time being.

void C2_MacroAssembler::vector_popcount_integral_evex(BasicType bt, XMMRegister dst, XMMRegister src,
KRegister mask, bool merge, int vec_enc) {
assert(VM_Version::supports_avx512vl() || vec_enc == Assembler::AVX_512bit, "");
assert(UsePopCountInstruction, "");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this assert on UsePopCountInstruction.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, its counter intuitive to enter into this control flow with UsePopcountInstruction being false.

@jatin-bhateja
Copy link
Member Author

@sviswa7 , your closing comments addressed.

@jatin-bhateja
Copy link
Member Author

/integrate

@openjdk
Copy link

openjdk bot commented Apr 12, 2022

Going to push as commit 1bc4187.

@openjdk openjdk bot added the integrated Pull request has been integrated label Apr 12, 2022
@openjdk openjdk bot closed this Apr 12, 2022
@openjdk openjdk bot removed the ready Pull request is ready to be integrated label Apr 12, 2022
@openjdk openjdk bot removed the rfr Pull request is ready for review label Apr 12, 2022
@openjdk
Copy link

openjdk bot commented Apr 12, 2022

@jatin-bhateja Pushed as commit 1bc4187.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

integrated Pull request has been integrated

Development

Successfully merging this pull request may close these issues.

4 participants