-
Notifications
You must be signed in to change notification settings - Fork 6.2k
8286972: Support the new loop induction variable related PopulateIndex IR node on x86 #8778
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Welcome back sviswanathan! A progress list of the required criteria for merging this PR into |
Webrevs
|
| bool is_bw = ((elem_bt == T_BYTE) || (elem_bt == T_SHORT)); | ||
| bool is_bw_supported = VM_Version::supports_avx512bw(); | ||
| if (is_bw && !is_bw_supported) { | ||
| assert(vlen_enc != Assembler::AVX_512bit, "required"); | ||
| assert((dst->encoding() < 16) && (src1->encoding() < 16) && (src2->encoding() < 16), | ||
| "XMM register should be 0-15"); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole block could be under #ifdef ASSERT.
| bool is_bw = ((elem_bt == T_BYTE) || (elem_bt == T_SHORT)); | ||
| bool is_bw_supported = VM_Version::supports_avx512bw(); | ||
| if (is_bw && !is_bw_supported) { | ||
| assert(vlen_enc != Assembler::AVX_512bit, "required"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are acceptable values of vlen_enc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For KNL, PopulateIndex support is limited to 256-bit as we need avx512bw() for the 512-bit support.
For other AVX2 and AVX512 architectures, all vector widths up to and including 512-bit are supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay.
| } else { | ||
| assert(vlen_enc != Assembler::AVX_512bit, "required"); | ||
| assert((dst->encoding() < 16),"XMM register should be 0-15"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The } else { case will be also executed on on KNL CPU. Did you tested with -XX:+UseKNLSetting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this part will be executed on KNL CPU.
I did run the compiler tests with UseKNLSetting and didn't see any issue.
src/hotspot/cpu/x86/x86.ad
Outdated
| case Op_PopulateIndex: | ||
| if (!is_LP64) { | ||
| return false; | ||
| } | ||
| // fallthrough |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, don't do fallthrough - RoundVF is not related to PopulateIndex. Why not if (!is_LP64 || UseAVX < 2)?
Are there limitations in 32 bits or you don't want spend time on not major platform (which is also understandable)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not a major platform so I didn't spend time on it.
src/hotspot/cpu/x86/x86.ad
Outdated
| if (size_in_bits > 256 && !VM_Version::supports_avx512bw()) | ||
| return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use {} according to our style.
| } | ||
|
|
||
| void C2_MacroAssembler::vpadd(BasicType elem_bt, XMMRegister dst, XMMRegister src1, XMMRegister src2, int vlen_enc) { | ||
| assert(UseAVX >= 2, "required"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not include this line in #ifdef ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not matter since it is assert which add code only in debug VM. I like this way.
vnkozlov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Let me test it.
| } | ||
|
|
||
| void C2_MacroAssembler::vpadd(BasicType elem_bt, XMMRegister dst, XMMRegister src1, XMMRegister src2, int vlen_enc) { | ||
| assert(UseAVX >= 2, "required"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not matter since it is assert which add code only in debug VM. I like this way.
| bool is_bw = ((elem_bt == T_BYTE) || (elem_bt == T_SHORT)); | ||
| bool is_bw_supported = VM_Version::supports_avx512bw(); | ||
| if (is_bw && !is_bw_supported) { | ||
| assert(vlen_enc != Assembler::AVX_512bit, "required"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay.
|
@vnkozlov I will look into adding the IR framework and regression test. |
vnkozlov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tier1-4 testing passed.
|
@vnkozlov thanks a lot. |
|
@vnkozlov I have added the IR framework jtreg test. Please review. |
vnkozlov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good.
You need second review
|
@sviswa7 This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 46 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
|
@vnkozlov Thanks a lot for the review. |
| * @summary Test vectorization of loop induction variable usage in the loop | ||
| * @requires vm.compiler2.enabled | ||
| * @requires vm.cpu.features ~= ".*avx2.*" | ||
| * @requires os.arch=="amd64" | os.arch=="x86_64" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we simplify os.arch=="amd64" | os.arch=="x86_64" to os.simpleArch == "x64" ?
This test runs on x86 only. It would be nice if it can run on AArch64 as well. So perhaps something like
28 * @requires (os.simpleArch == "x64" & vm.cpu.features ~= ".*avx2.*") |
29 * (os.simpleArch == "aarch64" & vm.cpu.features ~= ".*sve.*")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pfustc Yes, changed the requires to as suggested by you.
| case T_FLOAT: case T_INT: evpbroadcastd(dst, src, vlen_enc); return; | ||
| case T_DOUBLE: case T_LONG: evpbroadcastq(dst, src, vlen_enc); return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we use single and double precision broadcasts for floating point types, like you have done in else part
It may save domain switch over penalty (Section 3.5.2.2 Bypass between Execution Domains, Intel® 64 and IA-32 Architectures Optimization Reference Manual)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The floating point broadcast doesn't take the gpr as second source.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A prior move as in else part may be emitted for consistency or you want to keep floating point broadcasts only for else part.
src/hotspot/cpu/x86/x86.ad
Outdated
| int vlen_in_bytes = Matcher::vector_length_in_bytes(this); | ||
| int vlen_enc = vector_length_encoding(this); | ||
| BasicType elem_bt = Matcher::vector_element_basic_type(this); | ||
| assert($src2$$constant == 1, "required"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally assertion should be the first statement in a block, since they determine the pre-conditions under which code should executed.
src/hotspot/cpu/x86/x86.ad
Outdated
| effect(TEMP dst, TEMP vtmp, TEMP scratch); | ||
| format %{ "vector_populate_index $dst $src1 $src2\t! using $vtmp and $scratch as TEMP" %} | ||
| ins_encode %{ | ||
| int vlen_in_bytes = Matcher::vector_length_in_bytes(this); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Matcher::vector_length can be directly used instead of following computation in line 8274
vlen_in_bytes/type2aelembytes(elem_bt)
src/hotspot/cpu/x86/x86.ad
Outdated
| effect(TEMP dst, TEMP vtmp, TEMP scratch); | ||
| format %{ "vector_populate_index $dst $src1 $src2\t! using $vtmp and $scratch as TEMP" %} | ||
| ins_encode %{ | ||
| int vlen_in_bytes = Matcher::vector_length_in_bytes(this); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
src/hotspot/cpu/x86/x86.ad
Outdated
| int vlen_in_bytes = Matcher::vector_length_in_bytes(this); | ||
| int vlen_enc = vector_length_encoding(this); | ||
| BasicType elem_bt = Matcher::vector_element_basic_type(this); | ||
| assert($src2$$constant == 1, "required"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
|
|
||
| /** | ||
| * @test | ||
| * @summary Test vectorization of loop induction variable usage in the loop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR id missing.
| import java.util.Random; | ||
|
|
||
| public class TestPopulateIndex { | ||
| private static final int count = 65536; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small array size around 10K may work, we can also tune CompileThresholdScaling.
| } | ||
|
|
||
| public void checkResultIndexArrayFill() { | ||
| for (int i = 0; i < count; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
post-incrementation for consistency.
| } | ||
|
|
||
| public void checkResultExprWithIndex2() { | ||
| for (int i = 0; i < count; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
post-increment induction.
|
@jatin-bhateja Thanks a lot for the review. Your review comments are implemented. |
|
LGTM. Thanks |
|
I have to re-test it since code change in .ad file. |
|
@vnlozlov Thanks a lot, I will wait for your go ahead. |
|
My testing passed. You are good to push. |
|
/integrate |
|
Going to push as commit 5d8d6da.
Your commit was automatically rebased without conflicts. |
| format %{ "vector_populate_index $dst $src1 $src2\t! using $vtmp and $scratch as TEMP" %} | ||
| ins_encode %{ | ||
| assert($src2$$constant == 1, "required"); | ||
| int vlen = Matcher::vector_length(this); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I ask why use Matcher::vector_length() here, rather than Matcher::vector_length_in_bytes(), for load_iota_indices()? Thanks.
This PR adds x86 backend support for the new loop induction variable related PopulateIndex IR node.
This IR node was added as part of JDK-8280510.
The performance numbers are as follows:
Before:
Benchmark (count) Mode Cnt Score Error Units
IndexVector.exprWithIndex1 65536 thrpt 3 64556.552 ± 1126.396 ops/s
IndexVector.exprWithIndex2 65536 thrpt 3 22117.050 ± 11452.098 ops/s
IndexVector.indexArrayFill 65536 thrpt 3 117776.383 ± 1120.957 ops/s
After:
Benchmark (count) Mode Cnt Score Error Units
IndexVector.exprWithIndex1 65536 thrpt 3 203180.290 ± 2147.807 ops/s
IndexVector.exprWithIndex2 65536 thrpt 3 274132.756 ± 6853.393 ops/s
IndexVector.indexArrayFill 65536 thrpt 3 374165.202 ± 46930.779 ops/s
Please review.
Best Regards,
Sandhya
Progress
Issue
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/8778/head:pull/8778$ git checkout pull/8778Update a local copy of the PR:
$ git checkout pull/8778$ git pull https://git.openjdk.java.net/jdk pull/8778/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 8778View PR using the GUI difftool:
$ git pr show -t 8778Using diff file
Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/8778.diff