Skip to content

Conversation

@sviswa7
Copy link

@sviswa7 sviswa7 commented May 18, 2022

This PR adds x86 backend support for the new loop induction variable related PopulateIndex IR node.
This IR node was added as part of JDK-8280510.

The performance numbers are as follows:
Before:
Benchmark (count) Mode Cnt Score Error Units
IndexVector.exprWithIndex1 65536 thrpt 3 64556.552 ± 1126.396 ops/s
IndexVector.exprWithIndex2 65536 thrpt 3 22117.050 ± 11452.098 ops/s
IndexVector.indexArrayFill 65536 thrpt 3 117776.383 ± 1120.957 ops/s

After:
Benchmark (count) Mode Cnt Score Error Units
IndexVector.exprWithIndex1 65536 thrpt 3 203180.290 ± 2147.807 ops/s
IndexVector.exprWithIndex2 65536 thrpt 3 274132.756 ± 6853.393 ops/s
IndexVector.indexArrayFill 65536 thrpt 3 374165.202 ± 46930.779 ops/s

Please review.

Best Regards,
Sandhya


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed (1 review required, with at least 1 reviewer)

Issue

  • JDK-8286972: Support the new loop induction variable related PopulateIndex IR node on x86

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/8778/head:pull/8778
$ git checkout pull/8778

Update a local copy of the PR:
$ git checkout pull/8778
$ git pull https://git.openjdk.java.net/jdk pull/8778/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 8778

View PR using the GUI difftool:
$ git pr show -t 8778

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/8778.diff

@bridgekeeper
Copy link

bridgekeeper bot commented May 18, 2022

👋 Welcome back sviswanathan! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented May 18, 2022

@sviswa7 The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label May 18, 2022
@sviswa7 sviswa7 marked this pull request as ready for review May 18, 2022 21:47
@openjdk openjdk bot added the rfr Pull request is ready for review label May 18, 2022
@mlbridge
Copy link

mlbridge bot commented May 18, 2022

Webrevs

Comment on lines +2279 to +2285
bool is_bw = ((elem_bt == T_BYTE) || (elem_bt == T_SHORT));
bool is_bw_supported = VM_Version::supports_avx512bw();
if (is_bw && !is_bw_supported) {
assert(vlen_enc != Assembler::AVX_512bit, "required");
assert((dst->encoding() < 16) && (src1->encoding() < 16) && (src2->encoding() < 16),
"XMM register should be 0-15");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole block could be under #ifdef ASSERT.

bool is_bw = ((elem_bt == T_BYTE) || (elem_bt == T_SHORT));
bool is_bw_supported = VM_Version::supports_avx512bw();
if (is_bw && !is_bw_supported) {
assert(vlen_enc != Assembler::AVX_512bit, "required");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are acceptable values of vlen_enc?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For KNL, PopulateIndex support is limited to 256-bit as we need avx512bw() for the 512-bit support.
For other AVX2 and AVX512 architectures, all vector widths up to and including 512-bit are supported.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay.

Comment on lines +2312 to +2314
} else {
assert(vlen_enc != Assembler::AVX_512bit, "required");
assert((dst->encoding() < 16),"XMM register should be 0-15");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The } else { case will be also executed on on KNL CPU. Did you tested with -XX:+UseKNLSetting?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this part will be executed on KNL CPU.
I did run the compiler tests with UseKNLSetting and didn't see any issue.

Comment on lines 1471 to 1475
case Op_PopulateIndex:
if (!is_LP64) {
return false;
}
// fallthrough
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, don't do fallthrough - RoundVF is not related to PopulateIndex. Why not if (!is_LP64 || UseAVX < 2)?
Are there limitations in 32 bits or you don't want spend time on not major platform (which is also understandable)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not a major platform so I didn't spend time on it.

Comment on lines 1820 to 1821
if (size_in_bits > 256 && !VM_Version::supports_avx512bw())
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use {} according to our style.

}

void C2_MacroAssembler::vpadd(BasicType elem_bt, XMMRegister dst, XMMRegister src1, XMMRegister src2, int vlen_enc) {
assert(UseAVX >= 2, "required");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not include this line in #ifdef ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not matter since it is assert which add code only in debug VM. I like this way.

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Let me test it.

}

void C2_MacroAssembler::vpadd(BasicType elem_bt, XMMRegister dst, XMMRegister src1, XMMRegister src2, int vlen_enc) {
assert(UseAVX >= 2, "required");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not matter since it is assert which add code only in debug VM. I like this way.

bool is_bw = ((elem_bt == T_BYTE) || (elem_bt == T_SHORT));
bool is_bw_supported = VM_Version::supports_avx512bw();
if (is_bw && !is_bw_supported) {
assert(vlen_enc != Assembler::AVX_512bit, "required");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay.

@vnkozlov
Copy link
Contributor

vnkozlov commented May 19, 2022

@sviswa7 Can you add IR framework test to verify generation of PopulateIndex node? And regression test.
I see that 8280510 added only microbenchmark.

@sviswa7
Copy link
Author

sviswa7 commented May 19, 2022

@vnkozlov I will look into adding the IR framework and regression test.

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tier1-4 testing passed.

@sviswa7
Copy link
Author

sviswa7 commented May 19, 2022

@vnkozlov thanks a lot.

@sviswa7
Copy link
Author

sviswa7 commented May 19, 2022

@vnkozlov I have added the IR framework jtreg test. Please review.

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good.

You need second review

@openjdk
Copy link

openjdk bot commented May 20, 2022

@sviswa7 This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8286972: Support the new loop induction variable related PopulateIndex IR node on x86

Reviewed-by: kvn, jbhateja

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 46 new commits pushed to the master branch:

  • 40e99a1: 8285308: Win: Japanese logical fonts are drawn with wrong size
  • 890771e: 8285517: System.getenv() returns unexpected value if environment variable has non ASCII character
  • de74e0e: 8280035: Use Class.isInstance instead of Class.isAssignableFrom where applicable
  • 9f562ef: 8286872: Refactor add/modify notification icon (TrayIcon)
  • b089229: 8271078: jdk/incubator/vector/Float128VectorTests.java failed a subtest
  • 079312c: 8286182: [BACKOUT] x86: Handle integral division overflow during parsing
  • 7b19226: 8267038: Update IANA Language Subtag Registry to Version 2022-03-02
  • e60d8b5: 8253176: Signature.getParameters should specify that it can throw UnsupportedOperationException
  • aa50625: 8286447: [Linux] AWT should start in Headless mode if headful AWT library not installed
  • 655500a: 8286654: Add an optional description accessor on ToolProvider interface
  • ... and 36 more: https://git.openjdk.java.net/jdk/compare/2ed75be659503da584cfec9ead5e27665ae900ef...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label May 20, 2022
@sviswa7
Copy link
Author

sviswa7 commented May 20, 2022

@vnkozlov Thanks a lot for the review.

* @summary Test vectorization of loop induction variable usage in the loop
* @requires vm.compiler2.enabled
* @requires vm.cpu.features ~= ".*avx2.*"
* @requires os.arch=="amd64" | os.arch=="x86_64"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simplify os.arch=="amd64" | os.arch=="x86_64" to os.simpleArch == "x64" ?

This test runs on x86 only. It would be nice if it can run on AArch64 as well. So perhaps something like

 28 * @requires (os.simpleArch == "x64" & vm.cpu.features ~= ".*avx2.*") |
 29 *           (os.simpleArch == "aarch64" & vm.cpu.features ~= ".*sve.*")

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pfustc Yes, changed the requires to as suggested by you.

Comment on lines +2310 to +2311
case T_FLOAT: case T_INT: evpbroadcastd(dst, src, vlen_enc); return;
case T_DOUBLE: case T_LONG: evpbroadcastq(dst, src, vlen_enc); return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we use single and double precision broadcasts for floating point types, like you have done in else part
It may save domain switch over penalty (Section 3.5.2.2 Bypass between Execution Domains, Intel® 64 and IA-32 Architectures Optimization Reference Manual)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The floating point broadcast doesn't take the gpr as second source.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A prior move as in else part may be emitted for consistency or you want to keep floating point broadcasts only for else part.

int vlen_in_bytes = Matcher::vector_length_in_bytes(this);
int vlen_enc = vector_length_encoding(this);
BasicType elem_bt = Matcher::vector_element_basic_type(this);
assert($src2$$constant == 1, "required");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally assertion should be the first statement in a block, since they determine the pre-conditions under which code should executed.

effect(TEMP dst, TEMP vtmp, TEMP scratch);
format %{ "vector_populate_index $dst $src1 $src2\t! using $vtmp and $scratch as TEMP" %}
ins_encode %{
int vlen_in_bytes = Matcher::vector_length_in_bytes(this);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matcher::vector_length can be directly used instead of following computation in line 8274
vlen_in_bytes/type2aelembytes(elem_bt)

effect(TEMP dst, TEMP vtmp, TEMP scratch);
format %{ "vector_populate_index $dst $src1 $src2\t! using $vtmp and $scratch as TEMP" %}
ins_encode %{
int vlen_in_bytes = Matcher::vector_length_in_bytes(this);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

int vlen_in_bytes = Matcher::vector_length_in_bytes(this);
int vlen_enc = vector_length_encoding(this);
BasicType elem_bt = Matcher::vector_element_basic_type(this);
assert($src2$$constant == 1, "required");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above


/**
* @test
* @summary Test vectorization of loop induction variable usage in the loop
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR id missing.

import java.util.Random;

public class TestPopulateIndex {
private static final int count = 65536;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small array size around 10K may work, we can also tune CompileThresholdScaling.

}

public void checkResultIndexArrayFill() {
for (int i = 0; i < count; ++i) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

post-incrementation for consistency.

}

public void checkResultExprWithIndex2() {
for (int i = 0; i < count; ++i) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

post-increment induction.

@sviswa7
Copy link
Author

sviswa7 commented May 20, 2022

@jatin-bhateja Thanks a lot for the review. Your review comments are implemented.

@jatin-bhateja
Copy link
Member

LGTM. Thanks

@vnkozlov
Copy link
Contributor

I have to re-test it since code change in .ad file.

@sviswa7
Copy link
Author

sviswa7 commented May 20, 2022

@vnlozlov Thanks a lot, I will wait for your go ahead.

@vnkozlov
Copy link
Contributor

My testing passed. You are good to push.

@sviswa7
Copy link
Author

sviswa7 commented May 23, 2022

/integrate

@openjdk
Copy link

openjdk bot commented May 23, 2022

Going to push as commit 5d8d6da.
Since your change was applied there have been 73 commits pushed to the master branch:

  • 8122466: 8287113: JFR: Periodic task thread uses period for method sampling events
  • 940e94f: 8285739: disable EscapeBarrier deopt for virtual threads
  • 110d906: 8287103: java/lang/management/ThreadMXBean/VirtualThreadDeadlocks.java fails with Xcomp
  • 8040aa0: 8286908: ECDSA signature should not return parameters
  • 689f80c: 8287153: Whitespace typos in HeapRegion class
  • c906591: 8286391: Address possibly lossy conversions in jdk.compiler
  • 88018c4: 8287150: Remove HeapRegion::block_start_const declaration without definition
  • 81f128b: 8287154: java/nio/channels/FileChannel/LargeMapTest.java does not compile
  • 7becf13: 8286825: Linker naming cleanup
  • a0042de: 8276549: Improve documentation about ContainerPtr encoding
  • ... and 63 more: https://git.openjdk.java.net/jdk/compare/2ed75be659503da584cfec9ead5e27665ae900ef...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label May 23, 2022
@openjdk openjdk bot closed this May 23, 2022
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels May 23, 2022
@openjdk
Copy link

openjdk bot commented May 23, 2022

@sviswa7 Pushed as commit 5d8d6da.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

format %{ "vector_populate_index $dst $src1 $src2\t! using $vtmp and $scratch as TEMP" %}
ins_encode %{
assert($src2$$constant == 1, "required");
int vlen = Matcher::vector_length(this);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I ask why use Matcher::vector_length() here, rather than Matcher::vector_length_in_bytes(), for load_iota_indices()? Thanks.

@sviswa7 sviswa7 deleted the popindex branch June 3, 2024 21:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hotspot-compiler hotspot-compiler-dev@openjdk.org integrated Pull request has been integrated

Development

Successfully merging this pull request may close these issues.

5 participants