Skip to content

Conversation

@fg1417
Copy link

@fg1417 fg1417 commented Jul 12, 2024

This patch forces CastX2P to be a two-address instruction, so that C2 could allocate the same register for dst and src. Then we can remove the instruction completely in the assembly.

The motivation comes from some cast operations like castPP. The difference for ADLC between castPP and CastX2P lies in that CastX2P always has different types for dst and src. We can force ADLC to generate an extra two_adr() for CastX2P like it does automatically for castPP, which could tell register allocator that the instruction needs the same register for dst and src.

However, sometimes, RA and GCM in C2 can't work as we expected.

For example, we have Assembly on the existing code:

  ldp    x10, x11, [x17,#136]
  add    x10, x10, x15
  add    x11, x11, x10
  ldr    x12, [x17,#152]
  str    x16, [x10]
  add    x10, x12, x15
  str    x16, [x11]
  str    x16, [x10]

After applying the patch independently, the assembly is:

  ldr    x10, [x16,#136]  <--- 1
  add    x10, x10, x15
  ldr    x11, [x16,#144]  <--- 2
  mov    x13, x10         <--- 3
  str    x17, [x13]
  ldr    x12, [x16,#152]
  add    x10, x11, x10
  str    x17, [x10]
  add    x10, x12, x15
  str    x17, [x10]

C2 generates a totally extra mov, see 3, and we even lost the chance to merge load pair, see 1 and 2. That's terrible.

Although this scenario would disappear after combining with #20157, I'm still not sure if this patch is worthwhile.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8336464: C2: Force CastX2P to be a two-address instruction (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/20159/head:pull/20159
$ git checkout pull/20159

Update a local copy of the PR:
$ git checkout pull/20159
$ git pull https://git.openjdk.org/jdk.git pull/20159/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 20159

View PR using the GUI difftool:
$ git pr show -t 20159

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/20159.diff

Using Webrev

Link to Webrev Comment

This patch forces `CastX2P` to be a two-address instruction,
so that C2 could allocate the same register for dst and
src. Then we can remove the instruction completely in the
assembly.

The motivation comes from some cast operations like `castPP`.
The difference for ADLC between `castPP` and `CastX2P` lies in
that `CastX2P` always has different types for dst and src.
We can force ADLC to generate an extra `two_adr()` for `CastX2P`
like it does automatically for `castPP`, which could tell register
allocator that the instruction needs the same register for dst
and src.

However, sometimes, RA and GCM in C2 can't work as we expected.

For example, we have Assembly on the existing code:
```
  ldp    x10, x11, [x17,openjdk#136]
  add    x10, x10, x15
  add    x11, x11, x10
  ldr    x12, [x17,openjdk#152]
  str    x16, [x10]
  add    x10, x12, x15
  str    x16, [x11]
  str    x16, [x10]
```

After applying the patch, the assembly is:
```
  ldr    x10, [x16,openjdk#136]  <--- 1
  add    x10, x10, x15
  ldr    x11, [x16,openjdk#144]  <--- 2
  mov    x13, x10         <--- 3
  str    x17, [x13]
  ldr    x12, [x16,openjdk#152]
  add    x10, x11, x10
  str    x17, [x10]
  add    x10, x12, x15
  str    x17, [x10]
```

C2 generate a totally extra mov, see 3, and we even lost the chance
to merge load pair, see 1 and 2. That's terrible.

Although this scenario would disappear after combining with
openjdk#20157, I'm
still not sure if this patch is worthwhile.
@bridgekeeper
Copy link

bridgekeeper bot commented Jul 12, 2024

👋 Welcome back fgao! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jul 12, 2024

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk
Copy link

openjdk bot commented Jul 12, 2024

@fg1417 The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

uint matching_input = instr->two_address(_globalNames);

#if defined(AARCH64)
// Allocate the same register for src and dst, then we can remove
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't so much AArch64 specific as specific to any machine that doesn't have separate address and data registers. x86 prefers some registers to form addresses, but for others perhaps either a target macro or a callback function in $cpu.ad.

Copy link
Contributor

@theRealAph theRealAph Jul 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matcher::use_same_src_and_dest_reg_for_CastX2P ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review, @theRealAph . Updated it in the new commit.

@fg1417 fg1417 changed the title Force CastX2P to be a two-address instruction 8336464: Force CastX2P to be a two-address instruction Jul 17, 2024
@fg1417 fg1417 marked this pull request as ready for review July 17, 2024 13:57
@openjdk openjdk bot added the rfr Pull request is ready for review label Jul 17, 2024
@mlbridge
Copy link

mlbridge bot commented Jul 17, 2024

Webrevs

@fg1417 fg1417 changed the title 8336464: Force CastX2P to be a two-address instruction 8336464: C2: Force CastX2P to be a two-address instruction Jul 17, 2024
Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about this change. There reason we keep src and dst in different register for different types is most likely for cases when src could be used in other operations. Overwriting src register may give you more spills than before.

If there are no other src usages RA should handle this I think in shared code.

@fg1417
Copy link
Author

fg1417 commented Aug 5, 2024

I am not sure about this change. There reason we keep src and dst in different register for different types is most likely for cases when src could be used in other operations. Overwriting src register may give you more spills than before.

If there are no other src usages RA should handle this I think in shared code.

Thanks for your review @vnkozlov .

In my initial idea, if we keep src and dst in the same register, when src is used in other operations, yes, we need to generate extra spill code like:

Spill src to new_src
CastX2P    src src
...  // Other operations use new_src

In the final code, we can remove CastX2P, it becomes:

mov    src new_src
...  // Other operations use new_src

If we keep src and dst in the different registers, we may get:

CastX2P    src dst
...  // Other operations use src

In the final code, it will be:

mov    src  dst
...  // Other operations use src

I thought that keeping src and dst might not generate extra movs because we can remove CastX2P itself finally.

But I tried some written cases showed in the PR description, which violated my thoughts in an unexpected way. Then I'm also not sure about it.

Anyway, it's a try and comments are welcome :)

@vnkozlov
Copy link
Contributor

vnkozlov commented Aug 5, 2024

One idea I had long time ago JDK-6768706 is to add mach instructions with complex matching rules which uses CatX2P/CastP2X in common patterns. Then you can avoid moves between registers.

@dean-long
Copy link
Member

I thought reusing one of the inputs for the destination is the default, and we have to add TEMP to rules to prevent this from happening. So I don't understand why sometimes the register allocator doesn't reuse the register when there are no other uses. There is a concept of "chain rule" in ADLC that I don't quite understand, but I suspect that it is related.

@bridgekeeper
Copy link

bridgekeeper bot commented Sep 2, 2024

@fg1417 This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@fg1417
Copy link
Author

fg1417 commented Sep 3, 2024

Thanks for all your reviews and comments! I'll come back when I find a better way. Now, I'd like to convert to draft :)

@fg1417 fg1417 marked this pull request as draft September 3, 2024 08:29
@openjdk openjdk bot removed the rfr Pull request is ready for review label Sep 3, 2024
@bridgekeeper
Copy link

bridgekeeper bot commented Oct 29, 2024

@fg1417 This pull request has been inactive for more than 8 weeks and will be automatically closed if another 8 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

3 similar comments
@bridgekeeper
Copy link

bridgekeeper bot commented Oct 31, 2024

@fg1417 This pull request has been inactive for more than 8 weeks and will be automatically closed if another 8 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@bridgekeeper
Copy link

bridgekeeper bot commented Oct 31, 2024

@fg1417 This pull request has been inactive for more than 8 weeks and will be automatically closed if another 8 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@bridgekeeper
Copy link

bridgekeeper bot commented Oct 31, 2024

@fg1417 This pull request has been inactive for more than 8 weeks and will be automatically closed if another 8 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@bridgekeeper
Copy link

bridgekeeper bot commented Dec 26, 2024

@fg1417 This pull request has been inactive for more than 16 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the /open pull request command.

@bridgekeeper bridgekeeper bot closed this Dec 26, 2024
@theRealAph
Copy link
Contributor

You may be over-reacting. ldp is split by the decoder into two ops in every AArch64 implementation I've seen. Register-register mov takes zero cycles on all current AArch64 implementations. (Except perhaps the tiniest, which don't run much Java.) So you are seeing a less efficient encoding, not loss in performance. Granted, it's less efficient in icache and decode bandwidth, but that's not usually a bottleneck.

@theRealAph
Copy link
Contributor

On the other hand, the move generated by CastX2P also takes zero cycles to execute, so I guess it's a wash.

@fg1417
Copy link
Author

fg1417 commented Jan 7, 2025

You may be over-reacting. ldp is split by the decoder into two ops in every AArch64 implementation I've seen. Register-register mov takes zero cycles on all current AArch64 implementations. (Except perhaps the tiniest, which don't run much Java.) So you are seeing a less efficient encoding, not loss in performance. Granted, it's less efficient in icache and decode bandwidth, but that's not usually a bottleneck.

Hi @theRealAph, thanks for your comments. That quite makes sense to me and I learnt a lot!

I tried some hand-written assembly benchmarks. As you expected, the performance results of two-ldr version and one-load-pair version, whether using 32-bit registers or 64-bit registers, are very close on the V2 machine. But I checked the V2 software optimization guide, which says the execution latency and throughput of w-form ldp and ldr are the same. That confuses me. Does that mean, during the instruction execution, execution latency or throughput are not the main factors in this case?

On the other hand, the move generated by CastX2P also takes zero cycles to execute, so I guess it's a wash.

Yes, there may be no obvious performance uplift after removing these zero latency movs. One thing is the code size generated by C2 on aarch64 in some workloads is much larger than on x86, which is becoming a problem now. Not sure how much benefit it may bring for this problem, if we could improve the scenarios mentioned above by improving the matcher or register allocator.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hotspot-compiler hotspot-compiler-dev@openjdk.org

Development

Successfully merging this pull request may close these issues.

4 participants