8336464: C2: Force CastX2P to be a two-address instruction #20159

fg1417 · 2024-07-12T13:59:23Z

This patch forces CastX2P to be a two-address instruction, so that C2 could allocate the same register for dst and src. Then we can remove the instruction completely in the assembly.

The motivation comes from some cast operations like castPP. The difference for ADLC between castPP and CastX2P lies in that CastX2P always has different types for dst and src. We can force ADLC to generate an extra two_adr() for CastX2P like it does automatically for castPP, which could tell register allocator that the instruction needs the same register for dst and src.

However, sometimes, RA and GCM in C2 can't work as we expected.

For example, we have Assembly on the existing code:

  ldp    x10, x11, [x17,#136]
  add    x10, x10, x15
  add    x11, x11, x10
  ldr    x12, [x17,#152]
  str    x16, [x10]
  add    x10, x12, x15
  str    x16, [x11]
  str    x16, [x10]

After applying the patch independently, the assembly is:

  ldr    x10, [x16,#136]  <--- 1
  add    x10, x10, x15
  ldr    x11, [x16,#144]  <--- 2
  mov    x13, x10         <--- 3
  str    x17, [x13]
  ldr    x12, [x16,#152]
  add    x10, x11, x10
  str    x17, [x10]
  add    x10, x12, x15
  str    x17, [x10]

C2 generates a totally extra mov, see 3, and we even lost the chance to merge load pair, see 1 and 2. That's terrible.

Although this scenario would disappear after combining with #20157, I'm still not sure if this patch is worthwhile.

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Issue

JDK-8336464: C2: Force CastX2P to be a two-address instruction (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/20159/head:pull/20159
$ git checkout pull/20159

Update a local copy of the PR:
$ git checkout pull/20159
$ git pull https://git.openjdk.org/jdk.git pull/20159/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 20159

View PR using the GUI difftool:
$ git pr show -t 20159

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/20159.diff

Using Webrev

Link to Webrev Comment

This patch forces `CastX2P` to be a two-address instruction, so that C2 could allocate the same register for dst and src. Then we can remove the instruction completely in the assembly. The motivation comes from some cast operations like `castPP`. The difference for ADLC between `castPP` and `CastX2P` lies in that `CastX2P` always has different types for dst and src. We can force ADLC to generate an extra `two_adr()` for `CastX2P` like it does automatically for `castPP`, which could tell register allocator that the instruction needs the same register for dst and src. However, sometimes, RA and GCM in C2 can't work as we expected. For example, we have Assembly on the existing code: ``` ldp x10, x11, [x17,openjdk#136] add x10, x10, x15 add x11, x11, x10 ldr x12, [x17,openjdk#152] str x16, [x10] add x10, x12, x15 str x16, [x11] str x16, [x10] ``` After applying the patch, the assembly is: ``` ldr x10, [x16,openjdk#136] <--- 1 add x10, x10, x15 ldr x11, [x16,openjdk#144] <--- 2 mov x13, x10 <--- 3 str x17, [x13] ldr x12, [x16,openjdk#152] add x10, x11, x10 str x17, [x10] add x10, x12, x15 str x17, [x10] ``` C2 generate a totally extra mov, see 3, and we even lost the chance to merge load pair, see 1 and 2. That's terrible. Although this scenario would disappear after combining with openjdk#20157, I'm still not sure if this patch is worthwhile.

bridgekeeper · 2024-07-12T13:59:50Z

👋 Welcome back fgao! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2024-07-12T14:00:17Z

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

openjdk · 2024-07-12T14:00:39Z

@fg1417 The following label will be automatically applied to this pull request:

hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

theRealAph · 2024-07-15T11:14:01Z

src/hotspot/share/adlc/output_h.cpp

    uint  matching_input = instr->two_address(_globalNames);

+#if defined(AARCH64)
+    // Allocate the same register for src and dst, then we can remove


This isn't so much AArch64 specific as specific to any machine that doesn't have separate address and data registers. x86 prefers some registers to form addresses, but for others perhaps either a target macro or a callback function in $cpu.ad.

Matcher::use_same_src_and_dest_reg_for_CastX2P ?

Thanks for your review, @theRealAph . Updated it in the new commit.

mlbridge · 2024-07-17T14:00:03Z

Webrevs

00: Full (daf42b40)

vnkozlov

I am not sure about this change. There reason we keep src and dst in different register for different types is most likely for cases when src could be used in other operations. Overwriting src register may give you more spills than before.

If there are no other src usages RA should handle this I think in shared code.

fg1417 · 2024-08-05T10:32:39Z

I am not sure about this change. There reason we keep src and dst in different register for different types is most likely for cases when src could be used in other operations. Overwriting src register may give you more spills than before.

If there are no other src usages RA should handle this I think in shared code.

Thanks for your review @vnkozlov .

In my initial idea, if we keep src and dst in the same register, when src is used in other operations, yes, we need to generate extra spill code like:

Spill src to new_src
CastX2P    src src
...  // Other operations use new_src

In the final code, we can remove CastX2P, it becomes:

mov    src new_src
...  // Other operations use new_src

If we keep src and dst in the different registers, we may get:

CastX2P    src dst
...  // Other operations use src

In the final code, it will be:

mov    src  dst
...  // Other operations use src

I thought that keeping src and dst might not generate extra movs because we can remove CastX2P itself finally.

But I tried some written cases showed in the PR description, which violated my thoughts in an unexpected way. Then I'm also not sure about it.

Anyway, it's a try and comments are welcome :)

vnkozlov · 2024-08-05T18:55:44Z

One idea I had long time ago JDK-6768706 is to add mach instructions with complex matching rules which uses CatX2P/CastP2X in common patterns. Then you can avoid moves between registers.

dean-long · 2024-08-05T19:28:15Z

I thought reusing one of the inputs for the destination is the default, and we have to add TEMP to rules to prevent this from happening. So I don't understand why sometimes the register allocator doesn't reuse the register when there are no other uses. There is a concept of "chain rule" in ADLC that I don't quite understand, but I suspect that it is related.

bridgekeeper · 2024-09-02T22:06:41Z

@fg1417 This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

fg1417 · 2024-09-03T08:28:18Z

Thanks for all your reviews and comments! I'll come back when I find a better way. Now, I'd like to convert to draft :)

bridgekeeper · 2024-10-29T08:49:43Z

@fg1417 This pull request has been inactive for more than 8 weeks and will be automatically closed if another 8 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

bridgekeeper · 2024-10-31T02:06:45Z

@fg1417 This pull request has been inactive for more than 8 weeks and will be automatically closed if another 8 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

bridgekeeper · 2024-10-31T08:05:54Z

@fg1417 This pull request has been inactive for more than 8 weeks and will be automatically closed if another 8 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

bridgekeeper · 2024-10-31T14:11:07Z

@fg1417 This pull request has been inactive for more than 8 weeks and will be automatically closed if another 8 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

bridgekeeper · 2024-12-26T19:37:52Z

@fg1417 This pull request has been inactive for more than 16 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the /open pull request command.

theRealAph · 2024-12-27T08:05:48Z

You may be over-reacting. ldp is split by the decoder into two ops in every AArch64 implementation I've seen. Register-register mov takes zero cycles on all current AArch64 implementations. (Except perhaps the tiniest, which don't run much Java.) So you are seeing a less efficient encoding, not loss in performance. Granted, it's less efficient in icache and decode bandwidth, but that's not usually a bottleneck.

theRealAph · 2024-12-27T09:28:50Z

On the other hand, the move generated by CastX2P also takes zero cycles to execute, so I guess it's a wash.

fg1417 · 2025-01-07T16:41:59Z

You may be over-reacting. ldp is split by the decoder into two ops in every AArch64 implementation I've seen. Register-register mov takes zero cycles on all current AArch64 implementations. (Except perhaps the tiniest, which don't run much Java.) So you are seeing a less efficient encoding, not loss in performance. Granted, it's less efficient in icache and decode bandwidth, but that's not usually a bottleneck.

Hi @theRealAph, thanks for your comments. That quite makes sense to me and I learnt a lot!

I tried some hand-written assembly benchmarks. As you expected, the performance results of two-ldr version and one-load-pair version, whether using 32-bit registers or 64-bit registers, are very close on the V2 machine. But I checked the V2 software optimization guide, which says the execution latency and throughput of w-form ldp and ldr are the same. That confuses me. Does that mean, during the instruction execution, execution latency or throughput are not the main factors in this case?

On the other hand, the move generated by CastX2P also takes zero cycles to execute, so I guess it's a wash.

Yes, there may be no obvious performance uplift after removing these zero latency movs. One thing is the code size generated by C2 on aarch64 in some workloads is much larger than on x86, which is becoming a problem now. Not sure how much benefit it may bring for this problem, if we could improve the scenarios mentioned above by improving the matcher or register allocator.

Thanks.

openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Jul 12, 2024

This was referenced Jul 12, 2024

8338442: AArch64: Clean up IndOffXX type and let legitimize_address() fix out-of-range operands #16991

Closed

8336245: AArch64: remove extra register copy when converting from long to pointer #20157

Closed

theRealAph reviewed Jul 15, 2024

View reviewed changes

Use an interface function instead of macro

daf42b4

fg1417 changed the title ~~Force CastX2P to be a two-address instruction~~ 8336464: Force CastX2P to be a two-address instruction Jul 17, 2024

fg1417 marked this pull request as ready for review July 17, 2024 13:57

openjdk bot added the rfr Pull request is ready for review label Jul 17, 2024

fg1417 changed the title ~~8336464: Force CastX2P to be a two-address instruction~~ 8336464: C2: Force CastX2P to be a two-address instruction Jul 17, 2024

vnkozlov reviewed Jul 30, 2024

View reviewed changes

fg1417 marked this pull request as draft September 3, 2024 08:29

openjdk bot removed the rfr Pull request is ready for review label Sep 3, 2024

bridgekeeper bot closed this Dec 26, 2024

8336464: C2: Force CastX2P to be a two-address instruction #20159

8336464: C2: Force CastX2P to be a two-address instruction #20159

Uh oh!

Conversation

fg1417 commented Jul 12, 2024 • edited by openjdk bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress

Issue

Reviewing

Uh oh!

bridgekeeper bot commented Jul 12, 2024

Uh oh!

openjdk bot commented Jul 12, 2024

Uh oh!

openjdk bot commented Jul 12, 2024

Uh oh!

theRealAph Jul 15, 2024

Choose a reason for hiding this comment

Uh oh!

theRealAph Jul 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fg1417 Jul 17, 2024

Choose a reason for hiding this comment

Uh oh!

mlbridge bot commented Jul 17, 2024

Webrevs

Uh oh!

vnkozlov left a comment

Choose a reason for hiding this comment

Uh oh!

fg1417 commented Aug 5, 2024

Uh oh!

vnkozlov commented Aug 5, 2024

Uh oh!

dean-long commented Aug 5, 2024

Uh oh!

bridgekeeper bot commented Sep 2, 2024

Uh oh!

fg1417 commented Sep 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bridgekeeper bot commented Oct 29, 2024

Uh oh!

bridgekeeper bot commented Oct 31, 2024

Uh oh!

bridgekeeper bot commented Oct 31, 2024

Uh oh!

bridgekeeper bot commented Oct 31, 2024

Uh oh!

bridgekeeper bot commented Dec 26, 2024

Uh oh!

theRealAph commented Dec 27, 2024

Uh oh!

theRealAph commented Dec 27, 2024

Uh oh!

fg1417 commented Jan 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants

fg1417 commented Jul 12, 2024 •

edited by openjdk bot

Loading

theRealAph Jul 15, 2024 •

edited

Loading

fg1417 commented Sep 3, 2024 •

edited

Loading