Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Move backward jumps to before their successors after RPO-based layout #102461

Merged
merged 6 commits into from
May 21, 2024

Conversation

amanasifkhalid
Copy link
Member

@amanasifkhalid amanasifkhalid commented May 20, 2024

Part of #93020. In #102343, we noticed the RPO-based layout sometimes makes suboptimal decisions in terms of placing a block's hottest predecessor before it -- in particular, this affects loops that aren't entered at the top (comment). To address this, after establishing a baseline RPO layout, fgMoveBackwardJumpsToSuccessors will try to move backward unconditional jumps to right behind their targets to create fallthrough, if the predecessor block is sufficiently hot.

Since the new layout isn't enabled in CI yet, here are the diffs with the baseline using the new RPO layout, on Windows x64: gist. This change is a net size regression, though the PerfScore diffs look promising, and I see many instances of the inverted loop shape fixed. TP impact is <=0.03% over doing just the RPO-based layout.

cc @dotnet/jit-contrib

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 20, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@amanasifkhalid
Copy link
Member Author

amanasifkhalid commented May 20, 2024

New codegen for the example linked in this comment:

; Assembly listing for method Program:Sum(int[]):int (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T02] (  4,  7   )     ref  ->  rcx         class-hnd single-def <int[]>
;* V01 loc0         [V01,T05] (  0,  0   )     int  ->  zero-ref
;  V02 loc1         [V02,T04] (  4,  6   )     int  ->  rax
;  V03 OutArgs      [V03    ] (  1,  1   )  struct (32) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;  V04 cse0         [V04,T01] (  3, 10   )     int  ->  r10         "CSE #04: aggressive"
;  V05 cse1         [V05,T03] (  2,  9   )     int  ->  rdx         hoist "CSE #01: aggressive"
;  V06 rat0         [V06,T00] (  5, 17   )    long  ->   r8         "Widened IV V01"
;
; Lcl frame size = 40

G_M53154_IG01:  ;; offset=0x0000
       sub      rsp, 40
                                                ;; size=4 bbWeight=1 PerfScore 0.25
G_M53154_IG02:  ;; offset=0x0004
       xor      eax, eax
       mov      edx, dword ptr [rcx+0x08]
       xor      r8d, r8d
       jmp      SHORT G_M53154_IG04
                                                ;; size=10 bbWeight=1 PerfScore 4.50
G_M53154_IG03:  ;; offset=0x000E
       add      eax, r10d
       inc      r8d
                                                ;; size=6 bbWeight=2 PerfScore 1.00
G_M53154_IG04:  ;; offset=0x0014
       cmp      edx, r8d
       jle      SHORT G_M53154_IG06
                                                ;; size=5 bbWeight=8 PerfScore 10.00
G_M53154_IG05:  ;; offset=0x0019
       mov      r10d, dword ptr [rcx+4*r8+0x10]
       test     r10d, r10d
       jne      SHORT G_M53154_IG03
                                                ;; size=10 bbWeight=4 PerfScore 13.00
G_M53154_IG06:  ;; offset=0x0023
       add      rsp, 40
       ret
                                                ;; size=5 bbWeight=1 PerfScore 1.25

; Total bytes of code 40, prolog size 4, PerfScore 30.00, instruction count 14, allocated bytes for code 40 (MethodHash=477a305d) for method Program:Sum(int[]):int (FullOpts)
; ============================================================

@amanasifkhalid
Copy link
Member Author

amanasifkhalid commented May 20, 2024

I added a condition to fgMoveBlocksToHottestSuccessors to ensure loop heads are moved to the top of the loop, and updated the diffs link above. Adding this condition worsened size/PerfScore regressions, though I imagine we want to take this on principle? Here's the new layout for the System.Collections.Tests.Perf_BitArray.BitArrayNot(Size: 512) benchmark (notice BB11 now precedes BB12):

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight        IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1      897152 [000..039)-> BB23(0.001),BB08(0),BB07(0),BB06(0),BB05(0),BB04(0),BB03(0),BB02(0),BB09(0.999)[def] (switch)                     i IBC
BB09 [0009]  1       BB01                  1.00   896255 [071..0D4)-> BB13(0),BB10(1)         ( cond )                     i IBC nullcheck
BB10 [0033]  1       BB09                  1.00   896255 [0D6..???)-> BB12(1)                 (always)                     IBC internal
BB11 [0018]  1       BB12                 15.66 14051500 [0D6..0F7)-> BB12(1)                 (always)                     i IBC loophead bwd bwd-target
BB12 [0019]  2       BB10,BB11            16.65 14939683 [0F7..107)-> BB11(0.941),BB21(0.0595)  ( cond )                     i IBC bwd bwd-src
BB21 [0028]  4       BB12,BB13,BB16,BB20   1.00   896255 [15A..15E)-> BB20(0),BB23(1)         ( cond )                     i IBC bwd bwd-src
BB23 [0029]  3       BB01,BB08,BB21        1.00   897152 [15E..16E)                           (return)                     i IBC
BB13 [0021]  1       BB09                  0           0 [109..11A)-> BB21(0.48),BB16(0.52)   ( cond )                     i IBC rare
BB16 [0025]  2       BB13,BB15             0           0 [13D..14D)-> BB15(0.9),BB21(0.1)     ( cond )                     i IBC rare bwd bwd-src
BB15 [0024]  1       BB16                  0           0 [11C..13D)-> BB16(1)                 (always)                     i IBC rare loophead bwd bwd-target
BB20 [0027]  1       BB21                  0           0 [14F..15A)-> BB21(1)                 (always)                     i IBC rare loophead idxlen bwd bwd-target
BB02 [0002]  1       BB01                  0           0 [03B..042)-> BB03(1)                 (always)                     i IBC rare idxlen
BB03 [0003]  2       BB01,BB02             0           0 [042..049)-> BB04(1)                 (always)                     i IBC rare idxlen
BB04 [0004]  2       BB01,BB03             0           0 [049..050)-> BB05(1)                 (always)                     i IBC rare idxlen
BB05 [0005]  2       BB01,BB04             0           0 [050..057)-> BB06(1)                 (always)                     i IBC rare idxlen
BB06 [0006]  2       BB01,BB05             0           0 [057..05E)-> BB07(1)                 (always)                     i IBC rare idxlen
BB07 [0007]  2       BB01,BB06             0           0 [05E..065)-> BB08(1)                 (always)                     i IBC rare idxlen
BB08 [0008]  2       BB01,BB07             0           0 [065..071)-> BB23(1)                 (always)                     i IBC rare idxlen
BB24 [0038]  0                             0             [???..???)                           (throw )                     i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Edit: Looking at the diffs again, I don't think this exception is a good idea. Reverted.

@JulieLeeMSFT JulieLeeMSFT added the Priority:2 Work that is important, but not critical for the release label May 20, 2024
@AndyAyersMS
Copy link
Member

It's probably worth considering only BBJ_ALWAYS and BBJ_COND here, which makes finding the preferred successor simpler.

In the phoenix version the similar bit of code keyed off of RPO back edges. If a back edge source was not a BBJ_COND, and the target of the back edge was not following its most likely pred, we'd move the back edge source block up. Another way of saying this is that we'd really like the last block in the loop body to be an exit block.

@amanasifkhalid
Copy link
Member Author

amanasifkhalid commented May 20, 2024

In the phoenix version the similar bit of code keyed off of RPO back edges. If a back edge source was not a BBJ_COND, and the target of the back edge was not following its most likely pred, we'd move the back edge source block up. Another way of saying this is that we'd really like the last block in the loop body to be an exit block.

I see, I think I can simplify this PR quite a bit then. I'm guessing Phoenix wouldn't do this for BBJ_COND blocks with hot back-edges because those could be loop exits? It might be too narrow in scope, but maybe fgMoveBlocksToHottestSuccessors should only consider BBJ_ALWAYS back-edges to fix any mis-rotated loops?

(also, sorry to loop you in while you're OOF)

@amanasifkhalid amanasifkhalid changed the title JIT: Move blocks up to their hottest successors after RPO-based layout JIT: Move backward jumps to before their successors after RPO-based layout May 20, 2024
@amanasifkhalid
Copy link
Member Author

I've simplified my approach to only consider backward, unconditional jumps, and this seems to solve all the mis-rotated loop examples we looked at in #102343 while keeping churn and TP cost pretty low. This implementation is pretty narrow in the flowgraph shapes it's trying to fix, but since this work was motivated by the misshapen loop regressions, I think we can keep this conservative until we find a need for other passes to tweak the RPO-based order.

@EgorBo do you have any time to look at this today? If not, no worries; I think @jakobbotsch will be online tomorrow. Thanks!

@EgorBo
Copy link
Member

EgorBo commented May 20, 2024

LGTM as far as I can tell, I presume the newer commits you've pushed should still be zero diffs since it's RPO is not enabled, right?

@amanasifkhalid
Copy link
Member Author

I presume the newer commits you've pushed should still be zero diffs since it's RPO is not enabled, right?

Yes, that's right. I re-ran SPMI locally and updated the gist. TP impact is now <= 0.03%.

Copy link
Member

@jakobbotsch jakobbotsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as well. I think a simple targeted fix like this is just fine, but of course it would be nice with something a little more general in the future.

I saw you commented on #9304, but I wasn't sure what the final result ended up being; does this PR help there as well?

@amanasifkhalid
Copy link
Member Author

Thank you for the reviews! No diffs.

I saw you commented on #9304, but I wasn't sure what the final result ended up being; does this PR help there as well?

It does fix the loop rotation issue there as well, though the final layout isn't any better than what we started with. I'll elaborate more over there.

@amanasifkhalid amanasifkhalid merged commit 68f3edc into dotnet:main May 21, 2024
104 of 107 checks passed
@amanasifkhalid amanasifkhalid deleted the move-hot-blocks branch May 21, 2024 14:47
amanasifkhalid added a commit that referenced this pull request May 22, 2024
Follow-up to #102461, and part of #9304. Compacting blocks after establishing an RPO-based layout, but before moving backward jumps to fall into their successors, can enable more opportunities for branch removal; see #9304 for one such example.
steveharter pushed a commit to steveharter/runtime that referenced this pull request May 28, 2024
Follow-up to dotnet#102461, and part of dotnet#9304. Compacting blocks after establishing an RPO-based layout, but before moving backward jumps to fall into their successors, can enable more opportunities for branch removal; see dotnet#9304 for one such example.
Ruihan-Yin pushed a commit to Ruihan-Yin/runtime that referenced this pull request May 30, 2024
…ayout (dotnet#102461)

Part of dotnet#93020. In dotnet#102343, we noticed the RPO-based layout sometimes makes suboptimal decisions in terms of placing a block's hottest predecessor before it -- in particular, this affects loops that aren't entered at the top. To address this, after establishing a baseline RPO layout, fgMoveBackwardJumpsToSuccessors will try to move backward unconditional jumps to right behind their targets to create fallthrough, if the predecessor block is sufficiently hot.
Ruihan-Yin pushed a commit to Ruihan-Yin/runtime that referenced this pull request May 30, 2024
Follow-up to dotnet#102461, and part of dotnet#9304. Compacting blocks after establishing an RPO-based layout, but before moving backward jumps to fall into their successors, can enable more opportunities for branch removal; see dotnet#9304 for one such example.
@github-actions github-actions bot locked and limited conversation to collaborators Jun 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI Priority:2 Work that is important, but not critical for the release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants