Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Enable RPO-based block layout by default #102343

Merged
merged 1 commit into from
May 22, 2024

Conversation

amanasifkhalid
Copy link
Member

Part of #93020. Enables the new greedy RPO-based block layout by default. By fully switching over to the new layout algorithm, we can get rid of a lot of code that probably isn't useful anymore -- aside from the old layout, we should consider removing code that prematurely tries to maintain a certain ordering, like fgFindInsertPoint. I'm not going to do any of this removal just yet, just in case we want to keep the old implementation around for now.

We now have about a week of useful data from the rpolayout experiment in the perf lab. Here's a PDF/CDF of the minimum benchmark execution times from the last 5 days, on Windows x64:

image

Many (most?) of those datapoints are within the realm of noise. Here's a brief breakdown of the nontrivial improvements/regressions on x64, using the min/median/max benchmark execution times from the last 5 days:

Windows x64, min execution time
22.85% improved by >=2%; 14.64% regressed
9.54% improved by >=5%; 6.85% regressed
2.97% improved by >=10%; 2.97% regressed

Ubuntu x64, min execution time
27.53% improved by >=2%; 12.91% regressed
11.11% improved by >=5%; 5.69% regressed
3.68% improved by >=10%; 2.19% regressed

Windows x64, median execution time
26.62% improved by >=2%; 13.73% regressed
12.23% improved by >=5%; 6.89% regressed
4.32% improved by >=10%; 2.83% regressed

Ubuntu x64, median execution time
26.20% improved by >=2%; 12.01% regressed
11.17% improved by >=5%; 5.92% regressed
3.96% improved by >=10%; 2.31% regressed

Windows x64, max execution time
30.45% improved by >=2%; 18.67% regressed
17.01% improved by >=5%; 10.51% regressed
7.06% improved by >=10%; 5.11% regressed

Ubuntu x64, max execution time
31.38% improved by >=2%; 22.89% regressed
16.02% improved by >=5%; 11.26% regressed
6.92% improved by >=10%; 5.60% regressed

As of writing, 145 of 4,879 benchmarks regressed by 10% or more on Windows x64, when looking at their minimum execution times from the last 5 days. Block layout churn can have far-reaching consequences, so narrowing down which methods to look at when triaging regressions can be tricky. I've highlighted a few regressed benchmarks below with simple enough call graphs that the offending method is obvious; I think these examples highlight a few expected trends from the new layout algorithm:

System.Numerics.Tests.Perf_Matrix4x4.IsIdentityBenchmark
Base layout:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1    100 [000..007)-> BB05(0.2),BB02(0.8)     ( cond )                     i IBC
BB02 [0007]  1       BB01                  0.80  80 [006..007)-> BB05(0.2),BB03(0.8)     ( cond )                     i IBC
BB03 [0008]  1       BB02                  0.64  64 [006..007)-> BB05(0.48),BB04(0.52)   ( cond )                     i IBC
BB04 [0009]  1       BB03                  0.33  33 [006..007)-> BB06(1)                 (always)                     i IBC
BB06 [0011]  2       BB04,BB05             1    100 [006..00E)                           (return)                     i IBC
BB05 [0010]  3       BB01,BB02,BB03        0.67  67 [006..007)-> BB06(1)                 (always)                     i IBC
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Diff layout:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1    100 [000..007)-> BB05(0.2),BB02(0.8)     ( cond )                     i IBC
BB02 [0007]  1       BB01                  0.80  80 [006..007)-> BB05(0.2),BB03(0.8)     ( cond )                     i IBC
BB03 [0008]  1       BB02                  0.64  64 [006..007)-> BB05(0.48),BB04(0.52)   ( cond )                     i IBC
BB04 [0009]  1       BB03                  0.33  33 [006..007)-> BB06(1)                 (always)                     i IBC
BB05 [0010]  3       BB01,BB02,BB03        0.67  67 [006..007)-> BB06(1)                 (always)                     i IBC
BB06 [0011]  2       BB04,BB05             1    100 [006..00E)                           (return)                     i IBC
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

The new layout places BB06 after BB05, breaking up the fallthrough from BB04 to BB06. This has the benefit of removing a backward jump from BB05 to BB06, though in the case of this benchmark, it looks like the return path BB03->BB04->BB06 is taken, so we're penalized by the new jump over BB05. This benchmark is quite small, so the impact of the jump is big, regressing it by about 27%.

We could tweak the RPO-based layout by moving blocks up to just after their hottest predecessor to address this (@AndyAyersMS showed me something similar he did in Phoenix), though in this case, the block weights suggest BB05 is BB06's hottest predecessor, so I don't think there's anything worth changing here, in terms of the block layout algorithm itself.

System.Numerics.Tests.Perf_Matrix3x2.InequalityOperatorBenchmark regressed by about 19% for similar reasons.
Base layout:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1    100 [000..000)-> BB04(0.2),BB02(0.8)     ( cond )                     i IBC
BB02 [0011]  1       BB01                  0.80  80 [000..000)-> BB04(0.48),BB03(0.52)   ( cond )                     i IBC internal
BB03 [0012]  1       BB02                  0.42  42 [000..000)-> BB05(1)                 (always)                     i IBC internal
BB05 [0014]  2       BB03,BB04             1    100 [010..010)                           (return)                     i IBC
BB04 [0013]  2       BB01,BB02             0.58  58 [000..000)-> BB05(1)                 (always)                     i IBC internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Diff layout:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1    100 [000..000)-> BB04(0.2),BB02(0.8)     ( cond )                     i IBC
BB02 [0011]  1       BB01                  0.80  80 [000..000)-> BB04(0.48),BB03(0.52)   ( cond )                     i IBC internal
BB03 [0012]  1       BB02                  0.42  42 [000..000)-> BB05(1)                 (always)                     i IBC internal
BB04 [0013]  2       BB01,BB02             0.58  58 [000..000)-> BB05(1)                 (always)                     i IBC internal
BB05 [0014]  2       BB03,BB04             1    100 [010..010)                           (return)                     i IBC
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Based on the PGO data available, the new layout seems to be making better decisions. We could iterate on this by synthesizing likelihoods and/or repairing the profile pre-layout, and by adding a post-RPO layout heuristic that moves blocks up to their hottest predecessor.

System.Threading.Tests.Perf_Interlocked.CompareExchange_object_Match regressed by over 40%.
Base layout:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1    100 [000..001)-> BB03(0.2),BB02(0.8)     ( cond )                     i IBC
BB02 [0002]  1       BB01                  0      0 [000..001)                           (throw )                     i IBC rare hascall gcsafe
BB03 [0003]  1       BB01                  1    100 [000..009)                           (return)                     i IBC jmp hascall
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Diff layout:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1    100 [000..001)-> BB02(0.2),BB03(0.8)     ( cond )                     i IBC
BB03 [0003]  1       BB01                  1    100 [000..009)                           (return)                     i IBC jmp hascall
BB02 [0002]  1       BB01                  0      0 [000..001)                           (throw )                     i IBC rare hascall gcsafe
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

This is an interesting case, where the exceptional path BB01->BB03 is actually the more likely path, if the edge likelihoods are to be trusted. However, the JIT tends to assume throw blocks are always cold, hence BB03's block weight of 0. The new layout does not make any such assumption about throw blocks: After generating a greedy RPO-based layout of BB01->BB02->BB03, the new layout moves all rarely-run blocks (i.e. anything with a weight of 0) to the end of the method, hence the final BB01->BB03->BB02 layout. This case would be fixed by propagating weight to BB02 from BB01 that is proportional to their edge's likelihood, such that BB02 would no longer be considered rarely-run; thus, running profile repair before block layout would probably fix this. Though considering the perf cost of exception handling, perhaps we don't have much to gain from removing this expectation that throw blocks are cold.

I should note that the old layout didn't do anything clever on purpose here. It left BB02 after BB01 because it still expects the false target of a conditional block to be its next block, and not because it is the more likely successor. This invariant has been removed elsewhere in the JIT, so switching over to the new layout for good would allows us to remove the last bits of cruft around this implicit fallthrough requirement.

System.Threading.Tests.Perf_Interlocked.CompareExchange_object_NoMatch also regressed by over 40% for the same reason. There seem to be a few of these benchmark pairs inflating the improvement/regression counts.

System.Collections.Tests.Perf_BitArray.BitArrayNot(Size: 512) regressed by about 14%, due to layout differences in System.Collections.BitArray:Not:
Base layout:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight        IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1      900864 [000..039)-> BB23(0.001),BB08(0),BB07(0),BB06(0),BB05(0),BB04(0),BB03(0),BB02(0),BB09(0.999)[def] (switch)                     i IBC
BB11 [0018]  1       BB12                 15.60 14054033 [0D6..0F7)-> BB12(1)                 (always)                     i IBC loophead bwd bwd-target
BB12 [0019]  2       BB10,BB11            16.58 14933038 [0F7..107)-> BB11(0.941),BB25(0.0589)  ( cond )                     i IBC bwd bwd-src
BB25 [0039]  3       BB12,BB13,BB20        1.00   899963 [15A..15E)-> BB20(0),BB23(1)         ( cond )                     i IBC bwd
BB23 [0029]  4       BB01,BB08,BB25,BB27   1.00   900864 [15E..16E)                           (return)                     i IBC
BB09 [0009]  1       BB01                  1.00   899963 [071..0D4)-> BB13(0),BB10(1)         ( cond )                     i IBC nullcheck
BB10 [0033]  1       BB09                  1.00   899963 [0D6..???)-> BB12(1)                 (always)                     IBC internal
BB13 [0021]  1       BB09                  0           0 [109..11A)-> BB25(0.48),BB16(0.52)   ( cond )                     i IBC rare
BB15 [0024]  1       BB16                  0           0 [11C..13D)-> BB16(1)                 (always)                     i IBC rare loophead bwd bwd-target
BB16 [0025]  2       BB13,BB15             0           0 [13D..14D)-> BB15(0.9),BB27(0.1)     ( cond )                     i IBC rare bwd bwd-src
BB27 [0041]  1       BB16                  0           0 [???..???)-> BB23(0),BB20(1)         ( cond )                     IBC rare internal
BB20 [0027]  2       BB25,BB27             0           0 [14F..15A)-> BB25(1)                 (always)                     i IBC rare loophead idxlen bwd bwd-target
BB24 [0038]  0                             0             [???..???)                           (throw )                     i rare keep internal
BB02 [0002]  1       BB01                  0           0 [03B..042)-> BB03(1)                 (always)                     i IBC rare idxlen
BB03 [0003]  2       BB01,BB02             0           0 [042..049)-> BB04(1)                 (always)                     i IBC rare idxlen
BB04 [0004]  2       BB01,BB03             0           0 [049..050)-> BB05(1)                 (always)                     i IBC rare idxlen
BB05 [0005]  2       BB01,BB04             0           0 [050..057)-> BB06(1)                 (always)                     i IBC rare idxlen
BB06 [0006]  2       BB01,BB05             0           0 [057..05E)-> BB07(1)                 (always)                     i IBC rare idxlen
BB07 [0007]  2       BB01,BB06             0           0 [05E..065)-> BB08(1)                 (always)                     i IBC rare idxlen
BB08 [0008]  2       BB01,BB07             0           0 [065..071)-> BB23(1)                 (always)                     i IBC rare idxlen
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Diff layout:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight        IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1      930304 [000..039)-> BB23(0.001),BB08(0),BB07(0),BB06(0),BB05(0),BB04(0),BB03(0),BB02(0),BB09(0.999)[def] (switch)                     i IBC
BB09 [0009]  1       BB01                  1.00   929374 [071..0D4)-> BB13(0),BB10(1)         ( cond )                     i IBC nullcheck
BB10 [0033]  1       BB09                  1.00   929374 [0D6..???)-> BB12(1)                 (always)                     IBC internal
BB12 [0019]  2       BB10,BB11            16.50 15349629 [0F7..107)-> BB11(0.94),BB21(0.0605) ( cond )                     i IBC bwd bwd-src
BB11 [0018]  1       BB12                 15.50 14421162 [0D6..0F7)-> BB12(1)                 (always)                     i IBC loophead bwd bwd-target
BB21 [0028]  4       BB12,BB13,BB16,BB20   1.00   929374 [15A..15E)-> BB20(0),BB23(1)         ( cond )                     i IBC bwd bwd-src
BB23 [0029]  3       BB01,BB08,BB21        1      930304 [15E..16E)                           (return)                     i IBC
BB13 [0021]  1       BB09                  0           0 [109..11A)-> BB21(0.48),BB16(0.52)   ( cond )                     i IBC rare
BB16 [0025]  2       BB13,BB15             0           0 [13D..14D)-> BB15(0.9),BB21(0.1)     ( cond )                     i IBC rare bwd bwd-src
BB15 [0024]  1       BB16                  0           0 [11C..13D)-> BB16(1)                 (always)                     i IBC rare loophead bwd bwd-target
BB20 [0027]  1       BB21                  0           0 [14F..15A)-> BB21(1)                 (always)                     i IBC rare loophead idxlen bwd bwd-target
BB02 [0002]  1       BB01                  0           0 [03B..042)-> BB03(1)                 (always)                     i IBC rare idxlen
BB03 [0003]  2       BB01,BB02             0           0 [042..049)-> BB04(1)                 (always)                     i IBC rare idxlen
BB04 [0004]  2       BB01,BB03             0           0 [049..050)-> BB05(1)                 (always)                     i IBC rare idxlen
BB05 [0005]  2       BB01,BB04             0           0 [050..057)-> BB06(1)                 (always)                     i IBC rare idxlen
BB06 [0006]  2       BB01,BB05             0           0 [057..05E)-> BB07(1)                 (always)                     i IBC rare idxlen
BB07 [0007]  2       BB01,BB06             0           0 [05E..065)-> BB08(1)                 (always)                     i IBC rare idxlen
BB08 [0008]  2       BB01,BB07             0           0 [065..071)-> BB23(1)                 (always)                     i IBC rare idxlen
BB24 [0038]  0                             0             [???..???)                           (throw )                     i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

In terms of edge likelihoods, the new layout seems to get the critical paths right, though note that the "greedy" part of the RPO only applies to conditional blocks when deciding which successor to place next; other multi-successor block kinds, like switch blocks, don't seem to be common enough to be worth extending the layout's greediness to, though this could be done as a follow-up quite easily (see #101935). I believe the hot loop BB11<->BB12 is to blame for the regression: BB12 is reachable from BB11 and BB10, and BB11 is reachable only from BB12. When we start the RPO from BB01, we end up visiting BB10, then BB12, and then BB11, hence why the new layout places BB12 before BB11. This introduces more branches: We need a backward jump from BB11 to BB12 within the loop, and once BB12's condition is false, we need to jump over BB11 to get to the former's false target. If we place BB11 before BB12, then BB11 can fall into BB12, and BB12 can eventually fall into its false target after the loop; we only need the single backward jump from BB12 to BB11.

Perhaps we could re-canonicalize loops post-layout to fix these cases, though I hesitate to purposefully break the RPO. Cases like this one could be tackled by a heuristic that optimizes for some optimal layout score, as described in #93020.

System.Collections.Tests.Perf_BitArray.BitArrayCopyToByteArray(Size: 512) regressed by over 40% for seemingly the same reason, though the problematic loop shapes are all in the cold section. Take a look at BB51<->BB52, BB56<->BB57, etc.
Base layout:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight        IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1      281408 [000..00B)-> BB02(0),BB03(1)         ( cond )                     i IBC
BB03 [0064]  1       BB01                  1      281408 [000..016)-> BB04(0),BB05(1)         ( cond )                     i IBC
BB05 [0068]  1       BB03                  1      281408 [00B..017)-> BB07(0.2),BB06(0.8)     ( cond )                     i IBC nullcheck
BB06 [0081]  1       BB05                  0.80   225126 [016..017)-> BB07(1)                 (always)                     i IBC hascall gcsafe
BB07 [0082]  2       BB05,BB06             1      281408 [016..017)-> BB09(0),BB08(1)         ( cond )                     i IBC
BB08 [0072]  1       BB07                  1      281408 [016..017)-> BB11(1)                 (always)                     i IBC
BB11 [0002]  2       BB08,BB09             1      281408 [???..???)-> BB13(0),BB14(1)         ( cond )                     i IBC hascall
BB14 [0147]  1       BB11                  1      281408 [02F..039)-> BB16(0),BB20(1)         ( cond )                     i IBC
BB20 [0007]  2       BB13,BB14             1      281408 [???..???)-> BB22(0),BB23(1)         ( cond )                     i IBC hascall
BB23 [0152]  2       BB20,BB22             1      281408 [094..0A1)-> BB45(0),BB25(1)         ( cond )                     i IBC
BB25 [0008]  1       BB23                  1      281408 [0A1..0BA)-> BB26(0),BB27(1)         ( cond )                     i IBC
BB27 [0010]  1       BB25                  1      281408 [0C5..0DB)-> BB41(0),BB30(1)         ( cond )                     i IBC idxlen
BB30 [0098]  1       BB27                  1      281408 [0DA..0DB)-> BB32(0.2),BB31(0.8)     ( cond )                     i IBC idxlen nullcheck
BB31 [0103]  1       BB30                  0.80   225126 [0DA..0DB)-> BB32(1)                 (always)                     i IBC hascall gcsafe
BB32 [0104]  2       BB30,BB31             1      281408 [0DA..0F3)-> BB37(0.00795),BB33(0.992)   ( cond )                     i IBC idxlen nullcheck
BB33 [0139]  2       BB32,BB36           124.77 35110912 [0F3..103)-> BB40(0),BB34(1)         ( cond )                     i IBC idxlen bwd
BB34 [0114]  1       BB33                124.77 35110912 [0F3..104)-> BB36(0.2),BB35(0.8)     ( cond )                     i IBC bwd
BB35 [0125]  1       BB34                 99.81 28088730 [103..104)-> BB36(1)                 (always)                     i IBC hascall gcsafe bwd
BB36 [0126]  2       BB34,BB35           124.77 35110912 [103..119)-> BB33(0.992),BB37(0.00795)   ( cond )                     i IBC bwd
BB37 [0141]  2       BB32,BB36             1.00   281408 [119..11E)-> BB38(0),BB39(1)         ( cond )                     i IBC
BB39 [0017]  2       BB37,BB38             1.00   281408 [144..159)-> BB44(0),BB43(0),BB42(0),BB62(1)[def] (switch)                     i IBC
BB62 [0136]  5       BB18,BB19,BB39,BB44,BB60   1.00   281408 [???..???)                           (return)                     IBC internal
BB63 [0153]  0                             0             [???..???)                           (throw )                     i rare keep internal
BB64 [0154]  0                             0             [???..???)                           (throw )                     i rare keep internal
BB02 [0063]  1       BB01                  0           0 [000..001)                           (throw )                     i IBC rare hascall gcsafe
BB04 [0067]  1       BB03                  0           0 [00B..00C)                           (throw )                     i IBC rare hascall gcsafe
BB09 [0073]  1       BB07                  0           0 [016..01F)-> BB11(1),BB10(0)         ( cond )                     i IBC rare nullcheck
BB10 [0001]  1       BB09                  0           0 [01F..02F)                           (throw )                     i IBC rare hascall gcsafe newobj
BB13 [0146]  1       BB11                  0           0 [???..???)-> BB20(0),BB16(1)         ( cond )                     IBC rare internal
BB16 [0003]  2       BB13,BB14             0           0 [039..04E)-> BB18(1),BB17(0)         ( cond )                     i IBC rare
BB17 [0004]  1       BB16                  0           0 [04E..059)                           (throw )                     i IBC rare hascall gcsafe newobj
BB18 [0005]  1       BB16                  0           0 [059..07D)-> BB62(0.48),BB19(0.52)   ( cond )                     i IBC rare hascall gcsafe
BB19 [0006]  1       BB18                  0           0 [07D..094)-> BB62(1)                 (always)                     i IBC rare idxlen
BB22 [0151]  1       BB20                  0           0 [???..???)-> BB23(1)                 (always)                     IBC rare internal
BB26 [0009]  1       BB25                  0           0 [0BA..0C5)                           (throw )                     i IBC rare hascall gcsafe newobj
BB38 [0016]  1       BB37                  0           0 [11E..144)-> BB39(1)                 (always)                     i IBC rare idxlen
BB40 [0113]  1       BB33                  0           0 [0F3..0F4)                           (throw )                     i IBC rare hascall gcsafe bwd
BB41 [0140]  1       BB27                  0           0 [103..104)                           (throw )                     i IBC rare gcsafe bwd
BB42 [0019]  1       BB39                  0           0 [15A..170)-> BB43(1)                 (always)                     i IBC rare idxlen
BB43 [0020]  2       BB39,BB42             0           0 [170..185)-> BB44(1)                 (always)                     i IBC rare idxlen
BB44 [0021]  2       BB39,BB43             0           0 [185..199)-> BB62(1)                 (always)                     i IBC rare idxlen
BB45 [0022]  1       BB23                  0           0 [199..1A8)-> BB61(0),BB46(1)         ( cond )                     i IBC rare hascall
BB46 [0023]  1       BB45                  0           0 [1A8..1B8)-> BB48(1),BB47(0)         ( cond )                     i IBC rare
BB47 [0024]  1       BB46                  0           0 [1B8..1C3)                           (throw )                     i IBC rare hascall gcsafe newobj
BB48 [0025]  1       BB46                  0           0 [1C3..1D3)-> BB60(0.48),BB49(0.52)   ( cond )                     i IBC rare
BB49 [0026]  1       BB48                  0           0 [1D3..338)-> BB54(0.48),BB50(0.52)   ( cond )                     i IBC rare
BB50 [0034]  1       BB49                  0           0 [338..371)-> BB52(1)                 (always)                     i IBC rare idxlen
BB51 [0035]  1       BB52                  0           0 [371..3C5)-> BB52(1)                 (always)                     i IBC rare loophead idxlen bwd bwd-target
BB52 [0036]  2       BB50,BB51             0           0 [3C5..3D8)-> BB51(0.9),BB53(0.1)     ( cond )                     i IBC rare bwd bwd-src
BB53 [0037]  1       BB52                  0           0 [3D8..3E1)-> BB60(1)                 (always)                     i IBC rare
BB54 [0038]  1       BB49                  0           0 [3E1..400)-> BB60(0.48),BB55(0.52)   ( cond )                     i IBC rare
BB55 [0040]  1       BB54                  0           0 [400..455)-> BB57(1)                 (always)                     i IBC rare idxlen
BB56 [0044]  1       BB57                  0           0 [455..4E4)-> BB57(1)                 (always)                     i IBC rare loophead idxlen bwd bwd-target
BB57 [0045]  2       BB55,BB56             0           0 [4E4..4FD)-> BB56(0.9),BB58(0.1)     ( cond )                     i IBC rare bwd bwd-src
BB58 [0046]  1       BB57                  0           0 [4FD..506)-> BB60(1)                 (always)                     i IBC rare
BB59 [0057]  1       BB60                  0           0 [61B..647)-> BB60(1)                 (always)                     i IBC rare loophead idxlen bwd bwd-target
BB60 [0058]  5       BB48,BB53,BB54,BB58,BB59   0           0 [647..651)-> BB59(0.9),BB62(0.1)     ( cond )                     i IBC rare bwd bwd-src
BB61 [0060]  1       BB45                  0           0 [652..662)                           (throw )                     i IBC rare hascall gcsafe newobj
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Diff layout:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight        IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1      277888 [000..00B)-> BB03(1),BB02(0)         ( cond )                     i IBC
BB03 [0064]  1       BB01                  1      277888 [000..016)-> BB05(1),BB04(0)         ( cond )                     i IBC
BB05 [0068]  1       BB03                  1      277888 [00B..017)-> BB07(0.2),BB06(0.8)     ( cond )                     i IBC nullcheck
BB06 [0081]  1       BB05                  0.80   222310 [016..017)-> BB07(1)                 (always)                     i IBC hascall gcsafe
BB07 [0082]  2       BB05,BB06             1      277888 [016..017)-> BB09(0),BB08(1)         ( cond )                     i IBC
BB08 [0072]  1       BB07                  1      277888 [016..017)-> BB11(1)                 (always)                     i IBC
BB11 [0002]  2       BB08,BB09             1      277888 [???..???)-> BB14(1),BB13(0)         ( cond )                     i IBC hascall
BB14 [0147]  1       BB11                  0.50   138944 [???..???)-> BB15(1)                 (always)                     IBC internal
BB15 [0143]  2       BB13,BB14             1      277888 [02F..039)-> BB20(1),BB16(0)         ( cond )                     i IBC hascall
BB20 [0007]  1       BB15                  1      277888 [???..???)-> BB23(1),BB22(0)         ( cond )                     i IBC hascall
BB23 [0152]  2       BB20,BB22             1      277888 [094..0A1)-> BB45(0),BB25(1)         ( cond )                     i IBC
BB25 [0008]  1       BB23                  1      277888 [0A1..0BA)-> BB27(1),BB26(0)         ( cond )                     i IBC
BB27 [0010]  1       BB25                  1      277888 [0C5..0DB)-> BB30(1),BB41(0)         ( cond )                     i IBC idxlen
BB30 [0098]  1       BB27                  1      277888 [0DA..0DB)-> BB32(0.2),BB31(0.8)     ( cond )                     i IBC idxlen nullcheck
BB31 [0103]  1       BB30                  0.80   222310 [0DA..0DB)-> BB32(1)                 (always)                     i IBC hascall gcsafe
BB32 [0104]  2       BB30,BB31             1      277888 [0DA..0F3)-> BB37(0.00782),BB33(0.992)   ( cond )                     i IBC idxlen nullcheck
BB33 [0139]  2       BB32,BB36           126.94 35274752 [0F3..103)-> BB34(1),BB40(0)         ( cond )                     i IBC idxlen bwd
BB34 [0114]  1       BB33                126.94 35274752 [0F3..104)-> BB36(0.2),BB35(0.8)     ( cond )                     i IBC bwd
BB35 [0125]  1       BB34                101.55 28219802 [103..104)-> BB36(1)                 (always)                     i IBC hascall gcsafe bwd
BB36 [0126]  2       BB34,BB35           126.94 35274752 [103..119)-> BB33(0.992),BB37(0.00782)   ( cond )                     i IBC bwd
BB37 [0141]  2       BB32,BB36             1.00   277888 [119..11E)-> BB39(1),BB38(0)         ( cond )                     i IBC
BB39 [0017]  2       BB37,BB38             1.00   277888 [144..159)-> BB44(0),BB43(0),BB42(0),BB62(1)[def] (switch)                     i IBC
BB62 [0136]  5       BB18,BB19,BB39,BB44,BB60   1.00   277888 [???..???)                           (return)                     IBC internal
BB09 [0073]  1       BB07                  0           0 [016..01F)-> BB11(1),BB10(0)         ( cond )                     i IBC rare nullcheck
BB13 [0146]  1       BB11                  0           0 [???..???)-> BB15(1)                 (always)                     IBC rare internal
BB22 [0151]  1       BB20                  0           0 [???..???)-> BB23(1)                 (always)                     IBC rare internal
BB40 [0113]  1       BB33                  0           0 [0F3..0F4)                           (throw )                     i IBC rare hascall gcsafe bwd
BB38 [0016]  1       BB37                  0           0 [11E..144)-> BB39(1)                 (always)                     i IBC rare idxlen
BB42 [0019]  1       BB39                  0           0 [15A..170)-> BB43(1)                 (always)                     i IBC rare idxlen
BB43 [0020]  2       BB39,BB42             0           0 [170..185)-> BB44(1)                 (always)                     i IBC rare idxlen
BB44 [0021]  2       BB39,BB43             0           0 [185..199)-> BB62(1)                 (always)                     i IBC rare idxlen
BB41 [0140]  1       BB27                  0           0 [103..104)                           (throw )                     i IBC rare gcsafe bwd
BB26 [0009]  1       BB25                  0           0 [0BA..0C5)                           (throw )                     i IBC rare hascall gcsafe newobj
BB45 [0022]  1       BB23                  0           0 [199..1A8)-> BB61(0),BB46(1)         ( cond )                     i IBC rare hascall
BB46 [0023]  1       BB45                  0           0 [1A8..1B8)-> BB48(1),BB47(0)         ( cond )                     i IBC rare
BB48 [0025]  1       BB46                  0           0 [1C3..1D3)-> BB60(0.48),BB49(0.52)   ( cond )                     i IBC rare
BB49 [0026]  1       BB48                  0           0 [1D3..338)-> BB54(0.48),BB50(0.52)   ( cond )                     i IBC rare
BB50 [0034]  1       BB49                  0           0 [338..371)-> BB52(1)                 (always)                     i IBC rare idxlen
BB52 [0036]  2       BB50,BB51             0           0 [3C5..3D8)-> BB51(0.9),BB53(0.1)     ( cond )                     i IBC rare bwd bwd-src
BB51 [0035]  1       BB52                  0           0 [371..3C5)-> BB52(1)                 (always)                     i IBC rare loophead idxlen bwd bwd-target
BB53 [0037]  1       BB52                  0           0 [3D8..3E1)-> BB60(1)                 (always)                     i IBC rare
BB54 [0038]  1       BB49                  0           0 [3E1..400)-> BB60(0.48),BB55(0.52)   ( cond )                     i IBC rare
BB55 [0040]  1       BB54                  0           0 [400..455)-> BB57(1)                 (always)                     i IBC rare idxlen
BB57 [0045]  2       BB55,BB56             0           0 [4E4..4FD)-> BB56(0.9),BB58(0.1)     ( cond )                     i IBC rare bwd bwd-src
BB56 [0044]  1       BB57                  0           0 [455..4E4)-> BB57(1)                 (always)                     i IBC rare loophead idxlen bwd bwd-target
BB58 [0046]  1       BB57                  0           0 [4FD..506)-> BB60(1)                 (always)                     i IBC rare
BB60 [0058]  5       BB48,BB53,BB54,BB58,BB59   0           0 [647..651)-> BB59(0.9),BB62(0.1)     ( cond )                     i IBC rare bwd bwd-src
BB59 [0057]  1       BB60                  0           0 [61B..647)-> BB60(1)                 (always)                     i IBC rare loophead idxlen bwd bwd-target
BB47 [0024]  1       BB46                  0           0 [1B8..1C3)                           (throw )                     i IBC rare hascall gcsafe newobj
BB61 [0060]  1       BB45                  0           0 [652..662)                           (throw )                     i IBC rare hascall gcsafe newobj
BB16 [0003]  1       BB15                  0           0 [039..04E)-> BB18(1),BB17(0)         ( cond )                     i IBC rare
BB18 [0005]  1       BB16                  0           0 [059..07D)-> BB62(0.48),BB19(0.52)   ( cond )                     i IBC rare hascall gcsafe
BB19 [0006]  1       BB18                  0           0 [07D..094)-> BB62(1)                 (always)                     i IBC rare idxlen
BB17 [0004]  1       BB16                  0           0 [04E..059)                           (throw )                     i IBC rare hascall gcsafe newobj
BB10 [0001]  1       BB09                  0           0 [01F..02F)                           (throw )                     i IBC rare hascall gcsafe newobj
BB04 [0067]  1       BB03                  0           0 [00B..00C)                           (throw )                     i IBC rare hascall gcsafe
BB02 [0063]  1       BB01                  0           0 [000..001)                           (throw )                     i IBC rare hascall gcsafe
BB63 [0153]  0                             0             [???..???)                           (throw )                     i rare keep internal
BB64 [0154]  0                             0             [???..???)                           (throw )                     i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

There are lots of improvements that I'm not extending the same analysis to, and I don't mean for my tone to be pessimistic. My key takeaway from this is much of the behavior we don't necessarily want in the new layout algorithm can be addressed by leveraging block weights to selectively repair various shapes -- we have Phoenix as a guide for a lot of this work. But in its current form, the new layout algorithm is certainly easier to understand, and quite a bit faster; I'm expecting TP improvements over 1% for this PR, and follow-up work (should we decide to remove the old layout entirely) will only improve this further.

cc @dotnet/jit-contrib

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 16, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@jakobbotsch
Copy link
Member

jakobbotsch commented May 17, 2024

System.Collections.Tests.Perf_BitArray.BitArrayNot(Size: 512) regressed by about 14%, due to layout differences in System.Collections.BitArray:Not: Base layout:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight        IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1      900864 [000..039)-> BB23(0.001),BB08(0),BB07(0),BB06(0),BB05(0),BB04(0),BB03(0),BB02(0),BB09(0.999)[def] (switch)                     i IBC
BB11 [0018]  1       BB12                 15.60 14054033 [0D6..0F7)-> BB12(1)                 (always)                     i IBC loophead bwd bwd-target
BB12 [0019]  2       BB10,BB11            16.58 14933038 [0F7..107)-> BB11(0.941),BB25(0.0589)  ( cond )                     i IBC bwd bwd-src
BB25 [0039]  3       BB12,BB13,BB20        1.00   899963 [15A..15E)-> BB20(0),BB23(1)         ( cond )                     i IBC bwd
BB23 [0029]  4       BB01,BB08,BB25,BB27   1.00   900864 [15E..16E)                           (return)                     i IBC
BB09 [0009]  1       BB01                  1.00   899963 [071..0D4)-> BB13(0),BB10(1)         ( cond )                     i IBC nullcheck
BB10 [0033]  1       BB09                  1.00   899963 [0D6..???)-> BB12(1)                 (always)                     IBC internal
BB13 [0021]  1       BB09                  0           0 [109..11A)-> BB25(0.48),BB16(0.52)   ( cond )                     i IBC rare
BB15 [0024]  1       BB16                  0           0 [11C..13D)-> BB16(1)                 (always)                     i IBC rare loophead bwd bwd-target
BB16 [0025]  2       BB13,BB15             0           0 [13D..14D)-> BB15(0.9),BB27(0.1)     ( cond )                     i IBC rare bwd bwd-src
BB27 [0041]  1       BB16                  0           0 [???..???)-> BB23(0),BB20(1)         ( cond )                     IBC rare internal
BB20 [0027]  2       BB25,BB27             0           0 [14F..15A)-> BB25(1)                 (always)                     i IBC rare loophead idxlen bwd bwd-target
BB24 [0038]  0                             0             [???..???)                           (throw )                     i rare keep internal
BB02 [0002]  1       BB01                  0           0 [03B..042)-> BB03(1)                 (always)                     i IBC rare idxlen
BB03 [0003]  2       BB01,BB02             0           0 [042..049)-> BB04(1)                 (always)                     i IBC rare idxlen
BB04 [0004]  2       BB01,BB03             0           0 [049..050)-> BB05(1)                 (always)                     i IBC rare idxlen
BB05 [0005]  2       BB01,BB04             0           0 [050..057)-> BB06(1)                 (always)                     i IBC rare idxlen
BB06 [0006]  2       BB01,BB05             0           0 [057..05E)-> BB07(1)                 (always)                     i IBC rare idxlen
BB07 [0007]  2       BB01,BB06             0           0 [05E..065)-> BB08(1)                 (always)                     i IBC rare idxlen
BB08 [0008]  2       BB01,BB07             0           0 [065..071)-> BB23(1)                 (always)                     i IBC rare idxlen
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Diff layout:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight        IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1      930304 [000..039)-> BB23(0.001),BB08(0),BB07(0),BB06(0),BB05(0),BB04(0),BB03(0),BB02(0),BB09(0.999)[def] (switch)                     i IBC
BB09 [0009]  1       BB01                  1.00   929374 [071..0D4)-> BB13(0),BB10(1)         ( cond )                     i IBC nullcheck
BB10 [0033]  1       BB09                  1.00   929374 [0D6..???)-> BB12(1)                 (always)                     IBC internal
BB12 [0019]  2       BB10,BB11            16.50 15349629 [0F7..107)-> BB11(0.94),BB21(0.0605) ( cond )                     i IBC bwd bwd-src
BB11 [0018]  1       BB12                 15.50 14421162 [0D6..0F7)-> BB12(1)                 (always)                     i IBC loophead bwd bwd-target
BB21 [0028]  4       BB12,BB13,BB16,BB20   1.00   929374 [15A..15E)-> BB20(0),BB23(1)         ( cond )                     i IBC bwd bwd-src
BB23 [0029]  3       BB01,BB08,BB21        1      930304 [15E..16E)                           (return)                     i IBC
BB13 [0021]  1       BB09                  0           0 [109..11A)-> BB21(0.48),BB16(0.52)   ( cond )                     i IBC rare
BB16 [0025]  2       BB13,BB15             0           0 [13D..14D)-> BB15(0.9),BB21(0.1)     ( cond )                     i IBC rare bwd bwd-src
BB15 [0024]  1       BB16                  0           0 [11C..13D)-> BB16(1)                 (always)                     i IBC rare loophead bwd bwd-target
BB20 [0027]  1       BB21                  0           0 [14F..15A)-> BB21(1)                 (always)                     i IBC rare loophead idxlen bwd bwd-target
BB02 [0002]  1       BB01                  0           0 [03B..042)-> BB03(1)                 (always)                     i IBC rare idxlen
BB03 [0003]  2       BB01,BB02             0           0 [042..049)-> BB04(1)                 (always)                     i IBC rare idxlen
BB04 [0004]  2       BB01,BB03             0           0 [049..050)-> BB05(1)                 (always)                     i IBC rare idxlen
BB05 [0005]  2       BB01,BB04             0           0 [050..057)-> BB06(1)                 (always)                     i IBC rare idxlen
BB06 [0006]  2       BB01,BB05             0           0 [057..05E)-> BB07(1)                 (always)                     i IBC rare idxlen
BB07 [0007]  2       BB01,BB06             0           0 [05E..065)-> BB08(1)                 (always)                     i IBC rare idxlen
BB08 [0008]  2       BB01,BB07             0           0 [065..071)-> BB23(1)                 (always)                     i IBC rare idxlen
BB24 [0038]  0                             0             [???..???)                           (throw )                     i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

In terms of edge likelihoods, the new layout seems to get the critical paths right, though note that the "greedy" part of the RPO only applies to conditional blocks when deciding which successor to place next; other multi-successor block kinds, like switch blocks, don't seem to be common enough to be worth extending the layout's greediness to, though this could be done as a follow-up quite easily (see #101935). I believe the hot loop BB11<->BB12 is to blame for the regression: BB12 is reachable from BB11 and BB10, and BB11 is reachable only from BB12. When we start the RPO from BB01, we end up visiting BB10, then BB12, and then BB11, hence why the new layout places BB12 before BB11. This introduces more branches: We need a backward jump from BB11 to BB12 within the loop, and once BB12's condition is false, we need to jump over BB11 to get to the former's false target. If we place BB11 before BB12, then BB11 can fall into BB12, and BB12 can eventually fall into its false target after the loop; we only need the single backward jump from BB12 to BB11.

Hmm, this makes me a bit leery. It is very common to lay out loops in this way to avoid this extra branch; Roslyn does that in IL for us, which the layout algorithm then effectively undoes. I think this affects all loops that we don't enter at the top.
The saving grace is that usually loop inversion will kick in and make most loops entered at the top. But I think it would be good to collect some stats over how many loops this affects.

Here is a simple example:

private static int Sum(int[] arr)
{
    int i = 0;
    int sum = 0;
    while (i < arr.Length && arr[i] != 0)
    {
        sum += arr[i];
        i++;
    }

    return sum;
}

Base:

G_M57365_IG02:  ;; offset=0x0004
       xor      eax, eax
       mov      edx, dword ptr [rcx+0x08]
       xor      edx, edx
       jmp      SHORT G_M57365_IG04
						;; size=9 bbWeight=1 PerfScore 4.50
G_M57365_IG03:  ;; offset=0x000D
       add      eax, dword ptr [rcx+4*rdx+0x10]
       inc      edx
						;; size=6 bbWeight=2 PerfScore 6.50
G_M57365_IG04:  ;; offset=0x0013
       cmp      dword ptr [rcx+0x08], edx
       jle      SHORT G_M57365_IG06
						;; size=5 bbWeight=8 PerfScore 32.00
G_M57365_IG05:  ;; offset=0x0018
       cmp      dword ptr [rcx+4*rdx+0x10], 0
       jne      SHORT G_M57365_IG03
						;; size=7 bbWeight=4 PerfScore 16.00

Diff:

G_M57365_IG02:  ;; offset=0x0004
       xor      eax, eax
       mov      edx, dword ptr [rcx+0x08]
       xor      edx, edx
						;; size=7 bbWeight=1 PerfScore 2.50
G_M57365_IG03:  ;; offset=0x000B
       cmp      dword ptr [rcx+0x08], edx
       jle      SHORT G_M57365_IG06
						;; size=5 bbWeight=8 PerfScore 32.00
G_M57365_IG04:  ;; offset=0x0010
       cmp      dword ptr [rcx+4*rdx+0x10], 0
       je       SHORT G_M57365_IG06
						;; size=7 bbWeight=4 PerfScore 16.00
G_M57365_IG05:  ;; offset=0x0017
       add      eax, dword ptr [rcx+4*rdx+0x10]
       inc      edx
       jmp      SHORT G_M57365_IG03
						;; size=8 bbWeight=2 PerfScore 10.50

@amanasifkhalid
Copy link
Member Author

Hmm, this makes me a bit leery. It is very common to lay out loops in this way to avoid this extra branch; Roslyn does that in IL for us, which the layout algorithm then effectively undoes. I think this affects all loops that we don't enter at the top.
The saving grace is that usually loop inversion will kick in and make most loops entered at the top. But I think it would be good to collect some stats over how many loops this affects.

Yeah, I'm a bit concerned about this too, considering it affected the pretty idiomatic example you gave. In the benchmarks.run_pgo collection, I found 5,676 loops with backward jumps from their loop heads using the new layout, as opposed to 5,458 loops with the old layout. If we only look at loops that aren't rarely run, these numbers drop to 3,111 and 3,755, respectively.

@jakobbotsch are there other collections you'd like me to specifically look at? And if you think it's worth addressing this in the layout's implementation, what should our merge strategy look like? Would you want to run the experiment for a bit longer with the loop inversion fix before enabling to see what the improvements look like?

@amanasifkhalid
Copy link
Member Author

amanasifkhalid commented May 17, 2024

Yeah, I'm a bit concerned about this too, considering it affected the pretty idiomatic example you gave. In the benchmarks.run_pgo collection, I found 5,676 loops with backward jumps from their loop heads using the new layout, as opposed to 5,458 loops with the old layout. If we only look at loops that aren't rarely run, these numbers drop to 3,111 and 3,755, respectively.

I reran my analysis across all SPMI collections we currently have on win x64, and for loops that aren't rarely run, the loop head ends with a backward jump for 74,213 loops with the new layout, versus 68,315 loops with the old layout. There's definitely some double-counting here across the benchmarks.* and libraries_tests.* collections, though the new layout does seem marginally more susceptible to introducing this shape.

@amanasifkhalid
Copy link
Member Author

Now that I'm thinking about it, I think we ought to try something similar to what Phoenix does after creating the RPO-based layout where we move a block's hottest predecessor to just before it, so that loop heads don't end up at the end of loop bodies. Looking around various GitHub issues like #9304, I think this is some low-hanging fruit we can address up-front.

@EgorBo
Copy link
Member

EgorBo commented May 19, 2024

@EgorBot --disasm

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkRunner.Run<MyBench>(args: args);

public class MyBench
{
    private int _int = 0;
    private long _long = 0;
    private string _location, _newValue, _comparand;

    [GlobalSetup(Target = nameof(CompareExchange_object_NoMatch))]
    public void Setup_CompareExchange_object_NoMatch()
    {
        _location = "Hello";
        _newValue = "World";
        _comparand = "What?";
    }

    [Benchmark]
    public string CompareExchange_object_NoMatch() 
        => Interlocked.CompareExchange(ref _location, _newValue, _comparand);
}

@EgorBo
Copy link
Member

EgorBo commented May 19, 2024

BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish)
Intel Xeon Platinum 8370C CPU 2.80GHz, 1 CPU, 8 logical and 4 physical cores
Method Toolchain Mean Ratio Code Size
CompareExchange_object_NoMatch Main 4.927 ns 1.00 43 B
CompareExchange_object_NoMatch PR 4.979 ns 1.01 43 B

BDN_Artifacts.zip

@EgorBo
Copy link
Member

EgorBo commented May 19, 2024

I was just testing my bot, although, @amanasifkhalid you mentioned that Setup_CompareExchange_object_NoMatch regressed by 40% - was it Windows specific?

@EgorBo
Copy link
Member

EgorBo commented May 19, 2024

Although, by looking at the BDN_Artifacts it seems that there is ASM difference, namely: https://www.diffchecker.com/hu5vJF0i/ (no idea where the baseline and the PR in that diff)

@amanasifkhalid
Copy link
Member Author

@EgorBo yes, that was on Windows x64. That number also came from me looking at min execution times, so it's possible the baseline had an unusually good run?

Nice bot, by the way

@amanasifkhalid
Copy link
Member Author

I've opened #102461 to address the loop inversion issue. If we decide we want that change, should we let the layout experiment run for another week or so? I'm fine with punting this change to Preview 6. @dotnet/jit-contrib

@JulieLeeMSFT JulieLeeMSFT added the Priority:2 Work that is important, but not critical for the release label May 20, 2024
amanasifkhalid added a commit that referenced this pull request May 21, 2024
…ayout (#102461)

Part of #93020. In #102343, we noticed the RPO-based layout sometimes makes suboptimal decisions in terms of placing a block's hottest predecessor before it -- in particular, this affects loops that aren't entered at the top. To address this, after establishing a baseline RPO layout, fgMoveBackwardJumpsToSuccessors will try to move backward unconditional jumps to right behind their targets to create fallthrough, if the predecessor block is sufficiently hot.
@amanasifkhalid
Copy link
Member Author

Updated diffs. @EgorBo since we addressed the mis-rotated loop issue with #102461, are you ok with merging this as-is?

Copy link
Member

@EgorBo EgorBo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the diffs look very nice! Looking forward to dotnet/performance results 🙂

@amanasifkhalid amanasifkhalid merged commit f02a695 into dotnet:main May 22, 2024
107 checks passed
@amanasifkhalid amanasifkhalid deleted the enable-rpo-layout branch May 22, 2024 17:55
@AndyAyersMS
Copy link
Member

Nice to see this enabled! Thanks for digging into the original set of diffs and fixing problems.

@amanasifkhalid
Copy link
Member Author

@AndyAyersMS @jakobbotsch @EgorBo thank you all for your help with getting this merged!

steveharter pushed a commit to steveharter/runtime that referenced this pull request May 28, 2024
Ruihan-Yin pushed a commit to Ruihan-Yin/runtime that referenced this pull request May 30, 2024
…ayout (dotnet#102461)

Part of dotnet#93020. In dotnet#102343, we noticed the RPO-based layout sometimes makes suboptimal decisions in terms of placing a block's hottest predecessor before it -- in particular, this affects loops that aren't entered at the top. To address this, after establishing a baseline RPO layout, fgMoveBackwardJumpsToSuccessors will try to move backward unconditional jumps to right behind their targets to create fallthrough, if the predecessor block is sufficiently hot.
Ruihan-Yin pushed a commit to Ruihan-Yin/runtime that referenced this pull request May 30, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Jun 22, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI Priority:2 Work that is important, but not critical for the release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants