JIT: Enable RPO-based block layout by default #102343

amanasifkhalid · 2024-05-16T22:00:32Z

Part of #93020. Enables the new greedy RPO-based block layout by default. By fully switching over to the new layout algorithm, we can get rid of a lot of code that probably isn't useful anymore -- aside from the old layout, we should consider removing code that prematurely tries to maintain a certain ordering, like fgFindInsertPoint. I'm not going to do any of this removal just yet, just in case we want to keep the old implementation around for now.

We now have about a week of useful data from the rpolayout experiment in the perf lab. Here's a PDF/CDF of the minimum benchmark execution times from the last 5 days, on Windows x64:

Many (most?) of those datapoints are within the realm of noise. Here's a brief breakdown of the nontrivial improvements/regressions on x64, using the min/median/max benchmark execution times from the last 5 days:

Windows x64, min execution time
22.85% improved by >=2%; 14.64% regressed
9.54% improved by >=5%; 6.85% regressed
2.97% improved by >=10%; 2.97% regressed

Ubuntu x64, min execution time
27.53% improved by >=2%; 12.91% regressed
11.11% improved by >=5%; 5.69% regressed
3.68% improved by >=10%; 2.19% regressed

Windows x64, median execution time
26.62% improved by >=2%; 13.73% regressed
12.23% improved by >=5%; 6.89% regressed
4.32% improved by >=10%; 2.83% regressed

Ubuntu x64, median execution time
26.20% improved by >=2%; 12.01% regressed
11.17% improved by >=5%; 5.92% regressed
3.96% improved by >=10%; 2.31% regressed

Windows x64, max execution time
30.45% improved by >=2%; 18.67% regressed
17.01% improved by >=5%; 10.51% regressed
7.06% improved by >=10%; 5.11% regressed

Ubuntu x64, max execution time
31.38% improved by >=2%; 22.89% regressed
16.02% improved by >=5%; 11.26% regressed
6.92% improved by >=10%; 5.60% regressed

As of writing, 145 of 4,879 benchmarks regressed by 10% or more on Windows x64, when looking at their minimum execution times from the last 5 days. Block layout churn can have far-reaching consequences, so narrowing down which methods to look at when triaging regressions can be tricky. I've highlighted a few regressed benchmarks below with simple enough call graphs that the offending method is obvious; I think these examples highlight a few expected trends from the new layout algorithm:

System.Numerics.Tests.Perf_Matrix4x4.IsIdentityBenchmark
Base layout:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1    100 [000..007)-> BB05(0.2),BB02(0.8)     ( cond )                     i IBC
BB02 [0007]  1       BB01                  0.80  80 [006..007)-> BB05(0.2),BB03(0.8)     ( cond )                     i IBC
BB03 [0008]  1       BB02                  0.64  64 [006..007)-> BB05(0.48),BB04(0.52)   ( cond )                     i IBC
BB04 [0009]  1       BB03                  0.33  33 [006..007)-> BB06(1)                 (always)                     i IBC
BB06 [0011]  2       BB04,BB05             1    100 [006..00E)                           (return)                     i IBC
BB05 [0010]  3       BB01,BB02,BB03        0.67  67 [006..007)-> BB06(1)                 (always)                     i IBC
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Diff layout:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1    100 [000..007)-> BB05(0.2),BB02(0.8)     ( cond )                     i IBC
BB02 [0007]  1       BB01                  0.80  80 [006..007)-> BB05(0.2),BB03(0.8)     ( cond )                     i IBC
BB03 [0008]  1       BB02                  0.64  64 [006..007)-> BB05(0.48),BB04(0.52)   ( cond )                     i IBC
BB04 [0009]  1       BB03                  0.33  33 [006..007)-> BB06(1)                 (always)                     i IBC
BB05 [0010]  3       BB01,BB02,BB03        0.67  67 [006..007)-> BB06(1)                 (always)                     i IBC
BB06 [0011]  2       BB04,BB05             1    100 [006..00E)                           (return)                     i IBC
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

The new layout places BB06 after BB05, breaking up the fallthrough from BB04 to BB06. This has the benefit of removing a backward jump from BB05 to BB06, though in the case of this benchmark, it looks like the return path BB03->BB04->BB06 is taken, so we're penalized by the new jump over BB05. This benchmark is quite small, so the impact of the jump is big, regressing it by about 27%.

We could tweak the RPO-based layout by moving blocks up to just after their hottest predecessor to address this (@AndyAyersMS showed me something similar he did in Phoenix), though in this case, the block weights suggest BB05 is BB06's hottest predecessor, so I don't think there's anything worth changing here, in terms of the block layout algorithm itself.

System.Numerics.Tests.Perf_Matrix3x2.InequalityOperatorBenchmark regressed by about 19% for similar reasons.
Base layout:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1    100 [000..000)-> BB04(0.2),BB02(0.8)     ( cond )                     i IBC
BB02 [0011]  1       BB01                  0.80  80 [000..000)-> BB04(0.48),BB03(0.52)   ( cond )                     i IBC internal
BB03 [0012]  1       BB02                  0.42  42 [000..000)-> BB05(1)                 (always)                     i IBC internal
BB05 [0014]  2       BB03,BB04             1    100 [010..010)                           (return)                     i IBC
BB04 [0013]  2       BB01,BB02             0.58  58 [000..000)-> BB05(1)                 (always)                     i IBC internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Diff layout:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1    100 [000..000)-> BB04(0.2),BB02(0.8)     ( cond )                     i IBC
BB02 [0011]  1       BB01                  0.80  80 [000..000)-> BB04(0.48),BB03(0.52)   ( cond )                     i IBC internal
BB03 [0012]  1       BB02                  0.42  42 [000..000)-> BB05(1)                 (always)                     i IBC internal
BB04 [0013]  2       BB01,BB02             0.58  58 [000..000)-> BB05(1)                 (always)                     i IBC internal
BB05 [0014]  2       BB03,BB04             1    100 [010..010)                           (return)                     i IBC
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Based on the PGO data available, the new layout seems to be making better decisions. We could iterate on this by synthesizing likelihoods and/or repairing the profile pre-layout, and by adding a post-RPO layout heuristic that moves blocks up to their hottest predecessor.

System.Threading.Tests.Perf_Interlocked.CompareExchange_object_Match regressed by over 40%.
Base layout:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1    100 [000..001)-> BB03(0.2),BB02(0.8)     ( cond )                     i IBC
BB02 [0002]  1       BB01                  0      0 [000..001)                           (throw )                     i IBC rare hascall gcsafe
BB03 [0003]  1       BB01                  1    100 [000..009)                           (return)                     i IBC jmp hascall
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Diff layout:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1    100 [000..001)-> BB02(0.2),BB03(0.8)     ( cond )                     i IBC
BB03 [0003]  1       BB01                  1    100 [000..009)                           (return)                     i IBC jmp hascall
BB02 [0002]  1       BB01                  0      0 [000..001)                           (throw )                     i IBC rare hascall gcsafe
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

This is an interesting case, where the exceptional path BB01->BB03 is actually the more likely path, if the edge likelihoods are to be trusted. However, the JIT tends to assume throw blocks are always cold, hence BB03's block weight of 0. The new layout does not make any such assumption about throw blocks: After generating a greedy RPO-based layout of BB01->BB02->BB03, the new layout moves all rarely-run blocks (i.e. anything with a weight of 0) to the end of the method, hence the final BB01->BB03->BB02 layout. This case would be fixed by propagating weight to BB02 from BB01 that is proportional to their edge's likelihood, such that BB02 would no longer be considered rarely-run; thus, running profile repair before block layout would probably fix this. Though considering the perf cost of exception handling, perhaps we don't have much to gain from removing this expectation that throw blocks are cold.

I should note that the old layout didn't do anything clever on purpose here. It left BB02 after BB01 because it still expects the false target of a conditional block to be its next block, and not because it is the more likely successor. This invariant has been removed elsewhere in the JIT, so switching over to the new layout for good would allows us to remove the last bits of cruft around this implicit fallthrough requirement.

System.Threading.Tests.Perf_Interlocked.CompareExchange_object_NoMatch also regressed by over 40% for the same reason. There seem to be a few of these benchmark pairs inflating the improvement/regression counts.

System.Collections.Tests.Perf_BitArray.BitArrayNot(Size: 512) regressed by about 14%, due to layout differences in System.Collections.BitArray:Not:
Base layout:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight        IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1      900864 [000..039)-> BB23(0.001),BB08(0),BB07(0),BB06(0),BB05(0),BB04(0),BB03(0),BB02(0),BB09(0.999)[def] (switch)                     i IBC
BB11 [0018]  1       BB12                 15.60 14054033 [0D6..0F7)-> BB12(1)                 (always)                     i IBC loophead bwd bwd-target
BB12 [0019]  2       BB10,BB11            16.58 14933038 [0F7..107)-> BB11(0.941),BB25(0.0589)  ( cond )                     i IBC bwd bwd-src
BB25 [0039]  3       BB12,BB13,BB20        1.00   899963 [15A..15E)-> BB20(0),BB23(1)         ( cond )                     i IBC bwd
BB23 [0029]  4       BB01,BB08,BB25,BB27   1.00   900864 [15E..16E)                           (return)                     i IBC
BB09 [0009]  1       BB01                  1.00   899963 [071..0D4)-> BB13(0),BB10(1)         ( cond )                     i IBC nullcheck
BB10 [0033]  1       BB09                  1.00   899963 [0D6..???)-> BB12(1)                 (always)                     IBC internal
BB13 [0021]  1       BB09                  0           0 [109..11A)-> BB25(0.48),BB16(0.52)   ( cond )                     i IBC rare
BB15 [0024]  1       BB16                  0           0 [11C..13D)-> BB16(1)                 (always)                     i IBC rare loophead bwd bwd-target
BB16 [0025]  2       BB13,BB15             0           0 [13D..14D)-> BB15(0.9),BB27(0.1)     ( cond )                     i IBC rare bwd bwd-src
BB27 [0041]  1       BB16                  0           0 [???..???)-> BB23(0),BB20(1)         ( cond )                     IBC rare internal
BB20 [0027]  2       BB25,BB27             0           0 [14F..15A)-> BB25(1)                 (always)                     i IBC rare loophead idxlen bwd bwd-target
BB24 [0038]  0                             0             [???..???)                           (throw )                     i rare keep internal
BB02 [0002]  1       BB01                  0           0 [03B..042)-> BB03(1)                 (always)                     i IBC rare idxlen
BB03 [0003]  2       BB01,BB02             0           0 [042..049)-> BB04(1)                 (always)                     i IBC rare idxlen
BB04 [0004]  2       BB01,BB03             0           0 [049..050)-> BB05(1)                 (always)                     i IBC rare idxlen
BB05 [0005]  2       BB01,BB04             0           0 [050..057)-> BB06(1)                 (always)                     i IBC rare idxlen
BB06 [0006]  2       BB01,BB05             0           0 [057..05E)-> BB07(1)                 (always)                     i IBC rare idxlen
BB07 [0007]  2       BB01,BB06             0           0 [05E..065)-> BB08(1)                 (always)                     i IBC rare idxlen
BB08 [0008]  2       BB01,BB07             0           0 [065..071)-> BB23(1)                 (always)                     i IBC rare idxlen
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Diff layout:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight        IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1      930304 [000..039)-> BB23(0.001),BB08(0),BB07(0),BB06(0),BB05(0),BB04(0),BB03(0),BB02(0),BB09(0.999)[def] (switch)                     i IBC
BB09 [0009]  1       BB01                  1.00   929374 [071..0D4)-> BB13(0),BB10(1)         ( cond )                     i IBC nullcheck
BB10 [0033]  1       BB09                  1.00   929374 [0D6..???)-> BB12(1)                 (always)                     IBC internal
BB12 [0019]  2       BB10,BB11            16.50 15349629 [0F7..107)-> BB11(0.94),BB21(0.0605) ( cond )                     i IBC bwd bwd-src
BB11 [0018]  1       BB12                 15.50 14421162 [0D6..0F7)-> BB12(1)                 (always)                     i IBC loophead bwd bwd-target
BB21 [0028]  4       BB12,BB13,BB16,BB20   1.00   929374 [15A..15E)-> BB20(0),BB23(1)         ( cond )                     i IBC bwd bwd-src
BB23 [0029]  3       BB01,BB08,BB21        1      930304 [15E..16E)                           (return)                     i IBC
BB13 [0021]  1       BB09                  0           0 [109..11A)-> BB21(0.48),BB16(0.52)   ( cond )                     i IBC rare
BB16 [0025]  2       BB13,BB15             0           0 [13D..14D)-> BB15(0.9),BB21(0.1)     ( cond )                     i IBC rare bwd bwd-src
BB15 [0024]  1       BB16                  0           0 [11C..13D)-> BB16(1)                 (always)                     i IBC rare loophead bwd bwd-target
BB20 [0027]  1       BB21                  0           0 [14F..15A)-> BB21(1)                 (always)                     i IBC rare loophead idxlen bwd bwd-target
BB02 [0002]  1       BB01                  0           0 [03B..042)-> BB03(1)                 (always)                     i IBC rare idxlen
BB03 [0003]  2       BB01,BB02             0           0 [042..049)-> BB04(1)                 (always)                     i IBC rare idxlen
BB04 [0004]  2       BB01,BB03             0           0 [049..050)-> BB05(1)                 (always)                     i IBC rare idxlen
BB05 [0005]  2       BB01,BB04             0           0 [050..057)-> BB06(1)                 (always)                     i IBC rare idxlen
BB06 [0006]  2       BB01,BB05             0           0 [057..05E)-> BB07(1)                 (always)                     i IBC rare idxlen
BB07 [0007]  2       BB01,BB06             0           0 [05E..065)-> BB08(1)                 (always)                     i IBC rare idxlen
BB08 [0008]  2       BB01,BB07             0           0 [065..071)-> BB23(1)                 (always)                     i IBC rare idxlen
BB24 [0038]  0                             0             [???..???)                           (throw )                     i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

In terms of edge likelihoods, the new layout seems to get the critical paths right, though note that the "greedy" part of the RPO only applies to conditional blocks when deciding which successor to place next; other multi-successor block kinds, like switch blocks, don't seem to be common enough to be worth extending the layout's greediness to, though this could be done as a follow-up quite easily (see #101935). I believe the hot loop BB11<->BB12 is to blame for the regression: BB12 is reachable from BB11 and BB10, and BB11 is reachable only from BB12. When we start the RPO from BB01, we end up visiting BB10, then BB12, and then BB11, hence why the new layout places BB12 before BB11. This introduces more branches: We need a backward jump from BB11 to BB12 within the loop, and once BB12's condition is false, we need to jump over BB11 to get to the former's false target. If we place BB11 before BB12, then BB11 can fall into BB12, and BB12 can eventually fall into its false target after the loop; we only need the single backward jump from BB12 to BB11.

Perhaps we could re-canonicalize loops post-layout to fix these cases, though I hesitate to purposefully break the RPO. Cases like this one could be tackled by a heuristic that optimizes for some optimal layout score, as described in #93020.

System.Collections.Tests.Perf_BitArray.BitArrayCopyToByteArray(Size: 512) regressed by over 40% for seemingly the same reason, though the problematic loop shapes are all in the cold section. Take a look at BB51<->BB52, BB56<->BB57, etc.
Base layout:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight        IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1      281408 [000..00B)-> BB02(0),BB03(1)         ( cond )                     i IBC
BB03 [0064]  1       BB01                  1      281408 [000..016)-> BB04(0),BB05(1)         ( cond )                     i IBC
BB05 [0068]  1       BB03                  1      281408 [00B..017)-> BB07(0.2),BB06(0.8)     ( cond )                     i IBC nullcheck
BB06 [0081]  1       BB05                  0.80   225126 [016..017)-> BB07(1)                 (always)                     i IBC hascall gcsafe
BB07 [0082]  2       BB05,BB06             1      281408 [016..017)-> BB09(0),BB08(1)         ( cond )                     i IBC
BB08 [0072]  1       BB07                  1      281408 [016..017)-> BB11(1)                 (always)                     i IBC
BB11 [0002]  2       BB08,BB09             1      281408 [???..???)-> BB13(0),BB14(1)         ( cond )                     i IBC hascall
BB14 [0147]  1       BB11                  1      281408 [02F..039)-> BB16(0),BB20(1)         ( cond )                     i IBC
BB20 [0007]  2       BB13,BB14             1      281408 [???..???)-> BB22(0),BB23(1)         ( cond )                     i IBC hascall
BB23 [0152]  2       BB20,BB22             1      281408 [094..0A1)-> BB45(0),BB25(1)         ( cond )                     i IBC
BB25 [0008]  1       BB23                  1      281408 [0A1..0BA)-> BB26(0),BB27(1)         ( cond )                     i IBC
BB27 [0010]  1       BB25                  1      281408 [0C5..0DB)-> BB41(0),BB30(1)         ( cond )                     i IBC idxlen
BB30 [0098]  1       BB27                  1      281408 [0DA..0DB)-> BB32(0.2),BB31(0.8)     ( cond )                     i IBC idxlen nullcheck
BB31 [0103]  1       BB30                  0.80   225126 [0DA..0DB)-> BB32(1)                 (always)                     i IBC hascall gcsafe
BB32 [0104]  2       BB30,BB31             1      281408 [0DA..0F3)-> BB37(0.00795),BB33(0.992)   ( cond )                     i IBC idxlen nullcheck
BB33 [0139]  2       BB32,BB36           124.77 35110912 [0F3..103)-> BB40(0),BB34(1)         ( cond )                     i IBC idxlen bwd
BB34 [0114]  1       BB33                124.77 35110912 [0F3..104)-> BB36(0.2),BB35(0.8)     ( cond )                     i IBC bwd
BB35 [0125]  1       BB34                 99.81 28088730 [103..104)-> BB36(1)                 (always)                     i IBC hascall gcsafe bwd
BB36 [0126]  2       BB34,BB35           124.77 35110912 [103..119)-> BB33(0.992),BB37(0.00795)   ( cond )                     i IBC bwd
BB37 [0141]  2       BB32,BB36             1.00   281408 [119..11E)-> BB38(0),BB39(1)         ( cond )                     i IBC
BB39 [0017]  2       BB37,BB38             1.00   281408 [144..159)-> BB44(0),BB43(0),BB42(0),BB62(1)[def] (switch)                     i IBC
BB62 [0136]  5       BB18,BB19,BB39,BB44,BB60   1.00   281408 [???..???)                           (return)                     IBC internal
BB63 [0153]  0                             0             [???..???)                           (throw )                     i rare keep internal
BB64 [0154]  0                             0             [???..???)                           (throw )                     i rare keep internal
BB02 [0063]  1       BB01                  0           0 [000..001)                           (throw )                     i IBC rare hascall gcsafe
BB04 [0067]  1       BB03                  0           0 [00B..00C)                           (throw )                     i IBC rare hascall gcsafe
BB09 [0073]  1       BB07                  0           0 [016..01F)-> BB11(1),BB10(0)         ( cond )                     i IBC rare nullcheck
BB10 [0001]  1       BB09                  0           0 [01F..02F)                           (throw )                     i IBC rare hascall gcsafe newobj
BB13 [0146]  1       BB11                  0           0 [???..???)-> BB20(0),BB16(1)         ( cond )                     IBC rare internal
BB16 [0003]  2       BB13,BB14             0           0 [039..04E)-> BB18(1),BB17(0)         ( cond )                     i IBC rare
BB17 [0004]  1       BB16                  0           0 [04E..059)                           (throw )                     i IBC rare hascall gcsafe newobj
BB18 [0005]  1       BB16                  0           0 [059..07D)-> BB62(0.48),BB19(0.52)   ( cond )                     i IBC rare hascall gcsafe
BB19 [0006]  1       BB18                  0           0 [07D..094)-> BB62(1)                 (always)                     i IBC rare idxlen
BB22 [0151]  1       BB20                  0           0 [???..???)-> BB23(1)                 (always)                     IBC rare internal
BB26 [0009]  1       BB25                  0           0 [0BA..0C5)                           (throw )                     i IBC rare hascall gcsafe newobj
BB38 [0016]  1       BB37                  0           0 [11E..144)-> BB39(1)                 (always)                     i IBC rare idxlen
BB40 [0113]  1       BB33                  0           0 [0F3..0F4)                           (throw )                     i IBC rare hascall gcsafe bwd
BB41 [0140]  1       BB27                  0           0 [103..104)                           (throw )                     i IBC rare gcsafe bwd
BB42 [0019]  1       BB39                  0           0 [15A..170)-> BB43(1)                 (always)                     i IBC rare idxlen
BB43 [0020]  2       BB39,BB42             0           0 [170..185)-> BB44(1)                 (always)                     i IBC rare idxlen
BB44 [0021]  2       BB39,BB43             0           0 [185..199)-> BB62(1)                 (always)                     i IBC rare idxlen
BB45 [0022]  1       BB23                  0           0 [199..1A8)-> BB61(0),BB46(1)         ( cond )                     i IBC rare hascall
BB46 [0023]  1       BB45                  0           0 [1A8..1B8)-> BB48(1),BB47(0)         ( cond )                     i IBC rare
BB47 [0024]  1       BB46                  0           0 [1B8..1C3)                           (throw )                     i IBC rare hascall gcsafe newobj
BB48 [0025]  1       BB46                  0           0 [1C3..1D3)-> BB60(0.48),BB49(0.52)   ( cond )                     i IBC rare
BB49 [0026]  1       BB48                  0           0 [1D3..338)-> BB54(0.48),BB50(0.52)   ( cond )                     i IBC rare
BB50 [0034]  1       BB49                  0           0 [338..371)-> BB52(1)                 (always)                     i IBC rare idxlen
BB51 [0035]  1       BB52                  0           0 [371..3C5)-> BB52(1)                 (always)                     i IBC rare loophead idxlen bwd bwd-target
BB52 [0036]  2       BB50,BB51             0           0 [3C5..3D8)-> BB51(0.9),BB53(0.1)     ( cond )                     i IBC rare bwd bwd-src
BB53 [0037]  1       BB52                  0           0 [3D8..3E1)-> BB60(1)                 (always)                     i IBC rare
BB54 [0038]  1       BB49                  0           0 [3E1..400)-> BB60(0.48),BB55(0.52)   ( cond )                     i IBC rare
BB55 [0040]  1       BB54                  0           0 [400..455)-> BB57(1)                 (always)                     i IBC rare idxlen
BB56 [0044]  1       BB57                  0           0 [455..4E4)-> BB57(1)                 (always)                     i IBC rare loophead idxlen bwd bwd-target
BB57 [0045]  2       BB55,BB56             0           0 [4E4..4FD)-> BB56(0.9),BB58(0.1)     ( cond )                     i IBC rare bwd bwd-src
BB58 [0046]  1       BB57                  0           0 [4FD..506)-> BB60(1)                 (always)                     i IBC rare
BB59 [0057]  1       BB60                  0           0 [61B..647)-> BB60(1)                 (always)                     i IBC rare loophead idxlen bwd bwd-target
BB60 [0058]  5       BB48,BB53,BB54,BB58,BB59   0           0 [647..651)-> BB59(0.9),BB62(0.1)     ( cond )                     i IBC rare bwd bwd-src
BB61 [0060]  1       BB45                  0           0 [652..662)                           (throw )                     i IBC rare hascall gcsafe newobj
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Diff layout:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight        IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1      277888 [000..00B)-> BB03(1),BB02(0)         ( cond )                     i IBC
BB03 [0064]  1       BB01                  1      277888 [000..016)-> BB05(1),BB04(0)         ( cond )                     i IBC
BB05 [0068]  1       BB03                  1      277888 [00B..017)-> BB07(0.2),BB06(0.8)     ( cond )                     i IBC nullcheck
BB06 [0081]  1       BB05                  0.80   222310 [016..017)-> BB07(1)                 (always)                     i IBC hascall gcsafe
BB07 [0082]  2       BB05,BB06             1      277888 [016..017)-> BB09(0),BB08(1)         ( cond )                     i IBC
BB08 [0072]  1       BB07                  1      277888 [016..017)-> BB11(1)                 (always)                     i IBC
BB11 [0002]  2       BB08,BB09             1      277888 [???..???)-> BB14(1),BB13(0)         ( cond )                     i IBC hascall
BB14 [0147]  1       BB11                  0.50   138944 [???..???)-> BB15(1)                 (always)                     IBC internal
BB15 [0143]  2       BB13,BB14             1      277888 [02F..039)-> BB20(1),BB16(0)         ( cond )                     i IBC hascall
BB20 [0007]  1       BB15                  1      277888 [???..???)-> BB23(1),BB22(0)         ( cond )                     i IBC hascall
BB23 [0152]  2       BB20,BB22             1      277888 [094..0A1)-> BB45(0),BB25(1)         ( cond )                     i IBC
BB25 [0008]  1       BB23                  1      277888 [0A1..0BA)-> BB27(1),BB26(0)         ( cond )                     i IBC
BB27 [0010]  1       BB25                  1      277888 [0C5..0DB)-> BB30(1),BB41(0)         ( cond )                     i IBC idxlen
BB30 [0098]  1       BB27                  1      277888 [0DA..0DB)-> BB32(0.2),BB31(0.8)     ( cond )                     i IBC idxlen nullcheck
BB31 [0103]  1       BB30                  0.80   222310 [0DA..0DB)-> BB32(1)                 (always)                     i IBC hascall gcsafe
BB32 [0104]  2       BB30,BB31             1      277888 [0DA..0F3)-> BB37(0.00782),BB33(0.992)   ( cond )                     i IBC idxlen nullcheck
BB33 [0139]  2       BB32,BB36           126.94 35274752 [0F3..103)-> BB34(1),BB40(0)         ( cond )                     i IBC idxlen bwd
BB34 [0114]  1       BB33                126.94 35274752 [0F3..104)-> BB36(0.2),BB35(0.8)     ( cond )                     i IBC bwd
BB35 [0125]  1       BB34                101.55 28219802 [103..104)-> BB36(1)                 (always)                     i IBC hascall gcsafe bwd
BB36 [0126]  2       BB34,BB35           126.94 35274752 [103..119)-> BB33(0.992),BB37(0.00782)   ( cond )                     i IBC bwd
BB37 [0141]  2       BB32,BB36             1.00   277888 [119..11E)-> BB39(1),BB38(0)         ( cond )                     i IBC
BB39 [0017]  2       BB37,BB38             1.00   277888 [144..159)-> BB44(0),BB43(0),BB42(0),BB62(1)[def] (switch)                     i IBC
BB62 [0136]  5       BB18,BB19,BB39,BB44,BB60   1.00   277888 [???..???)                           (return)                     IBC internal
BB09 [0073]  1       BB07                  0           0 [016..01F)-> BB11(1),BB10(0)         ( cond )                     i IBC rare nullcheck
BB13 [0146]  1       BB11                  0           0 [???..???)-> BB15(1)                 (always)                     IBC rare internal
BB22 [0151]  1       BB20                  0           0 [???..???)-> BB23(1)                 (always)                     IBC rare internal
BB40 [0113]  1       BB33                  0           0 [0F3..0F4)                           (throw )                     i IBC rare hascall gcsafe bwd
BB38 [0016]  1       BB37                  0           0 [11E..144)-> BB39(1)                 (always)                     i IBC rare idxlen
BB42 [0019]  1       BB39                  0           0 [15A..170)-> BB43(1)                 (always)                     i IBC rare idxlen
BB43 [0020]  2       BB39,BB42             0           0 [170..185)-> BB44(1)                 (always)                     i IBC rare idxlen
BB44 [0021]  2       BB39,BB43             0           0 [185..199)-> BB62(1)                 (always)                     i IBC rare idxlen
BB41 [0140]  1       BB27                  0           0 [103..104)                           (throw )                     i IBC rare gcsafe bwd
BB26 [0009]  1       BB25                  0           0 [0BA..0C5)                           (throw )                     i IBC rare hascall gcsafe newobj
BB45 [0022]  1       BB23                  0           0 [199..1A8)-> BB61(0),BB46(1)         ( cond )                     i IBC rare hascall
BB46 [0023]  1       BB45                  0           0 [1A8..1B8)-> BB48(1),BB47(0)         ( cond )                     i IBC rare
BB48 [0025]  1       BB46                  0           0 [1C3..1D3)-> BB60(0.48),BB49(0.52)   ( cond )                     i IBC rare
BB49 [0026]  1       BB48                  0           0 [1D3..338)-> BB54(0.48),BB50(0.52)   ( cond )                     i IBC rare
BB50 [0034]  1       BB49                  0           0 [338..371)-> BB52(1)                 (always)                     i IBC rare idxlen
BB52 [0036]  2       BB50,BB51             0           0 [3C5..3D8)-> BB51(0.9),BB53(0.1)     ( cond )                     i IBC rare bwd bwd-src
BB51 [0035]  1       BB52                  0           0 [371..3C5)-> BB52(1)                 (always)                     i IBC rare loophead idxlen bwd bwd-target
BB53 [0037]  1       BB52                  0           0 [3D8..3E1)-> BB60(1)                 (always)                     i IBC rare
BB54 [0038]  1       BB49                  0           0 [3E1..400)-> BB60(0.48),BB55(0.52)   ( cond )                     i IBC rare
BB55 [0040]  1       BB54                  0           0 [400..455)-> BB57(1)                 (always)                     i IBC rare idxlen
BB57 [0045]  2       BB55,BB56             0           0 [4E4..4FD)-> BB56(0.9),BB58(0.1)     ( cond )                     i IBC rare bwd bwd-src
BB56 [0044]  1       BB57                  0           0 [455..4E4)-> BB57(1)                 (always)                     i IBC rare loophead idxlen bwd bwd-target
BB58 [0046]  1       BB57                  0           0 [4FD..506)-> BB60(1)                 (always)                     i IBC rare
BB60 [0058]  5       BB48,BB53,BB54,BB58,BB59   0           0 [647..651)-> BB59(0.9),BB62(0.1)     ( cond )                     i IBC rare bwd bwd-src
BB59 [0057]  1       BB60                  0           0 [61B..647)-> BB60(1)                 (always)                     i IBC rare loophead idxlen bwd bwd-target
BB47 [0024]  1       BB46                  0           0 [1B8..1C3)                           (throw )                     i IBC rare hascall gcsafe newobj
BB61 [0060]  1       BB45                  0           0 [652..662)                           (throw )                     i IBC rare hascall gcsafe newobj
BB16 [0003]  1       BB15                  0           0 [039..04E)-> BB18(1),BB17(0)         ( cond )                     i IBC rare
BB18 [0005]  1       BB16                  0           0 [059..07D)-> BB62(0.48),BB19(0.52)   ( cond )                     i IBC rare hascall gcsafe
BB19 [0006]  1       BB18                  0           0 [07D..094)-> BB62(1)                 (always)                     i IBC rare idxlen
BB17 [0004]  1       BB16                  0           0 [04E..059)                           (throw )                     i IBC rare hascall gcsafe newobj
BB10 [0001]  1       BB09                  0           0 [01F..02F)                           (throw )                     i IBC rare hascall gcsafe newobj
BB04 [0067]  1       BB03                  0           0 [00B..00C)                           (throw )                     i IBC rare hascall gcsafe
BB02 [0063]  1       BB01                  0           0 [000..001)                           (throw )                     i IBC rare hascall gcsafe
BB63 [0153]  0                             0             [???..???)                           (throw )                     i rare keep internal
BB64 [0154]  0                             0             [???..???)                           (throw )                     i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

There are lots of improvements that I'm not extending the same analysis to, and I don't mean for my tone to be pessimistic. My key takeaway from this is much of the behavior we don't necessarily want in the new layout algorithm can be addressed by leveraging block weights to selectively repair various shapes -- we have Phoenix as a guide for a lot of this work. But in its current form, the new layout algorithm is certainly easier to understand, and quite a bit faster; I'm expecting TP improvements over 1% for this PR, and follow-up work (should we decide to remove the old layout entirely) will only improve this further.

cc @dotnet/jit-contrib

dotnet-policy-service · 2024-05-16T22:00:57Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

jakobbotsch · 2024-05-17T08:54:44Z

System.Collections.Tests.Perf_BitArray.BitArrayNot(Size: 512) regressed by about 14%, due to layout differences in System.Collections.BitArray:Not: Base layout:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight        IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1      900864 [000..039)-> BB23(0.001),BB08(0),BB07(0),BB06(0),BB05(0),BB04(0),BB03(0),BB02(0),BB09(0.999)[def] (switch)                     i IBC
BB11 [0018]  1       BB12                 15.60 14054033 [0D6..0F7)-> BB12(1)                 (always)                     i IBC loophead bwd bwd-target
BB12 [0019]  2       BB10,BB11            16.58 14933038 [0F7..107)-> BB11(0.941),BB25(0.0589)  ( cond )                     i IBC bwd bwd-src
BB25 [0039]  3       BB12,BB13,BB20        1.00   899963 [15A..15E)-> BB20(0),BB23(1)         ( cond )                     i IBC bwd
BB23 [0029]  4       BB01,BB08,BB25,BB27   1.00   900864 [15E..16E)                           (return)                     i IBC
BB09 [0009]  1       BB01                  1.00   899963 [071..0D4)-> BB13(0),BB10(1)         ( cond )                     i IBC nullcheck
BB10 [0033]  1       BB09                  1.00   899963 [0D6..???)-> BB12(1)                 (always)                     IBC internal
BB13 [0021]  1       BB09                  0           0 [109..11A)-> BB25(0.48),BB16(0.52)   ( cond )                     i IBC rare
BB15 [0024]  1       BB16                  0           0 [11C..13D)-> BB16(1)                 (always)                     i IBC rare loophead bwd bwd-target
BB16 [0025]  2       BB13,BB15             0           0 [13D..14D)-> BB15(0.9),BB27(0.1)     ( cond )                     i IBC rare bwd bwd-src
BB27 [0041]  1       BB16                  0           0 [???..???)-> BB23(0),BB20(1)         ( cond )                     IBC rare internal
BB20 [0027]  2       BB25,BB27             0           0 [14F..15A)-> BB25(1)                 (always)                     i IBC rare loophead idxlen bwd bwd-target
BB24 [0038]  0                             0             [???..???)                           (throw )                     i rare keep internal
BB02 [0002]  1       BB01                  0           0 [03B..042)-> BB03(1)                 (always)                     i IBC rare idxlen
BB03 [0003]  2       BB01,BB02             0           0 [042..049)-> BB04(1)                 (always)                     i IBC rare idxlen
BB04 [0004]  2       BB01,BB03             0           0 [049..050)-> BB05(1)                 (always)                     i IBC rare idxlen
BB05 [0005]  2       BB01,BB04             0           0 [050..057)-> BB06(1)                 (always)                     i IBC rare idxlen
BB06 [0006]  2       BB01,BB05             0           0 [057..05E)-> BB07(1)                 (always)                     i IBC rare idxlen
BB07 [0007]  2       BB01,BB06             0           0 [05E..065)-> BB08(1)                 (always)                     i IBC rare idxlen
BB08 [0008]  2       BB01,BB07             0           0 [065..071)-> BB23(1)                 (always)                     i IBC rare idxlen
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Diff layout:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight        IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1      930304 [000..039)-> BB23(0.001),BB08(0),BB07(0),BB06(0),BB05(0),BB04(0),BB03(0),BB02(0),BB09(0.999)[def] (switch)                     i IBC
BB09 [0009]  1       BB01                  1.00   929374 [071..0D4)-> BB13(0),BB10(1)         ( cond )                     i IBC nullcheck
BB10 [0033]  1       BB09                  1.00   929374 [0D6..???)-> BB12(1)                 (always)                     IBC internal
BB12 [0019]  2       BB10,BB11            16.50 15349629 [0F7..107)-> BB11(0.94),BB21(0.0605) ( cond )                     i IBC bwd bwd-src
BB11 [0018]  1       BB12                 15.50 14421162 [0D6..0F7)-> BB12(1)                 (always)                     i IBC loophead bwd bwd-target
BB21 [0028]  4       BB12,BB13,BB16,BB20   1.00   929374 [15A..15E)-> BB20(0),BB23(1)         ( cond )                     i IBC bwd bwd-src
BB23 [0029]  3       BB01,BB08,BB21        1      930304 [15E..16E)                           (return)                     i IBC
BB13 [0021]  1       BB09                  0           0 [109..11A)-> BB21(0.48),BB16(0.52)   ( cond )                     i IBC rare
BB16 [0025]  2       BB13,BB15             0           0 [13D..14D)-> BB15(0.9),BB21(0.1)     ( cond )                     i IBC rare bwd bwd-src
BB15 [0024]  1       BB16                  0           0 [11C..13D)-> BB16(1)                 (always)                     i IBC rare loophead bwd bwd-target
BB20 [0027]  1       BB21                  0           0 [14F..15A)-> BB21(1)                 (always)                     i IBC rare loophead idxlen bwd bwd-target
BB02 [0002]  1       BB01                  0           0 [03B..042)-> BB03(1)                 (always)                     i IBC rare idxlen
BB03 [0003]  2       BB01,BB02             0           0 [042..049)-> BB04(1)                 (always)                     i IBC rare idxlen
BB04 [0004]  2       BB01,BB03             0           0 [049..050)-> BB05(1)                 (always)                     i IBC rare idxlen
BB05 [0005]  2       BB01,BB04             0           0 [050..057)-> BB06(1)                 (always)                     i IBC rare idxlen
BB06 [0006]  2       BB01,BB05             0           0 [057..05E)-> BB07(1)                 (always)                     i IBC rare idxlen
BB07 [0007]  2       BB01,BB06             0           0 [05E..065)-> BB08(1)                 (always)                     i IBC rare idxlen
BB08 [0008]  2       BB01,BB07             0           0 [065..071)-> BB23(1)                 (always)                     i IBC rare idxlen
BB24 [0038]  0                             0             [???..???)                           (throw )                     i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

In terms of edge likelihoods, the new layout seems to get the critical paths right, though note that the "greedy" part of the RPO only applies to conditional blocks when deciding which successor to place next; other multi-successor block kinds, like switch blocks, don't seem to be common enough to be worth extending the layout's greediness to, though this could be done as a follow-up quite easily (see #101935). I believe the hot loop BB11<->BB12 is to blame for the regression: BB12 is reachable from BB11 and BB10, and BB11 is reachable only from BB12. When we start the RPO from BB01, we end up visiting BB10, then BB12, and then BB11, hence why the new layout places BB12 before BB11. This introduces more branches: We need a backward jump from BB11 to BB12 within the loop, and once BB12's condition is false, we need to jump over BB11 to get to the former's false target. If we place BB11 before BB12, then BB11 can fall into BB12, and BB12 can eventually fall into its false target after the loop; we only need the single backward jump from BB12 to BB11.

Hmm, this makes me a bit leery. It is very common to lay out loops in this way to avoid this extra branch; Roslyn does that in IL for us, which the layout algorithm then effectively undoes. I think this affects all loops that we don't enter at the top.
The saving grace is that usually loop inversion will kick in and make most loops entered at the top. But I think it would be good to collect some stats over how many loops this affects.

Here is a simple example:

private static int Sum(int[] arr)
{
    int i = 0;
    int sum = 0;
    while (i < arr.Length && arr[i] != 0)
    {
        sum += arr[i];
        i++;
    }

    return sum;
}

Base:

G_M57365_IG02:  ;; offset=0x0004
       xor      eax, eax
       mov      edx, dword ptr [rcx+0x08]
       xor      edx, edx
       jmp      SHORT G_M57365_IG04
						;; size=9 bbWeight=1 PerfScore 4.50
G_M57365_IG03:  ;; offset=0x000D
       add      eax, dword ptr [rcx+4*rdx+0x10]
       inc      edx
						;; size=6 bbWeight=2 PerfScore 6.50
G_M57365_IG04:  ;; offset=0x0013
       cmp      dword ptr [rcx+0x08], edx
       jle      SHORT G_M57365_IG06
						;; size=5 bbWeight=8 PerfScore 32.00
G_M57365_IG05:  ;; offset=0x0018
       cmp      dword ptr [rcx+4*rdx+0x10], 0
       jne      SHORT G_M57365_IG03
						;; size=7 bbWeight=4 PerfScore 16.00

Diff:

G_M57365_IG02:  ;; offset=0x0004
       xor      eax, eax
       mov      edx, dword ptr [rcx+0x08]
       xor      edx, edx
						;; size=7 bbWeight=1 PerfScore 2.50
G_M57365_IG03:  ;; offset=0x000B
       cmp      dword ptr [rcx+0x08], edx
       jle      SHORT G_M57365_IG06
						;; size=5 bbWeight=8 PerfScore 32.00
G_M57365_IG04:  ;; offset=0x0010
       cmp      dword ptr [rcx+4*rdx+0x10], 0
       je       SHORT G_M57365_IG06
						;; size=7 bbWeight=4 PerfScore 16.00
G_M57365_IG05:  ;; offset=0x0017
       add      eax, dword ptr [rcx+4*rdx+0x10]
       inc      edx
       jmp      SHORT G_M57365_IG03
						;; size=8 bbWeight=2 PerfScore 10.50

amanasifkhalid · 2024-05-17T15:59:03Z

Hmm, this makes me a bit leery. It is very common to lay out loops in this way to avoid this extra branch; Roslyn does that in IL for us, which the layout algorithm then effectively undoes. I think this affects all loops that we don't enter at the top.
The saving grace is that usually loop inversion will kick in and make most loops entered at the top. But I think it would be good to collect some stats over how many loops this affects.

Yeah, I'm a bit concerned about this too, considering it affected the pretty idiomatic example you gave. In the benchmarks.run_pgo collection, I found 5,676 loops with backward jumps from their loop heads using the new layout, as opposed to 5,458 loops with the old layout. If we only look at loops that aren't rarely run, these numbers drop to 3,111 and 3,755, respectively.

@jakobbotsch are there other collections you'd like me to specifically look at? And if you think it's worth addressing this in the layout's implementation, what should our merge strategy look like? Would you want to run the experiment for a bit longer with the loop inversion fix before enabling to see what the improvements look like?

amanasifkhalid · 2024-05-17T17:22:06Z

Yeah, I'm a bit concerned about this too, considering it affected the pretty idiomatic example you gave. In the benchmarks.run_pgo collection, I found 5,676 loops with backward jumps from their loop heads using the new layout, as opposed to 5,458 loops with the old layout. If we only look at loops that aren't rarely run, these numbers drop to 3,111 and 3,755, respectively.

I reran my analysis across all SPMI collections we currently have on win x64, and for loops that aren't rarely run, the loop head ends with a backward jump for 74,213 loops with the new layout, versus 68,315 loops with the old layout. There's definitely some double-counting here across the benchmarks.* and libraries_tests.* collections, though the new layout does seem marginally more susceptible to introducing this shape.

amanasifkhalid · 2024-05-17T21:08:57Z

Now that I'm thinking about it, I think we ought to try something similar to what Phoenix does after creating the RPO-based layout where we move a block's hottest predecessor to just before it, so that loop heads don't end up at the end of loop bodies. Looking around various GitHub issues like #9304, I think this is some low-hanging fruit we can address up-front.

EgorBo · 2024-05-19T18:15:54Z

@EgorBot --disasm

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkRunner.Run<MyBench>(args: args);

public class MyBench
{
    private int _int = 0;
    private long _long = 0;
    private string _location, _newValue, _comparand;

    [GlobalSetup(Target = nameof(CompareExchange_object_NoMatch))]
    public void Setup_CompareExchange_object_NoMatch()
    {
        _location = "Hello";
        _newValue = "World";
        _comparand = "What?";
    }

    [Benchmark]
    public string CompareExchange_object_NoMatch() 
        => Interlocked.CompareExchange(ref _location, _newValue, _comparand);
}

EgorBo · 2024-05-19T18:35:41Z

BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish)
Intel Xeon Platinum 8370C CPU 2.80GHz, 1 CPU, 8 logical and 4 physical cores

Method	Toolchain	Mean	Ratio	Code Size
CompareExchange_object_NoMatch	Main	4.927 ns	1.00	43 B
CompareExchange_object_NoMatch	PR	4.979 ns	1.01	43 B

BDN_Artifacts.zip

EgorBo · 2024-05-19T18:37:02Z

I was just testing my bot, although, @amanasifkhalid you mentioned that Setup_CompareExchange_object_NoMatch regressed by 40% - was it Windows specific?

EgorBo · 2024-05-19T18:38:24Z

Although, by looking at the BDN_Artifacts it seems that there is ASM difference, namely: https://www.diffchecker.com/hu5vJF0i/ (no idea where the baseline and the PR in that diff)

amanasifkhalid · 2024-05-20T13:20:56Z

@EgorBo yes, that was on Windows x64. That number also came from me looking at min execution times, so it's possible the baseline had an unusually good run?

Nice bot, by the way

amanasifkhalid · 2024-05-20T15:53:47Z

I've opened #102461 to address the loop inversion issue. If we decide we want that change, should we let the layout experiment run for another week or so? I'm fine with punting this change to Preview 6. @dotnet/jit-contrib

…ayout (#102461) Part of #93020. In #102343, we noticed the RPO-based layout sometimes makes suboptimal decisions in terms of placing a block's hottest predecessor before it -- in particular, this affects loops that aren't entered at the top. To address this, after establishing a baseline RPO layout, fgMoveBackwardJumpsToSuccessors will try to move backward unconditional jumps to right behind their targets to create fallthrough, if the predecessor block is sufficiently hot.

amanasifkhalid · 2024-05-22T17:05:56Z

Updated diffs. @EgorBo since we addressed the mis-rotated loop issue with #102461, are you ok with merging this as-is?

EgorBo

Some of the diffs look very nice! Looking forward to dotnet/performance results 🙂

AndyAyersMS · 2024-05-22T18:21:07Z

Nice to see this enabled! Thanks for digging into the original set of diffs and fixing problems.

amanasifkhalid · 2024-05-22T18:27:13Z

@AndyAyersMS @jakobbotsch @EgorBo thank you all for your help with getting this merged!

…ayout (dotnet#102461) Part of dotnet#93020. In dotnet#102343, we noticed the RPO-based layout sometimes makes suboptimal decisions in terms of placing a block's hottest predecessor before it -- in particular, this affects loops that aren't entered at the top. To address this, after establishing a baseline RPO layout, fgMoveBackwardJumpsToSuccessors will try to move backward unconditional jumps to right behind their targets to create fallthrough, if the predecessor block is sufficiently hot.

Enable RPO-based block layout by default

00b4b6d

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 16, 2024

dotnet-policy-service bot assigned amanasifkhalid May 16, 2024

amanasifkhalid mentioned this pull request May 17, 2024

JIT: poor CQ for list enumeration constructs #9304

Open

amanasifkhalid mentioned this pull request May 20, 2024

JIT: Move backward jumps to before their successors after RPO-based layout #102461

Merged

JulieLeeMSFT requested review from EgorBo and jakobbotsch May 20, 2024 16:52

JulieLeeMSFT added the Priority:2 Work that is important, but not critical for the release label May 20, 2024

amanasifkhalid mentioned this pull request May 21, 2024

JIT: Compact blocks in fgMoveBackwardJumpsToSuccessors #102512

Merged

amanasifkhalid closed this May 22, 2024

amanasifkhalid reopened this May 22, 2024

EgorBo approved these changes May 22, 2024

View reviewed changes

amanasifkhalid merged commit f02a695 into dotnet:main May 22, 2024
107 checks passed

amanasifkhalid deleted the enable-rpo-layout branch May 22, 2024 17:55

amanasifkhalid mentioned this pull request May 22, 2024

Move jitopt and rpo experiments to PerfVipers #102575

Merged

EgorBo mentioned this pull request May 28, 2024

Widespread perf regressions due to RPO layout #102763

Open

steveharter pushed a commit to steveharter/runtime that referenced this pull request May 28, 2024

JIT: Enable RPO-based block layout by default (dotnet#102343)

84397d1

Ruihan-Yin pushed a commit to Ruihan-Yin/runtime that referenced this pull request May 30, 2024

JIT: Enable RPO-based block layout by default (dotnet#102343)

8472840

github-actions bot locked and limited conversation to collaborators Jun 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: Enable RPO-based block layout by default #102343

JIT: Enable RPO-based block layout by default #102343

amanasifkhalid commented May 16, 2024

dotnet-policy-service bot commented May 16, 2024

jakobbotsch commented May 17, 2024 •

edited

Loading

amanasifkhalid commented May 17, 2024

amanasifkhalid commented May 17, 2024 •

edited

Loading

amanasifkhalid commented May 17, 2024

EgorBo commented May 19, 2024

EgorBo commented May 19, 2024

EgorBo commented May 19, 2024

EgorBo commented May 19, 2024 •

edited

Loading

amanasifkhalid commented May 20, 2024

amanasifkhalid commented May 20, 2024

amanasifkhalid commented May 22, 2024

EgorBo left a comment

AndyAyersMS commented May 22, 2024

amanasifkhalid commented May 22, 2024

JIT: Enable RPO-based block layout by default #102343

JIT: Enable RPO-based block layout by default #102343

Conversation

amanasifkhalid commented May 16, 2024

dotnet-policy-service bot commented May 16, 2024

jakobbotsch commented May 17, 2024 • edited Loading

amanasifkhalid commented May 17, 2024

amanasifkhalid commented May 17, 2024 • edited Loading

amanasifkhalid commented May 17, 2024

EgorBo commented May 19, 2024

EgorBo commented May 19, 2024

EgorBo commented May 19, 2024

EgorBo commented May 19, 2024 • edited Loading

amanasifkhalid commented May 20, 2024

amanasifkhalid commented May 20, 2024

amanasifkhalid commented May 22, 2024

EgorBo left a comment

Choose a reason for hiding this comment

AndyAyersMS commented May 22, 2024

amanasifkhalid commented May 22, 2024

jakobbotsch commented May 17, 2024 •

edited

Loading

amanasifkhalid commented May 17, 2024 •

edited

Loading

EgorBo commented May 19, 2024 •

edited

Loading