JIT: Don't do aggressive block compaction too early #105041

amanasifkhalid · 2024-07-17T16:58:23Z

Several of the regressions in #103972 stem from early aggressive block compaction pessimizing loop inversion. For example, consider the following block layout early in compilation for System.Linq.Tests.Perf_Enumerable.Aggregate_Seed. If we don't aggressively compact in fgUpdateFlowGraphPhase, we get the following layout right before loop inversion:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1    [000..003)-> BB03(0.5),BB02(0.5)     ( cond )                     i
BB02 [0001]  1       BB01                  0    [003..00A)                           (throw )                     i rare hascall gcsafe
BB03 [0002]  1       BB01                  1    [00A..00D)-> BB05(0.5),BB04(0.5)     ( cond )                     i
BB04 [0003]  1       BB03                  0    [00D..013)                           (throw )                     i rare hascall gcsafe
BB05 [0004]  1       BB03                  1    [013..01C)-> BB06(1)                 (always)                     i hascall gcsafe
BB06 [0005]  1  0    BB05                  1    [01C..01E)-> BB08(1)                 (always) T0      try {       i keep
BB07 [0006]  1  0    BB08                  1    [01E..02E)-> BB08(1)                 (always) T0                  i hascall gcsafe bwd bwd-target
BB08 [0007]  2  0    BB06,BB07             1    [02E..036)-> BB07(0.5),BB16(0.5)     ( cond ) T0      }           i hascall bwd bwd-src
BB16 [0015]  1       BB08                  1    [038..03B)-> BB17(1)                 (always)                     i keep cfb
BB17 [0016]  1       BB16                  1    [03B..042)-> BB15(1)                 (always)                     i hascall gcsafe
BB15 [0012]  1       BB17                  1    [042..044)                           (return)                     i
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ funclets follow
BB12 [0009]  1     0                       1    [038..03B)-> BB14(0.5),BB13(0.5)     ( cond )    H0 F fault {     i keep flet
BB13 [0010]  1     0 BB12                  1    [03B..041)-> BB14(1)                 (always)    H0               i hascall gcsafe
BB14 [0011]  2     0 BB12,BB13             1    [041..042)                           (falret)    H0   }           i
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

In the try region, there's an opportunity to aggressively compact, and get the following loop shape:

BB06 [0005]  2  0    BB05,BB07             1    [01C..036)-> BB07(0.5),BB16(0.5)     ( cond ) T0      try {       i keep hascall bwd
BB07 [0006]  1  0    BB06                  1    [01E..02E)-> BB06(1)                 (always) T0      }           i hascall gcsafe bwd bwd-target
BB16 [0015]  1       BB06                  1    [038..03B)-> BB17(1)                 (always)                     i keep cfb

Loop inversion won't kick in for this, and since fgMoveHotJumps won't modify the first block of an EH region, we don't get the opportunity to flip this shape, so we end up with subpar loop layout.

If we don't aggressively compact, loop inversion kicks in, and we get the following shape:

BB06 [0005]  1  0    BB05                  1    [01C..01E)-> BB15(1)                 (always) T0      try {       i keep
BB15 [0018]  1  0    BB06                  1    [???..???)-> BB09(0.5),BB07(0.5)     ( cond ) T0                  internal
BB07 [0006]  2  0    BB08,BB15             1    [01E..02E)-> BB08(1)                 (always) T0                  i hascall gcsafe bwd bwd-target
BB08 [0007]  1  0    BB07                  1    [02E..036)-> BB07(0.5),BB09(0.5)     ( cond ) T0      }           i hascall bwd bwd-src
BB09 [0015]  2       BB08,BB15             1    [038..03B)-> BB10(1)                 (always)                     i keep cfb

After flow opts and block layout run, our final shape looks like this:

BB04 [0005]  1  0    BB03                  1    [01C..01E)-> BB06(0.5),BB05(0.5)     ( cond ) T0      try {       i keep
BB05 [0019]  2  0    BB04,BB05             4    [01E..036)-> BB05(0.5),BB06(0.5)     ( cond ) T0      }           i loophead gcsafe bwd
BB06 [0015]  2       BB04,BB05             1    [038..044)                           (return)                     i keep gcsafe cfb

I don't think we are losing anything by reserving aggressive compaction for later flow opt phases, and the upside is more opportunities for loop inversion, which may unlock other loop opts.

dotnet-policy-service · 2024-07-17T16:58:56Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

amanasifkhalid · 2024-07-17T18:32:33Z

Here's another interesting consequence: Aggressive compaction would sometimes move a loop inside a try region up, such that the loop header is now the first block in the region. Loop cloning gives up if it finds the start of a try region in the loop, so with this change, I'm now seeing more loop cloning.

amanasifkhalid · 2024-07-17T19:18:03Z

cc @dotnet/jit-contrib, @AndyAyersMS PTAL. Diffs have some pretty big code size increases. The jit-analyze output reveals much of the net size increase stems from more loop cloning. Looking at the example diffs, I see loop inversion is frequently increasing code size slightly, but ultimately improving PerfScore (in general, it looks like many of the small size improvements are PerfScore regressions, and the small size regressions tend to be PerfScore improvements). #103785 had some pretty substantial TP improvements, so the TP cost here isn't as bad as it looks.

AndyAyersMS · 2024-07-18T17:30:45Z

I worry about this much churn this late in the cycle. Seems like it will give the usual mix of improvements and some regressions.

I wonder if we could just make loop inversion (and perhaps loop cloning) smarter? Inversion is very pattern-matchy. Though perhaps that causes the same or worse level of churn.

amanasifkhalid · 2024-07-18T19:01:45Z

I worry about this much churn this late in the cycle. Seems like it will give the usual mix of improvements and some regressions.
I wonder if we could just make loop inversion (and perhaps loop cloning) smarter? Inversion is very pattern-matchy. Though perhaps that causes the same or worse level of churn.

I share your concerns about churn; for similar reasons, I'm nervous about tweaking loop inversion. The diffs for this change are large, but the problem it's trying to address is somewhat narrow: We compact blocks early on such that we get this weird loop shape that loop inversion can't fix, and the loop is now the beginning of a try region, so fgMoveHotJumps won't try to fix it. Since this is a pretty narrow case, I'm ok with us not taking this until later (I think in the next release, we ought to try relaxing some of the EH restrictions during layout, like being able to change the first block of a try region if it doesn't create flow into the middle of it).

I also think the diffs on this PR are misleadingly large. I ran SPMI on this change versus not doing aggressive compaction at all (i.e. without #103785), and this change seems to be reversing a lot of the churn I initially created: The diffs are much smaller. Here's the short summary:

Diffs are based on 2,696,572 contexts (1,091,355 MinOpts, 1,605,217 FullOpts).

MISSED contexts: base: 29,302 (1.07%), diff: 29,214 (1.07%)

Overall (-2,847 bytes)

Collection	Base size (bytes)	Diff size (bytes)	PerfScore in Diffs
aspnet.run.windows.x64.checked.mch	64,452,395	-470	-3.19%
benchmarks.run.windows.x64.checked.mch	11,307,991	-215	-16.48%
benchmarks.run_pgo.windows.x64.checked.mch	41,871,649	-298	-5.17%
benchmarks.run_tiered.windows.x64.checked.mch	12,538,466	+73	+16.53%
coreclr_tests.run.windows.x64.checked.mch	392,985,123	+712	+1.44%
libraries.crossgen2.windows.x64.checked.mch	36,986,365	-87	-1.19%
libraries.pmi.windows.x64.checked.mch	62,824,530	+83	+22.81%
libraries_tests.run.windows.x64.Release.mch	313,793,334	-3,665	-8.06%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch	146,552,499	+1,272	-7.66%
realworld.run.windows.x64.checked.mch	12,214,184	-90	-2.95%
smoke_tests.nativeaot.windows.x64.checked.mch	3,489,201	-162	+85.66%

FullOpts (-2,847 bytes)

Collection	Base size (bytes)	Diff size (bytes)	PerfScore in Diffs
aspnet.run.windows.x64.checked.mch	36,418,630	-470	-3.19%
benchmarks.run.windows.x64.checked.mch	11,307,379	-215	-16.48%
benchmarks.run_pgo.windows.x64.checked.mch	22,288,213	-298	-5.17%
benchmarks.run_tiered.windows.x64.checked.mch	3,234,203	+73	+16.53%
coreclr_tests.run.windows.x64.checked.mch	120,195,377	+712	+1.44%
libraries.crossgen2.windows.x64.checked.mch	36,984,660	-87	-1.19%
libraries.pmi.windows.x64.checked.mch	62,711,569	+83	+22.81%
libraries_tests.run.windows.x64.Release.mch	125,753,062	-3,665	-8.06%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch	135,705,717	+1,272	-7.66%
realworld.run.windows.x64.checked.mch	11,801,081	-90	-2.95%
smoke_tests.nativeaot.windows.x64.checked.mch	3,488,230	-162	+85.66%

amanasifkhalid · 2024-07-19T14:25:58Z

@AndyAyersMS are you ok with including this in .NET 9, or should we hold off?

AndyAyersMS · 2024-07-19T14:53:51Z

@AndyAyersMS are you ok with including this in .NET 9, or should we hold off?

I am thinking we should hold off. But let me look at which benchmarks had regressions a bit more closely first.

AndyAyersMS · 2024-07-19T15:28:07Z

@AndyAyersMS are you ok with including this in .NET 9, or should we hold off?

I am thinking we should hold off. But let me look at which benchmarks had regressions a bit more closely first.

Ok, some of them are pretty large, and the number of improvements from #103785 was fairly small. Can you repro some of the worst-case regressions locally, and see that they're now fixed?

I built a collated table here: #103972 (comment)

AndyAyersMS · 2024-07-19T17:24:13Z

Maybe also look at #50204? It might be an interesting alternative, if that's all that it takes to "fix" loop inversion.

amanasifkhalid · 2024-07-19T19:40:51Z

Ok, some of them are pretty large, and the number of improvements from #103785 was fairly small. Can you repro some of the worst-case regressions locally, and see that they're now fixed?

Thanks for taking a look at the regressions! I'm not able to reproduce any improvement for the largest amd64 regressions locally, unfortunately. Looking at some of the JIT dumps, I see diffs from more conservative block compaction, but this isn't manifesting significant codegen diffs, which is odd...

Maybe also look at #50204? It might be an interesting alternative, if that's all that it takes to "fix" loop inversion.

Thanks for pointing that out. I've opened #105161 -- the asmdiffs looked promising locally. Maybe we can pursue that change for P7, and come back to our block compaction strategy (among other flow opts) in .NET 10.

amanasifkhalid · 2024-08-12T17:58:40Z

Let's keep the new compaction on for now, and look into adjusting affected phases to handle the shapes they're struggling with.

Don't do aggressive compaction too early

edbd5e0

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 17, 2024

dotnet-policy-service bot assigned amanasifkhalid Jul 17, 2024

This was referenced Jul 17, 2024

System.IO.Net5Compat.Tests and System.IO.Tests suddenly exiting with error 137 #100558

Open

SIGKILL (OOM?) while running LibraryImportGenerator.Tests w/o actionable log messages or artifacts dotnet/dnceng#2496

Open

amanasifkhalid mentioned this pull request Jul 19, 2024

JIT: Enable inversion of top-tested loops #105161

Open

amanasifkhalid added this to the 10.0.0 milestone Aug 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: Don't do aggressive block compaction too early #105041

JIT: Don't do aggressive block compaction too early #105041

amanasifkhalid commented Jul 17, 2024 •

edited

Loading

dotnet-policy-service bot commented Jul 17, 2024

amanasifkhalid commented Jul 17, 2024

amanasifkhalid commented Jul 17, 2024

AndyAyersMS commented Jul 18, 2024

amanasifkhalid commented Jul 18, 2024

amanasifkhalid commented Jul 19, 2024

AndyAyersMS commented Jul 19, 2024

AndyAyersMS commented Jul 19, 2024

AndyAyersMS commented Jul 19, 2024

amanasifkhalid commented Jul 19, 2024

amanasifkhalid commented Aug 12, 2024

JIT: Don't do aggressive block compaction too early #105041

Are you sure you want to change the base?

JIT: Don't do aggressive block compaction too early #105041

Conversation

amanasifkhalid commented Jul 17, 2024 • edited Loading

dotnet-policy-service bot commented Jul 17, 2024

amanasifkhalid commented Jul 17, 2024

amanasifkhalid commented Jul 17, 2024

AndyAyersMS commented Jul 18, 2024

amanasifkhalid commented Jul 18, 2024

amanasifkhalid commented Jul 19, 2024

AndyAyersMS commented Jul 19, 2024

AndyAyersMS commented Jul 19, 2024

AndyAyersMS commented Jul 19, 2024

amanasifkhalid commented Jul 19, 2024

amanasifkhalid commented Aug 12, 2024

amanasifkhalid commented Jul 17, 2024 •

edited

Loading