-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark Regressions from Profile Maintenance and Block Layout Changes #113108
Comments
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
Starting from the top, the Currently, profile synthesis will set the weight of handler region entries to something close to zero, but it won't mark them as rarely run. We probably don't want to change this, because marking them as rarely run would block inlining within handlers, which matters for finally regions. If we keep handler regions from being marked as rarely run after I wasn't able to repro the regression with EgorBot, but I'll take a look locally. |
I'm able to repro the regression on my Windows AMD machine, but the difference is quite inconsistent. From run to run, the difference is either just a nanosecond, or over 2x. For example:
I'm failing to find any codegen diffs along the call stack of this benchmark, and the potential fix would likely have its own slew of regressions, so the return on pursuing these regressions likely aren't as high as others that are directly related to layout churn. I'll keep this one in mind for later... |
PerfLabTests.CastingPerf2.CastingPerf.ScalarValueTypeObj regressed and then later improved due to flowgraph/profile churn. Its current regressed status is from #112060, and is tracked in #112657. This is also true for System.Linq.Tests.Perf_Enumerable.SingleWithPredicate_FirstElementMatches(input: Array). I'll remove these from the above chart. |
System.Collections.Tests.Perf_BitArray.BitArrayNot(Size: 512) improved on both platforms once late profile repair was enabled. On Viper Windows, it regressed again around 2/27/2025. I can't find an issue tracking this regression, but nothing in the commit range is flowgraph-related. |
I've annotated the most-regressed benchmarks with culprit PRs based on when the last notable regression occurred.
I looked at a few of the regressions from #103450, and for these cases, 3-opt is keen to make bottom-tested loops top-tested. For example, here's the initial layout for
Both loops are bottom-tested, and we fall out of the inner loop into the outer loop. 3-opt transforms this into the following:
It's the same case for
For hot loops, 3-opt finds it massively profitable to create fallthrough along a loop backedge, since the entry edges are considerably lower in weight, but this gives a worse loop shape overall. I think we'd be justified in inhibiting the creation of fallthrough along loop backedges if it would make the loop top-tested. A targeted implementation would be to detect loop backedges in |
By restricting 3-opt from creating fallthrough along loop backedges, and restricting hot jump compaction from considering loop exit edges (#113101), layout is looking better. For
I'm sure these two changes will incur pretty big diffs, but I think they're focused enough that (hopefully) the effects will be predictable. Checking if an edge is a loop exit/backedge requires iterating up to every loop edge in the method, so we don't want to do this lookup every time in |
With #113101, some of the 4-opt regressions have layouts that are further away from a locally-minimal cost. For example, without the change, this is what the layout of
With it, instead of
I probably shouldn't be trying to artificially inhibit 3-opt's cost model for such cases, like I am in #113101. For other 4-opt regressions, we have loops with unconditional back edges, and there doesn't seem to be some obviously optimal rotation of the loop. For example, from
In this case, both loop heads |
Part of #113108. When converting an unconditional jump to a test block into a duplicate test block, fgOptimizeBranch flips the condition (unnecessarily -- I plan to fix this someday), which means we need to derive the new true edge's likelihood from the cloned false edge, and vice versa. Now that late profile repair is enabled, if we get this wrong, the flipped edge likelihoods will flatten loop weights, and inflate non-loop weights. I suspect fixing this will also revert some of the regressions from late profile repair.
Part of #107749. This is part of my broader goal of phasing out fgReorderBlocks entirely, but there are a few transformations we may want to do right before block layout. In particular, I noticed in #113108 (comment) that some regressed benchmarks have loops with unconditional backedges; in the absence of better loop inversion, we might want to run our unconditional-to-conditional branch duplication optimization right before layout to improve branch behavior in loops. We can put such optimizations in this pre-layout phase.
Note that for this benchmark, and the above ones with unconventional loop rotations, the external TSP optimizer failed to improve upon 3-opt's layout, so this isn't an algorithmic limitation we're hitting, but a limit with the cost model. I haven't yet tried tweaking the cost model by integrating control penalty constants, though this could be one way forward. Young et al's Near-optimal Intraprocedural Branch Alignment derives constants from the pipeline penalties from a particular ISA -- we don't have an obvious equivalent for our purposes. And if we did, changing the cost model would likely incur significant churn. I'm increasingly motivated by these examples to implement some late form of loop inversion, where for each unconditional backedge, we consider cloning the loop's condition and making the backedge conditional. As a side note, it seems odd that |
Adding a pre-layout pass of
Instead of entering the loop from
|
Part of #107749. fgReorderBlocks runs fgOptimizeBranch when it decides not to reorder a particular block. Turning off the old layout in favor of RPO layout in .NET 9 had the unintended consequence of also disabling the later fgOptimizeBranch pass the JIT used to do. After finding cases in #113108 that benefit from cloning and hoisting loop tests, I decided to reintroduce this late pass. These regressions also highlight that fgOptimizeBranch can create compaction opportunities inside loop bodies that we don't currently take advantage of; I've added a check to handle this.
#113491 is in, but it's a bit too early to see churn in the benchmarks. I've recomputed the above regression table with a few modifications to help track down the remaining interesting regressions:
This yields a much shorter table, consisting of regressions smaller in magnitude than the original triage suggested. |
|
For I looked at the ADX query for this benchmark to see if profile repair regressed it on other platforms, and I didn't see anything interesting. However, the win-amd configuration improved around the time #113491 went in, so I suspect this regression will be closed soon: |
As I anticipated, the |
#113491 resolves In the latter case, it looks like 3-opt is failing to model some architecture-specific behavior, since I don't see similarly stubborn regressions on other platforms. This seems to be a theme for layout regressions: one platform regresses while another improves. Perhaps we could leverage hardware vendors' optimization guides to develop platform-specific cost models, though that's not in the cards for .NET 10. For what it's worth, Tiger machines improved from #113491 as well: |
|
#111873 destabilized Either way, the layout for this benchmark looks good:
#112151 may address the CSE regressions; I'll continue this work with that PR. |
|
This leaves about 20 benchmarks that meaningfully regressed either due to 3-opt (#103450, #110277) or late profile repair (#111915), and many of the regressions stemming from the latter are layout-related. After analyzing the most significant regressions, it seems 3-opt successfully optimizes for its cost model often enough that we don't see cases in the perf lab where it clearly misses a globally-optimal solution. However, these remaining regressions show 3-opt's cost model does not always capture the nuances of the target architecture. @AndyAyersMS has proposed some ideas for building a more sophisticated model: Aside from trying to differentiate the costs of different types of branches, we could also try to model how each partition swap changes the lengths of the jumps within each partition. I still don't know of a way to compute this cheaply, but a brute-force solution could be useful for determining if we can gain anything from this. For .NET 10, I don't plan on significantly changing 3-opt's cost model at this point. Thus, I'm going to start closing tracking issues for layout-related regressions, though I'll keep the profile/CSE ones open for now since I have some current/planned PRs that might help them. Before doing anything, I'll build a table of improvements corresponding to the PRs analyzed above so we can get a sense of what we're gaining from this tradeoff. |
|
Part of #107749. I've collated all the regression reports assigned to me since #103450 was merged. For now, I'm excluding reports on Tiger machine configs to skip JCC errata-related regressions.
The text was updated successfully, but these errors were encountered: