JIT: Visit switch successors in increasing likelihood order for RPO-based layout #101935

amanasifkhalid · 2024-05-06T19:13:57Z

Follow-up to #101473. Also cleans up the BasicBlock successor visitor API surface a bit by separating logic for visiting successors in increasing likelihood order into BasicBlock::VisitAllSuccsInLikelihoodOrder.

dotnet-policy-service · 2024-05-06T19:14:31Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

amanasifkhalid · 2024-05-06T19:24:22Z

Here's the diff summary on win-x64, since the RPO-based layout is disabled by default. Diffs aren't that big, and they aren't overwhelmingly in one direction...

Diffs are based on 2,534,676 contexts (987,849 MinOpts, 1,546,827 FullOpts).

MISSED contexts: 2,922 (0.12%)

Overall (-8,298 bytes)

Collection	Base size (bytes)	Diff size (bytes)	PerfScore in Diffs
aspnet.run.windows.x64.checked.mch	39,855,926	+1,073	+0.20%
benchmarks.run.windows.x64.checked.mch	8,710,150	-108	-0.05%
benchmarks.run_pgo.windows.x64.checked.mch	32,332,197	-718	+0.39%
benchmarks.run_tiered.windows.x64.checked.mch	12,340,101	-23	+0.03%
coreclr_tests.run.windows.x64.checked.mch	402,921,293	-325	+0.27%
libraries.crossgen2.windows.x64.checked.mch	45,252,213	-1,591	-0.13%
libraries.pmi.windows.x64.checked.mch	63,617,634	-575	-0.11%
libraries_tests.run.windows.x64.Release.mch	285,052,618	-2,971	+0.79%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch	137,049,028	-574	-0.14%
realworld.run.windows.x64.checked.mch	13,552,182	-1,625	-0.08%
smoke_tests.nativeaot.windows.x64.checked.mch	5,032,760	-861	-1.60%

FullOpts (-8,298 bytes)

Collection	Base size (bytes)	Diff size (bytes)	PerfScore in Diffs
aspnet.run.windows.x64.checked.mch	27,307,698	+1,073	+0.20%
benchmarks.run.windows.x64.checked.mch	8,709,728	-108	-0.05%
benchmarks.run_pgo.windows.x64.checked.mch	18,519,682	-718	+0.39%
benchmarks.run_tiered.windows.x64.checked.mch	3,077,579	-23	+0.03%
coreclr_tests.run.windows.x64.checked.mch	121,562,176	-325	+0.27%
libraries.crossgen2.windows.x64.checked.mch	45,250,508	-1,591	-0.13%
libraries.pmi.windows.x64.checked.mch	63,504,133	-575	-0.11%
libraries_tests.run.windows.x64.Release.mch	107,104,520	-2,971	+0.79%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch	126,736,473	-574	-0.14%
realworld.run.windows.x64.checked.mch	13,146,461	-1,625	-0.08%
smoke_tests.nativeaot.windows.x64.checked.mch	5,031,713	-861	-1.60%

amanasifkhalid · 2024-05-07T17:43:36Z

cc @dotnet/jit-contrib, @AndyAyersMS PTAL. If you don't think it's worth churning the RPO layout implementation while running an experiment in the perf lab (especially since the diffs don't look like particularly large wins), I'm happy to undo those changes and just keep the block visitor cleanup work.

amanasifkhalid · 2024-05-08T22:21:38Z

@jakobbotsch this cleans up the BasicBlock visitor API surface a bit. I initially tried something like the initializer pattern you mentioned over in #101473, but the current usage of AllSuccessorEnumerator in fgRunDfs complicated this approach -- what do you think of just passing the useProfile template argument in fgRunDfs to AllSuccessorEnumerator? It introduces some code duplication to the latter's constructor, but it's otherwise a pretty localized change.

jakobbotsch · 2024-05-10T12:09:12Z

src/coreclr/jit/block.h

@@ -2497,7 +2501,7 @@ class AllSuccessorEnumerator

 public:
    // Constructs an enumerator of all `block`'s successors.
-    AllSuccessorEnumerator(Compiler* comp, BasicBlock* block, const bool useProfile = false);
+    AllSuccessorEnumerator(Compiler* const comp, BasicBlock* const block);


Nit: marking parameters as const in declarations is not very beneficial (it is a property of the definition, not of the declaration)

We have a number of places where we have these const. IMO we should enable this rule of clang-tidy: https://clang.llvm.org/extra/clang-tidy/checks/readability/avoid-const-params-in-decls.html

We have a number of places where we have these const. IMO we should enable this rule of clang-tidy:

That seems reasonable. I can open a PR for that.

jakobbotsch · 2024-05-10T12:13:08Z

src/coreclr/jit/compiler.hpp

            Compiler::SwitchUniqueSuccSet sd = comp->GetDescriptorForSwitch(this);
+            jitstd::sort(sd.nonDuplicates, (sd.nonDuplicates + sd.numDistinctSuccs),
+                         [](FlowEdge* const lhs, FlowEdge* const rhs) {
+                return (lhs->getLikelihood() * lhs->getDupCount()) < (rhs->getLikelihood() * rhs->getDupCount());
+            });


It seems odd for the visitor function to be mutating the descriptor.

It would make more sense to me if this sorting step was done ahead of time as part of the block layout phase.

It would make more sense to me if this sorting step was done ahead of time as part of the block layout phase.

Are you thinking of something like a pass over the block list that sorts all the switch descriptors, before computing the DFS? If so, considering the diffs don't look all that significant, I'm tempted to just visit switch successors in their unsorted order to avoid cluttering the visitor APIs, or adding another loop to the layout algorithm that won't do anything most of the time.

Something like that, or if that is undesirable, then some map created lazily in the context of the block layout phase (i.e. when we visit the switch block). I guess the latter is not much different to just allocating new memory to sort these successors into.
But at that point it makes more sense to me if we change AllSuccessorEnumerator slightly: in reality it's really just a vector with small vector optimization for N=4. If you phrase fgRunDfs in terms of a callback that returns this small vector of successors, then the memory where we can do the sorting without mutating existing data structures will already be available.

If so, considering the diffs don't look all that significant, I'm tempted to just visit switch successors in their unsorted order to avoid cluttering the visitor APIs

I would suggest this, unless you have some information that order of successors matters for switches.

If you phrase fgRunDfs in terms of a callback that returns this small vector of successors, then the memory where we can do the sorting without mutating existing data structures will already be available.

I tried something like this just now, but this ends up touching quite a bit of code if we want to keep the callback naive to the block kind (which seems desirable). I was thinking we'd adjust AllSuccessorEnumerator's underlying array to hold FlowEdge pointers instead of BasicBlock pointers to the block's successors, so that the callback can easily sort the array by edge likelihoods -- otherwise, we'd need to get the edge of each successor block with fgGetPredForBlock, which seems needlessly expensive. This means adjusting BasicBlock::VisitAllSuccs to take a callback that operates on FlowEdge* instead of BasicBlock*, but this doesn't work well in VisitEHSuccs, as EHblkDsc points to EH successors using BasicBlock pointers instead of FlowEdge pointers. We don't create edges to those successors, so we can't switch that data structure over to using edges.

I would suggest this, unless you have some information that order of successors matters for switches.

I think it makes most sense to leave switch successors alone for now, but still abstract the layout-specific ordering code out of VisitAllSuccs. If we're only doing anything special for BBJ_COND blocks for now, do we want a callback-based approach in fgRunDfs (I presume this callback would just manually compare the likelihoods of conditional blocks' true/false targets) that can be easily expanded later? Or are we ok with having a separate visitor method, VisitAllSuccsInLikelihoodOrder, that handles the BBJ_COND case while setting up the array in AllSuccessorEnumerator?

amanasifkhalid added 3 commits May 3, 2024 16:00

Greedy switch successor iteration

3d6c6e3

Successor visitor API cleanup

d722737

Disable

c7a9821

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 6, 2024

dotnet-policy-service bot assigned amanasifkhalid May 6, 2024

build-analysis bot mentioned this pull request May 6, 2024

Failed test: baseservices/exceptions/unhandled/unhandledTester/unhandledTester #100495

Closed

jakobbotsch reviewed May 10, 2024

View reviewed changes

amanasifkhalid mentioned this pull request May 16, 2024

JIT: Enable RPO-based block layout by default #102343

Merged

amanasifkhalid closed this Aug 12, 2024

github-actions bot locked and limited conversation to collaborators Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: Visit switch successors in increasing likelihood order for RPO-based layout #101935

JIT: Visit switch successors in increasing likelihood order for RPO-based layout #101935

amanasifkhalid commented May 6, 2024

dotnet-policy-service bot commented May 6, 2024

amanasifkhalid commented May 6, 2024

amanasifkhalid commented May 7, 2024

amanasifkhalid commented May 8, 2024

jakobbotsch May 10, 2024

jakobbotsch May 10, 2024

amanasifkhalid May 10, 2024

jakobbotsch May 10, 2024

amanasifkhalid May 10, 2024

jakobbotsch May 10, 2024

AndyAyersMS May 10, 2024

amanasifkhalid May 10, 2024

JIT: Visit switch successors in increasing likelihood order for RPO-based layout #101935

JIT: Visit switch successors in increasing likelihood order for RPO-based layout #101935

Conversation

amanasifkhalid commented May 6, 2024

dotnet-policy-service bot commented May 6, 2024

amanasifkhalid commented May 6, 2024

amanasifkhalid commented May 7, 2024

amanasifkhalid commented May 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment