Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Visit switch successors in increasing likelihood order for RPO-based layout #101935

Closed

Conversation

amanasifkhalid
Copy link
Member

Follow-up to #101473. Also cleans up the BasicBlock successor visitor API surface a bit by separating logic for visiting successors in increasing likelihood order into BasicBlock::VisitAllSuccsInLikelihoodOrder.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 6, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@amanasifkhalid
Copy link
Member Author

Here's the diff summary on win-x64, since the RPO-based layout is disabled by default. Diffs aren't that big, and they aren't overwhelmingly in one direction...

Diffs are based on 2,534,676 contexts (987,849 MinOpts, 1,546,827 FullOpts).

MISSED contexts: 2,922 (0.12%)

Overall (-8,298 bytes)
Collection Base size (bytes) Diff size (bytes) PerfScore in Diffs
aspnet.run.windows.x64.checked.mch 39,855,926 +1,073 +0.20%
benchmarks.run.windows.x64.checked.mch 8,710,150 -108 -0.05%
benchmarks.run_pgo.windows.x64.checked.mch 32,332,197 -718 +0.39%
benchmarks.run_tiered.windows.x64.checked.mch 12,340,101 -23 +0.03%
coreclr_tests.run.windows.x64.checked.mch 402,921,293 -325 +0.27%
libraries.crossgen2.windows.x64.checked.mch 45,252,213 -1,591 -0.13%
libraries.pmi.windows.x64.checked.mch 63,617,634 -575 -0.11%
libraries_tests.run.windows.x64.Release.mch 285,052,618 -2,971 +0.79%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch 137,049,028 -574 -0.14%
realworld.run.windows.x64.checked.mch 13,552,182 -1,625 -0.08%
smoke_tests.nativeaot.windows.x64.checked.mch 5,032,760 -861 -1.60%
FullOpts (-8,298 bytes)
Collection Base size (bytes) Diff size (bytes) PerfScore in Diffs
aspnet.run.windows.x64.checked.mch 27,307,698 +1,073 +0.20%
benchmarks.run.windows.x64.checked.mch 8,709,728 -108 -0.05%
benchmarks.run_pgo.windows.x64.checked.mch 18,519,682 -718 +0.39%
benchmarks.run_tiered.windows.x64.checked.mch 3,077,579 -23 +0.03%
coreclr_tests.run.windows.x64.checked.mch 121,562,176 -325 +0.27%
libraries.crossgen2.windows.x64.checked.mch 45,250,508 -1,591 -0.13%
libraries.pmi.windows.x64.checked.mch 63,504,133 -575 -0.11%
libraries_tests.run.windows.x64.Release.mch 107,104,520 -2,971 +0.79%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch 126,736,473 -574 -0.14%
realworld.run.windows.x64.checked.mch 13,146,461 -1,625 -0.08%
smoke_tests.nativeaot.windows.x64.checked.mch 5,031,713 -861 -1.60%

@amanasifkhalid
Copy link
Member Author

cc @dotnet/jit-contrib, @AndyAyersMS PTAL. If you don't think it's worth churning the RPO layout implementation while running an experiment in the perf lab (especially since the diffs don't look like particularly large wins), I'm happy to undo those changes and just keep the block visitor cleanup work.

@amanasifkhalid
Copy link
Member Author

@jakobbotsch this cleans up the BasicBlock visitor API surface a bit. I initially tried something like the initializer pattern you mentioned over in #101473, but the current usage of AllSuccessorEnumerator in fgRunDfs complicated this approach -- what do you think of just passing the useProfile template argument in fgRunDfs to AllSuccessorEnumerator? It introduces some code duplication to the latter's constructor, but it's otherwise a pretty localized change.

@@ -2497,7 +2501,7 @@ class AllSuccessorEnumerator

public:
// Constructs an enumerator of all `block`'s successors.
AllSuccessorEnumerator(Compiler* comp, BasicBlock* block, const bool useProfile = false);
AllSuccessorEnumerator(Compiler* const comp, BasicBlock* const block);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: marking parameters as const in declarations is not very beneficial (it is a property of the definition, not of the declaration)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a number of places where we have these const. IMO we should enable this rule of clang-tidy: https://clang.llvm.org/extra/clang-tidy/checks/readability/avoid-const-params-in-decls.html

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a number of places where we have these const. IMO we should enable this rule of clang-tidy:

That seems reasonable. I can open a PR for that.

Comment on lines 735 to +739
Compiler::SwitchUniqueSuccSet sd = comp->GetDescriptorForSwitch(this);
jitstd::sort(sd.nonDuplicates, (sd.nonDuplicates + sd.numDistinctSuccs),
[](FlowEdge* const lhs, FlowEdge* const rhs) {
return (lhs->getLikelihood() * lhs->getDupCount()) < (rhs->getLikelihood() * rhs->getDupCount());
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems odd for the visitor function to be mutating the descriptor.

It would make more sense to me if this sorting step was done ahead of time as part of the block layout phase.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would make more sense to me if this sorting step was done ahead of time as part of the block layout phase.

Are you thinking of something like a pass over the block list that sorts all the switch descriptors, before computing the DFS? If so, considering the diffs don't look all that significant, I'm tempted to just visit switch successors in their unsorted order to avoid cluttering the visitor APIs, or adding another loop to the layout algorithm that won't do anything most of the time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like that, or if that is undesirable, then some map created lazily in the context of the block layout phase (i.e. when we visit the switch block). I guess the latter is not much different to just allocating new memory to sort these successors into.
But at that point it makes more sense to me if we change AllSuccessorEnumerator slightly: in reality it's really just a vector with small vector optimization for N=4. If you phrase fgRunDfs in terms of a callback that returns this small vector of successors, then the memory where we can do the sorting without mutating existing data structures will already be available.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, considering the diffs don't look all that significant, I'm tempted to just visit switch successors in their unsorted order to avoid cluttering the visitor APIs

I would suggest this, unless you have some information that order of successors matters for switches.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you phrase fgRunDfs in terms of a callback that returns this small vector of successors, then the memory where we can do the sorting without mutating existing data structures will already be available.

I tried something like this just now, but this ends up touching quite a bit of code if we want to keep the callback naive to the block kind (which seems desirable). I was thinking we'd adjust AllSuccessorEnumerator's underlying array to hold FlowEdge pointers instead of BasicBlock pointers to the block's successors, so that the callback can easily sort the array by edge likelihoods -- otherwise, we'd need to get the edge of each successor block with fgGetPredForBlock, which seems needlessly expensive. This means adjusting BasicBlock::VisitAllSuccs to take a callback that operates on FlowEdge* instead of BasicBlock*, but this doesn't work well in VisitEHSuccs, as EHblkDsc points to EH successors using BasicBlock pointers instead of FlowEdge pointers. We don't create edges to those successors, so we can't switch that data structure over to using edges.

I would suggest this, unless you have some information that order of successors matters for switches.

I think it makes most sense to leave switch successors alone for now, but still abstract the layout-specific ordering code out of VisitAllSuccs. If we're only doing anything special for BBJ_COND blocks for now, do we want a callback-based approach in fgRunDfs (I presume this callback would just manually compare the likelihoods of conditional blocks' true/false targets) that can be easily expanded later? Or are we ok with having a separate visitor method, VisitAllSuccsInLikelihoodOrder, that handles the BBJ_COND case while setting up the array in AllSuccessorEnumerator?

@github-actions github-actions bot locked and limited conversation to collaborators Sep 12, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants