-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: cross-block local assertion prop in morph #94363
JIT: cross-block local assertion prop in morph #94363
Conversation
During global morph, allow assertions to propagate to a block from the block's predecessors. Handle special cases where we can't allow this to happen: * block has preds that have not yet been morphed * block has no preds * block is specially flagged as one that might gain new preds during morph * block is an EH handler entry Contributes to dotnet#93246.
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsDuring global morph, allow assertions to propagate to a block from the block's predecessors. Handle special cases where we can't allow this to happen:
Contributes to #93246.
|
@jakobbotsch @dotnet/jit-contrib here's the "rough cut" This will have about a 1% TP impact (win x64) and big code size improvements and regressions. Seems like it will take some time to chip away at those. Running this now to flush out any remaining bugs. |
@amanasifkhalid FYI I'm seeing the same sort of cloning issue you hit in this PR... there are 4600 size regressions so it would be nice to filter out the ones where there was extra cloning.
|
Runtime failure is in the diagnostics test that looks at debug info from an optimized method. |
Latest commit upscales the bit vectors proportional to This will incur a bit more TP cost; locally looks like around 1.3% or so. But as a benefit we see considerably larger overall code size savings; x64 windows should be around -1MB. Thoughts about how to get this to the point where it can be mergedThroughput: I think I can claw some of the TP back by (1) optimizing search loops in assertion prop. We should generally never search the entire table (save when adding an assertion) and instead walk the live assertion set, and for local prop we should intersect that set with the dep vector. The bit vector operations aren't free so it might make sense to do this only when there are sufficient numbers of assertions, though that makes the code uglier. Perhaps it can all be hidden behind a suitably smart enumerator. (2) stop morphing blocks that are unreachable or become unreachable because of local assertion prop. Aside from removing all the morph related overhead, I suspect sometimes this dead block morphing creates issues for live blocks, eg either useless assertions that cause the table to fill faster, or else global actions (say DNER) that are hard to undo. Currently the RPO strategy won't allow for removal of dead EH and may or may not make it easy to remove unreachable cyclic flow. Both of these can be spun off as preliminary checkins, though the full benefits may not be seen without this change, as main has small bit vectors and isn't able to propagate constants nearly as aggressively. At the end of the day though I expect there will still be a TP impact. Code Size / Code Quality With the massive numbers of diffs any sort of manual assessment is going to be a random sampling at best. I will try and look at some of the bigger regressions, possibly suppressing cloning to help remove that as a factor. The more aggressive copy prop that this enables puts pressure on LSRA, as copies to temps provide natural split points. I am not sure we can find effective heuristics to counterbalance this. I think it makes sense to check all this in initially as off-by-default, and enable runs in the perf lab, and use that data to both identify CQ regressions and to decide if the CQ improvements justify the additional TP costs. |
I'm going to disable this by default for now and set up some lab runs for correctness. I will alsoset up runs of this in a fork in the perf lab, so I can get a sense of the perf impact of all this CQ churn. In the meantime I'll continue to look at fixing the bad diffs (and bad perf, once I have data) and whittling away at TP, so once this is merged I'll open a draft PR to enable and iterate there. |
@@ -220,6 +221,7 @@ | |||
<TestEnvironment Include="jitobjectstackallocation" JitObjectStackAllocation="1" TieredCompilation="0" /> | |||
<TestEnvironment Include="jitphysicalpromotion_only" JitStressModeNames="STRESS_NO_OLD_PROMOTION" TieredCompilation="0" /> | |||
<TestEnvironment Include="jitphysicalpromotion_full" JitStressModeNames="STRESS_PHYSICAL_PROMOTION_COST STRESS_NO_OLD_PROMOTION" TieredCompilation="0" /> | |||
<TestEnvironment Include="jitcrossblocklocalassertionprop" JitDoCrossBlockLocalAssertionProp="1" TieredCompilation="0" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TieredCompilation=0
is the default in this file, so you don't need to specify it (looks like some other cases also over-specify it)
src/coreclr/jit/jitconfigvalues.h
Outdated
@@ -653,6 +653,9 @@ CONFIG_INTEGER(JitEnableHeadTailMerge, W("JitEnableHeadTailMerge"), 1) | |||
// Enable physical promotion | |||
CONFIG_INTEGER(JitEnablePhysicalPromotion, W("JitEnablePhysicalPromotion"), 1) | |||
|
|||
// Enable cross-block local assertion prop | |||
CONFIG_INTEGER(JitDoCrossBlockLocalAssertionProp, W("JitDoCrossBlockLocalAssertionProp"), 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do
seems fine, but should we use Enable
instead, to match other similar cases (like JitEnablePhysicalPromotion
, above)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure... it was up with the other Do
cases before, but now that I moved it to be available in release it no longer matches its neighbors.
src/coreclr/jit/morph.cpp
Outdated
@@ -13822,7 +13900,8 @@ PhaseStatus Compiler::fgMorphBlocks() | |||
|
|||
// Local assertion prop is enabled if we are optimized | |||
// | |||
optLocalAssertionProp = opts.OptimizationEnabled(); | |||
optLocalAssertionProp = opts.OptimizationEnabled(); | |||
optCrossBlockLocalAssertionProp = optLocalAssertionProp; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optCrossBlockLocalAssertionProp = optLocalAssertionProp; | |
optCrossBlockLocalAssertionProp = optLocalAssertionProp && JitConfig.JitDoCrossBlockLocalAssertionProp(); |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you consult the config in optAssertionInit
. Is there a reason to wait until then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went back and forth as to whether all this setup should be assertion prop's business; I could push more of the logic there I suppose.
|
||
if (!canUsePredAssertions) | ||
{ | ||
apLocal = BitVecOps::MakeEmpty(apTraits); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this also need optAssertionReset(0);
?
Can the (!optCrossBlockLocalAssertionProp)
and the "failure" cases be combined (tail merged)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this also need
optAssertionReset(0);
?
No. There are 3 modes of operation here:
- classic: block starts with empty table via
Reset(0)
and empty BV - cross-block: block starts with current table and intersection of pred BVs (which may end up empty)
- cross-block special case: block starts with current table and empty BV
Modes 2 and 3 can mix with each other, but not with mode 1.
We use mode 1 if the feature is disabled, or if the method has a lot of locals.
When the feature is enabled, and the method does not have a lot of locals, we use mode 2 (if canUsePredAssertions
) or mode 3 (otherwise) depending on the various block properties.
Can the
(!optCrossBlockLocalAssertionProp)
and the "failure" cases be combined (tail merged)?
Let me look at tweaking this.
src/coreclr/jit/morph.cpp
Outdated
canUsePredAssertions = false; | ||
} | ||
|
||
// If there were no preds, there are no asserions in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// If there were no preds, there are no asserions in. | |
// If there were no preds, there are no assertions in. |
src/coreclr/jit/morph.cpp
Outdated
// | ||
JITDUMP(FMT_BB " pred " FMT_BB " not processed; clearing assertions in\n", block->bbNum, | ||
pred->bbNum); | ||
canUsePredAssertions = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add break;
?
Or better, make the failure case first, with break
, and eliminate the continue
if (lvaCount > maxTrackedLocals) | ||
{ | ||
JITDUMP("Disabling cross-block assertion prop: too many locals\n"); | ||
optCrossBlockLocalAssertionProp = false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it might be very "spurious diff" inducing (add a local and suddenly a bunch of assertions stop being propagated). You might have mentioned it somewhere else, but do you have any plans to make this less "hard threshold-y"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also looking at things like optAssertionDep
below it kind of feels like we should be calling lvaSortByRefCount
here and utilize tracked indices, although I don't think ref counts are up to date here (in particular ones introduced by physical promotion -- might not be so hard to fix that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's one of those cases where it would be great to have separated "single block" locals and "global" locals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you have any plans to make this less "hard threshold-y"?
I don't have any great ideas here, no. Some of this is the inherent cost of a blind forward scheme, we push facts along hoping they will matter. SSA would be helpful, we could just attach assertions to SSA lifetimes.
utilize tracked indices
Some kind of prioritization scheme might work out nicely; presumably frequently referenced locals are more interesting candidates for assertions.
I guess it's one of those cases where it would be great to have separated "single block" locals and "global" locals.
I had the same thought -- I tried an experiment to see how many assertions were killed in the same block they were born in, and it was quite small; most of them get killed at merges.
It might be possible to leverage RCS_EARLY to deduce if newly generated assertions are likely to be referenced later on; if not they could be purged before they "escape" the block. But we're also enabling cross-block copy prop so even getting that right might be tricky, as the ref counts can and will change during morph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
There's an unexpected diff on 32 bit platforms, I want to chase this down... not obvious to me what this is from the code changes. I can repro locally. |
Ah, this is fixing an issue I introduced in #94322. In We now use |
My last round of revisions to dotnet#94363 introduced a bug. Fix.
My last round of revisions to #94363 introduced a bug. Fix.
When we expand QMARKS, ensure that any block with a no-return call gets changed to BBJ_THROW. This fixes a case I am seeing with cross-block local assertion prop, as the upper QMARK gets optimized away and so we don't check if the expansing has any noreturn calls. It also happens in places with just within-block local assertion prop. Contributes to dotnet#94363.
When we expand QMARKS, ensure that any block with a no-return call gets changed to BBJ_THROW. This fixes a case I am seeing with cross-block local assertion prop, as the upper QMARK gets optimized away and so we don't check if the expansing has any noreturn calls. It also happens in places with just within-block local assertion prop. Contributes to #94363.
During global morph, allow assertions to propagate to a block from the block's predecessors. Handle special cases where we can't allow this to happen:
Contributes to #93246.
For now this is off by default.