Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: cross-block local assertion prop in morph #94363

Merged
merged 5 commits into from
Nov 10, 2023

Conversation

AndyAyersMS
Copy link
Member

@AndyAyersMS AndyAyersMS commented Nov 3, 2023

During global morph, allow assertions to propagate to a block from the block's predecessors. Handle special cases where we can't allow this to happen:

  • block has preds that have not yet been morphed
  • block has no preds
  • block is specially flagged as one that might gain new preds during morph
  • block is an EH handler entry

Contributes to #93246.

For now this is off by default.

During global morph, allow assertions to propagate to a block from the block's
predecessors. Handle special cases where we can't allow this to happen:
* block has preds that have not yet been morphed
* block has no preds
* block is specially flagged as one that might gain new preds during morph
* block is an EH handler entry

Contributes to dotnet#93246.
@ghost ghost assigned AndyAyersMS Nov 3, 2023
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Nov 3, 2023
@ghost
Copy link

ghost commented Nov 3, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

During global morph, allow assertions to propagate to a block from the block's predecessors. Handle special cases where we can't allow this to happen:

  • block has preds that have not yet been morphed
  • block has no preds
  • block is specially flagged as one that might gain new preds during morph
  • block is an EH handler entry

Contributes to #93246.

Author: AndyAyersMS
Assignees: AndyAyersMS
Labels:

area-CodeGen-coreclr

Milestone: -

@AndyAyersMS AndyAyersMS marked this pull request as draft November 3, 2023 23:25
@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Nov 3, 2023

@jakobbotsch @dotnet/jit-contrib here's the "rough cut"

This will have about a 1% TP impact (win x64) and big code size improvements and regressions. Seems like it will take some time to chip away at those.

Diffs

Running this now to flush out any remaining bugs.

@AndyAyersMS
Copy link
Member Author

@amanasifkhalid FYI I'm seeing the same sort of cloning issue you hit in this PR... there are 4600 size regressions so it would be nice to filter out the ones where there was extra cloning.

[18:08:46] Top method regressions (percentages):
[18:08:46]          216 (67.50% of base) : 18807.dasm - Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpHeaders:ValidateHeaderValueCharacters(System.String,Microsoft.Extensions.Primitives.StringValues,System.Func`2[System.String,System.Text.Encoding]) (Tier1)

@AndyAyersMS
Copy link
Member Author

Runtime failure is in the diagnostics test that looks at debug info from an optimized method.

@AndyAyersMS
Copy link
Member Author

Latest commit upscales the bit vectors proportional to lvaCount until we hit the tracked local limit (so using similar sized bit vectors to what we use for liveness, since empirically I see about 0.6 local assertions / local, and want to over-size a bit to reduce likelihood of running out of table space).

This will incur a bit more TP cost; locally looks like around 1.3% or so. But as a benefit we see considerably larger overall code size savings; x64 windows should be around -1MB.

Thoughts about how to get this to the point where it can be merged

Throughput:

I think I can claw some of the TP back by

(1) optimizing search loops in assertion prop. We should generally never search the entire table (save when adding an assertion) and instead walk the live assertion set, and for local prop we should intersect that set with the dep vector.

The bit vector operations aren't free so it might make sense to do this only when there are sufficient numbers of assertions, though that makes the code uglier. Perhaps it can all be hidden behind a suitably smart enumerator.

(2) stop morphing blocks that are unreachable or become unreachable because of local assertion prop.

Aside from removing all the morph related overhead, I suspect sometimes this dead block morphing creates issues for live blocks, eg either useless assertions that cause the table to fill faster, or else global actions (say DNER) that are hard to undo.

Currently the RPO strategy won't allow for removal of dead EH and may or may not make it easy to remove unreachable cyclic flow.

Both of these can be spun off as preliminary checkins, though the full benefits may not be seen without this change, as main has small bit vectors and isn't able to propagate constants nearly as aggressively.

At the end of the day though I expect there will still be a TP impact.

Code Size / Code Quality

With the massive numbers of diffs any sort of manual assessment is going to be a random sampling at best. I will try and look at some of the bigger regressions, possibly suppressing cloning to help remove that as a factor.

The more aggressive copy prop that this enables puts pressure on LSRA, as copies to temps provide natural split points. I am not sure we can find effective heuristics to counterbalance this.

I think it makes sense to check all this in initially as off-by-default, and enable runs in the perf lab, and use that data to both identify CQ regressions and to decide if the CQ improvements justify the additional TP costs.

@AndyAyersMS
Copy link
Member Author

Diffs @ bc59375.

Aggregate size savings look good, but still quite a lot of churn, and a fair number of large regressions.

@AndyAyersMS
Copy link
Member Author

I'm going to disable this by default for now and set up some lab runs for correctness. I will alsoset up runs of this in a fork in the perf lab, so I can get a sense of the perf impact of all this CQ churn.

In the meantime I'll continue to look at fixing the bad diffs (and bad perf, once I have data) and whittling away at TP, so once this is merged I'll open a draft PR to enable and iterate there.

@AndyAyersMS AndyAyersMS marked this pull request as ready for review November 8, 2023 23:30
@@ -220,6 +221,7 @@
<TestEnvironment Include="jitobjectstackallocation" JitObjectStackAllocation="1" TieredCompilation="0" />
<TestEnvironment Include="jitphysicalpromotion_only" JitStressModeNames="STRESS_NO_OLD_PROMOTION" TieredCompilation="0" />
<TestEnvironment Include="jitphysicalpromotion_full" JitStressModeNames="STRESS_PHYSICAL_PROMOTION_COST STRESS_NO_OLD_PROMOTION" TieredCompilation="0" />
<TestEnvironment Include="jitcrossblocklocalassertionprop" JitDoCrossBlockLocalAssertionProp="1" TieredCompilation="0" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TieredCompilation=0 is the default in this file, so you don't need to specify it (looks like some other cases also over-specify it)

@@ -653,6 +653,9 @@ CONFIG_INTEGER(JitEnableHeadTailMerge, W("JitEnableHeadTailMerge"), 1)
// Enable physical promotion
CONFIG_INTEGER(JitEnablePhysicalPromotion, W("JitEnablePhysicalPromotion"), 1)

// Enable cross-block local assertion prop
CONFIG_INTEGER(JitDoCrossBlockLocalAssertionProp, W("JitDoCrossBlockLocalAssertionProp"), 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do seems fine, but should we use Enable instead, to match other similar cases (like JitEnablePhysicalPromotion, above)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure... it was up with the other Do cases before, but now that I moved it to be available in release it no longer matches its neighbors.

@@ -13822,7 +13900,8 @@ PhaseStatus Compiler::fgMorphBlocks()

// Local assertion prop is enabled if we are optimized
//
optLocalAssertionProp = opts.OptimizationEnabled();
optLocalAssertionProp = opts.OptimizationEnabled();
optCrossBlockLocalAssertionProp = optLocalAssertionProp;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
optCrossBlockLocalAssertionProp = optLocalAssertionProp;
optCrossBlockLocalAssertionProp = optLocalAssertionProp && JitConfig.JitDoCrossBlockLocalAssertionProp();

?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you consult the config in optAssertionInit. Is there a reason to wait until then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went back and forth as to whether all this setup should be assertion prop's business; I could push more of the logic there I suppose.


if (!canUsePredAssertions)
{
apLocal = BitVecOps::MakeEmpty(apTraits);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this also need optAssertionReset(0);?

Can the (!optCrossBlockLocalAssertionProp) and the "failure" cases be combined (tail merged)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this also need optAssertionReset(0);?

No. There are 3 modes of operation here:

  1. classic: block starts with empty table via Reset(0) and empty BV
  2. cross-block: block starts with current table and intersection of pred BVs (which may end up empty)
  3. cross-block special case: block starts with current table and empty BV

Modes 2 and 3 can mix with each other, but not with mode 1.

We use mode 1 if the feature is disabled, or if the method has a lot of locals.

When the feature is enabled, and the method does not have a lot of locals, we use mode 2 (if canUsePredAssertions) or mode 3 (otherwise) depending on the various block properties.

Can the (!optCrossBlockLocalAssertionProp) and the "failure" cases be combined (tail merged)?

Let me look at tweaking this.

canUsePredAssertions = false;
}

// If there were no preds, there are no asserions in.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// If there were no preds, there are no asserions in.
// If there were no preds, there are no assertions in.

//
JITDUMP(FMT_BB " pred " FMT_BB " not processed; clearing assertions in\n", block->bbNum,
pred->bbNum);
canUsePredAssertions = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add break;?

Or better, make the failure case first, with break, and eliminate the continue

Comment on lines +568 to +572
if (lvaCount > maxTrackedLocals)
{
JITDUMP("Disabling cross-block assertion prop: too many locals\n");
optCrossBlockLocalAssertionProp = false;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it might be very "spurious diff" inducing (add a local and suddenly a bunch of assertions stop being propagated). You might have mentioned it somewhere else, but do you have any plans to make this less "hard threshold-y"?

Copy link
Member

@jakobbotsch jakobbotsch Nov 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also looking at things like optAssertionDep below it kind of feels like we should be calling lvaSortByRefCount here and utilize tracked indices, although I don't think ref counts are up to date here (in particular ones introduced by physical promotion -- might not be so hard to fix that).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's one of those cases where it would be great to have separated "single block" locals and "global" locals.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have any plans to make this less "hard threshold-y"?

I don't have any great ideas here, no. Some of this is the inherent cost of a blind forward scheme, we push facts along hoping they will matter. SSA would be helpful, we could just attach assertions to SSA lifetimes.

utilize tracked indices

Some kind of prioritization scheme might work out nicely; presumably frequently referenced locals are more interesting candidates for assertions.

I guess it's one of those cases where it would be great to have separated "single block" locals and "global" locals.

I had the same thought -- I tried an experiment to see how many assertions were killed in the same block they were born in, and it was quite small; most of them get killed at merges.

It might be possible to leverage RCS_EARLY to deduce if newly generated assertions are likely to be referenced later on; if not they could be purged before they "escape" the block. But we're also enabling cross-block copy prop so even getting that right might be tricky, as the ref counts can and will change during morph.

Copy link
Member

@jakobbotsch jakobbotsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@AndyAyersMS
Copy link
Member Author

There's an unexpected diff on 32 bit platforms, I want to chase this down... not obvious to me what this is from the code changes. I can repro locally.

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Nov 10, 2023

There's an unexpected diff on 32 bit platforms, I want to chase this down... not obvious to me what this is from the code changes. I can repro locally.

Ah, this is fixing an issue I introduced in #94322. In optAssertionIsNonNullInternal we were inadvertently removing an assertion from the dep vector if it was no longer live (via IntersectionD). If it later came back to life we might not be able to find it since it would not be re-added to the dep vector (adding only happens when an assertion is first created). So we might fail to kill it if then again went dead.

We now use Intersection to avoid modifying the dep vector.

@AndyAyersMS AndyAyersMS merged commit 5127e07 into dotnet:main Nov 10, 2023
195 of 197 checks passed
AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this pull request Nov 10, 2023
My last round of revisions to dotnet#94363 introduced a bug. Fix.
AndyAyersMS added a commit that referenced this pull request Nov 10, 2023
My last round of revisions to #94363 introduced a bug. Fix.
AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this pull request Nov 14, 2023
When we expand QMARKS, ensure that any block with a no-return call gets
changed to BBJ_THROW.

This fixes a case I am seeing with cross-block local assertion prop,
as the upper QMARK gets optimized away and so we don't check if the
expansing has any noreturn calls.

It also happens in places with just within-block local assertion prop.

Contributes to dotnet#94363.
AndyAyersMS added a commit that referenced this pull request Nov 14, 2023
When we expand QMARKS, ensure that any block with a no-return call gets
changed to BBJ_THROW.

This fixes a case I am seeing with cross-block local assertion prop,
as the upper QMARK gets optimized away and so we don't check if the
expansing has any noreturn calls.

It also happens in places with just within-block local assertion prop.

Contributes to #94363.
@github-actions github-actions bot locked and limited conversation to collaborators Dec 10, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants