Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Improve local assertion prop throughput #94597

Merged
merged 12 commits into from
Nov 13, 2023

Conversation

AndyAyersMS
Copy link
Member

@AndyAyersMS AndyAyersMS commented Nov 10, 2023

Leverage the "dep vectors" to avoid the search the assertion table during local assertion prop. Helps the current (small table) behavior some, helps the future cross-block (larger table) behavior more.

Similar tricks may be possible for global AP, though the set of assertions there is more varied.

Contributes to #93246.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Nov 10, 2023
@ghost ghost assigned AndyAyersMS Nov 10, 2023
@ghost
Copy link

ghost commented Nov 10, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

null

Author: AndyAyersMS
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@AndyAyersMS
Copy link
Member Author

Locally this looks really good for TP once cross-block is enabled. I picked up the fix from #94608 so let's see what the lab thinks.

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Nov 10, 2023

TP Impact on cross-block assertion prop

Baseline cross-block TP

image

With this PR at 034dbda 548be19

image

After 6a07b54
image

After 631a42f
image

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Nov 10, 2023

With 6a07b54 I am now seeing 0.13% (with cross-block local AP enabled) on benchmarks.run; baseline is more like 1.57%.... so this may get us to a happy TP place.

Would be nice to enacapsulate this pattern a bit better, perhaps. Also maybe in global AP we can think about ways to do similar indexing, as there are still various whole-table searches going on there.

@AndyAyersMS
Copy link
Member Author

One odd thing is that after the force push / rebase I lost some of the improvements and gained some sizeable regressions. EG win x64 went from -1.1M to -800K.

I've verified this change is no diff vs baseline (either with or without cross-block enabled) so not sure what happened just yet. Will need to diff against some older build I guess.

@AndyAyersMS AndyAyersMS changed the title Search less assertion prop JIT: Improve local assertion prop throughput Nov 11, 2023
@AndyAyersMS AndyAyersMS marked this pull request as ready for review November 11, 2023 01:59
@AndyAyersMS
Copy link
Member Author

@jakobbotsch PTAL
cc @dotnet/jit-contrib

@AndyAyersMS
Copy link
Member Author

Diffs

No codegen diffs; some TP a bit faster, some slower -- the more dramatic TP impact here is with cross-block enabled (see above). It may be that our bit vector iterator has some unnecessary overhead, as it seems like walking through an N entry table (with at most 64 entries) should generally be more costly that iterating through a bit vector with at most N (and likely just a few) bits set.

AssertionIndex const index = GetAssertionIndex(bvIndex);
AssertionDsc* const curAssertion = optGetAssertion(index);

if (curAssertion->Equals(newAssertion, !optLocalAssertionProp))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (curAssertion->Equals(newAssertion, !optLocalAssertionProp))
if (curAssertion->Equals(newAssertion, /* vnBased */ false))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Also in the other cases)

Comment on lines 1658 to 1673
if (newAssertion->op2.kind == O2K_LCLVAR_COPY)
{
lclNum = newAssertion->op2.lcl.lclNum;
BitVecOps::Iter iter(apTraits, GetAssertionDep(lclNum));
unsigned bvIndex = 0;
while (iter.NextElem(&bvIndex))
{
AssertionIndex const index = GetAssertionIndex(bvIndex);
AssertionDsc* const curAssertion = optGetAssertion(index);

if (curAssertion->Equals(newAssertion, !optLocalAssertionProp))
{
return index;
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? Shouldn't the previous case have found it if there is an equal assertion? Or will we only keep an assertion like v1 = v2 in one of the bit vectors?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's kept for both (see just below) -- so yeah that second loop is not needed.

@AndyAyersMS
Copy link
Member Author

One odd thing is that after the force push / rebase I lost some of the improvements and gained some sizeable regressions. EG win x64 went from -1.1M to -800K.

I've verified this change is no diff vs baseline (either with or without cross-block enabled) so not sure what happened just yet. Will need to diff against some older build I guess.

Aha, some OSR cases got pessimized with the DFS reachability check, as the original method entry and such are considered unreachable. Will push a fix.

@AndyAyersMS
Copy link
Member Author

There are around 100 or so methods that see diffs now with cross-block enabled (because of the now fixed reachability check); seems like a lot of them are compiled regex methods:

[14:24:05] Top method improvements (percentages):
[14:24:05]          -65 (-4.90% of base) : 304049.dasm - System.Text.RegularExpressions.CompiledRegexRunner:Regex32447_TryMatchAtCurrentPosition(System.Text.RegularExpressions.RegexRunner,System.ReadOnlySpan`1[ushort]):ubyte (FullOpts)
[14:24:05]          -73 (-3.87% of base) : 298116.dasm - System.Text.RegularExpressions.CompiledRegexRunner:Regex5110_TryMatchAtCurrentPosition(System.Text.RegularExpressions.RegexRunner,System.ReadOnlySpan`1[ushort]):ubyte (FullOpts)
[14:24:05]          -34 (-3.84% of base) : 298413.dasm - System.Text.RegularExpressions.CompiledRegexRunner:Regex5473_TryMatchAtCurrentPosition(System.Text.RegularExpressions.RegexRunner,System.ReadOnlySpan`1[ushort]):ubyte (FullOpts)
[14:24:05]          -10 (-1.35% of base) : 304046.dasm - System.Text.RegularExpressions.CompiledRegexRunner:Regex32443_TryMatchAtCurrentPosition(System.Text.RegularExpressions.RegexRunner,System.ReadOnlySpan`1[ushort]):ubyte (FullOpts)
[14:24:05]           -9 (-1.18% of base) : 130597.dasm - System.IO.Compression.ZipArchive:WriteFile():this (FullOpts)
...etc...

@AndyAyersMS
Copy link
Member Author

Local TP diff with cross-block enabled (based on 7c8fff9). Seems almost too good to be true.

image

I'll prep another PR for enabling once this bit is merged, and we'll see what the lab says.

@AndyAyersMS
Copy link
Member Author

@jakobbotsch take another look when you can...

{
if (!BlockSetOps::IsMember(this, visited, fgEntryBB->bbNum))
{
fgDfsReversePostorderHelper(fgEntryBB, visited, preorderIndex, postorderIndex);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These aren't really reachable, right? (Except for maybe tailcall-to-loop opt). But we want them in the order to be able to propagate facts from them.

I wonder if SSA/VN could benefit similarly to seeing these.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like a sign that CQ for OSR methods would benefit from inserting the jump to the start block very late, e.g. right before lowering (probably with some additional DCE for the unreachable blocks). Of course at some cost of TP, so maybe it would need to be restricted to loop-based patchpoints while the current behavior is ideal for partial compilation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note to#33658 to try this someday. One consideration for deferral is that we might need to alter or block some optimizations that might remove computations from the OSR side (hoisting, CSE).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note currently fgEntryBB gets nulled out after morph.

@AndyAyersMS AndyAyersMS merged commit 8a84b33 into dotnet:main Nov 13, 2023
136 of 139 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Dec 14, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants