-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Improve local assertion prop throughput #94597
JIT: Improve local assertion prop throughput #94597
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue Detailsnull
|
Locally this looks really good for TP once cross-block is enabled. I picked up the fix from #94608 so let's see what the lab thinks. |
This reverts commit 062552152a1d209f6d9dbd89da8dd7f14b1a56fd.
034dbda
to
e98dca2
Compare
With 6a07b54 I am now seeing 0.13% (with cross-block local AP enabled) on benchmarks.run; baseline is more like 1.57%.... so this may get us to a happy TP place. Would be nice to enacapsulate this pattern a bit better, perhaps. Also maybe in global AP we can think about ways to do similar indexing, as there are still various whole-table searches going on there. |
One odd thing is that after the force push / rebase I lost some of the improvements and gained some sizeable regressions. EG win x64 went from -1.1M to -800K. I've verified this change is no diff vs baseline (either with or without cross-block enabled) so not sure what happened just yet. Will need to diff against some older build I guess. |
@jakobbotsch PTAL |
No codegen diffs; some TP a bit faster, some slower -- the more dramatic TP impact here is with cross-block enabled (see above). It may be that our bit vector iterator has some unnecessary overhead, as it seems like walking through an N entry table (with at most 64 entries) should generally be more costly that iterating through a bit vector with at most N (and likely just a few) bits set. |
src/coreclr/jit/assertionprop.cpp
Outdated
AssertionIndex const index = GetAssertionIndex(bvIndex); | ||
AssertionDsc* const curAssertion = optGetAssertion(index); | ||
|
||
if (curAssertion->Equals(newAssertion, !optLocalAssertionProp)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (curAssertion->Equals(newAssertion, !optLocalAssertionProp)) | |
if (curAssertion->Equals(newAssertion, /* vnBased */ false)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Also in the other cases)
src/coreclr/jit/assertionprop.cpp
Outdated
if (newAssertion->op2.kind == O2K_LCLVAR_COPY) | ||
{ | ||
lclNum = newAssertion->op2.lcl.lclNum; | ||
BitVecOps::Iter iter(apTraits, GetAssertionDep(lclNum)); | ||
unsigned bvIndex = 0; | ||
while (iter.NextElem(&bvIndex)) | ||
{ | ||
AssertionIndex const index = GetAssertionIndex(bvIndex); | ||
AssertionDsc* const curAssertion = optGetAssertion(index); | ||
|
||
if (curAssertion->Equals(newAssertion, !optLocalAssertionProp)) | ||
{ | ||
return index; | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? Shouldn't the previous case have found it if there is an equal assertion? Or will we only keep an assertion like v1 = v2
in one of the bit vectors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's kept for both (see just below) -- so yeah that second loop is not needed.
Aha, some OSR cases got pessimized with the DFS reachability check, as the original method entry and such are considered unreachable. Will push a fix. |
There are around 100 or so methods that see diffs now with cross-block enabled (because of the now fixed reachability check); seems like a lot of them are compiled regex methods:
|
Local TP diff with cross-block enabled (based on 7c8fff9). Seems almost too good to be true. I'll prep another PR for enabling once this bit is merged, and we'll see what the lab says. |
@jakobbotsch take another look when you can... |
{ | ||
if (!BlockSetOps::IsMember(this, visited, fgEntryBB->bbNum)) | ||
{ | ||
fgDfsReversePostorderHelper(fgEntryBB, visited, preorderIndex, postorderIndex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These aren't really reachable, right? (Except for maybe tailcall-to-loop opt). But we want them in the order to be able to propagate facts from them.
I wonder if SSA/VN could benefit similarly to seeing these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like a sign that CQ for OSR methods would benefit from inserting the jump to the start block very late, e.g. right before lowering (probably with some additional DCE for the unreachable blocks). Of course at some cost of TP, so maybe it would need to be restricted to loop-based patchpoints while the current behavior is ideal for partial compilation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a note to#33658 to try this someday. One consideration for deferral is that we might need to alter or block some optimizations that might remove computations from the OSR side (hoisting, CSE).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also note currently fgEntryBB
gets nulled out after morph.
Leverage the "dep vectors" to avoid the search the assertion table during local assertion prop. Helps the current (small table) behavior some, helps the future cross-block (larger table) behavior more.
Similar tricks may be possible for global AP, though the set of assertions there is more varied.
Contributes to #93246.