Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optimization "X & 1 == 1" to "X & 1" (#61412) #62818

Merged
merged 13 commits into from
Mar 21, 2022

Conversation

SkiFoD
Copy link
Contributor

@SkiFoD SkiFoD commented Dec 14, 2021

Looks like I got it working :) At least the result for a simple function
public static bool Test(int x) => (x & 1) == 1;
looks like this
image.
Would be glad to hear if I did something wrong and how can I make it better.

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Dec 14, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed community-contribution Indicates that the PR has been added by a community member labels Dec 14, 2021
@ghost
Copy link

ghost commented Dec 14, 2021

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Looks like I got it working :) At least the result for a simple function
public static bool Test(int x) => (x & 1) == 1;
looks like this
image.
Would be glad to hear if I did something wrong and how can I make it better.

Author: SkiFoD
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

Comment on lines 2786 to 2787
// Optimizes (X & 1) == 1 to (X & 1)
// GTF_RELOP_JMP_USED is used to make sure the optimization is used for return statements only.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would look into moving this optimization to morph. We tend not to put general peepholes like that into lowering.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My impression that optimizations which depend on GTF_RELOP_JMP_USED are better be in lower, after various copy-props - at least in my recent similar opt when I moved it to lower I got better diffs

Copy link
Contributor

@SingleAccretion SingleAccretion Dec 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be interesting to check what the diffs look like for this case. I cannot think off the top of my head why would we catch more cases in lower (well, modulo the "missed remorph" cases).

Relops under jumps are marked NO_CSE, so, in theory, we should have about the same number of !GTF_RELOP_JMP_USED ones before and after optimizations (modulo dead code and such).

It is a bit better to catch these things in morph in general because we help CSE make better decisions that way, but if it turns out the diffs say we should do it in lower, so be it.

{
GenTree* op1 = cmp->gtGetOp1();

if (op1->gtGetOp1()->gtOper == GT_LCL_VAR && op1->gtGetOp2()->IsIntegralConst(1) && !(cmp->gtFlags & GTF_RELOP_JMP_USED))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to check op1's operator here, it can be anything


if (op1->gtGetOp1()->gtOper == GT_LCL_VAR && op1->gtGetOp2()->IsIntegralConst(1) && !(cmp->gtFlags & GTF_RELOP_JMP_USED))
{
GenTree* next = cmp->gtNext;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to use LIR::Use::ReplaceWith here if you want to replace the root node.

@SkiFoD
Copy link
Contributor Author

SkiFoD commented Dec 15, 2021

@EgorBo @SingleAccretion I'm concerned about the failed test cases. Most of them are "Assertion failed 'OperIsSimple()'" .It might mean that the optimization is applied not only to the return statements. Do you have any ideas what went wrong ? :)

@EgorBo
Copy link
Member

EgorBo commented Dec 15, 2021

@EgorBo @SingleAccretion I'm concerned about the failed test cases. Most of them are "Assertion failed 'OperIsSimple()'" .It might mean that the optimization is applied not only to the return statements. Do you have any ideas what went wrong ? :)

I am sure you just don't replace the node correctly, you indeed can start from doing it in morph.cpp first as it should be much easier to test and then we'll check the diffs

@SingleAccretion
Copy link
Contributor

SingleAccretion commented Dec 15, 2021

The code assumes that the user of the relop will be the next node, which is not a correct assumption, there may be intervening nodes in between (an arbitrary number of them in fact).

You'll need to use LIR::Use::TryGetUse (look for other usages in lowering) and replace the node.

(I agree checking morph first would be better)

Comment on lines 11504 to 11517
// Optimizes (X & 1) == 1 to (X & 1)
// GTF_RELOP_JMP_USED is used to make sure the optimization is used for return statements only.
if (tree->gtGetOp2()->IsIntegralConst(1) && tree->gtGetOp1()->OperIs(GT_AND) && !(tree->gtFlags & GTF_RELOP_JMP_USED))
{
GenTree* op1 = tree->gtGetOp1();

if (op1->gtGetOp2()->IsIntegralConst(1))
{
DEBUG_DESTROY_NODE(tree->gtGetOp2());
DEBUG_DESTROY_NODE(tree);

return fgMorphTree(op1);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done in post-order, when all the constants have been discovered.

Comment on lines 12085 to 12099
// Optimizes (X & 1) == 1 to (X & 1)
// GTF_RELOP_JMP_USED is used to make sure the optimization is used for return statements only.
if (tree->gtGetOp2()->IsIntegralConst(1) && tree->gtGetOp1()->OperIs(GT_AND) &&
!(tree->gtFlags & GTF_RELOP_JMP_USED))
{
GenTree* op1 = tree->gtGetOp1();

if (op1->gtGetOp2()->IsIntegralConst(1))
{
DEBUG_DESTROY_NODE(tree->gtGetOp2());
DEBUG_DESTROY_NODE(tree);

return op1;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(In post-order) ...and inside fgOptimizeEqualityComparisonWithConst.

(Sorry for not being complete in my first comment :( )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to hear any corrections :) feel free

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(In post-order) Do you mean the order the nodes are being destroyed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fgOptimizeEqualityComparisonWithConst is supposed to return a comparison node, meanwhile we have to get rid of the equality node. Are you sure the optimization should be placed to there?

runtime/src/coreclr/jit/morph.cpp

Lines 12085 to 12086 in 99d82c2

tree = fgOptimizeEqualityComparisonWithConst(tree->AsOp());
assert(tree->OperIsCompare());

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure the optimization should be placed to there?

Yeah, this is an artificial restriction and can be removed. You may also need to remove a few downstream asserts too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you are right. It works. But I'm still confused about the ordering, what exactly must be in post-order?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tree traversal terminology:

fgMorphSmpOp:
   Pre-Order (before the operands were morphed)
   Morphing the operands
   Post-Order (after the operands were morphed)

As a general rule, optimizations should be done in post-order, because then we will know facts about the operands we wouldn't have known in pre-order (such as that they are constant, or have side effects).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tree traversal terminology:

fgMorphSmpOp:
   Pre-Order (before the operands were morphed)
   Morphing the operands
   Post-Order (after the operands were morphed)

As a general rule, optimizations should be done in post-order, because then we will know facts about the operands we wouldn't have known in pre-order (such as that they are constant, or have side effects).

Oh, I know about the terminology. Thank you for the explanation :)

// / \
// x CNS 1
//
// GTF_RELOP_JMP_USED is used to make sure the optimization is used for return statements only.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// GTF_RELOP_JMP_USED is used to make sure the optimization is used for return statements only.
// The compiler requires jumps to have relop operands, so we do not fold that case.

Notably relops can appear anywhere r-values can, not just under GT_RETURNs.

@@ -13550,6 +13545,26 @@ GenTree* Compiler::fgOptimizeEqualityComparisonWithConst(GenTreeOp* cmp)
DEBUG_DESTROY_NODE(rshiftOp->gtGetOp2());
DEBUG_DESTROY_NODE(rshiftOp);
}

// Optimizes (X & 1) == 1 to (X & 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think it'll be easier to reason about this if we move the (new) optimization just under the op2->IsIntegralConst(0) || op2->IsIntegralConst(1), before the other ones. On success this returns early, on failure does nothing, so subsequent transformations don't need to take it into account.

Also, presumably we would need to check op2Value == 1.

DEBUG_DESTROY_NODE(cmp->gtGetOp2());
DEBUG_DESTROY_NODE(cmp);

return op1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the function header: Currently only returns relop trees. -> Currently only returns GTK_SMPOP trees.

@SkiFoD
Copy link
Contributor Author

SkiFoD commented Dec 17, 2021

@EgorBo @SingleAccretion It seems like I placed the code in the right place, but the tests are failed :( What can I do about it?

@SingleAccretion
Copy link
Contributor

It seems that's just the build failing:

/__w/1/s/src/coreclr/jit/morph.cpp:13457:36: error: backslash and newline separated by space [-Werror,-Wbackslash-newline-escape]
        //                     /   \                  
                                   ^

I think the fix is to remove the trailing whitespaces in the picture.

// x CNS 1
//
// The compiler requires jumps to have relop operands, so we do not fold that case.
if (op1->OperIs(GT_AND) && op2Value == 1 && !(cmp->gtFlags & GTF_RELOP_JMP_USED))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (op1->OperIs(GT_AND) && op2Value == 1 && !(cmp->gtFlags & GTF_RELOP_JMP_USED))
if (op1->OperIs(GT_AND) && (op2Value == 1) && !(cmp->gtFlags & GTF_RELOP_JMP_USED))

See the code guide.

@@ -13446,6 +13441,26 @@ GenTree* Compiler::fgOptimizeEqualityComparisonWithConst(GenTreeOp* cmp)
{
ssize_t op2Value = static_cast<ssize_t>(op2->IntegralValue());

// Optimizes (X & 1) == 1 to (X & 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also needs to check that we will be returning a properly typed tree here. X & 1 can be of TYP_LONG, while relops are always TYP_INT.

There are two ways to fix this: a) just give up if the AND is not of TYP_INT (genActualType(andNode) == relop->TypeGet()), or try our luck with optNarrowTree (see example of usage in this function). Note that optNarrowTree should only be called under an fgGlobalMorph guard.

The optNarrowTree solution should get better diffs, the "always check types" one is simpler, up to you which one to implement.

Fixing this should fix the x86 failures we're seeing in CI.

(Also, for debugging, SPMI is a very useful tool.)

@SkiFoD
Copy link
Contributor Author

SkiFoD commented Dec 23, 2021

@SingleAccretion Well, it seems my usage of optNarrowTree caused more tests to fail. Could you tell me if I implemented your suggestion right?

@SingleAccretion
Copy link
Contributor

Well, it seems my usage of optNarrowTree caused more tests to fail. Could you tell me if I implemented your suggestion right?

The pattern with optNarrowTree is that you first ask if it will be able to narrow, then ask it to actually narrow:

if (optNarrowTree(..., false))
{
    optNarrowTree(..., true);
}

That said, from reading optNarrowTree code it was not obvious for me why it would fail (it special cases, for unclear reasons, GT_ANDs and manually inserts narrowing casts for them).

It seems the SPMI diffs failed, you should be able to run replay locally and debug the failures.

@SkiFoD
Copy link
Contributor Author

SkiFoD commented Dec 23, 2021

@EgorBo @SingleAccretion
I had tried to implement the both ways and use superpmi replays afterward and got
MISSING: Method context 91188 failed to replay: SuperPMI assertion failed (missing map IsFieldStatic): key 00007FFD63664B30 SuperPMI encountered missing data for 1 out of 260885 contexts the both times.
I suppose there must be some jitoptions allowing me to see what's wrong and debug it, but I haven't managed to find them.

Here is the last run logs that I got.
superpmi.77.log
superpmi.78.log

@SingleAccretion
Copy link
Contributor

@SkiFoD I think you need to replay the Windows x86 collection (-arch x86), since that's where failures in CI showed up (and where we'd expect them) (note you will need to ./build clr.jit -a x86 -c Checked first).

@SkiFoD
Copy link
Contributor Author

SkiFoD commented Dec 24, 2021

@SingleAccretion @EgorBo I did the replay with -arch x86 option and got the same error MISSING: Method context 90668 failed to replay: SuperPMI assertion failed (missing map IsFieldStatic): key 00000000095B5894 SuperPMI encountered missing data for 1 out of 260068 contexts.
There must be some way to investigate what is wrong with the assertion, how do you usually debug such cases?
superpmi.100.log
`

@SingleAccretion
Copy link
Contributor

@SkiFoD looks like the replay didn't pick up the libraries.pmi collection for some reason. For reference, here's the assert in CI:

Running asm diffs of C:\h\w\B9C10A09\p\artifacts\spmi\mch\1d61ee87-b3be-48ae-a12e-2fb9b5b1cee7.windows.x86\libraries.pmi.windows.x86.checked.mch
...
ISSUE: <ASSERT> D:\a\_work\1\s\src\coreclr\jit\optimizer.cpp (5375) - Assertion failed 'doit == false' in 'System.Diagnostics.EventLogInternal:CompletionCallback(System.Object):this' during 'Morph - Global' (IL size 358)

Try -filter libraries.pmi --force_download perhaps?

@SkiFoD
Copy link
Contributor Author

SkiFoD commented Dec 27, 2021

@SingleAccretion Seems like I managed to fix the test failures. I ended up using the A option, because narrowing would show me this error during replays:
Assertion failed 'src2->OperIs(GT_LONG)' in 'System.Diagnostics.EventLogInternal:CompletionCallback(System.Object):this' during 'Lowering nodeinfo' (IL size 358)
The error occure here

assert(src1->OperIs(GT_LONG));
assert(src2->OperIs(GT_LONG));

I assume it happened because one of operators had been narrowed to TYPE_INT so it made the assertion failing.

Copy link
Contributor

@SingleAccretion SingleAccretion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see there are regressions in the diffs. What is their cause and can they be mitigated?

@@ -13446,6 +13441,29 @@ GenTree* Compiler::fgOptimizeEqualityComparisonWithConst(GenTreeOp* cmp)
{
ssize_t op2Value = static_cast<ssize_t>(op2->IntegralValue());

if (fgGlobalMorph)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fgGlobalMorph can be dropped now that we don't use optNarrowTree.

(This transform preserves VNs, and the whole method is under the "not CSE" guard)

Comment on lines 13457 to 13463
if (op1->gtGetOp2()->IsIntegralConst(1) && (genActualType(op1) == cmp->TypeGet()))
{

DEBUG_DESTROY_NODE(op2);
DEBUG_DESTROY_NODE(cmp);

return op1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not a correctness requirement, but I would recommend transferring the relop's VN to the AND tree. Relop VNs are more "valuable":

op1->SetVNsFromNode(cmp);

@SkiFoD
Copy link
Contributor Author

SkiFoD commented Dec 28, 2021

I see there are regressions in the diffs. What is their cause and can they be mitigated?

I got rid of the regressions, but had to return the check (op1->gtGetOp1()->gtOper == GT_LCL_VAR).
The bad news is the error MISSING: Method context 90668 failed to replay: SuperPMI assertion failed (missing map IsFieldStatic): key 00000000095B5894 SuperPMI encountered missing data for 1 out of 260068 contexts keeps showing up and I can't figure out what causes it.

Copy link
Contributor

@SingleAccretion SingleAccretion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keeps showing up and I can't figure out what causes it.

I wouldn't worry about it, it is "expected".

More on this

The way SPMI works, it collects all the communication between the Jit and the EE for a given session, and then replay uses that data. But of course, it is not necessary that the Jit being used for replay would be asking the same exact questions that the Jit used for collections would have. Asking less is ok, asking more - that's where the MISSING (data) problem comes from.

The collections are updated on a weekly basis (I believe), so even in normal operation MISSING errors can happen, say because a change merged in a middle of the week tweaked the Jit-EE traffic in some way where more information is now needed (and there is none in the existing collections).

It is the case that for changes not expected to alter the Jit-EE traffic such as this one, MISSING errors are (almost) entirely ignorable. However, if a change (even a pure Jit one) needs "too much" data from the collections that is not there, that's when SPMI becomes less useful for diffs (and replay), and regular PMI/CG-en-based-diffs are employed.

// The compiler requires jumps to have relop operands, so we do not fold that case.
if (op1->OperIs(GT_AND) && (op2Value == 1) && !(cmp->gtFlags & GTF_RELOP_JMP_USED))
{
if ((op1->gtGetOp1()->gtOper == GT_LCL_VAR) && op1->gtGetOp2()->IsIntegralConst(1) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does restricting the AND's operand to a local solve the regressions - what was their source?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SingleAccretion Well, I got pretty many regressions and It seemed suspicious to me that such a tiny optimization could cause this. I used jit-analyzer to figure out what was the actual diff between the dasms, but ended up just comparing both side by side and found out that the optimization introduced extra asm instructions (I'm not good at low-level engineering, I'm learning), so I assumed that the optimization had been applied in cased it should have been applied.
The restriction for x to be a local variable (This is how I understand the GT_LCL_VAR type) did the trick. I'm curious why @EgorBo said that x can be anything and would like to dig deeper to figure out what is the source of the regressions when x is anything. How do you debug such cases?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I got pretty many regressions and It seemed suspicious to me that such a tiny optimization could cause this.

I agree, that is why it would be good to know; the restriction on the local may just be hiding it. Some regressions cannot be avoided (it is usual that, for example, we get some due to different register allocation), but we should have a good understanding of the causes to make decisions.

How do you debug such cases?

I personally do it by: a) fully diffing with SPMI b) analyzing the cases individually by first diffing (as in git diff, or some online tool) the assembly, then the JitDumps (these ones are usually only git diff-able...), and debugging as necessary. I believe these days you can do this with the SPMI script directly by using the -c argument to pass the method context number that had the diffs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the debugging advice. I've spent few days trying comparing dasm files. It is very hard for me to tell what is exactly wrong, but I found some suspicious facts:

  1. Regressions mostly occur when 'x' is a pointer (or ref variable).
    Here is the diff example https://www.diffchecker.com/fQLJQ4YW
  2. Regressions mostly occur in MoveNext() method
Example image
3. Some regressed dasms show terrible PerfScore. Here is the example https://www.diffchecker.com/BQUzlt9K

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for looking into this. I took a look myself, I agree it is a bit of a tricky case.

We lose tail duplication because it only looks at relops for our purposes. Then because of that we lose relop forward substitution, but perhaps more importantly, the flow that RA can, apparently, better work with.

If we look at PerfScore instead of code size, things looks better (just a few small regressions), this holds both for libraries.pmi and libraries_tests.pmi. The PerfScore outliers you note are indeed curious. It appears for the "diff" method the altered flow causes the compiler to use synthetic scales (instead of profile data) for more blocks, and for more general loops to exist.

(Side note: JitDisablePgo=1 doesn't appear to work well, so it is a bit hard to say why things are the way they are)

Overall, it appears doing this in lowering will be better after all.

@JulieLeeMSFT
Copy link
Member

@EgorBo please review and approve this PR.
Thanks @SkiFoD and @SingleAccretion for the contribution and extensive reviews.

@JulieLeeMSFT JulieLeeMSFT added this to the 7.0.0 milestone Jan 25, 2022
@JulieLeeMSFT
Copy link
Member

Ping @EgorBo for community PR review.

@EgorBo
Copy link
Member

EgorBo commented Mar 14, 2022

@SkiFoD could you please rebase this one as well (and tag me)

@SkiFoD
Copy link
Contributor Author

SkiFoD commented Mar 15, 2022

@EgorBo I updated the branch and run SPMI once again. I got confussed by the results so I'd like to know your opinion on them.

@JulieLeeMSFT
Copy link
Member

Failure is due to #63854.
Not related to this change.

@JulieLeeMSFT JulieLeeMSFT merged commit ae941b5 into dotnet:main Mar 21, 2022
radekdoulik pushed a commit to radekdoulik/runtime that referenced this pull request Mar 30, 2022
* Add optimization "X & 1 == 1" to "X & 1" (dotnet#61412)

* Moved the optimization to the morph phase (dotnet#61412)

* Done in post-order (dotnet#61412)

* Moved the optimization into fgOptimizeEqualityComparisonWithConst (dotnet#61412)

* Some corrections due the comments (dotnet#61412)

* Fix of the picture (dotnet#61412)

* Add optNarrowTree use (dotnet#61412)

* Change narrowing to the type check (dotnet#61412)

* Fix regressions (dotnet#61412)

* Moved the optimization to the lowering phase (dotnet#61412)

* Reverted Morph changes (dotnet#61412)

* Moved the optimization into OptimizeConstCompare method (dotnet#61412)

* Add GT_EQ check(dotnet#61412)
@ghost ghost locked as resolved and limited conversation to collaborators Apr 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants