Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add lowering support for conditional nodes #71705

Merged
merged 26 commits into from
Aug 9, 2022

Conversation

a74nh
Copy link
Contributor

@a74nh a74nh commented Jul 6, 2022

This builds on the code added in #71616

The code is pulled directly from #67894

As before, with this patch nothing uses the conditional nodes, so the impact on code gen should be zero.

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Jul 6, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 6, 2022
@ghost
Copy link

ghost commented Jul 6, 2022

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

This builds on the code added in #71616

The code is pulled directly from #67286

As before, with this patch nothing uses the conditional nodes, so the impact on code gen should be zero.

Author: a74nh
Assignees: -
Labels:

area-CodeGen-coreclr, community-contribution

Milestone: -

@a74nh
Copy link
Contributor Author

a74nh commented Jul 6, 2022

Looks like those test failures might be related to my changes in LowerNodeCC LowerHWIntrinsicCC. Will investigate.

@kunalspathak
Copy link
Member

Seems there are code paths touching x64 as seen in spmi-diff

image

@a74nh
Copy link
Contributor Author

a74nh commented Jul 11, 2022

Seems there are code paths touching x64 as seen in spmi-diff

Highly suspect this is my codegenxarch changes. Will look at this.

Comment on lines 4481 to 4482
// An And that is not contained should not have any contained children.
assert(!op1->isContained() && !op2->isContained());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code should not make any assumptions about this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is handling that case.

Firstly that would mean having extra logic in the AND code generation to handle contained children. Do we then assume the contained children of the AND can be of any node type?

Then should I apply compare chain logic to it? If op1 is a compare chain, then op2 should probably generate a conditional compare, and then the AND needs to generate into a register. Alternatively, just do the easy way of generate op1 into a register, op2 into a register and do a normal AND.

The problem is, those cases aren't going to be happen (due to the lower phase not creating non-contained ANDs that have contained children).

So maybe the answer is:
in the code generation for and, if it has contained children, then generate those children into registers and then plant the AND as normal

Copy link
Member

@jakobbotsch jakobbotsch Jul 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Firstly that would mean having extra logic in the AND code generation to handle contained children. Do we then assume the contained children of the AND can be of any node type?

No; this should be left up to codegen of AND as it exists today.

For any uncontained child it does not make sense to ask questions about its children in turn; those will have been handled as part of its code generation. They should be opaque to the grandparent node. Above you are also assuming that all operands of GT_AND nodes are themselves GenTreeOp; this is not an ok assumption to make either.

These are probably indications that the implementation is not completely right yet. It is ok to make this work only for contained nodes and to rely on containment checks done in lowering previously, but the way this is currently implemented it is not the case. For example, for LIR like:

a = LCL_VAR V00
b = CNS_INT 4
c = AND a, b
d = LCL_VAR_ADDR V00
e = CALL Foo, d
f = CNS_INT 1
g = SELECT c, e, f

we will not be able to contain AND in SELECT, yet IIUC genCodeForSelect will come here and assume that LCL_VAR V00 and CNS_INT 4 can be cast to GenTreeOp (not true) and that they cannot be contained (not true for the constant).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No; this should be left up to codegen of AND as it exists today.

There will still need to be some changes in the codegen for AND....

in CodeGen::genCodeForBinary() for Arm64 it only handles contained for MUL:
if (op2->OperIs(GT_MUL) && op2->isContained())

They should be opaque to the grandparent node

Agreed (and the subsequent comments too)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will still need to be some changes in the codegen for AND....

I agree this will be necessary if we are going to add support for AND to consume compares via ccmp instead of by materializing the truth value into a register. However, if we are leaving the support for SELECT only then I don't see why it should be necessary, but maybe I am missing something.

I guess the way I would shape this is the following:

  • Make genCodeForConditionalCompare always produce a value into the flags, never into a register. Assert that it is only called on contained AND and compare nodes of the right shapes. In the base cases, we still need to call genConsumeReg on operands and get values from registers.
  • Change genCodeForSelect to satisfy the above: if the cond op is contained, then call genCodeForConditionalCompare, and otherwise call genConsumeReg and generate code to compare the register with 0
  • (Optional) Make a similar change in codegen for GT_AND to support contained comparisons there via ccmp as well. This will need to materialize a truth value.

The way the current code is shaped might also be fine, but you are probably missing some handling for the base cases (where the operands are no longer contained).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That logic doesn't quite work. For a 'SELECT' node, if the conditional is contained then it could be a 'CMP' node or an 'AND' node. So we still need something to handle both.

I'll refine the existing code using some of the above, and add some code in the AND and then see where that gets us....

Copy link
Member

@jakobbotsch jakobbotsch Jul 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion was to handle the contained AND in genCodeForConditionalCompare, in the same way you are doing now.
It is fine to assert that you only see contained ANDs/CMP here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is updated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks great now, but I'll have to spend a bit of time on this to make sure that this handles interference checking with IsSafeToContainMem correctly.

// Return Value:
// True if the chain is valid
//
bool Lowering::ContainCheckCompareChain(GenTree* tree, GenTree* parent, GenTree** earliest_valid)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we can start using this in other places in this PR to test it without having to introduce the complicated if-conversion logic.
For example, should it not be beneficial to do this for any possible comparison/and node? E.g. code like
bool b = (x < 5) & (y < 3);` might be able to use this even without if-conversion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, thinking about it some more, this part of the PR seems orthogonal to GT_SELECT, there are several nodes that can be taught to consume flags without materializing a truth value in a register:

  • GT_AND via ccmp
  • GT_SELECT via csel
  • GT_JTRUE via cb

Hopefully all of these will end up on the same plan (doesn't have to be in this PR, but would be nice if we could do something for GT_AND to at least test it out).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking this through.... If I add compare chain logic to the lowering of GT_AND, then most of ContainCheckCompareChain() will vanish because the AND nodes will be lowered before the SELECT node. That's probably a good thing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me, potentially some heuristics may be needed to ensure we only do this when it is profitable, not sure about the latency/throughput of ccmp vs bitwise and. But this way we can get some early testing on this logic here which will be nice.

Also, we should be sure only do this transformation when optimizations are enabled (comp->opts.OptimizationEnabled()).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this way we can get some early testing on this logic here which will be nice.

+1 on that. @a74nh , just pinging to see if you agree and plan to do this while lowering AND itself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like there is a decision to be made here -- what is the most efficient way to generate this pattern of nodes. It does not matter how we represent the sequence of nodes, we still have to make that decision.

Just to clarify, because I'm not sure my point was very clear. Right now the TEST_EQ transformation is assuming ahead of time that it is always the most optimal way to do it. If we instead had conditional compare nodes, we would be doing the opposite -- assuming ahead of time that the conditional compares are always the most optimal way. In reality we need to consider these transforms that conflict on a case by case basis and make a decision for each on what the best way is. So I think it is a good thing that we hit a situation like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think it's best to disable the TEST_EQ optimisation if it sees the contained chain. Disabling that cause the JTRUE to JCC optimisation to kick in instead. Disabling that too, means we end up with:

cmp     ...
ccmp    ...., le
cset x0, gt
cbz x0, label

Which is much better than the code without my patch.

I don't see how GT_SELECT helps in the general case. I can only see that it would help if the successor blocks to the block containing GT_JTRUE are simple enough that the entire thing can be replaced with GT_SELECT. Can you elaborate what you mean?

In most cases in the code above, the last two instructions will become a csel.
But, yes, there will be scenarios where we can't use GT_SELECT. In most of those cases though, we'll also fail to generate a compare chain. If it turns out there are more instances than I expect, then we can look at using the JCC in a follow on patch to replace the last two instructions in the code above with a jgt.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which is much better than the code without my patch.

Indeed, looks great. I look forward to the PR that puts JTRUE on the same plan as this.
One thing I would like to see is a micro benchmark to make sure our expectation that this is better is correct.
The easy way to do that is via benchmark.NET where you can specify the old and new corerun with the --corerun <old corerun> <new corerun>.

In most cases in the code above, the last two instructions will become a csel.

I would not expect that to be the case. Only for very limited forms of IR will we be able to generate GT_SELECT in this case, since it requires both successor blocks to be single assignments to the same variable without any other side effects. This is probably a common pattern, but not the the majority case pattern?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New patch with everything above included.

  • I've left in the compare chain size calculation in the lower phase - but, it's not really being used at the moment, as its allowing all chain sizes.
  • Getting the register consuming working took a while. I had to move the consume calls out of GenCodeCompare(), otherwise the node gets consumed in the BinaryOp generation, then during the compare chain generation, it gets consumed again during the compare generation.

Running asmdiffs on the library tests, I only get 10 functions firing. Not seeing any chains longer than the above examples being generated (possible that it's due to my code). Want to spend a little more time playing with the results.

Copy link
Member

@jakobbotsch jakobbotsch Aug 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I'm fine with this getting low hits for now. I expect in .NET 8 we can enable it for JTRUE and expand boolean optimizations (I believe we do not transform (a relop b) && (c relop d) => (a relop b) & (c relop d) today), which should make the optimization much more impactful.

Also, we can presumably do something similar for bitwise or.

@JulieLeeMSFT JulieLeeMSFT added this to the 7.0.0 milestone Aug 1, 2022
@a74nh
Copy link
Contributor Author

a74nh commented Aug 5, 2022

Failures in Antigen and jitstress and outerloop. Are these something I should be concerned about and any quick pointers on what I should be running to reproduce them?

@kunalspathak
Copy link
Member

Antigen failures are known issues. I would wait to complete jitstress and outerloop legs. I do see some new failures (only on Arm) for outerloop and the way to reproduce is download the correlation payload using runfo

Some known failures are:

@jakobbotsch
Copy link
Member

I do see some new failures (only on Arm) for outerloop

From what I can see there are no new outerloop failures if you compare to the last outerloop run on main:
https://dev.azure.com/dnceng/public/_build/results?buildId=1925086&view=results

On the other hand, the arm64 System.Threading.Tests.SpinLockTests.RunSpinLockTests_NegativeTests failure looks like it is related.

@a74nh
Copy link
Contributor Author

a74nh commented Aug 8, 2022

I do see some new failures (only on Arm) for outerloop

From what I can see there are no new outerloop failures if you compare to the last outerloop run on main: https://dev.azure.com/dnceng/public/_build/results?buildId=1925086&view=results

On the other hand, the arm64 System.Threading.Tests.SpinLockTests.RunSpinLockTests_NegativeTests failure looks like it is related.

That's odd, I don't get any failures when I run it myself:

❯ ~/dotnet/runtime_csel/.dotnet/dotnet build -t:Test -c Release -p:XunitMethodName=System.Threading.Tests.SpinLockTests.RunSpinLockTests_NegativeTests
MSBuild version 17.3.0-preview-22306-01+1c045cf58 for .NET
  Determining projects to restore...
  All projects are up-to-date for restore.
  Microsoft.Interop.SourceGeneration -> /home/alahay01/dotnet/runtime_csel/artifacts/bin/Microsoft.Interop.SourceGeneration/Release/netstandard2.0/Microsoft.Interop.SourceGeneration.dll
  LibraryImportGenerator -> /home/alahay01/dotnet/runtime_csel/artifacts/bin/LibraryImportGenerator/Release/netstandard2.0/Microsoft.Interop.LibraryImportGenerator.dll
  TestUtilities -> /home/alahay01/dotnet/runtime_csel/artifacts/bin/TestUtilities/Release/net6.0/TestUtilities.dll
  System.Threading.Tests -> /home/alahay01/dotnet/runtime_csel/artifacts/bin/System.Threading.Tests/Release/net7.0/System.Threading.Tests.dll
  ----- start Mon Aug 8 10:05:18 UTC 2022 =============== To repro directly: =====================================================
  pushd /home/alahay01/dotnet/runtime_csel/artifacts/bin/System.Threading.Tests/Release/net7.0
  /home/alahay01/dotnet/runtime_csel/artifacts/bin/testhost/net7.0-Linux-Release-arm64/dotnet exec --runtimeconfig System.Threading.Tests.runtimeconfig.json --depsfile System.Threading.Tests.deps.json xunit.console.dll System.Threading.Tests.dll -xml testResults.xml -nologo -method System.Threading.Tests.SpinLockTests.RunSpinLockTests_NegativeTests -notrait category=OuterLoop -notrait category=failing
  popd
  ===========================================================================================================
  ~/dotnet/runtime_csel/artifacts/bin/System.Threading.Tests/Release/net7.0 ~/dotnet/runtime_csel/src/libraries/System.Threading/tests
    Discovering: System.Threading.Tests (method display = ClassAndMethod, method display options = None)
    Discovered:  System.Threading.Tests (found 1 of 274 test case)
    Starting:    System.Threading.Tests (parallel test collections = on, max threads = 64)
    Finished:    System.Threading.Tests
  === TEST EXECUTION SUMMARY ===
     System.Threading.Tests  Total: 1, Errors: 0, Failed: 0, Skipped: 0, Time: 0.571s
  ~/dotnet/runtime_csel/src/libraries/System.Threading/tests
  ----- end Mon Aug 8 10:05:23 UTC 2022 ----- exit code 0 ----------------------------------------------------------
  exit code 0 means Exited Successfully

Build succeeded.
    0 Warning(s)
    0 Error(s)

Time Elapsed 00:00:11.84


❯ env | grep COMPlus
COMPlus_TieredCompilation=1
COMPlus_JitStress=1

I also did a run with JitDisasm=*, which showed exactly 2 uses of ccmp (although it's hard to tell much from this, because it was mixed with the output of other functions. A way of dumping each function to a different file would be really useful)

@a74nh
Copy link
Contributor Author

a74nh commented Aug 8, 2022

Fixed up all of Kunal's comments.

@jakobbotsch
Copy link
Member

jakobbotsch commented Aug 8, 2022

That's odd, I don't get any failures when I run it myself:

This is the codegen diff for System.Threading.SpinLock.Exit(bool). It does not look right, seems like this change ends up treating EQ(AND(x, 0x80000000), 0) as EQ(x, 0x80000000).

@jakobbotsch
Copy link
Member

FWIW, whether or not you see the failure probably depends on whether we end up tiering System.Threading.SpinLock:Exit(bool).
You might be able to reproduce it more consistently with COMPlus_ReadyToRun=0 and COMPlus_TieredCompilation=0.

@a74nh
Copy link
Contributor Author

a74nh commented Aug 8, 2022

FWIW, whether or not you see the failure probably depends on whether we end up tiering System.Threading.SpinLock:Exit(bool). You might be able to reproduce it more consistently with COMPlus_ReadyToRun=0 and COMPlus_TieredCompilation=0.

Getting the diff now, but not the failure. Investigating the IR.

@a74nh
Copy link
Contributor Author

a74nh commented Aug 8, 2022

That's odd, I don't get any failures when I run it myself:

This is the codegen diff for System.Threading.SpinLock.Exit(bool). It does not look right, seems like this change ends up treating EQ(AND(x, 0x80000000), 0) as EQ(x, 0x80000000).

That looks fine to me.

Original code is:

        public void Exit(bool useMemoryBarrier)
        {
            // This is the fast path for the thread tracking is disabled and not to use memory barrier, otherwise go to the slow path
            // The reason not to add else statement if the usememorybarrier is that it will add more branching in the code and will prevent
            // method inlining, so this is optimized for useMemoryBarrier=false and Exit() overload optimized for useMemoryBarrier=true.
            int tmpOwner = _owner;
            if ((tmpOwner & LOCK_ID_DISABLE_MASK) != 0 & !useMemoryBarrier)
            {
                _owner = tmpOwner & (~LOCK_ANONYMOUS_OWNED);
            }
            else
            {
                ExitSlowPath(useMemoryBarrier);
            }
        }

(Note - this chaining only happens because the C# code is using & instead of &&).

We have the following IR: (Ignoring the children of NE 36 and EQ 41 for space)

                 [000033] -----------                         *  JTRUE     void  
                 [000034] J------N---                         \--*  EQ        int   
                 [000035] -----------                            +--*  AND       int   
                 [000036] -----------                            |  +--*  NE        int   ...
                 [000041] -----------                            |  \--*  EQ        int     ...
                 [000045] -----------                            \--*  CNS_INT   int    0

In current head, The AND EQ 0 is turned into a TEST_NE during lowering, giving:

N023 ( 15, 12) [000041] -----------                   t41 = *  EQ        int    REG x2 $202
N029 ( 10, 10) [000036] -----------                   t36 = *  TEST_NE   int    REG x3 $204
                                                            /--*  t41    int    
                                                            +--*  t36    int    
N031 ( 28, 26) [000034] J------N---                         *  TEST_NE   void   REG NA
N033 ( 30, 28) [000033] -----------                         *  JTRUE     void   REG NA $VN.Void

(and the NE was turned into a TEST_NE, but that's not relevant)

That's generated as:

EQ 41:
        7100003F          cmp     w1, #0        
        9A9F17E2          cset    x2, eq   
TEST_NE 36:
        7201027F          tst     w19, #0x80000000      
        9A9F07E3          cset    x3, ne  
TEST_NE 34:
        6A03005F          tst     w2, w3        
        54000161          bne     G_M57783_IG06

With my patch, that optimisation is skipped (due to hitting the isContained checks in lowering).

Instead, a different optimisation kicks in, switching the JTRUE EQ AND 0 to a JCMP AND 0:
(This optimisation isn't part of my patch)

  N008 ( 15, 12) [000041] -c---------                   t41 = *  EQ        int    $202
  N013 ( 10, 10) [000036] -c---------                   t36 = *  TEST_NE   int    $204
                                                              /--*  t41    int    
                                                              +--*  t36    int    
  N014 ( 26, 23) [000035] -----------                   t35 = *  AND       int    $205
  N015 (  1,  2) [000045] -c---------                   t45 =    CNS_INT   int    0 $40
                                                              /--*  t35    int    
                                                              +--*  t45    int    
  N016 ( 28, 26) [000034] CNE-------N---                         *  JCMP      void  

That generates:

Compare chain: EQ 41:
  IN0004:                           cmp     w1, #0
Compare chain: TEST_NE 36:
  IN0005:                           ccmp    w19, w2, z, eq
Compare chain Finished: move the result from flags into a register
  IN0006:                           cset    x2, ne
JCMP 34:
  IN0007:                           cbnz    w2, G_M57783_IG06

(Ideally, the chain wouldn't need to generate into a register)

@jakobbotsch
Copy link
Member

The semantics of TEST_NE(x, y) is (x & y) != 0. I don't think you can turn this into a conditional compare.
Small self-contained example:

public static void Main(string[] args)
{
    Console.WriteLine(Foo(3, false));
}

[MethodImpl(MethodImplOptions.NoInlining)]
private static int Foo(int i, bool b)
{
    if ((i & 1) != 0 & !b)
        return 1;
    return 0;
}

Expected result: 1
Actual result on this PR with TC disabled: 0

Change-Id: I8a1761e1e89f589e1daf0318e120aae5dd3d7241
CustomizedGitHooks: yes
@a74nh
Copy link
Contributor Author

a74nh commented Aug 8, 2022

The semantics of TEST_NE(x, y) is (x & y) != 0. I don't think you can turn this into a conditional compare. Small self-contained example:

Right! I was paying attention to the wrong part of the code.

New version pushed. I've added OperIsCmpCompare() to ensure TEST_ nodes are not put into the chains. (I guess there's a argument for not creating the TEST_ nodes if a chain could be created, but I wouldn't want to do that here).

@kunalspathak
Copy link
Member

Do we need to make similar change in codegen too? OperIsCompare() -> OperIsCmpCompare()?

@a74nh
Copy link
Contributor Author

a74nh commented Aug 9, 2022

Do we need to make similar change in codegen too? OperIsCompare() -> OperIsCmpCompare()?

Yes. It probably will never make any difference (due to lower never creating an invalid sequence), but should be there.

Added a patch to do this. Plus, I added some tests for these types of sequences.

@kunalspathak
Copy link
Member

Yes. It probably will never make any difference (due to lower never creating an invalid sequence), but should be there.

Sorry, I should I mentioned this yesterday, but I still see some places in gentree and lsrabuild where we still use OperIsCompare(). Do they also need fixup?

image

@a74nh
Copy link
Contributor Author

a74nh commented Aug 9, 2022

Yes. It probably will never make any difference (due to lower never creating an invalid sequence), but should be there.

Sorry, I should I mentioned this yesterday, but I still see some places in gentree and lsrabuild where we still use OperIsCompare(). Do they also need fixup?

The lsrabuild one needed fixing up - done this now.

The others don't need changing - they are simply me reverting from OperIsCompare() || OperIsConditionalCompare() back to OperIsCompare().

@kunalspathak
Copy link
Member

We don't have to do it in this PR, but wondering is the CCMP <immediate> variant not supported today?

image

Below I see that we do mov w1, #55 before using it in ccmp.
image

@a74nh
Copy link
Contributor Author

a74nh commented Aug 9, 2022

We don't have to do it in this PR, but wondering is the CCMP <immediate> variant not supported today?

image

Below I see that we do mov w1, #55 before using it in ccmp. image

We are using the immediate version of ccmp, but it only has 5 bits of space for the value.

That's different to the immediate version of cmp, which has 12bits plus an optional shift.

(This is why after containing a compare we have to redo the containing of its children)

You'll see quite a few places where an immediate will fit into cmp but not into ccmp.

@kunalspathak
Copy link
Member

Thank you @a74nh for your contribution and thank you @jakobbotsch for the thorough review.

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants