-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add lowering support for conditional nodes #71705
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
Looks like those test failures might be related to my changes in LowerNodeCC LowerHWIntrinsicCC. Will investigate. |
Seems there are code paths touching x64 as seen in spmi-diff |
Highly suspect this is my codegenxarch changes. Will look at this. |
2fbd8e5
to
2b34f75
Compare
src/coreclr/jit/codegenarm64.cpp
Outdated
// An And that is not contained should not have any contained children. | ||
assert(!op1->isContained() && !op2->isContained()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code should not make any assumptions about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is handling that case.
Firstly that would mean having extra logic in the AND code generation to handle contained children. Do we then assume the contained children of the AND can be of any node type?
Then should I apply compare chain logic to it? If op1 is a compare chain, then op2 should probably generate a conditional compare, and then the AND needs to generate into a register. Alternatively, just do the easy way of generate op1 into a register, op2 into a register and do a normal AND.
The problem is, those cases aren't going to be happen (due to the lower phase not creating non-contained ANDs that have contained children).
So maybe the answer is:
in the code generation for and, if it has contained children, then generate those children into registers and then plant the AND as normal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Firstly that would mean having extra logic in the AND code generation to handle contained children. Do we then assume the contained children of the AND can be of any node type?
No; this should be left up to codegen of AND
as it exists today.
For any uncontained child it does not make sense to ask questions about its children in turn; those will have been handled as part of its code generation. They should be opaque to the grandparent node. Above you are also assuming that all operands of GT_AND
nodes are themselves GenTreeOp
; this is not an ok assumption to make either.
These are probably indications that the implementation is not completely right yet. It is ok to make this work only for contained nodes and to rely on containment checks done in lowering previously, but the way this is currently implemented it is not the case. For example, for LIR like:
a = LCL_VAR V00
b = CNS_INT 4
c = AND a, b
d = LCL_VAR_ADDR V00
e = CALL Foo, d
f = CNS_INT 1
g = SELECT c, e, f
we will not be able to contain AND
in SELECT
, yet IIUC genCodeForSelect
will come here and assume that LCL_VAR V00
and CNS_INT 4
can be cast to GenTreeOp
(not true) and that they cannot be contained (not true for the constant).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No; this should be left up to codegen of AND as it exists today.
There will still need to be some changes in the codegen for AND
....
in CodeGen::genCodeForBinary()
for Arm64 it only handles contained for MUL:
if (op2->OperIs(GT_MUL) && op2->isContained())
They should be opaque to the grandparent node
Agreed (and the subsequent comments too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There will still need to be some changes in the codegen for AND....
I agree this will be necessary if we are going to add support for AND
to consume compares via ccmp
instead of by materializing the truth value into a register. However, if we are leaving the support for SELECT
only then I don't see why it should be necessary, but maybe I am missing something.
I guess the way I would shape this is the following:
- Make
genCodeForConditionalCompare
always produce a value into the flags, never into a register. Assert that it is only called on containedAND
and compare nodes of the right shapes. In the base cases, we still need to callgenConsumeReg
on operands and get values from registers. - Change
genCodeForSelect
to satisfy the above: if the cond op is contained, then callgenCodeForConditionalCompare
, and otherwise callgenConsumeReg
and generate code to compare the register with 0 - (Optional) Make a similar change in codegen for
GT_AND
to support contained comparisons there viaccmp
as well. This will need to materialize a truth value.
The way the current code is shaped might also be fine, but you are probably missing some handling for the base cases (where the operands are no longer contained).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That logic doesn't quite work. For a 'SELECT' node, if the conditional is contained then it could be a 'CMP' node or an 'AND' node. So we still need something to handle both.
I'll refine the existing code using some of the above, and add some code in the AND and then see where that gets us....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion was to handle the contained AND in genCodeForConditionalCompare
, in the same way you are doing now.
It is fine to assert that you only see contained ANDs/CMP here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks great now, but I'll have to spend a bit of time on this to make sure that this handles interference checking with IsSafeToContainMem
correctly.
src/coreclr/jit/lower.cpp
Outdated
// Return Value: | ||
// True if the chain is valid | ||
// | ||
bool Lowering::ContainCheckCompareChain(GenTree* tree, GenTree* parent, GenTree** earliest_valid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we can start using this in other places in this PR to test it without having to introduce the complicated if-conversion logic.
For example, should it not be beneficial to do this for any possible comparison/and node? E.g. code like
bool b = (x < 5) & (y < 3)
;` might be able to use this even without if-conversion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, thinking about it some more, this part of the PR seems orthogonal to GT_SELECT
, there are several nodes that can be taught to consume flags without materializing a truth value in a register:
GT_AND
viaccmp
GT_SELECT
viacsel
GT_JTRUE
viacb
Hopefully all of these will end up on the same plan (doesn't have to be in this PR, but would be nice if we could do something for GT_AND
to at least test it out).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking this through.... If I add compare chain logic to the lowering of GT_AND
, then most of ContainCheckCompareChain() will vanish because the AND
nodes will be lowered before the SELECT
node. That's probably a good thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me, potentially some heuristics may be needed to ensure we only do this when it is profitable, not sure about the latency/throughput of ccmp
vs bitwise and. But this way we can get some early testing on this logic here which will be nice.
Also, we should be sure only do this transformation when optimizations are enabled (comp->opts.OptimizationEnabled()
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this way we can get some early testing on this logic here which will be nice.
+1 on that. @a74nh , just pinging to see if you agree and plan to do this while lowering AND
itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like there is a decision to be made here -- what is the most efficient way to generate this pattern of nodes. It does not matter how we represent the sequence of nodes, we still have to make that decision.
Just to clarify, because I'm not sure my point was very clear. Right now the TEST_EQ
transformation is assuming ahead of time that it is always the most optimal way to do it. If we instead had conditional compare nodes, we would be doing the opposite -- assuming ahead of time that the conditional compares are always the most optimal way. In reality we need to consider these transforms that conflict on a case by case basis and make a decision for each on what the best way is. So I think it is a good thing that we hit a situation like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think it's best to disable the TEST_EQ optimisation if it sees the contained chain. Disabling that cause the JTRUE to JCC optimisation to kick in instead. Disabling that too, means we end up with:
cmp ...
ccmp ...., le
cset x0, gt
cbz x0, label
Which is much better than the code without my patch.
I don't see how GT_SELECT helps in the general case. I can only see that it would help if the successor blocks to the block containing GT_JTRUE are simple enough that the entire thing can be replaced with GT_SELECT. Can you elaborate what you mean?
In most cases in the code above, the last two instructions will become a csel
.
But, yes, there will be scenarios where we can't use GT_SELECT
. In most of those cases though, we'll also fail to generate a compare chain. If it turns out there are more instances than I expect, then we can look at using the JCC in a follow on patch to replace the last two instructions in the code above with a jgt
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which is much better than the code without my patch.
Indeed, looks great. I look forward to the PR that puts JTRUE
on the same plan as this.
One thing I would like to see is a micro benchmark to make sure our expectation that this is better is correct.
The easy way to do that is via benchmark.NET where you can specify the old and new corerun with the --corerun <old corerun> <new corerun>
.
In most cases in the code above, the last two instructions will become a csel.
I would not expect that to be the case. Only for very limited forms of IR will we be able to generate GT_SELECT
in this case, since it requires both successor blocks to be single assignments to the same variable without any other side effects. This is probably a common pattern, but not the the majority case pattern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New patch with everything above included.
- I've left in the compare chain size calculation in the lower phase - but, it's not really being used at the moment, as its allowing all chain sizes.
- Getting the register consuming working took a while. I had to move the consume calls out of GenCodeCompare(), otherwise the node gets consumed in the BinaryOp generation, then during the compare chain generation, it gets consumed again during the compare generation.
Running asmdiffs on the library tests, I only get 10 functions firing. Not seeing any chains longer than the above examples being generated (possible that it's due to my code). Want to spend a little more time playing with the results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I'm fine with this getting low hits for now. I expect in .NET 8 we can enable it for JTRUE and expand boolean optimizations (I believe we do not transform (a relop b) && (c relop d)
=> (a relop b) & (c relop d)
today), which should make the optimization much more impactful.
Also, we can presumably do something similar for bitwise or.
e2060d3
to
e57ac03
Compare
Failures in Antigen and jitstress and outerloop. Are these something I should be concerned about and any quick pointers on what I should be running to reproduce them? |
Antigen failures are known issues. I would wait to complete jitstress and outerloop legs. I do see some new failures (only on Arm) for outerloop and the way to reproduce is download the correlation payload using Some known failures are: |
From what I can see there are no new outerloop failures if you compare to the last outerloop run on main: On the other hand, the arm64 System.Threading.Tests.SpinLockTests.RunSpinLockTests_NegativeTests failure looks like it is related. |
That's odd, I don't get any failures when I run it myself:
I also did a run with JitDisasm=*, which showed exactly 2 uses of ccmp (although it's hard to tell much from this, because it was mixed with the output of other functions. A way of dumping each function to a different file would be really useful) |
Fixed up all of Kunal's comments. |
This is the codegen diff for |
FWIW, whether or not you see the failure probably depends on whether we end up tiering |
Getting the diff now, but not the failure. Investigating the IR. |
That looks fine to me. Original code is:
(Note - this chaining only happens because the C# code is using & instead of &&). We have the following IR: (Ignoring the children of
In current head, The
(and the That's generated as:
With my patch, that optimisation is skipped (due to hitting the isContained checks in lowering). Instead, a different optimisation kicks in, switching the
That generates:
(Ideally, the chain wouldn't need to generate into a register) |
The semantics of public static void Main(string[] args)
{
Console.WriteLine(Foo(3, false));
}
[MethodImpl(MethodImplOptions.NoInlining)]
private static int Foo(int i, bool b)
{
if ((i & 1) != 0 & !b)
return 1;
return 0;
} Expected result: 1 |
Change-Id: I8a1761e1e89f589e1daf0318e120aae5dd3d7241 CustomizedGitHooks: yes
Right! I was paying attention to the wrong part of the code. New version pushed. I've added OperIsCmpCompare() to ensure TEST_ nodes are not put into the chains. (I guess there's a argument for not creating the TEST_ nodes if a chain could be created, but I wouldn't want to do that here). |
Do we need to make similar change in codegen too? |
Yes. It probably will never make any difference (due to lower never creating an invalid sequence), but should be there. Added a patch to do this. Plus, I added some tests for these types of sequences. |
The lsrabuild one needed fixing up - done this now. The others don't need changing - they are simply me reverting from |
We are using the immediate version of That's different to the immediate version of (This is why after containing a compare we have to redo the containing of its children) You'll see quite a few places where an immediate will fit into cmp but not into ccmp. |
Thank you @a74nh for your contribution and thank you @jakobbotsch for the thorough review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This builds on the code added in #71616
The code is pulled directly from #67894
As before, with this patch nothing uses the conditional nodes, so the impact on code gen should be zero.