-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Conversation
Still a lot to do here but first things first: the issue of JIT throughput always comes up and last time I experimented with this I haven't done any measurements. As is, this costs ~0.19% instructions retired. I saved more than that with a simple change such as #14965 and there are a bunch more like that. So I'm inclined to say that doing if-conversion in the JIT is feasible. |
Overall this looks quite reasonable. Since an if conversion can enable an upstream if conversion (say nested ?:) is there any way you can see to run this "bottom-up" instead of top down? If we see a lot of stack spill/reload in simple hammocks I wonder if I should look at the importer and see if we can just hoist all that into the predecessor. Though I would swear we only spill when we see joins. But it's been a while since I looked at that code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments throughout
{ | ||
m_falseBlock = m_block->bbNext; | ||
m_trueBlock = m_block->bbJumpDest; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May want to check that m_falseBlock != m_trueBlock
just to be safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also make sure that neither of these is block
itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I sometimes try and forget that the jit's flowgraph has implicit fall through semantics.
But doing an early check that block
and m_trueBlock
are different may save a bit of time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The interesting thing is that if trueBlock == block
you can still have a correct half hammock if falseBlock
jumps back to block
. But then this would be an infinite loop which we don't care about (and likely don't have test coverage for).
And I should take a look at other scenarios involving loops (e.g. joinBlock == block
).
src/jit/lower.cpp
Outdated
|
||
BasicBlock* joinBlock = SingleSuccessorBlock(m_falseBlock); | ||
|
||
if (joinBlock == m_block->bbJumpDest) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably clearer as joinBlock == m_trueBlock
src/jit/lower.cpp
Outdated
return false; | ||
} | ||
|
||
if ((op == nullptr) || !op->OperIs(GT_STORE_LCL_VAR, GT_RETURN)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just assert op is not null as you've screened out the empty block cases for both hammock kinds already? Otherwise this sequence reads a bit strangely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, this is definitely odd.
// copies that are generated when the importer spills a non-empty stack at the start | ||
// of a block. Try to identify and clone such copies. | ||
|
||
GenTree* copyDst = PreviousInstruction(opSrc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How common is this?
Full | ||
}; | ||
|
||
HammockKind MakeHammock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any thought here to screening out cases where the true or false block has non-hammock predecessors? I suppose the amount of code duplication is pretty small, except perhaps in the case where there are lots of stack spill reloads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIR at the time I simply assumed that the amount of duplication would be small and simply decided to not check for predecessors to save time. I should check try both approaches and see what diffs say.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small clarification: for the half case we don't really care if the join block has other predecessors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, rejecting such cases does increase the code size improvement from ~9k to ~14k. While the amount of "real" duplicated code is very small when you have return blocks you'll end up duplicating epilogs and these can be quite large.
Not sure what's the best option here. Ideally we'd avoid duplication if the epilog is known to be large but we won't know what until register allocation.
src/jit/codegenxarch.cpp
Outdated
constexpr instruction EJtoCMOV[]{INS_nop, INS_nop, INS_cmovo, INS_cmovno, INS_cmovb, INS_cmovae, | ||
INS_cmove, INS_cmovne, INS_cmovbe, INS_cmova, INS_cmovs, INS_cmovns, | ||
INS_cmovpe, INS_cmovpo, INS_cmovl, INS_cmovge, INS_cmovle, INS_cmovg}; | ||
noway_assert((unsigned)desc.ins[0] < _countof(EJtoCMOV)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to add comments with the respective EJ values here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually considering getting rid of the array, the more usual way this is done is something like (kind - EJ_jo) + INS_cmovo
.
Fortunately I managed to find the example that prompted me to add all that code: int sum = 0;
for (int i = 0; i < bits.Length; i++)
sum += bits[i] ? 1 : 0;
return sum; I don't know how common this case is (I really should add some kind of instrumentation to my code to get some numbers...) but it looks pretty harmless, the kind of code people would be surprised if the JIT doesn't handle it as expected.
In lowering we end up with
It's probably particular to compound assignment operators. No idea if there are other common ways to produce this kind of IL. |
That's probably the biggest limitation the current implementation has. Another one is that it can't handle indirs. And part of the reason it can't handle indirs is that true/false blocks haven't been lowered yet so we don't have address modes yet and that makes it more problematic to figure out if an address expression is trivial enough to be accepted. So, probably the right way to fix this is to do if-conversion separately, after lowering, and traverse the block list backwards (probably a post-order traversal would be better but also more expensive). It's definitely something that I want to try but first I need to clear out some more minor issue (e.g. not being able to handle A nested case is something like if (x == y)
return 0;
if (x < y)
return 1;
return -1; currently generates: G_M55886_IG02:
3BCA cmp ecx, edx
7503 jne SHORT G_M55886_IG04
33C0 xor eax, eax
G_M55886_IG03:
C3 ret
G_M55886_IG04:
3BCA cmp ecx, edx
B801000000 mov eax, 1
BAFFFFFFFF mov edx, -1
0F4DC2 cmovge eax, edx
G_M55886_IG05:
C3 ret Though I see that VC++ too doesn't handle this so perhaps it's not that bad :) |
Current jit-diffs: Total bytes of diff: -9795 (-0.04% of base)
diff is an improvement.
Total byte diff includes 0 bytes from reconciling methods
Base had 0 unique methods, 0 unique bytes
Diff had 0 unique methods, 0 unique bytes
Top file regressions by size (bytes):
123 : System.Reflection.Metadata.dasm (0.16% of base)
90 : Microsoft.CSharp.dasm (0.03% of base)
85 : Microsoft.CodeAnalysis.CSharp.dasm (0.00% of base)
52 : NuGet.Versioning.dasm (0.19% of base)
24 : System.Net.Requests.dasm (0.02% of base)
Top file improvements by size (bytes):
-2621 : System.Private.CoreLib.dasm (-0.08% of base)
-1516 : System.Private.Xml.dasm (-0.05% of base)
-1443 : System.Linq.Parallel.dasm (-0.23% of base)
-1060 : System.Data.Common.dasm (-0.10% of base)
-698 : Microsoft.CodeAnalysis.VisualBasic.dasm (-0.03% of base)
96 total files with size differences (64 improved, 32 regressed), 34 unchanged.
Top method regessions by size (bytes):
244 : System.Reflection.Metadata.dasm - MetadataReader:InitializeTableReaders(struct,ubyte,ref,ref):this
162 : Microsoft.CodeAnalysis.dasm - MetadataSizes:.ctor(struct,struct,int,int,int,int,bool,bool,bool):this
72 : System.Private.DataContractSerialization.dasm - ArrayHelper`2:ReadArray(ref,ref,ref,int):ref:this (12 methods)
70 : Microsoft.CodeAnalysis.CSharp.dasm - SyntaxFacts:GetKeywordKind(ref):ushort
66 : Microsoft.CodeAnalysis.VisualBasic.dasm - ExpressionEvaluator:PerformCompileTimeBinaryOperation(ushort,byte,ref,ref,ref):ref
Top method improvements by size (bytes):
-1012 : System.Private.Xml.dasm - XmlReflectionImporter:ImportAccessorMapping(ref,ref,ref,ref,ref,bool,bool,ref):this
-648 : System.Private.CoreLib.dasm - Comparer`1:System.Collections.IComparer.Compare(ref,ref):int:this (39 methods)
-408 : System.Runtime.Numerics.dasm - Number:NumberToStringFormat(byref,byref,struct,ref)
-300 : System.Text.RegularExpressions.dasm - RegexInterpreter:Go():this
-257 : System.Private.CoreLib.dasm - Number:NumberToStringFormat(byref,byref,struct,ref)
3511 total methods with size differences (2136 improved, 1375 regressed), 137577 unchanged. Here it's interesting to note that before handling cases involving constants the diff was a regression of around 1k. Judging by the jump from +1k to -9k it looks like there are quite a few of such constant cases. |
Some other random notes from peeking at an existing if conversion implementation:
setcc/sbb were used for constant cases where absolute diff between values was 1. Looks like you only currently use setcc. There was a double/float variant but it was disabled by default. |
Hmm, I used
Yep, introducing stores is pretty much out of the question. Especially if they're GC stores that need write barriers. And anyway
It's something that I want to look into but it's low priority. SBB (and ADC) can only be used with some unsigned compares and using unsigned types is less common in .NET. Though now that people seem to be in race to the bottom to optimize anything and everything maybe these will become more common... A more interesting use of ADC/SBB would be something like
Float condition or float move? Float conditions are a bit more cumbersome but they're doable. Float moves, hmm, they require quite different IR (and likely need the new intrinsic support). It's also not clear to me how useful these are, the framework doesn't contain a lot of floating point code and I haven't wrote any significant floating point code in while so I can't come up with any good use cases for it right now. |
The float was a mov, the pattern is (from an amd optimization guide I think): comisd a, b t1 = movsd v1
jcc L1 t2 = movsd a
c = movsd v1 ==> t3 = movsd b
jmp L2 t2 = cmppd t2, t3, cc
L1: t1 = andpd t1, t2
c = movsd v2 t2 = andnpd t2, v2
L2: t1 = orpd t1, t2
c = movsd t1 Would be nice to verify we are seeing cmov recognition kick in for min/max reductions in loops and the compares we do in various framework sorts (at least those with simple/default comparers), and if not, what else is getting in the way. I think I can defer some of the importer spills at the starts of the then/else blocks of balanced hammocks -- and this does interesting things to the codegen in some methods and probably warrants deeper investigation -- but the net impact is that it just pushes those spills to the ends of those blocks. So this is not going to be helpful for if conversion. |
I've experimented a bit to see the impact of those spills by limiting the number of copies if-conversion accepts:
Yes, we need to allow up to 13 copies to avoid diffs. Such a large number of copies appears when you have calls with many parameters: [MethodImpl(MethodImplOptions.NoInlining)]
static int foo(int i1, int i2, int i3, int i4, int i5, int i6, int i7, int i8) => i1;
[MethodImpl(MethodImplOptions.NoInlining)]
static int Test(int[] a) => foo(a[0], a[1], a[2], a[3], a[4], a[5], a[6], a[7] == 42 ? 41 : 43); generates this IR: https://gist.github.com/mikedn/ff9b88ab8781ea10a117c3c6bab3af9a |
Yes, I've seen that pattern in various places. It seems a bit large and perhaps it should be done only when we have information about branch probability. On the other hand, with SSE 4.1 ANDs+OR can be replaced with a single blend instruction. Considering that SSE 4.1 is ~10 years old we could simply do this only it is available. There's also the special case of Floating point I'd say that we should leave floating point out of this PR, otherwise things can become hairy. Support can be added separately if this goes ahead. |
Integer int max = int.MinValue;
for (int i = 0; i < a.Length; i++)
max = Math.Max(max, a[i]);
return max; generates G_M55887_IG03:
4C63CA movsxd r9, edx
468B4C8910 mov r9d, dword ptr [rcx+4*r9+16]
413BC1 cmp eax, r9d
440F4DC8 cmovge r9d, eax
418BC1 mov eax, r9d ; useless? may need to reverse condition?
FFC2 inc edx
443BC2 cmp r8d, edx
7FE7 jg SHORT G_M55887_IG03 TODO: benchmarks. This kind of loop carried dependency is where CMOV can perform worse when the branch is predictable. |
Roslyn seems to flip ternary expressions for some reason; if you are comparing output from static int Max_Ternary(int val1, int val2)
{
return (val1 >= val2) ? val1 : val2;
}
static int Max_If(int val1, int val2)
{
if (val1 >= val2)
return val1;
else
return val2;
}
Also will reorder if statements; though haven't worked out a consistent trigger |
I'm not sure how is that relevant. |
As expected, it turns out that using Not to say that @AndyAyersMS Thoughts? |
The jit is not really equipped to do anything very sophisticated, but maybe there are some ways to avoid cases where cmov will likely be unprofitable. Paraphrasing Agner Fog's advice,
In our case we have no way to estimate the prediction rate, aside from the rare block flag. So I suppose we have to assume every branch amenable to if conversion is potentially unpredictable. We might be able to guesstimate when the code is part of a loop-carried dependence chain or when one of the inputs is a potentially long-latency operation. Maybe the optimizer can leave some breadcrumbs.... |
It appears that's what C/C++ compilers are doing in certain circumstances (no auto-vectorization): example on gcc.godbolt.org. In VC++ I also tried to use PGO and apparently it doesn't make any difference, it still happily generates CMOV. Oh well, I guess there's still room for improvement in all compiles. Or maybe they know that they can auto-vectorize such code and don't bother otherwise. And of course, there are cases where branch based
Yes, I was going to look into that. Kind of doubt that it's feasible, the optimizer doesn't do a lot with loops anyway. Another option is to only handle Yet another option is to do a fake G_M55887_IG03:
4C63CA movsxd r9, edx
468B4C8910 mov r9d, dword ptr [rcx+4*r9+16]
413BC1 cmp eax, r9d
7D02 jge SHORT G_M55887_IG04
EB03 jmp SHORT G_M55887_IG05 ; meh...
G_M55887_IG04:
448BC8 mov r9d, eax
G_M55887_IG05:
418BC1 mov eax, r9d
FFC2 inc edx
443BC2 cmp r8d, edx
7FE4 jg SHORT G_M55887_IG03 With if-conversion and a fake CMOV we get: G_M55887_IG03:
4C63CA movsxd r9, edx
468B4C8910 mov r9d, dword ptr [rcx+4*r9+16]
413BC1 cmp eax, r9d
7C03 jl SHORT G_M55887_IG04
448BC8 mov r9d, eax
G_M55887_IG04:
418BC1 mov eax, r9d
FFC2 inc edx
443BC2 cmp r8d, edx
7FE6 jg SHORT G_M55887_IG03 This is 10-20% faster in some cases. |
dfdb5a9
to
fd95492
Compare
Without IBC data BBF_RUN_RARELY is likely next to useless since it will only get set for blocks that aren't likely to be relevant to if conversion (e.g. exception handlers). But it turns out that Roslyn assemblies do have IBC data so I've been able to check that BBF_RUN_RARELY actually works as expected (and I only saw diffs in those Roslyn assemblies). |
@AndyAyersMS Since benchMonteCarlo came up I thought I'd give floating point conditions a try. A slightly modified version of the code under_curve += x * x + y * y <= 1.0 ? 1 : 0; produces C4E1792EC6 vucomisd xmm0, xmm6
0F93C0 setae al
0FB6C0 movzx rax, al
03D8 add ebx, eax while the original code produces C4E1792EC6 vucomisd xmm0, xmm6
7202 jb SHORT G_M57734_IG04
FFC3 inc ebx
G_M57734_IG04: The new version is ~10% faster. Not a lot, random number generation is probably expensive enough that the advantage of if-conversion in this case is small. |
Also cleanup similar display for other nodes (JCC, JCMP, SETCC)
No update for 1+ year. I recommend to close the PR until it is ready to be worked on again ... thoughts @mikedn ? |
I have further local changes that I'd like to test, maybe this month... |
Looks like cmov is faster than a branch in all circumstances on SkyLake and above? https://github.com/xiadz/cmov |
As far as I can tell what was a measured is a single case and one that doesn't include CMOV in a loop carried dependency. In general there's simply no way for CMOV to be faster in cases that involve predictable branches. The CPU has to execute the CMOV and all its dependencies. With a predictable branch the CPU has almost nothing to do, especially if the branch is predicted not taken. Ultimately this is the kind of stuff that tiered compilation should handle but I'm not sure when the JIT will get there. In the meantime it's not clear if it's worth the risk to slow down some cases to speed up others. Though it's true that when you do get a speed up from CMOV than can be significant, usually larger than the potential slowdown. Anyway, I've been working on this on and off, I need to push some changes... |
Thank you for your contribution. As announced in dotnet/coreclr#27549 this repository will be moving to dotnet/runtime on November 13. If you would like to continue working on this PR after this date, the easiest way to move the change to dotnet/runtime is:
|
I'll get back to this once I clear up some of the IR improvements I hope to do ( |
To set some expectations - this is primarily opened for testing and discussion. It may turn that it does not work, or it's too expensive or whatever.