Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

[WIP] If conversion #16156

Closed
wants to merge 10 commits into from
Closed

[WIP] If conversion #16156

wants to merge 10 commits into from

Conversation

mikedn
Copy link

@mikedn mikedn commented Feb 1, 2018

To set some expectations - this is primarily opened for testing and discussion. It may turn that it does not work, or it's too expensive or whatever.

@mikedn mikedn changed the title [WIP] cmov [WIP] If conversion Feb 1, 2018
@mikedn
Copy link
Author

mikedn commented Feb 1, 2018

Still a lot to do here but first things first: the issue of JIT throughput always comes up and last time I experimented with this I haven't done any measurements. As is, this costs ~0.19% instructions retired. I saved more than that with a simple change such as #14965 and there are a bunch more like that. So I'm inclined to say that doing if-conversion in the JIT is feasible.

@AndyAyersMS
Copy link
Member

Overall this looks quite reasonable.

Since an if conversion can enable an upstream if conversion (say nested ?:) is there any way you can see to run this "bottom-up" instead of top down?

If we see a lot of stack spill/reload in simple hammocks I wonder if I should look at the importer and see if we can just hoist all that into the predecessor. Though I would swear we only spill when we see joins. But it's been a while since I looked at that code.

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments throughout

{
m_falseBlock = m_block->bbNext;
m_trueBlock = m_block->bbJumpDest;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May want to check that m_falseBlock != m_trueBlock just to be safe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also make sure that neither of these is block itself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sometimes try and forget that the jit's flowgraph has implicit fall through semantics.

But doing an early check that block and m_trueBlock are different may save a bit of time.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interesting thing is that if trueBlock == block you can still have a correct half hammock if falseBlock jumps back to block. But then this would be an infinite loop which we don't care about (and likely don't have test coverage for).

And I should take a look at other scenarios involving loops (e.g. joinBlock == block).


BasicBlock* joinBlock = SingleSuccessorBlock(m_falseBlock);

if (joinBlock == m_block->bbJumpDest)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably clearer as joinBlock == m_trueBlock

return false;
}

if ((op == nullptr) || !op->OperIs(GT_STORE_LCL_VAR, GT_RETURN))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just assert op is not null as you've screened out the empty block cases for both hammock kinds already? Otherwise this sequence reads a bit strangely.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this is definitely odd.

// copies that are generated when the importer spills a non-empty stack at the start
// of a block. Try to identify and clone such copies.

GenTree* copyDst = PreviousInstruction(opSrc);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How common is this?

Full
};

HammockKind MakeHammock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thought here to screening out cases where the true or false block has non-hammock predecessors? I suppose the amount of code duplication is pretty small, except perhaps in the case where there are lots of stack spill reloads.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIR at the time I simply assumed that the amount of duplication would be small and simply decided to not check for predecessors to save time. I should check try both approaches and see what diffs say.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small clarification: for the half case we don't really care if the join block has other predecessors.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, rejecting such cases does increase the code size improvement from ~9k to ~14k. While the amount of "real" duplicated code is very small when you have return blocks you'll end up duplicating epilogs and these can be quite large.

Not sure what's the best option here. Ideally we'd avoid duplication if the epilog is known to be large but we won't know what until register allocation.

constexpr instruction EJtoCMOV[]{INS_nop, INS_nop, INS_cmovo, INS_cmovno, INS_cmovb, INS_cmovae,
INS_cmove, INS_cmovne, INS_cmovbe, INS_cmova, INS_cmovs, INS_cmovns,
INS_cmovpe, INS_cmovpo, INS_cmovl, INS_cmovge, INS_cmovle, INS_cmovg};
noway_assert((unsigned)desc.ins[0] < _countof(EJtoCMOV));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to add comments with the respective EJ values here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually considering getting rid of the array, the more usual way this is done is something like (kind - EJ_jo) + INS_cmovo.

@mikedn
Copy link
Author

mikedn commented Feb 2, 2018

If we see a lot of stack spill/reload in simple hammocks I wonder if I should look at the importer and see if we can just hoist all that into the predecessor. Though I would swear we only spill when we see joins. But it's been a while since I looked at that code.

Fortunately I managed to find the example that prompted me to add all that code:

int sum = 0;
for (int i = 0; i < bits.Length; i++)
    sum += bits[i] ? 1 : 0;
return sum;

I don't know how common this case is (I really should add some kind of instrumentation to my code to get some numbers...) but it looks pretty harmless, the kind of code people would be surprised if the JIT doesn't handle it as expected.
The IL that produces this problem is

IL_0008  07                ldloc.1     
IL_0009  91                ldelem.u1   
IL_000a  2d 03             brtrue.s     3 (IL_000f)
IL_000c  16                ldc.i4.0    
IL_000d  2b 01             br.s         1 (IL_0010)
IL_000f  17                ldc.i4.1    
IL_0010  58                add         
IL_0011  0a                stloc.0     

In lowering we end up with

N012 ( 11, 12) [000024] ---XG-------              *  JTRUE     void  

------------ BB03 [00C..00F) -> BB05 (always), preds={BB02} succs={BB05}
N001 (  1,  1) [000030] ------------        t30 =    LCL_VAR   int    V03 tmp0         u:3 (last use) $1c0
                                                 /--*  t30    int    
N003 (  1,  3) [000058] DA----------              *  STORE_LCL_VAR int    V04 tmp1         d:5
N001 (  1,  1) [000056] ------------        t56 =    CNS_INT   int    0 $40
                                                 /--*  t56    int    
N003 (  1,  3) [000062] DA----------              *  STORE_LCL_VAR int    V05 tmp2         d:5

------------ BB04 [00F..010), preds={BB02} succs={BB05}
N001 (  1,  1) [000031] ------------        t31 =    LCL_VAR   int    V03 tmp0         u:3 (last use) $1c0
                                                 /--*  t31    int    
N003 (  1,  3) [000035] DA----------              *  STORE_LCL_VAR int    V04 tmp1         d:4
N001 (  1,  1) [000033] ------------        t33 =    CNS_INT   int    1 $47
                                                 /--*  t33    int    
N003 (  1,  3) [000039] DA----------              *  STORE_LCL_VAR int    V05 tmp2         d:4

It's probably particular to compound assignment operators. No idea if there are other common ways to produce this kind of IL.

@mikedn
Copy link
Author

mikedn commented Feb 2, 2018

Since an if conversion can enable an upstream if conversion (say nested ?:) is there any way you can see to run this "bottom-up" instead of top down?

That's probably the biggest limitation the current implementation has. Another one is that it can't handle indirs. And part of the reason it can't handle indirs is that true/false blocks haven't been lowered yet so we don't have address modes yet and that makes it more problematic to figure out if an address expression is trivial enough to be accepted.

So, probably the right way to fix this is to do if-conversion separately, after lowering, and traverse the block list backwards (probably a post-order traversal would be better but also more expensive). It's definitely something that I want to try but first I need to clear out some more minor issue (e.g. not being able to handle GT_TEST_EQ, GT_TEST_NE) that are the consequence of this prototype getting stale.

A nested case is something like int.Compare:

if (x == y)
    return 0;
if (x < y)
    return 1;
return -1;

currently generates:

G_M55886_IG02:
       3BCA                 cmp      ecx, edx
       7503                 jne      SHORT G_M55886_IG04
       33C0                 xor      eax, eax
G_M55886_IG03:
       C3                   ret
G_M55886_IG04:
       3BCA                 cmp      ecx, edx
       B801000000           mov      eax, 1
       BAFFFFFFFF           mov      edx, -1
       0F4DC2               cmovge   eax, edx
G_M55886_IG05:
       C3                   ret

Though I see that VC++ too doesn't handle this so perhaps it's not that bad :)

@mikedn
Copy link
Author

mikedn commented Feb 2, 2018

Current jit-diffs:

Total bytes of diff: -9795 (-0.04% of base)
    diff is an improvement.
Total byte diff includes 0 bytes from reconciling methods
        Base had    0 unique methods,        0 unique bytes
        Diff had    0 unique methods,        0 unique bytes
Top file regressions by size (bytes):
         123 : System.Reflection.Metadata.dasm (0.16% of base)
          90 : Microsoft.CSharp.dasm (0.03% of base)
          85 : Microsoft.CodeAnalysis.CSharp.dasm (0.00% of base)
          52 : NuGet.Versioning.dasm (0.19% of base)
          24 : System.Net.Requests.dasm (0.02% of base)
Top file improvements by size (bytes):
       -2621 : System.Private.CoreLib.dasm (-0.08% of base)
       -1516 : System.Private.Xml.dasm (-0.05% of base)
       -1443 : System.Linq.Parallel.dasm (-0.23% of base)
       -1060 : System.Data.Common.dasm (-0.10% of base)
        -698 : Microsoft.CodeAnalysis.VisualBasic.dasm (-0.03% of base)
96 total files with size differences (64 improved, 32 regressed), 34 unchanged.
Top method regessions by size (bytes):
         244 : System.Reflection.Metadata.dasm - MetadataReader:InitializeTableReaders(struct,ubyte,ref,ref):this
         162 : Microsoft.CodeAnalysis.dasm - MetadataSizes:.ctor(struct,struct,int,int,int,int,bool,bool,bool):this
          72 : System.Private.DataContractSerialization.dasm - ArrayHelper`2:ReadArray(ref,ref,ref,int):ref:this (12 methods)
          70 : Microsoft.CodeAnalysis.CSharp.dasm - SyntaxFacts:GetKeywordKind(ref):ushort
          66 : Microsoft.CodeAnalysis.VisualBasic.dasm - ExpressionEvaluator:PerformCompileTimeBinaryOperation(ushort,byte,ref,ref,ref):ref
Top method improvements by size (bytes):
       -1012 : System.Private.Xml.dasm - XmlReflectionImporter:ImportAccessorMapping(ref,ref,ref,ref,ref,bool,bool,ref):this
        -648 : System.Private.CoreLib.dasm - Comparer`1:System.Collections.IComparer.Compare(ref,ref):int:this (39 methods)
        -408 : System.Runtime.Numerics.dasm - Number:NumberToStringFormat(byref,byref,struct,ref)
        -300 : System.Text.RegularExpressions.dasm - RegexInterpreter:Go():this
        -257 : System.Private.CoreLib.dasm - Number:NumberToStringFormat(byref,byref,struct,ref)
3511 total methods with size differences (2136 improved, 1375 regressed), 137577 unchanged.

Here it's interesting to note that before handling cases involving constants the diff was a regression of around 1k. Judging by the jump from +1k to -9k it looks like there are quite a few of such constant cases.

@AndyAyersMS
Copy link
Member

Some other random notes from peeking at an existing if conversion implementation:

  • Profitability looked at two things:
    • whether branch was heavily biased (I suppose BBF_RUN_RARELY is the best we have w/o IBC data). So maybe also bail on cases where one or both blocks are rarely run.
    • whether destination was a memory store in a half hammock in a loop with a lot of store pressure. Since you require the dest to be a local var we don't see such cases. We probably want to punt on this variant anyways as it runs afoul of the "invented write" concerns people have for concurrency safe codegen.

setcc/sbb were used for constant cases where absolute diff between values was 1. Looks like you only currently use setcc.

There was a double/float variant but it was disabled by default.

@mikedn
Copy link
Author

mikedn commented Feb 3, 2018

whether branch was heavily biased (I suppose BBF_RUN_RARELY is the best we have w/o IBC data). So maybe also bail on cases where one or both blocks are rarely run.

Hmm, I used fgInDifferentRegions but that only covers BBF_COLD. Come to think of it, I don't know what the exact difference between RUN_RARELY and COLD is, need to check. It's also not clear if we really should be blocking if-conversion if both blocks are cold/rare, maybe we should do it if it helps with JIT throughput.

We probably want to punt on this variant anyways as it runs afoul of the "invented write" concerns people have for concurrency safe codegen.

Yep, introducing stores is pretty much out of the question. Especially if they're GC stores that need write barriers. And anyway CMOV doesn't support a mem, reg form so we'd need to produce CMOV reg, reg and then a separate MOV to do the stored.

setcc/sbb were used for constant cases where absolute diff between values was 1. Looks like you only currently use setcc.

It's something that I want to look into but it's low priority. SBB (and ADC) can only be used with some unsigned compares and using unsigned types is less common in .NET. Though now that people seem to be in race to the bottom to optimize anything and everything maybe these will become more common...

A more interesting use of ADC/SBB would be something like x += u < 0 ? 1 : 0, this could be turned into cmp u,0 adc x,0. Unfortunately this seems problematic to get without SSA, ?: gets turned into SELCC and this stores into a lclvar. And the uses of the lclvar are?

There was a double/float variant but it was disabled by default

Float condition or float move? Float conditions are a bit more cumbersome but they're doable. Float moves, hmm, they require quite different IR (and likely need the new intrinsic support). It's also not clear to me how useful these are, the framework doesn't contain a lot of floating point code and I haven't wrote any significant floating point code in while so I can't come up with any good use cases for it right now.

@AndyAyersMS
Copy link
Member

BBF_COLD means we're physically splitting the method code into two disjoint pieces. This currently only happens when prejitting.

BBF_RUN_RARELY means we don't think the block will ever be executed in performance-sensitive scenarios. When splitting is enabled this will at some point cause the block to move to the cold code segment, otherwise we just move the block towards the end of the (single) code segment.

The float was a mov, the pattern is (from an amd optimization guide I think):

    comisd a, b            t1 = movsd v1        
    jcc L1                 t2 = movsd a
    c = movsd v1    ==>    t3 = movsd b
    jmp L2                 t2 = cmppd t2, t3, cc
   L1:                     t1 = andpd t1, t2
    c = movsd v2           t2 = andnpd t2, v2
   L2:                     t1 = orpd t1, t2
                            c = movsd t1

Would be nice to verify we are seeing cmov recognition kick in for min/max reductions in loops and the compares we do in various framework sorts (at least those with simple/default comparers), and if not, what else is getting in the way.

I think I can defer some of the importer spills at the starts of the then/else blocks of balanced hammocks -- and this does interesting things to the codegen in some methods and probably warrants deeper investigation -- but the net impact is that it just pushes those spills to the ends of those blocks. So this is not going to be helpful for if conversion.

@mikedn
Copy link
Author

mikedn commented Feb 4, 2018

I've experimented a bit to see the impact of those spills by limiting the number of copies if-conversion accepts:

Number of copies jit-diff improvement
0 ~4k (instead of 9795 bytes)
1 ~7k
2 ~9k
13 9795 bytes (original)

Yes, we need to allow up to 13 copies to avoid diffs. Such a large number of copies appears when you have calls with many parameters:

[MethodImpl(MethodImplOptions.NoInlining)]
static int foo(int i1, int i2, int i3, int i4, int i5, int i6, int i7, int i8) => i1;
[MethodImpl(MethodImplOptions.NoInlining)]
static int Test(int[] a) => foo(a[0], a[1], a[2], a[3], a[4], a[5], a[6], a[7] == 42 ? 41 : 43);

generates this IR: https://gist.github.com/mikedn/ff9b88ab8781ea10a117c3c6bab3af9a

@mikedn
Copy link
Author

mikedn commented Feb 5, 2018

The float was a mov, the pattern is (from an amd optimization guide I think):

Yes, I've seen that pattern in various places. It seems a bit large and perhaps it should be done only when we have information about branch probability. On the other hand, with SSE 4.1 ANDs+OR can be replaced with a single blend instruction. Considering that SSE 4.1 is ~10 years old we could simply do this only it is available.

There's also the special case of f1 < f2 ? f1 : f2, we could probably generate MINSS & co. with little effort.

Floating point Math.Min/Max are problematic due to their handling of NaNs. We probably need to "fix" IsNan first (#14846) to get reasonable code for these.

I'd say that we should leave floating point out of this PR, otherwise things can become hairy. Support can be added separately if this goes ahead.

@mikedn
Copy link
Author

mikedn commented Feb 5, 2018

Integer Math.Min/Max do work as expected:

int max = int.MinValue;
for (int i = 0; i < a.Length; i++)
    max = Math.Max(max, a[i]);
return max;

generates

G_M55887_IG03:
       4C63CA               movsxd   r9, edx
       468B4C8910           mov      r9d, dword ptr [rcx+4*r9+16]
       413BC1               cmp      eax, r9d
       440F4DC8             cmovge   r9d, eax
       418BC1               mov      eax, r9d ; useless? may need to reverse condition?
       FFC2                 inc      edx
       443BC2               cmp      r8d, edx
       7FE7                 jg       SHORT G_M55887_IG03

TODO: benchmarks. This kind of loop carried dependency is where CMOV can perform worse when the branch is predictable.

@benaadams
Copy link
Member

Roslyn seems to flip ternary expressions for some reason; if you are comparing output from ? : and if else

static int Max_Ternary(int val1, int val2)
{
    return (val1 >= val2) ? val1 : val2;
}

static int Max_If(int val1, int val2)
{
    if (val1 >= val2)
        return val1;
    else
        return val2;
}
.method private hidebysig static 
    int32 Max_Ternary (
        int32 val1,
        int32 val2
    ) cil managed 
{
    // Code size 8 (0x8)
    .maxstack 8

    IL_0000: ldarg.0
    IL_0001: ldarg.1
    IL_0002: bge.s IL_0006
    IL_0004: ldarg.1
    IL_0005: ret
    IL_0006: ldarg.0
    IL_0007: ret
} // end of method C::Max_Ternary

.method private hidebysig static 
    int32 Max_If (
        int32 val1,
        int32 val2
    ) cil managed 
{
    // Code size 8 (0x8)
    .maxstack 8

    IL_0000: ldarg.0
    IL_0001: ldarg.1
    IL_0002: blt.s IL_0006
    IL_0004: ldarg.0
    IL_0005: ret
    IL_0006: ldarg.1
    IL_0007: ret
} // end of method C::Max_If

Also will reorder if statements; though haven't worked out a consistent trigger

@mikedn
Copy link
Author

mikedn commented Feb 5, 2018

Roslyn seems to flip ternary expressions for some reason; if you are comparing output from ? : and if else

I'm not sure how is that relevant.

@mikedn
Copy link
Author

mikedn commented Feb 9, 2018

TODO: benchmarks. This kind of loop carried dependency is where CMOV can perform worse when the branch is predictable.

As expected, it turns out that using CMOV to compute the max element of an array can be a pretty bad idea. For many inputs (constant array, sorted array and even random array) the CMOV version is 1.5x slower than the branch version.

Not to say that Max based CMOV is all bad. It is possible to construct an input (e.g. a[i] = i + (rnd.Next(50) - 25)) that slows down the branch version by 3.5x and in that case the CMOV version beats the branch version by 2.5x.

@AndyAyersMS Thoughts?

@AndyAyersMS
Copy link
Member

The jit is not really equipped to do anything very sophisticated, but maybe there are some ways to avoid cases where cmov will likely be unprofitable.

Paraphrasing Agner Fog's advice,

a conditional jump is faster than a conditional move if the code is part of a dependency chain and the prediction rate is better than 75%. A conditional jump is also preferred if we can avoid a lengthy calculation of the unchosen conditional move source when other operand is chosen. Loop-carried dependency chains are particularly sensitive to the disadvantages of conditional moves...

In our case we have no way to estimate the prediction rate, aside from the rare block flag. So I suppose we have to assume every branch amenable to if conversion is potentially unpredictable.

We might be able to guesstimate when the code is part of a loop-carried dependence chain or when one of the inputs is a potentially long-latency operation. Maybe the optimizer can leave some breadcrumbs....

@mikedn
Copy link
Author

mikedn commented Feb 10, 2018

In our case we have no way to estimate the prediction rate, aside from the rare block flag. So I suppose we have to assume every branch amenable to if conversion is potentially unpredictable.

It appears that's what C/C++ compilers are doing in certain circumstances (no auto-vectorization): example on gcc.godbolt.org. In VC++ I also tried to use PGO and apparently it doesn't make any difference, it still happily generates CMOV. Oh well, I guess there's still room for improvement in all compiles. Or maybe they know that they can auto-vectorize such code and don't bother otherwise.

And of course, there are cases where branch based Max has terrible performance - c[i] = max(a[i], b[i]) slows down by 7x for random input. In this case the CMOV version is 5x faster.

We might be able to guesstimate when the code is part of a loop-carried dependence chain or when one of the inputs is a potentially long-latency operation. Maybe the optimizer can leave some breadcrumbs....

Yes, I was going to look into that. Kind of doubt that it's feasible, the optimizer doesn't do a lot with loops anyway.

Another option is to only handle select(const1, const2) as this has less potential issues with performance and seems to be the mostly requested feature in this area.

Yet another option is to do a fake CMOV. That may sound strange but it actually improves some cases without hurting others. The current JIT doesn't do a very good job with Max anyway:

G_M55887_IG03:
       4C63CA               movsxd   r9, edx
       468B4C8910           mov      r9d, dword ptr [rcx+4*r9+16]
       413BC1               cmp      eax, r9d
       7D02                 jge      SHORT G_M55887_IG04
       EB03                 jmp      SHORT G_M55887_IG05 ; meh...
G_M55887_IG04:
       448BC8               mov      r9d, eax
G_M55887_IG05:
       418BC1               mov      eax, r9d
       FFC2                 inc      edx
       443BC2               cmp      r8d, edx
       7FE4                 jg       SHORT G_M55887_IG03

With if-conversion and a fake CMOV we get:

G_M55887_IG03:
       4C63CA               movsxd   r9, edx
       468B4C8910           mov      r9d, dword ptr [rcx+4*r9+16]
       413BC1               cmp      eax, r9d
       7C03                 jl       SHORT G_M55887_IG04
       448BC8               mov      r9d, eax
G_M55887_IG04:
       418BC1               mov      eax, r9d
       FFC2                 inc      edx
       443BC2               cmp      r8d, edx
       7FE6                 jg       SHORT G_M55887_IG03

This is 10-20% faster in some cases.

@mikedn mikedn force-pushed the ifcvt branch 2 times, most recently from dfdb5a9 to fd95492 Compare February 19, 2018 17:34
@mikedn
Copy link
Author

mikedn commented Feb 19, 2018

I suppose BBF_RUN_RARELY is the best we have w/o IBC data

Without IBC data BBF_RUN_RARELY is likely next to useless since it will only get set for blocks that aren't likely to be relevant to if conversion (e.g. exception handlers).

But it turns out that Roslyn assemblies do have IBC data so I've been able to check that BBF_RUN_RARELY actually works as expected (and I only saw diffs in those Roslyn assemblies).

@mikedn
Copy link
Author

mikedn commented Mar 20, 2018

@AndyAyersMS Since benchMonteCarlo came up I thought I'd give floating point conditions a try. A slightly modified version of the code

under_curve += x * x + y * y <= 1.0 ? 1 : 0;

produces

       C4E1792EC6           vucomisd xmm0, xmm6
       0F93C0               setae    al
       0FB6C0               movzx    rax, al
       03D8                 add      ebx, eax

while the original code produces

       C4E1792EC6           vucomisd xmm0, xmm6
       7202                 jb       SHORT G_M57734_IG04
       FFC3                 inc      ebx
G_M57734_IG04:

The new version is ~10% faster. Not a lot, random number generation is probably expensive enough that the advantage of if-conversion in this case is small.

@CarolEidt CarolEidt added this to the 2.2.0 milestone Apr 12, 2018
@karelz
Copy link
Member

karelz commented May 4, 2019

No update for 1+ year. I recommend to close the PR until it is ready to be worked on again ... thoughts @mikedn ?

@mikedn
Copy link
Author

mikedn commented May 4, 2019

I have further local changes that I'd like to test, maybe this month...

@BruceForstall BruceForstall modified the milestones: 3.0, Future Jul 1, 2019
@benaadams
Copy link
Member

Looks like cmov is faster than a branch in all circumstances on SkyLake and above? https://github.com/xiadz/cmov

image

@mikedn
Copy link
Author

mikedn commented Aug 5, 2019

Looks like cmov is faster than a branch in all circumstances on SkyLake and above?

As far as I can tell what was a measured is a single case and one that doesn't include CMOV in a loop carried dependency.

In general there's simply no way for CMOV to be faster in cases that involve predictable branches. The CPU has to execute the CMOV and all its dependencies. With a predictable branch the CPU has almost nothing to do, especially if the branch is predicted not taken.

Ultimately this is the kind of stuff that tiered compilation should handle but I'm not sure when the JIT will get there. In the meantime it's not clear if it's worth the risk to slow down some cases to speed up others. Though it's true that when you do get a speed up from CMOV than can be significant, usually larger than the potential slowdown.

Anyway, I've been working on this on and off, I need to push some changes...

@sandreenko sandreenko added the post-consolidation PRs which will be hand ported to dotnet/runtime label Nov 6, 2019
@maryamariyan
Copy link
Member

Thank you for your contribution. As announced in dotnet/coreclr#27549 this repository will be moving to dotnet/runtime on November 13. If you would like to continue working on this PR after this date, the easiest way to move the change to dotnet/runtime is:

  1. In your coreclr repository clone, create patch by running git format-patch origin
  2. In your runtime repository clone, apply the patch by running git apply --directory src/coreclr <path to the patch created in step 1>

@mikedn
Copy link
Author

mikedn commented Nov 13, 2019

I'll get back to this once I clear up some of the IR improvements I hope to do (GT_LIST, GT_ASG, GT_ADDR and struct stuff)

@mikedn mikedn closed this Nov 13, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen post-consolidation PRs which will be hand ported to dotnet/runtime
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants