`not(isNull x)` leads to very odd and partially unreachable IL code that performace 5x slower than redefining `not` yourself #9433

abelbraaksma · 2020-06-12T13:40:29Z

I noticed this while doing timings for #9390, where sometimes using not gave an unexpected performance degradation. This boiled down to not sometimes leading to very unexpected IL.

Repro steps

Take the following code snippet:

let useNotIsNull (str:string) =
    if not(isNull str) then str.Length
    else 0

Because not is coded to return a single IL instruction with ceq, and isNull, while coded with match, also leads to basically a single ceq instruction, that we'd end up with two instructions, or, after optimization, a single one. However, it blows up:

    IL_0000: ldarg.0
    IL_0001: brfalse.s IL_0006

    IL_0003: ldc.i4.0
    IL_0004: br.s IL_0007

    IL_0006: ldc.i4.1

    IL_0007: brtrue.s IL_0010

    IL_0009: ldarg.0
    IL_000a: callvirt instance int32 [System.Private.CoreLib]System.String::get_Length()
    IL_000f: ret

    IL_0010: ldc.i4.0
    IL_0011: ret

Which gets translated in C# as:

public static int useNotIsNull(string str)
{
    if (str != null || 1 == 0)
    {
        return str.Length;
    }
    return 0;
}

If you were to recreate the not function as follows:

let not x = match x with true -> false | _ -> true

The same code above would now be encoded in IL as:

    IL_0000: ldarg.0
    IL_0001: brfalse.s IL_000a

    IL_0003: ldarg.0
    IL_0004: callvirt instance int32 [System.Private.CoreLib]System.String::get_Length()
    IL_0009: ret

    IL_000a: ldc.i4.0
    IL_000b: ret

And here is the real killer, if we encode not as itself, the problem also disappears, regardless of whether it is marked as inline (the original) or not:

let justLikeNot x = not x

let useJustLikeNot (str:string) =
    if justLikeNot(isNull str) then str.Length
    else 0

Resulting IL:

    IL_0000: ldarg.0
    IL_0001: brfalse.s IL_000a

    IL_0003: ldarg.0
    IL_0004: callvirt instance int32 [System.Private.CoreLib]System.String::get_Length()
    IL_0009: ret

    IL_000a: ldc.i4.0
    IL_000b: ret

Strangely, the not function itself looks exactly the same as the justLikeNot function above:

    IL_0000: ldarg.0
    IL_0001: ldc.i4.0
    IL_0002: ceq
    IL_0004: ret

Though in one case (with isNull) it leads to strange opcodes. In most other cases, it leads to the expected folding of the ceq into a brfalse or brtrue respectively.

More examples of coding this and their surprising translations can be found in this SharpLab.io snippet.

Expected behavior

Actual behavior

See above for the actual behavior. In terms of performance, the different not versions in the code perform all as expected, since they are ultimately folded into optimized x64 assembly, except for the not(isNull x) version. The notIsNull below uses not(isNull x), the others all use a different way of coding not than the default:

(These timings were made by ensuring the function returns and is not optimized away (hence the str.Length call) and repeated 10_000x in a close for-loop to erase timing inefficiencies for micro-benchmarks with BDN.)

This is ultimatedly caused by the final assembly, which looks as follows (note the popping and extra call):

; FSharp.Perf.BenchLength.notIsNull()
       push      rdi
       push      rsi
       sub       rsp,28
       mov       ecx,[rcx+8]
       call      FSharp.Perf.Data.get(Int32)
       mov       rsi,rax
       xor       edi,edi
M00_L00:                      ; start of for-loop body
       mov       rcx,rsi
       call      FSharp.Perf.StringLength.notIsNull(System.String)
       inc       edi
       cmp       edi,2711       ; loop 10_000 times
       jl        short M00_L00
       add       rsp,28
       pop       rsi
       pop       rdi
       ret
; Total bytes of code 44

; FSharp.Perf.StringLength.notIsNull(System.String)
       test      rcx,rcx
       je        short M02_L00
       mov       eax,[rcx+8]
       ret
M02_L00:
       xor       eax,eax
       ret
; Total bytes of code 12

Compare that to using one of the not redefinitions, which, with the same code, gives:

; FSharp.Perf.BenchLength.newNot()
       sub       rsp,28
       mov       ecx,[rcx+8]
       call      FSharp.Perf.Data.get(Int32)
       xor       edx,edx
M00_L00:                      ; start of for-loop body
       test      rax,rax
       je        short M00_L01
       mov       ecx,[rax+8]
M00_L01:
       inc       edx
       cmp       edx,2711       ; loop 10_000 times
       jl        short M00_L00
       add       rsp,28
       ret
; Total bytes of code 37

That is: no push/pop of rdi and rsi, that is, no new stackframe.

Known workarounds

Redefine not yourself and the problem seems to disappear.

Related information

I've only tested this on the latest VS + FSC (with optimizations on, of course), but the Sharplab decoding showed the same results.

I discussed this with @baronfel yesterday and neither of us could come up with a reasonable explanation, even more so since re-defining not as itself leads to optimized code, so I'm not sure why the combination not(isNull x) leads to such IL. The Sharplab.io link shows that using something else than isNull in the brackets does not lead to the same weird IL opcodes.

The text was updated successfully, but these errors were encountered:

cartermp · 2020-06-12T20:34:48Z

Yeah, this is pretty rough. Another good reason to rethink some of this custom IL emission stuff in FSharp.Core.

// Learn more about F# at http://docs.microsoft.com/dotnet/fsharp

open System
open BenchmarkDotNet.Attributes
open BenchmarkDotNet.Running

module Op =
    let inline not' x = match x with true -> false | false -> true

[<MemoryDiagnoser>]
type NotBench() =
    let s = "hello"

    [<Benchmark(Baseline=true)>]
    member _.Not() = not (isNull s)

    [<Benchmark>]
    member _.NotWithMatch() = Op.not' (isNull s)

[<EntryPoint>]
let main argv =
    let summary = BenchmarkRunner.Run<NotBench>()
    printfn "%A" summary
    0 // return an integer exit code

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.329 (2004/?/20H1)
Intel Core i7-6700K CPU 4.00GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.5.20279.10
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.27801, CoreFX 5.0.20.27801), X64 RyuJIT DEBUG
  DefaultJob : .NET Core 5.0.0 (CoreCLR 5.0.20.27801, CoreFX 5.0.20.27801), X64 RyuJIT

Method	Mean	Error	StdDev	Median	Ratio	Gen 0	Gen 1	Gen 2	Allocated
Not	0.2296 ns	0.0113 ns	0.0100 ns	0.2279 ns	1.000	-	-	-	-
NotWithMatch	0.0006 ns	0.0021 ns	0.0018 ns	0.0000 ns	0.003	-	-	-	-

abelbraaksma · 2020-06-12T22:23:47Z

Thanks for the quick response & fix!

0.0006ns? No machine is so fast ;). I think your benchmark ended up being optimized away (which by itself is good, it shows that the match version of not is more amenable to JIT optimizations). You probably got this notification:

// * Warnings *
ZeroMeasurement
NotBench.NotWithMatch: Default -> The method duration is indistinguishable from the empty method duration

I find that for very small measurements, to prevent his from happening, a little loop helps (while being cautious that the loop itself doesn't get erased). I got that tip from BDN w.r.t. to small CPU bound function performance.

This doesn't even help, they are now both erased:

[<Benchmark(Baseline=true)>]
member _.Not() = 
    for i=0 to 10000 do not (isNull s) |> ignore

[<Benchmark>]
member _.NotWithMatch() = 
    for i=0 to 10000 do Op.not' (isNull s) |> ignore

But this works, the optimizer doesn't erase it anymore:

[<Benchmark(Baseline=true)>]
member _.Not() = 
    let mutable b = false
    for i=0 to 10000 do b <- not (isNull s)
    b

[<Benchmark>]
member _.NotWithMatch() = 
    let mutable b = false
    for i=0 to 10000 do b <- Op.not' (isNull s)
    b

This way you won't get absolute measurements of the kind like "not takes an X amount of ns", but the relative measurements show the increase in time between the two anyway:

If we add the following function, we can do a zero-measurement:

[<Benchmark>]
member _.OnlyNull() = 
    let mutable b = false
    for i=0 to 10000 do b <- isNull s
    b

Which shows that adding the optimized version of not to isNull is now indistinguishable, zero overhead. Which is correct, because a condition that was brtrue is now brfalse or vv., precisely what not should do after optimizations have run.

The overhead of the original not function is 2.86 / 10_000 = 0.28ns (on my machine).

Another good reason to rethink some of this custom IL emission stuff in FSharp.Core.

What is weird to me is the custom IL leads to the correct IL if compared to the match-based function. Which suggests that the optimizer treats the custom IL, even if the same as for a normal function, differently.

cartermp · 2020-06-12T23:46:22Z

Heh, good catch! Yeah, it gets optimized away. I wonder if this means that some user code will have things optimized away as well?

cartermp · 2020-06-13T00:12:58Z

Edit: corrected some info

Okay, it appears that you don't need the big loop (since that could serve to just amortize the cost to be pretty much the same) of the check depending on how it runs on your machine. With some dummy code that prevents an optimization:

module Op =
    let inline not' x = if x then false else true
    let inline not'' x = match x with true -> false | false -> true


[<MemoryDiagnoser>]
type NotBench() =
    let s = "hello"

    [<Benchmark(Baseline=true)>]
    member _.Not() = 
        let mutable b = false
        for i=0 to 1 do b <- not (isNull s)
        b

    [<Benchmark>]
    member _.NotWithIf() = 
        let mutable b = false
        for i=0 to 1 do b <- Op.not' (isNull s)
        b

    [<Benchmark>]
    member _.NotWithMatch() = 
        let mutable b = false
        for i=0 to 1 do b <- Op.not'' (isNull s)
        b

I get these timings against .NET 5:

Method	Mean	Error	StdDev	Ratio	Gen 0	Gen 1	Gen 2	Allocated
Not	1.859 ns	0.0224 ns	0.0187 ns	1.00	-	-	-	-
NotWithIf	1.277 ns	0.0178 ns	0.0139 ns	0.69	-	-	-	-
NotWithMatch	1.277 ns	0.0188 ns	0.0157 ns	0.69	-	-	-	-

And these for net48:

Method	Mean	Error	StdDev	Ratio	RatioSD	Gen 0	Gen 1	Gen 2	Allocated
Not	1.909 ns	0.0448 ns	0.0419 ns	1.00	0.00	-	-	-	-
NotWithIf	1.295 ns	0.0357 ns	0.0334 ns	0.68	0.03	-	-	-	-
NotWithMatch	1.292 ns	0.0126 ns	0.0118 ns	0.68	0.02	-	-	-	-

Some small variance from run to run, but the modified calls always seem better.

More generally though, I think moving away from embedded inline IL when we can is always positive.

abelbraaksma · 2020-06-13T14:02:02Z

That's very interesting, that you get sensible results even without the for .. to approach. If I run your code, I get non-sensible results (I mean: it looks as if it is sensible, but it really isn't):

When I use a loop of 10 items, they are still all within error margin. When I increase to 100 it starts to make sense. But to make sure what I am seeing are the right numbers, I increase to 1000 or more generally (for very small functions, or when I suspect the timings).

This is with 100 (adjusted with OperationsPerInvoke):

This is with 1000 (adjusted with OperationsPerInvoke):

This may have something to do with how accurate the underlying timer is, or the way a processor predicts execution, or how cache lines are read, I don't know. Either way, when timings are not like I expected, I double-check, or increase in a loop.

Obviously, the more noise we introduce, the harder it is to predict the net cost of a certain operand, but diffing the results gives us a reasonably good handle (in your case, you don't need any of this, I wish I had your machine ;)). For the not function it is rather obvious though, because the disassembly clearly shows the extra stackframe and overhead for that extra member call.

Just to say, both our measurements show a performance the new not to be 25%-32% faster, but the new situation is equal to having no not at all. Since the old situation meant: 19 bytes larger binary assembly and ~14 extra asm instructions compared to exactly 0 extra instructions in the new case means you have improved performance by a factor of ∞.

Which is pretty awesome!

TIHan · 2020-06-15T12:55:26Z

This is really good analysis. Thank you for supplying all the numbers!

Based on the info provided, we should consider just redefining not to be the match statement, or even an if: let inline not (value: bool) = if value then false else true.

I haven't looked into the behavior of the inline IL to determine if it's a simple bug. To me, we don't need the inline IL now since a basic match or if version produces better IL and fixes the isNull perf.

forki · 2020-06-15T13:20:07Z

The rather cool thing is that optimizer is now doing even more Will Smith <notifications@github.com> schrieb am Mo., 15. Juni 2020, 14:55:

…

This is really good analysis. Thank you for supplying all the numbers! Based on the info provided, we should consider just redefining not to be the match statement, or even an if: let inline not (value: bool) = if value then false else true. I haven't looked into the behavior of the inline IL to determine if it's a simple bug. To me, we don't need the inline IL now since a basic match or if version produces better IL and fixes the isNull perf. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#9433 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAOANGKM62MCPAZ6TNA7WDRWYK43ANCNFSM4N4K3CIA> .

forki · 2020-06-15T13:31:31Z

But can someone please check that not <| not <| not <| not <| true will be rewritten to true in the IL . . ,. C cc488thg f gg uh jgjoi g9 Steffen Forkmann <sforkmann@gmail.com> schrieb am Mo., 15. Juni 2020, 15:19:

…

The rather cool thing is that optimizer is now doing even more Will Smith ***@***.***> schrieb am Mo., 15. Juni 2020, 14:55: > This is really good analysis. Thank you for supplying all the numbers! > > Based on the info provided, we should consider just redefining not to be > the match statement, or even an if: let inline not (value: bool) = if > value then false else true. > > I haven't looked into the behavior of the inline IL to determine if it's > a simple bug. To me, we don't need the inline IL now since a basic match > or if version produces better IL and fixes the isNull perf. > > — > You are receiving this because you are subscribed to this thread. > Reply to this email directly, view it on GitHub > <#9433 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAAOANGKM62MCPAZ6TNA7WDRWYK43ANCNFSM4N4K3CIA> > . >

abelbraaksma · 2020-06-16T12:49:37Z

@forki, the IL doesn't matter in that case, the jitted assembly folds that in the presence of a constant. I found that out while doing measurements on multiple uses of not.

forki · 2020-06-16T12:54:57Z

it does matter in the context on further simplifcations

abelbraaksma · 2020-06-16T13:05:49Z

@forki If you have:

let mutable b = true
b <- not b
b <- not b
b <- not b
b

It will appear in IL as written, but in disassembler it's const-folded. That's because RyuJit has the proper info that it's safe to do so.

In some cases, appearance of ceq with a boolean in IL, followed by another ceq and boolean will also be folded, leading to less IL.

This issue points to those cases where that didn't happen, namely where one of the chained comparisons compares to zero (null in user code).

I'll have to check your specific case, but certainly right now it's not optimized if you end the sequence with isNull x.

abelbraaksma · 2020-07-04T22:45:49Z

But can someone please check that
not <| not <| not <| not <| true
will be rewritten to true in the IL

@forki, if I fix that code to be not <| (not <| (not <| (not <| true))), it is just true: https://sharplab.io/#v2:DYLgZgzgNAJiDUAfA9gBwKYDsAEBlAnhAC7oC2AsAFBH4bYDCAFAJTYC8V2X2pZARugBO2APoA6ALIt22TMiLYAPImyM5C5avVKVa+TuxFBAV3TNzQA=

And this:

let mutable b = true
b <- not b
b <- not b
b <- not b
b

Gets more complicate IL: https://sharplab.io/#v2:DYLgZgzgNAJiDUAfA9gBwKYDsAEBlAnhAC7oC2AsAFBH4bYDCAFAJTYC8V2X2pZARugBO2APoA6ALIt22Tt3nB0RHgFciAQz6LsfGUUEr0c+V10AeALTZMyZX2MnzVm3Yfyn12zrfc+QA===

But the resulting JIT disassembly is just:

    L0000: xor eax, eax
    L0002: ret

Which is interesting, because if I try the same from C# it isn't folded. Hmm 🤔

abelbraaksma · 2020-07-22T21:46:37Z

The main issue here has been resolved in PR #9715, but the internal not function remains in place and should probably be cleaned up as well, see for that follow-up issue: #9753

abelbraaksma changed the title ~~not(isNull x) leads to very odd and partially unreachable IL code~~ not(isNull x) leads to very odd and partially unreachable IL code that performace 5x slower than redefining not yourself Jun 12, 2020

cartermp mentioned this issue Jun 12, 2020

Redefine 'not' to not emit IL directly #9434

Closed

cartermp added the Area-Library label Jun 12, 2020

cartermp added this to the Backlog milestone Jun 12, 2020

cartermp mentioned this issue Jun 17, 2020

Remove inline IL for public 'not' function #9473

Closed

abelbraaksma mentioned this issue Jun 23, 2020

The 'string' operator produces suboptimal code when the object's type is known. #9153

Closed

abelbraaksma mentioned this issue Jul 18, 2020

New attempt to fix 'not' function #9715

Merged

KevinRansom closed this as completed in #9715 Jul 22, 2020

abelbraaksma mentioned this issue Jul 25, 2020

Optimization issues with inlined il + generated IL #9753

Open

cartermp modified the milestones: Backlog, 16.8 Aug 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`not(isNull x)` leads to very odd and partially unreachable IL code that performace 5x slower than redefining `not` yourself #9433

`not(isNull x)` leads to very odd and partially unreachable IL code that performace 5x slower than redefining `not` yourself #9433

abelbraaksma commented Jun 12, 2020 •

edited

Loading

cartermp commented Jun 12, 2020

abelbraaksma commented Jun 12, 2020 •

edited

Loading

cartermp commented Jun 12, 2020

cartermp commented Jun 13, 2020 •

edited

Loading

abelbraaksma commented Jun 13, 2020 •

edited

Loading

TIHan commented Jun 15, 2020

forki commented Jun 15, 2020 via email

forki commented Jun 15, 2020 via email

abelbraaksma commented Jun 16, 2020

forki commented Jun 16, 2020

abelbraaksma commented Jun 16, 2020 •

edited

Loading

abelbraaksma commented Jul 4, 2020 •

edited

Loading

abelbraaksma commented Jul 22, 2020 •

edited

Loading

not(isNull x) leads to very odd and partially unreachable IL code that performace 5x slower than redefining not yourself #9433

not(isNull x) leads to very odd and partially unreachable IL code that performace 5x slower than redefining not yourself #9433

Comments

abelbraaksma commented Jun 12, 2020 • edited Loading

Repro steps

cartermp commented Jun 12, 2020

abelbraaksma commented Jun 12, 2020 • edited Loading

cartermp commented Jun 12, 2020

cartermp commented Jun 13, 2020 • edited Loading

abelbraaksma commented Jun 13, 2020 • edited Loading

TIHan commented Jun 15, 2020

forki commented Jun 15, 2020 via email

forki commented Jun 15, 2020 via email

abelbraaksma commented Jun 16, 2020

forki commented Jun 16, 2020

abelbraaksma commented Jun 16, 2020 • edited Loading

abelbraaksma commented Jul 4, 2020 • edited Loading

abelbraaksma commented Jul 22, 2020 • edited Loading

`not(isNull x)` leads to very odd and partially unreachable IL code that performace 5x slower than redefining `not` yourself #9433

`not(isNull x)` leads to very odd and partially unreachable IL code that performace 5x slower than redefining `not` yourself #9433

abelbraaksma commented Jun 12, 2020 •

edited

Loading

abelbraaksma commented Jun 12, 2020 •

edited

Loading

cartermp commented Jun 13, 2020 •

edited

Loading

abelbraaksma commented Jun 13, 2020 •

edited

Loading

abelbraaksma commented Jun 16, 2020 •

edited

Loading

abelbraaksma commented Jul 4, 2020 •

edited

Loading

abelbraaksma commented Jul 22, 2020 •

edited

Loading