Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

not(isNull x) leads to very odd and partially unreachable IL code that performace 5x slower than redefining not yourself #9433

Closed
abelbraaksma opened this issue Jun 12, 2020 · 13 comments · Fixed by #9715
Labels
Area-Library Issues for FSharp.Core not covered elsewhere
Milestone

Comments

@abelbraaksma
Copy link
Contributor

abelbraaksma commented Jun 12, 2020

I noticed this while doing timings for #9390, where sometimes using not gave an unexpected performance degradation. This boiled down to not sometimes leading to very unexpected IL.

Repro steps

Take the following code snippet:

let useNotIsNull (str:string) =
    if not(isNull str) then str.Length
    else 0

Because not is coded to return a single IL instruction with ceq, and isNull, while coded with match, also leads to basically a single ceq instruction, that we'd end up with two instructions, or, after optimization, a single one. However, it blows up:

    IL_0000: ldarg.0
    IL_0001: brfalse.s IL_0006

    IL_0003: ldc.i4.0
    IL_0004: br.s IL_0007

    IL_0006: ldc.i4.1

    IL_0007: brtrue.s IL_0010

    IL_0009: ldarg.0
    IL_000a: callvirt instance int32 [System.Private.CoreLib]System.String::get_Length()
    IL_000f: ret

    IL_0010: ldc.i4.0
    IL_0011: ret

Which gets translated in C# as:

public static int useNotIsNull(string str)
{
    if (str != null || 1 == 0)
    {
        return str.Length;
    }
    return 0;
}

If you were to recreate the not function as follows:

let not x = match x with true -> false | _ -> true

The same code above would now be encoded in IL as:

    IL_0000: ldarg.0
    IL_0001: brfalse.s IL_000a

    IL_0003: ldarg.0
    IL_0004: callvirt instance int32 [System.Private.CoreLib]System.String::get_Length()
    IL_0009: ret

    IL_000a: ldc.i4.0
    IL_000b: ret

And here is the real killer, if we encode not as itself, the problem also disappears, regardless of whether it is marked as inline (the original) or not:

let justLikeNot x = not x

let useJustLikeNot (str:string) =
    if justLikeNot(isNull str) then str.Length
    else 0

Resulting IL:

    IL_0000: ldarg.0
    IL_0001: brfalse.s IL_000a

    IL_0003: ldarg.0
    IL_0004: callvirt instance int32 [System.Private.CoreLib]System.String::get_Length()
    IL_0009: ret

    IL_000a: ldc.i4.0
    IL_000b: ret

Strangely, the not function itself looks exactly the same as the justLikeNot function above:

    IL_0000: ldarg.0
    IL_0001: ldc.i4.0
    IL_0002: ceq
    IL_0004: ret

Though in one case (with isNull) it leads to strange opcodes. In most other cases, it leads to the expected folding of the ceq into a brfalse or brtrue respectively.

More examples of coding this and their surprising translations can be found in this SharpLab.io snippet.

Expected behavior

Actual behavior

See above for the actual behavior. In terms of performance, the different not versions in the code perform all as expected, since they are ultimately folded into optimized x64 assembly, except for the not(isNull x) version. The notIsNull below uses not(isNull x), the others all use a different way of coding not than the default:

image

(These timings were made by ensuring the function returns and is not optimized away (hence the str.Length call) and repeated 10_000x in a close for-loop to erase timing inefficiencies for micro-benchmarks with BDN.)

This is ultimatedly caused by the final assembly, which looks as follows (note the popping and extra call):

; FSharp.Perf.BenchLength.notIsNull()
       push      rdi
       push      rsi
       sub       rsp,28
       mov       ecx,[rcx+8]
       call      FSharp.Perf.Data.get(Int32)
       mov       rsi,rax
       xor       edi,edi
M00_L00:                      ; start of for-loop body
       mov       rcx,rsi
       call      FSharp.Perf.StringLength.notIsNull(System.String)
       inc       edi
       cmp       edi,2711       ; loop 10_000 times
       jl        short M00_L00
       add       rsp,28
       pop       rsi
       pop       rdi
       ret
; Total bytes of code 44
; FSharp.Perf.StringLength.notIsNull(System.String)
       test      rcx,rcx
       je        short M02_L00
       mov       eax,[rcx+8]
       ret
M02_L00:
       xor       eax,eax
       ret
; Total bytes of code 12

Compare that to using one of the not redefinitions, which, with the same code, gives:

; FSharp.Perf.BenchLength.newNot()
       sub       rsp,28
       mov       ecx,[rcx+8]
       call      FSharp.Perf.Data.get(Int32)
       xor       edx,edx
M00_L00:                      ; start of for-loop body
       test      rax,rax
       je        short M00_L01
       mov       ecx,[rax+8]
M00_L01:
       inc       edx
       cmp       edx,2711       ; loop 10_000 times
       jl        short M00_L00
       add       rsp,28
       ret
; Total bytes of code 37

That is: no push/pop of rdi and rsi, that is, no new stackframe.

Known workarounds

Redefine not yourself and the problem seems to disappear.

Related information

I've only tested this on the latest VS + FSC (with optimizations on, of course), but the Sharplab decoding showed the same results.

I discussed this with @baronfel yesterday and neither of us could come up with a reasonable explanation, even more so since re-defining not as itself leads to optimized code, so I'm not sure why the combination not(isNull x) leads to such IL. The Sharplab.io link shows that using something else than isNull in the brackets does not lead to the same weird IL opcodes.

@abelbraaksma abelbraaksma changed the title not(isNull x) leads to very odd and partially unreachable IL code not(isNull x) leads to very odd and partially unreachable IL code that performace 5x slower than redefining not yourself Jun 12, 2020
@cartermp
Copy link
Contributor

Yeah, this is pretty rough. Another good reason to rethink some of this custom IL emission stuff in FSharp.Core.

// Learn more about F# at http://docs.microsoft.com/dotnet/fsharp

open System
open BenchmarkDotNet.Attributes
open BenchmarkDotNet.Running

module Op =
    let inline not' x = match x with true -> false | false -> true

[<MemoryDiagnoser>]
type NotBench() =
    let s = "hello"

    [<Benchmark(Baseline=true)>]
    member _.Not() = not (isNull s)

    [<Benchmark>]
    member _.NotWithMatch() = Op.not' (isNull s)

[<EntryPoint>]
let main argv =
    let summary = BenchmarkRunner.Run<NotBench>()
    printfn "%A" summary
    0 // return an integer exit code
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.329 (2004/?/20H1)
Intel Core i7-6700K CPU 4.00GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.5.20279.10
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.27801, CoreFX 5.0.20.27801), X64 RyuJIT DEBUG
  DefaultJob : .NET Core 5.0.0 (CoreCLR 5.0.20.27801, CoreFX 5.0.20.27801), X64 RyuJIT

Method Mean Error StdDev Median Ratio Gen 0 Gen 1 Gen 2 Allocated
Not 0.2296 ns 0.0113 ns 0.0100 ns 0.2279 ns 1.000 - - - -
NotWithMatch 0.0006 ns 0.0021 ns 0.0018 ns 0.0000 ns 0.003 - - - -

@cartermp cartermp added the Area-Library Issues for FSharp.Core not covered elsewhere label Jun 12, 2020
@cartermp cartermp added this to the Backlog milestone Jun 12, 2020
@abelbraaksma
Copy link
Contributor Author

abelbraaksma commented Jun 12, 2020

Thanks for the quick response & fix!

0.0006ns? No machine is so fast ;). I think your benchmark ended up being optimized away (which by itself is good, it shows that the match version of not is more amenable to JIT optimizations). You probably got this notification:

// * Warnings *
ZeroMeasurement
NotBench.NotWithMatch: Default -> The method duration is indistinguishable from the empty method duration

I find that for very small measurements, to prevent his from happening, a little loop helps (while being cautious that the loop itself doesn't get erased). I got that tip from BDN w.r.t. to small CPU bound function performance.

This doesn't even help, they are now both erased:

[<Benchmark(Baseline=true)>]
member _.Not() = 
    for i=0 to 10000 do not (isNull s) |> ignore

[<Benchmark>]
member _.NotWithMatch() = 
    for i=0 to 10000 do Op.not' (isNull s) |> ignore

But this works, the optimizer doesn't erase it anymore:

[<Benchmark(Baseline=true)>]
member _.Not() = 
    let mutable b = false
    for i=0 to 10000 do b <- not (isNull s)
    b

[<Benchmark>]
member _.NotWithMatch() = 
    let mutable b = false
    for i=0 to 10000 do b <- Op.not' (isNull s)
    b

This way you won't get absolute measurements of the kind like "not takes an X amount of ns", but the relative measurements show the increase in time between the two anyway:

image

If we add the following function, we can do a zero-measurement:

[<Benchmark>]
member _.OnlyNull() = 
    let mutable b = false
    for i=0 to 10000 do b <- isNull s
    b

Which shows that adding the optimized version of not to isNull is now indistinguishable, zero overhead. Which is correct, because a condition that was brtrue is now brfalse or vv., precisely what not should do after optimizations have run.

The overhead of the original not function is 2.86 / 10_000 = 0.28ns (on my machine).

image

Another good reason to rethink some of this custom IL emission stuff in FSharp.Core.

What is weird to me is the custom IL leads to the correct IL if compared to the match-based function. Which suggests that the optimizer treats the custom IL, even if the same as for a normal function, differently.

@cartermp
Copy link
Contributor

Heh, good catch! Yeah, it gets optimized away. I wonder if this means that some user code will have things optimized away as well?

@cartermp
Copy link
Contributor

cartermp commented Jun 13, 2020

Edit: corrected some info

Okay, it appears that you don't need the big loop (since that could serve to just amortize the cost to be pretty much the same) of the check depending on how it runs on your machine. With some dummy code that prevents an optimization:

module Op =
    let inline not' x = if x then false else true
    let inline not'' x = match x with true -> false | false -> true


[<MemoryDiagnoser>]
type NotBench() =
    let s = "hello"

    [<Benchmark(Baseline=true)>]
    member _.Not() = 
        let mutable b = false
        for i=0 to 1 do b <- not (isNull s)
        b

    [<Benchmark>]
    member _.NotWithIf() = 
        let mutable b = false
        for i=0 to 1 do b <- Op.not' (isNull s)
        b

    [<Benchmark>]
    member _.NotWithMatch() = 
        let mutable b = false
        for i=0 to 1 do b <- Op.not'' (isNull s)
        b

I get these timings against .NET 5:

Method Mean Error StdDev Ratio Gen 0 Gen 1 Gen 2 Allocated
Not 1.859 ns 0.0224 ns 0.0187 ns 1.00 - - - -
NotWithIf 1.277 ns 0.0178 ns 0.0139 ns 0.69 - - - -
NotWithMatch 1.277 ns 0.0188 ns 0.0157 ns 0.69 - - - -

And these for net48:

Method Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
Not 1.909 ns 0.0448 ns 0.0419 ns 1.00 0.00 - - - -
NotWithIf 1.295 ns 0.0357 ns 0.0334 ns 0.68 0.03 - - - -
NotWithMatch 1.292 ns 0.0126 ns 0.0118 ns 0.68 0.02 - - - -

Some small variance from run to run, but the modified calls always seem better.

More generally though, I think moving away from embedded inline IL when we can is always positive.

@abelbraaksma
Copy link
Contributor Author

abelbraaksma commented Jun 13, 2020

That's very interesting, that you get sensible results even without the for .. to approach. If I run your code, I get non-sensible results (I mean: it looks as if it is sensible, but it really isn't):

image

When I use a loop of 10 items, they are still all within error margin. When I increase to 100 it starts to make sense. But to make sure what I am seeing are the right numbers, I increase to 1000 or more generally (for very small functions, or when I suspect the timings).

This is with 100 (adjusted with OperationsPerInvoke):
image

This is with 1000 (adjusted with OperationsPerInvoke):
image

This may have something to do with how accurate the underlying timer is, or the way a processor predicts execution, or how cache lines are read, I don't know. Either way, when timings are not like I expected, I double-check, or increase in a loop.

Obviously, the more noise we introduce, the harder it is to predict the net cost of a certain operand, but diffing the results gives us a reasonably good handle (in your case, you don't need any of this, I wish I had your machine ;)). For the not function it is rather obvious though, because the disassembly clearly shows the extra stackframe and overhead for that extra member call.

Just to say, both our measurements show a performance the new not to be 25%-32% faster, but the new situation is equal to having no not at all. Since the old situation meant: 19 bytes larger binary assembly and ~14 extra asm instructions compared to exactly 0 extra instructions in the new case means you have improved performance by a factor of .

Which is pretty awesome!

@TIHan
Copy link
Contributor

TIHan commented Jun 15, 2020

This is really good analysis. Thank you for supplying all the numbers!

Based on the info provided, we should consider just redefining not to be the match statement, or even an if: let inline not (value: bool) = if value then false else true.

I haven't looked into the behavior of the inline IL to determine if it's a simple bug. To me, we don't need the inline IL now since a basic match or if version produces better IL and fixes the isNull perf.

@forki
Copy link
Contributor

forki commented Jun 15, 2020 via email

@forki
Copy link
Contributor

forki commented Jun 15, 2020 via email

@abelbraaksma
Copy link
Contributor Author

@forki, the IL doesn't matter in that case, the jitted assembly folds that in the presence of a constant. I found that out while doing measurements on multiple uses of not.

@forki
Copy link
Contributor

forki commented Jun 16, 2020

it does matter in the context on further simplifcations

@abelbraaksma
Copy link
Contributor Author

abelbraaksma commented Jun 16, 2020

@forki If you have:

let mutable b = true
b <- not b
b <- not b
b <- not b
b

It will appear in IL as written, but in disassembler it's const-folded. That's because RyuJit has the proper info that it's safe to do so.

In some cases, appearance of ceq with a boolean in IL, followed by another ceq and boolean will also be folded, leading to less IL.

This issue points to those cases where that didn't happen, namely where one of the chained comparisons compares to zero (null in user code).

I'll have to check your specific case, but certainly right now it's not optimized if you end the sequence with isNull x.

@abelbraaksma
Copy link
Contributor Author

abelbraaksma commented Jul 4, 2020

But can someone please check that
not <| not <| not <| not <| true
will be rewritten to true in the IL

@forki, if I fix that code to be not <| (not <| (not <| (not <| true))), it is just true: https://sharplab.io/#v2:DYLgZgzgNAJiDUAfA9gBwKYDsAEBlAnhAC7oC2AsAFBH4bYDCAFAJTYC8V2X2pZARugBO2APoA6ALIt22TMiLYAPImyM5C5avVKVa+TuxFBAV3TNzQA=

And this:

let mutable b = true
b <- not b
b <- not b
b <- not b
b

Gets more complicate IL: https://sharplab.io/#v2:DYLgZgzgNAJiDUAfA9gBwKYDsAEBlAnhAC7oC2AsAFBH4bYDCAFAJTYC8V2X2pZARugBO2APoA6ALIt22Tt3nB0RHgFciAQz6LsfGUUEr0c+V10AeALTZMyZX2MnzVm3Yfyn12zrfc+QA===

But the resulting JIT disassembly is just:

    L0000: xor eax, eax
    L0002: ret

Which is interesting, because if I try the same from C# it isn't folded. Hmm 🤔

@abelbraaksma
Copy link
Contributor Author

abelbraaksma commented Jul 22, 2020

The main issue here has been resolved in PR #9715, but the internal not function remains in place and should probably be cleaned up as well, see for that follow-up issue: #9753

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-Library Issues for FSharp.Core not covered elsewhere
Projects
None yet
4 participants