Minor performance regression going to dotnet core 3.1 from .NET 4.8 #31613

mrange · 2020-02-01T09:44:31Z

Hi.

I saw a performance regression going from .NET 4.8 to dotnet core 3.1. It's small so in practice this might not hurt most users but I thought it's better to create an issue than keeping mum.

I noticed it when discussing my other issue: #2191 so the code will be similar although I don't think this is tail call related but I don't know for sure of course.

When setting up a simple push stream pipeline

// Minimalistic PushStream
//  A PushStream accepts a receiver function that will be called
//  with each value in the PushStream
type 'T PushStream = ('T -> unit) -> unit

module PushStream =
  let inline zero      ()       = LanguagePrimitives.GenericZero
  let inline push      r v      = r v

  // Creates a PushStream with all integers from b to e (inclusive)
  let inline fromRange b e    r = for i = b to e do push r i
  // Maps all values in ps using mapping function f
  let inline map       f   ps r = ps (fun v -> push r (f v))
  // Filters all values in ps using filter function f
  let inline filter    f   ps r = ps (fun v -> if f v then push r v)
  // Sums all values in ps
  let inline sum           ps   = let mutable s = zero () in ps (fun v -> s <- s + v); s

[<DisassemblyDiagnoser>]
type Benchmarks () =
  [<Params (10000, 100)>] 
  member val public Count = 100 with get, set

  [<Benchmark>]
  member x.SimplePushStreamTest () =
    PushStream.fromRange  0 x.Count
    |> PushStream.map     int64
    |> PushStream.filter  (fun v -> (v &&& 1L) = 0L)
    |> PushStream.map     ((+) 1L)
    |> PushStream.sum

Benchmark dotnet reports:

$ dotnet run -c Release -f netcoreapp3.1 --filter '*' --runtimes net48 netcoreapp3.1
...
|               Method |       Runtime |     Toolchain | Count |        Mean |     Error |    StdDev | Ratio | RatioSD | Code Size |
|--------------------- |-------------- |-------------- |------ |------------:|----------:|----------:|------:|--------:|----------:|
| SimplePushStreamTest |      .NET 4.8 |         net48 |   100 |    400.6 ns |   3.92 ns |   3.67 ns |  1.00 |    0.00 |     272 B |
| SimplePushStreamTest | .NET Core 3.1 | netcoreapp3.1 |   100 |    439.3 ns |   4.35 ns |   4.07 ns |  1.10 |    0.02 |     273 B |
|                      |               |               |       |             |           |           |       |         |           |
| SimplePushStreamTest |      .NET 4.8 |         net48 | 10000 | 33,542.5 ns | 143.25 ns | 133.99 ns |  1.00 |    0.00 |     272 B |
| SimplePushStreamTest | .NET Core 3.1 | netcoreapp3.1 | 10000 | 39,449.8 ns | 259.08 ns | 242.35 ns |  1.18 |    0.01 |     273 B |

.NET 4.8 performs between 10% to 20% faster than dotnet core 3.1.

I dug a bit into the jitted assembler and found the following differences

--- dotnetcore.asm
+++ net48.asm
@@ -1,4 +1,4 @@
-; dotnet core 3.1
+; .net v48
 
 ; PushStream.fromRange  0 x.Count
 LOOP:
@@ -12,7 +12,6 @@
 jne     LOOP
 
 ; PushStream.map     int64
-nop     dword ptr [rax+rax]
 mov     rcx,qword ptr [rcx+8]
 movsxd  rdx,edx
 mov     rax,qword ptr [rcx]
@@ -21,8 +20,7 @@
 jmp     rax
 
 ; PushStream.filter  (fun v -> (v &&& 1L) = 0L)
-nop     dword ptr [rax+rax]
-mov     eax,edx
+mov     rax,rdx
 test    al,1
 jne     BAILOUT
 mov     rcx,qword ptr [rcx+8]
@@ -35,7 +33,6 @@
 ret
 
 ; PushStream.map     ((+) 1L)
-nop     dword ptr [rax+rax]
 mov     rcx,qword ptr [rcx+8]
 inc     rdx
 mov     rax,qword ptr [rcx]
@@ -44,11 +41,9 @@
 jmp     rax
 
 ; PushStream.sum
-nop     dword ptr [rax+rax]
 mov     rax,qword ptr [rcx+8]
 mov     rcx,rax
 add     rdx,qword ptr [rax+8]
 mov     qword ptr [rcx+8],rdx
 xor     eax,eax
 ret
-

It seems that in dotnet core there's an extra nop at the start of each method. I suspected tiered compilation but after much messing about trying to disable tiered compilation it's either unrelated or I wasn't able to disable tiered compilation.

It surprises me that the nop adds this much overhead but I can't spot anything else of significance.

The code is here: https://github.com/mrange/TryNewDisassembler/tree/fsharpPerformanceRegression

And here:

module PerformanceRegression =
  open System
  open System.Linq
  open System.Diagnostics

  // Minimalistic PushStream
  //  A PushStream accepts a receiver function that will be called
  //  with each value in the PushStream
  type 'T PushStream = ('T -> unit) -> unit

  module PushStream =
    let inline zero      ()       = LanguagePrimitives.GenericZero
    let inline push      r v      = r v

    // Creates a PushStream with all integers from b to e (inclusive)
    let inline fromRange b e    r = for i = b to e do push r i
    // Maps all values in ps using mapping function f
    let inline map       f   ps r = ps (fun v -> push r (f v))
    // Filters all values in ps using filter function f
    let inline filter    f   ps r = ps (fun v -> if f v then push r v)
    // Sums all values in ps
    let inline sum           ps   = let mutable s = zero () in ps (fun v -> s <- s + v); s

  module Tests =
    open BenchmarkDotNet.Attributes
    open BenchmarkDotNet.Configs
    open BenchmarkDotNet.Jobs
    open BenchmarkDotNet.Horology
    open BenchmarkDotNet.Running
    open BenchmarkDotNet.Diagnostics.Windows.Configs

    [<DisassemblyDiagnoser>]
    type Benchmarks () =
      [<Params (10000, 100)>] 
      member val public Count = 100 with get, set

      [<Benchmark>]
      member x.SimplePushStreamTest () =
        PushStream.fromRange  0 x.Count
        |> PushStream.map     int64
        |> PushStream.filter  (fun v -> (v &&& 1L) = 0L)
        |> PushStream.map     ((+) 1L)
        |> PushStream.sum

    let run argv = 
      let job = Job.Default
                    .WithWarmupCount(30)
                    .WithIterationTime(TimeInterval.FromMilliseconds(250.0)) // the default is 0.5s per iteration, which is slighlty too much for us
                    .WithMinIterationCount(15)
                    .WithMaxIterationCount(20)
                    .AsDefault()
      let config = DefaultConfig.Instance.AddJob(job)
      let b = BenchmarkSwitcher [|typeof<Benchmarks>|]
      let summary = b.Run(argv, config)
      printfn "%A" summary

// Run with: dotnet run -c Release -f netcoreapp3.1 --filter '*' --runtimes net48 netcoreapp3.1
[<EntryPoint>]
let main argv =
  PerformanceRegression.Tests.run argv
  0

category:cq
theme:optimization
skill-level:intermediate
cost:medium

The text was updated successfully, but these errors were encountered:

jkotas · 2020-02-01T19:00:27Z

The extra no-ops should be gone in current master after dotnet/coreclr#26740 .

jkotas · 2020-02-01T19:00:32Z

cc @adamsitnik

AndyAyersMS · 2020-05-06T01:47:56Z

We should verify perf is recovered. Let's take a look as part of .NET 5.

BruceForstall · 2020-06-22T23:36:24Z

@erozenfeld Can you verify if any work remains here?

AndyAyersMS · 2020-07-31T00:54:24Z

I still see a regression with a fairly recent P8 build:

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.959 (1909/November2018Update/19H2)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.8.20359.6
  [Host]     : .NET Core 3.1.4 (CoreCLR 4.700.20.20201, CoreFX 4.700.20.21406), X64 RyuJIT DEBUG
  Job-DTXTLF : .NET Framework 4.8 (4.8.4180.0), X64 RyuJIT
  Job-XPSVFA : .NET Core 3.1.4 (CoreCLR 4.700.20.20201, CoreFX 4.700.20.21406), X64 RyuJIT
  Job-KQQRIJ : .NET Core 5.0.0 (CoreCLR 5.0.20.35809, CoreFX 5.0.20.35809), X64 RyuJIT

MaxIterationCount=20  MinIterationCount=15  WarmupCount=30

|               Method |        Job |       Runtime |     Toolchain | Count |        Mean |       Error |      StdDev | Ratio | RatioSD |
|--------------------- |----------- |-------------- |-------------- |------ |------------:|------------:|------------:|------:|--------:|
| SimplePushStreamTest | Job-DTXTLF |      .NET 4.8 |         net48 |   100 |    372.9 ns |     4.34 ns |     4.06 ns |  1.00 |    0.00 |
| SimplePushStreamTest | Job-XPSVFA | .NET Core 3.1 | netcoreapp3.1 |   100 |    406.1 ns |     9.33 ns |     9.58 ns |  1.09 |    0.03 |
| SimplePushStreamTest | Job-KQQRIJ | .NET Core 5.0 |  netcoreapp50 |   100 |    439.1 ns |    23.10 ns |    26.60 ns |  1.18 |    0.08 |
|                      |            |               |               |       |             |             |             |       |         |
| SimplePushStreamTest | Job-DTXTLF |      .NET 4.8 |         net48 | 10000 | 29,770.6 ns |   531.09 ns |   470.80 ns |  1.00 |    0.00 |
| SimplePushStreamTest | Job-XPSVFA | .NET Core 3.1 | netcoreapp3.1 | 10000 | 30,609.8 ns |   402.64 ns |   356.93 ns |  1.03 |    0.02 |
| SimplePushStreamTest | Job-KQQRIJ | .NET Core 5.0 |  netcoreapp50 | 10000 | 37,312.9 ns | 1,136.50 ns | 1,167.10 ns |  1.25 |    0.04 |

AndyAyersMS · 2020-08-07T16:08:09Z

@briansull if you haven't started looking at this, I'll take a look.

briansull · 2020-08-07T17:18:07Z

Sure, go ahead and take a look

AndyAyersMS · 2020-08-08T01:13:34Z

Suspect this is some residual overhead from tiered compilation. Disabling that gives

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.959 (1909/November2018Update/19H2)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.8.20359.6
  [Host]     : .NET Core 3.1.4 (CoreCLR 4.700.20.20201, CoreFX 4.700.20.21406), X64 RyuJIT DEBUG
  Job-FZFFZC : .NET Framework 4.8 (4.8.4180.0), X64 RyuJIT
  Job-FIZZFG : .NET Core 3.1.4 (CoreCLR 4.700.20.20201, CoreFX 4.700.20.21406), X64 RyuJIT
  Job-HDIAOD : .NET Core 5.0.0 (CoreCLR 5.0.20.35809, CoreFX 5.0.20.35809), X64 RyuJIT

MaxIterationCount=20  MinIterationCount=15  WarmupCount=30

|               Method |        Job |       Runtime |     Toolchain | Count |        Mean |     Error |    StdDev | Ratio | RatioSD | Code Size |
|--------------------- |----------- |-------------- |-------------- |------ |------------:|----------:|----------:|------:|--------:|----------:|
| SimplePushStreamTest | Job-FZFFZC |      .NET 4.8 |         net48 |   100 |    375.1 ns |   4.80 ns |   4.71 ns |  1.00 |    0.00 |     272 B |
| SimplePushStreamTest | Job-FIZZFG | .NET Core 3.1 | netcoreapp3.1 |   100 |    388.9 ns |  15.25 ns |  15.66 ns |  1.03 |    0.04 |     272 B |
| SimplePushStreamTest | Job-HDIAOD | .NET Core 5.0 | netcoreapp5.0 |   100 |    389.1 ns |   7.26 ns |   6.79 ns |  1.04 |    0.02 |     273 B |
|                      |            |               |               |       |             |           |           |       |         |           |
| SimplePushStreamTest | Job-FZFFZC |      .NET 4.8 |         net48 | 10000 | 31,360.3 ns | 545.98 ns | 510.71 ns |  1.00 |    0.00 |     272 B |
| SimplePushStreamTest | Job-FIZZFG | .NET Core 3.1 | netcoreapp3.1 | 10000 | 30,618.3 ns | 217.74 ns | 193.02 ns |  0.98 |    0.02 |     272 B |
| SimplePushStreamTest | Job-HDIAOD | .NET Core 5.0 | netcoreapp5.0 | 10000 | 31,787.3 ns | 412.13 ns | 385.51 ns |  1.01 |    0.02 |     273 B |

Differential profiles of 4.8 vs 5.0 show identical codegen in all the hot methods, eg:

;; 4.8
;; Program+PerformanceRegression+Tests+SimplePushStreamTest@42-6.Invoke(Int64)

00007ff8`f2816040 488b4908        mov     rcx,qword ptr [rcx+8]
00007ff8`f2816044 48ffc2          inc     rdx
00007ff8`f2816047 488b01          mov     rax,qword ptr [rcx]
00007ff8`f281604a 488b4040        mov     rax,qword ptr [rax+40h]
00007ff8`f281604e 488b4020        mov     rax,qword ptr [rax+20h]
00007ff8`f2816052 48ffe0          jmp     rax

;; 5.0
;; Assembly listing for method SimplePushStreamTest@42-6:Invoke(long):Unit:this

G_M35068_IG01:
						;; bbWeight=1    PerfScore 0.00
G_M35068_IG02:
       mov      rcx, gword ptr [rcx+8]
       inc      rdx
       mov      rax, qword ptr [rcx]
       mov      rax, qword ptr [rax+64]
       mov      rax, qword ptr [rax+32]
						;; bbWeight=1    PerfScore 8.25
G_M35068_IG03:
       rex.jmp  rax

cc @kouvel

Going to relabel this as VM.

kouvel · 2020-08-11T03:27:44Z

This seems to have something to do with JIT timing. I have seen some cases before where different loop alignment led to noticeable regressions when tiering is enabled. It could also be better at random, but naturally the regressions are more noticed. Adjusting timing of rejits by changing the call count threshold seems to change the perf significantly. Code locality may also be relevant, as I noticed that using R2R'ed runtime binaries versus using IL-only runtime binaries also affects perf significantly regardless of tiering. Tiering timings have changed between 3.1 and 5.0, typically it would be for the better, though there may be benchmarks where slight differences in timing may realize as larger differences in performance. I suspect this is not a regression due to a bug but rather a regression due to chance. It would need more investigation to determine the root causes, and if my theory is right, to determine how to reduce the chance factor and make it more deterministic.

kouvel · 2020-08-11T11:46:43Z

I see large differences in time spent in a function when the function's code crosses a cache line boundary. For small functions it may help to align them such that they would fit within a cache line.

AndyAyersMS · 2020-08-11T16:00:29Z

#2249 added 32 byte alignment for tier1 methods with loops; might be interesting to try this for all tier1 methods.

kouvel · 2020-08-11T21:27:52Z

I see, I was also thinking something like this:

If 0 < size <= 16 then align at any 16-byte offset in a cache line
If 16 < size <= 32 then align at offsets 0, 16, or 32, but not at 48
If 32 < size <= 48 then align at offsets 0 or 16
If 48 < size <= 64 then align at offset 0 only

Might allow for a bit more density but #2249 also mentioned that crossing a 32-byte boundary has perf issues for loops, so 32-byte alignment may work better in some cases.

mrange changed the title ~~Performance regression between dotnet core 3.1 from .NET 4.8~~ Minor performance regression going to dotnet core 3.1 from .NET 4.8 Feb 1, 2020

jkotas added the tenet-performance Performance related issue label Feb 1, 2020

jkotas added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 1, 2020

AndyAyersMS added this to the 5.0 milestone May 6, 2020

BruceForstall assigned AndyAyersMS and erozenfeld and unassigned AndyAyersMS Jun 22, 2020

JulieLeeMSFT unassigned erozenfeld Jul 29, 2020

JulieLeeMSFT assigned briansull Aug 5, 2020

AndyAyersMS self-assigned this Aug 7, 2020

AndyAyersMS unassigned AndyAyersMS and briansull Aug 8, 2020

AndyAyersMS added area-VM-coreclr and removed area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Aug 8, 2020

kouvel added area-TieredCompilation-coreclr and removed area-VM-coreclr labels Aug 11, 2020

kouvel modified the milestones: 5.0.0, 6.0.0, Future Aug 11, 2020

kouvel modified the milestones: Future, 6.0.0 Aug 11, 2020

mangod9 modified the milestones: 6.0.0, Future Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minor performance regression going to dotnet core 3.1 from .NET 4.8 #31613

Minor performance regression going to dotnet core 3.1 from .NET 4.8 #31613

mrange commented Feb 1, 2020 •

edited by BruceForstall

Loading

jkotas commented Feb 1, 2020

jkotas commented Feb 1, 2020

AndyAyersMS commented May 6, 2020

BruceForstall commented Jun 22, 2020

AndyAyersMS commented Jul 31, 2020

AndyAyersMS commented Aug 7, 2020

briansull commented Aug 7, 2020

AndyAyersMS commented Aug 8, 2020

kouvel commented Aug 11, 2020

kouvel commented Aug 11, 2020

AndyAyersMS commented Aug 11, 2020

kouvel commented Aug 11, 2020

Minor performance regression going to dotnet core 3.1 from .NET 4.8 #31613

Minor performance regression going to dotnet core 3.1 from .NET 4.8 #31613

Comments

mrange commented Feb 1, 2020 • edited by BruceForstall Loading

jkotas commented Feb 1, 2020

jkotas commented Feb 1, 2020

AndyAyersMS commented May 6, 2020

BruceForstall commented Jun 22, 2020

AndyAyersMS commented Jul 31, 2020

AndyAyersMS commented Aug 7, 2020

briansull commented Aug 7, 2020

AndyAyersMS commented Aug 8, 2020

kouvel commented Aug 11, 2020

kouvel commented Aug 11, 2020

AndyAyersMS commented Aug 11, 2020

kouvel commented Aug 11, 2020

mrange commented Feb 1, 2020 •

edited by BruceForstall

Loading