.NET Core 3.1 integer performance 12% slower than .NET 4.8 #34414

kaalus · 2020-04-01T18:45:59Z

I was benchmarking the following code (64 bit fixed point type multiplication) under .NET 4.8 and .NET Core 3.1. Due to the jitter improvements made in .NET Core I expected the performance to be better. Unfortunately this code consistently runs about 12% slower than in .NET 4.8:

.NET 4.8: 2.66 sec
.NET Core 3.1: 2.99 sec

(i7 7700k, both projects in Release and "Prefer 32 bit" option disabled)

const int FractionBits = 16;
public static Fixed LongMultiply(Fixed v1, Fixed v2)
{
    long hi1 = v1.Raw >> 32;  // v1.Raw is of type long
    long lo1 = v1.Raw & 0xFFFFFFFF;
    long hi2 = v2.Raw >> 32;  // v2.Raw is of type long
    long lo2 = v2.Raw & 0xFFFFFFFF;
    long hi = (hi1 * hi2) << (64 - FractionBits);
    long mid = (hi1 * lo2 + hi2 * lo1) << (32 - FractionBits);
    long lo = (long)(((ulong)lo1 * (ulong)lo2) >> FractionBits);
    return new Fixed(hi + mid + lo);
}

Please find attached VS solution with both projects: FixedBenchmark.zip

category:cq
theme:needs-triage
skill-level:expert
cost:large

The text was updated successfully, but these errors were encountered:

EgorBo · 2020-04-01T19:43:06Z

I'd recommend to rewrite the benchmark to smaller sub-benchmarks and with Benchmark.NET and let it to care about iterations and input, your code also measures Random.NextValue speed which is not desired I guess.

gfoidl · 2020-04-01T19:45:01Z

It looks like you got trapped by Tiered Compilation, see also Run-time configuration options for compilation on how to disable it.

For benchmarking it is advisable to use https://benchmarkdotnet.org/.

sandreenko · 2020-04-01T20:23:10Z

Thanks for reporting it @kaalus.

It looks like you got trapped by Tiered Compilation, see also Run-time configuration options for compilation on how to disable it.

TestLongMultiply(); // Make sure all's jitted that should be enough to warm TestLongMultiply(double v1, double v2) and recompile it with full jit and you can't be caught in a loop because we have TC_QuickJitForLoops=0 by default, but it would be interesting to see results with complus_TieredCompilation=0.

AndyAyersMS · 2020-04-29T18:56:38Z

The code does tier up as the key methods are called in a loop, and the driver methods with loops will bypass Tier0. It takes about 200ms on my machine for Tier1 code to get installed.

However, I also see apparent regressions with tiering disabled.

I converted this over to BDN (keeping the code structure as is) and it also reports a regression from desktop. The 3.1 runs are quite a bit more varied and somewhat bimodal, and their "fast" time is closer to 4.8's.

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.778 (1909/November2018Update/19H2)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-alpha.1.20073.10
  [Host]        : .NET Core 2.1.16 (CoreCLR 4.6.28516.03, CoreFX 4.6.28516.10), X64 RyuJIT
  .NET 4.7.2    : .NET Framework 4.8 (4.8.4121.0), X64 RyuJIT
  .NET Core 2.1 : .NET Core 2.1.16 (CoreCLR 4.6.28516.03, CoreFX 4.6.28516.10), X64 RyuJIT
  .NET Core 3.1 : .NET Core 3.1.4 (CoreCLR 4.700.20.20201, CoreFX 4.700.20.21406), X64 RyuJIT

Method	Job	Runtime	Mean	Error	StdDev	Ratio	RatioSD
TestLongMultiply	.NET 4.7.2	.NET 4.7.2	3.270 s	0.0297 s	0.0264 s	1.00	0.00
TestLongMultiply	.NET Core 2.1	.NET Core 2.1	3.467 s	0.0552 s	0.0516 s	1.06	0.02
TestLongMultiply	.NET Core 3.1	.NET Core 3.1	3.799 s	0.0144 s	0.0113 s	1.16	0.01

So this merits a closer look. Note runtimes seem pretty volatile -- if I reduce the value of count1 in the test code I can sometimes get quite different results.

AndyAyersMS · 2020-06-22T19:13:55Z

Probably not going to get to this for 5.0, so marking as future.

AndyAyersMS · 2021-01-22T22:06:16Z

Running a further modified BDN version (no looping in the test method, no random values) still shows 5.0 regressions:

(edit -- disregard this, see below)

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.102
  [Host]     : .NET Core 5.0.2 (CoreCLR 5.0.220.61120, CoreFX 5.0.220.61120), X64 RyuJIT
  Job-VGGTDI : .NET Framework 4.8 (4.8.4250.0), X64 RyuJIT
  Job-NMVNIB : .NET Core 2.1.24 (CoreCLR 4.6.29518.01, CoreFX 4.6.29518.01), X64 RyuJIT
  Job-QZTYYY : .NET Core 3.1.11 (CoreCLR 4.700.20.56602, CoreFX 4.700.20.56604), X64 RyuJIT
  Job-TENKRT : .NET Core 5.0.2 (CoreCLR 5.0.220.61120, CoreFX 5.0.220.61120), X64 RyuJIT


|           Method |        Job |       Runtime |     Toolchain |       Mean |    Error |   StdDev |     Median | Ratio | RatioSD |
|----------------- |----------- |-------------- |-------------- |-----------:|---------:|---------:|-----------:|------:|--------:|
| TestLongMultiply | Job-VGGTDI |    .NET 4.7.2 |        net472 | 1,040.9 ns | 32.36 ns | 95.42 ns | 1,018.7 ns |  1.00 |    0.00 |
| TestLongMultiply | Job-NMVNIB | .NET Core 2.1 | netcoreapp2.1 |   830.9 ns | 28.65 ns | 82.21 ns |   802.8 ns |  0.81 |    0.09 |
| TestLongMultiply | Job-QZTYYY | .NET Core 3.1 | netcoreapp3.1 |   829.6 ns | 16.62 ns | 21.02 ns |   831.3 ns |  0.80 |    0.07 |
| TestLongMultiply | Job-TENKRT | .NET Core 5.0 | netcoreapp5.0 | 1,105.2 ns | 21.97 ns | 45.86 ns | 1,093.5 ns |  1.05 |    0.10 |

AndyAyersMS · 2021-01-22T22:55:17Z

In the above all the time is spent constructing the Random object, so needed to remove that call too. Now we have:

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.102
  [Host]     : .NET Core 5.0.2 (CoreCLR 5.0.220.61120, CoreFX 5.0.220.61120), X64 RyuJIT
  Job-OYHRDB : .NET Framework 4.8 (4.8.4250.0), X64 RyuJIT
  Job-NJBFZP : .NET Core 2.1.24 (CoreCLR 4.6.29518.01, CoreFX 4.6.29518.01), X64 RyuJIT
  Job-GOZUDV : .NET Core 3.1.11 (CoreCLR 4.700.20.56602, CoreFX 4.700.20.56604), X64 RyuJIT
  Job-GFNDSP : .NET Core 5.0.2 (CoreCLR 5.0.220.61120, CoreFX 5.0.220.61120), X64 RyuJIT


|           Method |        Job |       Runtime |     Toolchain |     Mean |    Error |    StdDev |   Median | Ratio | RatioSD |
|----------------- |----------- |-------------- |-------------- |---------:|---------:|----------:|---------:|------:|--------:|
| TestLongMultiply | Job-OYHRDB |    .NET 4.7.2 |        net472 | 305.7 ns |  5.93 ns |   7.71 ns | 304.8 ns |  1.00 |    0.00 |
| TestLongMultiply | Job-NJBFZP | .NET Core 2.1 | netcoreapp2.1 | 284.7 ns |  3.59 ns |   3.00 ns | 284.9 ns |  0.93 |    0.03 |
| TestLongMultiply | Job-GOZUDV | .NET Core 3.1 | netcoreapp3.1 | 372.5 ns |  8.88 ns |  25.77 ns | 363.0 ns |  1.24 |    0.08 |
| TestLongMultiply | Job-GFNDSP | .NET Core 5.0 | netcoreapp5.0 | 622.5 ns | 38.72 ns | 111.71 ns | 588.8 ns |  2.19 |    0.30 |

and still what looks like a clear regression.

AndyAyersMS · 2021-01-22T23:53:12Z

Results still fluctuate quite a bit from run to run and sometimes only show a very small regression. ETL analysis of just benchmark intervals shows:

;; 4.7.2

02.20%   3.07E+06    ?        Unknown
31.62%   4.405E+07   jit ???  [bdn]Program.TestLongMultiply(float64,float64)
30.94%   4.311E+07   jit ???  [bdn]Fixed.op_Explicit(value class Fixed)
20.43%   2.846E+07   jit ???  [bdn]Fixed.op_Implicit(float64)
08.96%   1.249E+07   jit ???  [bdn]Fixed.LongMultiply(value class Fixed,value class Fixed)
04.21%   5.86E+06    jit ???  [bdn]Program.TestLongMultiply()

;; 5.0.2
06.11%   9.48E+06    ?        Unknown
28.56%   4.435E+07   Tier-1   [bdn]Fixed.op_Explicit(value class Fixed)
26.82%   4.164E+07   Tier-1   [bdn]Program.TestLongMultiply(float64,float64)
22.32%   3.465E+07   Tier-1   [bdn]Fixed.op_Implicit(float64)
09.96%   1.546E+07   Tier-1   [bdn]Fixed.LongMultiply(value class Fixed,value class Fixed)
05.79%   8.99E+06    FullOpt  [bdn]Program.TestLongMultiply()

so suggests several methods may have worse CQ; however codegen looks similar.

4.7.2

5.0.2

; Fixed.op_Implicit(Double)
      vmovsd    qword ptr [rsp+8],xmm0
      mov       rax,[rsp+8]
      mov       rcx,0FFFFFFFFFFFF
      and       rcx,rax
      mov       rdx,0
      or        rdx,rcx
      mov       rcx,rax
      shr       rcx,34
      and       ecx,7FF
      add       ecx,0FFFFFC01
      mov       r8,0
      test      rax,r8
      je        short M03_L00
      neg       rdx
M03_L00:
      mov       eax,ecx
      neg       eax
      lea       ecx,[rax+24]
      test      ecx,ecx
      jle       short M03_L02
      cmp       ecx,40
      jl        short M03_L01
      xor       eax,eax
      ret
M03_L01:
      mov       rax,rdx
      sar       rax,cl
      ret
M03_L02:
      cmp       ecx,0FFFFFFC0
      jg        short M03_L03
      xor       eax,eax
      ret
M03_L03:
      neg       ecx
      mov       rax,rdx
      shl       rax,cl
      ret
; Total bytes of code 118

; Fixed.op_Implicit(Double)
       vmovsd    qword ptr [rsp+8],xmm0
       mov       rax,[rsp+8]
       mov       rdx,0FFFFFFFFFFFF
       and       rdx,rax
       mov       r8,0
       or        rdx,r8
       mov       rcx,rax
       shr       rcx,34
       and       ecx,7FF
       add       ecx,0FFFFFC01
       mov       r8,0
       test      rax,r8
       je        short M03_L00
       neg       rdx
M03_L00:
       neg       ecx
       add       ecx,24
       test      ecx,ecx
       jle       short M03_L02
       cmp       ecx,40
       jl        short M03_L01
       xor       eax,eax
       ret
M03_L01:
       mov       rax,rdx
       sar       rax,cl
       ret
M03_L02:
       cmp       ecx,0FFFFFFC0
       jg        short M03_L03
       xor       eax,eax
       ret
M03_L03:
       neg       ecx
       mov       rax,rdx
       shl       rax,cl
       ret
; Total bytes of code 115

; Fixed.LongMultiply(Fixed, Fixed)        
       mov       eax,0FFFFFFFF
       and       rax,rcx
       mov       r8,rdx
       sar       r8,20
       mov       r9d,0FFFFFFFF
       and       rdx,r9
       sar       rcx,20
       mov       r9,rcx
       imul      r9,r8
       shl       r9,30
       imul      rcx,rdx
       imul      r8,rax
       add       rcx,r8
       shl       rcx,10
       imul      rax,rdx
       shr       rax,10
       add       r9,rcx
       add       rax,r9
       ret
; Total bytes of code 69

; Fixed.LongMultiply(Fixed, Fixed)    
       mov       eax,0FFFFFFFF
       and       rax,rcx
       mov       r8,rdx
       sar       r8,20
       mov       r9d,0FFFFFFFF
       and       rdx,r9
       sar       rcx,20
       mov       r9,rcx
       imul      r9,r8
       shl       r9,30
       imul      rcx,rdx
       imul      r8,rax
       add       rcx,r8
       shl       rcx,10
       imul      rax,rdx
       shr       rax,10
       add       rcx,r9
       add       rax,rcx
       ret
; Total bytes of code 69

@kaalus do you have a more fully realized use case for this code that shows regressions?

AndyAyersMS · 2021-04-08T19:30:33Z

@jakobbotsch is going to take a fresh look at this.

jakobbotsch · 2021-06-22T16:22:50Z

The code generated for .NET 6.0 seems to be almost identical for all but the main driving loop: https://www.diffchecker.com/lztPvQsp (.NET4.8 on the left, .NET 6 on the right)
For the main driving loop the code looks much better in .NET 6, yet I can still reproduce the regression. VTune seems to indicate that it's front-end bound and that the regression might be related to the µop cache. Some hardware counters (here .NET 6 is on the left, .NET 4.8 on the right):

Hardware Event Type	Hardware Event Count
ARITH.DIVIDER_ACTIVE	3,456,003,456 - 100,683,100,683 = -97,227,097,227
BACLEARS.ANY	37,800,756 - 24,300,486 = 13,500,270
BR_MISP_RETIRED.ALL_BRANCHES_PS	Not changed, 334,806,696
CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE	1,836,036,720 - 1,629,482,589 = 206,554,131
CPU_CLK_UNHALTED.REF_TSC	166,061,900,000 - 147,117,950,000 = 18,943,950,000
CPU_CLK_UNHALTED.REF_XCLK	1,875,187,503 - 1,696,983,939 = 178,203,564
CPU_CLK_UNHALTED.THREAD	308,324,400,000 - 270,939,050,000 = 37,385,350,000
CPU_CLK_UNHALTED.THREAD_P	306,612,306,612 - 274,455,274,455 = 32,157,032,157
CYCLE_ACTIVITY.STALLS_L1D_MISS	189,000,189 - 162,000,162 = 27,000,027
CYCLE_ACTIVITY.STALLS_L2_MISS	162,000,162 - 54,000,054 = 108,000,108
CYCLE_ACTIVITY.STALLS_L3_MISS	162,000,162 - 0 = 162,000,162
CYCLE_ACTIVITY.STALLS_MEM_ANY	10,584,010,584 - 7,425,007,425 = 3,159,003,159
DSB2MITE_SWITCHES.PENALTY_CYCLES	14,283,014,283 - 8,883,008,883 = 5,400,005,400
DTLB_LOAD_MISSES.STLB_HIT	108,000,108 - 81,000,081 = 27,000,027
DTLB_LOAD_MISSES.WALK_ACTIVE	20,250,405 - 27,000,540 = -6,750,135
DTLB_STORE_MISSES.STLB_HIT	35,100,702 - 24,300,486 = 10,800,216
DTLB_STORE_MISSES.WALK_ACTIVE	1,350,027 - 12,150,243 = -10,800,216
EXE_ACTIVITY.1_PORTS_UTIL	37,017,037,017 - 28,728,028,728 = 8,289,008,289
EXE_ACTIVITY.2_PORTS_UTIL	74,493,074,493 - 55,917,055,917 = 18,576,018,576
EXE_ACTIVITY.BOUND_ON_STORES	Not changed, 0
EXE_ACTIVITY.EXE_BOUND_0_PORTS	891,000,891 - 1,080,001,080 = -189,000,189
FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE	Not changed, 0
FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE	Not changed, 0
FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE	Not changed, 0
FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE	Not changed, 0
FP_ARITH_INST_RETIRED.SCALAR_DOUBLE	46,278,046,278 - 46,413,046,413 = -135,000,135
FP_ARITH_INST_RETIRED.SCALAR_SINGLE	Not changed, 0
FP_ASSIST.ANY	Not changed, 0
FRONTEND_RETIRED.DSB_MISS_PS	3,290,147,397 - 668,290,095 = 2,621,857,302
FRONTEND_RETIRED.L2_MISS_PS	1,350,081 - 0 = 1,350,081
FRONTEND_RETIRED.LATENCY_GE_16_PS	8,100,486 - 16,200,972 = -8,100,486
FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS	29,639,678,274 - 12,870,322,173 = 16,769,356,101
FRONTEND_RETIRED.LATENCY_GE_4_PS	344,270,655 - 101,256,075 = 243,014,580
FRONTEND_RETIRED.STLB_MISS_PS	0 - 1,350,081 = -1,350,081
ICACHE_16B.IFDATA_STALL	270,000,270 - 351,000,351 = -81,000,081
ICACHE_16B.IFDATA_STALL:cmask=1:e=yes	54,000,054 - 27,000,027 = 27,000,027
ICACHE_64B.IFTAG_STALL	1,147,511,475 - 1,071,910,719 = 75,600,756
IDQ.ALL_DSB_CYCLES_4_UOPS	95,445,095,445 - 130,572,130,572 = -35,127,035,127
IDQ.ALL_DSB_CYCLES_ANY_UOPS	157,626,157,626 - 202,878,202,878 = -45,252,045,252
IDQ.ALL_MITE_CYCLES_4_UOPS	63,504,063,504 - 5,832,005,832 = 57,672,057,672
IDQ.ALL_MITE_CYCLES_ANY_UOPS	100,629,100,629 - 33,939,033,939 = 66,690,066,690
IDQ.DSB_UOPS	534,654,534,654 - 830,439,830,439 = -295,785,295,785
IDQ.MITE_UOPS	353,322,353,322 - 74,439,074,439 = 278,883,278,883
IDQ.MS_SWITCHES	405,000,405 - 216,000,216 = 189,000,189
IDQ.MS_UOPS	Not changed, 3,132,003,132
IDQ_UOPS_NOT_DELIVERED.CORE	272,160,272,160 - 124,173,124,173 = 147,987,147,987
IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE	16,497,016,497 - 4,293,004,293 = 12,204,012,204
ILD_STALL.LCP	Not changed, 0
INST_RETIRED.ANY	744,658,450,000 - 748,778,600,000 = -4,120,150,000
INST_RETIRED.PREC_DIST	740,286,740,286 - 743,931,743,931 = -3,645,003,645
INT_MISC.CLEAR_RESTEER_CYCLES	1,323,001,323 - 1,404,001,404 = -81,000,081
INT_MISC.RECOVERY_CYCLES	2,160,002,160 - 2,376,002,376 = -216,000,216
L1D_PEND_MISS.FB_FULL:cmask=1	Not changed, 0
L1D_PEND_MISS.PENDING	Not changed, 567,000,567
L2_RQSTS.RFO_HIT	Not changed, 2,700,027
LD_BLOCKS.NO_SR	Not changed, 0
LD_BLOCKS.STORE_FORWARD	1,350,027 - 4,050,081 = -2,700,054
LD_BLOCKS_PARTIAL.ADDRESS_ALIAS	114,752,295 - 14,850,297 = 99,901,998
MACHINE_CLEARS.COUNT	Not changed, 0
MEM_INST_RETIRED.ALL_STORES_PS	87,210,087,210 - 84,645,084,645 = 2,565,002,565
MEM_INST_RETIRED.LOCK_LOADS_PS	4,050,243 - 0 = 4,050,243
MEM_INST_RETIRED.SPLIT_LOADS_PS	Not changed, 0
MEM_INST_RETIRED.SPLIT_STORES_PS	Not changed, 0
MEM_INST_RETIRED.STLB_MISS_LOADS_PS	Not changed, 0
MEM_INST_RETIRED.STLB_MISS_STORES_PS	Not changed, 0
MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS	540,270 - 0 = 540,270
MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT_PS	270,135 - 0 = 270,135
MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS	Not changed, 0
MEM_LOAD_RETIRED.FB_HIT_PS	Not changed, 18,900,378
MEM_LOAD_RETIRED.L1_HIT_PS	114,804,114,804 - 110,322,110,322 = 4,482,004,482
MEM_LOAD_RETIRED.L1_MISS_PS	16,200,324 - 21,600,432 = -5,400,108
MEM_LOAD_RETIRED.L2_HIT_PS	Not changed, 22,950,459
MEM_LOAD_RETIRED.L3_HIT_PS	2,701,080 - 2,025,810 = 675,270
MEM_LOAD_RETIRED.L3_MISS_PS	0 - 1,350,081 = -1,350,081
OFFCORE_REQUESTS_BUFFER.SQ_FULL	Not changed, 0
OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD:cmask=4	162,000,162 - 81,000,081 = 81,000,081
OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD	972,000,972 - 567,000,567 = 405,000,405
OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO	513,000,513 - 324,000,324 = 189,000,189
OTHER_ASSISTS.ANY	Not changed, 0
PARTIAL_RAT_STALLS.SCOREBOARD	1,053,001,053 - 918,000,918 = 135,000,135
UOPS_DISPATCHED_PORT.PORT_0	138,834,138,834 - 132,273,132,273 = 6,561,006,561
UOPS_DISPATCHED_PORT.PORT_1	135,216,135,216 - 128,547,128,547 = 6,669,006,669
UOPS_DISPATCHED_PORT.PORT_2	75,168,075,168 - 72,279,072,279 = 2,889,002,889
UOPS_DISPATCHED_PORT.PORT_3	80,487,080,487 - 76,491,076,491 = 3,996,003,996
UOPS_DISPATCHED_PORT.PORT_4	89,289,089,289 - 90,018,090,018 = -729,000,729
UOPS_DISPATCHED_PORT.PORT_5	148,716,148,716 - 145,341,145,341 = 3,375,003,375
UOPS_DISPATCHED_PORT.PORT_6	200,934,200,934 - 181,062,181,062 = 19,872,019,872
UOPS_DISPATCHED_PORT.PORT_7	51,948,051,948 - 55,512,055,512 = -3,564,003,564
UOPS_EXECUTED.CORE_CYCLES_GE_1	293,625,293,625 - 263,574,263,574 = 30,051,030,051
UOPS_EXECUTED.CORE_CYCLES_GE_2	256,716,256,716 - 235,008,235,008 = 21,708,021,708
UOPS_EXECUTED.CORE_CYCLES_GE_3	181,710,181,710 - 178,443,178,443 = 3,267,003,267
UOPS_EXECUTED.CORE_CYCLES_NONE	15,606,015,606 - 12,528,012,528 = 3,078,003,078
UOPS_EXECUTED.THREAD	862,677,862,677 - 829,521,829,521 = 33,156,033,156
UOPS_EXECUTED.X87	Not changed, 0
UOPS_ISSUED.ANY	925,965,925,965 - 924,372,924,372 = 1,593,001,593
UOPS_ISSUED.VECTOR_WIDTH_MISMATCH	Not changed, 0
UOPS_RETIRED.MACRO_FUSED	29,700,029,700 - 30,078,030,078 = -378,000,378
UOPS_RETIRED.RETIRE_SLOTS	908,982,908,982 - 897,723,897,723 = 11,259,011,259

IDQ.DSB_UOPS seems to indicate that much fewer µops are delivered from the µop cache in the .NET 6.0 code. So this looks to be some microarchitectural detail. In particular, the following shows the functions where it seems we have µop cache problems in the .NET 6 version:

(I have been unable to get the same data for .NET 4.8 as VTune does not seem to be able to resolve the functions)
Looking at op_Explicit in the diff above it looks like it is 64-byte aligned in .NET 4.8 but only 16-byte aligned in .NET 6.0. Not sure if a fluke in that particular run, but certainly something to investigate further.

jakobbotsch · 2021-06-23T20:51:22Z

This seems to be related to method ordering and the addresses that the benchmark code ultimately ends up at. The order that .NET 4.8 ends up with is particularly good for the µop cache, it seems. I can regress the .NET 4.8 case by 15% simply by adding more code that is never called during the benchmark, causing the benchmark code to end up being allocated at different addresses.

In the good case the BDN code looks like this and we get the following results:

Method	Mean	Error	StdDev	Code Size
TestLongMultiply	2.964 s	0.0235 s	0.0196 s	1,295 B

In those cases, the method orderings and addresses look like this:

Address	Delta	Name
7FFA37C59FE0	0	Fixed.op_Implicit(Double)
7FFA37C5A070	144	Program+BenchmarkClass.TestLongMultiply()
7FFA37C5A440	976	Fixed.op_Explicit(Fixed)
7FFA37C5A490	80	Program+BenchmarkClass.TestLongMultiply(Double, Double)
7FFA37C5A570	224	Fixed.LongMultiply(Fixed, Fixed)

These addresses seem consistent between runs. Now let's add some random code to the constructor of the benchmark class. Here the function Churn allows us to generate a controllable amount of code to be jitted:

diff --git a/good.cs b/bad.cs
index c5c281a..20b96be 100644
--- a/good.cs
+++ b/bad.cs
@@ -44,16 +44,30 @@ public static class Program
         public BenchmarkClass()
         {
             double[] testValues = new double[iters];
             Random rand = new Random(3);
             for (int i = 0; i < testValues.Length; i++)
                 testValues[i] = rand.NextDouble();

             _testValues = testValues;
+            Churn(18, new GenT<int> { Value = 3 });
+        }
+
+        struct GenT<T>
+        {
+            public T Value;
+        }
+
+        private int Churn<T>(int amount, GenT<T> value) where T : struct
+        {
+            if (amount == 0)
+                return 0;
+
+            return Churn<GenT<T>>(amount - 1, new GenT<GenT<T>> { Value = value }) + 1;
         }

Now the results look like this:

Method	Mean	Error	StdDev	Code Size
TestLongMultiply	3.426 s	0.0668 s	0.0915 s	1,295 B

The benchmark regressed by 15%. The ordering and the deltas are the same, but the addresses are different:

Address	Delta	Name
7FFA37C6A6B0	0	Fixed.op_Implicit(Double)
7FFA37C6A740	144	Program+BenchmarkClass.TestLongMultiply()
7FFA37C6AB10	976	Fixed.op_Explicit(Fixed)
7FFA37C6AB60	80	Program+BenchmarkClass.TestLongMultiply(Double, Double)
7FFA37C6AC40	224	Fixed.LongMultiply(Fixed, Fixed)

We can add even more code to get fast benchmarks again, for example at n=30:

Method	Mean	Error	StdDev	Code Size
TestLongMultiply	3.057 s	0.0598 s	0.0839 s	1,295 B

Address	Delta	Name
7FFA37C8AB20	0	Fixed.op_Implicit(Double)
7FFA37C8ABB0	144	Program+BenchmarkClass.TestLongMultiply()
7FFA37C8AF80	976	Fixed.op_Explicit(Fixed)
7FFA37C8AFD0	80	Program+BenchmarkClass.TestLongMultiply(Double, Double)
7FFA37C8B0B0	224	Fixed.LongMultiply(Fixed, Fixed)

Now the question is if we can get the fast version in .NET 6. Without any Churn call, the results for .NET 6.0 on my machine are similar to the bad version for .NET 4.8:

Method	Mean	Error	StdDev	Code Size
TestLongMultiply	3.455 s	0.0167 s	0.0156 s	1,167 B

Address	Delta	Name
7FF9FFABAAE0	0	Program+BenchmarkClass.TestLongMultiply()
7FF9FFABAFC0	1248	Fixed.op_Implicit(Double)
7FF9FFABB050	144	Fixed.op_Explicit(Fixed)
7FF9FFABB0A0	80	Program+BenchmarkClass.TestLongMultiply(Double, Double)
7FF9FFABB180	224	Fixed.LongMultiply(Fixed, Fixed)

The first difference is that .NET 6 places the functions in a different order. The second is that the delta is larger for TestLongMultiply(); it seems like .NET 6 is placing precode stubs in the same code heap as hot functions, and some of these end up between Fixed.op_Implicit(Double) and Program+BenchmarkClass.TestLongMultiply(). Those precode stubs come from the type loader loading the Math class when TestLongMultiply is jitted. We can avoid it by ensuring that type gets loaded before, for example by using some Math function. Then we get:

Address	Delta	Name
7FF9FC1079D0	0	Fixed.op_Implicit(Double)
7FF9FC107A60	144	Program+BenchmarkClass.TestLongMultiply()
7FF9FC107E10	944	Fixed.op_Explicit(Fixed)
7FF9FC107E60	80	Program+BenchmarkClass.TestLongMultiply(Double, Double)
7FF9FC107F40	224	Fixed.LongMultiply(Fixed, Fixed)

From a code locality perspective this seems to be as good as we can do. Yet, even with varying the Churn parameter I am not able to get as good results as the best .NET 4.8 results.

I think there are a couple of actionable items here:

Function ordering is something we can do globally when crossgenning with PGO data. Added it to Dynamic PGO #43618.
Precode stubs should probably not be allocated together with hot, optimized code. This is probably something for hot/cold code splitting that we are considering for .NET 7. This is already in Dynamic PGO #43618.

I don't think there's much we can do here for .NET 6 since CQ seems to be on par or better, so I will move this to .NET 7 for now. However, as a workaround you can mark the explicit double conversion with [MethodImpl(MethodImplOptions.AggressiveInlining)]. With this, I get significantly better timings and .NET 6 beats .NET 4.8:

Method	Job	Runtime	Toolchain	Mean	Error	StdDev	Ratio	RatioSD	Code Size
TestLongMultiply	Job-GIYABY	.NET 6.0	net6.0	2.218 s	0.0141 s	0.0125 s	1.00	0.00	1,211 B
TestLongMultiply	Job-BOXOPL	.NET Framework 4.8	net48	2.461 s	0.0353 s	0.0407 s	1.11	0.02	1,349 B

jakobbotsch · 2022-06-07T11:18:04Z

I will close this given the inlining workaround above, and given that this was mainly micro-architectural. We also have the possibility of doing method ordering in R2R now which addresses some of the concerns around that.

Dotnet-GitSync-Bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels Apr 1, 2020

BruceForstall added this to the 5.0 milestone Apr 4, 2020

BruceForstall removed the untriaged New issue has not been triaged by the area owner label Apr 4, 2020

BruceForstall assigned AndyAyersMS Jun 3, 2020

AndyAyersMS modified the milestones: 5.0.0, Future Jun 22, 2020

BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020

BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Nov 25, 2020

BruceForstall modified the milestones: Future, 6.0.0 Nov 25, 2020

AndyAyersMS removed their assignment Apr 8, 2021

JulieLeeMSFT assigned jakobbotsch Apr 9, 2021

jakobbotsch modified the milestones: 6.0.0, 7.0.0 Jun 23, 2021

jakobbotsch closed this as completed Jun 7, 2022

ghost locked as resolved and limited conversation to collaborators Jul 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.NET Core 3.1 integer performance 12% slower than .NET 4.8 #34414

.NET Core 3.1 integer performance 12% slower than .NET 4.8 #34414

kaalus commented Apr 1, 2020 •

edited by BruceForstall

Loading

EgorBo commented Apr 1, 2020

gfoidl commented Apr 1, 2020

sandreenko commented Apr 1, 2020

AndyAyersMS commented Apr 29, 2020

AndyAyersMS commented Jun 22, 2020

AndyAyersMS commented Jan 22, 2021 •

edited

Loading

AndyAyersMS commented Jan 22, 2021 •

edited

Loading

AndyAyersMS commented Jan 22, 2021

AndyAyersMS commented Apr 8, 2021

jakobbotsch commented Jun 22, 2021

jakobbotsch commented Jun 23, 2021 •

edited

Loading

jakobbotsch commented Jun 7, 2022

.NET Core 3.1 integer performance 12% slower than .NET 4.8 #34414

.NET Core 3.1 integer performance 12% slower than .NET 4.8 #34414

Comments

kaalus commented Apr 1, 2020 • edited by BruceForstall Loading

EgorBo commented Apr 1, 2020

gfoidl commented Apr 1, 2020

sandreenko commented Apr 1, 2020

AndyAyersMS commented Apr 29, 2020

AndyAyersMS commented Jun 22, 2020

AndyAyersMS commented Jan 22, 2021 • edited Loading

AndyAyersMS commented Jan 22, 2021 • edited Loading

AndyAyersMS commented Jan 22, 2021

AndyAyersMS commented Apr 8, 2021

jakobbotsch commented Jun 22, 2021

jakobbotsch commented Jun 23, 2021 • edited Loading

jakobbotsch commented Jun 7, 2022

kaalus commented Apr 1, 2020 •

edited by BruceForstall

Loading

AndyAyersMS commented Jan 22, 2021 •

edited

Loading

AndyAyersMS commented Jan 22, 2021 •

edited

Loading

jakobbotsch commented Jun 23, 2021 •

edited

Loading