Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CQ: Big Value Type Suboptimal Inlining Code Generation #39732

Closed
nietras opened this issue Jul 21, 2020 · 5 comments
Closed

CQ: Big Value Type Suboptimal Inlining Code Generation #39732

nietras opened this issue Jul 21, 2020 · 5 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Milestone

Comments

@nietras
Copy link
Contributor

nietras commented Jul 21, 2020

Improve support for "functor" pattern in .NET, reduce code duplication and help resolve an issue in #39543 which tries to consolidate sorting code on a generic TComparer code path. It would be great if the JIT could be improved in the face of big value types.

Benchmark Code

using System;
using System.Collections.Generic;
using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Diagnosers;
using BenchmarkDotNet.Running;

public class Program
{
    static void Main(string[] args) => BenchmarkSwitcher.FromAssemblies(new[] { typeof(Program).Assembly }).Run(args);

    [DisassemblyDiagnoser]
    public class CompareBigStruct
    {
        Random _random;
        readonly Comparison<BigStruct> _comparison;
        readonly Comparison<BigStruct> _comparisonDefaultComparer;
        readonly Comparer<BigStruct> _defaultComparer;
        readonly StructComparer<BigStruct> _structComparer;
        readonly StructComparisonComparer<BigStruct> _structComparisonComparer;

        public CompareBigStruct()
        {
            _defaultComparer = Comparer<BigStruct>.Default;
            _structComparer = new StructComparer<BigStruct>();
            _comparison = (x, y) => x.CompareTo(y);
            _comparisonDefaultComparer = _defaultComparer.Compare;
            _structComparisonComparer = new StructComparisonComparer<BigStruct>(_comparison);
        }

        public BigStruct X { get; set; }

        public BigStruct Y { get; set; }

        [GlobalSetup]
        public void Setup()
        {
            _random = new Random(21317834);
            X = new BigStruct(_random.Next());
            Y = new BigStruct(_random.Next());
        }

        [Benchmark(Baseline = true)]
        public int Comparison() => _comparison(X, Y);

        [Benchmark()]
        public int ComparisonDefaultComparer() => _comparisonDefaultComparer(X, Y);

        [Benchmark]
        public int ComparerDefault() => _defaultComparer.Compare(X, Y);

        [Benchmark]
        public int ComparerStruct() => _structComparer.Compare(X, Y);

        [Benchmark]
        public int ComparerStructComparison() => _structComparisonComparer.Compare(X, Y);
    }

    internal readonly struct StructComparer<T> 
        : IComparer<T>
        where T : IComparable<T>
    {
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public int Compare(T x, T y) => x.CompareTo(y);
    }

    internal readonly struct StructComparisonComparer<T> : IComparer<T>
    {
        private readonly Comparison<T> _comparison;

        public StructComparisonComparer(Comparison<T> comparison) =>
            _comparison = comparison;

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public int Compare( T x, T y) => _comparison(x, y);
    }

    public readonly struct BigStruct : IComparable<BigStruct>
    {
        private readonly long _long;
        private readonly int _int0;
        private readonly int _int1;
        private readonly short _short0;
        private readonly short _short1;
        private readonly short _short2;
        private readonly short _short3;
        private readonly double _double;

        public BigStruct(int value)
        {
            _long = value;
            _int0 = value;
            _int1 = value;
            _short0 = (short)value;
            _short1 = (short)value;
            _short2 = (short)value;
            _short3 = (short)value;
            _double = value;
        }

        public int CompareTo(BigStruct other) => _int1.CompareTo(other._int1);
    }
}

Benchmark Results

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.388 (2004/?/20H1)
Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.100-rc.1.20367.2
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.36102, CoreFX 5.0.20.36102), X64 RyuJIT
  DefaultJob : .NET Core 5.0.0 (CoreCLR 5.0.20.36102, CoreFX 5.0.20.36102), X64 RyuJIT

Method Mean Error StdDev Ratio Code Size
Comparison 3.500 ns 0.0065 ns 0.0051 ns 1.00 144 B
ComparisonDefaultComparer 3.366 ns 0.0071 ns 0.0067 ns 0.96 144 B
ComparerDefault 3.661 ns 0.0059 ns 0.0049 ns 1.05 160 B
ComparerStruct 2.232 ns 0.0066 ns 0.0058 ns 0.64 142 B
ComparerStructComparison 5.269 ns 0.0071 ns 0.0060 ns 1.51 219 B

Benchmark Disassembly

The problem is in the code generated for CompareBigStruct.ComparerStructComparison which can be easily seen.

.NET Core 5.0.0 (CoreCLR 5.0.20.36102, CoreFX 5.0.20.36102), X64 RyuJIT

; Program+CompareBigStruct.Comparison()
       sub       rsp,0A8
       vzeroupper
       mov       rax,[rcx+10]
       vmovdqu   xmm0,xmmword ptr [rcx+38]
       vmovdqu   xmmword ptr [rsp+88],xmm0
       vmovdqu   xmm0,xmmword ptr [rcx+48]
       vmovdqu   xmmword ptr [rsp+98],xmm0
       vmovdqu   xmm0,xmmword ptr [rcx+58]
       vmovdqu   xmmword ptr [rsp+68],xmm0
       vmovdqu   xmm0,xmmword ptr [rcx+68]
       vmovdqu   xmmword ptr [rsp+78],xmm0
       vmovdqu   xmm0,xmmword ptr [rsp+88]
       vmovdqu   xmmword ptr [rsp+48],xmm0
       vmovdqu   xmm0,xmmword ptr [rsp+98]
       vmovdqu   xmmword ptr [rsp+58],xmm0
       vmovdqu   xmm0,xmmword ptr [rsp+68]
       vmovdqu   xmmword ptr [rsp+28],xmm0
       vmovdqu   xmm0,xmmword ptr [rsp+78]
       vmovdqu   xmmword ptr [rsp+38],xmm0
       mov       rcx,[rax+8]
       lea       rdx,[rsp+48]
       lea       r8,[rsp+28]
       call      qword ptr [rax+18]
       nop
       add       rsp,0A8
       ret
; Total bytes of code 144

.NET Core 5.0.0 (CoreCLR 5.0.20.36102, CoreFX 5.0.20.36102), X64 RyuJIT

; Program+CompareBigStruct.ComparisonDefaultComparer()
       sub       rsp,0A8
       vzeroupper
       mov       rax,[rcx+18]
       vmovdqu   xmm0,xmmword ptr [rcx+38]
       vmovdqu   xmmword ptr [rsp+88],xmm0
       vmovdqu   xmm0,xmmword ptr [rcx+48]
       vmovdqu   xmmword ptr [rsp+98],xmm0
       vmovdqu   xmm0,xmmword ptr [rcx+58]
       vmovdqu   xmmword ptr [rsp+68],xmm0
       vmovdqu   xmm0,xmmword ptr [rcx+68]
       vmovdqu   xmmword ptr [rsp+78],xmm0
       vmovdqu   xmm0,xmmword ptr [rsp+88]
       vmovdqu   xmmword ptr [rsp+48],xmm0
       vmovdqu   xmm0,xmmword ptr [rsp+98]
       vmovdqu   xmmword ptr [rsp+58],xmm0
       vmovdqu   xmm0,xmmword ptr [rsp+68]
       vmovdqu   xmmword ptr [rsp+28],xmm0
       vmovdqu   xmm0,xmmword ptr [rsp+78]
       vmovdqu   xmmword ptr [rsp+38],xmm0
       mov       rcx,[rax+8]
       lea       rdx,[rsp+48]
       lea       r8,[rsp+28]
       call      qword ptr [rax+18]
       nop
       add       rsp,0A8
       ret
; Total bytes of code 144

.NET Core 5.0.0 (CoreCLR 5.0.20.36102, CoreFX 5.0.20.36102), X64 RyuJIT

; Program+CompareBigStruct.ComparerDefault()
       sub       rsp,0A8
       vzeroupper
       mov       rdx,[rcx+20]
       vmovdqu   xmm0,xmmword ptr [rcx+38]
       vmovdqu   xmmword ptr [rsp+88],xmm0
       vmovdqu   xmm0,xmmword ptr [rcx+48]
       vmovdqu   xmmword ptr [rsp+98],xmm0
       vmovdqu   xmm0,xmmword ptr [rcx+58]
       vmovdqu   xmmword ptr [rsp+68],xmm0
       vmovdqu   xmm0,xmmword ptr [rcx+68]
       vmovdqu   xmmword ptr [rsp+78],xmm0
       vmovdqu   xmm0,xmmword ptr [rsp+88]
       vmovdqu   xmmword ptr [rsp+48],xmm0
       vmovdqu   xmm0,xmmword ptr [rsp+98]
       vmovdqu   xmmword ptr [rsp+58],xmm0
       vmovdqu   xmm0,xmmword ptr [rsp+68]
       vmovdqu   xmmword ptr [rsp+28],xmm0
       vmovdqu   xmm0,xmmword ptr [rsp+78]
       vmovdqu   xmmword ptr [rsp+38],xmm0
       mov       [rsp+20],rdx
       mov       rcx,rdx
       lea       rdx,[rsp+48]
       lea       r8,[rsp+28]
       mov       rax,[rsp+20]
       mov       rax,[rax]
       mov       rax,[rax+40]
       call      qword ptr [rax+20]
       nop
       add       rsp,0A8
       ret
; Total bytes of code 160

.NET Core 5.0.0 (CoreCLR 5.0.20.36102, CoreFX 5.0.20.36102), X64 RyuJIT

; Program+CompareBigStruct.ComparerStruct()
       sub       rsp,88
       vzeroupper
       vmovdqu   xmm0,xmmword ptr [rcx+38]
       vmovdqu   xmmword ptr [rsp+68],xmm0
       vmovdqu   xmm0,xmmword ptr [rcx+48]
       vmovdqu   xmmword ptr [rsp+78],xmm0
       vmovdqu   xmm0,xmmword ptr [rcx+58]
       vmovdqu   xmmword ptr [rsp+48],xmm0
       vmovdqu   xmm0,xmmword ptr [rcx+68]
       vmovdqu   xmmword ptr [rsp+58],xmm0
       vmovdqu   xmm0,xmmword ptr [rsp+68]
       vmovdqu   xmmword ptr [rsp+28],xmm0
       vmovdqu   xmm0,xmmword ptr [rsp+78]
       vmovdqu   xmmword ptr [rsp+38],xmm0
       vmovdqu   xmm0,xmmword ptr [rsp+48]
       vmovdqu   xmmword ptr [rsp+8],xmm0
       vmovdqu   xmm0,xmmword ptr [rsp+58]
       vmovdqu   xmmword ptr [rsp+18],xmm0
       mov       eax,[rsp+14]
       cmp       [rsp+34],eax
       jge       short M00_L00
       mov       eax,0FFFFFFFF
       jmp       short M00_L02
M00_L00:
       cmp       [rsp+34],eax
       jle       short M00_L01
       mov       eax,1
       jmp       short M00_L02
M00_L01:
       xor       eax,eax
M00_L02:
       add       rsp,88
       ret
; Total bytes of code 142

.NET Core 5.0.0 (CoreCLR 5.0.20.36102, CoreFX 5.0.20.36102), X64 RyuJIT

; Program+CompareBigStruct.ComparerStructComparison()
       sub       rsp,0E8
       vzeroupper
       lea       rdx,[rcx+30]
       vmovdqu   xmm0,xmmword ptr [rcx+38]
       vmovdqu   xmmword ptr [rsp+0C8],xmm0
       vmovdqu   xmm0,xmmword ptr [rcx+48]
       vmovdqu   xmmword ptr [rsp+0D8],xmm0
       vmovdqu   xmm0,xmmword ptr [rcx+58]
       vmovdqu   xmmword ptr [rsp+0A8],xmm0
       vmovdqu   xmm0,xmmword ptr [rcx+68]
       vmovdqu   xmmword ptr [rsp+0B8],xmm0
       vmovdqu   xmm0,xmmword ptr [rsp+0C8]
       vmovdqu   xmmword ptr [rsp+88],xmm0
       vmovdqu   xmm0,xmmword ptr [rsp+0D8]
       vmovdqu   xmmword ptr [rsp+98],xmm0
       vmovdqu   xmm0,xmmword ptr [rsp+0A8]
       vmovdqu   xmmword ptr [rsp+68],xmm0
       vmovdqu   xmm0,xmmword ptr [rsp+0B8]
       vmovdqu   xmmword ptr [rsp+78],xmm0
       mov       rax,[rdx]
       vmovdqu   xmm0,xmmword ptr [rsp+88]
       vmovdqu   xmmword ptr [rsp+48],xmm0
       vmovdqu   xmm0,xmmword ptr [rsp+98]
       vmovdqu   xmmword ptr [rsp+58],xmm0
       vmovdqu   xmm0,xmmword ptr [rsp+68]
       vmovdqu   xmmword ptr [rsp+28],xmm0
       vmovdqu   xmm0,xmmword ptr [rsp+78]
       vmovdqu   xmmword ptr [rsp+38],xmm0
       mov       rcx,[rax+8]
       lea       rdx,[rsp+48]
       lea       r8,[rsp+28]
       call      qword ptr [rax+18]
       nop
       add       rsp,0E8
       ret
; Total bytes of code 219

cc: @jkotas @dotnet/jit-contrib

category:cq
theme:structs
skill-level:expert
cost:large
impact:medium

@nietras nietras added the tenet-performance Performance related issue label Jul 21, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels Jul 21, 2020
@nietras
Copy link
Contributor Author

nietras commented Jul 21, 2020

cc: @EgorBo help! 😉

@AndyAyersMS
Copy link
Member

I'm going to mark this as future.

Passing large structs by value through layers of calls puts a premium on struct optimizations like copy prop and reverse copy prop; the jit is not proficient at these opts.

We should be able to extend the "tail call copy avoidance" change to optimize away the last of these; all we need is a call in tail position, not an actual tail call (see #33004).

SysV codegen is somewhat amusing (this is with preview 5). cc @sandreenko

; Program+StructComparisonComparer`1[[Program+BigStruct, r39732]].Compare(BigStruct, BigStruct)
       push      rbp
       vzeroupper
       mov       rbp,rsp
       vmovdqu   xmm0,xmmword ptr [rbp+10]
       vmovdqu   xmmword ptr [rbp+10],xmm0
       vmovdqu   xmm0,xmmword ptr [rbp+20]
       vmovdqu   xmmword ptr [rbp+20],xmm0
       vmovdqu   xmm0,xmmword ptr [rbp+30]
       vmovdqu   xmmword ptr [rbp+30],xmm0
       vmovdqu   xmm0,xmmword ptr [rbp+40]
       vmovdqu   xmmword ptr [rbp+40],xmm0
       mov       rax,[rdi]
       mov       rdi,[rax+8]
       mov       rax,[rax+18]
       pop       rbp
       jmp       rax
; Total bytes of code 62
``

@AndyAyersMS AndyAyersMS removed the untriaged New issue has not been triaged by the area owner label Jul 22, 2020
@AndyAyersMS AndyAyersMS added this to the Future milestone Jul 23, 2020
@AndyAyersMS
Copy link
Member

@sandreenko can you look into this as part of your possible .Net 6 work?

@sandreenko sandreenko self-assigned this Jan 15, 2021
@sandreenko sandreenko modified the milestones: Future, 6.0.0 Jan 15, 2021
@sandreenko
Copy link
Contributor

@AndyAyersMS sure!

@jakobbotsch
Copy link
Member

The codegen today (with TieredPGO disabled, to make it faithful to the original benchmark) is:

; Assembly listing for method Program+CompareBigStruct:ComparerStructComparison():int:this
; Emitting BLENDED_CODE for X64 with AVX - Windows
; Tier-1 compilation
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 3 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 this         [V00,T00] (  5,  5   )     ref  ->  rcx         this class-hnd single-def
;  V01 OutArgs      [V01    ] (  1,  1   )  struct (32) [rsp+00H]   do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V02 tmp1         [V02    ] (  0,  0   )   byref  ->  zero-ref    "impAppendStmt"
;* V03 tmp2         [V03    ] (  0,  0   )  struct (32) zero-ref    do-not-enreg[S] "impAppendStmt"
;* V04 tmp3         [V04    ] (  0,  0   )  struct (32) zero-ref    do-not-enreg[S] "spilled call-like call argument"
;  V05 tmp4         [V05    ] (  2,  4   )  struct (32) [rsp+48H]   do-not-enreg[XS] addr-exposed "by-value struct argument"
;  V06 tmp5         [V06    ] (  2,  4   )  struct (32) [rsp+28H]   do-not-enreg[XS] addr-exposed "by-value struct argument"
;  V07 tmp6         [V07,T01] (  3,  6   )     ref  ->  rax         single-def "argument with side effect"
;
; Lcl frame size = 104
G_M64140_IG01:  ;; offset=0000H
       sub      rsp, 104
       vzeroupper
                                                ;; size=7 bbWeight=1 PerfScore 1.25
G_M64140_IG02:  ;; offset=0007H
       mov      rax, gword ptr [rcx+30H]
       vmovdqu  ymm0, ymmword ptr [rcx+38H]
       vmovdqu  ymmword ptr [rsp+48H], ymm0
       vmovdqu  ymm0, ymmword ptr [rcx+58H]
       vmovdqu  ymmword ptr [rsp+28H], ymm0
       mov      rcx, gword ptr [rax+08H]
       lea      rdx, [rsp+48H]
       lea      r8, [rsp+28H]
       call     [rax+18H]System.Comparison`1[Program+BigStruct]:Invoke(Program+BigStruct,Program+BigStruct):int:this
       nop
                                                ;; size=44 bbWeight=1 PerfScore 20.25
G_M64140_IG03:  ;; offset=0033H
       vzeroupper
       add      rsp, 104
       ret
                                                ;; size=8 bbWeight=1 PerfScore 2.25
; Total bytes of code 59, prolog size 7, PerfScore 29.65, instruction count 15, allocated bytes for code 59 (MethodHash=bf2d0573) for method Program+CompareBigStruct:ComparerStructComparison():int:this
; ============================================================

So looks like this has been fixed at some point.

Bonus: results on my machine on main with and without physical promotion:

Method Job EnvironmentVariables Mean Error StdDev Ratio RatioSD
Comparison Job-UXLFRT Empty 1.4742 ns 0.0070 ns 0.0065 ns 1.000 0.00
ComparisonDefaultComparer Job-UXLFRT Empty 1.2121 ns 0.0420 ns 0.0393 ns 0.822 0.03
ComparerDefault Job-UXLFRT Empty 1.2750 ns 0.0095 ns 0.0079 ns 0.865 0.01
ComparerStruct Job-UXLFRT Empty 1.1216 ns 0.0108 ns 0.0096 ns 0.761 0.01
ComparerStructComparison Job-UXLFRT Empty 1.5564 ns 0.0110 ns 0.0097 ns 1.056 0.01
Comparison Job-VPJPAQ DOTNET_JitEnablePhysicalPromotion=1 0.6453 ns 0.0142 ns 0.0132 ns 0.438 0.01
ComparisonDefaultComparer Job-VPJPAQ DOTNET_JitEnablePhysicalPromotion=1 0.6634 ns 0.0134 ns 0.0125 ns 0.450 0.01
ComparerDefault Job-VPJPAQ DOTNET_JitEnablePhysicalPromotion=1 0.5734 ns 0.0158 ns 0.0148 ns 0.389 0.01
ComparerStruct Job-VPJPAQ DOTNET_JitEnablePhysicalPromotion=1 0.0145 ns 0.0029 ns 0.0022 ns 0.010 0.00
ComparerStructComparison Job-VPJPAQ DOTNET_JitEnablePhysicalPromotion=1 0.7441 ns 0.0066 ns 0.0051 ns 0.505 0.00

@ghost ghost locked as resolved and limited conversation to collaborators Jul 17, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

7 participants