Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Promote (scalar replace) structs with more than 4 fields #6534

Closed
AndreyAkinshin opened this issue Aug 22, 2016 · 9 comments · Fixed by #88090
Closed

Promote (scalar replace) structs with more than 4 fields #6534

AndreyAkinshin opened this issue Aug 22, 2016 · 9 comments · Fixed by #88090
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization Priority:2 Work that is important, but not critical for the release tenet-performance Performance related issue
Milestone

Comments

@AndreyAkinshin
Copy link
Member

AndreyAkinshin commented Aug 22, 2016

Let's look at the following code (based on this StackOverflow question):

public struct Cards3
{
    public byte C0, C1, C2;
}

public struct Cards8
{
    public byte C0, C1, C2, C3, C4, C5, C6, C7;
}

class Program
{
    static void Main()
    {           
        Run3();
        Run8();
    }

    private static Cards3[] cards3 = new Cards3[1];
    private static Cards8[] cards8 = new Cards8[1];

    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Run3()
    {
        var c = cards3[0];
        return c.C0 - c.C1;
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Run8()
    {
        var c = cards8[0];
        return c.C0 - c.C1;
    }
}

Now let's look at the asm code (Windows 10, .NET Framework 4.6.1 (4.0.30319.42000), clrjit-v4.6.1080.0):

; Run3                                                    
            var c = cards3[0];                            
00007FFEDF0A4752  in          al,dx                       
00007FFEDF0A4753  sub         byte ptr [rax-48h],cl       
00007FFEDF0A4756  add         byte ptr [rax],0B0h         
00007FFEDF0A4759  out         72h,al                      
00007FFEDF0A475B  add         al,byte ptr [rax]           
00007FFEDF0A475D  add         byte ptr [rax-75h],cl       
00007FFEDF0A4760  add         byte ptr [rbx+76000878h],al 
00007FFEDF0A4766  adc         al,48h                      
00007FFEDF0A4768  add         eax,10h                     
00007FFEDF0A476B  movzx       edx,byte ptr [rax]          ; !!!
00007FFEDF0A476E  movzx       eax,byte ptr [rax+1]        ; !!!
            return c.C0 - c.C1;                           
00007FFEDF0A4772  sub         edx,eax                     ; !!!
00007FFEDF0A4774  mov         eax,edx                     ; !!!
00007FFEDF0A4776  add         rsp,28h                     
00007FFEDF0A477A  ret                                     
00007FFEDF0A477B  call        00007FFF3EB57BE0            
00007FFEDF0A4780  int         3                           
; Run8                                                    
            var c = cards8[0];                            
00007FFEDF0B49A2  in          al,dx                       
00007FFEDF0B49A3  sub         byte ptr [rbx],dh           
00007FFEDF0B49A5  ror         byte ptr [rax-77h],44h      
00007FFEDF0B49A9  and         al,20h                      
00007FFEDF0B49AB  mov         rax,202902D0088h            
00007FFEDF0B49B5  mov         rax,qword ptr [rax]         
00007FFEDF0B49B8  cmp         dword ptr [rax+8],0         
00007FFEDF0B49BC  jbe         00007FFEDF0B49D8            
00007FFEDF0B49BE  mov         rax,qword ptr [rax+10h]     ; !!!
00007FFEDF0B49C2  mov         qword ptr [rsp+20h],rax     ; !!!
            return c.C0 - c.C1;                           
00007FFEDF0B49C7  movzx       eax,byte ptr [rsp+20h]      ; !!!
00007FFEDF0B49CC  movzx       edx,byte ptr [rsp+21h]      ; !!!
00007FFEDF0B49D1  sub         eax,edx                     
00007FFEDF0B49D3  add         rsp,28h                     
00007FFEDF0B49D7  ret                                     
00007FFEDF0B49D8  call        00007FFF3EB57BE0            
00007FFEDF0B49DD  int         3                           

As you can see, in the Run3 case, RyuJIT keeps the target bytes (C0, C1) in the edx, eax registers; in the Run8 case, RyuJIT keeps them on stack (qword ptr [rsp+20h]). Why? This may slightly degrade the performance of an application (see these benchmarks).

category:cq
theme:structs
skill-level:expert
cost:large
impact:large

@AndreyAkinshin AndreyAkinshin changed the title Strange RyuJIT behaviour for aligned structures Strange RyuJIT behaviour (structure memory access) Aug 22, 2016
@CarolEidt
Copy link
Contributor

This is due to a current limitation in struct promotion (aka scalar replacement) that restricts it to structs with no more than 4 fields.

@CarolEidt
Copy link
Contributor

This is related to #6494, but as this is specific and has a specific test case, I'm going to rename this "Promote (scalar replace) structs with more than 4 fields".

@CarolEidt CarolEidt changed the title Strange RyuJIT behaviour (structure memory access) Promote (scalar replace) structs with more than 4 fields Aug 22, 2016
@AndreyAkinshin
Copy link
Member Author

@CarolEidt, thanks for the explanation!

@RussKeldorph
Copy link
Contributor

@dotnet/jit-contrib

@russellhadley
Copy link
Contributor

@AndreyAkinshin What's the impact of this? Is this benchmark represent an issue in a particular workload or product? Or is this more of a general investigation? I think this issue is important but I'm trying to get a handle on how quickly we'd need to look at this.

@AndreyAkinshin
Copy link
Member Author

It's a general investigation, no real issues. =) I just thought that it would be nice to have such kind of optimizations.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@Chicken-Bones
Copy link
Contributor

Chicken-Bones commented Jun 27, 2022

We've carefully crafted a struct Enumerator to fit within the current limitations, but the lack of ReadOnlySpan and the need to pack fields makes the code hard to maintain and interop with. It appears that struct-in-struct is supported, but because Span has 2 fields where an array is just a reference, it pushes past the 4 field limit.

https://github.com/tModLoader/tModLoader/blob/1.4/patches/tModLoader/Terraria/ModLoader/Core/HookList.cs#L14

Fragment:

public ref struct InstanceEnumerator
{
	// These have to be arrays rather than ReadOnlySpan as the JIT won't unpack/promote 'struct in struct' (or span in struct either)
	// Revisit with .NET 6 https://github.com/dotnet/runtime/issues/37924
	private readonly Instanced<T>[] instances;
	private readonly int[] hookIndices;

	// ideally this would be Instanced<T> and drop the need for the ii variable in the MoveNext function
	// but again, struct in struct promotion (and also increasing the 'field count'
	private T current;
	// i and j are combined into a 64bit variable beacuse JIT currently won't unpack/promote structs with > 4 fields into registers
	// See https://github.com/dotnet/runtime/issues/6534
	private ulong ij;
	//private int i;
	//private int j;
	
	...

@jakobbotsch
Copy link
Member

Physical promotion (#83388) can potentially clean this up. With #85105 and DOTNET_JitEnablePhysicalPromotion=1, DOTNET_JitStressModeNames=STRESS_PHYSICAL_PROMOTION_COST, we get:

; Assembly listing for method Program:Run8():int
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;* V00 loc0         [V00    ] (  0,  0   )  struct ( 8) zero-ref    do-not-enreg[SF]
;  V01 OutArgs      [V01    ] (  1,  1   )  struct (32) [rsp+00H]   do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;  V02 tmp1         [V02,T02] (  2,  2   )   ubyte  ->  rdx         single-def "V00.[000..001)"
;  V03 tmp2         [V03,T03] (  2,  2   )   ubyte  ->  rax         single-def "V00.[001..002)"
;  V04 tmp3         [V04,T00] (  3,  6   )   byref  ->  rax         single-def "Spilling address for field-by-field copy"
;  V05 tmp4         [V05,T01] (  3,  6   )     ref  ->  rax         single-def "arr expr"
;
; Lcl frame size = 40

G_M14509_IG01:
       sub      rsp, 40
                                                ;; size=4 bbWeight=1 PerfScore 0.25
G_M14509_IG02:
       mov      rax, 0xD1FFAB1E      ; data for Program:cards8
       mov      rax, gword ptr [rax]
       cmp      dword ptr [rax+08H], 0
       jbe      SHORT G_M14509_IG04
       add      rax, 16
       movzx    rdx, byte  ptr [rax]
       movzx    rax, byte  ptr [rax+01H]
       sub      edx, eax
       mov      eax, edx
                                                ;; size=34 bbWeight=1 PerfScore 11.00
G_M14509_IG03:
       add      rsp, 40
       ret
                                                ;; size=5 bbWeight=1 PerfScore 1.25
G_M14509_IG04:
       call     CORINFO_HELP_RNGCHKFAIL
       int3
                                                ;; size=6 bbWeight=0 PerfScore 0.00

; Total bytes of code 49, prolog size 4, PerfScore 17.40, instruction count 14, allocated bytes for code 49 (MethodHash=6250c752) for method Program:Run8():int
; ============================================================

Currently this promotion will not happen without disabling costing checks because the profitability heuristic does not take into account that promotion can potentially make some struct loads much cheaper (in this case the array access). Hence the heuristic believes that with only one access to each field, it is best left alone.

Another thing is that we can avoid spilling simple address computations like these when they are just ADD(LCL_VAR, CNS) to avoid this add rax, 16.

@jakobbotsch
Copy link
Member

jakobbotsch commented May 25, 2023

After #86660 we get the above codegen by default when physical promotion is enabled.

Sadly the add rax, 16 remains, even after #85889. The problem is that the operand is an INDEX_ADDR node that has not been expanded at this point to something recognizable by decomposition.

It also remains in Run3 even though it has been expanded at the point of block morphing. There the problem is that the address is a quite complicated piece of IR that the address peeling does not recognize:

               [000010] DACXGO-----                           STORE_LCL_VAR struct<Cards3, 3>(P) V00 loc0         
                                                                ubyte  V00.Cards3:C0 (offs=0x00) -> V02 tmp1         
                                                                ubyte  V00.Cards3:C1 (offs=0x01) -> V03 tmp2         
                                                                ubyte  V00.Cards3:C2 (offs=0x02) -> V04 tmp3          (last use)
               [000009] nACXG+-----                         └──▌  BLK       struct<Cards3, 3>
               [000034] -ACXG+-----                            └──▌  COMMA     byref 
               [000021] DACXG+-----                               ├──▌  STORE_LCL_VAR ref    V05 tmp4         
               [000006] --CXG+-----                                 └──▌  COMMA     ref   
               [000005] H-CXG+-----                                    ├──▌  CALL help long   CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE
               [000003] -----+----- arg0 in rcx                          ├──▌  CNS_INT   long   0x7ff7ef1aff68
               [000004] -----+----- arg1 in rdx                          └──▌  CNS_INT   int    3
               [000001] I---G+-----                                    └──▌  IND       ref   
               [000000] H----+-----                                       └──▌  CNS_INT(h) long   0x1bc92001d08 static Fseq[cards3]
               [000033] ---X-+-----                               └──▌  COMMA     byref 
               [000026] ---X-+-----                                  ├──▌  BOUNDS_CHECK_Rng void  
               [000007] -----+-----                                    ├──▌  CNS_INT   int    0
               [000025] ---X-+-----                                    └──▌  ARR_LENGTH int   
               [000022] -----+-----                                       └──▌  LCL_VAR   ref    V05 tmp4         
               [000032] -----+-----                                  └──▌  ARR_ADDR  byref Cards3[]
               [000031] -----+-----                                     └──▌  ADD       byref 
               [000023] -----+-----                                        ├──▌  LCL_VAR   ref    V05 tmp4         
               [000030] -----+-----                                        └──▌  CNS_INT   long   16

I've opened #86755 for this.

@jakobbotsch jakobbotsch added the Priority:2 Work that is important, but not critical for the release label Jun 20, 2023
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jun 27, 2023
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jun 29, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Jul 29, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization Priority:2 Work that is important, but not critical for the release tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants