Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

[WIP] Improve Dictionary<K,T> CQ #14030

Closed
wants to merge 1 commit into from
Closed

Conversation

benaadams
Copy link
Member

@benaadams benaadams commented Sep 17, 2017

Incorporate @AndyAyersMS Potential CQ improvements in Dictionary<K,T>

Since calls to the comparer can have arbitrary side effects, the jit will reload data members after the call. This leads to redundant loads and range checks.

Try and eliminate these by caching data members in locals.

For default comparer, use regular inline method calls rather than a virtual function call on interface type

Split default and custom comparer paths

For BCL ValueType primitives byte, sbyte, ushort, short, uint, int, ulong, long, IntPtr, UIntPtr, Guid cast via object to the type and use .Equals directly them directly; then rely on the Jit to elide the boxing - meaning the mostly completely inline (e.g. Guid doesn't; though isn't a virtual call)

Find the entry and pass that back using ref returns; rather than index to array that then needs to be looked up again.

More extensive use of ref locals.

Will follow up with metrics

}
else
{
if (typeof(TKey) == typeof(byte))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think that we want to use this pattern and modify every collection in CoreFX to use it. I think it is going in a wrong direction and only fixes the problem for primitive types.

Instead, the JIT should recognize EqualityComparer<TKey>.Default as intrinsic - https://github.com/dotnet/coreclr/issues/6688 is about this. So this would be just: EqualityComparer.Default.Equals(key0, key1).

Copy link
Member Author

@benaadams benaadams Sep 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added #if EQUALITYCOMPARER_INTRINSIC, that work which uses EqualityComparer<TKey>.Default

Copy link
Member Author

@benaadams benaadams Sep 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only fixes the problem for primitive types

Would be good if C# could express it for all types and all functions, not just relying on the Jit to pick up on EqualityComparer<TKey>.Default.Equals only dotnet/csharplang#905

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkotas can you elaborate on what you have in mind for #6688? The source link in that issue seems not to work anymore.

Are you suggesting that IEqualityComparer<TKey>'s methods be marked as intrinsics, and then the jit can inject tests at call sites via this interface when TKey is known to see if the value is the EqualityComparer<TKey>.Default?

Copy link
Member Author

@benaadams benaadams Sep 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This this is the file the orignal link pointed to EqualityComparerHelpers looks like some constrained static methods it would swap with?

Copy link
Member

@jkotas jkotas Sep 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the jit can inject tests at call sites via this interface when TKey is known to see if the value is the EqualityComparer.Default

I meant that the JIT should recognize EqualityComparer<TKey>.Default.Equals(a,b) pattern and replace it with the expanded code. For example, the following https://github.com/dotnet/coreclr/blob/master/src/mscorlib/shared/System/ValueTuple.cs#L740 should be just comparing integers directly for tuple of integers. It does the virtual calls today.

If it is too hard to recognize EqualityComparer<TKey>.Default.Equals(a,b) pattern in the JIT, I would be ok to add a static EqualityComparer<T>.Equals(a,b) method and recognize that as intrinsic. It would need to be approved as public API to be widely usable and all relevant places changed to call it, so it is less preferred.

@benaadams
Copy link
Member Author

benaadams commented Sep 18, 2017

Dictionary`2:TryGetValue(char,byref):bool:this

Before

; Assembly listing for method Dictionary`2:TryGetValue(char,byref):bool:this
; Emitting BLENDED_CODE for X64 CPU with SSE2
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 this         [V00,T00] (  4,  3.50)     ref  ->  rsi         this class-hnd
;  V01 arg1         [V01,T02] (  3,  3   )    char  ->  rdx        
;  V02 arg2         [V02,T01] (  4,  3   )   byref  ->  rdi        
;  V03 loc0         [V03,T03] (  4,  3   )     int  ->  rax        
;  V04 tmp0         [V04,T04] (  3,  3   )     ref  ->  rdx        
;  V05 OutArgs      [V05    ] (  1,  1   )  lclBlk (32) [rsp+0x00]  
;
; Lcl frame size = 40

G_M28467_IG01:
       57                   push     rdi
       56                   push     rsi
       4883EC28             sub      rsp, 40
       488BF1               mov      rsi, rcx
       498BF8               mov      rdi, r8

G_M28467_IG02:
       0FB7D2               movzx    rdx, dx
       488BCE               mov      rcx, rsi
       E800000000           call     Dictionary`2:FindEntry(char):int:this
       85C0                 test     eax, eax
       7C29                 jl       SHORT G_M28467_IG04
       488B5610             mov      rdx, gword ptr [rsi+16]
       3B4208               cmp      eax, dword ptr [rdx+8]
       732C                 jae      SHORT G_M28467_IG06
       4863C8               movsxd   rcx, eax
       488D0C49             lea      rcx, [rcx+2*rcx]
       488B54CA10           mov      rdx, gword ptr [rdx+8*rcx+16]
       488BCF               mov      rcx, rdi
       E800000000           call     CORINFO_HELP_CHECKED_ASSIGN_REF
       B801000000           mov      eax, 1

G_M28467_IG03:
       4883C428             add      rsp, 40
       5E                   pop      rsi
       5F                   pop      rdi
       C3                   ret      

G_M28467_IG04:
       33C0                 xor      rax, rax
       488907               mov      qword ptr [rdi], rax

G_M28467_IG05:
       4883C428             add      rsp, 40
       5E                   pop      rsi
       5F                   pop      rdi
       C3                   ret      

G_M28467_IG06:
       E800000000           call     CORINFO_HELP_RNGCHKFAIL
       CC                   int3     

; Total bytes of code 86, prolog size 6 for method Dictionary`2:TryGetValue(char,byref):bool:this

After

; Assembly listing for method Dictionary`2:TryGetValue(char,byref):bool:this
; Emitting BLENDED_CODE for X64 CPU with SSE2
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 this         [V00,T00] (  3,  3   )     ref  ->  rcx         this class-hnd
;  V01 arg1         [V01,T02] (  3,  3   )    char  ->  rdx        
;  V02 arg2         [V02,T01] (  3,  3   )   byref  ->  rsi        
;  V03 loc0         [V03,T06] (  2,  1.50)   byref  ->  rax        
;  V04 loc1         [V04    ] (  3,  3   )    bool  ->  [rsp+0x28]   do-not-enreg[X] addr-exposed ld-addr-op
;* V05 loc2         [V05    ] (  0,  0   )     ref  ->  zero-ref    ld-addr-op class-hnd
;  V06 tmp0         [V06,T03] (  3,  2   )   byref  ->  rsi        
;  V07 tmp1         [V07,T04] (  3,  2   )   byref  ->  rcx        
;  V08 tmp2         [V08,T05] (  3,  2   )     ref  ->  rdx        
;  V09 OutArgs      [V09    ] (  1,  1   )  lclBlk (32) [rsp+0x00]  
;
; Lcl frame size = 48

G_M28468_IG01:
       56                   push     rsi
       4883EC30             sub      rsp, 48
       498BF0               mov      rsi, r8

G_M28468_IG02:
       0FB7D2               movzx    rdx, dx
       4C8D442428           lea      r8, bword ptr [rsp+28H]
       E800000000           call     Dictionary`2:FindEntry(char,byref):byref:this
       807C242800           cmp      byte  ptr [rsp+28H], 0
       7507                 jne      SHORT G_M28468_IG03
       488BCE               mov      rcx, rsi
       33D2                 xor      rdx, rdx
       EB06                 jmp      SHORT G_M28468_IG04

G_M28468_IG03:
       488BCE               mov      rcx, rsi
       488B10               mov      rdx, gword ptr [rax]

G_M28468_IG04:
       E800000000           call     CORINFO_HELP_CHECKED_ASSIGN_REF
       8B442428             mov      eax, dword ptr [rsp+28H]
       0FB6C0               movzx    rax, al

G_M28468_IG05:
       4883C430             add      rsp, 48
       5E                   pop      rsi
       C3                   ret      

; Total bytes of code 59, prolog size 5 for method Dictionary`2:TryGetValue(char,byref):bool:this

@benaadams benaadams changed the title Improve Dictionary<K,T> CQ for primitive BCL ValueTypes Improve Dictionary<K,T> CQ Sep 18, 2017
@benaadams
Copy link
Member Author

benaadams commented Sep 18, 2017

Before

Dictionary`2:FindEntry(char):int:this

; Assembly listing for method Dictionary`2:FindEntry(char):int:this
; Emitting BLENDED_CODE for X64 CPU with SSE2
; optimized code
; rsp based frame
; fully interruptible
; Final local variable assignments
;
;  V00 this         [V00,T03] ( 29, 53   )     ref  ->  rsi         this class-hnd
;  V01 arg1         [V01,T13] (  7,  7.50)    char  ->  rdx        
;  V02 loc0         [V02,T09] ( 14, 24.50)     int  ->  rbx        
;  V03 loc1         [V03,T01] ( 25, 72   )     int  ->  rbp        
;* V04 tmp0         [V04    ] (  0,  0   )     ref  ->  zero-ref    class-hnd exact
;  V05 tmp1         [V05,T12] ( 13, 13   )     ref  ->  rcx        
;  V06 tmp2         [V06,T11] ( 18, 18   )     int  ->  rdx        
;  V07 tmp3         [V07,T00] ( 15,120   )     ref  ->  rcx        
;  V08 tmp4         [V08,T10] (  4, 16   )     ref  ->  [rsp+0x28]  
;  V09 tmp5         [V09,T06] ( 10, 40   )     ref  ->  rdx        
;  V10 tmp6         [V10,T07] ( 10, 40   )     int  ->   r8        
;  V11 tmp7         [V11,T02] (  9, 72   )     ref  ->  rax        
;  V12 OutArgs      [V12    ] (  1,  1   )  lclBlk (32) [rsp+0x00]  
;  V13 cse0         [V13,T04] ( 15, 50   )   byref  ->  rcx        
;  V14 cse1         [V14,T05] ( 11, 44   )    long  ->  r14        
;  V15 cse2         [V15,T15] ( 17,  8.50)     int  ->   r8        
;  V16 cse3         [V16,T08] (  9, 30   )     ref  ->  rcx        
;  V17 cse4         [V17,T16] ( 11,  5.50)     ref  ->  rax        
;  V18 cse5         [V18,T14] (  9,  9   )     int  ->  rdi        
;
; Lcl frame size = 48

G_M38464_IG01:
       4156                 push     r14
       57                   push     rdi
       56                   push     rsi
       55                   push     rbp
       53                   push     rbx
       4883EC30             sub      rsp, 48
       488BF1               mov      rsi, rcx

G_M38464_IG02:
       48837E0800           cmp      gword ptr [rsi+8], 0
       0F848C000000         je       G_M38464_IG05
       488B4E18             mov      rcx, gword ptr [rsi+24]
       0FB7FA               movzx    rdi, dx
       8BD7                 mov      edx, edi
       4C8D1D00000000       lea      r11, [(reloc)]
       3909                 cmp      dword ptr [rcx], ecx
       41FF13               call     qword ptr [r11]IEqualityComparer`1:GetHashCode(char):int:this
       8BD8                 mov      ebx, eax
       81E3FFFFFF7F         and      ebx, 0x7FFFFFFF
       488B4608             mov      rax, gword ptr [rsi+8]
       488BC8               mov      rcx, rax
       448B4008             mov      r8d, dword ptr [rax+8]
       8BC3                 mov      eax, ebx
       99                   cdq      
       41F7F8               idiv     edx:eax, r8d
       413BD0               cmp      edx, r8d
       7376                 jae      SHORT G_M38464_IG09
       4863D2               movsxd   rdx, edx
       8B6C9110             mov      ebp, dword ptr [rcx+4*rdx+16]
       85ED                 test     ebp, ebp
       7C4E                 jl       SHORT G_M38464_IG05

G_M38464_IG03:
       488B4E10             mov      rcx, gword ptr [rsi+16]
       3B6908               cmp      ebp, dword ptr [rcx+8]
       7362                 jae      SHORT G_M38464_IG09
       4863D5               movsxd   rdx, ebp
       4C8D3452             lea      r14, [rdx+2*rdx]
       4A8D4CF110           lea      rcx, bword ptr [rcx+8*r14+16]
       395908               cmp      dword ptr [rcx+8], ebx
       7522                 jne      SHORT G_M38464_IG04
       488B5618             mov      rdx, gword ptr [rsi+24]
       440FB74110           movzx    r8, word  ptr [rcx+16]
       488BCA               mov      rcx, rdx
       418BD0               mov      edx, r8d
       448BC7               mov      r8d, edi
       4C8D1D00000000       lea      r11, [(reloc)]
       3909                 cmp      dword ptr [rcx], ecx
       41FF13               call     qword ptr [r11]IEqualityComparer`1:Equals(char,char):bool:this
       85C0                 test     eax, eax
       7522                 jne      SHORT G_M38464_IG07

G_M38464_IG04:
       488B4610             mov      rax, gword ptr [rsi+16]
       3B6808               cmp      ebp, dword ptr [rax+8]
       7326                 jae      SHORT G_M38464_IG09
       428B6CF01C           mov      ebp, dword ptr [rax+8*r14+28]
       85ED                 test     ebp, ebp
       7DB2                 jge      SHORT G_M38464_IG03

G_M38464_IG05:
       B8FFFFFFFF           mov      eax, -1

G_M38464_IG06:
       4883C430             add      rsp, 48
       5B                   pop      rbx
       5D                   pop      rbp
       5E                   pop      rsi
       5F                   pop      rdi
       415E                 pop      r14
       C3                   ret      

G_M38464_IG07:
       8BC5                 mov      eax, ebp

G_M38464_IG08:
       4883C430             add      rsp, 48
       5B                   pop      rbx
       5D                   pop      rbp
       5E                   pop      rsi
       5F                   pop      rdi
       415E                 pop      r14
       C3                   ret      

G_M38464_IG09:
       E800000000           call     CORINFO_HELP_RNGCHKFAIL
       CC                   int3     

; Total bytes of code 199, prolog size 13 for method Dictionary`2:FindEntry(char):int:this

After DefaultComparer

Dictionary`2:FindEntry(char,byref):byref:this

Marking Dictionary`2:FindEntry(char,byref,ref):byref:this as NOINLINE because of too many il bytes
Marking Dictionary`2:FindEntry(char,byref,ref):byref:this as NOINLINE because of too many il bytes
**************** Inline Tree
Inlines into 060039B9 Dictionary`2:FindEntry(char,byref):byref:this
  [0 IL=0028 TR=000079 060039BA] [FAILED: too many il bytes] Dictionary`2:FindEntry(char,byref,ref):byref:this
  [0 IL=0038 TR=000034 060039BA] [FAILED: too many il bytes] Dictionary`2:FindEntry(char,byref,ref):byref:this
Budget: initialTime=192, finalTime=192, initialBudget=1920, currentBudget=1920
Budget: initialSize=1134, finalSize=1134
; Assembly listing for method Dictionary`2:FindEntry(char,byref):byref:this
; Emitting BLENDED_CODE for X64 CPU with SSE2
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 this         [V00,T00] (  9,  6   )     ref  ->  rsi         this class-hnd
;  V01 arg1         [V01,T02] (  4,  3   )    char  ->  rbx        
;  V02 arg2         [V02,T01] (  4,  3   )   byref  ->  rdi        
;  V03 loc0         [V03,T07] (  3,  2.50)     ref  ->  rbp         class-hnd
;* V04 tmp0         [V04    ] (  0,  0   )     ref  ->  zero-ref    class-hnd exact
;  V05 tmp1         [V05,T03] (  4,  3.50)    long  ->  rcx        
;  V06 tmp2         [V06,T04] (  4,  3.50)    long  ->  rdx        
;  V07 tmp3         [V07,T05] (  4,  3.50)    long  ->  rcx        
;  V08 tmp4         [V08,T06] (  4,  3.50)    long  ->  rdx        
;  V09 OutArgs      [V09    ] (  1,  1   )  lclBlk (40) [rsp+0x00]  
;
; Lcl frame size = 56

G_M38464_IG01:
       57                   push     rdi
       56                   push     rsi
       55                   push     rbp
       53                   push     rbx
       4883EC38             sub      rsp, 56
       48894C2430           mov      qword ptr [rsp+30H], rcx
       488BF1               mov      rsi, rcx
       8BDA                 mov      ebx, edx
       498BF8               mov      rdi, r8

G_M38464_IG02:
       488B6E18             mov      rbp, gword ptr [rsi+24]
       4885ED               test     rbp, rbp
       7543                 jne      SHORT G_M38464_IG05
       488B0E               mov      rcx, qword ptr [rsi]
       488B5130             mov      rdx, qword ptr [rcx+48]
       488B12               mov      rdx, qword ptr [rdx]
       488B5228             mov      rdx, qword ptr [rdx+40]
       4885D2               test     rdx, rdx
       750F                 jne      SHORT G_M38464_IG03
       488D1500000000       lea      rdx, [(reloc)]
       E800000000           call     CORINFO_HELP_RUNTIMEHANDLE_CLASS
       488BD0               mov      rdx, rax

G_M38464_IG03:
       4533C0               xor      r8, r8
       4C89442420           mov      gword ptr [rsp+20H], r8
       440FB7C3             movzx    r8, bx
       488BCE               mov      rcx, rsi
       4C8BCF               mov      r9, rdi
       E800000000           call     Dictionary`2:FindEntry(char,byref,ref):byref:this
       90                   nop      

G_M38464_IG04:
       4883C438             add      rsp, 56
       5B                   pop      rbx
       5D                   pop      rbp
       5E                   pop      rsi
       5F                   pop      rdi
       C3                   ret      

G_M38464_IG05:
       488B0E               mov      rcx, qword ptr [rsi]
       488B5130             mov      rdx, qword ptr [rcx+48]
       488B12               mov      rdx, qword ptr [rdx]
       488B5220             mov      rdx, qword ptr [rdx+32]
       4885D2               test     rdx, rdx
       750F                 jne      SHORT G_M38464_IG06
       488D1500000000       lea      rdx, [(reloc)]
       E800000000           call     CORINFO_HELP_RUNTIMEHANDLE_CLASS
       488BD0               mov      rdx, rax

G_M38464_IG06:
       48896C2420           mov      gword ptr [rsp+20H], rbp
       440FB7C3             movzx    r8, bx
       488BCE               mov      rcx, rsi
       4C8BCF               mov      r9, rdi
       E800000000           call     Dictionary`2:FindEntry(char,byref,ref):byref:this
       90                   nop      

G_M38464_IG07:
       4883C438             add      rsp, 56
       5B                   pop      rbx
       5D                   pop      rbp
       5E                   pop      rsi
       5F                   pop      rdi
       C3                   ret      

; Total bytes of code 161, prolog size 13 for method Dictionary`2:FindEntry(char,byref):byref:this

Dictionary`2:FindEntry(char,byref,ref):byref:this

Marking Dictionary`2:FindEntry(char,byref,ref):byref:this as NOINLINE because of too many il bytes
Devirtualized virtual call to EqualityComparer`1:Equals; now direct call to GenericEqualityComparer`1:Equals [exact]
Successfully inlined Char:GetHashCode():int:this (9 IL bytes) (depth 1) [below ALWAYS_INLINE size]
Successfully inlined HashHelpers:FindBucket(int,int):int (4 IL bytes) (depth 1) [below ALWAYS_INLINE size]
Successfully inlined EqualityComparer`1:get_Default():ref (6 IL bytes) (depth 1) [below ALWAYS_INLINE size]
Successfully inlined GenericEqualityComparer`1:Equals(char,char):bool:this (45 IL bytes) (depth 1) [aggressive inline attribute]
Successfully inlined Char:Equals(char):bool:this (6 IL bytes) (depth 2) [below ALWAYS_INLINE size]
**************** Inline Tree
Inlines into 060039BA Dictionary`2:FindEntry(char,byref,ref):byref:this
  [1 IL=0053 TR=000044 060010FC] [below ALWAYS_INLINE size] Char:GetHashCode():int:this
  [2 IL=0120 TR=000063 06003858] [below ALWAYS_INLINE size] HashHelpers:FindBucket(int,int):int
  [3 IL=0175 TR=000110 060039D7] [below ALWAYS_INLINE size] EqualityComparer`1:get_Default():ref
  [4 IL=0188 TR=000116 060039E0] [aggressive inline attribute] GenericEqualityComparer`1:Equals(char,char):bool:this
    [5 IL=0025 TR=000200 060010FE] [below ALWAYS_INLINE size] Char:Equals(char):bool:this
  [0 IL=0260 TR=000023 060039BB] [FAILED: noinline per IL/cached result] Dictionary`2:get_NotFound():byref:this
Budget: initialTime=858, finalTime=928, initialBudget=8580, currentBudget=8656
Budget: increased by 76 because of force inlines
Budget: initialSize=6196, finalSize=6196
; Assembly listing for method Dictionary`2:FindEntry(char,byref,ref):byref:this
; Emitting BLENDED_CODE for X64 CPU with SSE2
; optimized code
; rsp based frame
; fully interruptible
; Final local variable assignments
;
;  V00 this         [V00,T09] (  6,  4.50)     ref  ->  rcx         this class-hnd
;* V01 TypeCtx      [V01    ] (  0,  0   )    long  ->  zero-ref   
;  V02 arg1         [V02,T08] (  5,  5   )    char  ->  [rsp+0x50]   do-not-enreg[F] ld-addr-op
;  V03 arg2         [V03,T12] (  4,  3.50)   byref  ->   r9        
;* V04 arg3         [V04    ] (  0,  0   )     ref  ->  zero-ref    class-hnd
;  V05 loc0         [V05,T10] ( 10,  6   )     ref  ->   r8         class-hnd
;  V06 loc1         [V06,T06] (  7,  7   )     int  ->  rsi        
;  V07 loc2         [V07,T02] (  6, 10   )     ref  ->  rdi         class-hnd
;  V08 loc3         [V08,T00] ( 11, 26.50)     int  ->  rbx        
;  V09 loc4         [V09,T01] (  6, 18.50)   byref  ->  rdx        
;  V10 loc5         [V10,T07] (  7,  7   )     int  ->  r11        
;  V11 tmp1         [V11,T13] (  2,  4   )    bool  ->   r8        
;  V12 tmp2         [V12,T04] (  2,  8   )    char  ->   r8         ld-addr-op
;* V13 tmp3         [V13    ] (  0,  0   )     ref  ->  zero-ref    class-hnd exact
;  V14 tmp4         [V14,T05] (  2,  8   )    char  ->  r10         ld-addr-op
;* V15 tmp5         [V15    ] (  0,  0   )     ref  ->  zero-ref    class-hnd exact
;* V16 tmp6         [V16    ] (  0,  0   )    char  ->  zero-ref    ld-addr-op
;  V17 OutArgs      [V17    ] (  1,  1   )  lclBlk (32) [rsp+0x00]  
;  V18 cse0         [V18,T11] ( 11,  5.50)     int  ->  rdx        
;  V19 cse1         [V19,T03] (  5,  9.50)     int  ->  rax        
;  V20 cse2         [V20,T14] (  6,  3   )     int  ->  r10        
;
; Lcl frame size = 32

G_M37161_IG01:
       57                   push     rdi
       56                   push     rsi
       53                   push     rbx
       4883EC20             sub      rsp, 32
       4489442450           mov      dword ptr [rsp+50H], r8d

G_M37161_IG02:
       41C60101             mov      byte  ptr [r9], 1
       4C8B4108             mov      r8, gword ptr [rcx+8]
       4D85C0               test     r8, r8
       7477                 je       SHORT G_M37161_IG05
       0FB7442450           movzx    rax, word  ptr [rsp+50H]
       0FB7542450           movzx    rdx, word  ptr [rsp+50H]
       C1E210               shl      edx, 16
       0BC2                 or       eax, edx
       8BF0                 mov      esi, eax
       81E6FFFFFF7F         and      esi, 0x7FFFFFFF
       488B7910             mov      rdi, gword ptr [rcx+16]
       458B5008             mov      r10d, dword ptr [r8+8]
       458BDA               mov      r11d, r10d
       8BC6                 mov      eax, esi
       33D2                 xor      rdx, rdx
       41F7F3               div      edx:eax, r11d
       413BD2               cmp      edx, r10d
       7366                 jae      SHORT G_M37161_IG09
       4863C2               movsxd   rax, edx
       418B5C8010           mov      ebx, dword ptr [r8+4*rax+16]
       85DB                 test     ebx, ebx
       7C3D                 jl       SHORT G_M37161_IG05
       8B4708               mov      eax, dword ptr [rdi+8]

G_M37161_IG03:
       3BD8                 cmp      ebx, eax
       7353                 jae      SHORT G_M37161_IG09
       4863D3               movsxd   rdx, ebx
       488D1452             lea      rdx, [rdx+2*rdx]
       488D54D710           lea      rdx, bword ptr [rdi+8*rdx+16]
       397208               cmp      dword ptr [rdx+8], esi
       751E                 jne      SHORT G_M37161_IG04
       440FB74210           movzx    r8, word  ptr [rdx+16]
       448B542450           mov      r10d, dword ptr [rsp+50H]
       450FB7D2             movzx    r10, r10w
       453BC2               cmp      r8d, r10d
       410F94C0             sete     r8b
       450FB6C0             movzx    r8, r8b
       4585C0               test     r8d, r8d
       7519                 jne      SHORT G_M37161_IG07

G_M37161_IG04:
       8B5A0C               mov      ebx, dword ptr [rdx+12]
       85DB                 test     ebx, ebx
       7DC6                 jge      SHORT G_M37161_IG03

G_M37161_IG05:
       41C60100             mov      byte  ptr [r9], 0
       E800000000           call     Dictionary`2:get_NotFound():byref:this
       90                   nop      

G_M37161_IG06:
       4883C420             add      rsp, 32
       5B                   pop      rbx
       5E                   pop      rsi
       5F                   pop      rdi
       C3                   ret      

G_M37161_IG07:
       488BC2               mov      rax, rdx

G_M37161_IG08:
       4883C420             add      rsp, 32
       5B                   pop      rbx
       5E                   pop      rsi
       5F                   pop      rdi
       C3                   ret      

G_M37161_IG09:
       E800000000           call     CORINFO_HELP_RNGCHKFAIL
       CC                   int3     

; Total bytes of code 179, prolog size 12 for method Dictionary`2:FindEntry(char,byref,ref):byref:this

@benaadams
Copy link
Member Author

benaadams commented Sep 18, 2017

The two functions inline for structs; but for reference keys fail with

[FAILED: complex handle access] Dictionary`2:KeyHashCode(ref):int:this
[FAILED: complex handle access] Dictionary`2:KeyEquals(ref,ref):bool:this

Equals is a bit weird; but KeyHashCode isn't very complex? Maybe its the default(TKey) == null test?

Adjusting code

@AndyAyersMS
Copy link
Member

In the cases where key or value are reference types, we'll generate a shared methods. In shared methods, inlining generic methods from some other class is not supported.

Would have liked to see the jit fare better in this stretch (char compare)... actually not clear what is going on here, seems like the jit is comparing words and bytes.

       450FB75110           movzx    r10, word  ptr [r9+16]
       448B5C2468           mov      r11d, dword ptr [rsp+68H]
       450FB7DB             movzx    r11, r11w
       453BD3               cmp      r10d, r11d
       410F94C2             sete     r10b
       450FB6D2             movzx    r10, r10b
       4585D2               test     r10d, r10d

@benaadams
Copy link
Member Author

actually not clear what is going on here, seems like the jit is comparing words and bytes.

Let me re-run the tests locally; from the failures seem to have broken something from a last minute late night change... 😒

@mikedn
Copy link

mikedn commented Sep 18, 2017

Would have liked to see the jit fare better in this stretch (char compare)... actually not clear what is going on here, seems like the jit is comparing words and bytes.

I see that it compares 2 words widened to dword and then compares to result of the previous comparation with 0. Looks like the old #914

IEqualityComparer<TKey> customComparer = _customComparer;
if (customComparer == null)
{
if (default(TKey) == null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not necessary. result = key?.GetHashCode() ?? 0; should work well for all cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also key is checked everywhere before being added; and throws ArgumentNullExceptions so can just be key.GetHashCode()?

@benaadams
Copy link
Member Author

Testing error is definitely caused by the code change; though is something I haven't seen before

Assert failure(PID 6924 [0x00001b0c], Thread: 5256 [0x1488]): !"Recursion in CLRException::GetThrowable"

CORECLR! CLRException::GetThrowable + 0x4A2 (0x00007fff`6a6a1ea2)
CORECLR! CLRException::GetThrowableFromException + 0x252 (0x00007fff`6a6a2592)
CORECLR! UnwindAndContinueRethrowHelperInsideCatch + 0xBB (0x00007fff`6a65890b)
CORECLR! `JIT_RngChkFail'::`1'::catch$5 + 0xE1 (0x00007fff`6b379ab5)
CORECLR! CallSettingFrame + 0x20 (0x00007fff`6b18fe70)
CORECLR! _CxxCallCatchBlock + 0x15A (0x00007fff`6b182aca)
NTDLL! RtlCaptureContext + 0x3E3 (0x00007fff`abc31983)
CORECLR! JIT_RngChkFail + 0x275 (0x00007fff`6a863c75)
SYSTEM.PRIVATE.CORELIB! <no symbol> + 0x0 (0x00007fff`69ad64e8)
SYSTEM.PRIVATE.CORELIB! <no symbol> + 0x0 (0x00007fff`69ad60c5)
    File: c:\github\coreclr\src\vm\clrex.cpp Line: 149
    Image: C:\GitHub\coreclr\bin\tests\Windows_NT.x64.Debug\Tests\Core_Root\CoreRun.exe

@benaadams
Copy link
Member Author

benaadams commented Sep 18, 2017

Guessing the Dictionary is used in generating the Exception; so throws another exception... etc

@benaadams
Copy link
Member Author

Missing & 0x7FFFFFFF XD

@benaadams
Copy link
Member Author

Is this a recognized bounds check eliminating pattern?

var buckets = _buckets;
var val = buckets[(uint)hashCode % buckets.Length]

@AndyAyersMS
Copy link
Member

Not sure, but would guess it's not. Will check in later today if you don't.

It should be, as well as the related & variant.

@mikedn
Copy link

mikedn commented Sep 18, 2017

Is this a recognized bounds check eliminating pattern?

Nope. On my things to look at list :). Watch out with the cast to uint, both operands should be cast to uint, otherwise you may get a long modulo. Ideally this should not be needed, the JIT could figure out that the value is positive from & 0x7FFFFFFF.

@benaadams
Copy link
Member Author

benaadams commented Sep 19, 2017

Updated asm with current; doesn't look like % removes the bounds check

@benaadams
Copy link
Member Author

2 items not currently covered.

Nullables - I think they will go class path with default(TKey) == null so I'll need to put them on a custom comparer from start (basically what it does now)

Remove functions... I've split all the traversal functions into 2 to hoist branching on comparer. However Remove started as 2 traversal functions (for the out variant #10203); but that have become 4, which is getting even bloatier than this (or that) - so I've merged them to one; then split to two. So will have to pay attention to that in the metrics - maybe with a big Matrix as @mikedn mentions in the original PR

@benaadams
Copy link
Member Author

benaadams commented Sep 19, 2017

Dictionary``2:TryInsertDefaultComparer(ref,ref,ubyte):bool:thiss has three exit unwinds which makes me 😢

G_M24340_IG01:
       4157                 push     r15
       4156                 push     r14
       4155                 push     r13
       4154                 push     r12
       57                   push     rdi
       56                   push     rsi
       55                   push     rbp
       53                   push     rbx
       4883EC38             sub      rsp, 56
       488BF1               mov      rsi, rcx
       488BFA               mov      rdi, rdx
       498BD8               mov      rbx, r8
       418BE9               mov      ebp, r9d

G_M24340_IG02:
       4C8B7608             mov      r14, gword ptr [rsi+8]
       4D85F6               test     r14, r14
       750D                 jne      SHORT G_M24340_IG03
       488BCE               mov      rcx, rsi
       33D2                 xor      edx, edx
       E800000000           call     Dictionary`2:Initialize(int):ref:this
       4C8BF0               mov      r14, rax

G_M24340_IG03:
       488BCF               mov      rcx, rdi
       488B07               mov      rax, qword ptr [rdi]
       488B4040             mov      rax, qword ptr [rax+64]
       FF5010               call     qword ptr [rax+16]Object:GetHashCode():int:this
       448BF8               mov      r15d, eax
       4181E7FFFFFF7F       and      r15d, 0x7FFFFFFF
       4C8B6610             mov      r12, gword ptr [rsi+16]
       4533ED               xor      r13d, r13d
       418B4E08             mov      ecx, dword ptr [r14+8]
       418BC7               mov      eax, r15d
       33D2                 xor      rdx, rdx
       F7F1                 div      edx:eax, ecx
       3BD1                 cmp      edx, ecx
       0F8309000000         jae      G_M24340_IG15
       4863CA               movsxd   rcx, edx
       418B448E10           mov      eax, dword ptr [r14+4*rcx+16]
       85C0                 test     eax, eax
       7C4B                 jl       SHORT G_M24340_IG06

G_M24340_IG04:
       413B442408           cmp      eax, dword ptr [r12+8]
       0F8309000000         jae      G_M24340_IG15
       4863C8               movsxd   rcx, eax
       488D0C49             lea      rcx, [rcx+2*rcx]
       498D44CC10           lea      rax, bword ptr [r12+8*rcx+16]
       4889442428           mov      bword ptr [rsp+28H], rax
       4C8BC0               mov      r8, rax
       45397810             cmp      dword ptr [r8+16], r15d
       751A                 jne      SHORT G_M24340_IG05
       4C89442430           mov      bword ptr [rsp+30H], r8
       498B08               mov      rcx, gword ptr [r8]
       488BD7               mov      rdx, rdi
       4C8B09               mov      r9, qword ptr [rcx]
       4D8B4940             mov      r9, qword ptr [r9+64]
       41FF5108             call     qword ptr [r9+8]Object:Equals(ref):bool:this
       85C0                 test     eax, eax
       7538                 jne      SHORT G_M24340_IG07

G_M24340_IG05:
       488B442428           mov      rax, bword ptr [rsp+28H]
       8B4014               mov      eax, dword ptr [rax+20]
       85C0                 test     eax, eax
       7DB5                 jge      SHORT G_M24340_IG04

G_M24340_IG06:
       8B5644               mov      edx, dword ptr [rsi+68]
       85D2                 test     edx, edx
       7E67                 jle      SHORT G_M24340_IG11
       448B6E40             mov      r13d, dword ptr [rsi+64]
       453B6C2408           cmp      r13d, dword ptr [r12+8]
       0F8309000000         jae      G_M24340_IG15
       4963CD               movsxd   rcx, r13d
       488D0C49             lea      rcx, [rcx+2*rcx]
       418B4CCC24           mov      ecx, dword ptr [r12+8*rcx+36]
       894E40               mov      dword ptr [rsi+64], ecx
       FFCA                 dec      edx
       895644               mov      dword ptr [rsi+68], edx
       EB6E                 jmp      SHORT G_M24340_IG13

G_M24340_IG07:
       4C8B442430           mov      r8, bword ptr [rsp+30H]
       440FB6E5             movzx    r12, bpl
       4183FC01             cmp      r12d, 1
       7515                 jne      SHORT G_M24340_IG08
       498D4808             lea      rcx, bword ptr [r8+8]
       488BD3               mov      rdx, rbx
       E800000000           call     CORINFO_HELP_CHECKED_ASSIGN_REF
       FF463C               inc      dword ptr [rsi+60]
       41BD01000000         mov      r13d, 1

G_M24340_IG08:
       4183FC02             cmp      r12d, 2
       0F8400000000         je       G_M24340_IG14

G_M24340_IG09:
       418BC5               mov      eax, r13d

G_M24340_IG10:
       4883C438             add      rsp, 56
       5B                   pop      rbx
       5D                   pop      rbp
       5E                   pop      rsi
       5F                   pop      rdi
       415C                 pop      r12
       415D                 pop      r13
       415E                 pop      r14
       415F                 pop      r15
       C3                   ret      

G_M24340_IG11:
       8B6E38               mov      ebp, dword ptr [rsi+56]
       41396C2408           cmp      dword ptr [r12+8], ebp
       751C                 jne      SHORT G_M24340_IG12
       8BCD                 mov      ecx, ebp
       E800000000           call     HashHelpers:ExpandPrime(int):int
       8BD0                 mov      edx, eax
       488BCE               mov      rcx, rsi
       4533C0               xor      r8d, r8d
       E800000000           call     Dictionary`2:ResizeDefaultComparer(int,bool):this
       4C8B7608             mov      r14, gword ptr [rsi+8]
       4C8B6610             mov      r12, gword ptr [rsi+16]

G_M24340_IG12:
       448BED               mov      r13d, ebp
       FF4638               inc      dword ptr [rsi+56]

G_M24340_IG13:
       418B4E08             mov      ecx, dword ptr [r14+8]
       418BC7               mov      eax, r15d
       33D2                 xor      rdx, rdx
       F7F1                 div      edx:eax, ecx
       3BD1                 cmp      edx, ecx
       0F8309000000         jae      G_M24340_IG15
       4863D2               movsxd   rdx, edx
       498D6C9610           lea      rbp, bword ptr [r14+4*rdx+16]
       453B6C2408           cmp      r13d, dword ptr [r12+8]
       0F8309000000         jae      G_M24340_IG15
       4963D5               movsxd   rdx, r13d
       488D1452             lea      rdx, [rdx+2*rdx]
       4D8D74D410           lea      r14, bword ptr [r12+8*rdx+16]
       45897E10             mov      dword ptr [r14+16], r15d
       8B5500               mov      edx, dword ptr [rbp]
       41895614             mov      dword ptr [r14+20], edx
       498BCE               mov      rcx, r14
       488BD7               mov      rdx, rdi
       E800000000           call     CORINFO_HELP_CHECKED_ASSIGN_REF
       498D4E08             lea      rcx, bword ptr [r14+8]
       488BD3               mov      rdx, rbx
       E800000000           call     CORINFO_HELP_CHECKED_ASSIGN_REF
       44896D00             mov      dword ptr [rbp], r13d
       FF463C               inc      dword ptr [rsi+60]
       41BD01000000         mov      r13d, 1
       E95AFFFFFF           jmp      G_M24340_IG09

************** Beginning of cold code **************

G_M24340_IG14:
       488BCF               mov      rcx, rdi
       E800000000           call     ThrowHelper:ThrowAddingDuplicateWithKeyArgumentException(ref)
       CC                   int3     

G_M24340_IG15:
       E800000000           call     CORINFO_HELP_RNGCHKFAIL
       CC                   int3     

; Total bytes of code 457, prolog size 28 for method Dictionary`2:TryInsertDefaultComparer(ref,ref,ubyte):bool:this

@benaadams
Copy link
Member Author

benaadams commented Sep 19, 2017

Non-ref version
Dictionary``2:TryInsertDefaultComparer(long,bool,ubyte):bool:this

G_M11705_IG01:
       4157                 push     r15
       4156                 push     r14
       4155                 push     r13
       4154                 push     r12
       57                   push     rdi
       56                   push     rsi
       55                   push     rbp
       53                   push     rbx
       4883EC48             sub      rsp, 72
       488BF1               mov      rsi, rcx
       488BFA               mov      rdi, rdx
       418BD8               mov      ebx, r8d
       418BE9               mov      ebp, r9d

G_M11705_IG02:
       4C8B7608             mov      r14, gword ptr [rsi+8]
       4D85F6               test     r14, r14
       750D                 jne      SHORT G_M11705_IG03
       488BCE               mov      rcx, rsi
       33D2                 xor      edx, edx
       E800000000           call     Dictionary`2:Initialize(int):ref:this
       4C8BF0               mov      r14, rax

G_M11705_IG03:
       488BC7               mov      rax, rdi
       48C1F820             sar      rax, 32
       33C7                 xor      eax, edi
       448BF8               mov      r15d, eax
       4181E7FFFFFF7F       and      r15d, 0x7FFFFFFF
       4C8B6610             mov      r12, gword ptr [rsi+16]
       4533ED               xor      r13d, r13d
       418B4E08             mov      ecx, dword ptr [r14+8]
       418BC7               mov      eax, r15d
       33D2                 xor      rdx, rdx
       F7F1                 div      edx:eax, ecx
       3BD1                 cmp      edx, ecx
       0F8319000000         jae      G_M11705_IG15
       4863CA               movsxd   rcx, edx
       418B448E10           mov      eax, dword ptr [r14+4*rcx+16]
       85C0                 test     eax, eax
       7C3F                 jl       SHORT G_M11705_IG06
       418B4C2408           mov      ecx, dword ptr [r12+8]

G_M11705_IG04:
       3BC1                 cmp      eax, ecx
       0F8319000000         jae      G_M11705_IG15
       4863C0               movsxd   rax, eax
       488D0440             lea      rax, [rax+2*rax]
       498D44C410           lea      rax, bword ptr [r12+8*rax+16]
       488BD0               mov      rdx, rax
       44393A               cmp      dword ptr [rdx], r15d
       7517                 jne      SHORT G_M11705_IG05
       4C8B4208             mov      r8, qword ptr [rdx+8]
       4C8BCF               mov      r9, rdi
       4D3BC1               cmp      r8, r9
       410F94C0             sete     r8b
       450FB6C0             movzx    r8, r8b
       4585C0               test     r8d, r8d
       7536                 jne      SHORT G_M11705_IG07

G_M11705_IG05:
       8B4004               mov      eax, dword ptr [rax+4]
       85C0                 test     eax, eax
       7DC6                 jge      SHORT G_M11705_IG04

G_M11705_IG06:
       8B4E44               mov      ecx, dword ptr [rsi+68]
       85C9                 test     ecx, ecx
       7E5A                 jle      SHORT G_M11705_IG11
       448B6E40             mov      r13d, dword ptr [rsi+64]
       418B6C2408           mov      ebp, dword ptr [r12+8]
       443BED               cmp      r13d, ebp
       0F8319000000         jae      G_M11705_IG15
       4963C5               movsxd   rax, r13d
       488D0440             lea      rax, [rax+2*rax]
       418B44C414           mov      eax, dword ptr [r12+8*rax+20]
       894640               mov      dword ptr [rsi+64], eax
       FFC9                 dec      ecx
       894E44               mov      dword ptr [rsi+68], ecx
       EB60                 jmp      SHORT G_M11705_IG13

G_M11705_IG07:
       400FB6CD             movzx    rcx, bpl
       83F901               cmp      ecx, 1
       750C                 jne      SHORT G_M11705_IG08
       885A10               mov      byte  ptr [rdx+16], bl
       FF463C               inc      dword ptr [rsi+60]
       41BD01000000         mov      r13d, 1

G_M11705_IG08:
       83F902               cmp      ecx, 2
       0F8400000000         je       G_M11705_IG14

G_M11705_IG09:
       418BC5               mov      eax, r13d

G_M11705_IG10:
       4883C448             add      rsp, 72
       5B                   pop      rbx
       5D                   pop      rbp
       5E                   pop      rsi
       5F                   pop      rdi
       415C                 pop      r12
       415D                 pop      r13
       415E                 pop      r14
       415F                 pop      r15
       C3                   ret      

G_M11705_IG11:
       448B6E38             mov      r13d, dword ptr [rsi+56]
       418B6C2408           mov      ebp, dword ptr [r12+8]
       413BED               cmp      ebp, r13d
       751D                 jne      SHORT G_M11705_IG12
       418BCD               mov      ecx, r13d
       E800000000           call     HashHelpers:ExpandPrime(int):int
       8BD0                 mov      edx, eax
       488BCE               mov      rcx, rsi
       4533C0               xor      r8d, r8d
       E800000000           call     Dictionary`2:ResizeDefaultComparer(int,bool):this
       4C8B7608             mov      r14, gword ptr [rsi+8]
       4C8B6610             mov      r12, gword ptr [rsi+16]

G_M11705_IG12:
       FF4638               inc      dword ptr [rsi+56]

G_M11705_IG13:
       418B4E08             mov      ecx, dword ptr [r14+8]
       418BC7               mov      eax, r15d
       33D2                 xor      rdx, rdx
       F7F1                 div      edx:eax, ecx
       3BD1                 cmp      edx, ecx
       0F8319000000         jae      G_M11705_IG15
       4863C2               movsxd   rax, edx
       498D448610           lea      rax, bword ptr [r14+4*rax+16]
       453B6C2408           cmp      r13d, dword ptr [r12+8]
       0F8319000000         jae      G_M11705_IG15
       4963CD               movsxd   rcx, r13d
       488D0C49             lea      rcx, [rcx+2*rcx]
       498D4CCC10           lea      rcx, bword ptr [r12+8*rcx+16]
       448939               mov      dword ptr [rcx], r15d
       8B10                 mov      edx, dword ptr [rax]
       895104               mov      dword ptr [rcx+4], edx
       48897908             mov      qword ptr [rcx+8], rdi
       885910               mov      byte  ptr [rcx+16], bl
       448928               mov      dword ptr [rax], r13d
       FF463C               inc      dword ptr [rsi+60]
       41BD01000000         mov      r13d, 1
       E96CFFFFFF           jmp      G_M11705_IG09

************** Beginning of cold code **************

G_M11705_IG14:
       488D0D00000000       lea      rcx, [(reloc)]
       E800000000           call     CORINFO_HELP_NEWSFAST
       488BC8               mov      rcx, rax
       48897908             mov      qword ptr [rcx+8], rdi
       E800000000           call     ThrowHelper:ThrowAddingDuplicateWithKeyArgumentException(ref)
       CC                   int3     

G_M11705_IG15:
       E800000000           call     CORINFO_HELP_RNGCHKFAIL
       CC                   int3     

; Total bytes of code 426, prolog size 28 for method Dictionary`2:TryInsertDefaultComparer(long,bool,ubyte):bool:this

@benaadams
Copy link
Member Author

In Dictionary``2:FindEntryDefaultComparer(char,byref):byref:this

This is a bit weird

       0FB7442468           movzx    rax, word  ptr [rsp+68H]
       0FB7542468           movzx    rdx, word  ptr [rsp+68H]

@mikedn
Copy link

mikedn commented Sep 19, 2017

Updated asm with current; doesn't look like % removes the bounds check

Well, it does... in my branch - mikedn@a587bd3. Saves ~1K in corelib, all from existing hashtable code. Could save a bit more (in Linq's Lookup for example) if parse VN PHIs. They're a bit convoluted but it seems possible.

@benaadams
Copy link
Member Author

benaadams commented Sep 19, 2017

Added a goto 5e5affb 😢 which removes the multiple returns for TryInsertX and shaves off 27 bytes per method+type; 15 for shared method in the inner search loop. Updated asm.

@stephentoub
Copy link
Member

Added a goto

cc: @JosephTremoulet

@JosephTremoulet
Copy link

Dictionary``2:TryInsertDefaultComparer(ref,ref,ubyte):bool:thiss has three exit unwinds

Added a goto 😢 which removes the multiple returns for TryInsertX and shaves off 27 bytes per method+type; 15 for shared method in the inner search loop. Updated asm.

Yeah, we still have the heuristic that leaves the epilogs separate if you don't have more than four of them in the method. I'd expect us to move them out of the loop, though, if the jit recognizes it as a loop; @benaadams is that not what you were seeing? (I don't have a copy of the pre-edit asm)

@benaadams
Copy link
Member Author

if the jit recognizes it as a loop;

Its uses a weird loop increment for a for; is more of a while
for (int i = buckets[(uint)hashCode % (uint)buckets.Length]; i >= 0; i = entries[i].next)

Before goto asm is here https://gist.github.com/benaadams/f32d9468281e37c0b8f32dcdceb718d2

@JosephTremoulet
Copy link

Before goto asm is here https://gist.github.com/benaadams/f32d9468281e37c0b8f32dcdceb718d2

Ok, so that did place the epilogs (which are under IG08, IG11, and IG15) outside of the loop (which consists of IG04 and IG05), but yes since there were only three it didn't do any merging.

@benaadams
Copy link
Member Author

but yes since there were only three it didn't do any merging.

What you are saying is I need to put another return in? 😉

@damageboy
Copy link

damageboy commented Sep 25, 2017

Thanks! For reference:
http://www.hackersdelight.org/hdcodetxt/magic.c.txt

This is what can generate those numbers, but take it no account that it assumes 32 bit multiplication into separate 32 bit words, and shifting of the high-word, so when doing it with 64 bit multiplication you need to shift by an additional 32 bits to the right...

I wonder if it would make sense to generate an huge switch case with the literal code or store a table like the cpp code I referenced does...

@mikedn
Copy link

mikedn commented Sep 25, 2017

Am I missing something?

Couple of posts above

It may be worth trying using magic division but that probably requires storing the magic number in the dictionary object so it makes it larger.

@benaadams
Copy link
Member Author

benaadams commented Sep 25, 2017

It may be worth trying using magic division but that probably requires storing the magic number in the dictionary object so it makes it larger.

Would need 2 numbers or 1 plus an array lookup with extra bounds check; also no way to elide the modulus based bounds check in future.

@damageboy
Copy link

@mikedn thanks, missed it...

Wouldn't adding a single index entry to the Dictionary be worth it? especially if the index is basically just used in a big-ole switch inside HashHelpers a-la:

switch (primeIndex) {
  // 9
  case 0: return (hashCode * 0x38e38e39) >> (32 + 1);
  // 23
  case 1: return (hashCode * 0xb21642c9) >>  (32 + 4);
  ...
}

@mikedn
Copy link

mikedn commented Sep 25, 2017

Wouldn't adding a single index entry to the Dictionary be worth it?

Hard to say. Dictionaries are very common, it is possible that increasing the object size by 8 bytes does matter in some scenarios. And at the same time the faster modulus operation may not matter in some scenarios. It's the usual tradeoff - memory vs. CPU...

especially if the index is basically just used in a big-ole switch

In the best case that switch will turn into a binary search tree - a bunch of compares and branches. That will add a few cycles to the magic modulus operation. Magic modulus is faster than the normal div instruction but it's not that fast and if you add more cycles, well... you may end up were you started. And this doesn't include the cost of branch mispredictions which may be high in this case...

@damageboy
Copy link

damageboy commented Sep 25, 2017

In the best case that switch will turn into a binary search tree - a bunch of compares and branches. That will add a few cycles to the magic modulus operation. Magic modulus is faster than the normal div instruction but it's not that fast and if you add more cycles, well... you may end up were you started. And this doesn't include the cost of branch mispredictions which may be high in this case...

I thought that's why we are talking about adding another byte/int to the dictionary, to avoid the binary search tree and turn that switch into a simple lookup table based on the prime index, otherwise, why would you want to add the additional byte/int to begin with..., I assumed it would serve as a "prime index"

@mikedn
Copy link

mikedn commented Sep 25, 2017

turn that switch into a simple lookup table based on the prime index

What is "a simple lookup table"? I'm pretty sure I know what you mean but I don't see how it could be useful in this case from a performance point of view.

And for clarity - the binary search tree I'm talking about looks something like

if (x < 123) {
    if (x == 7) {
    }
    else if (x == 42) {
    }
}
else if (x == 123) {
}
else {
    if (x == 547) {
    }
}

The C# compiler generates this kind of code when a normal switch instruction cannot be used because the jump table would be too big/sparse.

@damageboy
Copy link

damageboy commented Sep 25, 2017

I was assuming that the extra int we were discussing would just be and index into a prime table, where the prime numbers currently reside in HashHelpers...

If it's a plain index running sequentially, then a switch would/should be rendered into a simple jump table... unless I'm totally off base.

The other option would be to NOT store and extra int, and use the array's Length which is one of the prime numbers in that list as the key for the switch, in which case this obviously turns into a binary tree as you described previously...

@mikedn
Copy link

mikedn commented Sep 25, 2017

I was assuming that the extra int we were discussing would just be and index into a prime table, where the prime numbers currently reside in HashHelpers...

Why store an index instead of storing the magic number directly? Granted, the magic number is an int and then we need another int for the shift so we really need 2 ints. But on 64 bit targets it doesn't matter adding a single int increases the object size by 8 bytes due to alignment requirements. So storing an index is useful only on 32 bit targets and it also cost you an indirection to read the actual values from the table.

@damageboy
Copy link

damageboy commented Sep 25, 2017

Why store an index instead of storing the magic number directly? Granted, the magic number is an int and then we need another int for the shift so we really need 2 ints. But on 64 bit targets it doesn't matter adding a single int increases the object size by 8 bytes due to alignment requirements. So storing an index is useful only on 32 bit targets and it also cost you an indirection to read the actual values from the table.

Good question, I guess my hunch/assumption is that storing two members and using them to do x * _member1 >> _member2 would generate less efficient code than a jump table with direct constants baked into the instruction stream... (assuming it's a linear switch)
Then again, I might be completely out of my depth and completely wrong :)

@mikedn
Copy link

mikedn commented Sep 25, 2017

to do x * _member1 >> _member2 would generate less efficient code than a jump table with direct constants baked into the instruction stream... (assuming it's a linear switch)

Yeah, you might save couple of cycles in some cases if you do that. And you could take advantage of the fact that RyuJIT already does magic division and write something like:

switch (buckets.Length) {
case 3:
    return i % 3;
case 5:
    return i % 5;
...
default:
    return i % buckets.Length;
}

But we still have the problem that such a switch results in a bunch of conditional branches and possible suffer from bad branch prediction.

@redknightlois
Copy link

redknightlois commented Oct 13, 2017

@damageboy getting the 2 values is likely to be far more efficient (I tried for something different where I needed to get rid of an idiv instruction that was killing me, and in that case was better). There are 3 reasons:

  • Both values are likely to be stored in the very same cache line, and given how important they are; that cache line will have high probability to be in the cache for tight loops. (better amortized time there)
  • The frontend will require to fetch & decode less instructions (than in the switch/if case) and also avoiding retiring instructions without using them. Think that for the general case you need a table of at least 6 to 8 primes for the fast path, that is log2(n) conditional branches.
  • Those 2 values will only change when growing the dictionary, that means that there is very low probability of the cache line be evicted because of writes.

@benaadams benaadams changed the title Improve Dictionary<K,T> CQ [WIP] Improve Dictionary<K,T> CQ Oct 14, 2017
@benaadams
Copy link
Member Author

Rebased, updated some asm #14030 (comment)

The EqualityComparer is devirtualizing

Devirtualized virtual call to EqualityComparer`1:Equals; now direct call to GenericEqualityComparer`1:Equals [exact]

However the find is a two call step via CORINFO_HELP_RUNTIMEHANDLE_CLASS

The code can be improved further with conditional-ref and ref-local-reassignment in C#7.2 with the extra call indirection removed, as well as improving CQ in Insert.

So will put on hold for now; then reevaluate after those features are available

@benaadams
Copy link
Member Author

Closing in preference of #15411

@benaadams benaadams closed this Dec 7, 2017
@benaadams
Copy link
Member Author

Had a go at using magic division for rehash and expand in #15453 since they are bulk operations

@benaadams
Copy link
Member Author

benaadams commented Jan 9, 2018

Returning to the idea of magic division instead of mod... It that over thinking it?

Could just use Lzcnt.LeadingZeroCount; haven't verified the code is correct, but something like:

static int LzcntMod(int value, int prime)
{
    var mask = (1 << 32 - (int)Lzcnt.LeadingZeroCount((uint)prime)) - 1;
    var index = value & mask;
    if (index >= prime)
    {
        index -= prime;
    }
    return index;
}

Mod %

G_M55235_IG03:
       418BC0               mov      eax, r8d
       99                   cdq      
       F7F9                 idiv     edx:eax, ecx

LzcntMod

G_M38796_IG03:
       F30FBDCA             lzcnt    ecx, edx
       F7D9                 neg      ecx
       83C120               add      ecx, 32
       41B901000000         mov      r9d, 1
       41D3E1               shl      r9d, cl
       418D49FF             lea      ecx, [r9-1]
       23C8                 and      ecx, eax
       3BCA                 cmp      ecx, edx
       7C02                 jl       SHORT G_M38796_IG04
       2BCA                 sub      ecx, edx

Results

   Method |     Mean |     Error |    StdDev |
--------- |---------:|----------:|----------:|
      Mod | 7.688 ns | 0.0943 ns | 0.0882 ns |
 LzcntMod | 2.641 ns | 0.0057 ns | 0.0053 ns |

Would that work? (Could even be a Jit time optimization for variable divisor?)

/cc @fiigii @tannergooding

*crossgen issues: intrinsic test needs to be inline, or won't inline - method call will eliminate most of the gains

@benaadams
Copy link
Member Author

nvm - I went a bit crazy :)

@nietras
Copy link

nietras commented Jan 20, 2018

@benaadams the benchmark you mention #14030 (comment) do you have the code for that somewhere?

@benaadams benaadams deleted the dictionary branch March 27, 2018 05:12
@AndyAyersMS
Copy link
Member

Wonder if https://arxiv.org/abs/1902.01961 might give us some new ideas for trying to get rid of that pesky divide.

@benaadams
Copy link
Member Author

User from reddit points out they gave the 64bit version for what the paper is doing as an answer on SE.Math in 2015

https://math.stackexchange.com/a/1251328/172258

@benaadams
Copy link
Member Author

Isn't constant, should would need to carry the shape or pointer to lookup in the dict and pull it in with the prime.

IDIV is ~20-27+ cycle latency so definitely has legs

@AndyAyersMS
Copy link
Member

Yes, it's not clear if there's a win to be had here -- the need to have a lookup table or similar means for getting at the magic values is a drawback. But now the mod might need just one multiply, not two.

@benaadams
Copy link
Member Author

Still looks to use two multiplys; but drops everything else:

// GCC 6.2             | // Clang 4.0                 | // our fast version + GCC 6.2
mov eax , edi          | mov eax , edi                | movabs rax , 194176253407468965
mov edx , 1491936009   | imul rax , rax , 1491936009  | mov edi , edi
mul edx                | shr rax , 32                 | imul rdi , rax
mov eax , edi          | mov ecx , edi                | mov eax , 95
sub eax , edx          | sub ecx , eax                | mul rdi
shr eax                | shr ecx                      | mov rax , rdx
add eax , edx          | add ecx , eax                | //
shr eax , 6            | shr ecx , 6                  | //
imul eax , eax , 95    | imul eax , ecx , 95          | //
sub edi , eax          | sub edi , eax                | //
mov eax                | mov eax , edi                | //

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants