Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

[WIP] Dictionary - magic remainder #15453

Closed
wants to merge 4 commits into from

Conversation

benaadams
Copy link
Member

@benaadams benaadams commented Dec 9, 2017

Split Resize -> Expand and Rehash

Rehashing is only used in string key variant so it doesn't need to be compiled in every Expand
Rehashing doesn't expand the Dictionary so it can reuse the existing arrays
Use MagicNumberRemainder rather than idiv in Expand and Rehash

@benaadams
Copy link
Member Author

benaadams commented Dec 9, 2017

Before Resize

; Lcl frame size = 88
...
G_M29783_IG10:
       413BD8               cmp      ebx, r8d
       0F8393000000         jae      G_M29783_IG18
       4863D3               movsxd   rdx, ebx
       488D1452             lea      rdx, [rdx+2*rdx]
       498D4CD510           lea      rcx, bword ptr [r13+8*rdx+16]
       448B4910             mov      r9d, dword ptr [rcx+16]
       4585C9               test     r9d, r9d
       7C1C                 jl       SHORT G_M29783_IG11
       418BC1               mov      eax, r9d
       99                   cdq      
       F7FF                 idiv     edx:eax, edi
       413BD6               cmp      edx, r14d
       7373                 jae      SHORT G_M29783_IG18
       4863C2               movsxd   rax, edx
       8B448510             mov      eax, dword ptr [rbp+4*rax+16]
       894114               mov      dword ptr [rcx+20], eax
       4863C2               movsxd   rax, edx
       895C8510             mov      dword ptr [rbp+4*rax+16], ebx

G_M29783_IG11:
       FFC3                 inc      ebx
       8B44244C             mov      eax, dword ptr [rsp+4CH]
       3BD8                 cmp      ebx, eax
       8944244C             mov      dword ptr [rsp+4CH], eax
       7CB8                 jl       SHORT G_M29783_IG10
...
; Total bytes of code 497, prolog size 29 for method Dictionary`2:Resize(int,bool):this

After Expand

; Lcl frame size = 72
...
G_M13446_IG06:
       3BC2                 cmp      eax, edx
       737A                 jae      SHORT G_M13446_IG10
       4863C8               movsxd   rcx, eax
       488D0C49             lea      rcx, [rcx+2*rcx]
       4D8D44CD10           lea      r8, bword ptr [r13+8*rcx+16]
       458B4810             mov      r9d, dword ptr [r8+16]
       4585C9               test     r9d, r9d
       7C35                 jl       SHORT G_M13446_IG07
       418BC9               mov      ecx, r9d
       448BD5               mov      r10d, ebp
       4C0FAFD1             imul     r10, rcx
       418D4E20             lea      ecx, [r14+32]
       49D3EA               shr      r10, cl
       498BCA               mov      rcx, r10
       8BC9                 mov      ecx, ecx             ; bit weird
       0FAFCB               imul     ecx, ebx
       442BC9               sub      r9d, ecx
       453BCC               cmp      r9d, r12d
       7344                 jae      SHORT G_M13446_IG10
       4963C9               movsxd   rcx, r9d
       418B4C8F10           mov      ecx, dword ptr [r15+4*rcx+16]
       41894814             mov      dword ptr [r8+20], ecx
       4963C9               movsxd   rcx, r9d
       4189448F10           mov      dword ptr [r15+4*rcx+16], eax

G_M13446_IG07:
       FFC0                 inc      eax
       3BC6                 cmp      eax, esi
       7CAC                 jl       SHORT G_M13446_IG06
...
; Total bytes of code 316, prolog size 26 for method Dictionary`2:Expand(int):this

// Generate & validate https://github.com/benaadams/Primes
public static readonly PrimeInfo[] PrimesInfo =
{
new PrimeInfo(3, 0x55555556, 0),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This initialization pattern produces a lot of code bloat.

Copy link
Member Author

@benaadams benaadams Dec 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better pattern?

I could do an array of int that proceeds in groups of 3
e.g.

= { 3, 0x55555556, 0, 7, 0x92492493, 2, ...

But that seems a bit raw?

Copy link
Member

@jkotas jkotas Dec 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is a bit raw - we do it in number of other places.

Number of C# features force an unfortunate trade-off onto you: (pretty, big and slow final code) vs. (raw, small and fast final code).

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static uint MagicNumberDivide(uint numerator, uint magic, int shift)
{
return (uint)((numerator * (ulong)magic) >> (32 + shift));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like + 32 can be folded into the magic table.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to check if this works well on 32 bit targets. I think RyuJIT x86 handles these particular long operations well but I do not know what happens on ARM32.

@benaadams
Copy link
Member Author

I think the splitting of Resize stands alone as a good change so will add a PR for that.

@benaadams
Copy link
Member Author

Added PR for the split #15455

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static uint MagicNumberDivide(uint numerator, uint magic, int shift)
{
return (uint)((numerator * (ulong)magic) >> (32 + shift));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth a Debug.Assert using regular division?

@benaadams
Copy link
Member Author

benaadams commented Dec 19, 2017

Last commit a8917c0 adds magic remainder onto #15419

Its definitely faster 5% - 10%

                           Method |   Items |            Current |                New |  Change |
--------------------------------- |-------- |-------------------:|-------------------:|--------:|
    ContainsKey_Single_Yes_IntKey |     100 |           5.909 ns |           5.629 ns |  x 1.04 |
     ContainsKey_Single_No_IntKey |     100 |           5.973 ns |           5.681 ns |  x 1.05 |
           ContainsKey_All_IntKey |     100 |         591.147 ns |         580.291 ns |  x 1.02 |
 ContainsKey_Single_Yes_StringKey |     100 |          20.476 ns |          18.509 ns |  x 1.10 |
  ContainsKey_Single_No_StringKey |     100 |          16.165 ns |          14.430 ns |  x 1.12 |
 ContainsKey_Single_Yes_ObjectKey |     100 |           8.749 ns |           7.798 ns |  x 1.12 |
  ContainsKey_Single_No_ObjectKey |     100 |           8.448 ns |           7.737 ns |  x 1.09 |
                       Add_IntKey |     100 |       3,989.325 ns |       3,565.964 ns |  x 1.11 |
                    Add_StringKey |     100 |       9,967.582 ns |       9,539.857 ns |  x 1.04 |
                    Add_ObjectKey |     100 |       8,678.552 ns |       8,191.180 ns |  x 1.06 |
               PreSizedAdd_IntKey |     100 |       1,105.320 ns |       1,026.360 ns |  x 1.08 |
            PreSizedAdd_StringKey |     100 |       7,460.880 ns |       7,082.505 ns |  x 1.05 |
            PreSizedAdd_ObjectKey |     100 |       6,240.259 ns |       6,046.344 ns |  x 1.03 |

However, there is a lot going on, including extra memory accesses to get the magic and shift (and makes the dictionary 2 fields larger)

Changes 1 idiv

and      ebx, 0xD1FFAB1E           ;  & 0x7FFFFFFF
mov      rax, gword ptr [rsi+8]    
mov      rcx, rax                  
mov      r8d, dword ptr [rax+8]    
mov      eax, ebx                  
cdq                                
idiv     edx:eax, r8d              
cmp      edx, r8d                  
jae      G_M6157_IG09              ; CORINFO_HELP_RNGCHKFAIL

to 11 instuctions (not sure the asm is great?)

and      r15d, 0xD1FFAB1E          ;  & 0x7FFFFFFF
mov      edx, dword ptr [rbx+8]    
mov      r8d, edx                  
mov      ecx, dword ptr [rdi+72]   
mov      r11d, dword ptr [rdi+76]  
mov      eax, r15d                 
mov      ecx, ecx                  ; bit weird
imul     rax, rcx                  
mov      ecx, r11d                 
shr      rax, cl                   
mov      rcx, rax                  
mov      ecx, ecx                  ; bit weird               
imul     ecx, r8d                  
mov      r8d, r15d                 
sub      r8d, ecx                  
cmp      r8d, edx                  
jae      SHORT G_M6159_IG07        ; CORINFO_HELP_RNGCHKFAIL

So there's probably a better way; for more gain with less cost. However, its here as an investigation for posterity to be referred to.

Thoughts?

@mikedn
Copy link

mikedn commented Dec 19, 2017

Its definitely faster 5% - 10%

Hmm, not as much as I would have hopped. But then, you replace div with a memory load, 2 imuls, 1 shift and one subtract - that's something like 15 cycles instead of 26. Not bad but not a very big difference either.

mov ecx, ecx ; bit weird

Those are likely uint->long casts. Well, the first certainly is, not so sure about the second, it seems pointless since the imul that follows is 32 bit, not 64.

not sure the asm is great?

Yeah, on CPU that support it SHRX could be used instead of SHR. Should save 2 cycles on Intel CPUs.

So there's probably a better way; for more gain with less cost

Well, the only other solution is to completely get rid of the need for prime numbers and use power of 2 numbers instead... and deal with all the baggage that such a change would bring.

@benaadams
Copy link
Member Author

I tried the fast range and while it was much faster; the benchmarker (in-proc) started crawling after a while (while collating results) - so they are definitely badly distributed hashes around.

Thought about trying rehashing the hash with Sse42.Crc32 but that's not available everywhere 😢

@benaadams
Copy link
Member Author

benaadams commented Dec 20, 2017

Its definitely faster 5% - 10%

you replace div with a memory load, 2 imuls, 1 shift and one subtract - that's something like 15 cycles instead of 26. Not bad but not a very big difference either.

Would suggest idiv is 9 - 17% of the function; so that's the upper limit on gain?

@mikedn
Copy link

mikedn commented Dec 20, 2017

Would suggest idiv is 9 - 17% of the function; so that's the upper limit on gain?

You mean you counted up all the cycles needed by all the code in that function and you ended up with 9-17%? Could be. The reality is that while idiv is expensive the rest of the code isn't exactly cheap either. It's not like all is just return entries[i % entries.Length];.

@benaadams
Copy link
Member Author

benaadams commented Dec 20, 2017

Inverse cycle change * gain

26 / 15 * 5 = 8.66 %
26 / 15 * 10 = 17.33 %

@damageboy
Copy link

I've been reading the code + agner fog's instruction tables, and something has jumped to my attention.
This might not affect the normal lookup code @benaadams has provided above and in the PR, but could be useful for the resizing functionality, I might be making a complete ass of myself, so please be gentle...

In instruction_table.pdf Agner Fog writes, example about Intel Skylake execution ports that (page 231):

Port 0: Integer, f.p. and vector ALU, mul, div, branch
Port 1: Integer, f.p. and vector ALU
Port 2: Load
Port 3: Load
Port 4: Store
Port 5: Integer and vector ALU
Port 6: Integer ALU, branch
Port 7: Store address

From this I understand that in Intel's latest and greatest arch (also in previous archs, from skimming over the PDF) there is only one execution port per core that can execute an idiv (port 0), while there are 4, if I count correctly, that can execute an imul/mul (ports 0,1,5,6).

If I understand the implication of this correctly, it would mean that the CPU could do multiple MagicNumberRemainder / MagicNumberDivision in series more efficiently, if instructed to, due to instruction reordering and the ILP "provided" by the 4 execution ports, but this is not true for multiple idiv instructions...

While this probably is less meaningful for the FindEntry() and Remove() methods where the Dictionary, hence the CPU, is dealing with one key at a time, and probably can't really take advantage of this parallelism, it might be useful for the Resize/Rehash operations, where multiple MagicNumberRemainder calls could run in succession (rolled out) which might lead to higher perf improvements for that scenario?

Am I completely wrong?

@benaadams benaadams closed this Dec 21, 2017
@benaadams
Copy link
Member Author

Am I completely wrong?

No, but Resize only occurs approx 2^n TryInserts (at each doubling in size) so its not a dominant factor; and TryInsert has a lot of instructions of which idiv is 9-17% of the time (not insignificant)

There are other code quality issues #15419 probably to fix first

@benaadams
Copy link
Member Author

e.g setting the 2^n buckets to -1 on each Resize isn't great either

@mikedn
Copy link

mikedn commented Dec 21, 2017

Am I completely wrong?

You're not but you missed a little detail - throughput. If you look at the instruction table for Skylake for example, you'll see that the reciprocal throughput of int32 division is only 6, not 26. I'm not sure how accurate that 6 number really is but the fact is that if you try to execute 2 independent divisions the total execution time won't necessarily be twice the time of executing a single division, there's some instruction level parallelism going on. For example:

for (uint i = 1; i < 1000000000; i++)
{
    uint u1 = 1234 / i;
    uint u2 = 1233 / i;
    uint u3 = 1232 / i;
}

This loops takes ~8800ms on my computer. The same loop but with a single division in it takes ~3700ms. So the 3 division loop is around x2.4 slower than the single division loop yet it does 3 divisions, not just 2.

So we're doing 3 billion divisions in 8800ms. With my CPU clock (3.10GHz) that means something like 10 cycles per division, not 26.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants