-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Conversation
Before ; Lcl frame size = 88
...
G_M29783_IG10:
413BD8 cmp ebx, r8d
0F8393000000 jae G_M29783_IG18
4863D3 movsxd rdx, ebx
488D1452 lea rdx, [rdx+2*rdx]
498D4CD510 lea rcx, bword ptr [r13+8*rdx+16]
448B4910 mov r9d, dword ptr [rcx+16]
4585C9 test r9d, r9d
7C1C jl SHORT G_M29783_IG11
418BC1 mov eax, r9d
99 cdq
F7FF idiv edx:eax, edi
413BD6 cmp edx, r14d
7373 jae SHORT G_M29783_IG18
4863C2 movsxd rax, edx
8B448510 mov eax, dword ptr [rbp+4*rax+16]
894114 mov dword ptr [rcx+20], eax
4863C2 movsxd rax, edx
895C8510 mov dword ptr [rbp+4*rax+16], ebx
G_M29783_IG11:
FFC3 inc ebx
8B44244C mov eax, dword ptr [rsp+4CH]
3BD8 cmp ebx, eax
8944244C mov dword ptr [rsp+4CH], eax
7CB8 jl SHORT G_M29783_IG10
...
; Total bytes of code 497, prolog size 29 for method Dictionary`2:Resize(int,bool):this After ; Lcl frame size = 72
...
G_M13446_IG06:
3BC2 cmp eax, edx
737A jae SHORT G_M13446_IG10
4863C8 movsxd rcx, eax
488D0C49 lea rcx, [rcx+2*rcx]
4D8D44CD10 lea r8, bword ptr [r13+8*rcx+16]
458B4810 mov r9d, dword ptr [r8+16]
4585C9 test r9d, r9d
7C35 jl SHORT G_M13446_IG07
418BC9 mov ecx, r9d
448BD5 mov r10d, ebp
4C0FAFD1 imul r10, rcx
418D4E20 lea ecx, [r14+32]
49D3EA shr r10, cl
498BCA mov rcx, r10
8BC9 mov ecx, ecx ; bit weird
0FAFCB imul ecx, ebx
442BC9 sub r9d, ecx
453BCC cmp r9d, r12d
7344 jae SHORT G_M13446_IG10
4963C9 movsxd rcx, r9d
418B4C8F10 mov ecx, dword ptr [r15+4*rcx+16]
41894814 mov dword ptr [r8+20], ecx
4963C9 movsxd rcx, r9d
4189448F10 mov dword ptr [r15+4*rcx+16], eax
G_M13446_IG07:
FFC0 inc eax
3BC6 cmp eax, esi
7CAC jl SHORT G_M13446_IG06
...
; Total bytes of code 316, prolog size 26 for method Dictionary`2:Expand(int):this |
// Generate & validate https://github.com/benaadams/Primes | ||
public static readonly PrimeInfo[] PrimesInfo = | ||
{ | ||
new PrimeInfo(3, 0x55555556, 0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This initialization pattern produces a lot of code bloat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a better pattern?
I could do an array of int that proceeds in groups of 3
e.g.
= { 3, 0x55555556, 0, 7, 0x92492493, 2, ...
But that seems a bit raw?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is a bit raw - we do it in number of other places.
Number of C# features force an unfortunate trade-off onto you: (pretty, big and slow final code) vs. (raw, small and fast final code).
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
private static uint MagicNumberDivide(uint numerator, uint magic, int shift) | ||
{ | ||
return (uint)((numerator * (ulong)magic) >> (32 + shift)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like + 32
can be folded into the magic table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll need to check if this works well on 32 bit targets. I think RyuJIT x86 handles these particular long
operations well but I do not know what happens on ARM32.
I think the splitting of Resize stands alone as a good change so will add a PR for that. |
Added PR for the split #15455 |
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
private static uint MagicNumberDivide(uint numerator, uint magic, int shift) | ||
{ | ||
return (uint)((numerator * (ulong)magic) >> (32 + shift)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
worth a Debug.Assert using regular division?
9ae4032
to
a8917c0
Compare
Last commit a8917c0 adds magic remainder onto #15419 Its definitely faster 5% - 10%
However, there is a lot going on, including extra memory accesses to get the magic and shift (and makes the dictionary 2 fields larger) Changes 1 and ebx, 0xD1FFAB1E ; & 0x7FFFFFFF
mov rax, gword ptr [rsi+8]
mov rcx, rax
mov r8d, dword ptr [rax+8]
mov eax, ebx
cdq
idiv edx:eax, r8d
cmp edx, r8d
jae G_M6157_IG09 ; CORINFO_HELP_RNGCHKFAIL to 11 instuctions (not sure the asm is great?) and r15d, 0xD1FFAB1E ; & 0x7FFFFFFF
mov edx, dword ptr [rbx+8]
mov r8d, edx
mov ecx, dword ptr [rdi+72]
mov r11d, dword ptr [rdi+76]
mov eax, r15d
mov ecx, ecx ; bit weird
imul rax, rcx
mov ecx, r11d
shr rax, cl
mov rcx, rax
mov ecx, ecx ; bit weird
imul ecx, r8d
mov r8d, r15d
sub r8d, ecx
cmp r8d, edx
jae SHORT G_M6159_IG07 ; CORINFO_HELP_RNGCHKFAIL So there's probably a better way; for more gain with less cost. However, its here as an investigation for posterity to be referred to. Thoughts? |
Hmm, not as much as I would have hopped. But then, you replace
Those are likely
Yeah, on CPU that support it
Well, the only other solution is to completely get rid of the need for prime numbers and use power of 2 numbers instead... and deal with all the baggage that such a change would bring. |
I tried the fast range and while it was much faster; the benchmarker (in-proc) started crawling after a while (while collating results) - so they are definitely badly distributed hashes around. Thought about trying rehashing the hash with |
Would suggest |
You mean you counted up all the cycles needed by all the code in that function and you ended up with 9-17%? Could be. The reality is that while |
Inverse cycle change * gain 26 / 15 * 5 = 8.66 % |
I've been reading the code + agner fog's instruction tables, and something has jumped to my attention. In instruction_table.pdf Agner Fog writes, example about Intel Skylake execution ports that (page 231):
From this I understand that in Intel's latest and greatest arch (also in previous archs, from skimming over the PDF) there is only one execution port per core that can execute an If I understand the implication of this correctly, it would mean that the CPU could do multiple While this probably is less meaningful for the Am I completely wrong? |
No, but Resize only occurs approx 2^n There are other code quality issues #15419 probably to fix first |
e.g setting the 2^n buckets to -1 on each Resize isn't great either |
You're not but you missed a little detail - throughput. If you look at the instruction table for Skylake for example, you'll see that the reciprocal throughput of int32 division is only 6, not 26. I'm not sure how accurate that 6 number really is but the fact is that if you try to execute 2 independent divisions the total execution time won't necessarily be twice the time of executing a single division, there's some instruction level parallelism going on. For example: for (uint i = 1; i < 1000000000; i++)
{
uint u1 = 1234 / i;
uint u2 = 1233 / i;
uint u3 = 1232 / i;
} This loops takes ~8800ms on my computer. The same loop but with a single division in it takes ~3700ms. So the 3 division loop is around x2.4 slower than the single division loop yet it does 3 divisions, not just 2. So we're doing 3 billion divisions in 8800ms. With my CPU clock (3.10GHz) that means something like 10 cycles per division, not 26. |
Split Resize -> Expand and Rehash
Rehashing is only used in
string
key variant so it doesn't need to be compiled in every ExpandRehashing doesn't expand the Dictionary so it can reuse the existing arrays
Use MagicNumberRemainder rather than
idiv
inExpand
andRehash