Queue<T>.Enqueue and Stack<T>.Push performance improvements #36502

Wraiyth · 2019-03-30T05:23:37Z

Addresses #36405

This updates Stack.Push and Queue.Enqueue to have similar optimizations and structure to List.Add

Performance improvements to Queue<T>.Enqueue and Stack<T>.Push

dnfclas · 2019-03-30T05:23:49Z

All CLA requirements met.

gfoidl · 2019-03-30T08:00:33Z

src/System.Collections/src/System/Collections/Generic/Stack.cs

        }

        // Pushes an item to the top of the stack.
+        [MethodImpl(MethodImplOptions.AggressiveInlining)]


But we shouldn't add AggressiveInlining unless there's a strong performance-motivated scenario for doing so. Getting something to inline isn't a goal in and of itself.

(source: #12094 (comment))

For the microbenchmark this may be good, but as the call-sites get bigger, it may decrease perf in real world scenarios.

I don't know what the choice is...

I.e.

this would be "inlined" into every call-site

G_M60461_IG02: 8B4714 mov eax, dword ptr [rdi+20] 8B5718 mov edx, dword ptr [rdi+24] 488B4F08 mov rcx, gword ptr [rdi+8] 448B4108 mov r8d, dword ptr [rcx+8] 443BC2 cmp r8d, edx 7623 jbe SHORT G_M60461_IG04 443BC0 cmp r8d, eax 761E jbe SHORT G_M60461_IG04 4C63C8 movsxd r9, eax 4289748910 mov dword ptr [rcx+4*r9+16], esi FFC0 inc eax 443BC0 cmp r8d, eax 7502 jne SHORT G_M60461_IG03 33C0 xor eax, eax G_M60461_IG03: 894714 mov dword ptr [rdi+20], eax FFC2 inc edx 895718 mov dword ptr [rdi+24], edx FF471C inc dword ptr [rdi+28] EB05 jmp SHORT G_M60461_IG05 G_M60461_IG04: E810F5FFFF call MyQueue`1:EnqueueWithResize(int):this

(" " because the JIT may optimize something, depending on the call-site).

This is right. @stephentoub @jkotas when deciding where aggressive is reasonable, do you have rules of thumb? Eg, known to be heavily called in hot paths + known to typically fail automatic inlining + not very large + significant help in micro benchmark?

Ups, this dasm / comment belongs to Queue.

deciding where aggressive is reasonable, do you have rules of thumb?

My rule of thumb is if you're using it, you most likely shouldn't be. I generally only use it when it's a critical-path function, when inlining makes a very measurable impact on consumers (not just a microbenchmark of the method itself), and there's something about it that would fool the JIT's heuristics, such as a ton of IL but most of which will be shaken away from the resulting asm, e.g. lots of different branches for different generic type tests against various struct types, intrinsics usage, etc. I don't think this qualifies. I'm actually skeptical of the existing usage on List.Add.

This aspect of performance optimization is interesting, is there somewhere some documentation/research about it(explain issue on "real system" out of microbenchmarks) or is only "experience"?

Inlining a method results in a copy of its code at the call site (subject to any follow-on optimizations that inlining might enable). With lots of call sites, that can significantly bloat code size. That increase in code size is not just an issue in and of itself (e.g. more memory consumption, etc.), but each copy of that code is fundamentally different, recognized as different code for an instruction cache; having many different copies of the same code can thus result in the instruction cache being much less efficient.

I've removed the use of AggressiveInlining on these. That guidance is good, thanks @stephentoub - I was purely running off what had been applied to List here.

@stephentoub thank's for explanation, I wonder if there are "numbers" somewhere(maybe on codebase or on "internet") or if we should take it as "best practice", similar to "remove null check "here" is unuseful because branch predictor likely guesses", I'm aware that is very slippery slope but maybe you(and internal team researcher) have some real number

This is highly dependent on actual code and actual call-sites. What numbers do you expect?
Research wise this optimization domain is covered, you may find a lot of papers, but the decision if inline or not comes back to actual code. So there won't be any hard rules, just heuristics.

src/System.Collections/src/System/Collections/Generic/Queue.cs

gfoidl · 2019-03-30T08:37:29Z

src/System.Collections/src/System/Collections/Generic/Queue.cs

        }

        // Adds item to the tail of the queue.
+        [MethodImpl(MethodImplOptions.AggressiveInlining)]


Same comment as on Stack<T>.Push applies here.
For List<T>.Add aggressive inlining makes sense, as it is just a comparison for hot/cold-path, then on the hot-path an array access plus incrementing the version.

Here more work needs to be done, so the call-site gets quite a bit bigger.

I've removed AggressiveInlining as per the discussion above.

gfoidl · 2019-03-30T08:43:32Z

src/System.Collections/src/System/Collections/Generic/Queue.cs

+
+            if (newCapacity < _array.Length + MinimumGrow)
+            {
+                newCapacity = _array.Length + MinimumGrow;


Unfortunately the JIT fetches _array.Length on each use. So cache it, like:

[MethodImpl(MethodImplOptions.NoInlining)] private void EnqueueWithResize(T item) { int length = _array.Length; int newCapacity = length * 2; int lengthWithGrow = length + MinimumGrow; if (newCapacity < lengthWithGrow) { newCapacity = lengthWithGrow; } SetCapacity(newCapacity); // ...

Or use Math.Min:

int length = _array.Length; int newCapacity = Math.Min(length * 2, length + MinimumGrow); SetCapacity(newCapacity);

PS: I know it's the cold path, but reduced asm-size is also benefitial 😉

Unfortunately the JIT fetches _array.Length on each use.

For the benefit of others: the JIT must do this since it cannot predict access from other threads...

I didn't actually know that, thanks!

I didn't actually know that

The nice part of PRs. You improve the product and may learn something 😄

Hopefully my statement is correct then. @jkotas? I guess it wants to avoid a situation where the body runs indefinitely repeatedly checking the field for some expected external change that actually already happened.

Nothing to do with multiple threads. It is missed optimization. There is nothing fundamental preventing the JIT from doing the transformation with caching the length in locals that you have done manually.

lengthWithGrow is not strictly needed with current JIT. The JIT is able to do the transformation for you.

Or use Math.Min

You probably meant Math.Max. That is the most sensible and readable way to write this. It should generate about as good code as the manually inlined version.

lengthWithGrow is not strictly needed with current JIT.

Hm, I saw the JIT emitting two leas for length + MinimumGrow, one in the if and one in the assignment. That's why I did it manually.

You probably meant Math.Max

Oh, of couse.

Made some updates here based on feedback

danmoseley · 2019-03-30T09:19:27Z

GrowFactor is not used

gfoidl · 2019-03-30T09:24:33Z

GrowFactor is not used

BTW:

private const int GrowFactor = 200; // double each time
// ...
int newcapacity = (int)((long)_array.Length * (long)GrowFactor / 100);

Why is it written that way? Instead of just

private const long GrowFactor = 2; // double each time
// ...
int newcapacity = (int)((long)_array.Length * GrowFactor);

So still have the same overflow behavior as before.

Tornhoof · 2019-03-30T12:47:46Z

See https://github.com/dotnet/corefx/issues/35515 for a possible answer

jkotas · 2019-03-30T14:40:51Z

int newcapacity = (int)((long)_array.Length * GrowFactor);

This looks over-engineered to me. You should be able to write it like this:

private const long GrowFactor = 2;
int newcapacity = _array.Length * GrowFactor;

and it will have the exact same behavior wrt. overflows as int newcapacity = (int)((long)_array.Length * (long)GrowFactor / 100);

Removed unused GrowFactor member Removed Aggressive Inlining Other minor improvements

stephentoub · 2019-03-31T14:57:50Z

src/System.Collections/src/System/Collections/Generic/Queue.cs

            }
+            else
+            {
+                EnqueueWithResize(item);


What's the benefit to separating this out into a non-inlineable method? Is Enqueue inlined with this separated out? Does separating it out make the JIT better at optimizing the rest of the code?

What's the benefit ...

When Enqueue was inlined, it was a hot/cold-path-split. But this doesn't apply anymore.

Is Enqueue inlined with this separated out?

No. At least not now, as AggressiveInlining was removed.

When written as

public void Enqueue(T item) { int tail = _tail; int size = _size; T[] array = _array; if ((uint)size < (uint)array.Length && (uint)tail < (uint)array.Length) { array[tail] = item; _tail = MoveNext(ref tail, array); _size = size + 1; _version++; } else { int length = array.Length; int newCapaicty = Math.Max(length * 2, length + MinimumGrow); SetCapacity(newCapaicty); _array[_tail] = item; MoveNext(ref _tail); _size++; _version++; } }

(so just manually inlined EnqueueWithResize) and with the _version++ in each branch the asm becomes:

G_M9246_IG01: 55 push rbp 4156 push r14 53 push rbx 488D6C2410 lea rbp, [rsp+10H] 488BDF mov rbx, rdi 448BF6 mov r14d, esi G_M9246_IG02: 8B7B14 mov edi, dword ptr [rbx+20] 8B7318 mov esi, dword ptr [rbx+24] 488B4308 mov rax, gword ptr [rbx+8] 8B5008 mov edx, dword ptr [rax+8] 3BD6 cmp edx, esi 7624 jbe SHORT G_M9246_IG05 3BD7 cmp edx, edi 7620 jbe SHORT G_M9246_IG05 4863CF movsxd rcx, edi 4489748810 mov dword ptr [rax+4*rcx+16], r14d FFC7 inc edi 3BD7 cmp edx, edi 7502 jne SHORT G_M9246_IG03 33FF xor edi, edi G_M9246_IG03: 897B14 mov dword ptr [rbx+20], edi FFC6 inc esi 897318 mov dword ptr [rbx+24], esi FF431C inc dword ptr [rbx+28] G_M9246_IG04: 5B pop rbx 415E pop r14 5D pop rbp C3 ret G_M9246_IG05: 8BF2 mov esi, edx D1E6 shl esi, 1 83C204 add edx, 4 3BF2 cmp esi, edx 7D02 jge SHORT G_M9246_IG06 EB02 jmp SHORT G_M9246_IG07 G_M9246_IG06: 8BD6 mov edx, esi G_M9246_IG07: 488BFB mov rdi, rbx 8BF2 mov esi, edx E8D3F1FFFF call Queue`1:SetCapacity(int):this 488B4308 mov rax, gword ptr [rbx+8] 8B7B14 mov edi, dword ptr [rbx+20] 8B7008 mov esi, dword ptr [rax+8] 3BFE cmp edi, esi 7323 jae SHORT G_M9246_IG10 4863FF movsxd rdi, edi 448974B810 mov dword ptr [rax+4*rdi+16], r14d 488D4314 lea rax, bword ptr [rbx+20] 8B38 mov edi, dword ptr [rax] FFC7 inc edi 3BF7 cmp esi, edi 7502 jne SHORT G_M9246_IG08 33FF xor edi, edi G_M9246_IG08: 8938 mov dword ptr [rax], edi FF4318 inc dword ptr [rbx+24] FF431C inc dword ptr [rbx+28] G_M9246_IG09: 5B pop rbx 415E pop r14 5D pop rbp C3 ret G_M9246_IG10: E80DD9B678 call CORINFO_HELP_RNGCHKFAIL CC int3

So there is a "fast-return" at G_M9246_IG04 (with the same asm as with a separate EnqueueWithResize), then followed by the "resize"-part. I think this code is good and an explicitely EnqueueWithResize can be omitted.

Note: if _version++ is at the end of the method, then there is just one single ret, which may slow down the fast-path.
Alternatively _version++ could be moved to the top of the method, so asm becomes

G_M9246_IG02: FF431C inc dword ptr [rbx+28] ; _version++ 8B7B14 mov edi, dword ptr [rbx+20] 8B7318 mov esi, dword ptr [rbx+24] 488B4308 mov rax, gword ptr [rbx+8] 8B5008 mov edx, dword ptr [rax+8] 3BD6 cmp edx, esi 7621 jbe SHORT G_M9246_IG05 ; ...

(with the same asm as with a separate EnqueueWithResize)

I cheated a bit. In the prolog and epilog there's a difference:

G_M9243_IG01: 55 push rbp - 488BEC mov rbp, rsp - 90 nop + 4156 push r14 + 53 push rbx + 488D6C2410 lea rbp, [rsp+10H] + 488BDF mov rbx, rdi + 448BF6 mov r14d, esi ; ... G_M9246_IG04: + 5B pop rbx + 415E pop r14 5D pop rbp C3 ret ; ...

Don't know if this makes much difference -- one may need to measure it, but I assume it's within noise.

This is also followed from Stack.Push which splits those out as well
// Non-inline from Stack.Push to improve its code quality as uncommon path (added in #26086)

Perhaps I was misunderstanding the purpose behind that in the first place. Even without AggressiveInlining the choice may be made to inline it, so separating out the cold-path is beneficial. Any guidance here is definitely appreciated because I might have totally mis-understood this

@Wraiyth you haven't missunderstood something. It a usual review-process to question and reason about every change -- even when the change is taken over from an accepted PR, so it's good to re-think about some decisions (made earlier, or made now). That's the case here.

stephentoub · 2019-03-31T14:59:59Z

src/System.Collections/src/System/Collections/Generic/Queue.cs

            _tail = (_size == capacity) ? 0 : _size;
            _version++;
        }
+        private void MoveNext(ref int index) => MoveNext(ref index, _array);


Nit: needs a blank line above this. That said, are there many other call sites to this? Why not just use _array at those call sites rather than adding this wrapper?

My idea was #36502 (comment)

I see there are other places that could use this too.

Dequeue

TryDequeue

By these array and head is already in registers, so no need to refetch them from memory. That's where MoveNex(ref, T[]) comes into play.

omariom · 2019-04-02T00:59:40Z

src/System.Collections/src/System/Collections/Generic/Queue.cs

+            int size = _size;
+            T[] array = _array;
+
+            if ((uint)size < (uint)array.Length && (uint)tail < (uint)array.Length)


doesn't (uint)tail < (uint)array.Length just move the check to C# code?

Correct. The only mini-advantage is that the JITed code size is a few bytes smaller.

G_M9243_IG02: FF471C inc dword ptr [rdi+28] 8B4714 mov eax, dword ptr [rdi+20] 8B5718 mov edx, dword ptr [rdi+24] 488B4F08 mov rcx, gword ptr [rdi+8] 448B4108 mov r8d, dword ptr [rcx+8] 443BC2 cmp r8d, edx 7620 jbe SHORT G_M9243_IG05 - 443BC0 cmp r8d, eax - 761B jbe SHORT G_M9243_IG05 + 413BC0 cmp eax, r8d + 7323 jae SHORT G_M9243_IG07 4C63C8 movsxd r9, eax 4289748910 mov dword ptr [rcx+4*r9+16], esi FFC0 inc eax 443BC0 cmp r8d, eax 7502 jne SHORT G_M9243_IG03 33C0 xor eax, eax G_M9243_IG03: 894714 mov dword ptr [rdi+20], eax FFC2 inc edx 895718 mov dword ptr [rdi+24], edx G_M9243_IG04: 5D pop rbp C3 ret G_M9243_IG05: E8E0F4FFFF call Queue`1:EnqueueWithResize(int):this 90 nop G_M9243_IG06: 5D pop rbp C3 ret +G_M9243_IG07: + E8882CB778 call CORINFO_HELP_RNGCHKFAIL + CC int3 -; Total bytes of code 67, prolog size 5 for method Queue`1:Enqueue(int):this +; Total bytes of code 73, prolog size 5 for method Queue`1:Enqueue(int):this ; ============================================================

I think this can be reverted, thus improving readability of the code.

maryamariyan · 2019-08-01T18:01:27Z

@Wraiyth master is now open for 5.0, any update on this PR?

danmoseley · 2019-08-06T20:44:49Z

I'm going to close this for age. Thanks for the contribution @Wraiyth . If you want to pick this up again, feel free to create a new PR. If you do it would be good to have benchmark results from perf tests for Queue and Stack (https://github.com/dotnet/performance/). Thanks again.

c94f9fa

Performance improvements to Queue<T>.Enqueue and Stack<T>.Push

gfoidl reviewed Mar 30, 2019

View reviewed changes

Changes based on PR Feedback

da00884

Removed unused GrowFactor member Removed Aggressive Inlining Other minor improvements

stephentoub reviewed Mar 31, 2019

View reviewed changes

karelz added the area-System.Collections label Apr 1, 2019

karelz assigned Wraiyth and safern Apr 1, 2019

karelz added this to the 3.0 milestone Apr 1, 2019

omariom reviewed Apr 2, 2019

View reviewed changes

karelz modified the milestones: 3.0, 5.0 Jul 17, 2019

adamsitnik added the tenet-performance Performance related issue label Jul 31, 2019

danmoseley closed this Aug 6, 2019

Queue<T>.Enqueue and Stack<T>.Push performance improvements #36502

Queue<T>.Enqueue and Stack<T>.Push performance improvements #36502

Uh oh!

Conversation

Wraiyth commented Mar 30, 2019

Uh oh!

dnfclas commented Mar 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danmoseley Mar 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stephentoub Mar 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danmoseley commented Mar 30, 2019

Uh oh!

gfoidl commented Mar 30, 2019

Uh oh!

Tornhoof commented Mar 30, 2019

Uh oh!

jkotas commented Mar 30, 2019

Uh oh!

stephentoub Mar 31, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gfoidl Apr 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

dnfclas commented Mar 30, 2019 •

edited

Loading

danmoseley Mar 30, 2019 •

edited

Loading

stephentoub Mar 30, 2019 •

edited

Loading

stephentoub Mar 31, 2019 •

edited

Loading

gfoidl Apr 1, 2019 •

edited

Loading