-
Notifications
You must be signed in to change notification settings - Fork 4.9k
Queue<T>.Enqueue and Stack<T>.Push performance improvements #36502
Conversation
Performance improvements to Queue<T>.Enqueue and Stack<T>.Push
| } | ||
|
|
||
| // Pushes an item to the top of the stack. | ||
| [MethodImpl(MethodImplOptions.AggressiveInlining)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we shouldn't add AggressiveInlining unless there's a strong performance-motivated scenario for doing so. Getting something to inline isn't a goal in and of itself.
(source: #12094 (comment))
For the microbenchmark this may be good, but as the call-sites get bigger, it may decrease perf in real world scenarios.
I don't know what the choice is...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I.e.
this would be "inlined" into every call-site
G_M60461_IG02:
8B4714 mov eax, dword ptr [rdi+20]
8B5718 mov edx, dword ptr [rdi+24]
488B4F08 mov rcx, gword ptr [rdi+8]
448B4108 mov r8d, dword ptr [rcx+8]
443BC2 cmp r8d, edx
7623 jbe SHORT G_M60461_IG04
443BC0 cmp r8d, eax
761E jbe SHORT G_M60461_IG04
4C63C8 movsxd r9, eax
4289748910 mov dword ptr [rcx+4*r9+16], esi
FFC0 inc eax
443BC0 cmp r8d, eax
7502 jne SHORT G_M60461_IG03
33C0 xor eax, eax
G_M60461_IG03:
894714 mov dword ptr [rdi+20], eax
FFC2 inc edx
895718 mov dword ptr [rdi+24], edx
FF471C inc dword ptr [rdi+28]
EB05 jmp SHORT G_M60461_IG05
G_M60461_IG04:
E810F5FFFF call MyQueue`1:EnqueueWithResize(int):this(" " because the JIT may optimize something, depending on the call-site).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is right. @stephentoub @jkotas when deciding where aggressive is reasonable, do you have rules of thumb? Eg, known to be heavily called in hot paths + known to typically fail automatic inlining + not very large + significant help in micro benchmark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ups, this dasm / comment belongs to Queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deciding where aggressive is reasonable, do you have rules of thumb?
My rule of thumb is if you're using it, you most likely shouldn't be. I generally only use it when it's a critical-path function, when inlining makes a very measurable impact on consumers (not just a microbenchmark of the method itself), and there's something about it that would fool the JIT's heuristics, such as a ton of IL but most of which will be shaken away from the resulting asm, e.g. lots of different branches for different generic type tests against various struct types, intrinsics usage, etc. I don't think this qualifies. I'm actually skeptical of the existing usage on List.Add.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This aspect of performance optimization is interesting, is there somewhere some documentation/research about it(explain issue on "real system" out of microbenchmarks) or is only "experience"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inlining a method results in a copy of its code at the call site (subject to any follow-on optimizations that inlining might enable). With lots of call sites, that can significantly bloat code size. That increase in code size is not just an issue in and of itself (e.g. more memory consumption, etc.), but each copy of that code is fundamentally different, recognized as different code for an instruction cache; having many different copies of the same code can thus result in the instruction cache being much less efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed the use of AggressiveInlining on these. That guidance is good, thanks @stephentoub - I was purely running off what had been applied to List here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stephentoub thank's for explanation, I wonder if there are "numbers" somewhere(maybe on codebase or on "internet") or if we should take it as "best practice", similar to "remove null check "here" is unuseful because branch predictor likely guesses", I'm aware that is very slippery slope but maybe you(and internal team researcher) have some real number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is highly dependent on actual code and actual call-sites. What numbers do you expect?
Research wise this optimization domain is covered, you may find a lot of papers, but the decision if inline or not comes back to actual code. So there won't be any hard rules, just heuristics.
| } | ||
|
|
||
| // Adds item to the tail of the queue. | ||
| [MethodImpl(MethodImplOptions.AggressiveInlining)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as on Stack<T>.Push applies here.
For List<T>.Add aggressive inlining makes sense, as it is just a comparison for hot/cold-path, then on the hot-path an array access plus incrementing the version.
Here more work needs to be done, so the call-site gets quite a bit bigger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed AggressiveInlining as per the discussion above.
|
|
||
| if (newCapacity < _array.Length + MinimumGrow) | ||
| { | ||
| newCapacity = _array.Length + MinimumGrow; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately the JIT fetches _array.Length on each use. So cache it, like:
[MethodImpl(MethodImplOptions.NoInlining)]
private void EnqueueWithResize(T item)
{
int length = _array.Length;
int newCapacity = length * 2;
int lengthWithGrow = length + MinimumGrow;
if (newCapacity < lengthWithGrow)
{
newCapacity = lengthWithGrow;
}
SetCapacity(newCapacity);
// ...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or use Math.Min:
int length = _array.Length;
int newCapacity = Math.Min(length * 2, length + MinimumGrow);
SetCapacity(newCapacity);There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PS: I know it's the cold path, but reduced asm-size is also benefitial 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately the JIT fetches _array.Length on each use.
For the benefit of others: the JIT must do this since it cannot predict access from other threads...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't actually know that, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't actually know that
The nice part of PRs. You improve the product and may learn something 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully my statement is correct then. @jkotas? I guess it wants to avoid a situation where the body runs indefinitely repeatedly checking the field for some expected external change that actually already happened.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing to do with multiple threads. It is missed optimization. There is nothing fundamental preventing the JIT from doing the transformation with caching the length in locals that you have done manually.
lengthWithGrow is not strictly needed with current JIT. The JIT is able to do the transformation for you.
Or use
Math.Min
You probably meant Math.Max. That is the most sensible and readable way to write this. It should generate about as good code as the manually inlined version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lengthWithGrowis not strictly needed with current JIT.
Hm, I saw the JIT emitting two leas for length + MinimumGrow, one in the if and one in the assignment. That's why I did it manually.
You probably meant
Math.Max
Oh, of couse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made some updates here based on feedback
|
GrowFactor is not used |
BTW: private const int GrowFactor = 200; // double each time
// ...
int newcapacity = (int)((long)_array.Length * (long)GrowFactor / 100);Why is it written that way? Instead of just private const long GrowFactor = 2; // double each time
// ...
int newcapacity = (int)((long)_array.Length * GrowFactor);So still have the same overflow behavior as before. |
|
See https://github.com/dotnet/corefx/issues/35515 for a possible answer |
This looks over-engineered to me. You should be able to write it like this: and it will have the exact same behavior wrt. overflows as |
Removed unused GrowFactor member Removed Aggressive Inlining Other minor improvements
| } | ||
| else | ||
| { | ||
| EnqueueWithResize(item); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the benefit to separating this out into a non-inlineable method? Is Enqueue inlined with this separated out? Does separating it out make the JIT better at optimizing the rest of the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the benefit ...
When Enqueue was inlined, it was a hot/cold-path-split. But this doesn't apply anymore.
Is Enqueue inlined with this separated out?
No. At least not now, as AggressiveInlining was removed.
When written as
public void Enqueue(T item)
{
int tail = _tail;
int size = _size;
T[] array = _array;
if ((uint)size < (uint)array.Length && (uint)tail < (uint)array.Length)
{
array[tail] = item;
_tail = MoveNext(ref tail, array);
_size = size + 1;
_version++;
}
else
{
int length = array.Length;
int newCapaicty = Math.Max(length * 2, length + MinimumGrow);
SetCapacity(newCapaicty);
_array[_tail] = item;
MoveNext(ref _tail);
_size++;
_version++;
}
}(so just manually inlined EnqueueWithResize) and with the _version++ in each branch the asm becomes:
G_M9246_IG01:
55 push rbp
4156 push r14
53 push rbx
488D6C2410 lea rbp, [rsp+10H]
488BDF mov rbx, rdi
448BF6 mov r14d, esi
G_M9246_IG02:
8B7B14 mov edi, dword ptr [rbx+20]
8B7318 mov esi, dword ptr [rbx+24]
488B4308 mov rax, gword ptr [rbx+8]
8B5008 mov edx, dword ptr [rax+8]
3BD6 cmp edx, esi
7624 jbe SHORT G_M9246_IG05
3BD7 cmp edx, edi
7620 jbe SHORT G_M9246_IG05
4863CF movsxd rcx, edi
4489748810 mov dword ptr [rax+4*rcx+16], r14d
FFC7 inc edi
3BD7 cmp edx, edi
7502 jne SHORT G_M9246_IG03
33FF xor edi, edi
G_M9246_IG03:
897B14 mov dword ptr [rbx+20], edi
FFC6 inc esi
897318 mov dword ptr [rbx+24], esi
FF431C inc dword ptr [rbx+28]
G_M9246_IG04:
5B pop rbx
415E pop r14
5D pop rbp
C3 ret
G_M9246_IG05:
8BF2 mov esi, edx
D1E6 shl esi, 1
83C204 add edx, 4
3BF2 cmp esi, edx
7D02 jge SHORT G_M9246_IG06
EB02 jmp SHORT G_M9246_IG07
G_M9246_IG06:
8BD6 mov edx, esi
G_M9246_IG07:
488BFB mov rdi, rbx
8BF2 mov esi, edx
E8D3F1FFFF call Queue`1:SetCapacity(int):this
488B4308 mov rax, gword ptr [rbx+8]
8B7B14 mov edi, dword ptr [rbx+20]
8B7008 mov esi, dword ptr [rax+8]
3BFE cmp edi, esi
7323 jae SHORT G_M9246_IG10
4863FF movsxd rdi, edi
448974B810 mov dword ptr [rax+4*rdi+16], r14d
488D4314 lea rax, bword ptr [rbx+20]
8B38 mov edi, dword ptr [rax]
FFC7 inc edi
3BF7 cmp esi, edi
7502 jne SHORT G_M9246_IG08
33FF xor edi, edi
G_M9246_IG08:
8938 mov dword ptr [rax], edi
FF4318 inc dword ptr [rbx+24]
FF431C inc dword ptr [rbx+28]
G_M9246_IG09:
5B pop rbx
415E pop r14
5D pop rbp
C3 ret
G_M9246_IG10:
E80DD9B678 call CORINFO_HELP_RNGCHKFAIL
CC int3 So there is a "fast-return" at G_M9246_IG04 (with the same asm as with a separate EnqueueWithResize), then followed by the "resize"-part. I think this code is good and an explicitely EnqueueWithResize can be omitted.
Note: if _version++ is at the end of the method, then there is just one single ret, which may slow down the fast-path.
Alternatively _version++ could be moved to the top of the method, so asm becomes
G_M9246_IG02:
FF431C inc dword ptr [rbx+28] ; _version++
8B7B14 mov edi, dword ptr [rbx+20]
8B7318 mov esi, dword ptr [rbx+24]
488B4308 mov rax, gword ptr [rbx+8]
8B5008 mov edx, dword ptr [rax+8]
3BD6 cmp edx, esi
7621 jbe SHORT G_M9246_IG05
; ...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(with the same asm as with a separate
EnqueueWithResize)
I cheated a bit. In the prolog and epilog there's a difference:
G_M9243_IG01:
55 push rbp
- 488BEC mov rbp, rsp
- 90 nop
+ 4156 push r14
+ 53 push rbx
+ 488D6C2410 lea rbp, [rsp+10H]
+ 488BDF mov rbx, rdi
+ 448BF6 mov r14d, esi
; ...
G_M9246_IG04:
+ 5B pop rbx
+ 415E pop r14
5D pop rbp
C3 ret
; ...Don't know if this makes much difference -- one may need to measure it, but I assume it's within noise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also followed from Stack.Push which splits those out as well
// Non-inline from Stack.Push to improve its code quality as uncommon path (added in #26086)
Perhaps I was misunderstanding the purpose behind that in the first place. Even without AggressiveInlining the choice may be made to inline it, so separating out the cold-path is beneficial. Any guidance here is definitely appreciated because I might have totally mis-understood this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Wraiyth you haven't missunderstood something. It a usual review-process to question and reason about every change -- even when the change is taken over from an accepted PR, so it's good to re-think about some decisions (made earlier, or made now). That's the case here.
| _tail = (_size == capacity) ? 0 : _size; | ||
| _version++; | ||
| } | ||
| private void MoveNext(ref int index) => MoveNext(ref index, _array); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: needs a blank line above this. That said, are there many other call sites to this? Why not just use _array at those call sites rather than adding this wrapper?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My idea was #36502 (comment)
I see there are other places that could use this too.
- Dequeue
- TryDequeue
By these array and head is already in registers, so no need to refetch them from memory. That's where MoveNex(ref, T[]) comes into play.
| int size = _size; | ||
| T[] array = _array; | ||
|
|
||
| if ((uint)size < (uint)array.Length && (uint)tail < (uint)array.Length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't (uint)tail < (uint)array.Length just move the check to C# code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. The only mini-advantage is that the JITed code size is a few bytes smaller.
G_M9243_IG02:
FF471C inc dword ptr [rdi+28]
8B4714 mov eax, dword ptr [rdi+20]
8B5718 mov edx, dword ptr [rdi+24]
488B4F08 mov rcx, gword ptr [rdi+8]
448B4108 mov r8d, dword ptr [rcx+8]
443BC2 cmp r8d, edx
7620 jbe SHORT G_M9243_IG05
- 443BC0 cmp r8d, eax
- 761B jbe SHORT G_M9243_IG05
+ 413BC0 cmp eax, r8d
+ 7323 jae SHORT G_M9243_IG07
4C63C8 movsxd r9, eax
4289748910 mov dword ptr [rcx+4*r9+16], esi
FFC0 inc eax
443BC0 cmp r8d, eax
7502 jne SHORT G_M9243_IG03
33C0 xor eax, eax
G_M9243_IG03:
894714 mov dword ptr [rdi+20], eax
FFC2 inc edx
895718 mov dword ptr [rdi+24], edx
G_M9243_IG04:
5D pop rbp
C3 ret
G_M9243_IG05:
E8E0F4FFFF call Queue`1:EnqueueWithResize(int):this
90 nop
G_M9243_IG06:
5D pop rbp
C3 ret
+G_M9243_IG07:
+ E8882CB778 call CORINFO_HELP_RNGCHKFAIL
+ CC int3
-; Total bytes of code 67, prolog size 5 for method Queue`1:Enqueue(int):this
+; Total bytes of code 73, prolog size 5 for method Queue`1:Enqueue(int):this
; ============================================================I think this can be reverted, thus improving readability of the code.
|
@Wraiyth master is now open for 5.0, any update on this PR? |
|
I'm going to close this for age. Thanks for the contribution @Wraiyth . If you want to pick this up again, feel free to create a new PR. If you do it would be good to have benchmark results from perf tests for Queue and Stack (https://github.com/dotnet/performance/). Thanks again. |
Addresses #36405
This updates Stack.Push and Queue.Enqueue to have similar optimizations and structure to List.Add