Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Conversation

@Wraiyth
Copy link

@Wraiyth Wraiyth commented Mar 30, 2019

Addresses #36405

This updates Stack.Push and Queue.Enqueue to have similar optimizations and structure to List.Add

Performance improvements to Queue<T>.Enqueue and Stack<T>.Push
@dnfclas
Copy link

dnfclas commented Mar 30, 2019

CLA assistant check
All CLA requirements met.

}

// Pushes an item to the top of the stack.
[MethodImpl(MethodImplOptions.AggressiveInlining)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we shouldn't add AggressiveInlining unless there's a strong performance-motivated scenario for doing so. Getting something to inline isn't a goal in and of itself.

(source: #12094 (comment))

For the microbenchmark this may be good, but as the call-sites get bigger, it may decrease perf in real world scenarios.

I don't know what the choice is...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I.e.

this would be "inlined" into every call-site
G_M60461_IG02:
       8B4714               mov      eax, dword ptr [rdi+20]
       8B5718               mov      edx, dword ptr [rdi+24]
       488B4F08             mov      rcx, gword ptr [rdi+8]
       448B4108             mov      r8d, dword ptr [rcx+8]
       443BC2               cmp      r8d, edx
       7623                 jbe      SHORT G_M60461_IG04
       443BC0               cmp      r8d, eax
       761E                 jbe      SHORT G_M60461_IG04
       4C63C8               movsxd   r9, eax
       4289748910           mov      dword ptr [rcx+4*r9+16], esi
       FFC0                 inc      eax
       443BC0               cmp      r8d, eax
       7502                 jne      SHORT G_M60461_IG03
       33C0                 xor      eax, eax

G_M60461_IG03:
       894714               mov      dword ptr [rdi+20], eax
       FFC2                 inc      edx
       895718               mov      dword ptr [rdi+24], edx
       FF471C               inc      dword ptr [rdi+28]
       EB05                 jmp      SHORT G_M60461_IG05

G_M60461_IG04:
       E810F5FFFF           call     MyQueue`1:EnqueueWithResize(int):this

(" " because the JIT may optimize something, depending on the call-site).

Copy link
Member

@danmoseley danmoseley Mar 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is right. @stephentoub @jkotas when deciding where aggressive is reasonable, do you have rules of thumb? Eg, known to be heavily called in hot paths + known to typically fail automatic inlining + not very large + significant help in micro benchmark?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ups, this dasm / comment belongs to Queue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deciding where aggressive is reasonable, do you have rules of thumb?

My rule of thumb is if you're using it, you most likely shouldn't be. I generally only use it when it's a critical-path function, when inlining makes a very measurable impact on consumers (not just a microbenchmark of the method itself), and there's something about it that would fool the JIT's heuristics, such as a ton of IL but most of which will be shaken away from the resulting asm, e.g. lots of different branches for different generic type tests against various struct types, intrinsics usage, etc. I don't think this qualifies. I'm actually skeptical of the existing usage on List.Add.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This aspect of performance optimization is interesting, is there somewhere some documentation/research about it(explain issue on "real system" out of microbenchmarks) or is only "experience"?

Copy link
Member

@stephentoub stephentoub Mar 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inlining a method results in a copy of its code at the call site (subject to any follow-on optimizations that inlining might enable). With lots of call sites, that can significantly bloat code size. That increase in code size is not just an issue in and of itself (e.g. more memory consumption, etc.), but each copy of that code is fundamentally different, recognized as different code for an instruction cache; having many different copies of the same code can thus result in the instruction cache being much less efficient.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the use of AggressiveInlining on these. That guidance is good, thanks @stephentoub - I was purely running off what had been applied to List here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephentoub thank's for explanation, I wonder if there are "numbers" somewhere(maybe on codebase or on "internet") or if we should take it as "best practice", similar to "remove null check "here" is unuseful because branch predictor likely guesses", I'm aware that is very slippery slope but maybe you(and internal team researcher) have some real number

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is highly dependent on actual code and actual call-sites. What numbers do you expect?
Research wise this optimization domain is covered, you may find a lot of papers, but the decision if inline or not comes back to actual code. So there won't be any hard rules, just heuristics.

}

// Adds item to the tail of the queue.
[MethodImpl(MethodImplOptions.AggressiveInlining)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as on Stack<T>.Push applies here.
For List<T>.Add aggressive inlining makes sense, as it is just a comparison for hot/cold-path, then on the hot-path an array access plus incrementing the version.

Here more work needs to be done, so the call-site gets quite a bit bigger.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed AggressiveInlining as per the discussion above.


if (newCapacity < _array.Length + MinimumGrow)
{
newCapacity = _array.Length + MinimumGrow;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately the JIT fetches _array.Length on each use. So cache it, like:

[MethodImpl(MethodImplOptions.NoInlining)]
private void EnqueueWithResize(T item)
{
    int length = _array.Length;
    int newCapacity = length * 2;
    int lengthWithGrow = length + MinimumGrow;

    if (newCapacity < lengthWithGrow)
    {
        newCapacity = lengthWithGrow;
    }

    SetCapacity(newCapacity);
    // ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or use Math.Min:

int length = _array.Length;
int newCapacity = Math.Min(length * 2, length + MinimumGrow);

SetCapacity(newCapacity);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PS: I know it's the cold path, but reduced asm-size is also benefitial 😉

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately the JIT fetches _array.Length on each use.

For the benefit of others: the JIT must do this since it cannot predict access from other threads...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't actually know that, thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't actually know that

The nice part of PRs. You improve the product and may learn something 😄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully my statement is correct then. @jkotas? I guess it wants to avoid a situation where the body runs indefinitely repeatedly checking the field for some expected external change that actually already happened.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing to do with multiple threads. It is missed optimization. There is nothing fundamental preventing the JIT from doing the transformation with caching the length in locals that you have done manually.

lengthWithGrow is not strictly needed with current JIT. The JIT is able to do the transformation for you.

Or use Math.Min

You probably meant Math.Max. That is the most sensible and readable way to write this. It should generate about as good code as the manually inlined version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lengthWithGrow is not strictly needed with current JIT.

Hm, I saw the JIT emitting two leas for length + MinimumGrow, one in the if and one in the assignment. That's why I did it manually.

You probably meant Math.Max

Oh, of couse.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some updates here based on feedback

@danmoseley
Copy link
Member

GrowFactor is not used

@gfoidl
Copy link
Member

gfoidl commented Mar 30, 2019

GrowFactor is not used

BTW:

private const int GrowFactor = 200; // double each time
// ...
int newcapacity = (int)((long)_array.Length * (long)GrowFactor / 100);

Why is it written that way? Instead of just

private const long GrowFactor = 2; // double each time
// ...
int newcapacity = (int)((long)_array.Length * GrowFactor);

So still have the same overflow behavior as before.

@Tornhoof
Copy link
Contributor

See https://github.com/dotnet/corefx/issues/35515 for a possible answer

@jkotas
Copy link
Member

jkotas commented Mar 30, 2019

int newcapacity = (int)((long)_array.Length * GrowFactor);

This looks over-engineered to me. You should be able to write it like this:

private const long GrowFactor = 2;
int newcapacity = _array.Length * GrowFactor;

and it will have the exact same behavior wrt. overflows as int newcapacity = (int)((long)_array.Length * (long)GrowFactor / 100);

Removed unused GrowFactor member
Removed Aggressive Inlining
Other minor improvements
}
else
{
EnqueueWithResize(item);
Copy link
Member

@stephentoub stephentoub Mar 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the benefit to separating this out into a non-inlineable method? Is Enqueue inlined with this separated out? Does separating it out make the JIT better at optimizing the rest of the code?

Copy link
Member

@gfoidl gfoidl Apr 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the benefit ...

When Enqueue was inlined, it was a hot/cold-path-split. But this doesn't apply anymore.

Is Enqueue inlined with this separated out?

No. At least not now, as AggressiveInlining was removed.

When written as

public void Enqueue(T item)
{
    int tail = _tail;
    int size = _size;
    T[] array = _array;

    if ((uint)size < (uint)array.Length && (uint)tail < (uint)array.Length)
    {
        array[tail] = item;
        _tail = MoveNext(ref tail, array);
        _size = size + 1;
        _version++;
    }
    else
    {
        int length = array.Length;
        int newCapaicty = Math.Max(length * 2, length + MinimumGrow);

        SetCapacity(newCapaicty);

        _array[_tail] = item;
        MoveNext(ref _tail);
        _size++;
        _version++;
    }
}

(so just manually inlined EnqueueWithResize) and with the _version++ in each branch the asm becomes:

G_M9246_IG01:
       55                   push     rbp
       4156                 push     r14
       53                   push     rbx
       488D6C2410           lea      rbp, [rsp+10H]
       488BDF               mov      rbx, rdi
       448BF6               mov      r14d, esi

G_M9246_IG02:
       8B7B14               mov      edi, dword ptr [rbx+20]
       8B7318               mov      esi, dword ptr [rbx+24]
       488B4308             mov      rax, gword ptr [rbx+8]
       8B5008               mov      edx, dword ptr [rax+8]
       3BD6                 cmp      edx, esi
       7624                 jbe      SHORT G_M9246_IG05
       3BD7                 cmp      edx, edi
       7620                 jbe      SHORT G_M9246_IG05
       4863CF               movsxd   rcx, edi
       4489748810           mov      dword ptr [rax+4*rcx+16], r14d
       FFC7                 inc      edi
       3BD7                 cmp      edx, edi
       7502                 jne      SHORT G_M9246_IG03
       33FF                 xor      edi, edi

G_M9246_IG03:
       897B14               mov      dword ptr [rbx+20], edi
       FFC6                 inc      esi
       897318               mov      dword ptr [rbx+24], esi
       FF431C               inc      dword ptr [rbx+28]

G_M9246_IG04:
       5B                   pop      rbx
       415E                 pop      r14
       5D                   pop      rbp
       C3                   ret      

G_M9246_IG05:
       8BF2                 mov      esi, edx
       D1E6                 shl      esi, 1
       83C204               add      edx, 4
       3BF2                 cmp      esi, edx
       7D02                 jge      SHORT G_M9246_IG06
       EB02                 jmp      SHORT G_M9246_IG07

G_M9246_IG06:
       8BD6                 mov      edx, esi

G_M9246_IG07:
       488BFB               mov      rdi, rbx
       8BF2                 mov      esi, edx
       E8D3F1FFFF           call     Queue`1:SetCapacity(int):this
       488B4308             mov      rax, gword ptr [rbx+8]
       8B7B14               mov      edi, dword ptr [rbx+20]
       8B7008               mov      esi, dword ptr [rax+8]
       3BFE                 cmp      edi, esi
       7323                 jae      SHORT G_M9246_IG10
       4863FF               movsxd   rdi, edi
       448974B810           mov      dword ptr [rax+4*rdi+16], r14d
       488D4314             lea      rax, bword ptr [rbx+20]
       8B38                 mov      edi, dword ptr [rax]
       FFC7                 inc      edi
       3BF7                 cmp      esi, edi
       7502                 jne      SHORT G_M9246_IG08
       33FF                 xor      edi, edi

G_M9246_IG08:
       8938                 mov      dword ptr [rax], edi
       FF4318               inc      dword ptr [rbx+24]
       FF431C               inc      dword ptr [rbx+28]

G_M9246_IG09:
       5B                   pop      rbx
       415E                 pop      r14
       5D                   pop      rbp
       C3                   ret      

G_M9246_IG10:
       E80DD9B678           call     CORINFO_HELP_RNGCHKFAIL
       CC                   int3    

So there is a "fast-return" at G_M9246_IG04 (with the same asm as with a separate EnqueueWithResize), then followed by the "resize"-part. I think this code is good and an explicitely EnqueueWithResize can be omitted.

Note: if _version++ is at the end of the method, then there is just one single ret, which may slow down the fast-path.
Alternatively _version++ could be moved to the top of the method, so asm becomes

G_M9246_IG02:
       FF431C               inc      dword ptr [rbx+28]    ; _version++
       8B7B14               mov      edi, dword ptr [rbx+20]
       8B7318               mov      esi, dword ptr [rbx+24]
       488B4308             mov      rax, gword ptr [rbx+8]
       8B5008               mov      edx, dword ptr [rax+8]
       3BD6                 cmp      edx, esi
       7621                 jbe      SHORT G_M9246_IG05
; ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(with the same asm as with a separate EnqueueWithResize)

I cheated a bit. In the prolog and epilog there's a difference:

G_M9243_IG01:
       55                   push     rbp
-      488BEC               mov      rbp, rsp
-      90                   nop
+      4156                 push     r14
+      53                   push     rbx
+      488D6C2410           lea      rbp, [rsp+10H]
+      488BDF               mov      rbx, rdi
+      448BF6               mov      r14d, esi

; ...
G_M9246_IG04:
+      5B                   pop      rbx
+      415E                 pop      r14
       5D                   pop      rbp
       C3                   ret

; ...

Don't know if this makes much difference -- one may need to measure it, but I assume it's within noise.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also followed from Stack.Push which splits those out as well
// Non-inline from Stack.Push to improve its code quality as uncommon path (added in #26086)

Perhaps I was misunderstanding the purpose behind that in the first place. Even without AggressiveInlining the choice may be made to inline it, so separating out the cold-path is beneficial. Any guidance here is definitely appreciated because I might have totally mis-understood this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Wraiyth you haven't missunderstood something. It a usual review-process to question and reason about every change -- even when the change is taken over from an accepted PR, so it's good to re-think about some decisions (made earlier, or made now). That's the case here.

_tail = (_size == capacity) ? 0 : _size;
_version++;
}
private void MoveNext(ref int index) => MoveNext(ref index, _array);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: needs a blank line above this. That said, are there many other call sites to this? Why not just use _array at those call sites rather than adding this wrapper?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea was #36502 (comment)

I see there are other places that could use this too.

  • Dequeue
  • TryDequeue

By these array and head is already in registers, so no need to refetch them from memory. That's where MoveNex(ref, T[]) comes into play.

@karelz karelz added this to the 3.0 milestone Apr 1, 2019
int size = _size;
T[] array = _array;

if ((uint)size < (uint)array.Length && (uint)tail < (uint)array.Length)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't (uint)tail < (uint)array.Length just move the check to C# code?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. The only mini-advantage is that the JITed code size is a few bytes smaller.

G_M9243_IG02:
       FF471C               inc      dword ptr [rdi+28]
       8B4714               mov      eax, dword ptr [rdi+20]
       8B5718               mov      edx, dword ptr [rdi+24]
       488B4F08             mov      rcx, gword ptr [rdi+8]
       448B4108             mov      r8d, dword ptr [rcx+8]
       443BC2               cmp      r8d, edx
       7620                 jbe      SHORT G_M9243_IG05
-      443BC0               cmp      r8d, eax
-      761B                 jbe      SHORT G_M9243_IG05
+      413BC0               cmp      eax, r8d
+      7323                 jae      SHORT G_M9243_IG07
       4C63C8               movsxd   r9, eax
       4289748910           mov      dword ptr [rcx+4*r9+16], esi
       FFC0                 inc      eax
       443BC0               cmp      r8d, eax
       7502                 jne      SHORT G_M9243_IG03
       33C0                 xor      eax, eax

G_M9243_IG03:
       894714               mov      dword ptr [rdi+20], eax
       FFC2                 inc      edx
       895718               mov      dword ptr [rdi+24], edx

G_M9243_IG04:
       5D                   pop      rbp
       C3                   ret      

G_M9243_IG05:
       E8E0F4FFFF           call     Queue`1:EnqueueWithResize(int):this
       90                   nop      

G_M9243_IG06:
       5D                   pop      rbp
       C3                   ret      

+G_M9243_IG07:
+      E8882CB778           call     CORINFO_HELP_RNGCHKFAIL
+      CC                   int3     
-; Total bytes of code 67, prolog size 5 for method Queue`1:Enqueue(int):this
+; Total bytes of code 73, prolog size 5 for method Queue`1:Enqueue(int):this
; ============================================================

I think this can be reverted, thus improving readability of the code.

@karelz karelz modified the milestones: 3.0, 5.0 Jul 17, 2019
@adamsitnik adamsitnik added the tenet-performance Performance related issue label Jul 31, 2019
@maryamariyan
Copy link

@Wraiyth master is now open for 5.0, any update on this PR?

@danmoseley
Copy link
Member

I'm going to close this for age. Thanks for the contribution @Wraiyth . If you want to pick this up again, feel free to create a new PR. If you do it would be good to have benchmark results from perf tests for Queue and Stack (https://github.com/dotnet/performance/). Thanks again.

@danmoseley danmoseley closed this Aug 6, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.