Stack<T> optimization of (Try)Peek, (Try)Pop and Push #26086

gfoidl · 2017-12-28T13:59:12Z

Description

Enabled RCE on array-access and avoided the explicit check for _size == 0 --> this is done implicitly in the "RCE-if". So some cmps can be saved.

By Pop the effect on value types is not so big, than for reference types (two array accesses).

This PR is a kind of extension to https://github.com/dotnet/corefx/issues/17318

Benchmarks

Notes

Code for benchmarks lives here

Due the use of http://benchmarkdotnet.org the benchmarks were done a couple of times, because some crazy results with perf x2 were reported and this seems too strange. The results shown here are the more realistic ones. Individual results are in the linked repo above.
The changes from this PR never showed a decrease in perf.

Peek

BenchmarkDotNet=v0.10.11, OS=ubuntu 16.04
Processor=Intel Xeon CPU 2.60GHz, ProcessorCount=2
.NET Core SDK=2.1.3
  [Host]     : .NET Core 2.0.4 (Framework 4.6.0.0), 64bit RyuJIT
  DefaultJob : .NET Core 2.0.4 (Framework 4.6.0.0), 64bit RyuJIT

Method	Mean	Error	StdDev	Scaled	ScaledSD
Peek_Default	1.974 ns	0.1229 ns	0.2020 ns	1.00	0.00
Peek_New	1.600 ns	0.1231 ns	0.2023 ns	0.82	0.13

TryPeek

BenchmarkDotNet=v0.10.11, OS=ubuntu 16.04
Processor=Intel Xeon CPU 2.60GHz, ProcessorCount=2
.NET Core SDK=2.1.3
  [Host]     : .NET Core 2.0.4 (Framework 4.6.0.0), 64bit RyuJIT
  DefaultJob : .NET Core 2.0.4 (Framework 4.6.0.0), 64bit RyuJIT

Method	Mean	Error	StdDev	Median	Scaled	ScaledSD
TryPeek_Default	5.212 ns	0.1905 ns	0.3977 ns	5.178 ns	1.00	0.00
TryPeek_New	3.982 ns	0.1716 ns	0.4640 ns	3.798 ns	0.77	0.11

Pop

BenchmarkDotNet=v0.10.11, OS=ubuntu 16.04
Processor=Intel Xeon CPU 2.60GHz, ProcessorCount=2
.NET Core SDK=2.1.3
  [Host]     : .NET Core 2.0.4 (Framework 4.6.0.0), 64bit RyuJIT
  DefaultJob : .NET Core 2.0.4 (Framework 4.6.0.0), 64bit RyuJIT

Method	Mean	Error	StdDev	Scaled	ScaledSD
Pop_Default	1.0609 ns	0.0673 ns	0.0630 ns	1.00	0.00
Pop_New	0.8681 ns	0.1036 ns	0.0918 ns	0.82	0.10

TryPop

BenchmarkDotNet=v0.10.11, OS=ubuntu 16.04
Processor=Intel Xeon CPU 2.60GHz, ProcessorCount=2
.NET Core SDK=2.1.3
  [Host]     : .NET Core 2.0.4 (Framework 4.6.0.0), 64bit RyuJIT
  DefaultJob : .NET Core 2.0.4 (Framework 4.6.0.0), 64bit RyuJIT

Method	Mean	Error	StdDev	Scaled	ScaledSD
TryPop_Default	3.180 ns	0.0644 ns	0.0571 ns	1.00	0.00
TryPop_New	2.736 ns	0.1200 ns	0.1123 ns	0.86	0.04

Notes

Push

I did some trials for Push too, but the results weren't satisfying. Sometimes it was faster, sometimes slower. On Linux it was nearly always slower. So I reverted the change.
Changes were done in the commit https://github.com/gfoidl/corefx/commit/012333094dd2b2052eaf6daebaaba2a98f25d2b1

Inlining

MethodImplOptions.AggressiveInlining would give a perf win on the benchmarks, because they are micro-benchmarks. Due to inlining the callsite gets bigger and in real world scenarios the perf might decrease. I ran a benchmark where this happended, unfortunately I can't show the code because it's proprietary.

From #12094 (comment)

But we shouldn't add AggressiveInlining unless there's a strong performance-motivated scenario for doing so. Getting something to inline isn't a goal in and of itself.

On the other side List.Add has AggressiveInlining -- cf. https://github.com/dotnet/coreclr/blob/master/src/mscorlib/shared/System/Collections/Generic/List.cs#L225

I would stick to not aggressive inline, but what to do?

With value types the effect is not so big, because there is still one (manual) check for bounds. For reference types one bounds-check can be saved, so there is a win.

I can't see a real win, sometimes it's faster and sometimes slower. On Linux the tendency is to be slower. Therefore the state as is will be remained.

gfoidl · 2017-12-28T15:13:41Z

A Test fails but this has nothing to do with this PR. What shall / can I do here?

benaadams · 2017-12-28T16:37:51Z

What shall / can I do here?

Record and retest

Windows.7.Amd64.Open-Release-x86

System.IO.Compression.DeflateStreamUnitTests/Dispose_WithUnfinishedWriteAsync

Assert.Throws() Failure
Expected: typeof(System.AggregateException)
Actual:   (No exception was thrown)

   at System.IO.Compression.CompressionStreamUnitTestBase.<Dispose_WithUnfinishedWriteAsync>d__5.MoveNext() in D:\j\workspace\windows-TGrou---f8ac6754\src\Common\tests\System\IO\Compression\CompressionStreamUnitTestBase.cs:line 208
--- End of stack trace from previous location where exception was thrown ---
--- End of stack trace from previous location where exception was thrown ---
--- End of stack trace from previous location where exception was thrown ---

@dotnet-bot test Windows x86 Release Build

gfoidl · 2017-12-28T16:41:12Z

@benaadams thx!

benaadams · 2017-12-28T16:50:53Z

I did some trials for Push too, but the results weren't satisfying.

In the linked code the test if ((uint)size >= (uint)array.Length) doesn't elide the bounds check on the array as the array itself can change (as it returns to the access code after resizing)

gfoidl · 2017-12-28T17:01:50Z

On the dasm I saw that RCE is not applied, for that reason.
I don't know any way how to make this happen. Except the JIT could trace the new array size (the one from Array.Resize), but I'm afraid this kind of flow analysis is going to far.

benaadams · 2017-12-28T17:03:40Z

Move the Resize path completely out of flow?

if
   // ResizeAndPush(val)
else
   // rest of function

gfoidl · 2017-12-28T17:06:32Z

Ah good idea -- one just has to think about it, and that's not always that easy 😉
I'll try it.

karelz · 2017-12-28T17:56:19Z

cc @valenis

benaadams · 2017-12-28T18:54:00Z

Issue raised for Deflate fail https://github.com/dotnet/corefx/issues/26089

benaadams · 2017-12-28T19:19:39Z

src/System.Collections/src/System/Collections/Generic/Stack.cs

+            T[] array = _array;
+
+            // if (_size == 0) is equivalent to if (size == -1), and this case
+            // is covered with (uint)size, thus allowing RCE https://github.com/dotnet/coreclr/pull/9773


Rather than RCE call it "range check elimination" or "bounds check elimination"

RCE more commonly stands for "remote code execution" which will make security people twitchy having it in a comment. Especially since it says it allows RCE 😉

Oh, I didn't know that. Makes absoluteley sense, and I'll update this.

gfoidl · 2017-12-28T19:20:48Z

@benaadams now your approve was a minute to fast 😄

Push

BenchmarkDotNet=v0.10.11, OS=ubuntu 16.04
Processor=Intel Xeon CPU 2.30GHz, ProcessorCount=2
.NET Core SDK=2.1.3
  [Host]     : .NET Core 2.0.4 (Framework 4.6.0.0), 64bit RyuJIT
  DefaultJob : .NET Core 2.0.4 (Framework 4.6.0.0), 64bit RyuJIT

Method	InitialCapacity	Mean	Error	StdDev	Scaled	ScaledSD
Push_Default	0	0.4163 ns	0.0180 ns	0.0150 ns	1.00	0.00
Push_New	0	0.3121 ns	0.0141 ns	0.0132 ns	0.75	0.04
Push_Default	10000	0.4324 ns	0.0209 ns	0.0185 ns	1.00	0.00
Push_New	10000	0.3306 ns	0.0173 ns	0.0153 ns	0.77	0.05

Cf. #26086 (comment)

benaadams · 2017-12-28T19:27:58Z

Nice! - LGTM

gfoidl · 2017-12-28T19:40:51Z

Thank you!

mikedn · 2017-12-29T14:14:23Z

src/System.Collections/src/System/Collections/Generic/Stack.cs

        {
-            if (_size == 0)
+            int size = _size - 1;
+            T[] array = _array;


Is this copy really necessary?

With the copy T[] array = _array it gets:

000007fe`7bafc5d4 8b4110 mov eax,dword ptr [rcx+10h] 000007fe`7bafc5d7 ffc8 dec eax 000007fe`7bafc5d9 488b5108 mov rdx,qword ptr [rcx+8] 000007fe`7bafc5dd 448b4208 mov r8d,dword ptr [rdx+8] 000007fe`7bafc5e1 443bc0 cmp r8d,eax 000007fe`7bafc5e4 760c jbe 000007fe`7bafc5f2 000007fe`7bafc5e6 4863c0 movsxd rax,eax 000007fe`7bafc5e9 8b448210 mov eax,dword ptr [rdx+rax*4+10h] 000007fe`7bafc5ed 4883c428 add rsp,28h 000007fe`7bafc5f1 c3 ret

Without the copy and

if ((uint)size >= (uint)_array.Length) ... return _array[size];

it gets:

000007fe`7bb0c5d4 8b4110 mov eax,dword ptr [rcx+10h] 000007fe`7bb0c5d7 ffc8 dec eax 000007fe`7bb0c5d9 488b5108 mov rdx,qword ptr [rcx+8] 000007fe`7bb0c5dd 448b4208 mov r8d,dword ptr [rdx+8] 000007fe`7bb0c5e1 443bc0 cmp r8d,eax 000007fe`7bb0c5e4 760c jbe 000007fe`7bb0c5f2 000007fe`7bb0c5e6 4863c0 movsxd rax,eax 000007fe`7bb0c5e9 8b448210 mov eax,dword ptr [rdx+rax*4+10h] 000007fe`7bb0c5ed 4883c428 add rsp,28h 000007fe`7bb0c5f1 c3 ret

So basically the same code. Hence the copy can be avoided.

It seems as the JIT elides bound checks even on field access. Is this new? (and is this safe?)

It seems as the JIT elides bound checks even on field access. Is this new? (and is this safe?)

It's safe in the sense that the JIT generates correct code. Note that there's only one field load in the generated code. CSE removed the second one so in effect it did the same optimization that you did manually.

The fact that the range check is actually removed might be relatively new, in the past the JIT did not propagate existing analysis information to variables created by CSE and this usually prevented further optimizations. Some improvements in this area were done in dotnet/coreclr#9615

In general I'd recommend avoiding multiple field loads when arrays are involved, CSE may not be able to always eliminate redundant loads and even if it does there may still be cases where it doesn't work as well as manual optimization.

Still, in trivial cases such as Peek it should work fine and thus there's little reason to complicate the code.

Thanks for the great reply. 👍

I'll update the PR for this change.

For reference in TryPeek this JIT-optimization doesn't apply:

With Copy

000007fe`7b86c610 8b4110 mov eax,dword ptr [rcx+10h] 000007fe`7b86c613 448d40ff lea r8d,[rax-1] 000007fe`7b86c617 488b4908 mov rcx,qword ptr [rcx+8] 000007fe`7b86c61b 44394108 cmp dword ptr [rcx+8],r8d 000007fe`7b86c61f 7705 ja 000007fe`7b86c626 000007fe`7b86c621 33c0 xor eax,eax 000007fe`7b86c623 8902 mov dword ptr [rdx],eax 000007fe`7b86c625 c3 ret 000007fe`7b86c626 4963c0 movsxd rax,r8d 000007fe`7b86c629 8b448110 mov eax,dword ptr [rcx+rax*4+10h] 000007fe`7b86c62d 8902 mov dword ptr [rdx],eax 000007fe`7b86c62f b801000000 mov eax,1

Without Copy

000007fe`7b87c610 8b4110 mov eax,dword ptr [rcx+10h] 000007fe`7b87c613 448d40ff lea r8d,[rax-1] 000007fe`7b87c617 488b4108 mov rax,qword ptr [rcx+8] 000007fe`7b87c61b 8b4008 mov eax,dword ptr [rax+8] 000007fe`7b87c61e 413bc0 cmp eax,r8d 000007fe`7b87c621 7705 ja 000007fe`7b87c628 000007fe`7b87c623 33c0 xor eax,eax 000007fe`7b87c625 8902 mov dword ptr [rdx],eax 000007fe`7b87c627 c3 ret 000007fe`7b87c628 488b4108 mov rax,qword ptr [rcx+8] 000007fe`7b87c62c 4963c8 movsxd rcx,r8d 000007fe`7b87c62f 8b448810 mov eax,dword ptr [rax+rcx*4+10h] 000007fe`7b87c633 8902 mov dword ptr [rdx],eax 000007fe`7b87c635 b801000000 mov eax,1

Hmm, interesting. The range check seems to have been eliminated but there are 2 loads of the _array field. That doesn't seem right, I'll have to look into this.

I'm not yet sure what's going on but the generated code is definitely incorrect. Created https://github.com/dotnet/coreclr/issues/15671

As of PR-feedback #26086 (comment)

Benchmark resulst are not updated, because the same JIT-code is generated.

mikedn · 2018-01-12T18:18:11Z

The JIT bug that results in incorrect range check elimination has been fixed in dotnet/coreclr#15756. Unfortunately the fix makes range check elimination more conservative when fields are involved so you'll need to add back local variables for _array.

stephentoub · 2018-01-12T19:01:05Z

Unfortunately the fix makes range check elimination more conservative when fields are involve

@mikedn, did you happen to notice if there were any existing places in coreclr/corefx that regressed as a result and that we should also fix?

mikedn · 2018-01-12T22:26:07Z

@stephentoub I haven't looked too closely at the jit-diff. I'll check, fortunately it's small enough to look through it.

stephentoub · 2018-01-12T23:20:28Z

Thanks, @mikedn!

mikedn · 2018-01-13T11:13:42Z

did you happen to notice if there were any existing places in coreclr/corefx that regressed as a result and that we should also fix?

It looks like there are very few cases where we now get new range checks as a result of the JIT fix. I haven't looked at all corefx (the diffs is small but there are so many small and similar changes that are not related to range check elimination that after a while it gets boring) but the very few range checks I have found do not seem very interesting:

BitArray's constructor has a loop involving an array field. BitArray has many other CQ issues so it's not worth going only after that little loop.
RegexMatch.Reset has a similar loop. I'm not familiar with that code but I doubt that its perf is relevant.
BigInteger.Equals has some extra range checks. That code is not in a loop so I doubt that it makes a difference.

The common regression is not related to range checks but to redundant load elimination via CSE. Here's an example from one of Stack<T>.Push method:

--- D:\Projects\diffs\dasmset_13\diff\System.Collections.dasm	2018-01-13 09:19:01.000000000 +-0200
+++ D:\Projects\diffs\dasmset_13\base\System.Collections.dasm	2018-01-13 09:19:00.000000000 +-0200
@@ -50779,12 +50768,12 @@
        mov      ecx, dword ptr [rsi+24]
        mov      rax, gword ptr [rsi+8]
-       mov      eax, dword ptr [rax+8] ; load array length in a register
-       cmp      ecx, eax
+       cmp      ecx, dword ptr [rax+8] ; array length
        jne      SHORT G_M62838_IG05
        lea      rcx, bword ptr [rsi+8]
-       test     eax, eax
+       cmp      dword ptr [rax+8], 0 ; array length again, previously the value in eax was reused
        je       SHORT G_M62838_IG03
        mov      rbx, rcx
-       mov      ebp, eax
+       mov      ecx, dword ptr [rax+8] ; array length again, previously the value in eax was reused
+       mov      ebp, ecx
        shl      ebp, 1
        jmp      SHORT G_M62838_IG04

Do such regressions matter? I don't know, I haven't attempted to benchmark this code. It may put up a show in a micro-benchmark but I doubt that it will measurable impact the performance of a real world application, unless that application happens to spend 99% of the time in code affected by this regression 😄

stephentoub · 2018-01-13T12:48:34Z

Gotcha. Thanks for looking!

Cf. #26086 (comment)

gfoidl · 2018-01-13T13:40:33Z

Good (bug fixed) and bad (conservative range checks) 😉
I've updated the PR.

karelz · 2018-01-27T18:34:10Z

@stephentoub @mikedn are you ok with merging the change?

mikedn · 2018-01-27T18:47:39Z

LGTM

stephentoub · 2018-01-29T14:56:01Z

src/System.Collections/src/System/Collections/Generic/Stack.cs

-                Array.Resize(ref _array, (_array.Length == 0) ? DefaultCapacity : 2 * _array.Length);
+                array[size] = item;
+                _version++;
+                _size++;


Is this better than _size = size + 1;?

_size++ results in

inc dword ptr [rdi+24]

_size = size + 1 results in

inc eax mov dword ptr [rdi+24], eax

Have you checked if there's any difference in performance? It may be a single instruction but it has a memory load in it so it's not like it's as cheap as the other version.

I just run some benchmarks on this. It's hard to say which version is faster, because the difference in the magnitude of the noise (several executions). But _size = size + 1 has the tendency to show better numbers. So I'll update the PR to use _size = size + 1.

Cf. #26086 (comment)

gfoidl · 2018-01-30T09:25:56Z

Windows.7.Amd64.Open-Debug-x64

Unhandled Exception of Type Xunit.Sdk.EqualException

Assert.Equal() Failure
Expected: NotSignatureValid
Actual:   NotSignatureValid, UntrustedRoot

   at System.Security.Cryptography.X509Certificates.Tests.ChainTests.InvalidSelfSignedSignature() in D:\j\workspace\windows-TGrou---74aa877a\src\System.Security.Cryptography.X509Certificates\tests\ChainTests.cs:line 673

@dotnet-bot test Windows x64 Debug Build

gfoidl · 2018-01-30T09:51:57Z

NETFX x86 Release Build -- Build #7499 has a build failure, couldn't find the error in it (not so easy to find as in #26086 (comment)

@dotnet-bot test NETFX x86 Release Build

…26086) * Stack.Pop RCE With value types the effect is not so big, because there is still one (manual) check for bounds. For reference types one bounds-check can be saved, so there is a win. * Applied same optimizations as for Pop on Peek, TryPeek, TryPop, Push * Revert change for Push I can't see a real win, sometimes it's faster and sometimes slower. On Linux the tendency is to be slower. Therefore the state as is will be remained. * Stack.Push with hot-/cold-path (PushWithResize) * Addressed PR feedback Cf. dotnet/corefx#26086 (comment) * Array-copy in Peek is not necessary, JIT can do the same As of PR-feedback dotnet/corefx#26086 (comment) * Reverted b0bfd83 Cf. dotnet/corefx#26086 (comment) * Addressed PR feedback Cf. dotnet/corefx#26086 (comment) Commit migrated from dotnet/corefx@36ae610

gfoidl added 3 commits December 28, 2017 14:29

Stack.Pop RCE

c124114

With value types the effect is not so big, because there is still one (manual) check for bounds. For reference types one bounds-check can be saved, so there is a win.

Applied same optimizations as for Pop on Peek, TryPeek, TryPop, Push

0123330

Revert change for Push

e7308d8

I can't see a real win, sometimes it's faster and sometimes slower. On Linux the tendency is to be slower. Therefore the state as is will be remained.

gfoidl mentioned this pull request Dec 28, 2017

Queue<T> optimization of (Try)Dequeue #26087

Merged

karelz added the area-System.Collections label Dec 28, 2017

karelz assigned gfoidl, ianhays and safern Dec 28, 2017

gfoidl added a commit to gfoidl/Benchmarks that referenced this pull request Dec 28, 2017

Updated Stack1 and PushBenchmark -- dotnet/corefx#26086 (comment)

303ecf6

benaadams approved these changes Dec 28, 2017

View reviewed changes

Stack.Push with hot-/cold-path (PushWithResize)

d2f5b2c

gfoidl changed the title ~~Stack<T> optimization of (Try)Peek and (Try)Pop~~ Stack<T> optimization of (Try)Peek, (Try)Pop and Push Dec 28, 2017

benaadams reviewed Dec 28, 2017

View reviewed changes

Addressed PR feedback

f337882

Cf. #26086 (comment)

benaadams approved these changes Dec 28, 2017

View reviewed changes

safern approved these changes Dec 29, 2017

View reviewed changes

mikedn reviewed Dec 29, 2017

View reviewed changes

Array-copy in Peek is not necessary, JIT can do the same

b0bfd83

As of PR-feedback #26086 (comment)

gfoidl added a commit to gfoidl/Benchmarks that referenced this pull request Dec 29, 2017

Updated Stack1.Peek -- dotnet/corefx#26086 (comment)

06cd3b0

Benchmark resulst are not updated, because the same JIT-code is generated.

Reverted b0bfd83

0a2e11e

Cf. #26086 (comment)

stephentoub reviewed Jan 29, 2018

View reviewed changes

stephentoub approved these changes Jan 29, 2018

View reviewed changes

Addressed PR feedback

084acb7

Cf. #26086 (comment)

stephentoub merged commit 36ae610 into dotnet:master Jan 30, 2018

gfoidl deleted the stack-pop branch January 30, 2018 12:53

karelz added this to the 2.1.0 milestone Feb 4, 2018

Wraiyth mentioned this pull request Apr 1, 2019

Queue<T>.Enqueue and Stack<T>.Push performance improvements #36502

Closed

benaadams mentioned this pull request Jan 31, 2020

Compression Dispose_WithUnfinishedWriteAsync Fail dotnet/runtime#24539

Closed

Stack<T> optimization of (Try)Peek, (Try)Pop and Push #26086

Stack<T> optimization of (Try)Peek, (Try)Pop and Push #26086

Uh oh!

Conversation

gfoidl commented Dec 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Benchmarks

Notes

Peek

TryPeek

Pop

TryPop

Notes

Push

Inlining

Uh oh!

gfoidl commented Dec 28, 2017

Uh oh!

benaadams commented Dec 28, 2017

Uh oh!

gfoidl commented Dec 28, 2017

Uh oh!

benaadams commented Dec 28, 2017

Uh oh!

gfoidl commented Dec 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benaadams commented Dec 28, 2017

Uh oh!

gfoidl commented Dec 28, 2017

Uh oh!

karelz commented Dec 28, 2017

Uh oh!

benaadams commented Dec 28, 2017

Uh oh!

benaadams Dec 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gfoidl commented Dec 28, 2017

Push

Uh oh!

benaadams commented Dec 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gfoidl commented Dec 28, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gfoidl Dec 29, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

With Copy

Without Copy

Uh oh!

mikedn Dec 29, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mikedn commented Jan 12, 2018

Uh oh!

stephentoub commented Jan 12, 2018

Uh oh!

mikedn commented Jan 12, 2018

Uh oh!

stephentoub commented Jan 12, 2018

Uh oh!

mikedn commented Jan 13, 2018

Uh oh!

gfoidl commented Dec 28, 2017 •

edited

Loading

gfoidl commented Dec 28, 2017 •

edited

Loading

benaadams Dec 28, 2017 •

edited

Loading

benaadams commented Dec 28, 2017 •

edited

Loading

gfoidl Dec 29, 2017 •

edited

Loading

mikedn Dec 29, 2017 •

edited

Loading

gfoidl Jan 29, 2018 •

edited

Loading