Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Conversation

@gfoidl
Copy link
Member

@gfoidl gfoidl commented Dec 28, 2017

Description

Enabled RCE on array-access and avoided the explicit check for _size == 0 --> this is done implicitly in the "RCE-if". So some cmps can be saved.

By Pop the effect on value types is not so big, than for reference types (two array accesses).

This PR is a kind of extension to https://github.com/dotnet/corefx/issues/17318

Benchmarks

Notes

Code for benchmarks lives here

Due the use of http://benchmarkdotnet.org the benchmarks were done a couple of times, because some crazy results with perf x2 were reported and this seems too strange. The results shown here are the more realistic ones. Individual results are in the linked repo above.
The changes from this PR never showed a decrease in perf.

Peek

BenchmarkDotNet=v0.10.11, OS=ubuntu 16.04
Processor=Intel Xeon CPU 2.60GHz, ProcessorCount=2
.NET Core SDK=2.1.3
  [Host]     : .NET Core 2.0.4 (Framework 4.6.0.0), 64bit RyuJIT
  DefaultJob : .NET Core 2.0.4 (Framework 4.6.0.0), 64bit RyuJIT

Method Mean Error StdDev Scaled ScaledSD
Peek_Default 1.974 ns 0.1229 ns 0.2020 ns 1.00 0.00
Peek_New 1.600 ns 0.1231 ns 0.2023 ns 0.82 0.13

TryPeek

BenchmarkDotNet=v0.10.11, OS=ubuntu 16.04
Processor=Intel Xeon CPU 2.60GHz, ProcessorCount=2
.NET Core SDK=2.1.3
  [Host]     : .NET Core 2.0.4 (Framework 4.6.0.0), 64bit RyuJIT
  DefaultJob : .NET Core 2.0.4 (Framework 4.6.0.0), 64bit RyuJIT

Method Mean Error StdDev Median Scaled ScaledSD
TryPeek_Default 5.212 ns 0.1905 ns 0.3977 ns 5.178 ns 1.00 0.00
TryPeek_New 3.982 ns 0.1716 ns 0.4640 ns 3.798 ns 0.77 0.11

Pop

BenchmarkDotNet=v0.10.11, OS=ubuntu 16.04
Processor=Intel Xeon CPU 2.60GHz, ProcessorCount=2
.NET Core SDK=2.1.3
  [Host]     : .NET Core 2.0.4 (Framework 4.6.0.0), 64bit RyuJIT
  DefaultJob : .NET Core 2.0.4 (Framework 4.6.0.0), 64bit RyuJIT

Method Mean Error StdDev Scaled ScaledSD
Pop_Default 1.0609 ns 0.0673 ns 0.0630 ns 1.00 0.00
Pop_New 0.8681 ns 0.1036 ns 0.0918 ns 0.82 0.10

TryPop

BenchmarkDotNet=v0.10.11, OS=ubuntu 16.04
Processor=Intel Xeon CPU 2.60GHz, ProcessorCount=2
.NET Core SDK=2.1.3
  [Host]     : .NET Core 2.0.4 (Framework 4.6.0.0), 64bit RyuJIT
  DefaultJob : .NET Core 2.0.4 (Framework 4.6.0.0), 64bit RyuJIT

Method Mean Error StdDev Scaled ScaledSD
TryPop_Default 3.180 ns 0.0644 ns 0.0571 ns 1.00 0.00
TryPop_New 2.736 ns 0.1200 ns 0.1123 ns 0.86 0.04

Notes

Push

I did some trials for Push too, but the results weren't satisfying. Sometimes it was faster, sometimes slower. On Linux it was nearly always slower. So I reverted the change.
Changes were done in the commit https://github.com/gfoidl/corefx/commit/012333094dd2b2052eaf6daebaaba2a98f25d2b1

Inlining

MethodImplOptions.AggressiveInlining would give a perf win on the benchmarks, because they are micro-benchmarks. Due to inlining the callsite gets bigger and in real world scenarios the perf might decrease. I ran a benchmark where this happended, unfortunately I can't show the code because it's proprietary.

From #12094 (comment)

But we shouldn't add AggressiveInlining unless there's a strong performance-motivated scenario for doing so. Getting something to inline isn't a goal in and of itself.

On the other side List.Add has AggressiveInlining -- cf. https://github.com/dotnet/coreclr/blob/master/src/mscorlib/shared/System/Collections/Generic/List.cs#L225

I would stick to not aggressive inline, but what to do?

With value types the effect is not so big, because there is still one (manual) check for bounds.
For reference types one bounds-check can be saved, so there is a win.
I can't see a real win, sometimes it's faster and sometimes slower. On Linux the tendency is to be slower.
Therefore the state as is will be remained.
@gfoidl
Copy link
Member Author

gfoidl commented Dec 28, 2017

A Test fails but this has nothing to do with this PR. What shall / can I do here?

@benaadams
Copy link
Member

What shall / can I do here?

Record and retest

Windows.7.Amd64.Open-Release-x86

System.IO.Compression.DeflateStreamUnitTests/Dispose_WithUnfinishedWriteAsync

Assert.Throws() Failure
Expected: typeof(System.AggregateException)
Actual:   (No exception was thrown)

   at System.IO.Compression.CompressionStreamUnitTestBase.<Dispose_WithUnfinishedWriteAsync>d__5.MoveNext() in D:\j\workspace\windows-TGrou---f8ac6754\src\Common\tests\System\IO\Compression\CompressionStreamUnitTestBase.cs:line 208
--- End of stack trace from previous location where exception was thrown ---
--- End of stack trace from previous location where exception was thrown ---
--- End of stack trace from previous location where exception was thrown ---

@dotnet-bot test Windows x86 Release Build

@gfoidl
Copy link
Member Author

gfoidl commented Dec 28, 2017

@benaadams thx!

@benaadams
Copy link
Member

I did some trials for Push too, but the results weren't satisfying.

In the linked code the test if ((uint)size >= (uint)array.Length) doesn't elide the bounds check on the array as the array itself can change (as it returns to the access code after resizing)

@gfoidl
Copy link
Member Author

gfoidl commented Dec 28, 2017

On the dasm I saw that RCE is not applied, for that reason.
I don't know any way how to make this happen. Except the JIT could trace the new array size (the one from Array.Resize), but I'm afraid this kind of flow analysis is going to far.

@benaadams
Copy link
Member

Move the Resize path completely out of flow?

if
   // ResizeAndPush(val)
else
   // rest of function

@gfoidl
Copy link
Member Author

gfoidl commented Dec 28, 2017

Ah good idea -- one just has to think about it, and that's not always that easy 😉
I'll try it.

@karelz
Copy link
Member

karelz commented Dec 28, 2017

cc @valenis

@benaadams
Copy link
Member

Issue raised for Deflate fail https://github.com/dotnet/corefx/issues/26089

gfoidl added a commit to gfoidl/Benchmarks that referenced this pull request Dec 28, 2017
@gfoidl gfoidl changed the title Stack<T> optimization of (Try)Peek and (Try)Pop Stack<T> optimization of (Try)Peek, (Try)Pop and Push Dec 28, 2017
T[] array = _array;

// if (_size == 0) is equivalent to if (size == -1), and this case
// is covered with (uint)size, thus allowing RCE https://github.com/dotnet/coreclr/pull/9773
Copy link
Member

@benaadams benaadams Dec 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than RCE call it "range check elimination" or "bounds check elimination"

RCE more commonly stands for "remote code execution" which will make security people twitchy having it in a comment. Especially since it says it allows RCE 😉

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I didn't know that. Makes absoluteley sense, and I'll update this.

@gfoidl
Copy link
Member Author

gfoidl commented Dec 28, 2017

@benaadams now your approve was a minute to fast 😄

Push

BenchmarkDotNet=v0.10.11, OS=ubuntu 16.04
Processor=Intel Xeon CPU 2.30GHz, ProcessorCount=2
.NET Core SDK=2.1.3
  [Host]     : .NET Core 2.0.4 (Framework 4.6.0.0), 64bit RyuJIT
  DefaultJob : .NET Core 2.0.4 (Framework 4.6.0.0), 64bit RyuJIT

Method InitialCapacity Mean Error StdDev Scaled ScaledSD
Push_Default 0 0.4163 ns 0.0180 ns 0.0150 ns 1.00 0.00
Push_New 0 0.3121 ns 0.0141 ns 0.0132 ns 0.75 0.04
Push_Default 10000 0.4324 ns 0.0209 ns 0.0185 ns 1.00 0.00
Push_New 10000 0.3306 ns 0.0173 ns 0.0153 ns 0.77 0.05

@benaadams
Copy link
Member

benaadams commented Dec 28, 2017

Nice! - LGTM

@gfoidl
Copy link
Member Author

gfoidl commented Dec 28, 2017

Thank you!

{
if (_size == 0)
int size = _size - 1;
T[] array = _array;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this copy really necessary?

Copy link
Member Author

@gfoidl gfoidl Dec 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the copy T[] array = _array it gets:

000007fe`7bafc5d4 8b4110          mov     eax,dword ptr [rcx+10h]
000007fe`7bafc5d7 ffc8            dec     eax
000007fe`7bafc5d9 488b5108        mov     rdx,qword ptr [rcx+8]
000007fe`7bafc5dd 448b4208        mov     r8d,dword ptr [rdx+8]
000007fe`7bafc5e1 443bc0          cmp     r8d,eax
000007fe`7bafc5e4 760c            jbe     000007fe`7bafc5f2
000007fe`7bafc5e6 4863c0          movsxd  rax,eax
000007fe`7bafc5e9 8b448210        mov     eax,dword ptr [rdx+rax*4+10h]
000007fe`7bafc5ed 4883c428        add     rsp,28h
000007fe`7bafc5f1 c3              ret

Without the copy and

if ((uint)size >= (uint)_array.Length)
...
return _array[size];

it gets:

000007fe`7bb0c5d4 8b4110          mov     eax,dword ptr [rcx+10h]
000007fe`7bb0c5d7 ffc8            dec     eax
000007fe`7bb0c5d9 488b5108        mov     rdx,qword ptr [rcx+8]
000007fe`7bb0c5dd 448b4208        mov     r8d,dword ptr [rdx+8]
000007fe`7bb0c5e1 443bc0          cmp     r8d,eax
000007fe`7bb0c5e4 760c            jbe     000007fe`7bb0c5f2
000007fe`7bb0c5e6 4863c0          movsxd  rax,eax
000007fe`7bb0c5e9 8b448210        mov     eax,dword ptr [rdx+rax*4+10h]
000007fe`7bb0c5ed 4883c428        add     rsp,28h
000007fe`7bb0c5f1 c3              ret

So basically the same code. Hence the copy can be avoided.

It seems as the JIT elides bound checks even on field access. Is this new? (and is this safe?)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems as the JIT elides bound checks even on field access. Is this new? (and is this safe?)

It's safe in the sense that the JIT generates correct code. Note that there's only one field load in the generated code. CSE removed the second one so in effect it did the same optimization that you did manually.

The fact that the range check is actually removed might be relatively new, in the past the JIT did not propagate existing analysis information to variables created by CSE and this usually prevented further optimizations. Some improvements in this area were done in dotnet/coreclr#9615

In general I'd recommend avoiding multiple field loads when arrays are involved, CSE may not be able to always eliminate redundant loads and even if it does there may still be cases where it doesn't work as well as manual optimization.

Still, in trivial cases such as Peek it should work fine and thus there's little reason to complicate the code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great reply. 👍

I'll update the PR for this change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference in TryPeek this JIT-optimization doesn't apply:

With Copy

000007fe`7b86c610 8b4110          mov     eax,dword ptr [rcx+10h]
000007fe`7b86c613 448d40ff        lea     r8d,[rax-1]
000007fe`7b86c617 488b4908        mov     rcx,qword ptr [rcx+8]
000007fe`7b86c61b 44394108        cmp     dword ptr [rcx+8],r8d
000007fe`7b86c61f 7705            ja      000007fe`7b86c626
000007fe`7b86c621 33c0            xor     eax,eax
000007fe`7b86c623 8902            mov     dword ptr [rdx],eax
000007fe`7b86c625 c3              ret
000007fe`7b86c626 4963c0          movsxd  rax,r8d
000007fe`7b86c629 8b448110        mov     eax,dword ptr [rcx+rax*4+10h]
000007fe`7b86c62d 8902            mov     dword ptr [rdx],eax
000007fe`7b86c62f b801000000      mov     eax,1

Without Copy

000007fe`7b87c610 8b4110          mov     eax,dword ptr [rcx+10h]
000007fe`7b87c613 448d40ff        lea     r8d,[rax-1]
000007fe`7b87c617 488b4108        mov     rax,qword ptr [rcx+8]
000007fe`7b87c61b 8b4008          mov     eax,dword ptr [rax+8]
000007fe`7b87c61e 413bc0          cmp     eax,r8d
000007fe`7b87c621 7705            ja      000007fe`7b87c628
000007fe`7b87c623 33c0            xor     eax,eax
000007fe`7b87c625 8902            mov     dword ptr [rdx],eax
000007fe`7b87c627 c3              ret
000007fe`7b87c628 488b4108        mov     rax,qword ptr [rcx+8]
000007fe`7b87c62c 4963c8          movsxd  rcx,r8d
000007fe`7b87c62f 8b448810        mov     eax,dword ptr [rax+rcx*4+10h]
000007fe`7b87c633 8902            mov     dword ptr [rdx],eax
000007fe`7b87c635 b801000000      mov     eax,1

Copy link

@mikedn mikedn Dec 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, interesting. The range check seems to have been eliminated but there are 2 loads of the _array field. That doesn't seem right, I'll have to look into this.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not yet sure what's going on but the generated code is definitely incorrect. Created https://github.com/dotnet/coreclr/issues/15671

gfoidl added a commit to gfoidl/Benchmarks that referenced this pull request Dec 29, 2017
Benchmark resulst are not updated, because the same JIT-code is generated.
@mikedn
Copy link

mikedn commented Jan 12, 2018

The JIT bug that results in incorrect range check elimination has been fixed in dotnet/coreclr#15756. Unfortunately the fix makes range check elimination more conservative when fields are involved so you'll need to add back local variables for _array.

@stephentoub
Copy link
Member

Unfortunately the fix makes range check elimination more conservative when fields are involve

@mikedn, did you happen to notice if there were any existing places in coreclr/corefx that regressed as a result and that we should also fix?

@mikedn
Copy link

mikedn commented Jan 12, 2018

@stephentoub I haven't looked too closely at the jit-diff. I'll check, fortunately it's small enough to look through it.

@stephentoub
Copy link
Member

Thanks, @mikedn!

@mikedn
Copy link

mikedn commented Jan 13, 2018

did you happen to notice if there were any existing places in coreclr/corefx that regressed as a result and that we should also fix?

It looks like there are very few cases where we now get new range checks as a result of the JIT fix. I haven't looked at all corefx (the diffs is small but there are so many small and similar changes that are not related to range check elimination that after a while it gets boring) but the very few range checks I have found do not seem very interesting:

  • BitArray's constructor has a loop involving an array field. BitArray has many other CQ issues so it's not worth going only after that little loop.
  • RegexMatch.Reset has a similar loop. I'm not familiar with that code but I doubt that its perf is relevant.
  • BigInteger.Equals has some extra range checks. That code is not in a loop so I doubt that it makes a difference.

The common regression is not related to range checks but to redundant load elimination via CSE. Here's an example from one of Stack<T>.Push method:

--- D:\Projects\diffs\dasmset_13\diff\System.Collections.dasm	2018-01-13 09:19:01.000000000 +-0200
+++ D:\Projects\diffs\dasmset_13\base\System.Collections.dasm	2018-01-13 09:19:00.000000000 +-0200
@@ -50779,12 +50768,12 @@
        mov      ecx, dword ptr [rsi+24]
        mov      rax, gword ptr [rsi+8]
-       mov      eax, dword ptr [rax+8] ; load array length in a register
-       cmp      ecx, eax
+       cmp      ecx, dword ptr [rax+8] ; array length
        jne      SHORT G_M62838_IG05
        lea      rcx, bword ptr [rsi+8]
-       test     eax, eax
+       cmp      dword ptr [rax+8], 0 ; array length again, previously the value in eax was reused
        je       SHORT G_M62838_IG03
        mov      rbx, rcx
-       mov      ebp, eax
+       mov      ecx, dword ptr [rax+8] ; array length again, previously the value in eax was reused
+       mov      ebp, ecx
        shl      ebp, 1
        jmp      SHORT G_M62838_IG04

Do such regressions matter? I don't know, I haven't attempted to benchmark this code. It may put up a show in a micro-benchmark but I doubt that it will measurable impact the performance of a real world application, unless that application happens to spend 99% of the time in code affected by this regression 😄

@stephentoub
Copy link
Member

Gotcha. Thanks for looking!

@gfoidl
Copy link
Member Author

gfoidl commented Jan 13, 2018

Good (bug fixed) and bad (conservative range checks) 😉
I've updated the PR.

@karelz
Copy link
Member

karelz commented Jan 27, 2018

@stephentoub @mikedn are you ok with merging the change?

@mikedn
Copy link

mikedn commented Jan 27, 2018

LGTM

Array.Resize(ref _array, (_array.Length == 0) ? DefaultCapacity : 2 * _array.Length);
array[size] = item;
_version++;
_size++;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this better than _size = size + 1;?

Copy link
Member Author

@gfoidl gfoidl Jan 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_size++ results in

inc      dword ptr [rdi+24]

_size = size + 1 results in

inc      eax
mov      dword ptr [rdi+24], eax

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you checked if there's any difference in performance? It may be a single instruction but it has a memory load in it so it's not like it's as cheap as the other version.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just run some benchmarks on this. It's hard to say which version is faster, because the difference in the magnitude of the noise (several executions). But _size = size + 1 has the tendency to show better numbers. So I'll update the PR to use _size = size + 1.

@gfoidl
Copy link
Member Author

gfoidl commented Jan 30, 2018

Windows.7.Amd64.Open-Debug-x64

Unhandled Exception of Type Xunit.Sdk.EqualException

Assert.Equal() Failure
Expected: NotSignatureValid
Actual:   NotSignatureValid, UntrustedRoot

   at System.Security.Cryptography.X509Certificates.Tests.ChainTests.InvalidSelfSignedSignature() in D:\j\workspace\windows-TGrou---74aa877a\src\System.Security.Cryptography.X509Certificates\tests\ChainTests.cs:line 673

@dotnet-bot test Windows x64 Debug Build

@gfoidl
Copy link
Member Author

gfoidl commented Jan 30, 2018

NETFX x86 Release Build -- Build #7499 has a build failure, couldn't find the error in it (not so easy to find as in #26086 (comment)

@dotnet-bot test NETFX x86 Release Build

@stephentoub stephentoub merged commit 36ae610 into dotnet:master Jan 30, 2018
@gfoidl gfoidl deleted the stack-pop branch January 30, 2018 12:53
@karelz karelz added this to the 2.1.0 milestone Feb 4, 2018
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
…26086)

* Stack.Pop RCE

With value types the effect is not so big, because there is still one (manual) check for bounds.
For reference types one bounds-check can be saved, so there is a win.

* Applied same optimizations as for Pop on Peek, TryPeek, TryPop, Push

* Revert change for Push

I can't see a real win, sometimes it's faster and sometimes slower. On Linux the tendency is to be slower.
Therefore the state as is will be remained.

* Stack.Push with hot-/cold-path (PushWithResize)

* Addressed PR feedback

Cf. dotnet/corefx#26086 (comment)

* Array-copy in Peek is not necessary, JIT can do the same

As of PR-feedback dotnet/corefx#26086 (comment)

* Reverted b0bfd83

Cf. dotnet/corefx#26086 (comment)

* Addressed PR feedback

Cf. dotnet/corefx#26086 (comment)


Commit migrated from dotnet/corefx@36ae610
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants