-
Notifications
You must be signed in to change notification settings - Fork 4.9k
Stack<T> optimization of (Try)Peek, (Try)Pop and Push #26086
Conversation
With value types the effect is not so big, because there is still one (manual) check for bounds. For reference types one bounds-check can be saved, so there is a win.
I can't see a real win, sometimes it's faster and sometimes slower. On Linux the tendency is to be slower. Therefore the state as is will be remained.
|
A Test fails but this has nothing to do with this PR. What shall / can I do here? |
Record and retest
@dotnet-bot test Windows x86 Release Build |
|
@benaadams thx! |
In the linked code the test |
|
On the dasm I saw that RCE is not applied, for that reason. |
|
Move the Resize path completely out of flow? |
|
Ah good idea -- one just has to think about it, and that's not always that easy 😉 |
|
cc @valenis |
|
Issue raised for Deflate fail https://github.com/dotnet/corefx/issues/26089 |
| T[] array = _array; | ||
|
|
||
| // if (_size == 0) is equivalent to if (size == -1), and this case | ||
| // is covered with (uint)size, thus allowing RCE https://github.com/dotnet/coreclr/pull/9773 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than RCE call it "range check elimination" or "bounds check elimination"
RCE more commonly stands for "remote code execution" which will make security people twitchy having it in a comment. Especially since it says it allows RCE 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I didn't know that. Makes absoluteley sense, and I'll update this.
|
@benaadams now your approve was a minute to fast 😄 PushBenchmarkDotNet=v0.10.11, OS=ubuntu 16.04
Processor=Intel Xeon CPU 2.30GHz, ProcessorCount=2
.NET Core SDK=2.1.3
[Host] : .NET Core 2.0.4 (Framework 4.6.0.0), 64bit RyuJIT
DefaultJob : .NET Core 2.0.4 (Framework 4.6.0.0), 64bit RyuJIT
|
|
Nice! - LGTM |
|
Thank you! |
| { | ||
| if (_size == 0) | ||
| int size = _size - 1; | ||
| T[] array = _array; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this copy really necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the copy T[] array = _array it gets:
000007fe`7bafc5d4 8b4110 mov eax,dword ptr [rcx+10h]
000007fe`7bafc5d7 ffc8 dec eax
000007fe`7bafc5d9 488b5108 mov rdx,qword ptr [rcx+8]
000007fe`7bafc5dd 448b4208 mov r8d,dword ptr [rdx+8]
000007fe`7bafc5e1 443bc0 cmp r8d,eax
000007fe`7bafc5e4 760c jbe 000007fe`7bafc5f2
000007fe`7bafc5e6 4863c0 movsxd rax,eax
000007fe`7bafc5e9 8b448210 mov eax,dword ptr [rdx+rax*4+10h]
000007fe`7bafc5ed 4883c428 add rsp,28h
000007fe`7bafc5f1 c3 retWithout the copy and
if ((uint)size >= (uint)_array.Length)
...
return _array[size];it gets:
000007fe`7bb0c5d4 8b4110 mov eax,dword ptr [rcx+10h]
000007fe`7bb0c5d7 ffc8 dec eax
000007fe`7bb0c5d9 488b5108 mov rdx,qword ptr [rcx+8]
000007fe`7bb0c5dd 448b4208 mov r8d,dword ptr [rdx+8]
000007fe`7bb0c5e1 443bc0 cmp r8d,eax
000007fe`7bb0c5e4 760c jbe 000007fe`7bb0c5f2
000007fe`7bb0c5e6 4863c0 movsxd rax,eax
000007fe`7bb0c5e9 8b448210 mov eax,dword ptr [rdx+rax*4+10h]
000007fe`7bb0c5ed 4883c428 add rsp,28h
000007fe`7bb0c5f1 c3 retSo basically the same code. Hence the copy can be avoided.
It seems as the JIT elides bound checks even on field access. Is this new? (and is this safe?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems as the JIT elides bound checks even on field access. Is this new? (and is this safe?)
It's safe in the sense that the JIT generates correct code. Note that there's only one field load in the generated code. CSE removed the second one so in effect it did the same optimization that you did manually.
The fact that the range check is actually removed might be relatively new, in the past the JIT did not propagate existing analysis information to variables created by CSE and this usually prevented further optimizations. Some improvements in this area were done in dotnet/coreclr#9615
In general I'd recommend avoiding multiple field loads when arrays are involved, CSE may not be able to always eliminate redundant loads and even if it does there may still be cases where it doesn't work as well as manual optimization.
Still, in trivial cases such as Peek it should work fine and thus there's little reason to complicate the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great reply. 👍
I'll update the PR for this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference in TryPeek this JIT-optimization doesn't apply:
With Copy
000007fe`7b86c610 8b4110 mov eax,dword ptr [rcx+10h]
000007fe`7b86c613 448d40ff lea r8d,[rax-1]
000007fe`7b86c617 488b4908 mov rcx,qword ptr [rcx+8]
000007fe`7b86c61b 44394108 cmp dword ptr [rcx+8],r8d
000007fe`7b86c61f 7705 ja 000007fe`7b86c626
000007fe`7b86c621 33c0 xor eax,eax
000007fe`7b86c623 8902 mov dword ptr [rdx],eax
000007fe`7b86c625 c3 ret
000007fe`7b86c626 4963c0 movsxd rax,r8d
000007fe`7b86c629 8b448110 mov eax,dword ptr [rcx+rax*4+10h]
000007fe`7b86c62d 8902 mov dword ptr [rdx],eax
000007fe`7b86c62f b801000000 mov eax,1Without Copy
000007fe`7b87c610 8b4110 mov eax,dword ptr [rcx+10h]
000007fe`7b87c613 448d40ff lea r8d,[rax-1]
000007fe`7b87c617 488b4108 mov rax,qword ptr [rcx+8]
000007fe`7b87c61b 8b4008 mov eax,dword ptr [rax+8]
000007fe`7b87c61e 413bc0 cmp eax,r8d
000007fe`7b87c621 7705 ja 000007fe`7b87c628
000007fe`7b87c623 33c0 xor eax,eax
000007fe`7b87c625 8902 mov dword ptr [rdx],eax
000007fe`7b87c627 c3 ret
000007fe`7b87c628 488b4108 mov rax,qword ptr [rcx+8]
000007fe`7b87c62c 4963c8 movsxd rcx,r8d
000007fe`7b87c62f 8b448810 mov eax,dword ptr [rax+rcx*4+10h]
000007fe`7b87c633 8902 mov dword ptr [rdx],eax
000007fe`7b87c635 b801000000 mov eax,1There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, interesting. The range check seems to have been eliminated but there are 2 loads of the _array field. That doesn't seem right, I'll have to look into this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not yet sure what's going on but the generated code is definitely incorrect. Created https://github.com/dotnet/coreclr/issues/15671
As of PR-feedback #26086 (comment)
Benchmark resulst are not updated, because the same JIT-code is generated.
|
The JIT bug that results in incorrect range check elimination has been fixed in dotnet/coreclr#15756. Unfortunately the fix makes range check elimination more conservative when fields are involved so you'll need to add back local variables for |
@mikedn, did you happen to notice if there were any existing places in coreclr/corefx that regressed as a result and that we should also fix? |
|
@stephentoub I haven't looked too closely at the jit-diff. I'll check, fortunately it's small enough to look through it. |
|
Thanks, @mikedn! |
It looks like there are very few cases where we now get new range checks as a result of the JIT fix. I haven't looked at all corefx (the diffs is small but there are so many small and similar changes that are not related to range check elimination that after a while it gets boring) but the very few range checks I have found do not seem very interesting:
The common regression is not related to range checks but to redundant load elimination via CSE. Here's an example from one of --- D:\Projects\diffs\dasmset_13\diff\System.Collections.dasm 2018-01-13 09:19:01.000000000 +-0200
+++ D:\Projects\diffs\dasmset_13\base\System.Collections.dasm 2018-01-13 09:19:00.000000000 +-0200
@@ -50779,12 +50768,12 @@
mov ecx, dword ptr [rsi+24]
mov rax, gword ptr [rsi+8]
- mov eax, dword ptr [rax+8] ; load array length in a register
- cmp ecx, eax
+ cmp ecx, dword ptr [rax+8] ; array length
jne SHORT G_M62838_IG05
lea rcx, bword ptr [rsi+8]
- test eax, eax
+ cmp dword ptr [rax+8], 0 ; array length again, previously the value in eax was reused
je SHORT G_M62838_IG03
mov rbx, rcx
- mov ebp, eax
+ mov ecx, dword ptr [rax+8] ; array length again, previously the value in eax was reused
+ mov ebp, ecx
shl ebp, 1
jmp SHORT G_M62838_IG04Do such regressions matter? I don't know, I haven't attempted to benchmark this code. It may put up a show in a micro-benchmark but I doubt that it will measurable impact the performance of a real world application, unless that application happens to spend 99% of the time in code affected by this regression 😄 |
|
Gotcha. Thanks for looking! |
|
Good (bug fixed) and bad (conservative range checks) 😉 |
|
@stephentoub @mikedn are you ok with merging the change? |
|
LGTM |
| Array.Resize(ref _array, (_array.Length == 0) ? DefaultCapacity : 2 * _array.Length); | ||
| array[size] = item; | ||
| _version++; | ||
| _size++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this better than _size = size + 1;?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_size++ results in
inc dword ptr [rdi+24]_size = size + 1 results in
inc eax
mov dword ptr [rdi+24], eaxThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you checked if there's any difference in performance? It may be a single instruction but it has a memory load in it so it's not like it's as cheap as the other version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just run some benchmarks on this. It's hard to say which version is faster, because the difference in the magnitude of the noise (several executions). But _size = size + 1 has the tendency to show better numbers. So I'll update the PR to use _size = size + 1.
|
@dotnet-bot test Windows x64 Debug Build |
|
@dotnet-bot test NETFX x86 Release Build |
…26086) * Stack.Pop RCE With value types the effect is not so big, because there is still one (manual) check for bounds. For reference types one bounds-check can be saved, so there is a win. * Applied same optimizations as for Pop on Peek, TryPeek, TryPop, Push * Revert change for Push I can't see a real win, sometimes it's faster and sometimes slower. On Linux the tendency is to be slower. Therefore the state as is will be remained. * Stack.Push with hot-/cold-path (PushWithResize) * Addressed PR feedback Cf. dotnet/corefx#26086 (comment) * Array-copy in Peek is not necessary, JIT can do the same As of PR-feedback dotnet/corefx#26086 (comment) * Reverted b0bfd83 Cf. dotnet/corefx#26086 (comment) * Addressed PR feedback Cf. dotnet/corefx#26086 (comment) Commit migrated from dotnet/corefx@36ae610
Description
Enabled RCE on array-access and avoided the explicit check for
_size == 0--> this is done implicitly in the "RCE-if". So somecmps can be saved.By Pop the effect on value types is not so big, than for reference types (two array accesses).
This PR is a kind of extension to https://github.com/dotnet/corefx/issues/17318
Benchmarks
Notes
Code for benchmarks lives here
Due the use of http://benchmarkdotnet.org the benchmarks were done a couple of times, because some crazy results with perf x2 were reported and this seems too strange. The results shown here are the more realistic ones. Individual results are in the linked repo above.
The changes from this PR never showed a decrease in perf.
Peek
TryPeek
Pop
TryPop
Notes
Push
I did some trials for
Pushtoo, but the results weren't satisfying. Sometimes it was faster, sometimes slower. On Linux it was nearly always slower. So I reverted the change.Changes were done in the commit https://github.com/gfoidl/corefx/commit/012333094dd2b2052eaf6daebaaba2a98f25d2b1
Inlining
MethodImplOptions.AggressiveInliningwould give a perf win on the benchmarks, because they are micro-benchmarks. Due to inlining the callsite gets bigger and in real world scenarios the perf might decrease. I ran a benchmark where this happended, unfortunately I can't show the code because it's proprietary.From #12094 (comment)
On the other side
List.Addhas AggressiveInlining -- cf. https://github.com/dotnet/coreclr/blob/master/src/mscorlib/shared/System/Collections/Generic/List.cs#L225I would stick to not aggressive inline, but what to do?