Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange Span costs for as Memory.Span -> parameter #32396

Closed
benaadams opened this issue Feb 16, 2020 · 5 comments · Fixed by #32538
Closed

Strange Span costs for as Memory.Span -> parameter #32396

benaadams opened this issue Feb 16, 2020 · 5 comments · Fixed by #32538
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization

Comments

@benaadams
Copy link
Member

benaadams commented Feb 16, 2020

Can costs for passing Spans as parameters be reduced? (e.g. by passing in xmm registers).

The current costs may make passing pointers or refs more attractive; which is undesirable as it discards the bounding safety provided by the Spans.

Noticed in #32371 (comment) where the cost of using Span parameters for the method is higher than the method's time taken to test whether two sets of 4096 bytes are equal (test on Windows)

Created gist benchmark https://gist.github.com/benaadams/56af11cf7f8e0e1da3fed47464414f8a to demonstrate:

|                      Method |      Mean |     Error |    StdDev |
|---------------------------- |----------:|----------:|----------:|
|            PassSpansByParam |  9.982 ns | 0.0492 ns | 0.0460 ns |
|      PassDeconstructedSpans |  3.541 ns | 0.0496 ns | 0.0464 ns |
|          DeferSpansCreation |  3.513 ns | 0.0507 ns | 0.0475 ns |
|       PassSpansByParamTwice | 11.082 ns | 0.0417 ns | 0.0390 ns |
| PassReconstructedSpansParam |  5.014 ns | 0.0207 ns | 0.0172 ns |

All three methods create Spans from the same Memory<byte>

  • PassSpansByParam - passes the created Span<byte> as parameters
  • PassDeconstructedSpans - turns the Span<byte> into ref byte and int and passes those
  • DeferSpansCreation - passes no parameters and creates the Span<byte> in the callee
  • PassSpansByParamTwice - passes the created Span<byte> as parameters; then passes them through to second method in different param positions
  • PassReconstructedSpansParam - turns the Span<byte> into ref byte and int and passes those; then recreates the Spans from the params and passes those created Spans on to second method.

As PassSpansByParamTwice to see if its purely span passing (i.e. is the cost directly additive); it doesn't add the same cost on again; it seems to be more than purely parameter passing.

As PassReconstructedSpansParam does pass the Spans as parameters, just not from the original method that gets them from the Memory<byte> and has a much lower cost; even though it now involves an extra non-inlined method call, its even stranger?

category:cq
theme:optimization
skill-level:expert
cost:medium

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-System.Memory untriaged New issue has not been triaged by the area owner labels Feb 16, 2020
@benaadams benaadams changed the title Can costs for passing Spans as parameters be reduced Can costs for passing Spans as parameters be reduced? Feb 16, 2020
@benaadams benaadams changed the title Can costs for passing Spans as parameters be reduced? Strange Span costs for as Memory.Span -> parameter Feb 16, 2020
@benaadams
Copy link
Member Author

Updated with PassReconstructedSpansParam as deconstructing the spans to ref/lengths; passing to an intermediary method that then recreates the spans and calling the desired method is much faster than calling directly with the Spans (created from Memory<T>.Span)

@benaadams
Copy link
Member Author

PassSpansByParamTwice just passes through the stack pointers given as params, so doesn't really do much

@benaadams
Copy link
Member Author

Main difference between PassSpansByParam and PassReconstructedSpansParam looks the be the prologue (just how expensive is rep stosd?)

Slow

G_M41396_IG01:
       push     r14
       push     rdi
       push     rsi
       push     rbp
       push     rbx
       sub      rsp, 96
       mov      rsi, rcx
       lea      rdi, [rsp+20H]
       mov      ecx, 16
       xor      rax, rax
       rep stosd 
       mov      rcx, rsi
       mov      rsi, rcx

Fast

G_M24512_IG01:
       push     r15
       push     r14
       push     rdi
       push     rsi
       push     rbp
       push     rbx
       sub      rsp, 72
       xor      rax, rax
       mov      qword ptr [rsp+38H], rax
       mov      qword ptr [rsp+40H], rax
       mov      qword ptr [rsp+28H], rax
       mov      qword ptr [rsp+30H], rax
       mov      rsi, rcx

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Feb 16, 2020

PassSpansByParam has four spans on the stack: the two it ultimately needs to pass as outgoing args to SequenceEqualsParams, and the two it needs to pass as hidden return value buffers to MemoryManager.GetSpan. These all need to be zeroed in the prolog.

PassReconstructedSpansParam only has two spans, so has less to zero,.

We have an issue #8890 for improving heuristics for prolog zeroing.

Ideally the jit would reverse copy-prop and construct the GetSpan results in the same structs later used as arguments to SequenceEqualsParams; this is beyond what it can do now.

There are also some promoted spans in the mix, so the code is also moving span fields from stack to register pairs in places. The jit would be better off not promoting as the code never computes with the span fields, just passes them around, and they ultimately have to end up on the stack. But those are tricky heuristics to get right.

Can costs for passing Spans as parameters be reduced? (e.g. by passing in xmm registers).

The ABI costs for spans in simple examples like these should be somewhat lower on SysV, however the jit does not yet take full advantage of this.

Since there are GC refs in spans, using xmm here would require a large number of changes throughout the system -- we currently assume GC refs can only live in general purpose registers.. And the performance impact of an ABI change is hard to assess. Invariably some things get better and others worse.

@benaadams
Copy link
Member Author

We have an issue #8890 for improving heuristics for prolog zeroing.

Ah, it hits the 16 byte limit so moves to rep stosd though perhaps using a single xmm reg would be better; as mentioned in that issue.

@jkotas jkotas added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization and removed area-System.Memory labels Feb 16, 2020
@AndyAyersMS AndyAyersMS removed the untriaged New issue has not been triaged by the area owner label Feb 18, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants