Strange Span costs for as Memory.Span -> parameter #32396

benaadams · 2020-02-16T15:39:32Z

Can costs for passing Spans as parameters be reduced? (e.g. by passing in xmm registers).

The current costs may make passing pointers or refs more attractive; which is undesirable as it discards the bounding safety provided by the Spans.

Noticed in #32371 (comment) where the cost of using Span parameters for the method is higher than the method's time taken to test whether two sets of 4096 bytes are equal (test on Windows)

Created gist benchmark https://gist.github.com/benaadams/56af11cf7f8e0e1da3fed47464414f8a to demonstrate:

|                      Method |      Mean |     Error |    StdDev |
|---------------------------- |----------:|----------:|----------:|
|            PassSpansByParam |  9.982 ns | 0.0492 ns | 0.0460 ns |
|      PassDeconstructedSpans |  3.541 ns | 0.0496 ns | 0.0464 ns |
|          DeferSpansCreation |  3.513 ns | 0.0507 ns | 0.0475 ns |
|       PassSpansByParamTwice | 11.082 ns | 0.0417 ns | 0.0390 ns |
| PassReconstructedSpansParam |  5.014 ns | 0.0207 ns | 0.0172 ns |

All three methods create Spans from the same Memory<byte>

PassSpansByParam - passes the created Span<byte> as parameters
PassDeconstructedSpans - turns the Span<byte> into ref byte and int and passes those
DeferSpansCreation - passes no parameters and creates the Span<byte> in the callee
PassSpansByParamTwice - passes the created Span<byte> as parameters; then passes them through to second method in different param positions
PassReconstructedSpansParam - turns the Span<byte> into ref byte and int and passes those; then recreates the Spans from the params and passes those created Spans on to second method.

As PassSpansByParamTwice to see if its purely span passing (i.e. is the cost directly additive); it doesn't add the same cost on again; it seems to be more than purely parameter passing.

As PassReconstructedSpansParam does pass the Spans as parameters, just not from the original method that gets them from the Memory<byte> and has a much lower cost; even though it now involves an extra non-inlined method call, its even stranger?

category:cq
theme:optimization
skill-level:expert
cost:medium

The text was updated successfully, but these errors were encountered:

benaadams · 2020-02-16T16:38:35Z

Updated with PassReconstructedSpansParam as deconstructing the spans to ref/lengths; passing to an intermediary method that then recreates the spans and calling the desired method is much faster than calling directly with the Spans (created from Memory<T>.Span)

benaadams · 2020-02-16T17:45:30Z

PassSpansByParamTwice just passes through the stack pointers given as params, so doesn't really do much

benaadams · 2020-02-16T17:50:08Z

Main difference between PassSpansByParam and PassReconstructedSpansParam looks the be the prologue (just how expensive is rep stosd?)

Slow

G_M41396_IG01:
       push     r14
       push     rdi
       push     rsi
       push     rbp
       push     rbx
       sub      rsp, 96
       mov      rsi, rcx
       lea      rdi, [rsp+20H]
       mov      ecx, 16
       xor      rax, rax
       rep stosd 
       mov      rcx, rsi
       mov      rsi, rcx

Fast

G_M24512_IG01:
       push     r15
       push     r14
       push     rdi
       push     rsi
       push     rbp
       push     rbx
       sub      rsp, 72
       xor      rax, rax
       mov      qword ptr [rsp+38H], rax
       mov      qword ptr [rsp+40H], rax
       mov      qword ptr [rsp+28H], rax
       mov      qword ptr [rsp+30H], rax
       mov      rsi, rcx

AndyAyersMS · 2020-02-16T18:41:50Z

PassSpansByParam has four spans on the stack: the two it ultimately needs to pass as outgoing args to SequenceEqualsParams, and the two it needs to pass as hidden return value buffers to MemoryManager.GetSpan. These all need to be zeroed in the prolog.

PassReconstructedSpansParam only has two spans, so has less to zero,.

We have an issue #8890 for improving heuristics for prolog zeroing.

Ideally the jit would reverse copy-prop and construct the GetSpan results in the same structs later used as arguments to SequenceEqualsParams; this is beyond what it can do now.

There are also some promoted spans in the mix, so the code is also moving span fields from stack to register pairs in places. The jit would be better off not promoting as the code never computes with the span fields, just passes them around, and they ultimately have to end up on the stack. But those are tricky heuristics to get right.

Can costs for passing Spans as parameters be reduced? (e.g. by passing in xmm registers).

The ABI costs for spans in simple examples like these should be somewhat lower on SysV, however the jit does not yet take full advantage of this.

Since there are GC refs in spans, using xmm here would require a large number of changes throughout the system -- we currently assume GC refs can only live in general purpose registers.. And the performance impact of an ABI change is hard to assess. Invariably some things get better and others worse.

benaadams · 2020-02-16T19:05:58Z

We have an issue #8890 for improving heuristics for prolog zeroing.

Ah, it hits the 16 byte limit so moves to rep stosd though perhaps using a single xmm reg would be better; as mentioned in that issue.

Dotnet-GitSync-Bot added area-System.Memory untriaged New issue has not been triaged by the area owner labels Feb 16, 2020

benaadams changed the title ~~Can costs for passing Spans as parameters be reduced~~ Can costs for passing Spans as parameters be reduced? Feb 16, 2020

benaadams mentioned this issue Feb 16, 2020

Use intrinsics for SequenceEqual<byte> vectorization to emit at R2R #32371

Merged

benaadams changed the title ~~Can costs for passing Spans as parameters be reduced?~~ Strange Span costs for as Memory.Span -> parameter Feb 16, 2020

jkotas added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization and removed area-System.Memory labels Feb 16, 2020

benaadams mentioned this issue Feb 17, 2020

Use simd for small prolog zeroing (ia32/x64) #32442

Closed

AndyAyersMS removed the untriaged New issue has not been triaged by the area owner label Feb 18, 2020

AndyAyersMS mentioned this issue Feb 18, 2020

[Jit] Duplicate stores when method calling #32401

Closed

benaadams mentioned this issue Feb 19, 2020

Use xmm for stack prolog zeroing rather than rep stos #32538

Merged

AndyAyersMS closed this as completed in #32538 Mar 5, 2020

ghost locked as resolved and limited conversation to collaborators Dec 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strange Span costs for as Memory.Span -> parameter #32396

Strange Span costs for as Memory.Span -> parameter #32396

benaadams commented Feb 16, 2020 •

edited by BruceForstall

Loading

benaadams commented Feb 16, 2020

benaadams commented Feb 16, 2020

benaadams commented Feb 16, 2020

AndyAyersMS commented Feb 16, 2020 •

edited

Loading

benaadams commented Feb 16, 2020

Strange Span costs for as Memory.Span -> parameter #32396

Strange Span costs for as Memory.Span -> parameter #32396

Comments

benaadams commented Feb 16, 2020 • edited by BruceForstall Loading

benaadams commented Feb 16, 2020

benaadams commented Feb 16, 2020

benaadams commented Feb 16, 2020

AndyAyersMS commented Feb 16, 2020 • edited Loading

benaadams commented Feb 16, 2020

benaadams commented Feb 16, 2020 •

edited by BruceForstall

Loading

AndyAyersMS commented Feb 16, 2020 •

edited

Loading