Use simd for small prolog zeroing (ia32/x64) #32442

benaadams · 2020-02-17T15:06:13Z

rep stosd has a low throughput if the memory is not 32byte aligned; however it is a compact code size.

Use faster throughput SIMD instructions when the amount to zero in the prologue is low; moving back to rep stosd when it becomes too many instructions.

Also align rep stosd to 32 bytes when it is used.

Best checked with ignore whitespace (due to new if/else indenting)

Contributes to #8890

/cc @erozenfeld not entirely sure what I'm doing :)

benaadams · 2020-02-17T15:19:20Z

At R2R changes prologue for Memory`1:TryCopyTo(Memory`1):bool:this from

G_M41002_IG01:
       sub      rsp, 96
       mov      rsi, rcx
       lea      rdi, [rsp+28H]
       mov      ecx, 12
       xor      rax, rax
       rep stosd 
       mov      rcx, rsi
       mov      qword ptr [rsp+58H], rdx
       mov      rbx, rcx
       mov      rsi, rdx
       mov      rdi, r8
						;; bbWeight=1    PerfScore 31.50

to

       sub      rsp, 96
       xorps    xmm4, xmm4
       movdqu   xmmword ptr [rsp+28H], xmm4
       movdqu   xmmword ptr [rsp+38H], xmm4
       movdqu   xmmword ptr [rsp+48H], xmm4
       mov      qword ptr [rsp+58H], rdx
       mov      rbx, rcx
       mov      rsi, rdx
       mov      rdi, r8
						;; bbWeight=1    PerfScore 8.33

benaadams · 2020-02-17T15:28:40Z

At Jit time (when it can use longer ymm registers) changes prologue in issue #32396 from

       sub      rsp, 96
       mov      rsi, rcx
       lea      rdi, [rsp+20H]
       mov      ecx, 16
       xor      rax, rax
       rep stosd 
       mov      rcx, rsi
       mov      rsi, rcx
						;; bbWeight=1    PerfScore 32.00

to

       sub      rsp, 96
       vzeroupper 
       vxorps   ymm4, ymm4
       vmovdqu  ymmword ptr[rsp+20H], ymm4
       vmovdqu  ymmword ptr[rsp+40H], ymm4
       mov      rsi, rcx
						;; bbWeight=1    PerfScore 8.83

benaadams · 2020-02-17T15:52:52Z

Total bytes of diff: 598 (0.02% of base)
    diff is a regression.

Top file regressions (bytes):
         598 : System.Private.CoreLib.dasm (0.02% of base)

1 total files with Code Size differences (0 improved, 1 regressed), 0 unchanged.

Top method regressions (bytes):
          77 ( 3.19% of base) : System.Private.CoreLib.dasm - Array:<Sort>g__GenericSort|126_0(Array,Array,int,int) (11 methods)
          58 ( 0.31% of base) : System.Private.CoreLib.dasm - ValueTuple`8:GetHashCode():int:this (17 methods)
          36 ( 1.60% of base) : System.Private.CoreLib.dasm - ArraySegment`1:GetEnumerator():Enumerator:this (18 methods)
          26 ( 0.35% of base) : System.Private.CoreLib.dasm - Vector64`1:ToString():String:this (11 methods)
          26 ( 1.77% of base) : System.Private.CoreLib.dasm - Dictionary`2:System.Collections.Generic.IEnumerable<System.Collections.Generic.KeyValuePair<TKey,TValue>>.GetEnumerator():IEnumerator`1:this (11 methods)
          26 ( 1.77% of base) : System.Private.CoreLib.dasm - Dictionary`2:System.Collections.IEnumerable.GetEnumerator():IEnumerator:this (11 methods)
          26 ( 1.77% of base) : System.Private.CoreLib.dasm - Dictionary`2:System.Collections.IDictionary.GetEnumerator():IDictionaryEnumerator:this (11 methods)
          22 ( 2.00% of base) : System.Private.CoreLib.dasm - Memory`1:TryCopyTo(Memory`1):bool:this (4 methods)
          22 ( 2.00% of base) : System.Private.CoreLib.dasm - ReadOnlyMemory`1:TryCopyTo(Memory`1):bool:this (4 methods)
          18 ( 0.62% of base) : System.Private.CoreLib.dasm - ValueTuple`8:System.Runtime.CompilerServices.ITuple.get_Item(int):Object:this (17 methods)
          18 ( 0.23% of base) : System.Private.CoreLib.dasm - Vector128`1:ToString():String:this (11 methods)
          18 ( 0.32% of base) : System.Private.CoreLib.dasm - EqualityComparer`1:System.Collections.IEqualityComparer.GetHashCode(Object):int:this (42 methods)
          16 ( 1.30% of base) : System.Private.CoreLib.dasm - ConfiguredValueTaskAwaiter:OnCompleted(Action):this (4 methods)
          16 ( 1.34% of base) : System.Private.CoreLib.dasm - ConfiguredValueTaskAwaiter:UnsafeOnCompleted(Action):this (4 methods)
          16 ( 1.95% of base) : System.Private.CoreLib.dasm - Dictionary`2:GetEnumerator():Enumerator:this (8 methods)
          15 ( 0.09% of base) : System.Private.CoreLib.dasm - ValueTuple`8:ToString():String:this (17 methods)
          15 ( 0.10% of base) : System.Private.CoreLib.dasm - ValueTuple`8:System.IValueTupleInternal.ToStringEnd():String:this (17 methods)
          14 ( 2.33% of base) : System.Private.CoreLib.dasm - TimeZoneInfo:GetLocalTimeZone(CachedData):TimeZoneInfo
          14 ( 0.45% of base) : System.Private.CoreLib.dasm - ValueTuple`8:Equals(Object):bool:this (17 methods)
          14 ( 0.14% of base) : System.Private.CoreLib.dasm - ValueTuple`8:System.Collections.IStructuralEquatable.Equals(Object,IEqualityComparer):bool:this (17 methods)
          14 ( 0.28% of base) : System.Private.CoreLib.dasm - ValueTuple`8:System.IComparable.CompareTo(Object):int:this (17 methods)
          14 ( 0.12% of base) : System.Private.CoreLib.dasm - ValueTuple`8:System.Collections.IStructuralComparable.CompareTo(Object,IComparer):int:this (17 methods)
          13 ( 0.64% of base) : System.Private.CoreLib.dasm - DateTimeParse:TryParse(ReadOnlySpan`1,DateTimeFormatInfo,int,byref):bool (2 methods)
          10 ( 1.71% of base) : System.Private.CoreLib.dasm - Environment:get_UserDomainName():String
          10 ( 0.84% of base) : System.Private.CoreLib.dasm - TextSegmentationUtility:GetLengthOfFirstExtendedGraphemeCluster(ReadOnlySpan`1,DecodeFirstRune`1):int (2 methods)

Top method improvements (bytes):
        -132 (-33.93% of base) : System.Private.CoreLib.dasm - PathHelper:PrependDevicePathChars(byref,bool,byref):int
        -122 (-18.91% of base) : System.Private.CoreLib.dasm - MethodBase:AppendParameters(byref,ref,int)
        -119 (-8.98% of base) : System.Private.CoreLib.dasm - Path:GetRelativePath(String,String,int):String
         -67 (-12.93% of base) : System.Private.CoreLib.dasm - RuntimePropertyInfo:ToString():String:this
         -63 (-14.72% of base) : System.Private.CoreLib.dasm - RuntimeConstructorInfo:ToString():String:this
         -59 (-11.50% of base) : System.Private.CoreLib.dasm - KeyValuePair:PairToString(Object,Object):String
         -40 (-6.88% of base) : System.Private.CoreLib.dasm - Exception:ToString():String:this
         -27 (-6.75% of base) : System.Private.CoreLib.dasm - ActivityTracker:NormalizeActivityName(String,String,int):String
         -26 (-7.01% of base) : System.Private.CoreLib.dasm - EventPipeController:get_DefaultProviderConfiguration():ref
         -23 (-0.51% of base) : System.Private.CoreLib.dasm - EventSource:CreateManifestAndDescriptors(Type,String,EventSource,int):ref
         -20 (-5.00% of base) : System.Private.CoreLib.dasm - UTF7Encoding:MakeTables():this
         -16 (-0.31% of base) : System.Private.CoreLib.dasm - Enumerator:System.Collections.IEnumerator.get_Current():Object:this (65 methods)
         -13 (-0.22% of base) : System.Private.CoreLib.dasm - DateTimeParse:ParseByFormat(byref,byref,byref,DateTimeFormatInfo,byref):bool
         -12 (-0.51% of base) : System.Private.CoreLib.dasm - IdnMapping:PunycodeEncode(String):String
         -12 (-0.38% of base) : System.Private.CoreLib.dasm - IdnMapping:PunycodeDecode(String):String
         -10 (-2.03% of base) : System.Private.CoreLib.dasm - EventPipeEventDispatcher:CommitDispatchConfiguration():this
          -7 (-0.62% of base) : System.Private.CoreLib.dasm - Enum:ValueToHexString(Object):String
          -7 (-1.23% of base) : System.Private.CoreLib.dasm - HijriCalendar:GetAdvanceHijriDate():int
          -6 (-0.07% of base) : System.Private.CoreLib.dasm - Vector256`1:ToString():String:this (11 methods)
          -5 (-2.62% of base) : System.Private.CoreLib.dasm - ManifestBuilder:GetLocalizedMessage(String,CultureInfo,bool):String:this
          -3 (-3.06% of base) : System.Private.CoreLib.dasm - Vector64:ToVector128(Vector64`1):Vector128`1
          -3 (-3.26% of base) : System.Private.CoreLib.dasm - Vector128:WithLower(Vector128`1,Vector64`1):Vector128`1
          -3 (-0.93% of base) : System.Private.CoreLib.dasm - AssemblyLoadContext:ResolveSatelliteAssembly(AssemblyName):Assembly:this
          -3 (-0.19% of base) : System.Private.CoreLib.dasm - AssemblyName:EscapeString(String,int,int,ref,byref,bool,ushort,ushort,ushort):ref
          -2 (-0.28% of base) : System.Private.CoreLib.dasm - ILStubClass:IL_STUB_PInvoke(SafeRegistryHandle,String,ref,byref,ref,byref):int (2 methods)
303 total methods with Code Size differences (50 improved, 253 regressed), 20097 unchanged.

PathHelper:PrependDevicePathChars(byref,bool,byref):int still has the same rep stosd prologue; however more looks to get cleaned up on the inlines for some reasons so it uses 2 less tmps (because its importing simd, though thought inlines happen prior to the code gen?)

benaadams · 2020-02-17T15:56:31Z

Array:<Sort>g__GenericSort|126_0(Array,Array,int,int) x 11 regresses at the edge of the SIMD limit
From

       sub      rsp, 96
       mov      rsi, rcx
       lea      rdi, [rsp+20H]
       mov      ecx, 16
       xor      rax, rax
       rep stosd 
       mov      rcx, rsi
						;; bbWeight=1    PerfScore 30.75

To

       sub      rsp, 96
       xorps    xmm4, xmm4
       movdqu   xmmword ptr [rsp+20H], xmm4
       movdqu   xmmword ptr [rsp+30H], xmm4
       movdqu   xmmword ptr [rsp+40H], xmm4
       movdqu   xmmword ptr [rsp+50H], xmm4
       mov      rsi, rdx
						;; bbWeight=1    PerfScore 8.58

Though at Tier1 that would convert to a smaller form of

       sub      rsp, 96
       vxorps   ymm4, ymm4
       movdqu   ymmword ptr [rsp+20H], ymm4
       movdqu   ymmword ptr [rsp+40H], ymm4
       mov      rsi, rdx

tannergooding · 2020-02-17T17:20:35Z

At Jit time (when it can use longer ymm registers) changes prologue in issue #32396 from

The Intel® 64 and IA-32 Architectures Optimization Reference Manual has 12.6.2 Consider 16-Byte Memory Access when Memory is Unaligned:

For best results use Intel AVX 32-byte loads and align data to 32-bytes. However, there are cases where
you cannot align the data, or data alignment is unknown. This can happen when you are writing a library
function and the input data alignment is unknown. In these cases, using 16-Byte memory accesses may
be the best alternative. The following method uses 16-byte loads while still benefiting from the 32-byte
YMM registers.

Consider replacing unaligned 32-byte memory accesses using a combination of VMOVUPS,
VINSERTF128, and VEXTRACTF128 instructions.

and 12.6.3 Prefer Aligned Stores Over Aligned Loads

There are cases where it is possible to align only a subset of the processed data buffers. In these cases,
aligning data buffers used for store operations usually yields better performance than aligning data
buffers used for load operations.

Unaligned stores are likely to cause greater performance degradation than unaligned loads, since there
is a very high penalty on stores to a split cache-line that crosses pages. This penalty is estimated at 150
cycles. Loads that cross a page boundary are executed at retirement. In Example 12-12, unaligned store
address can affect SAXPY performance for 3 unaligned addresses to about one quarter of the aligned
case.

Neither look to be listed for a specific micro-architecture. Since the current cache line size is 64-bytes and I believe we only guarantee 16-byte stack alignment there would be a decent chance of crossing the cache line boundary for 2x YMM writes.

tannergooding · 2020-02-17T17:27:48Z

Array:g__GenericSort|126_0(Array,Array,int,int) x 11 regresses at the edge of the SIMD limit

Could this be due to the specific CPU? The optimization manual only lists the numbers for Ivy Bridge but seems to indicate 128 bytes is a good cutoff:

That is using rep stosb, but I'd imagine a similar cutoff for rep stosd (128 is also the cutoff that memset looks to use and it will use non temporal store if ERMSB isn't available).
The cutoff might also differ on AMD boxes (where perf of rep stosb/w/d/q depends greatly on alignment) and that might be worth taking into account.

benaadams · 2020-02-17T17:51:27Z

we only guarantee 16-byte stack alignment

Only pointer sized alignment? e.g 8 byte on x64

there would be a decent chance of crossing the cache line boundary for 2x YMM writes.

Still going to hit it with 4x XMM writes; its only removing 1 chance in a 32 byte segment (by having 2x 16 so a split in the middle).

there is a very high penalty on stores to a split cache-line that crosses pages

Since the issue happens at page boundaries and the stack is 8 byte aligned, the chance is ever so slightly lower at 2 in 4096 rather than 3 in 4096?

Array:g__GenericSort|126_0(Array,Array,int,int) x 11 regresses at the edge of the SIMD limit

Could this be due to the specific CPU? The optimization manual only lists the numbers for Ivy Bridge but seems to indicate 128 bytes is a good cutoff:

Choose 4 x SIMD size due to the code gen size rather than the performance; only using xmm means stopping at 64 bytes whereas using ymm goes to 128 bytes. Higher than than it moves back to rep stosd because the asm diff is getting too large.

tannergooding · 2020-02-17T18:00:12Z

Only pointer sized alignment? e.g 8 byte on x64

No, I believe the stack (for TYP_SIMD) has some special support and will be 16-byte aligned (maybe @CarolEidt could confirm while I try to find the relevant code).

Higher than than it moves back to rep stosd because the asm diff is getting too large

Ah, I see; makes sense.

benaadams · 2020-02-17T18:05:14Z

Only pointer sized alignment? e.g 8 byte on x64

No, I believe the stack (for TYP_SIMD) has some special support and will be 16-byte aligned

Could we align to 32-byte for 64bit ? (And then not use Avx on 32-bit, not sure if its supported there?)

That would also make rep stosd happier as it prefers 32byte alignment (though it also clobbers a bunch of registers and more so on Linux)

tannergooding · 2020-02-17T18:51:49Z

There is some information on the stack frame alignment here: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/lclvars.cpp#L4327 and here: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/lclvars.cpp#L6256

IIRC, we only properly align (on the stack and only for x64) TYP_SIMD16 right now and we don't guarantee alignment for TYP_SIMD32, but someone more familiar with the area should confirm 😄

AndyAyersMS · 2020-02-17T21:32:18Z

ABI guarantees that the stack has 16 byte alignment for x64, both on Windows and SysV, outside of the prolog and epilog, for non-leaf methods.

The local var area is likewise guaranteed by the jit to be 16 byte aligned, see eg lvaAlignFrame, and this is used to align SIMD locals, eg getSIMDTypeAlignment (though reading this just now it looks like we also think we can align to SIMD locals to 32 bytes, which should not be possible).

benaadams · 2020-02-17T23:32:31Z

The local var area is likewise guaranteed by the jit to be 16 byte aligned, see eg lvaAlignFrame

So it should be possible to 32 byte align on x64 to avoid any page split penalties in the G_xxx_IG01: after the sub rsp, NN (i.e. the zeroing block)

benaadams · 2020-02-17T23:58:09Z

Change with a larger struct ReadOnlySequence<byte> and its data

private readonly Memory<byte> _expected = new byte[5];
public bool ReadOnlySequenceEqual(ReadOnlySequence<byte> ros)
{
    var span = ros.FirstSpan;
    var expected = _expected.Span;
    return (span.Length == expected.Length) ? 
        SequenceEqualFast(span, expected) :
        SequenceEqualSlow(ros, expected);
}

[MethodImpl(MethodImplOptions.NoInlining)]
public static bool SequenceEqualFast(ReadOnlySpan<byte> input, ReadOnlySpan<byte> expected) => true;

[MethodImpl(MethodImplOptions.NoInlining)]
public static bool SequenceEqualSlow(ReadOnlySequence<byte> input, ReadOnlySpan<byte> expected) => true;

From

       sub      rsp, 168
       vzeroupper 
       mov      rsi, rcx
       lea      rdi, [rsp+28H]
       mov      ecx, 32
       xor      rax, rax
       rep stosd 
       mov      rcx, rsi
       mov      rdi, rcx
       mov      rsi, rdx
						;; bbWeight=1    PerfScore 36.25

To

       sub      rsp, 168
       vzeroupper 
       vxorps   ymm4, ymm4
       vmovdqu  ymmword ptr[rsp+28H], ymm4
       vmovdqu  ymmword ptr[rsp+48H], ymm4
       vmovdqu  ymmword ptr[rsp+68H], ymm4
       vmovdqu  ymmword ptr[rsp+88H], ymm4
       mov      rdi, rcx
       mov      rsi, rdx
						;; bbWeight=1    PerfScore 14.08

AndyAyersMS · 2020-02-18T03:33:24Z

So it should be possible to 32 byte align on x64

We currently don't have enough guarantees -- we'd need to implement "dynamic alignment" or similar, which comes with its own set of tradeoffs.

benaadams · 2020-02-18T09:18:37Z

So on the one hand crossing a page should be rareish; but on the other hand if you do cross a page and its in you hot-path range then will be continuously crossing a page as the stack unwinds and rewinds which would then make it very common. (i.e. once you are locked into crossing a page, it will happen frequently) Hmm... @tannergooding as you highlight that probably is quite a problem 😢

Some of the structs are pretty big (e.g. ReadOnlySequence<T> above #32442 (comment) with SequenceReader<T> being even bigger), so not using ymm would make the asm very big.

Don't want to add a branch to align... Could start stack sizes >= 64 unconditionally with a xmm read to align; add 0 or 16 (via mask, so no branch) to second reg e.g. rcx and then zero using that based on that...

However that would over zero the stack by 16 bytes when unaligned; would that be a problem? (e.g. are there stack markers to detect overruns which would get upset if they were zeroed?)

benaadams · 2020-02-18T10:35:02Z

Talking of SequenceReader<T>; a target method would be Kestrel's Http1Connection.TakeMessageHeaders (or non-dependency version for analysis gist) which starts:

G_M25757_IG01:
       push     rbp
       push     r15
       push     r14
       push     r13
       push     r12
       push     rdi
       push     rsi
       push     rbx
       sub      rsp, 216
       vzeroupper 
       lea      rbp, [rsp+110H]
       mov      rsi, rcx
       lea      rdi, [rbp-D8H]
       mov      ecx, 40
       xor      rax, rax
       rep stosd 
       mov      rcx, rsi
       mov      qword ptr [rbp-E8H], rsp
       mov      gword ptr [rbp+10H], rcx
       mov      bword ptr [rbp+18H], rdx
       mov      bword ptr [rbp+28H], r9
       mov      esi, r8d
						;; bbWeight=1    PerfScore 40.50
G_M25757_IG02:
       mov      rcx, bword ptr [rbp+18H]
       mov      rdx, gword ptr [rcx]
       mov      rcx, bword ptr [rbp+18H]
       mov      rdi, gword ptr [rcx+8]
       mov      rcx, bword ptr [rbp+18H]
       mov      ebx, dword ptr [rcx+16]
       and      ebx, 0xD1FFAB1E
       mov      rcx, bword ptr [rbp+18H]
       mov      r14d, dword ptr [rcx+20]
       and      r14d, 0xD1FFAB1E
       cmp      rdx, rdi
       je       SHORT G_M25757_IG10
						;; bbWeight=1    PerfScore 13.75

Bunch of busy work with rcx in G_M25757_IG02?

With ymm that becomes

       sub      rsp, 216
       vzeroupper 
       lea      rbp, [rsp+110H]
       vxorps   ymm4, ymm4
       vmovdqu  ymmword ptr[rbp-D8H], ymm4
       vmovdqu  ymmword ptr[rbp-B8H], ymm4
       vmovdqu  ymmword ptr[rbp-98H], ymm4
       vmovdqu  ymmword ptr[rbp-78H], ymm4
       vmovdqu  ymmword ptr[rbp-58H], ymm4

However only using xmm and allowing the same size zeroing as simd that would be the following, which feels way too much asm

       sub      rsp, 216
       vzeroupper 
       lea      rbp, [rsp+110H]
       vxorps   xmm4, xmm4
       vmovdqu  xmmword ptr [rbp-D8H], xmm4
       vmovdqu  xmmword ptr [rbp-C8H], xmm4
       vmovdqu  xmmword ptr [rbp-B8H], xmm4
       vmovdqu  xmmword ptr [rbp-A8H], xmm4
       vmovdqu  xmmword ptr [rbp-98H], xmm4
       vmovdqu  xmmword ptr [rbp-88H], xmm4
       vmovdqu  xmmword ptr [rbp-78H], xmm4
       vmovdqu  xmmword ptr [rbp-68H], xmm4
       vmovdqu  xmmword ptr [rbp-58H], xmm4
       vmovdqu  xmmword ptr [rbp-48H], xmm4

benaadams · 2020-02-18T12:13:55Z

Now produces:

       sub      rsp, 216
       vzeroupper 
       lea      rbp, [rsp+110H]
       vxorps   ymm4, ymm4
       vmovdqu  xmmword ptr [rbp-D8H], xmm4 ; Zero first 16 bytes
       lea      rax, [rbp-D8H]              ; Get block start
       and      rax, 31                     ; Get offset for 32 byte alignment of block (is 16 byte aligned and first 16 are already zeroed)
       add      rax, -216                   ; Add back offset
       add      rax, rbp                    ; Add back stack pointer
       vmovdqu  ymmword ptr[rax], ymm4
       vmovdqu  ymmword ptr[rax+32], ymm4
       vmovdqu  ymmword ptr[rax+64], ymm4
       vmovdqu  ymmword ptr[rax+96], ymm4
       vmovdqu  ymmword ptr[rax+128], ymm4

@tannergooding would that work?

benaadams · 2020-02-18T14:03:24Z

128 is also the cutoff that memset looks to use and it will use non temporal store if ERMSB isn't available

Probably don't want to use non temporal? Not having visibility from other procs is fine (since its thread stack); however likely want it in the cache as its going to be immediately used?

AndyAyersMS · 2020-02-18T19:16:56Z

Maybe? I wonder if this might cause store buffer stalls.

benaadams · 2020-02-18T20:17:57Z

Also 32 bytes aligns rep stosd when it does occur (larger than the simd cutover) in a similar way
From

lea      rdi, [rbp-98H]
mov      ecx, 26
xor      rax, rax
rep stosd

To

lea      rdi, [rbp-98H]
xorps    xmm4, xmm4
movdqu   xmmword ptr [rbp-98H], xmm4
and      rdi, 31
mov      ecx, 26
sub      rcx, rdi
add      rdi, -152
add      rdi, rbp
xor      rax, rax
rep stosd

benaadams · 2020-02-18T20:20:47Z

What is the average code-size increase for 128 bytes cut-off (for the non-AVX case?

Cross gen of corelib which is non-AVX is

Total bytes of diff: 9569 (0.29% of base)
    diff is a regression.

Top file regressions (bytes):
    9569 : System.Private.CoreLib.dasm (0.29% of base)

1 total files with Code Size differences (0 improved, 1 regressed), 0 unchanged.

Top method regressions (bytes):
    288 ( 1.80% of base) : ValueTuple`8:ToString():String:this (17 methods)
    288 ( 1.88% of base) : ValueTuple`8:System.IValueTupleInternal.ToStringEnd():String:this (17 methods)
    287 ( 9.15% of base) : ValueTuple`8:Equals(Object):bool:this (17 methods)
    287 ( 2.81% of base) : ValueTuple`8:System.Collections.IStructuralEquatable.Equals(Object,IEqualityComparer):bool:this (17 methods)
    287 ( 5.68% of base) : ValueTuple`8:System.IComparable.CompareTo(Object):int:this (17 methods)
    287 ( 2.38% of base) : ValueTuple`8:System.Collections.IStructuralComparable.CompareTo(Object,IComparer):int:this (17 methods)
    276 ( 2.54% of base) : EqualityComparer`1:System.Collections.IEqualityComparer.Equals(Object,Object):bool:this (42 methods)
    257 ( 2.64% of base) : Comparer`1:System.Collections.IComparer.Compare(Object,Object):int:this (34 methods)
    254 ( 1.37% of base) : ValueTuple`8:GetHashCode():int:this (17 methods)
    253 ( 3.70% of base) : TupleExtensions:ToValueTuple(Tuple`8):ValueTuple`8 (14 methods)
    224 ( 2.94% of base) : ValueTuple`8:Equals(ValueTuple`8):bool:this (17 methods)
    224 ( 2.94% of base) : ValueTuple`8:CompareTo(ValueTuple`8):int:this (17 methods)
    161 ( 7.44% of base) : TupleExtensions:CreateLong(__Canon,__Canon,__Canon,__Canon,__Canon,__Canon,__Canon,ValueTuple`8):ValueTuple`8 (7 methods)
    161 (10.00% of base) : EqualityComparer`1:IndexOf(ref,ValueTuple`8,int,int):int:this (7 methods)
    161 ( 9.88% of base) : EqualityComparer`1:LastIndexOf(ref,ValueTuple`8,int,int):int:this (7 methods)
    130 ( 4.46% of base) : ValueTuple`8:System.Runtime.CompilerServices.ITuple.get_Item(int):Object:this (17 methods)
    130 ( 2.33% of base) : EqualityComparer`1:System.Collections.IEqualityComparer.GetHashCode(Object):int:this (42 methods)
    122 ( 7.89% of base) : GenericEqualityComparer`1:IndexOf(ref,ValueTuple`8,int,int):int:this (7 methods)
    122 ( 7.79% of base) : GenericEqualityComparer`1:LastIndexOf(ref,ValueTuple`8,int,int):int:this (7 methods)
    119 (12.00% of base) : ValueTuple`8:System.Runtime.CompilerServices.ITuple.get_Length():int:this (17 methods)
     77 ( 3.19% of base) : Array:<Sort>g__GenericSort|126_0(Array,Array,int,int) (11 methods)
     66 ( 0.84% of base) : ArraySortHelper`1:IntroSort(Span`1,int,Comparison`1) (15 methods)
     54 ( 0.66% of base) : Vector256`1:ToString():String:this (11 methods)
     46 ( 4.79% of base) : DateTimeParse:TryParseExactMultiple(ReadOnlySpan`1,ref,DateTimeFormatInfo,int,byref):bool (2 methods)
     36 ( 1.60% of base) : ArraySegment`1:GetEnumerator():Enumerator:this (18 methods)
								
Top method improvements (bytes):
     -2 (-0.28% of base) : ILStubClass:IL_STUB_PInvoke(SafeRegistryHandle,String,ref,byref,ref,byref):int (2 methods)
     -2 (-0.29% of base) : ILStubClass:IL_STUB_PInvoke(SafeRegistryHandle,String,ref,byref,byref,byref):int (2 methods)
     -2 (-0.05% of base) : Utf8Parser:TryParse(ReadOnlySpan`1,byref,byref,ushort):bool (16 methods)
     -1 (-0.54% of base) : ILStubClass:IL_STUB_PInvoke(byref,byref,byref):bool
     -1 (-0.26% of base) : ILStubClass:IL_STUB_PInvoke(SafeRegistryHandle,int,ref,byref,long,ref,ref,ref):int
     -1 (-0.19% of base) : DateTimeFormat:Format(DateTime,String,IFormatProvider,TimeSpan):String
     -1 (-0.18% of base) : MemoryExtensions:IndexOf(ReadOnlySpan`1,ReadOnlySpan`1,int):int
     -1 (-0.15% of base) : Utf8Formatter:TryFormatFloatingPoint(__Canon,Span`1,byref,StandardFormat):bool
     -1 (-0.15% of base) : ExecutionContext:OnValuesChanged(ExecutionContext,ExecutionContext)
     -1 (-0.19% of base) : Task:WaitAllBlockingCore(List`1,int,CancellationToken):bool
     -1 (-0.23% of base) : NativeOrStaticEventRegistrationImpl:GetEventRegistrationTokenTableInternal(Object,Action`1,byref,bool):ConditionalWeakTable`2
     -1 (-0.34% of base) : FileLoadException:FormatFileLoadExceptionMessage(String,int):String
     -1 (-0.17% of base) : <FinishWriteAsync>d__63:MoveNext():this
     -1 (-0.15% of base) : <DisposeAsyncCore>d__99:MoveNext():this
     -1 (-0.12% of base) : Path:Combine(ref):String
     -1 (-0.42% of base) : PathInternal:RemoveRelativeSegments(String,int):String
     -1 (-0.18% of base) : <DisposeAsyncCore>d__33:MoveNext():this
     -1 (-0.13% of base) : EventSource:EnsureDescriptorsInitialized():this
		  
468 total methods with Code Size differences (18 improved, 450 regressed), 19932 unchanged.

jkotas · 2020-02-18T20:23:11Z

0.29% of base

I would be ok to take this regression for this fix.

benaadams · 2020-02-18T20:33:20Z

Haven't got it quite right...

exit code 139 means SIGSEGV Illegal memory access. 
Deref invalid pointer, overrunning buffer, stack overflow etc. Core dumped.

benaadams · 2020-02-19T00:38:38Z

Perhaps not everything is fully 16 byte aligned; not able to trigger this failure in simple program yet.

@AndyAyersMS when you say:

ABI guarantees that the stack has 16 byte alignment for x64, both on Windows and SysV, outside of the prolog and epilog, for non-leaf methods.

Is there a way to detect its a leaf method? Then I can have more variable realignment but keep it to leaf methods (i.e. not assume its 16byte aligned)

benaadams · 2020-02-19T02:06:01Z

Think I can use?

bool is16byteAligned = (compiler->compLclFrameSize % 16) == 0;

AndyAyersMS · 2020-02-19T02:08:19Z

Is there a way to detect its a leaf method?

For Windows x64, I think you can check if lvaOutgoingArgSpaceSize is zero. Not as sure about SysV, will have to do some digging.

You could also add explicit tracking in fgSimpleLowering.

benaadams · 2020-02-19T05:56:57Z

I've change it to assume no alignment and to always fully align with 2x xmm at start for ymm and rep stosd. ymm also finishes with 2x xmm for the last 32 bytes (and sizeof(long), sizeof(int) if required)

It does make it regress further slightly

Total bytes of diff: 12221 (0.38% of base)
    diff is a regression.

Top file regressions (bytes):
    12221 : System.Private.CoreLib.dasm (0.38% of base)

1 total files with Code Size differences (0 improved, 1 regressed), 0 unchanged.

Top method regressions (bytes):
    396 ( 2.47% of base) : ValueTuple`8:ToString():String:this (17 methods)
    396 ( 2.58% of base) : ValueTuple`8:System.IValueTupleInternal.ToStringEnd():String:this (17 methods)
    396 ( 3.64% of base) : EqualityComparer`1:System.Collections.IEqualityComparer.Equals(Object,Object):bool:this (42 methods)
    395 (12.60% of base) : ValueTuple`8:Equals(Object):bool:this (17 methods)
    395 ( 3.86% of base) : ValueTuple`8:System.Collections.IStructuralEquatable.Equals(Object,IEqualityComparer):bool:this (17 methods)
    395 ( 7.82% of base) : ValueTuple`8:System.IComparable.CompareTo(Object):int:this (17 methods)
    395 ( 3.28% of base) : ValueTuple`8:System.Collections.IStructuralComparable.CompareTo(Object,IComparer):int:this (17 methods)
    377 ( 3.87% of base) : Comparer`1:System.Collections.IComparer.Compare(Object,Object):int:this (34 methods)
    341 ( 4.99% of base) : TupleExtensions:ToValueTuple(Tuple`8):ValueTuple`8 (14 methods)
    331 ( 1.79% of base) : ValueTuple`8:GetHashCode():int:this (17 methods)
    320 ( 4.20% of base) : ValueTuple`8:Equals(ValueTuple`8):bool:this (17 methods)
    320 ( 4.20% of base) : ValueTuple`8:CompareTo(ValueTuple`8):int:this (17 methods)
    245 (11.32% of base) : TupleExtensions:CreateLong(__Canon,__Canon,__Canon,__Canon,__Canon,__Canon,__Canon,ValueTuple`8):ValueTuple`8 (7 methods)
    245 (15.22% of base) : EqualityComparer`1:IndexOf(ref,ValueTuple`8,int,int):int:this (7 methods)
    245 (15.03% of base) : EqualityComparer`1:LastIndexOf(ref,ValueTuple`8,int,int):int:this (7 methods)
    154 ( 5.29% of base) : ValueTuple`8:System.Runtime.CompilerServices.ITuple.get_Item(int):Object:this (17 methods)
    154 ( 2.75% of base) : EqualityComparer`1:System.Collections.IEqualityComparer.GetHashCode(Object):int:this (42 methods)
    146 ( 9.44% of base) : GenericEqualityComparer`1:IndexOf(ref,ValueTuple`8,int,int):int:this (7 methods)
    146 ( 9.32% of base) : GenericEqualityComparer`1:LastIndexOf(ref,ValueTuple`8,int,int):int:this (7 methods)
    143 (14.42% of base) : ValueTuple`8:System.Runtime.CompilerServices.ITuple.get_Length():int:this (17 methods)
     77 ( 3.19% of base) : Array:<Sort>g__GenericSort|126_0(Array,Array,int,int) (11 methods)
     70 ( 7.28% of base) : DateTimeParse:TryParseExactMultiple(ReadOnlySpan`1,ref,DateTimeFormatInfo,int,byref):bool (2 methods)
     66 ( 0.84% of base) : ArraySortHelper`1:IntroSort(Span`1,int,Comparison`1) (15 methods)
     54 ( 0.66% of base) : Vector256`1:ToString():String:this (11 methods)
     48 ( 2.36% of base) : DateTimeParse:TryParse(ReadOnlySpan`1,DateTimeFormatInfo,int,byref):bool (2 methods)

Top method improvements (bytes):
     -2 (-0.28% of base) : ILStubClass:IL_STUB_PInvoke(SafeRegistryHandle,String,ref,byref,ref,byref):int (2 methods)
     -2 (-0.29% of base) : ILStubClass:IL_STUB_PInvoke(SafeRegistryHandle,String,ref,byref,byref,byref):int (2 methods)
     -2 (-0.05% of base) : Utf8Parser:TryParse(ReadOnlySpan`1,byref,byref,ushort):bool (16 methods)
     -1 (-0.54% of base) : ILStubClass:IL_STUB_PInvoke(byref,byref,byref):bool
     -1 (-0.26% of base) : ILStubClass:IL_STUB_PInvoke(SafeRegistryHandle,int,ref,byref,long,ref,ref,ref):int
     -1 (-0.19% of base) : DateTimeFormat:Format(DateTime,String,IFormatProvider,TimeSpan):String
     -1 (-0.18% of base) : MemoryExtensions:IndexOf(ReadOnlySpan`1,ReadOnlySpan`1,int):int
     -1 (-0.15% of base) : Utf8Formatter:TryFormatFloatingPoint(__Canon,Span`1,byref,StandardFormat):bool
     -1 (-0.15% of base) : ExecutionContext:OnValuesChanged(ExecutionContext,ExecutionContext)
     -1 (-0.19% of base) : Task:WaitAllBlockingCore(List`1,int,CancellationToken):bool
     -1 (-0.23% of base) : NativeOrStaticEventRegistrationImpl:GetEventRegistrationTokenTableInternal(Object,Action`1,byref,bool):ConditionalWeakTable`2
     -1 (-0.34% of base) : FileLoadException:FormatFileLoadExceptionMessage(String,int):String
     -1 (-0.17% of base) : <FinishWriteAsync>d__63:MoveNext():this
     -1 (-0.15% of base) : <DisposeAsyncCore>d__99:MoveNext():this
     -1 (-0.12% of base) : Path:Combine(ref):String
     -1 (-0.42% of base) : PathInternal:RemoveRelativeSegments(String,int):String
     -1 (-0.18% of base) : <DisposeAsyncCore>d__33:MoveNext():this
     -1 (-0.13% of base) : EventSource:EnsureDescriptorsInitialized():this

468 total methods with Code Size differences (18 improved, 450 regressed), 19932 unchanged.

benaadams · 2020-02-19T05:59:10Z

Might wait to see if tests pass before breaking it down more 😅

benaadams · 2020-02-19T12:19:50Z

The simple thing would be to just use XMM for now, with some reasonable size cutoff. It would be a nice improvement without getting into extra complexity for over-aligned frames or portions thereof.

Decided to do that 😄 #32538

benaadams · 2020-02-19T14:13:34Z

Closing in preference of #32538

benaadams · 2020-02-19T22:25:56Z

Ah... might be issue between bytes and dwords

rep stosd has a low throughput if the memory is not 32byte aligned; however it is a compact code size. Use faster throughput SIMD instructions when the amount to zero in the prologue is low; moving back to rep stosd when it becomes too many instructions.

benaadams · 2020-02-19T23:03:11Z

Will do it in other one

Dotnet-GitSync-Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 17, 2020

benaadams force-pushed the rep-stosd branch from c0365e5 to 52e8579 Compare February 17, 2020 15:10

benaadams mentioned this pull request Feb 17, 2020

[RyuJIT] Improve heuristic for zero-initialization of locals #8890

Open

benaadams force-pushed the rep-stosd branch 4 times, most recently from 5d056d2 to 89ae26b Compare February 17, 2020 21:13

benaadams force-pushed the rep-stosd branch from 0ec66b9 to 7e7e9a1 Compare February 18, 2020 12:09

benaadams marked this pull request as ready for review February 18, 2020 12:53

benaadams force-pushed the rep-stosd branch 3 times, most recently from 6856159 to 2f06da8 Compare February 18, 2020 13:40

benaadams force-pushed the rep-stosd branch from 9d0e23d to 6de65e5 Compare February 18, 2020 19:02

benaadams mentioned this pull request Feb 18, 2020

JIT should suppress zero-extending same-register moves in more scenarios #12402

Open

benaadams force-pushed the rep-stosd branch 4 times, most recently from 5392cf1 to 4f5a326 Compare February 19, 2020 05:18

benaadams mentioned this pull request Feb 19, 2020

Use xmm for stack prolog zeroing rather than rep stos #32538

Merged

benaadams closed this Feb 19, 2020

benaadams reopened this Feb 19, 2020

benaadams added 4 commits February 19, 2020 22:29

Force full 32 byte alignment for YMM and rep stosd clears

21e9aaf

sub not mov

9f18e3d

Don't mix up bytes and dwords

aeb69ef

benaadams force-pushed the rep-stosd branch from aa1e45c to aeb69ef Compare February 19, 2020 22:30

benaadams closed this Feb 19, 2020

ghost locked as resolved and limited conversation to collaborators Dec 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use simd for small prolog zeroing (ia32/x64) #32442

Use simd for small prolog zeroing (ia32/x64) #32442

benaadams commented Feb 17, 2020 •

edited

Loading

benaadams commented Feb 17, 2020 •

edited

Loading

benaadams commented Feb 17, 2020 •

edited

Loading

benaadams commented Feb 17, 2020

benaadams commented Feb 17, 2020 •

edited

Loading

tannergooding commented Feb 17, 2020

tannergooding commented Feb 17, 2020

benaadams commented Feb 17, 2020

tannergooding commented Feb 17, 2020

benaadams commented Feb 17, 2020 •

edited

Loading

tannergooding commented Feb 17, 2020

AndyAyersMS commented Feb 17, 2020

benaadams commented Feb 17, 2020

benaadams commented Feb 17, 2020

AndyAyersMS commented Feb 18, 2020

benaadams commented Feb 18, 2020 •

edited

Loading

benaadams commented Feb 18, 2020

benaadams commented Feb 18, 2020 •

edited

Loading

benaadams commented Feb 18, 2020

AndyAyersMS commented Feb 18, 2020

benaadams commented Feb 18, 2020

benaadams commented Feb 18, 2020

jkotas commented Feb 18, 2020

benaadams commented Feb 18, 2020

benaadams commented Feb 19, 2020 •

edited

Loading

benaadams commented Feb 19, 2020

AndyAyersMS commented Feb 19, 2020

benaadams commented Feb 19, 2020

benaadams commented Feb 19, 2020

benaadams commented Feb 19, 2020

benaadams commented Feb 19, 2020

benaadams commented Feb 19, 2020

benaadams commented Feb 19, 2020

Use simd for small prolog zeroing (ia32/x64) #32442

Use simd for small prolog zeroing (ia32/x64) #32442

Conversation

benaadams commented Feb 17, 2020 • edited Loading

benaadams commented Feb 17, 2020 • edited Loading

benaadams commented Feb 17, 2020 • edited Loading

benaadams commented Feb 17, 2020

benaadams commented Feb 17, 2020 • edited Loading

tannergooding commented Feb 17, 2020

tannergooding commented Feb 17, 2020

benaadams commented Feb 17, 2020

tannergooding commented Feb 17, 2020

benaadams commented Feb 17, 2020 • edited Loading

tannergooding commented Feb 17, 2020

AndyAyersMS commented Feb 17, 2020

benaadams commented Feb 17, 2020

benaadams commented Feb 17, 2020

AndyAyersMS commented Feb 18, 2020

benaadams commented Feb 18, 2020 • edited Loading

benaadams commented Feb 18, 2020

benaadams commented Feb 18, 2020 • edited Loading

benaadams commented Feb 18, 2020

AndyAyersMS commented Feb 18, 2020

benaadams commented Feb 18, 2020

benaadams commented Feb 18, 2020

jkotas commented Feb 18, 2020

benaadams commented Feb 18, 2020

benaadams commented Feb 19, 2020 • edited Loading

benaadams commented Feb 19, 2020

AndyAyersMS commented Feb 19, 2020

benaadams commented Feb 19, 2020

benaadams commented Feb 19, 2020

benaadams commented Feb 19, 2020

benaadams commented Feb 19, 2020

benaadams commented Feb 19, 2020

benaadams commented Feb 19, 2020

benaadams commented Feb 17, 2020 •

edited

Loading

benaadams commented Feb 17, 2020 •

edited

Loading

benaadams commented Feb 17, 2020 •

edited

Loading

benaadams commented Feb 17, 2020 •

edited

Loading

benaadams commented Feb 17, 2020 •

edited

Loading

benaadams commented Feb 18, 2020 •

edited

Loading

benaadams commented Feb 18, 2020 •

edited

Loading

benaadams commented Feb 19, 2020 •

edited

Loading