Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use simd for small prolog zeroing (ia32/x64) #32442

Closed
wants to merge 4 commits into from

Conversation

benaadams
Copy link
Member

@benaadams benaadams commented Feb 17, 2020

rep stosd has a low throughput if the memory is not 32byte aligned; however it is a compact code size.

Use faster throughput SIMD instructions when the amount to zero in the prologue is low; moving back to rep stosd when it becomes too many instructions.

Also align rep stosd to 32 bytes when it is used.

Best checked with ignore whitespace (due to new if/else indenting)

Contributes to #8890

/cc @erozenfeld not entirely sure what I'm doing :)

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 17, 2020
@benaadams
Copy link
Member Author

benaadams commented Feb 17, 2020

At R2R changes prologue for Memory`1:TryCopyTo(Memory`1):bool:this from

G_M41002_IG01:
       sub      rsp, 96
       mov      rsi, rcx
       lea      rdi, [rsp+28H]
       mov      ecx, 12
       xor      rax, rax
       rep stosd 
       mov      rcx, rsi
       mov      qword ptr [rsp+58H], rdx
       mov      rbx, rcx
       mov      rsi, rdx
       mov      rdi, r8
						;; bbWeight=1    PerfScore 31.50

to

       sub      rsp, 96
       xorps    xmm4, xmm4
       movdqu   xmmword ptr [rsp+28H], xmm4
       movdqu   xmmword ptr [rsp+38H], xmm4
       movdqu   xmmword ptr [rsp+48H], xmm4
       mov      qword ptr [rsp+58H], rdx
       mov      rbx, rcx
       mov      rsi, rdx
       mov      rdi, r8
						;; bbWeight=1    PerfScore 8.33

@benaadams
Copy link
Member Author

benaadams commented Feb 17, 2020

At Jit time (when it can use longer ymm registers) changes prologue in issue #32396 from

       sub      rsp, 96
       mov      rsi, rcx
       lea      rdi, [rsp+20H]
       mov      ecx, 16
       xor      rax, rax
       rep stosd 
       mov      rcx, rsi
       mov      rsi, rcx
						;; bbWeight=1    PerfScore 32.00

to

       sub      rsp, 96
       vzeroupper 
       vxorps   ymm4, ymm4
       vmovdqu  ymmword ptr[rsp+20H], ymm4
       vmovdqu  ymmword ptr[rsp+40H], ymm4
       mov      rsi, rcx
						;; bbWeight=1    PerfScore 8.83

@benaadams
Copy link
Member Author

Total bytes of diff: 598 (0.02% of base)
    diff is a regression.

Top file regressions (bytes):
         598 : System.Private.CoreLib.dasm (0.02% of base)

1 total files with Code Size differences (0 improved, 1 regressed), 0 unchanged.

Top method regressions (bytes):
          77 ( 3.19% of base) : System.Private.CoreLib.dasm - Array:<Sort>g__GenericSort|126_0(Array,Array,int,int) (11 methods)
          58 ( 0.31% of base) : System.Private.CoreLib.dasm - ValueTuple`8:GetHashCode():int:this (17 methods)
          36 ( 1.60% of base) : System.Private.CoreLib.dasm - ArraySegment`1:GetEnumerator():Enumerator:this (18 methods)
          26 ( 0.35% of base) : System.Private.CoreLib.dasm - Vector64`1:ToString():String:this (11 methods)
          26 ( 1.77% of base) : System.Private.CoreLib.dasm - Dictionary`2:System.Collections.Generic.IEnumerable<System.Collections.Generic.KeyValuePair<TKey,TValue>>.GetEnumerator():IEnumerator`1:this (11 methods)
          26 ( 1.77% of base) : System.Private.CoreLib.dasm - Dictionary`2:System.Collections.IEnumerable.GetEnumerator():IEnumerator:this (11 methods)
          26 ( 1.77% of base) : System.Private.CoreLib.dasm - Dictionary`2:System.Collections.IDictionary.GetEnumerator():IDictionaryEnumerator:this (11 methods)
          22 ( 2.00% of base) : System.Private.CoreLib.dasm - Memory`1:TryCopyTo(Memory`1):bool:this (4 methods)
          22 ( 2.00% of base) : System.Private.CoreLib.dasm - ReadOnlyMemory`1:TryCopyTo(Memory`1):bool:this (4 methods)
          18 ( 0.62% of base) : System.Private.CoreLib.dasm - ValueTuple`8:System.Runtime.CompilerServices.ITuple.get_Item(int):Object:this (17 methods)
          18 ( 0.23% of base) : System.Private.CoreLib.dasm - Vector128`1:ToString():String:this (11 methods)
          18 ( 0.32% of base) : System.Private.CoreLib.dasm - EqualityComparer`1:System.Collections.IEqualityComparer.GetHashCode(Object):int:this (42 methods)
          16 ( 1.30% of base) : System.Private.CoreLib.dasm - ConfiguredValueTaskAwaiter:OnCompleted(Action):this (4 methods)
          16 ( 1.34% of base) : System.Private.CoreLib.dasm - ConfiguredValueTaskAwaiter:UnsafeOnCompleted(Action):this (4 methods)
          16 ( 1.95% of base) : System.Private.CoreLib.dasm - Dictionary`2:GetEnumerator():Enumerator:this (8 methods)
          15 ( 0.09% of base) : System.Private.CoreLib.dasm - ValueTuple`8:ToString():String:this (17 methods)
          15 ( 0.10% of base) : System.Private.CoreLib.dasm - ValueTuple`8:System.IValueTupleInternal.ToStringEnd():String:this (17 methods)
          14 ( 2.33% of base) : System.Private.CoreLib.dasm - TimeZoneInfo:GetLocalTimeZone(CachedData):TimeZoneInfo
          14 ( 0.45% of base) : System.Private.CoreLib.dasm - ValueTuple`8:Equals(Object):bool:this (17 methods)
          14 ( 0.14% of base) : System.Private.CoreLib.dasm - ValueTuple`8:System.Collections.IStructuralEquatable.Equals(Object,IEqualityComparer):bool:this (17 methods)
          14 ( 0.28% of base) : System.Private.CoreLib.dasm - ValueTuple`8:System.IComparable.CompareTo(Object):int:this (17 methods)
          14 ( 0.12% of base) : System.Private.CoreLib.dasm - ValueTuple`8:System.Collections.IStructuralComparable.CompareTo(Object,IComparer):int:this (17 methods)
          13 ( 0.64% of base) : System.Private.CoreLib.dasm - DateTimeParse:TryParse(ReadOnlySpan`1,DateTimeFormatInfo,int,byref):bool (2 methods)
          10 ( 1.71% of base) : System.Private.CoreLib.dasm - Environment:get_UserDomainName():String
          10 ( 0.84% of base) : System.Private.CoreLib.dasm - TextSegmentationUtility:GetLengthOfFirstExtendedGraphemeCluster(ReadOnlySpan`1,DecodeFirstRune`1):int (2 methods)

Top method improvements (bytes):
        -132 (-33.93% of base) : System.Private.CoreLib.dasm - PathHelper:PrependDevicePathChars(byref,bool,byref):int
        -122 (-18.91% of base) : System.Private.CoreLib.dasm - MethodBase:AppendParameters(byref,ref,int)
        -119 (-8.98% of base) : System.Private.CoreLib.dasm - Path:GetRelativePath(String,String,int):String
         -67 (-12.93% of base) : System.Private.CoreLib.dasm - RuntimePropertyInfo:ToString():String:this
         -63 (-14.72% of base) : System.Private.CoreLib.dasm - RuntimeConstructorInfo:ToString():String:this
         -59 (-11.50% of base) : System.Private.CoreLib.dasm - KeyValuePair:PairToString(Object,Object):String
         -40 (-6.88% of base) : System.Private.CoreLib.dasm - Exception:ToString():String:this
         -27 (-6.75% of base) : System.Private.CoreLib.dasm - ActivityTracker:NormalizeActivityName(String,String,int):String
         -26 (-7.01% of base) : System.Private.CoreLib.dasm - EventPipeController:get_DefaultProviderConfiguration():ref
         -23 (-0.51% of base) : System.Private.CoreLib.dasm - EventSource:CreateManifestAndDescriptors(Type,String,EventSource,int):ref
         -20 (-5.00% of base) : System.Private.CoreLib.dasm - UTF7Encoding:MakeTables():this
         -16 (-0.31% of base) : System.Private.CoreLib.dasm - Enumerator:System.Collections.IEnumerator.get_Current():Object:this (65 methods)
         -13 (-0.22% of base) : System.Private.CoreLib.dasm - DateTimeParse:ParseByFormat(byref,byref,byref,DateTimeFormatInfo,byref):bool
         -12 (-0.51% of base) : System.Private.CoreLib.dasm - IdnMapping:PunycodeEncode(String):String
         -12 (-0.38% of base) : System.Private.CoreLib.dasm - IdnMapping:PunycodeDecode(String):String
         -10 (-2.03% of base) : System.Private.CoreLib.dasm - EventPipeEventDispatcher:CommitDispatchConfiguration():this
          -7 (-0.62% of base) : System.Private.CoreLib.dasm - Enum:ValueToHexString(Object):String
          -7 (-1.23% of base) : System.Private.CoreLib.dasm - HijriCalendar:GetAdvanceHijriDate():int
          -6 (-0.07% of base) : System.Private.CoreLib.dasm - Vector256`1:ToString():String:this (11 methods)
          -5 (-2.62% of base) : System.Private.CoreLib.dasm - ManifestBuilder:GetLocalizedMessage(String,CultureInfo,bool):String:this
          -3 (-3.06% of base) : System.Private.CoreLib.dasm - Vector64:ToVector128(Vector64`1):Vector128`1
          -3 (-3.26% of base) : System.Private.CoreLib.dasm - Vector128:WithLower(Vector128`1,Vector64`1):Vector128`1
          -3 (-0.93% of base) : System.Private.CoreLib.dasm - AssemblyLoadContext:ResolveSatelliteAssembly(AssemblyName):Assembly:this
          -3 (-0.19% of base) : System.Private.CoreLib.dasm - AssemblyName:EscapeString(String,int,int,ref,byref,bool,ushort,ushort,ushort):ref
          -2 (-0.28% of base) : System.Private.CoreLib.dasm - ILStubClass:IL_STUB_PInvoke(SafeRegistryHandle,String,ref,byref,ref,byref):int (2 methods)
303 total methods with Code Size differences (50 improved, 253 regressed), 20097 unchanged.

PathHelper:PrependDevicePathChars(byref,bool,byref):int still has the same rep stosd prologue; however more looks to get cleaned up on the inlines for some reasons so it uses 2 less tmps (because its importing simd, though thought inlines happen prior to the code gen?)

@benaadams
Copy link
Member Author

benaadams commented Feb 17, 2020

Array:<Sort>g__GenericSort|126_0(Array,Array,int,int) x 11 regresses at the edge of the SIMD limit
From

       sub      rsp, 96
       mov      rsi, rcx
       lea      rdi, [rsp+20H]
       mov      ecx, 16
       xor      rax, rax
       rep stosd 
       mov      rcx, rsi
						;; bbWeight=1    PerfScore 30.75

To

       sub      rsp, 96
       xorps    xmm4, xmm4
       movdqu   xmmword ptr [rsp+20H], xmm4
       movdqu   xmmword ptr [rsp+30H], xmm4
       movdqu   xmmword ptr [rsp+40H], xmm4
       movdqu   xmmword ptr [rsp+50H], xmm4
       mov      rsi, rdx
						;; bbWeight=1    PerfScore 8.58

Though at Tier1 that would convert to a smaller form of

       sub      rsp, 96
       vxorps   ymm4, ymm4
       movdqu   ymmword ptr [rsp+20H], ymm4
       movdqu   ymmword ptr [rsp+40H], ymm4
       mov      rsi, rdx

@tannergooding
Copy link
Member

At Jit time (when it can use longer ymm registers) changes prologue in issue #32396 from

The Intel® 64 and IA-32 Architectures Optimization Reference Manual has 12.6.2 Consider 16-Byte Memory Access when Memory is Unaligned:

For best results use Intel AVX 32-byte loads and align data to 32-bytes. However, there are cases where
you cannot align the data, or data alignment is unknown. This can happen when you are writing a library
function and the input data alignment is unknown. In these cases, using 16-Byte memory accesses may
be the best alternative. The following method uses 16-byte loads while still benefiting from the 32-byte
YMM registers.

Consider replacing unaligned 32-byte memory accesses using a combination of VMOVUPS,
VINSERTF128, and VEXTRACTF128 instructions.

and 12.6.3 Prefer Aligned Stores Over Aligned Loads

There are cases where it is possible to align only a subset of the processed data buffers. In these cases,
aligning data buffers used for store operations usually yields better performance than aligning data
buffers used for load operations.

Unaligned stores are likely to cause greater performance degradation than unaligned loads, since there
is a very high penalty on stores to a split cache-line that crosses pages. This penalty is estimated at 150
cycles. Loads that cross a page boundary are executed at retirement. In Example 12-12, unaligned store
address can affect SAXPY performance for 3 unaligned addresses to about one quarter of the aligned
case.

Neither look to be listed for a specific micro-architecture. Since the current cache line size is 64-bytes and I believe we only guarantee 16-byte stack alignment there would be a decent chance of crossing the cache line boundary for 2x YMM writes.

@tannergooding
Copy link
Member

Array:g__GenericSort|126_0(Array,Array,int,int) x 11 regresses at the edge of the SIMD limit

Could this be due to the specific CPU? The optimization manual only lists the numbers for Ivy Bridge but seems to indicate 128 bytes is a good cutoff:
image

That is using rep stosb, but I'd imagine a similar cutoff for rep stosd (128 is also the cutoff that memset looks to use and it will use non temporal store if ERMSB isn't available).
The cutoff might also differ on AMD boxes (where perf of rep stosb/w/d/q depends greatly on alignment) and that might be worth taking into account.

@benaadams
Copy link
Member Author

we only guarantee 16-byte stack alignment

Only pointer sized alignment? e.g 8 byte on x64

there would be a decent chance of crossing the cache line boundary for 2x YMM writes.

Still going to hit it with 4x XMM writes; its only removing 1 chance in a 32 byte segment (by having 2x 16 so a split in the middle).

there is a very high penalty on stores to a split cache-line that crosses pages

Since the issue happens at page boundaries and the stack is 8 byte aligned, the chance is ever so slightly lower at 2 in 4096 rather than 3 in 4096?

Array:g__GenericSort|126_0(Array,Array,int,int) x 11 regresses at the edge of the SIMD limit

Could this be due to the specific CPU? The optimization manual only lists the numbers for Ivy Bridge but seems to indicate 128 bytes is a good cutoff:

Choose 4 x SIMD size due to the code gen size rather than the performance; only using xmm means stopping at 64 bytes whereas using ymm goes to 128 bytes. Higher than than it moves back to rep stosd because the asm diff is getting too large.

@tannergooding
Copy link
Member

Only pointer sized alignment? e.g 8 byte on x64

No, I believe the stack (for TYP_SIMD) has some special support and will be 16-byte aligned (maybe @CarolEidt could confirm while I try to find the relevant code).

Higher than than it moves back to rep stosd because the asm diff is getting too large

Ah, I see; makes sense.

@benaadams
Copy link
Member Author

benaadams commented Feb 17, 2020

Only pointer sized alignment? e.g 8 byte on x64

No, I believe the stack (for TYP_SIMD) has some special support and will be 16-byte aligned

Could we align to 32-byte for 64bit ? (And then not use Avx on 32-bit, not sure if its supported there?)

That would also make rep stosd happier as it prefers 32byte alignment (though it also clobbers a bunch of registers and more so on Linux)

@tannergooding
Copy link
Member

There is some information on the stack frame alignment here: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/lclvars.cpp#L4327 and here: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/lclvars.cpp#L6256

IIRC, we only properly align (on the stack and only for x64) TYP_SIMD16 right now and we don't guarantee alignment for TYP_SIMD32, but someone more familiar with the area should confirm 😄

@benaadams benaadams force-pushed the rep-stosd branch 4 times, most recently from 5d056d2 to 89ae26b Compare February 17, 2020 21:13
@AndyAyersMS
Copy link
Member

ABI guarantees that the stack has 16 byte alignment for x64, both on Windows and SysV, outside of the prolog and epilog, for non-leaf methods.

The local var area is likewise guaranteed by the jit to be 16 byte aligned, see eg lvaAlignFrame, and this is used to align SIMD locals, eg getSIMDTypeAlignment (though reading this just now it looks like we also think we can align to SIMD locals to 32 bytes, which should not be possible).

@benaadams
Copy link
Member Author

The local var area is likewise guaranteed by the jit to be 16 byte aligned, see eg lvaAlignFrame

So it should be possible to 32 byte align on x64 to avoid any page split penalties in the G_xxx_IG01: after the sub rsp, NN (i.e. the zeroing block)

@benaadams
Copy link
Member Author

Change with a larger struct ReadOnlySequence<byte> and its data

private readonly Memory<byte> _expected = new byte[5];
public bool ReadOnlySequenceEqual(ReadOnlySequence<byte> ros)
{
    var span = ros.FirstSpan;
    var expected = _expected.Span;
    return (span.Length == expected.Length) ? 
        SequenceEqualFast(span, expected) :
        SequenceEqualSlow(ros, expected);
}

[MethodImpl(MethodImplOptions.NoInlining)]
public static bool SequenceEqualFast(ReadOnlySpan<byte> input, ReadOnlySpan<byte> expected) => true;

[MethodImpl(MethodImplOptions.NoInlining)]
public static bool SequenceEqualSlow(ReadOnlySequence<byte> input, ReadOnlySpan<byte> expected) => true;

From

       sub      rsp, 168
       vzeroupper 
       mov      rsi, rcx
       lea      rdi, [rsp+28H]
       mov      ecx, 32
       xor      rax, rax
       rep stosd 
       mov      rcx, rsi
       mov      rdi, rcx
       mov      rsi, rdx
						;; bbWeight=1    PerfScore 36.25

To

       sub      rsp, 168
       vzeroupper 
       vxorps   ymm4, ymm4
       vmovdqu  ymmword ptr[rsp+28H], ymm4
       vmovdqu  ymmword ptr[rsp+48H], ymm4
       vmovdqu  ymmword ptr[rsp+68H], ymm4
       vmovdqu  ymmword ptr[rsp+88H], ymm4
       mov      rdi, rcx
       mov      rsi, rdx
						;; bbWeight=1    PerfScore 14.08

@AndyAyersMS
Copy link
Member

So it should be possible to 32 byte align on x64

We currently don't have enough guarantees -- we'd need to implement "dynamic alignment" or similar, which comes with its own set of tradeoffs.

@benaadams
Copy link
Member Author

benaadams commented Feb 18, 2020

So on the one hand crossing a page should be rareish; but on the other hand if you do cross a page and its in you hot-path range then will be continuously crossing a page as the stack unwinds and rewinds which would then make it very common. (i.e. once you are locked into crossing a page, it will happen frequently) Hmm... @tannergooding as you highlight that probably is quite a problem 😢

Some of the structs are pretty big (e.g. ReadOnlySequence<T> above #32442 (comment) with SequenceReader<T> being even bigger), so not using ymm would make the asm very big.

Don't want to add a branch to align... Could start stack sizes >= 64 unconditionally with a xmm read to align; add 0 or 16 (via mask, so no branch) to second reg e.g. rcx and then zero using that based on that...

However that would over zero the stack by 16 bytes when unaligned; would that be a problem? (e.g. are there stack markers to detect overruns which would get upset if they were zeroed?)

@benaadams
Copy link
Member Author

Talking of SequenceReader<T>; a target method would be Kestrel's Http1Connection.TakeMessageHeaders (or non-dependency version for analysis gist) which starts:

G_M25757_IG01:
       push     rbp
       push     r15
       push     r14
       push     r13
       push     r12
       push     rdi
       push     rsi
       push     rbx
       sub      rsp, 216
       vzeroupper 
       lea      rbp, [rsp+110H]
       mov      rsi, rcx
       lea      rdi, [rbp-D8H]
       mov      ecx, 40
       xor      rax, rax
       rep stosd 
       mov      rcx, rsi
       mov      qword ptr [rbp-E8H], rsp
       mov      gword ptr [rbp+10H], rcx
       mov      bword ptr [rbp+18H], rdx
       mov      bword ptr [rbp+28H], r9
       mov      esi, r8d
						;; bbWeight=1    PerfScore 40.50
G_M25757_IG02:
       mov      rcx, bword ptr [rbp+18H]
       mov      rdx, gword ptr [rcx]
       mov      rcx, bword ptr [rbp+18H]
       mov      rdi, gword ptr [rcx+8]
       mov      rcx, bword ptr [rbp+18H]
       mov      ebx, dword ptr [rcx+16]
       and      ebx, 0xD1FFAB1E
       mov      rcx, bword ptr [rbp+18H]
       mov      r14d, dword ptr [rcx+20]
       and      r14d, 0xD1FFAB1E
       cmp      rdx, rdi
       je       SHORT G_M25757_IG10
						;; bbWeight=1    PerfScore 13.75

Bunch of busy work with rcx in G_M25757_IG02?

With ymm that becomes

       sub      rsp, 216
       vzeroupper 
       lea      rbp, [rsp+110H]
       vxorps   ymm4, ymm4
       vmovdqu  ymmword ptr[rbp-D8H], ymm4
       vmovdqu  ymmword ptr[rbp-B8H], ymm4
       vmovdqu  ymmword ptr[rbp-98H], ymm4
       vmovdqu  ymmword ptr[rbp-78H], ymm4
       vmovdqu  ymmword ptr[rbp-58H], ymm4

However only using xmm and allowing the same size zeroing as simd that would be the following, which feels way too much asm

       sub      rsp, 216
       vzeroupper 
       lea      rbp, [rsp+110H]
       vxorps   xmm4, xmm4
       vmovdqu  xmmword ptr [rbp-D8H], xmm4
       vmovdqu  xmmword ptr [rbp-C8H], xmm4
       vmovdqu  xmmword ptr [rbp-B8H], xmm4
       vmovdqu  xmmword ptr [rbp-A8H], xmm4
       vmovdqu  xmmword ptr [rbp-98H], xmm4
       vmovdqu  xmmword ptr [rbp-88H], xmm4
       vmovdqu  xmmword ptr [rbp-78H], xmm4
       vmovdqu  xmmword ptr [rbp-68H], xmm4
       vmovdqu  xmmword ptr [rbp-58H], xmm4
       vmovdqu  xmmword ptr [rbp-48H], xmm4

@benaadams
Copy link
Member Author

benaadams commented Feb 18, 2020

Now produces:

       sub      rsp, 216
       vzeroupper 
       lea      rbp, [rsp+110H]
       vxorps   ymm4, ymm4
       vmovdqu  xmmword ptr [rbp-D8H], xmm4 ; Zero first 16 bytes
       lea      rax, [rbp-D8H]              ; Get block start
       and      rax, 31                     ; Get offset for 32 byte alignment of block (is 16 byte aligned and first 16 are already zeroed)
       add      rax, -216                   ; Add back offset
       add      rax, rbp                    ; Add back stack pointer
       vmovdqu  ymmword ptr[rax], ymm4
       vmovdqu  ymmword ptr[rax+32], ymm4
       vmovdqu  ymmword ptr[rax+64], ymm4
       vmovdqu  ymmword ptr[rax+96], ymm4
       vmovdqu  ymmword ptr[rax+128], ymm4

@tannergooding would that work?

@benaadams benaadams marked this pull request as ready for review February 18, 2020 12:53
@benaadams benaadams force-pushed the rep-stosd branch 3 times, most recently from 6856159 to 2f06da8 Compare February 18, 2020 13:40
@benaadams
Copy link
Member Author

128 is also the cutoff that memset looks to use and it will use non temporal store if ERMSB isn't available

Probably don't want to use non temporal? Not having visibility from other procs is fine (since its thread stack); however likely want it in the cache as its going to be immediately used?

@AndyAyersMS
Copy link
Member

Maybe? I wonder if this might cause store buffer stalls.

@benaadams
Copy link
Member Author

Also 32 bytes aligns rep stosd when it does occur (larger than the simd cutover) in a similar way
From

lea      rdi, [rbp-98H]
mov      ecx, 26
xor      rax, rax
rep stosd 

To

lea      rdi, [rbp-98H]
xorps    xmm4, xmm4
movdqu   xmmword ptr [rbp-98H], xmm4
and      rdi, 31
mov      ecx, 26
sub      rcx, rdi
add      rdi, -152
add      rdi, rbp
xor      rax, rax
rep stosd 

@benaadams
Copy link
Member Author

What is the average code-size increase for 128 bytes cut-off (for the non-AVX case?

Cross gen of corelib which is non-AVX is

Total bytes of diff: 9569 (0.29% of base)
    diff is a regression.

Top file regressions (bytes):
    9569 : System.Private.CoreLib.dasm (0.29% of base)

1 total files with Code Size differences (0 improved, 1 regressed), 0 unchanged.

Top method regressions (bytes):
    288 ( 1.80% of base) : ValueTuple`8:ToString():String:this (17 methods)
    288 ( 1.88% of base) : ValueTuple`8:System.IValueTupleInternal.ToStringEnd():String:this (17 methods)
    287 ( 9.15% of base) : ValueTuple`8:Equals(Object):bool:this (17 methods)
    287 ( 2.81% of base) : ValueTuple`8:System.Collections.IStructuralEquatable.Equals(Object,IEqualityComparer):bool:this (17 methods)
    287 ( 5.68% of base) : ValueTuple`8:System.IComparable.CompareTo(Object):int:this (17 methods)
    287 ( 2.38% of base) : ValueTuple`8:System.Collections.IStructuralComparable.CompareTo(Object,IComparer):int:this (17 methods)
    276 ( 2.54% of base) : EqualityComparer`1:System.Collections.IEqualityComparer.Equals(Object,Object):bool:this (42 methods)
    257 ( 2.64% of base) : Comparer`1:System.Collections.IComparer.Compare(Object,Object):int:this (34 methods)
    254 ( 1.37% of base) : ValueTuple`8:GetHashCode():int:this (17 methods)
    253 ( 3.70% of base) : TupleExtensions:ToValueTuple(Tuple`8):ValueTuple`8 (14 methods)
    224 ( 2.94% of base) : ValueTuple`8:Equals(ValueTuple`8):bool:this (17 methods)
    224 ( 2.94% of base) : ValueTuple`8:CompareTo(ValueTuple`8):int:this (17 methods)
    161 ( 7.44% of base) : TupleExtensions:CreateLong(__Canon,__Canon,__Canon,__Canon,__Canon,__Canon,__Canon,ValueTuple`8):ValueTuple`8 (7 methods)
    161 (10.00% of base) : EqualityComparer`1:IndexOf(ref,ValueTuple`8,int,int):int:this (7 methods)
    161 ( 9.88% of base) : EqualityComparer`1:LastIndexOf(ref,ValueTuple`8,int,int):int:this (7 methods)
    130 ( 4.46% of base) : ValueTuple`8:System.Runtime.CompilerServices.ITuple.get_Item(int):Object:this (17 methods)
    130 ( 2.33% of base) : EqualityComparer`1:System.Collections.IEqualityComparer.GetHashCode(Object):int:this (42 methods)
    122 ( 7.89% of base) : GenericEqualityComparer`1:IndexOf(ref,ValueTuple`8,int,int):int:this (7 methods)
    122 ( 7.79% of base) : GenericEqualityComparer`1:LastIndexOf(ref,ValueTuple`8,int,int):int:this (7 methods)
    119 (12.00% of base) : ValueTuple`8:System.Runtime.CompilerServices.ITuple.get_Length():int:this (17 methods)
     77 ( 3.19% of base) : Array:<Sort>g__GenericSort|126_0(Array,Array,int,int) (11 methods)
     66 ( 0.84% of base) : ArraySortHelper`1:IntroSort(Span`1,int,Comparison`1) (15 methods)
     54 ( 0.66% of base) : Vector256`1:ToString():String:this (11 methods)
     46 ( 4.79% of base) : DateTimeParse:TryParseExactMultiple(ReadOnlySpan`1,ref,DateTimeFormatInfo,int,byref):bool (2 methods)
     36 ( 1.60% of base) : ArraySegment`1:GetEnumerator():Enumerator:this (18 methods)
								
Top method improvements (bytes):
     -2 (-0.28% of base) : ILStubClass:IL_STUB_PInvoke(SafeRegistryHandle,String,ref,byref,ref,byref):int (2 methods)
     -2 (-0.29% of base) : ILStubClass:IL_STUB_PInvoke(SafeRegistryHandle,String,ref,byref,byref,byref):int (2 methods)
     -2 (-0.05% of base) : Utf8Parser:TryParse(ReadOnlySpan`1,byref,byref,ushort):bool (16 methods)
     -1 (-0.54% of base) : ILStubClass:IL_STUB_PInvoke(byref,byref,byref):bool
     -1 (-0.26% of base) : ILStubClass:IL_STUB_PInvoke(SafeRegistryHandle,int,ref,byref,long,ref,ref,ref):int
     -1 (-0.19% of base) : DateTimeFormat:Format(DateTime,String,IFormatProvider,TimeSpan):String
     -1 (-0.18% of base) : MemoryExtensions:IndexOf(ReadOnlySpan`1,ReadOnlySpan`1,int):int
     -1 (-0.15% of base) : Utf8Formatter:TryFormatFloatingPoint(__Canon,Span`1,byref,StandardFormat):bool
     -1 (-0.15% of base) : ExecutionContext:OnValuesChanged(ExecutionContext,ExecutionContext)
     -1 (-0.19% of base) : Task:WaitAllBlockingCore(List`1,int,CancellationToken):bool
     -1 (-0.23% of base) : NativeOrStaticEventRegistrationImpl:GetEventRegistrationTokenTableInternal(Object,Action`1,byref,bool):ConditionalWeakTable`2
     -1 (-0.34% of base) : FileLoadException:FormatFileLoadExceptionMessage(String,int):String
     -1 (-0.17% of base) : <FinishWriteAsync>d__63:MoveNext():this
     -1 (-0.15% of base) : <DisposeAsyncCore>d__99:MoveNext():this
     -1 (-0.12% of base) : Path:Combine(ref):String
     -1 (-0.42% of base) : PathInternal:RemoveRelativeSegments(String,int):String
     -1 (-0.18% of base) : <DisposeAsyncCore>d__33:MoveNext():this
     -1 (-0.13% of base) : EventSource:EnsureDescriptorsInitialized():this
		  
468 total methods with Code Size differences (18 improved, 450 regressed), 19932 unchanged.

@jkotas
Copy link
Member

jkotas commented Feb 18, 2020

0.29% of base

I would be ok to take this regression for this fix.

@benaadams
Copy link
Member Author

Haven't got it quite right...

exit code 139 means SIGSEGV Illegal memory access. 
Deref invalid pointer, overrunning buffer, stack overflow etc. Core dumped.

@benaadams
Copy link
Member Author

benaadams commented Feb 19, 2020

Perhaps not everything is fully 16 byte aligned; not able to trigger this failure in simple program yet.

@AndyAyersMS when you say:

ABI guarantees that the stack has 16 byte alignment for x64, both on Windows and SysV, outside of the prolog and epilog, for non-leaf methods.

Is there a way to detect its a leaf method? Then I can have more variable realignment but keep it to leaf methods (i.e. not assume its 16byte aligned)

@benaadams
Copy link
Member Author

Think I can use?

bool is16byteAligned = (compiler->compLclFrameSize % 16) == 0;

@AndyAyersMS
Copy link
Member

Is there a way to detect its a leaf method?

For Windows x64, I think you can check if lvaOutgoingArgSpaceSize is zero. Not as sure about SysV, will have to do some digging.

You could also add explicit tracking in fgSimpleLowering.

@benaadams benaadams force-pushed the rep-stosd branch 4 times, most recently from 5392cf1 to 4f5a326 Compare February 19, 2020 05:18
@benaadams
Copy link
Member Author

I've change it to assume no alignment and to always fully align with 2x xmm at start for ymm and rep stosd. ymm also finishes with 2x xmm for the last 32 bytes (and sizeof(long), sizeof(int) if required)

It does make it regress further slightly

Total bytes of diff: 12221 (0.38% of base)
    diff is a regression.

Top file regressions (bytes):
    12221 : System.Private.CoreLib.dasm (0.38% of base)

1 total files with Code Size differences (0 improved, 1 regressed), 0 unchanged.

Top method regressions (bytes):
    396 ( 2.47% of base) : ValueTuple`8:ToString():String:this (17 methods)
    396 ( 2.58% of base) : ValueTuple`8:System.IValueTupleInternal.ToStringEnd():String:this (17 methods)
    396 ( 3.64% of base) : EqualityComparer`1:System.Collections.IEqualityComparer.Equals(Object,Object):bool:this (42 methods)
    395 (12.60% of base) : ValueTuple`8:Equals(Object):bool:this (17 methods)
    395 ( 3.86% of base) : ValueTuple`8:System.Collections.IStructuralEquatable.Equals(Object,IEqualityComparer):bool:this (17 methods)
    395 ( 7.82% of base) : ValueTuple`8:System.IComparable.CompareTo(Object):int:this (17 methods)
    395 ( 3.28% of base) : ValueTuple`8:System.Collections.IStructuralComparable.CompareTo(Object,IComparer):int:this (17 methods)
    377 ( 3.87% of base) : Comparer`1:System.Collections.IComparer.Compare(Object,Object):int:this (34 methods)
    341 ( 4.99% of base) : TupleExtensions:ToValueTuple(Tuple`8):ValueTuple`8 (14 methods)
    331 ( 1.79% of base) : ValueTuple`8:GetHashCode():int:this (17 methods)
    320 ( 4.20% of base) : ValueTuple`8:Equals(ValueTuple`8):bool:this (17 methods)
    320 ( 4.20% of base) : ValueTuple`8:CompareTo(ValueTuple`8):int:this (17 methods)
    245 (11.32% of base) : TupleExtensions:CreateLong(__Canon,__Canon,__Canon,__Canon,__Canon,__Canon,__Canon,ValueTuple`8):ValueTuple`8 (7 methods)
    245 (15.22% of base) : EqualityComparer`1:IndexOf(ref,ValueTuple`8,int,int):int:this (7 methods)
    245 (15.03% of base) : EqualityComparer`1:LastIndexOf(ref,ValueTuple`8,int,int):int:this (7 methods)
    154 ( 5.29% of base) : ValueTuple`8:System.Runtime.CompilerServices.ITuple.get_Item(int):Object:this (17 methods)
    154 ( 2.75% of base) : EqualityComparer`1:System.Collections.IEqualityComparer.GetHashCode(Object):int:this (42 methods)
    146 ( 9.44% of base) : GenericEqualityComparer`1:IndexOf(ref,ValueTuple`8,int,int):int:this (7 methods)
    146 ( 9.32% of base) : GenericEqualityComparer`1:LastIndexOf(ref,ValueTuple`8,int,int):int:this (7 methods)
    143 (14.42% of base) : ValueTuple`8:System.Runtime.CompilerServices.ITuple.get_Length():int:this (17 methods)
     77 ( 3.19% of base) : Array:<Sort>g__GenericSort|126_0(Array,Array,int,int) (11 methods)
     70 ( 7.28% of base) : DateTimeParse:TryParseExactMultiple(ReadOnlySpan`1,ref,DateTimeFormatInfo,int,byref):bool (2 methods)
     66 ( 0.84% of base) : ArraySortHelper`1:IntroSort(Span`1,int,Comparison`1) (15 methods)
     54 ( 0.66% of base) : Vector256`1:ToString():String:this (11 methods)
     48 ( 2.36% of base) : DateTimeParse:TryParse(ReadOnlySpan`1,DateTimeFormatInfo,int,byref):bool (2 methods)

Top method improvements (bytes):
     -2 (-0.28% of base) : ILStubClass:IL_STUB_PInvoke(SafeRegistryHandle,String,ref,byref,ref,byref):int (2 methods)
     -2 (-0.29% of base) : ILStubClass:IL_STUB_PInvoke(SafeRegistryHandle,String,ref,byref,byref,byref):int (2 methods)
     -2 (-0.05% of base) : Utf8Parser:TryParse(ReadOnlySpan`1,byref,byref,ushort):bool (16 methods)
     -1 (-0.54% of base) : ILStubClass:IL_STUB_PInvoke(byref,byref,byref):bool
     -1 (-0.26% of base) : ILStubClass:IL_STUB_PInvoke(SafeRegistryHandle,int,ref,byref,long,ref,ref,ref):int
     -1 (-0.19% of base) : DateTimeFormat:Format(DateTime,String,IFormatProvider,TimeSpan):String
     -1 (-0.18% of base) : MemoryExtensions:IndexOf(ReadOnlySpan`1,ReadOnlySpan`1,int):int
     -1 (-0.15% of base) : Utf8Formatter:TryFormatFloatingPoint(__Canon,Span`1,byref,StandardFormat):bool
     -1 (-0.15% of base) : ExecutionContext:OnValuesChanged(ExecutionContext,ExecutionContext)
     -1 (-0.19% of base) : Task:WaitAllBlockingCore(List`1,int,CancellationToken):bool
     -1 (-0.23% of base) : NativeOrStaticEventRegistrationImpl:GetEventRegistrationTokenTableInternal(Object,Action`1,byref,bool):ConditionalWeakTable`2
     -1 (-0.34% of base) : FileLoadException:FormatFileLoadExceptionMessage(String,int):String
     -1 (-0.17% of base) : <FinishWriteAsync>d__63:MoveNext():this
     -1 (-0.15% of base) : <DisposeAsyncCore>d__99:MoveNext():this
     -1 (-0.12% of base) : Path:Combine(ref):String
     -1 (-0.42% of base) : PathInternal:RemoveRelativeSegments(String,int):String
     -1 (-0.18% of base) : <DisposeAsyncCore>d__33:MoveNext():this
     -1 (-0.13% of base) : EventSource:EnsureDescriptorsInitialized():this

468 total methods with Code Size differences (18 improved, 450 regressed), 19932 unchanged.

@benaadams
Copy link
Member Author

Might wait to see if tests pass before breaking it down more 😅

@benaadams
Copy link
Member Author

The simple thing would be to just use XMM for now, with some reasonable size cutoff. It would be a nice improvement without getting into extra complexity for over-aligned frames or portions thereof.

Decided to do that 😄 #32538

@benaadams
Copy link
Member Author

Closing in preference of #32538

@benaadams benaadams closed this Feb 19, 2020
@benaadams
Copy link
Member Author

Ah... might be issue between bytes and dwords

@benaadams benaadams reopened this Feb 19, 2020
rep stosd has a low throughput if the memory is not 32byte aligned; however it is a compact code size.

Use faster throughput SIMD instructions when the amount to zero in the prologue is low; moving back to rep stosd when it becomes too many instructions.
@benaadams
Copy link
Member Author

Will do it in other one

@benaadams benaadams closed this Feb 19, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants