Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sse2 Intrinsics and Vector<T> paths no longer being output at R2R #33544

Closed
benaadams opened this issue Mar 13, 2020 · 6 comments
Closed

Sse2 Intrinsics and Vector<T> paths no longer being output at R2R #33544

benaadams opened this issue Mar 13, 2020 · 6 comments
Labels
area-ReadyToRun-coreclr untriaged New issue has not been triaged by the area owner

Comments

@benaadams
Copy link
Member

Using jit-diff to produce the output

jit-diff diff --output d:\diffs
-t DIFF -d D:\GitHub\runtime\artifacts\bin\coreclr\Windows_NT.x64.Release-diff
--core_root D:\GitHub\runtime\artifacts\bin\coreclr\Windows_NT.x64.Release-diff -c

SpanHelpers.IndexOf(ref byte searchSpace, byte value, int length) doesn't look like it gets output at all.

The workhorse SpanHelpers.SequenceEqual(ref byte first, ref byte second, nuint length) doesn't output any vectorized asm

; Assembly listing for method SpanHelpers:SequenceEqual(byref,byref,int):bool
; Emitting BLENDED_CODE for X64 CPU with SSE2 - Windows
; ReadyToRun compilation
; optimized code
; rsp based frame
; fully interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T03] ( 16, 39   )   byref  ->  rcx        
;  ...
;  V96 cse6         [V96,T27] (  3, 12   )    long  ->  r10         "CSE - moderate"
;
; Lcl frame size = 0

G_M12780_IG01:
						;; bbWeight=1    PerfScore 0.00
G_M12780_IG02:
       cmp      rcx, rdx
       je       G_M12780_IG09
						;; bbWeight=1    PerfScore 1.25
G_M12780_IG03:
       xor      rax, rax
       cmp      r8d, 8
       jl       G_M12780_IG06
						;; bbWeight=0.50 PerfScore 0.75
G_M12780_IG04:
       add      r8d, -8
       movzx    r9, word  ptr [rcx+2*rax]
       movzx    r10, word  ptr [rdx+2*rax]
       cmp      r9d, r10d
       sete     r9b
       movzx    r9, r9b
       test     r9d, r9d
       je       G_M12780_IG11
       lea      r10, [rax+1]
       mov      r9, r10
       movzx    r9, word  ptr [rcx+2*r9]
       movzx    r10, word  ptr [rdx+2*r10]
       cmp      r9d, r10d
       sete     r9b
       movzx    r9, r9b
       test     r9d, r9d
       je       G_M12780_IG11
       lea      r10, [rax+2]
       mov      r9, r10
       movzx    r9, word  ptr [rcx+2*r9]
       movzx    r10, word  ptr [rdx+2*r10]
       cmp      r9d, r10d
       sete     r9b
       movzx    r9, r9b
       test     r9d, r9d
       je       G_M12780_IG11
       lea      r10, [rax+3]
       mov      r9, r10
       movzx    r9, word  ptr [rcx+2*r9]
       movzx    r10, word  ptr [rdx+2*r10]
       cmp      r9d, r10d
       sete     r9b
       movzx    r9, r9b
       test     r9d, r9d
       je       G_M12780_IG11
       lea      r10, [rax+4]
       mov      r9, r10
       movzx    r9, word  ptr [rcx+2*r9]
       movzx    r10, word  ptr [rdx+2*r10]
       cmp      r9d, r10d
       sete     r9b
       movzx    r9, r9b
       test     r9d, r9d
       je       G_M12780_IG11
       lea      r10, [rax+5]
       mov      r9, r10
       movzx    r9, word  ptr [rcx+2*r9]
       movzx    r10, word  ptr [rdx+2*r10]
       cmp      r9d, r10d
       sete     r9b
       movzx    r9, r9b
       test     r9d, r9d
       je       G_M12780_IG11
       lea      r10, [rax+6]
       mov      r9, r10
       movzx    r9, word  ptr [rcx+2*r9]
       movzx    r10, word  ptr [rdx+2*r10]
       cmp      r9d, r10d
       sete     r9b
       movzx    r9, r9b
       test     r9d, r9d
						;; bbWeight=4    PerfScore 204.00
G_M12780_IG05:
       je       G_M12780_IG11
       lea      r10, [rax+7]
       mov      r9, r10
       movzx    r9, word  ptr [rcx+2*r9]
       movzx    r10, word  ptr [rdx+2*r10]
       cmp      r9d, r10d
       sete     r9b
       movzx    r9, r9b
       test     r9d, r9d
       je       G_M12780_IG11
       add      rax, 8
       cmp      r8d, 8
       jge      G_M12780_IG04
						;; bbWeight=4    PerfScore 40.00
G_M12780_IG06:
       cmp      r8d, 4
       jl       G_M12780_IG08
       add      r8d, -4
       movzx    r9, word  ptr [rcx+2*rax]
       movzx    r10, word  ptr [rdx+2*rax]
       cmp      r9d, r10d
       sete     r9b
       movzx    r9, r9b
       test     r9d, r9d
       je       G_M12780_IG11
       lea      r10, [rax+1]
       movzx    r9, word  ptr [rcx+2*r10]
       lea      r10, [rax+1]
       movzx    r10, word  ptr [rdx+2*r10]
       cmp      r9d, r10d
       sete     r9b
       movzx    r9, r9b
       test     r9d, r9d
       je       SHORT G_M12780_IG11
       lea      r10, [rax+2]
       movzx    r9, word  ptr [rcx+2*r10]
       lea      r10, [rax+2]
       movzx    r10, word  ptr [rdx+2*r10]
       cmp      r9d, r10d
       sete     r9b
       movzx    r9, r9b
       test     r9d, r9d
       je       SHORT G_M12780_IG11
       lea      r10, [rax+3]
       movzx    r9, word  ptr [rcx+2*r10]
       lea      r10, [rax+3]
       movzx    r10, word  ptr [rdx+2*r10]
       cmp      r9d, r10d
       sete     r9b
       movzx    r9, r9b
       test     r9d, r9d
       je       SHORT G_M12780_IG11
       add      rax, 4
       test     r8d, r8d
       jle      SHORT G_M12780_IG09
						;; bbWeight=0.50 PerfScore 16.50
G_M12780_IG07:
       movzx    r9, word  ptr [rcx+2*rax]
       movzx    r10, word  ptr [rdx+2*rax]
       cmp      r9d, r10d
       sete     r9b
       movzx    r9, r9b
       test     r9d, r9d
       je       SHORT G_M12780_IG11
       inc      rax
       dec      r8d
						;; bbWeight=2    PerfScore 14.50
G_M12780_IG08:
       test     r8d, r8d
       jg       SHORT G_M12780_IG07
						;; bbWeight=4    PerfScore 5.00
G_M12780_IG09:
       mov      eax, 1
						;; bbWeight=0.50 PerfScore 0.13
G_M12780_IG10:
       ret      
						;; bbWeight=0.50 PerfScore 0.50
G_M12780_IG11:
       xor      eax, eax
						;; bbWeight=0.50 PerfScore 0.13
G_M12780_IG12:
       ret      
						;; bbWeight=0.50 PerfScore 0.50

; Total bytes of code 529, prolog size 0, PerfScore 336.15, (MethodHash=9c43ce13) for method SpanHelpers:SequenceEqual(byref,byref,int):bool

SpanHelpers.IndexOf(ref char searchSpace, char value, int length) no longer outputs vectorized asm

; Assembly listing for method SpanHelpers:IndexOf(byref,ushort,int):int
; Emitting BLENDED_CODE for X64 CPU with SSE2 - Windows
; ReadyToRun compilation
; optimized code
; rsp based frame
; fully interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T01] ( 15, 38   )   byref  ->  rcx        
;  ...
;  V92 tmp86        [V92,T20] (  9,  9   )     int  ->  rax         "Single return block return value"
;
; Lcl frame size = 0

G_M36664_IG01:
       mov      dword ptr [rsp+10H], edx
						;; bbWeight=1    PerfScore 1.00
G_M36664_IG02:
       xor      rdx, rdx
       cmp      r8d, 8
       jl       G_M36664_IG04
						;; bbWeight=1    PerfScore 1.50
G_M36664_IG03:
       add      r8d, -8
       movzx    rax, word  ptr [rcx+2*rdx]
       movzx    r9, word  ptr [rsp+10H]
       cmp      r9d, eax
       je       G_M36664_IG09
       lea      rax, [rdx+1]
       movzx    rax, word  ptr [rcx+2*rax]
       movzx    r9, word  ptr [rsp+10H]
       cmp      r9d, eax
       je       G_M36664_IG10
       lea      rax, [rdx+2]
       movzx    rax, word  ptr [rcx+2*rax]
       movzx    r9, word  ptr [rsp+10H]
       cmp      r9d, eax
       je       G_M36664_IG11
       lea      rax, [rdx+3]
       movzx    rax, word  ptr [rcx+2*rax]
       movzx    r9, word  ptr [rsp+10H]
       cmp      r9d, eax
       je       G_M36664_IG12
       lea      rax, [rdx+4]
       movzx    rax, word  ptr [rcx+2*rax]
       movzx    r9, word  ptr [rsp+10H]
       cmp      r9d, eax
       je       G_M36664_IG13
       lea      rax, [rdx+5]
       movzx    rax, word  ptr [rcx+2*rax]
       movzx    r9, word  ptr [rsp+10H]
       cmp      r9d, eax
       je       G_M36664_IG14
       lea      rax, [rdx+6]
       movzx    rax, word  ptr [rcx+2*rax]
       movzx    r9, word  ptr [rsp+10H]
       cmp      r9d, eax
       je       G_M36664_IG15
       lea      rax, [rdx+7]
       movzx    rax, word  ptr [rcx+2*rax]
       movzx    r9, word  ptr [rsp+10H]
       cmp      r9d, eax
       je       G_M36664_IG16
       add      rdx, 8
       cmp      r8d, 8
       jge      G_M36664_IG03
						;; bbWeight=4    PerfScore 157.00
G_M36664_IG04:
       cmp      r8d, 4
       jl       SHORT G_M36664_IG06
       add      r8d, -4
       movzx    rax, word  ptr [rcx+2*rdx]
       movzx    r9, word  ptr [rsp+10H]
       cmp      r9d, eax
       je       SHORT G_M36664_IG09
       lea      rax, [rdx+1]
       movzx    rax, word  ptr [rcx+2*rax]
       movzx    r9, word  ptr [rsp+10H]
       cmp      r9d, eax
       je       SHORT G_M36664_IG10
       lea      rax, [rdx+2]
       movzx    rax, word  ptr [rcx+2*rax]
       movzx    r9, word  ptr [rsp+10H]
       cmp      r9d, eax
       je       SHORT G_M36664_IG11
       lea      rax, [rdx+3]
       movzx    rax, word  ptr [rcx+2*rax]
       movzx    r9, word  ptr [rsp+10H]
       cmp      r9d, eax
       je       SHORT G_M36664_IG12
       add      rdx, 4
       test     r8d, r8d
       jle      SHORT G_M36664_IG07
						;; bbWeight=0.50 PerfScore 10.75
G_M36664_IG05:
       movzx    rax, word  ptr [rcx+2*rdx]
       movzx    r9, word  ptr [rsp+10H]
       cmp      r9d, eax
       je       SHORT G_M36664_IG09
       inc      rdx
       dec      r8d
						;; bbWeight=2    PerfScore 9.50
G_M36664_IG06:
       test     r8d, r8d
       jg       SHORT G_M36664_IG05
						;; bbWeight=4    PerfScore 5.00
G_M36664_IG07:
       mov      eax, -1
						;; bbWeight=0.50 PerfScore 0.13
G_M36664_IG08:
       ret      
						;; bbWeight=0.50 PerfScore 0.50
G_M36664_IG09:
       mov      eax, edx
       jmp      SHORT G_M36664_IG17
						;; bbWeight=0.50 PerfScore 1.13
G_M36664_IG10:
       lea      rax, [rdx+1]
       jmp      SHORT G_M36664_IG17
						;; bbWeight=0.50 PerfScore 1.25
G_M36664_IG11:
       lea      rax, [rdx+2]
       jmp      SHORT G_M36664_IG17
						;; bbWeight=0.50 PerfScore 1.25
G_M36664_IG12:
       lea      rax, [rdx+3]
       jmp      SHORT G_M36664_IG17
						;; bbWeight=0.50 PerfScore 1.25
G_M36664_IG13:
       lea      rax, [rdx+4]
       jmp      SHORT G_M36664_IG17
						;; bbWeight=0.50 PerfScore 1.25
G_M36664_IG14:
       lea      rax, [rdx+5]
       jmp      SHORT G_M36664_IG17
						;; bbWeight=0.50 PerfScore 1.25
G_M36664_IG15:
       lea      rax, [rdx+6]
       jmp      SHORT G_M36664_IG17
						;; bbWeight=0.50 PerfScore 1.25
G_M36664_IG16:
       lea      rax, [rdx+7]
						;; bbWeight=0.50 PerfScore 0.25
G_M36664_IG17:
       ret      
						;; bbWeight=0.50 PerfScore 0.50

; Total bytes of code 382, prolog size 4, PerfScore 232.95, (MethodHash=db3370c7) for method SpanHelpers:IndexOf(byref,ushort,int):int

/cc @jkotas @tannergooding

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-ReadyToRun-coreclr untriaged New issue has not been triaged by the area owner labels Mar 13, 2020
@jkotas
Copy link
Member

jkotas commented Mar 13, 2020

@davidwrighton Is this related to your recent change (#33090) ?

@davidwrighton
Copy link
Member

There is a fair bit of confusion here. The Vector<T> code is never compiled with crossgen1, and the vectorized forms of the IndexOf and SequenceEqual methods use Vector<T>. However, the generic variants of these methods do not use vector instructions. One of the less than ideal situations with jit-diff is that it does not make it clear when working with a generic method, that it is doing so. This is my analysis of what's happening here.

SequenceEqual(ref byte, ref byte, nuint) is marked with AggressiveOptimization, which prohibits generation of R2R code, so it should not be expected to be generated.

SequenceEqual(ref T, ref T, int) ISN'T marked with AggressiveOptimization, and is generated but without any vectorized logic. This isn't surprising, at it isn't vectorized.

IndexOf(ref T, T, int) ISN'T marked with AggressiveOptimization, and is generated but without any vectorized logic. This isn't surprising, at it isn't vectorized.

@benaadams
Copy link
Member Author

SequenceEqual(ref byte, ref byte, nuint) is marked with AggressiveOptimization, which prohibits generation of R2R code, so it should not be expected to be generated.

So it might be worth removing AggressiveOptimization to the versions that have Sse2 variants; to have it emitted and pregenerated? AggressiveOptimization was added as not having the inlines etc makes these methods very poor.

However, that would also hit arm as it wouldn't have a R2R intrinsic variant (at this stage); so perhaps #if the attribute?

@davidwrighton
Copy link
Member

Unfortunately, that won't work either. The vectorized SequenceEqual(ref byte, ref byte, nuint) uses Vector<T>, and as the size of Vector<T> can vary from run to run, the current rules for crossgen prohibit its usage in R2R code at all, even within a single function body. I'm in progress of building out support for specifying the available instruction set at AOT compiler time into crossgen2 (See #33274), but even there it won't have support for Vector<T> unless the code is opted into supporting Avx2.

We may consider doing future work involving generation of multiple code bodies to cover that case, but that will require significant further engineering work which isn't planned in the .NET 5.0 timeframe. Alternatively, we may raise the baseline CPU in some scenarios to Avx2 which is technically easier, but has its own set of concerns. I consider #33274 to be a building block upon which we will build tech that actually solves these sorts of problems for our customer base, but for this sort of issue it only can be directly used to solve a narrow subset of intrinsics usage problems in AOT code.

@benaadams
Copy link
Member Author

The vectorized SequenceEqual(ref byte, ref byte, nuint) uses Vector, and as the size of Vector can vary from run to run, the current rules for crossgen prohibit its usage in R2R code at all, even within a single function body.

I have a PR to use intrinsics for SequenceEqual #32371 and it looks to output Sse2 at R2R. I was missing that AggressiveOptimization blocked it.

However, I'm not sure of what define to use to only drop it for x86/x64 and not ARM (else ARM will get the Tier0 for the method; which would be unfortunate). TARGET_XARCH seems to only apply for the cpp?

@benaadams
Copy link
Member Author

Anyway, removing AggressiveOptimization solves the issue I was seeing. Thank you

@ghost ghost locked as resolved and limited conversation to collaborators Dec 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-ReadyToRun-coreclr untriaged New issue has not been triaged by the area owner
Projects
None yet
Development

No branches or pull requests

4 participants