Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT emits unnecessary movsxd instructions when calling into Span indexer #12218

Closed
GrabYourPitchforks opened this issue Mar 8, 2019 · 15 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization
Milestone

Comments

@GrabYourPitchforks
Copy link
Member

GrabYourPitchforks commented Mar 8, 2019

When passing a non-constant value into the Span<T> and ReadOnlySpan<T> indexer, the JIT will emit an unnecessary movsxd instruction on x64. The repro is fairly simple:

for (int i = 0; i < ints.Length; i++)
{
    retVal += ints[i];
}

Current codegen:

00007ffd`2a2e7291 85c9            test    ecx,ecx
00007ffd`2a2e7293 7e0f            jle     <AFTER_LOOP>
00007ffd`2a2e7295 4d63d1          movsxd  r10,r9d
00007ffd`2a2e7298 42030492        add     eax,dword ptr [rdx+r10*4]
00007ffd`2a2e729c 41ffc1          inc     r9d
00007ffd`2a2e729f 443bc9          cmp     r9d,ecx
00007ffd`2a2e72a2 7cf1            jl      00007ffd`2a2e7295

I prototyped the below change in my local branch by modifying the logic in importer.cpp to use zero-extension instead of signed-extension for the span indexer and ran a benchmark. The modified code took approximately one-third less time to run. This optimization may be worth investigating if we believe that developers are iterating over spans in hot loops. (Admittedly, any more complex logic within the loop would almost certainly overwhelm these benchmark results.)

            // Element access
            GenTree*             indexIntPtr = gtNewCastNode(TYP_U_IMPL, indexClone, true /* fromUnsigned */, TYP_U_IMPL);   // <-- modified line
            GenTree*             sizeofNode  = gtNewIconNode(elemSize);
            GenTree*             mulNode     = gtNewOperNode(GT_MUL, TYP_U_IMPL, indexIntPtr, sizeofNode);   // <-- modified line
Method Toolchain SpanLength Mean Error StdDev Ratio RatioSD
SumInts baseline 48 2,921.32 us 53.786 us 44.914 us 1.00 0.00
SumInts modified 48 1,964.96 us 38.825 us 43.154 us 0.67 0.02
SumInts baseline 512 35,429.46 us 574.800 us 537.669 us 1.00 0.00
SumInts modified 512 23,219.06 us 457.335 us 698.398 us 0.67 0.02
SumInts baseline 2048 139,664.62 us 1,799.241 us 1,683.011 us 1.00 0.00
SumInts modified 2048 93,175.18 us 1,838.916 us 3,586.665 us 0.66 0.04

/cc @dotnet/jit-contrib

category:cq
theme:basic-cq
skill-level:expert
cost:medium
impact:medium

@gfoidl
Copy link
Member

gfoidl commented Mar 8, 2019

The same movsxd is generated when indexing arrays.

The zero-extension could be omitted, if the JIT recognizes this pattern, and operate with the native int size (i.e. ulong on x64, uint), hence emitting code as follows (for x64):

G_M56184_IG03:
       03048A               add      eax, dword ptr [rdx+4*rdi]
       FFC7                 inc      rdi
       3BFE                 cmp      rdi, rsi
       7CF4                 jl       SHORT G_M56184_IG03

@BruceForstall
Copy link
Member

@AndyAyersMS

@AndyAyersMS
Copy link
Member

Range prop may be able to determine i cannot be negative (definitely for arrays, maybe for spans) and could possibly rewrite the in-loop uses that feed sign-extensions. Might get the simple cases.

Worth taking a quick look.

@mikedn
Copy link
Contributor

mikedn commented Mar 8, 2019

@AndyAyersMS I tried that in the past but for some reason I gave up. I'll check what I did back then.

@GrabYourPitchforks
Copy link
Member Author

@AndyAyersMS I didn't investigate anything quite as complex as flowing context that i must be non-negative or backing i with a native int to begin with. In the quick changes I experimented with, there was still a zero-extension (the movsxd r10, r9d instruction became mov r10d, r9d), and this was sufficient to get the performance gain mentioned. I suspect this was a zero-latency mov within the processor?

@mikedn
Copy link
Contributor

mikedn commented Mar 8, 2019

and this was sufficient to get the performance gain mentioned. I suspect this was a zero-latency mov within the processor?

Yes. I tried that in the past with array indices. It was working but in some cases involving pointer arithmetic there was a regression due to CSE no longer picking up the casts, those from array indices where zero extending and those from pointer arithmetic were sign extending.

@mikedn
Copy link
Contributor

mikedn commented Mar 8, 2019

@AndyAyersMS FWIW I have this in mikedn/coreclr@e0a6e91 It does seem to work (diff corelib -> -368 bytes) but I don't remember if the change was "done" or if there was some kind of problem that made me abandon it. It could be that it was just my general dislike of RangeCheck, I generally don't trust that code. I'll try to figure out what's up with it.

@mikedn
Copy link
Contributor

mikedn commented Mar 9, 2019

Ran a diff on the entire FX (after patching a strange issue with missing assertions):

Total bytes of diff: -1414 (-0.01% of base)
    diff is an improvement.
Top file improvements by size (bytes):
        -365 : System.Private.CoreLib.dasm (-0.01% of base)
        -186 : System.Data.Common.dasm (-0.02% of base)
        -160 : System.Private.Xml.dasm (-0.01% of base)
         -87 : Microsoft.CodeAnalysis.CSharp.dasm (0.00% of base)
         -82 : Microsoft.CSharp.dasm (-0.03% of base)
75 total files with size differences (75 improved, 0 regressed), 54 unchanged.
Top method improvements by size (bytes):
         -28 (-0.23% of base) : System.Private.CoreLib.dasm - Dictionary`2:OnDeserialization(ref):this (28 methods)
         -16 (-0.20% of base) : System.Data.Common.dasm - XmlTreeGen:SchemaTree(ref,ref,ref,ref,bool):this
         -15 (-0.47% of base) : System.Private.Xml.dasm - XmlSchemaValidator:EndElementIdentityConstraints(ref,ref,ref):this
         -12 (-7.14% of base) : Microsoft.CSharp.dasm - MethodTypeInferrer:DeduceDependencies():bool:this
         -12 (-0.79% of base) : System.Private.CoreLib.dasm - EnumeratorToIteratorAdapter`1:GetMany(ref):int:this (13 methods)
Top method improvements by size (percentage):
         -12 (-7.14% of base) : Microsoft.CSharp.dasm - MethodTypeInferrer:DeduceDependencies():bool:this
          -6 (-3.02% of base) : Microsoft.CSharp.dasm - MethodSymbol:InferenceMustFail():bool:this
          -6 (-2.67% of base) : Microsoft.CodeAnalysis.CSharp.dasm - PointerTypeSymbol:Equals(ref,bool,bool):bool:this (2 methods)
          -5 (-2.42% of base) : Microsoft.CSharp.dasm - UserStringBuilder:ErrAppendTypeParameters(ref,ref):this
          -2 (-2.06% of base) : Microsoft.CSharp.dasm - MethodTypeInferrer:ExactTypeArgumentInference(ref,ref):this
870 total methods with size differences (870 improved, 0 regressed), 123816 unchanged.

Not a lot, as this saves at most one byte per cast (sometimes mov has one byte less than movsxd, sometimes they have the same size). I'd guesstimate that there are ~5000 casts that are transformed.

PIN data doesn't look too good, a 0.5% regression. Didn't check memory usage but I don't think it will be good either. I was afraid of that, RangeCheck is just inefficient.

@AndyAyersMS
Copy link
Member

We're in the 3.0 endgame for the jit, and there doesn't seem to be any cheap / safe way to fix this, so marking as future.

@GrabYourPitchforks
Copy link
Member Author

Related to dotnet/coreclr#21553.

@GrabYourPitchforks
Copy link
Member Author

@GrabYourPitchforks
Copy link
Member Author

@BruceForstall @AndyAyersMS Is this perhaps a dupe of #7312?

@AndyAyersMS
Copy link
Member

Yes, it's a case where IV widening would help.

@GrabYourPitchforks
Copy link
Member Author

@AndyAyersMS I'm still not sure if this is a true dupe of 7312 or a secondary issue. If it's a true dupe then please feel free to close this issue. :)

@tannergooding
Copy link
Member

This was fixed with #81055

; Method Program:Test(System.Span`1[int]):int
G_M000_IG01:                ;; offset=0000H

G_M000_IG02:                ;; offset=0000H
       488B01               mov      rax, bword ptr [rcx]
       8B5108               mov      edx, dword ptr [rcx+08H]
       33C9                 xor      ecx, ecx
       4533C0               xor      r8d, r8d
       85D2                 test     edx, edx
       7E0F                 jle      SHORT G_M000_IG04
                            align    [0 bytes for IG03]

G_M000_IG03:                ;; offset=000FH
       458BC8               mov      r9d, r8d
       42030C88             add      ecx, dword ptr [rax+4*r9]
       41FFC0               inc      r8d
       443BC2               cmp      r8d, edx
       7CF1                 jl       SHORT G_M000_IG03

G_M000_IG04:                ;; offset=001EH
       8BC1                 mov      eax, ecx

G_M000_IG05:                ;; offset=0020H
       C3                   ret     

@ghost ghost locked as resolved and limited conversation to collaborators Mar 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization
Projects
None yet
Development

No branches or pull requests

8 participants