JIT emits unnecessary movsxd instructions when calling into Span indexer #12218

GrabYourPitchforks · 2019-03-08T03:28:24Z

When passing a non-constant value into the Span<T> and ReadOnlySpan<T> indexer, the JIT will emit an unnecessary movsxd instruction on x64. The repro is fairly simple:

for (int i = 0; i < ints.Length; i++)
{
    retVal += ints[i];
}

Current codegen:

00007ffd`2a2e7291 85c9            test    ecx,ecx
00007ffd`2a2e7293 7e0f            jle     <AFTER_LOOP>
00007ffd`2a2e7295 4d63d1          movsxd  r10,r9d
00007ffd`2a2e7298 42030492        add     eax,dword ptr [rdx+r10*4]
00007ffd`2a2e729c 41ffc1          inc     r9d
00007ffd`2a2e729f 443bc9          cmp     r9d,ecx
00007ffd`2a2e72a2 7cf1            jl      00007ffd`2a2e7295

I prototyped the below change in my local branch by modifying the logic in importer.cpp to use zero-extension instead of signed-extension for the span indexer and ran a benchmark. The modified code took approximately one-third less time to run. This optimization may be worth investigating if we believe that developers are iterating over spans in hot loops. (Admittedly, any more complex logic within the loop would almost certainly overwhelm these benchmark results.)

            // Element access
            GenTree*             indexIntPtr = gtNewCastNode(TYP_U_IMPL, indexClone, true /* fromUnsigned */, TYP_U_IMPL);   // <-- modified line
            GenTree*             sizeofNode  = gtNewIconNode(elemSize);
            GenTree*             mulNode     = gtNewOperNode(GT_MUL, TYP_U_IMPL, indexIntPtr, sizeofNode);   // <-- modified line

Method	Toolchain	SpanLength	Mean	Error	StdDev	Ratio	RatioSD
SumInts	baseline	48	2,921.32 us	53.786 us	44.914 us	1.00	0.00
SumInts	modified	48	1,964.96 us	38.825 us	43.154 us	0.67	0.02

SumInts	baseline	512	35,429.46 us	574.800 us	537.669 us	1.00	0.00
SumInts	modified	512	23,219.06 us	457.335 us	698.398 us	0.67	0.02

SumInts	baseline	2048	139,664.62 us	1,799.241 us	1,683.011 us	1.00	0.00
SumInts	modified	2048	93,175.18 us	1,838.916 us	3,586.665 us	0.66	0.04

/cc @dotnet/jit-contrib

category:cq
theme:basic-cq
skill-level:expert
cost:medium
impact:medium

The text was updated successfully, but these errors were encountered:

gfoidl · 2019-03-08T10:10:11Z

The same movsxd is generated when indexing arrays.

The zero-extension could be omitted, if the JIT recognizes this pattern, and operate with the native int size (i.e. ulong on x64, uint), hence emitting code as follows (for x64):

G_M56184_IG03:
       03048A               add      eax, dword ptr [rdx+4*rdi]
       FFC7                 inc      rdi
       3BFE                 cmp      rdi, rsi
       7CF4                 jl       SHORT G_M56184_IG03

BruceForstall · 2019-03-08T16:58:01Z

@AndyAyersMS

AndyAyersMS · 2019-03-08T17:13:32Z

Range prop may be able to determine i cannot be negative (definitely for arrays, maybe for spans) and could possibly rewrite the in-loop uses that feed sign-extensions. Might get the simple cases.

Worth taking a quick look.

mikedn · 2019-03-08T18:21:32Z

@AndyAyersMS I tried that in the past but for some reason I gave up. I'll check what I did back then.

GrabYourPitchforks · 2019-03-08T18:39:44Z

@AndyAyersMS I didn't investigate anything quite as complex as flowing context that i must be non-negative or backing i with a native int to begin with. In the quick changes I experimented with, there was still a zero-extension (the movsxd r10, r9d instruction became mov r10d, r9d), and this was sufficient to get the performance gain mentioned. I suspect this was a zero-latency mov within the processor?

mikedn · 2019-03-08T18:42:17Z

and this was sufficient to get the performance gain mentioned. I suspect this was a zero-latency mov within the processor?

Yes. I tried that in the past with array indices. It was working but in some cases involving pointer arithmetic there was a regression due to CSE no longer picking up the casts, those from array indices where zero extending and those from pointer arithmetic were sign extending.

mikedn · 2019-03-08T19:07:48Z

@AndyAyersMS FWIW I have this in mikedn/coreclr@e0a6e91 It does seem to work (diff corelib -> -368 bytes) but I don't remember if the change was "done" or if there was some kind of problem that made me abandon it. It could be that it was just my general dislike of RangeCheck, I generally don't trust that code. I'll try to figure out what's up with it.

mikedn · 2019-03-09T21:13:09Z

Ran a diff on the entire FX (after patching a strange issue with missing assertions):

Total bytes of diff: -1414 (-0.01% of base)
    diff is an improvement.
Top file improvements by size (bytes):
        -365 : System.Private.CoreLib.dasm (-0.01% of base)
        -186 : System.Data.Common.dasm (-0.02% of base)
        -160 : System.Private.Xml.dasm (-0.01% of base)
         -87 : Microsoft.CodeAnalysis.CSharp.dasm (0.00% of base)
         -82 : Microsoft.CSharp.dasm (-0.03% of base)
75 total files with size differences (75 improved, 0 regressed), 54 unchanged.
Top method improvements by size (bytes):
         -28 (-0.23% of base) : System.Private.CoreLib.dasm - Dictionary`2:OnDeserialization(ref):this (28 methods)
         -16 (-0.20% of base) : System.Data.Common.dasm - XmlTreeGen:SchemaTree(ref,ref,ref,ref,bool):this
         -15 (-0.47% of base) : System.Private.Xml.dasm - XmlSchemaValidator:EndElementIdentityConstraints(ref,ref,ref):this
         -12 (-7.14% of base) : Microsoft.CSharp.dasm - MethodTypeInferrer:DeduceDependencies():bool:this
         -12 (-0.79% of base) : System.Private.CoreLib.dasm - EnumeratorToIteratorAdapter`1:GetMany(ref):int:this (13 methods)
Top method improvements by size (percentage):
         -12 (-7.14% of base) : Microsoft.CSharp.dasm - MethodTypeInferrer:DeduceDependencies():bool:this
          -6 (-3.02% of base) : Microsoft.CSharp.dasm - MethodSymbol:InferenceMustFail():bool:this
          -6 (-2.67% of base) : Microsoft.CodeAnalysis.CSharp.dasm - PointerTypeSymbol:Equals(ref,bool,bool):bool:this (2 methods)
          -5 (-2.42% of base) : Microsoft.CSharp.dasm - UserStringBuilder:ErrAppendTypeParameters(ref,ref):this
          -2 (-2.06% of base) : Microsoft.CSharp.dasm - MethodTypeInferrer:ExactTypeArgumentInference(ref,ref):this
870 total methods with size differences (870 improved, 0 regressed), 123816 unchanged.

Not a lot, as this saves at most one byte per cast (sometimes mov has one byte less than movsxd, sometimes they have the same size). I'd guesstimate that there are ~5000 casts that are transformed.

PIN data doesn't look too good, a 0.5% regression. Didn't check memory usage but I don't think it will be good either. I was afraid of that, RangeCheck is just inefficient.

AndyAyersMS · 2019-03-12T17:21:59Z

We're in the 3.0 endgame for the jit, and there doesn't seem to be any cheap / safe way to fix this, so marking as future.

GrabYourPitchforks · 2019-03-21T06:39:05Z

Related to dotnet/coreclr#21553.

GrabYourPitchforks · 2019-04-02T21:52:18Z

Related to https://github.com/dotnet/coreclr/issues/23666.

GrabYourPitchforks · 2020-10-18T18:50:39Z

@BruceForstall @AndyAyersMS Is this perhaps a dupe of #7312?

AndyAyersMS · 2020-10-19T16:33:50Z

Yes, it's a case where IV widening would help.

GrabYourPitchforks · 2020-10-19T19:25:07Z

@AndyAyersMS I'm still not sure if this is a true dupe of 7312 or a secondary issue. If it's a true dupe then please feel free to close this issue. :)

tannergooding · 2023-02-09T17:21:01Z

This was fixed with #81055

; Method Program:Test(System.Span`1[int]):int
G_M000_IG01:                ;; offset=0000H

G_M000_IG02:                ;; offset=0000H
       488B01               mov      rax, bword ptr [rcx]
       8B5108               mov      edx, dword ptr [rcx+08H]
       33C9                 xor      ecx, ecx
       4533C0               xor      r8d, r8d
       85D2                 test     edx, edx
       7E0F                 jle      SHORT G_M000_IG04
                            align    [0 bytes for IG03]

G_M000_IG03:                ;; offset=000FH
       458BC8               mov      r9d, r8d
       42030C88             add      ecx, dword ptr [rax+4*r9]
       41FFC0               inc      r8d
       443BC2               cmp      r8d, edx
       7CF1                 jl       SHORT G_M000_IG03

G_M000_IG04:                ;; offset=001EH
       8BC1                 mov      eax, ecx

G_M000_IG05:                ;; offset=0020H
       C3                   ret

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the Future milestone Jan 31, 2020

GrabYourPitchforks mentioned this issue Feb 26, 2020

[WIP] Improve performance of Utf8Parser.TryParseInt32D #32843

Closed

saucecontrol mentioned this issue Apr 30, 2020

Code inefficiencies in loop array indexing #35618

Closed

GrabYourPitchforks mentioned this issue Jul 6, 2020

[WIP] Improve performance of Utf8Parser.TryParseInt32D GrabYourPitchforks/runtime#9

Closed

BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020

gfoidl mentioned this issue Apr 11, 2021

Add internal String.{Try}CopyTo and use in a few places in corelib #51062

Merged

EgorBo mentioned this issue May 9, 2021

JIT: Optimize redundant sign extensions in indexers #52414

Closed

TIHan removed the JitUntriaged CLR JIT issues needing additional triage label Oct 31, 2022

TIHan mentioned this issue Jan 28, 2023

JIT: inconsistent CQ for div/mod by power of 2 idioms #11442

Open

tannergooding closed this as completed Feb 9, 2023

ghost locked as resolved and limited conversation to collaborators Mar 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT emits unnecessary movsxd instructions when calling into Span indexer #12218

JIT emits unnecessary movsxd instructions when calling into Span indexer #12218

GrabYourPitchforks commented Mar 8, 2019 •

edited by BruceForstall

Loading

gfoidl commented Mar 8, 2019

BruceForstall commented Mar 8, 2019

AndyAyersMS commented Mar 8, 2019

mikedn commented Mar 8, 2019

GrabYourPitchforks commented Mar 8, 2019

mikedn commented Mar 8, 2019

mikedn commented Mar 8, 2019

mikedn commented Mar 9, 2019

AndyAyersMS commented Mar 12, 2019

GrabYourPitchforks commented Mar 21, 2019

GrabYourPitchforks commented Apr 2, 2019

GrabYourPitchforks commented Oct 18, 2020

AndyAyersMS commented Oct 19, 2020

GrabYourPitchforks commented Oct 19, 2020

tannergooding commented Feb 9, 2023

JIT emits unnecessary movsxd instructions when calling into Span indexer #12218

JIT emits unnecessary movsxd instructions when calling into Span indexer #12218

Comments

GrabYourPitchforks commented Mar 8, 2019 • edited by BruceForstall Loading

gfoidl commented Mar 8, 2019

BruceForstall commented Mar 8, 2019

AndyAyersMS commented Mar 8, 2019

mikedn commented Mar 8, 2019

GrabYourPitchforks commented Mar 8, 2019

mikedn commented Mar 8, 2019

mikedn commented Mar 8, 2019

mikedn commented Mar 9, 2019

AndyAyersMS commented Mar 12, 2019

GrabYourPitchforks commented Mar 21, 2019

GrabYourPitchforks commented Apr 2, 2019

GrabYourPitchforks commented Oct 18, 2020

AndyAyersMS commented Oct 19, 2020

GrabYourPitchforks commented Oct 19, 2020

tannergooding commented Feb 9, 2023

GrabYourPitchforks commented Mar 8, 2019 •

edited by BruceForstall

Loading