Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

[WIP] Eliminate array/Span index sign-extension #21553

Closed
wants to merge 2 commits into from
Closed

[WIP] Eliminate array/Span index sign-extension #21553

wants to merge 2 commits into from

Conversation

pentp
Copy link

@pentp pentp commented Dec 14, 2018

Array bounds check guarantees that the index is non-negative, so there is no need for sign extension.

I tried to get the correct casting, but for the array/string case, I'm only getting half of the gains (movsxd replaced with mov) and while the Span case looks more promising (movsxd is eliminated entirely), it's also a bit too eager and does this for non-normalized registers (parameters/function return values), so it's not entirely correct.

@dotnet/jit-contrib What would be the correct solution here?

Code size reduction is up to 15% in some methods and int.Parse for example is about 2-4% faster.

Typical diffs look like this:

        cmp      r10d, 103
        jae      G_M36150_IG23
-       movsxd   rax, r10d
-       mov      r11, 0xD1FFAB1E
-       cmp      byte  ptr [rax+r11], 255
+       mov      rax, 0xD1FFAB1E
+       cmp      byte  ptr [r10+rax], 255

        cmp      eax, dword ptr [rsi+8]
        jae      SHORT G_M26958_IG04
-       movsxd   rax, eax
+       mov      eax, eax
        mov      eax, dword ptr [rsi+4*rax+16]

@AaronRobinsonMSFT
Copy link
Member

cc @dotnet/jit-contrib

@CarolEidt
Copy link

I haven't spent a lot of time looking at the importation and morphing of casts, but I don't think it's correct to change Compiler::impImplicitIorI4Cast. I think that it would be better to create the cast directly in impIntrinsic, the way you're doing it in fgMorphArrayIndex, e.g.:

GenTree*             indexIntPtr = gtNewCastNode(TYP_I_IMPL, tree, true, TYP_U_IMPL);

However, I've always found the cast semantics a bit tricky, so it would be helpful for another JIT developer to weigh in here as well.

@mikedn
Copy link

mikedn commented Dec 15, 2018

I tried the array change in the past (there was no span back then) and it's kind of problematic, in some cases it can cause a 10-20% regression in performance:

int[] data = ...
int* pdata = ...
for (int i = 0; i < data.Length; i++)
    data[i] = pdata[i];

Pointer dereferencing generates its own index widening cast and that one can't be changed to unsigned so easily. If you change the array index cast then we'll have 2 different casts and CSE can't remove one of them anymore. You get:

       4C63C9               movsxd   r9, ecx
       468B0C8A             mov      r9d, dword ptr [rdx+4*r9]
       448BD1               mov      r10d, ecx
       46894C9010           mov      dword ptr [rax+4*r10+16], r9d
       FFC1                 inc      ecx
       443BC1               cmp      r8d, ecx
       7FEA                 jg       SHORT G_M850_IG03

instead of

G_M850_IG03:
       4C63C9               movsxd   r9, ecx
       468B148A             mov      r10d, dword ptr [rdx+4*r9]
       4689548810           mov      dword ptr [rax+4*r9+16], r10d
       FFC1                 inc      ecx
       443BC1               cmp      r8d, ecx
       7FED                 jg       SHORT G_M850_IG03

At least for loops, this should be handled by either doing induction variable widening or by recognizing that the induction variable is positive so that all casts inside the loop can be changed to unsigned, not just those involved in array indexing. But the JIT's loop handling code is kind of messy and I never starred at it long enough to tell if something like this would be easy to do or not.

Outside of loops, assertion propagation and/or range check could help with this as well. But those are also a bit clunky, especially the later.

@pentp
Copy link
Author

pentp commented Dec 16, 2018

What I'm actually trying to achieve is this:

G_M850_IG03:
       4C63C9               movsxd   r9, ecx
       468B148A             mov      r10d, dword ptr [rdx+4*r9]
       4689548810           mov      dword ptr [rax+4*rcx+16], r10d
       FFC1                 inc      ecx
       443BC1               cmp      r8d, ecx
       7FED                 jg       SHORT G_M850_IG03

Instead of using CSE r9, it would use ecx directly in this case. In more common use cases where there is no pointers used, it would elide the sign/zero-extension entirely when the source register already contains a normalized value or just do mov when it needs normalizing.

Maybe this is somewhat related to #12676? Although cast containment logically doesn't feel the right place for this, it might be the only option? At that point it should be possible to detect if the index is non-normalized (an enregistered parameter/return value) also, right? Or maybe we could force normalization of such registers at entry/after-call if they are later directly used as an array/span index as this shouldn't be too costly and would allow us to always elide the cast/extension? This normalization can be done after LSRA, so all the information should be available.

When I experimented by changing the last argument from TYP_U_IMPL to TYP_UINT it actually worked for spans (although it didn't respect normalization and had a bunch of asserts failing), but this isn't even close to the correct solution I guess...

@mikedn
Copy link

mikedn commented Dec 16, 2018

If I understood you correctly you're trying to take advantage of the fact that 32 bit instructions on 64 bit always zero out the upper 32 bit.

Yes, that would be nice to do but unfortunately it is not that simple:

  • you need to ensure that the JIT codegen does not intentionally or accidentally emit a 64 bit instruction when a 32 bit instruction is expected (e.g. add rcx, rdx instead of add ecx, edx)
  • you need to ensure that long to int conversions do actually generate code (e.g. mov eax, eax) rather than assuming that the consumer only cares about the lower 32 bit and skipping codegen
  • deal with call int returns - while you could probably make the JIT ensure that the upper 32 bits are always zeroed the same is not true about native compilers

I don't remember seeing native compilers doing such a thing. The typical approach is to widen the induction variable to long. That would give you:

       movsxd   r8, r8d
G_M850_IG03:
       mov      r10d, dword ptr [rdx+4*rcx]
       mov      dword ptr [rax+4*rcx+16], r10d
       inc      rcx
       cmp      r8, rcx
       jg       SHORT G_M850_IG03

It's not ideal because 64 bit instructions can be larger than 32 bit instructions due to the REX prefix. But it's safer than relying on the implicit zero extension performed by 32 bit instructions.

@pentp
Copy link
Author

pentp commented Dec 16, 2018

Yes, I'd like to take advantage of the implicit zeroing of registers' upper 32 bits. Maybe a new flag indicating normalization requirement would solve all thee obstacles that you pointed out? So that instead of widening it the JIT would only need to ensure that it doesn't forget to implicitly zero out the upper 32 bits (currently this only happens with skipped narrowing casts and theoretically with enregistered parameters/return values, both of which feel quite simple to solve).

Widening is done in some places of CoreLib by hand using nuint, but it feels a bit too heavy handed as a general JIT solution because of the code size issues.

@mikedn
Copy link

mikedn commented Dec 17, 2018

Yes, I'd like to take advantage of the implicit zeroing of registers' upper 32 bits

OK, that's something that alters the existing JIT IR semantics in a perhaps unusual manner - any TYP_INT node is required to produce a 32 bit value that's zero extended to 64 bit (on 64 bit targets, of course). Basically, the JIT IR must copy the x64/arm64 32 bit instruction semantics.

It's tempting. But I have a feeling that the JIT team won't be easily convinced to do this.

@pentp
Copy link
Author

pentp commented Dec 17, 2018

I was hoping to limit this only to bounds-checked array/span indexers. I don't know how realistic this plan is, but maybe I could do it by attaching a flag to the index tree and checking for it in a few places like narrowing-cast elimination, index widening cast elimination and after LSRA to normalize parameter/return value registers?

@mikedn
Copy link

mikedn commented Dec 18, 2018

maybe I could do it by attaching a flag

Attaching a flag is trivial but at the same time doesn't do anything in itself. This is information that needs to be propagated up the tree (not so difficult) and, considering that the primary use case involves loops, across the SSA graph (not so easy/cheap). And then deal with potential fallout from such information becoming stale during IR transforms.

By the time you do that you'll likely discover that:

  • You might as well do loop induction variable widening.
  • Not add a flag but simply require that all TYP_INT nodes produce zero extended 64 bit values and deal with the few cases that may require explicit zero extension (calls to native code and narrowing casts are probably the only ones). Well, and somehow ensure that existing codegen doesn't mess up.

@maryamariyan
Copy link
Member

Thank you for your contribution. As announced in dotnet/coreclr#27549 this repository will be moving to dotnet/runtime on November 13. If you would like to continue working on this PR after this date, the easiest way to move the change to dotnet/runtime is:

  1. In your coreclr repository clone, create patch by running git format-patch origin
  2. In your runtime repository clone, apply the patch by running git apply --directory src/coreclr <path to the patch created in step 1>

@BruceForstall
Copy link
Member

Given how stale this PR is, I'm going to close it. Feel free to re-open if work resumes, but it would be better to wait until the new repo opens, and open a new PR.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants