-
Notifications
You must be signed in to change notification settings - Fork 2.7k
[WIP] Eliminate array/Span index sign-extension #21553
Conversation
cc @dotnet/jit-contrib |
I haven't spent a lot of time looking at the importation and morphing of casts, but I don't think it's correct to change
However, I've always found the cast semantics a bit tricky, so it would be helpful for another JIT developer to weigh in here as well. |
I tried the array change in the past (there was no span back then) and it's kind of problematic, in some cases it can cause a 10-20% regression in performance: int[] data = ...
int* pdata = ...
for (int i = 0; i < data.Length; i++)
data[i] = pdata[i]; Pointer dereferencing generates its own index widening cast and that one can't be changed to unsigned so easily. If you change the array index cast then we'll have 2 different casts and CSE can't remove one of them anymore. You get: 4C63C9 movsxd r9, ecx
468B0C8A mov r9d, dword ptr [rdx+4*r9]
448BD1 mov r10d, ecx
46894C9010 mov dword ptr [rax+4*r10+16], r9d
FFC1 inc ecx
443BC1 cmp r8d, ecx
7FEA jg SHORT G_M850_IG03 instead of G_M850_IG03:
4C63C9 movsxd r9, ecx
468B148A mov r10d, dword ptr [rdx+4*r9]
4689548810 mov dword ptr [rax+4*r9+16], r10d
FFC1 inc ecx
443BC1 cmp r8d, ecx
7FED jg SHORT G_M850_IG03 At least for loops, this should be handled by either doing induction variable widening or by recognizing that the induction variable is positive so that all casts inside the loop can be changed to unsigned, not just those involved in array indexing. But the JIT's loop handling code is kind of messy and I never starred at it long enough to tell if something like this would be easy to do or not. Outside of loops, assertion propagation and/or range check could help with this as well. But those are also a bit clunky, especially the later. |
What I'm actually trying to achieve is this: G_M850_IG03:
4C63C9 movsxd r9, ecx
468B148A mov r10d, dword ptr [rdx+4*r9]
4689548810 mov dword ptr [rax+4*rcx+16], r10d
FFC1 inc ecx
443BC1 cmp r8d, ecx
7FED jg SHORT G_M850_IG03 Instead of using CSE r9, it would use ecx directly in this case. In more common use cases where there is no pointers used, it would elide the sign/zero-extension entirely when the source register already contains a normalized value or just do mov when it needs normalizing. Maybe this is somewhat related to #12676? Although cast containment logically doesn't feel the right place for this, it might be the only option? At that point it should be possible to detect if the index is non-normalized (an enregistered parameter/return value) also, right? Or maybe we could force normalization of such registers at entry/after-call if they are later directly used as an array/span index as this shouldn't be too costly and would allow us to always elide the cast/extension? This normalization can be done after LSRA, so all the information should be available. When I experimented by changing the last argument from |
If I understood you correctly you're trying to take advantage of the fact that 32 bit instructions on 64 bit always zero out the upper 32 bit. Yes, that would be nice to do but unfortunately it is not that simple:
I don't remember seeing native compilers doing such a thing. The typical approach is to widen the induction variable to movsxd r8, r8d
G_M850_IG03:
mov r10d, dword ptr [rdx+4*rcx]
mov dword ptr [rax+4*rcx+16], r10d
inc rcx
cmp r8, rcx
jg SHORT G_M850_IG03 It's not ideal because 64 bit instructions can be larger than 32 bit instructions due to the REX prefix. But it's safer than relying on the implicit zero extension performed by 32 bit instructions. |
Yes, I'd like to take advantage of the implicit zeroing of registers' upper 32 bits. Maybe a new flag indicating normalization requirement would solve all thee obstacles that you pointed out? So that instead of widening it the JIT would only need to ensure that it doesn't forget to implicitly zero out the upper 32 bits (currently this only happens with skipped narrowing casts and theoretically with enregistered parameters/return values, both of which feel quite simple to solve). Widening is done in some places of CoreLib by hand using |
OK, that's something that alters the existing JIT IR semantics in a perhaps unusual manner - any It's tempting. But I have a feeling that the JIT team won't be easily convinced to do this. |
I was hoping to limit this only to bounds-checked array/span indexers. I don't know how realistic this plan is, but maybe I could do it by attaching a flag to the index tree and checking for it in a few places like narrowing-cast elimination, index widening cast elimination and after LSRA to normalize parameter/return value registers? |
Attaching a flag is trivial but at the same time doesn't do anything in itself. This is information that needs to be propagated up the tree (not so difficult) and, considering that the primary use case involves loops, across the SSA graph (not so easy/cheap). And then deal with potential fallout from such information becoming stale during IR transforms. By the time you do that you'll likely discover that:
|
Thank you for your contribution. As announced in dotnet/coreclr#27549 this repository will be moving to dotnet/runtime on November 13. If you would like to continue working on this PR after this date, the easiest way to move the change to dotnet/runtime is:
|
Given how stale this PR is, I'm going to close it. Feel free to re-open if work resumes, but it would be better to wait until the new repo opens, and open a new PR. |
Array bounds check guarantees that the index is non-negative, so there is no need for sign extension.
I tried to get the correct casting, but for the array/string case, I'm only getting half of the gains (
movsxd
replaced withmov
) and while theSpan
case looks more promising (movsxd
is eliminated entirely), it's also a bit too eager and does this for non-normalized registers (parameters/function return values), so it's not entirely correct.@dotnet/jit-contrib What would be the correct solution here?
Code size reduction is up to 15% in some methods and
int.Parse
for example is about 2-4% faster.Typical diffs look like this: