Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not run stack level setter in release except for x86. #42197

Closed
wants to merge 3 commits into from

Conversation

sandreenko
Copy link
Contributor

@sandreenko sandreenko commented Sep 14, 2020

It saves ~0.20% of the Jit time.

This change has no diffs on arm/arm64 because we always save frame pointer there, the change does not affect x86.

x64 changes are both size improvement (because we don't push and pop rbp and it is available for LSRA) and regression, because we often encode a bigger immediate, for example:

-000013 lea      rbp, [rsp+C0H]
-000104 mov      dword ptr [V04 rbp+30H], r15d
-000108 mov      dword ptr [V06 rbp+40H], r12d
-00010C mov      dword ptr [V07 rbp+48H], r13d
-
+000102 mov      dword ptr [V04 rsp+F0H], r14d
+00010A mov      dword ptr [V06 rsp+100H], r15d
+000112 mov      dword ptr [V07 rsp+108H], r12d

so we encode a bigger immediate for each stack access and it gives us bigger code size, but it looks like something that could be optimized similar to COMPlus_JitConstCSE. If we see many patterns like reg + common_base + positive_small_const and it should use the number of accesses when currently it is using the number of stack variables that is a very rude heuristic. I will try to use COMPlus_JitConstCSE=3 and update the description, cc @briansull .

If I measure PerfScore it ignores such changes (it is questionable if immediate size should affect PerfScore or not but it is another discussion) and shows improvements but some regressions (many) as well, a typical regression is when critical edge resolution forces more variables to be on stack when in fact we start with a better condition: we have rbp available for register allocation.
It looks like an issue in regAlloc, @CarolEidt could you please look at the dump and say if my analysis is correct? Could it be fixed with combine free/busy reg allocation PR?

Sergey Andreenko added 3 commits September 14, 2020 01:53
@sandreenko sandreenko added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 14, 2020
assert(comp->compCanEncodePtrArgCntMax());
#endif

#if defined(TARGET_X86)
if (maxStackLevel >= sizeof(unsigned))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the actual change, as you can see from the comment it is applicable only to x86 where we push arguments (and use EBP) but it was forgotten to be guarded as such before CoreCLR initial PR.

@sandreenko
Copy link
Contributor Author

PTAL @dotnet/jit-contrib

@CarolEidt
Copy link
Contributor

It looks like an issue in regAlloc, @CarolEidt could you please look at the dump and say if my analysis is correct?

Looking at the attached dumps, the register allocation is identical and there are no critical edges, and no resolution is done. Did you intend to attach dumps for a different method?

@sandreenko
Copy link
Contributor Author

sandreenko commented Sep 18, 2020

It looks like an issue in regAlloc, @CarolEidt could you please look at the dump and say if my analysis is correct?

Looking at the attached dumps, the register allocation is identical and there are no critical edges, and no resolution is done. Did you intend to attach dumps for a different method?

Yes, sorry, the correct dumps:
base(better, 542).txt
diff(worse, 737).txt

the issue is in Splitting edge from BB08 to BB13; adding BB20:

base:                                                   
   BB20 bottom: move V04 from r15 to STK (Critical)
   BB20 bottom: move V06 from r12 to STK (Critical)
   BB20 bottom: move V07 from r13 to STK (Critical)
diff:
   BB20 bottom: move V04 from r14 to STK (Critical)
   BB20 bottom: move V06 from r15 to STK (Critical)
   BB20 bottom: move V07 from r12 to STK (Critical)
   BB20 bottom: move V08 from r13 to STK (Critical)

@CarolEidt
Copy link
Contributor

The edge splitting in both versions is equally bad. But because there are more available registers in the "diff" version, it allows for even more of a discrepancy between the most registers in use at an edge, and the least, so there are more resolution moves. Although there is no loop in this code, I think that the general idea behind the proposed solution to #9909, which is described in https://github.com/dotnet/runtime/blob/master/docs/design/coreclr/jit/lsra-detail.md#avoid-splitting-loop-backedges, would be the best way to address this. The idea would be to ensure that the allocation matches at critical edges to avoid splitting.
This case also highlights the simplistic heuristic for deciding when to allocate a register for an incoming stack parameter. Note that it decides not to allocate an initial register for any of the incoming stack parameters, as they are considered to have a low reference count. However, once it decides this it would presumably be better not to allocate a register for subsequent references (or, rather, to allocate and immediately spill), which would also reduce the mismatches.

@sandreenko
Copy link
Contributor Author

Thanks Carol for the analysis, I will try to address the first issue (bigger immediate encoding for bigger offset with rsp frames) and see how many methods are regressing because of the edge splitting issue, maybe good diffs will compensate bad diffs, otherwise I will postpone this PR until after #9909.

@sandreenko
Copy link
Contributor Author

I opened an issue to track that but will temporarily close the PR until I have time to work on it.

@sandreenko sandreenko closed this Sep 24, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 7, 2020
@sandreenko sandreenko deleted the stackLevelSetterImp branch December 29, 2020 19:30
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants