Do not run stack level setter in release except for x86. #42197

sandreenko · 2020-09-14T10:56:59Z

It saves ~0.20% of the Jit time.

This change has no diffs on arm/arm64 because we always save frame pointer there, the change does not affect x86.

x64 changes are both size improvement (because we don't push and pop rbp and it is available for LSRA) and regression, because we often encode a bigger immediate, for example:

-000013 lea      rbp, [rsp+C0H]
-000104 mov      dword ptr [V04 rbp+30H], r15d
-000108 mov      dword ptr [V06 rbp+40H], r12d
-00010C mov      dword ptr [V07 rbp+48H], r13d
-
+000102 mov      dword ptr [V04 rsp+F0H], r14d
+00010A mov      dword ptr [V06 rsp+100H], r15d
+000112 mov      dword ptr [V07 rsp+108H], r12d

so we encode a bigger immediate for each stack access and it gives us bigger code size, but it looks like something that could be optimized similar to COMPlus_JitConstCSE. If we see many patterns like reg + common_base + positive_small_const and it should use the number of accesses when currently it is using the number of stack variables that is a very rude heuristic. I will try to use COMPlus_JitConstCSE=3 and update the description, cc @briansull .

If I measure PerfScore it ignores such changes (it is questionable if immediate size should affect PerfScore or not but it is another discussion) and shows improvements but some regressions (many) as well, a typical regression is when critical edge resolution forces more variables to be on stack when in fact we start with a better condition: we have rbp available for register allocation.
It looks like an issue in regAlloc, @CarolEidt could you please look at the dump and say if my analysis is correct? Could it be fixed with combine free/busy reg allocation PR?

For x86 it is required for correctness, for other platforms is only does some checks.

sandreenko · 2020-09-15T23:37:32Z

src/coreclr/src/jit/stacklevelsetter.cpp

+ assert(comp->compCanEncodePtrArgCntMax());
+#endif
+
+#if defined(TARGET_X86)
 if (maxStackLevel >= sizeof(unsigned))


this is the actual change, as you can see from the comment it is applicable only to x86 where we push arguments (and use EBP) but it was forgotten to be guarded as such before CoreCLR initial PR.

sandreenko · 2020-09-15T23:38:18Z

PTAL @dotnet/jit-contrib

CarolEidt · 2020-09-18T20:31:11Z

It looks like an issue in regAlloc, @CarolEidt could you please look at the dump and say if my analysis is correct?

Looking at the attached dumps, the register allocation is identical and there are no critical edges, and no resolution is done. Did you intend to attach dumps for a different method?

sandreenko · 2020-09-18T21:03:52Z

It looks like an issue in regAlloc, @CarolEidt could you please look at the dump and say if my analysis is correct?

Looking at the attached dumps, the register allocation is identical and there are no critical edges, and no resolution is done. Did you intend to attach dumps for a different method?

Yes, sorry, the correct dumps:
base(better, 542).txt
diff(worse, 737).txt

the issue is in Splitting edge from BB08 to BB13; adding BB20:

base:                                                   
   BB20 bottom: move V04 from r15 to STK (Critical)
   BB20 bottom: move V06 from r12 to STK (Critical)
   BB20 bottom: move V07 from r13 to STK (Critical)
diff:
   BB20 bottom: move V04 from r14 to STK (Critical)
   BB20 bottom: move V06 from r15 to STK (Critical)
   BB20 bottom: move V07 from r12 to STK (Critical)
   BB20 bottom: move V08 from r13 to STK (Critical)

CarolEidt · 2020-09-18T22:20:49Z

The edge splitting in both versions is equally bad. But because there are more available registers in the "diff" version, it allows for even more of a discrepancy between the most registers in use at an edge, and the least, so there are more resolution moves. Although there is no loop in this code, I think that the general idea behind the proposed solution to #9909, which is described in https://github.com/dotnet/runtime/blob/master/docs/design/coreclr/jit/lsra-detail.md#avoid-splitting-loop-backedges, would be the best way to address this. The idea would be to ensure that the allocation matches at critical edges to avoid splitting.
This case also highlights the simplistic heuristic for deciding when to allocate a register for an incoming stack parameter. Note that it decides not to allocate an initial register for any of the incoming stack parameters, as they are considered to have a low reference count. However, once it decides this it would presumably be better not to allocate a register for subsequent references (or, rather, to allocate and immediately spill), which would also reduce the mismatches.

sandreenko · 2020-09-18T23:01:13Z

Thanks Carol for the analysis, I will try to address the first issue (bigger immediate encoding for bigger offset with rsp frames) and see how many methods are regressing because of the edge splitting issue, maybe good diffs will compensate bad diffs, otherwise I will postpone this PR until after #9909.

sandreenko · 2020-09-24T22:45:22Z

I opened an issue to track that but will temporarily close the PR until I have time to work on it.

Sergey Andreenko added 3 commits September 14, 2020 01:53

Change noway to a debug only assert for maxStackDepth.

e0fe2bc

Use a public getter for fgPtrArgCntMax.

8a09b10

Run stack level setter only for x86 or in debug.

caa9670

For x86 it is required for correctness, for other platforms is only does some checks.

sandreenko added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 14, 2020

sandreenko commented Sep 15, 2020

View reviewed changes

sandreenko mentioned this pull request Sep 24, 2020

Do not run stackLevelSetter except for x86. #42673

Closed

sandreenko closed this Sep 24, 2020

ghost locked as resolved and limited conversation to collaborators Dec 7, 2020

sandreenko deleted the stackLevelSetterImp branch December 29, 2020 19:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not run stack level setter in release except for x86. #42197

Do not run stack level setter in release except for x86. #42197

sandreenko commented Sep 14, 2020 •

edited

Loading

sandreenko Sep 15, 2020

sandreenko commented Sep 15, 2020

CarolEidt commented Sep 18, 2020

sandreenko commented Sep 18, 2020 •

edited

Loading

CarolEidt commented Sep 18, 2020

sandreenko commented Sep 18, 2020

sandreenko commented Sep 24, 2020

Do not run stack level setter in release except for x86. #42197

Do not run stack level setter in release except for x86. #42197

Conversation

sandreenko commented Sep 14, 2020 • edited Loading

sandreenko Sep 15, 2020

Choose a reason for hiding this comment

sandreenko commented Sep 15, 2020

CarolEidt commented Sep 18, 2020

sandreenko commented Sep 18, 2020 • edited Loading

CarolEidt commented Sep 18, 2020

sandreenko commented Sep 18, 2020

sandreenko commented Sep 24, 2020

sandreenko commented Sep 14, 2020 •

edited

Loading

sandreenko commented Sep 18, 2020 •

edited

Loading