-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
General inlining of tail calls #12304
Comments
I should probably point out that the math above is a bit handwavy, as the size of a frame is dependent on a lot of factors. The post inline size S(A+B) should be close to A+B but may be more or less. One generally hopes it will be less, as frame size is often a good indicator of prolog cost. Because the jit inliner is scriptable, the idea I have in mind for modelling the stack impact of inlining B into A is simply to conduct a huge number of actual measurements of the impact, similar to what I did a while back when I was building a model of the code size impact of inlines. We can then see how easy it is to estimate the stack frame impact given the facts available to the inliner (more likely, some biased estimator, so we're more likely to overestimate than underestimate). Using actual measurements also has the benefit that it captures various jit quirks and adapt to different target ISAs and ABIs. Ideally the heuristic is largely independent of these so we get similar behavior everywhere, but we'll have to see what the numbers say. |
A couple more notes
|
Preview of the jit changes needed, without any kind of heuristic: InlineExplicitTailCalls. Sample diff on one of the methods from dotnet/fsharp#6329: ;;; before
; Assembly listing for method Logger:Log(ubyte,ref):this
G_M33878_IG01:
nop
G_M33878_IG02:
mov rcx, gword ptr [rcx+8]
mov rdx, r8
mov rax, 0xD1FFAB1E
cmp dword ptr [rcx], ecx
G_M33878_IG03:
rex.jmp rax // tail call List.Add
;;; after
; Assembly listing for method Logger:Log(ubyte,ref):this
G_M33879_IG01:
sub rsp, 40
nop
G_M33879_IG02:
mov rcx, gword ptr [rcx+8]
inc dword ptr [rcx+20] // inline List.Add
mov rdx, gword ptr [rcx+8]
mov eax, dword ptr [rcx+16]
cmp dword ptr [rdx+8], eax
jbe SHORT G_M33879_IG04 // capacity check
lea r9d, [rax+1] // "fast path" add
mov dword ptr [rcx+16], r9d
mov rcx, rdx
mov edx, eax
call CORINFO_HELP_ARRADDR_ST
nop
G_M33879_IG03:
add rsp, 40
ret
G_M33879_IG04:
mov rdx, r8
mov rax, 0xD1FFAB1E
G_M33879_IG05:
add rsp, 40
rex.jmp rax // tail call List.AddWithResize |
This note explores expanding jit inlining to handle cases of explicit tail calls.
cc @dotnet/jit-contrib @dsyme
Related issues: #10487, dotnet/fsharp#6329
Current behavior of the jit
We will use the following notation to describe a chain of calls and call sites:
Here
A
,B
, andC
are methods, andcs1
andcs2
are call sites inA
andB
respectively. We will also useA
,B
, andC
in formulas where theyindicate the stack space used by the respective methods, and
S
for the totalstack needed by some logical chain of calls.
A call site can be:
?t
a tail call or a normal callxt
a normal callit
an implicit tail callet
an explicit tail call (can be fast or slow)Inlining methods that do not have explicit tail call sites
Case 0: Inlining when invoked at an explicit tail call site
A -(cs1:et)-> B
Before stack size is
After stack size is
worst case, when jit can't optimize some away any stack usage after inlining,
the stack delta is:
For performance, the before picture may have required slow tail call, so we can
expect good perf boost from inlining, perhaps better than a typical inline. If
we could anticipate this we could even give inlining an extra profitability boost.
[Case 0 Heuristic]: allow inlining if A or B is small stack, boost if cs1 was slow tail call
Heuristics that depend on estimating the stack size of
A
or the nature of thetail call to
B
may not be workable in the current jit; see commentary below.So we may decide to just rely on size of
B
:[Simple Case 0 Heuristic]: allow inlining if B is small stack
Inlining methods that have explicit tail call sites
Case 1: Inlining when invoked from non-tail call site
If
B
is inlined at a non-tail call site, then none of B's internal sites can be tail call sites. Sothey will all become normal calls:
Before stack usage is:
Stack consumption depends on inlining. Generally the jit won't be able to
estimate the stack size of
C
at the time it has to decide on whether or notto inline
B
.B
is inlined andC
is not,S' = (A + B) + C
, possibly lessB
andC
are inlined,S' = ((A + B) + C)
, possibly lessAssuming worst case where inlining does not allow the jit to optimize away any
stack usage, stack increase is
As noted before the jit will likely be unable to estimate
C
.For perf, if the call to
C
was a slow tail call, perf will improve beyond justthe normal inline boost. But jit may not be able to anticipate this easily.
[Case 1 Heuristic]: allow inlining if B is small stack
Case2: Inlining when invoked from implicit tail call site
If
cs1
is an implicit tail call, the jit can inlineB
and retaincs2
asan explicit tail call:
Before stack usage depends on whether
B
was tail called or called normally:We will use the former as it is smaller.
After stack usage (assuming
C
not inlined)So worst-case stack impact is
So for stack usage, inlining seems ok if either
A
orB
is small stack.For perf, jit should only do this if call to
C
does not change from a fasttail call to a slow call. Generally slow tail calls happen because caller stack
is insufficiently large. A rough heuristic for this is the number of args for
A
andB
respectively (some ABIs will need more complex logic).[Case2 Heuristic]: allow inlining if B is small stack, and #args(A) >= #args(B)
Case3: Inlining when invoked from explicit call site
This dovetails with the Case0 and Case2 above and the reasoning is similar.
[Case3 Heuristic]: allow inlining if B is small stack, and #args(A) >= #args(B)
Use of the tail call helper
In cases where the jit determines a tail call from
A
will need a helper,it might be feasible to modify
A
so that its frame has sufficient arg space tomake a fast tail call by giving
A
a special prolog. This would be viableprovided the tail call sites(s) were executed frequently enough that we'd be
willing to pay a bit more on every call, and non-tail call paths through
A
were rare and perhaps did not involve other calls.
Notes
candidate, but presumably we could develop something. Using number
and type of args and locals might be sufficient, though there are bad
cases where jit temps predominate (we could presumably also screen based
on IL size to help rule these out).
candidate's tail calls will be slow tail calls. Here the information must be
based on an IL prescan. The jit does not resolve inline candidate tokens in so
may not know how many args are consumed at a call site. But again we might be
able to guess well enough.
A
,B
,C
are part of recursive cycles weare somewhat less worried about increase in stack size, as it can't be
magnified. The jit probably can't reason about this, but some up front tools
perhaps could.
A
must take already accepted inlines into account,so "optimal" results given a number of possible candidates depend on order in
which inline candidates are considered. Currently the jit is not smart about
this, which is why the simplified heuristic for Case 0 is suggested.
since tail call method locals are lifetime disjoint from root method locals. So
stack size impact of inlining a tail call would be more like
max(A,B)
thanA + B
.This would remove most restrictions and might end up being a better option long
term (it could also potentially help address other problems like excessive prolog
zeroing).
Proposal
non tail-call sites
inlines of methods at tail call sites and methods with explicit tail calls
category:cq
theme:tail-call
skill-level:expert
cost:medium
impact:medium
The text was updated successfully, but these errors were encountered: