-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
True function inlining: design discussion #653
Comments
This is basically the same idea as Mark's PhD theses called "Deferred object creation"https://theses.gla.ac.uk/2975/1/2011shannonphd.pdf (see section 5.7.1). The only new addition is the sinking of frame reconstruction data into the trace, and the locals interleaving. |
I am super excited about this topic, since my intuition tells me that call inlining is the key to solid perf improvements. I am a little confused about the two alternatives -- is what you describe in your first comment called "sinking reconstruction into side exits"? The last time I chatted with Mark about this, he suggested that the simplest useful example would be to take a call like I think even that is somewhat challenging because you need to detect that there's only a certain kind of thing between Regarding frame reconstruction, it seems clear that in theory some kind of static analysis of the trace should be sufficient to record everything we need so that upon an early exit we can reconstruct the frame, and have some kind of slow code that does the actual reconstruction, either in the form of custom generated code (different for each exit), or in the form of a mini-interpreter for some very simple custom code (think of the way the line number table is used to reconstruct the current line number). It also looks as if we might be able to avoid the need for a fully general frame reconstruction mechanism by only inlining sequences of uops that satisfy some predicate (e.g. "only pure ops"). We might be able to choose a predicate that is less restrictive than "only pure" but that still allows the reconstruction data/code to be finite (and small) in size. In particular, it would be nice if we started by limiting ourselves to ops that cannot lead to a call to Thoughts? What would really get me excited would be a prototype that got rid of the |
Yes. I realised it's the same in Mark's paper as deferred object creation, so I deleted my comment. Sorry for the confusion.
Yup that's my current approach!
That was my initial plan way back, but it's too restrictive. Even something like
My current prototype still has reconstruction for that. That's not very easy at the moment and you need a few passes to remove everything perfectly. Here's how it could look like.
|
Honestly that's fine with me, at least for the first phase. It looks like we may just be disagreeing about what should be the first step. It seems my first step would be to initially limit call inlining to things that need zero reconstruction, and then gradually add more complex cases, whereas your proposed first step is to implement a general algorithm that includes reconstruction but isn't necessarily optimal in all cases, and then iterate on special cases. Do you think that's a fair summary? We should probably put this on the agenda for Wednesday, so we can get other opinions. |
Separately, I don't fully understand how you plan to move from stage 1 to stage 4 in your last comment with examples.
There also seem to be a few things missing after |
I can work on function inlining for the simple case first. Most of the optimizer infrastructure will be shared anyways, so starting with either works.
Add constant evaluation for
That's replacing _PUSH_FRAME and _POP_FRAME with their real stack effects, now that the frames are gone. That is what
Yup I was writing this manually, so I forgot a few things. Hope this helps illustrate the point though! |
Got it, const evaluation is a pretty powerful tool! Also congrats on getting the foundations merged. |
@Fidget-Spinner Please write down the details of the reconstruction design. |
Frame reconstruction details:At the point of a deopt, the operand stack will look like this:
To reconstruct the inlined frames, we need to convert the stack above Note that the stack is already laid out exactly how the localsplus of
The benefits of this reconstruction scheme:
Additional special case for See paragraph above in original post titled |
Paging @pablogsal. Do you think removing frames completely would break |
Not “break” as in they will crash but certainly it will make it less useful and then formation they report will be less accurate and confusing which will certainly impact users. This is the same when compilers inline functions and then getting stack traces in debuggers is challenging when the inline functions don’t appear. I would recommend to think about the debugging experience here because this is the classic example when performance and debugging experience are in direct competition and at the very least we should have that in mind when making decisions. |
There are two different things to think about:
|
Going back to the frame reconstruction details comment.
|
Woops yes they are at
Currently, I write a small stub of various uops containing what we need at the end of the trace, after
Yes I already implemented that on my old branch. It's a run-once uop at the top of the trace. This is why we need something like |
Hm, IIRC in Tier 1 (and I think also in Tier 2 so far) the callable+NULL are popped off the stack some time during the call sequence (I think
That sounds like it's relatively wasteful of space. Is this shareable by all deopts/exits in a trace? |
It's one stub per inlined frame, which is shared by all deopt/exits inside that inlined frame. Considering we only go 5 frames deep at maximum, it's not too much. |
I would still love to see the design of the reconstruction data described in more detail here. |
Will get to it. One more design discussion point: for 3.13, I can't mix non-inlined and inlined frames, so you can't have:
but rather only
The reasoning is that the former makes reconstruction more complex, and isn't feasible for the 3.13 time frame. The main thing is that the data stack chunks need to represent the frames. If we have mixed inlined and non-inlined things, then the frame reconstruction and data stack chunks in the interpreter need a lot more reconstruction juice. |
That makes total sense. |
I've dropped function inlining for 3.13. Instead I propose #657 to get most of the benefits without most of the cons. |
This is all the data we need to reconstruct an inlined frame at the moment https://github.com/Fidget-Spinner/cpython/blob/e1ee2ad076541e1e983a4c056fc93d3a7c5cf568/Python/optimizer_analysis.c#L644 |
True function inlining
This is my proposal for function inlining in 3.13. Feedback and alternative proposals welcome.
CPython 3.11 obtained frame inlining. This eliminated most of the overhead of function calls. However, "_PyInterpreterFrame" were still needed. These take the form of the instructions
_INIT_CALL_PY_EXACT_ARGS
and_PUSH_FRAME
.In CPython 3.13, I plan for true function inlining -- that is, no frame creation at all in the happy case, not even
_PyInterpreterFrame
frames.The uops tracer already projects traces through Python function boundaries. The goal is to eliminate
_INIT_CALL_PY_EXACT_ARGS
,_PUSH_FRAME
and_POP_FRAME
. Doing this is easy, but doing this while maintaining CPython semantics is very hard.Before
After
The cost of a call went from:
Before:
After:
sys._getframe
, see below).Locals interleaving
By interleaving the locals of the new call with the stack of the old function frame, we can achieve zero argument copying.
Before a function call, the stack looks like this:
Notice that the arguments are already laid out as how the locals of the new "frame" will be. We just need to expand the stack to accommodate for more locals. Finally, since we are now pointing into the current stack as our locals, we offset all
LOAD_FAST
andSTORE_FAST
into the current stack.LOAD_FAST 0 + offset
of inlined frameFinally, we must update the attribute names' tuple that the current frame points to, so that the inlined call's attribute loads load the correct attribute names. This is just a simple assignment.
Frame reconstruction
We need to reconstruct frames on every deopt, traceback,
sys._getframe
, and possiblylocals()
too. To reconstruct frames, we need to store the following information in the trace for each inlined frame in the current call stack:For the following call stack:
We reconstruct from the root frame upwards -- so in the order
Inlined Frame 1 -> Inlined Frame 2 -> Inlined Frame 3
. Note that we can statically determine the current stack entries of all inlined frames except the topmost frame --inlined frame 3
. For that, we simply need to calculate at runtime from the current stack pointer.If you want to see how complex this all is, I implemented the reconstruction in roughly 120 lines of code (with lots of debugging code inside), in my previous iteration of my optimizer. This already works with deopts and tracebacks.
sys._getframe
andlocals()
There is one minor inelegance with how we handle
sys._getframe
, but it's also a strength. In the case of deopts and tracebacks, after reconstruction, the recreated frames are now the ones being executed. In the case ofsys._getframe
, the frame being reconstructed is purely forsys._getframe
. That means we do not actually execute the reconstructed frames. The reason behind this is that we have no clue wherestack_pointer
is for the topmost frame, as it is local toceval.c
, so we cannot accurately create the topmost frame's stack. For this reason, we link up the frames as per usual, mark them asinlined virtual
, and clear them on_POST_INLINE
as per normal. This means even in the case ofsys._getframe
, the only cost we incur is the cost of the original function overhead plus a little extra. sys._getframe should have no noticeable slowdown.How will this play with the JIT?
I assume the JIT will get some form of register allocation. If Brandt implements some sort of stack operands cached in register scheme. The locals interleaving will mean Python calls pass their arguments by register automatically, with no additional infrastructure changes required to make the JIT aware of inlining. The only change the JIT would need to be aware of is that it needs to change the stack operands that it caches to the new virtual inlined stack base rather than the original stack base.
The text was updated successfully, but these errors were encountered: