-
-
Notifications
You must be signed in to change notification settings - Fork 31.7k
bpo-46939: Specialize calls to Python classes #31707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-46939: Specialize calls to Python classes #31707
Conversation
I nearly went insane writing and debugging this. The general idea follows Mark's comments in faster-cpython/ideas#267 (comment). However, since we're specializing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added some comments on things I've tested against and had to debug. For reviewers in case some parts aren't so clear.
@@ -5617,6 +5688,7 @@ MISS_WITH_OPARG_COUNTER(STORE_SUBSCR) | |||
|
|||
error: | |||
call_shape.kwnames = NULL; | |||
call_shape.init_pass_self = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to reviewers: We don't set frame->self = NULL
here because that means exceptions will destroy self. E.g. consider this:
class A:
def __init__(self):
try:
A.a # Kaboom!
except AttributeError:
pass
for _ in range(10):
print(A())
@@ -118,6 +119,7 @@ _PyFrame_InitializeSpecials( | |||
frame->f_state = FRAME_CREATED; | |||
frame->is_entry = false; | |||
frame->is_generator = false; | |||
frame->self = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to reviewers: tied to frame state instead of some cache/call_shape so that subsequent nested calls don't destroy self
(and we can identify which frame the self
belongs to). Consider the following code:
class Tokenizer:
def __init__(self):
self.__next() # Kaboom!
def __next(self):
pass
for _ in range(10):
print(Tokenizer())
Nice! Haven't had a chance to review this yet (I can probably get to it tomorrow, though). Unfortunately, I have another, larger PR open that heavily conflicts with this one. Is it okay if this waits until #31709 is merged? The caching in this PR will need to be reworked a bit to use the new inline caching mechanism, but it shouldn't be too difficult. |
Python/ceval.c
Outdated
DEOPT_IF(cls_t->tp_new != PyBaseObject_Type.tp_new, PRECALL); | ||
STAT_INC(PRECALL, hit); | ||
|
||
PyObject *self = _PyObject_New_Vector(cls_t, &PEEK(original_oparg), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the specializer only specialized for classes that don't override __new__
, you can avoid calling __new__
and just construct the object directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_PyObject_New_Vector
does some argument checking. https://github.com/python/cpython/pull/31707/files#diff-1decebeef15f4e0b0ce106c665751ec55068d4d1d1825847925ad4f528b5b872R4525
Come to think of it, if we verify that the argcount is right at specialization time, do we need to re-verify at runtime? Would it be safe to call tp_alloc
directly? It seems that the only thing that could change is the kwds
dict, but even then that's only used for argument count checking again.
This feels a bit fragile to me, but it is an interesting alternative to my approach of pushing an additional cleanup frame (https://github.com/python/cpython/compare/main...faster-cpython:specialize-calls-to-normal-python-classes?expand=1) This PR should be faster for calls to Python classes, but the extra frame field will have a cost for calls to Python functions. |
Indeed, hence why I nearly went insane :). One very easy way to trip over is that all new P.S. Do your benchmarks say anything? I'm really sorry but right now I can't benchmark. P.P.S. That's an interesting approach that I can't currently wrap my head around. Sorry if I created any duplicate work. |
I've updated the first comment with a micro benchmark using pyperf on Windows with PGO. The results show |
Pyperformance shows a speedup in object-heavy workloads. 1% speedup on average:
|
Should this be closed in favor of #99331? |
I'll close it when that one's merged |
CALL_X supporting
__init__
inlining:Micro benchmark using pyperf (Windows, PGO):
https://bugs.python.org/issue46939