Skip to content

bpo-46939: Specialize calls to Python classes #31707

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

Fidget-Spinner
Copy link
Member

@Fidget-Spinner Fidget-Spinner commented Mar 6, 2022

CALL_X supporting __init__ inlining:

  • CALL_PY_EXACT_ARGS
  • CALL_PY_WITH_DEFAULTS
  • CALL

Micro benchmark using pyperf (Windows, PGO):

-m pyperf timeit -s "
>> class A:
>>  def __init__(self, a):
>>   self.a =a
>> " "A(1)" -o ..\pyperf-output\spec_py_class.json

Mean +- std dev: [spec_py_class_main] 159 ns +- 5 ns -> [spec_py_class] 106 ns +- 3 ns: 1.51x faster

https://bugs.python.org/issue46939

@Fidget-Spinner
Copy link
Member Author

I nearly went insane writing and debugging this. The general idea follows Mark's comments in faster-cpython/ideas#267 (comment). However, since we're specializing __init__, we need to return self instead of the None that __init__ normally returns. This required a significant amount of hackery.

Copy link
Member Author

@Fidget-Spinner Fidget-Spinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some comments on things I've tested against and had to debug. For reviewers in case some parts aren't so clear.

@@ -5617,6 +5688,7 @@ MISS_WITH_OPARG_COUNTER(STORE_SUBSCR)

error:
call_shape.kwnames = NULL;
call_shape.init_pass_self = false;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers: We don't set frame->self = NULL here because that means exceptions will destroy self. E.g. consider this:

class A:
  def __init__(self):
    try:
      A.a # Kaboom!
    except AttributeError:
      pass

for _ in range(10):
  print(A())

@@ -118,6 +119,7 @@ _PyFrame_InitializeSpecials(
frame->f_state = FRAME_CREATED;
frame->is_entry = false;
frame->is_generator = false;
frame->self = NULL;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers: tied to frame state instead of some cache/call_shape so that subsequent nested calls don't destroy self (and we can identify which frame the self belongs to). Consider the following code:

class Tokenizer:
    def __init__(self):
        self.__next() # Kaboom!
    def __next(self):
        pass

for _ in range(10):
 print(Tokenizer())

@brandtbucher
Copy link
Member

Nice! Haven't had a chance to review this yet (I can probably get to it tomorrow, though).

Unfortunately, I have another, larger PR open that heavily conflicts with this one. Is it okay if this waits until #31709 is merged? The caching in this PR will need to be reworked a bit to use the new inline caching mechanism, but it shouldn't be too difficult.

@brandtbucher brandtbucher self-requested a review March 7, 2022 00:49
Python/ceval.c Outdated
DEOPT_IF(cls_t->tp_new != PyBaseObject_Type.tp_new, PRECALL);
STAT_INC(PRECALL, hit);

PyObject *self = _PyObject_New_Vector(cls_t, &PEEK(original_oparg),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the specializer only specialized for classes that don't override __new__, you can avoid calling __new__ and just construct the object directly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_PyObject_New_Vector does some argument checking. https://github.com/python/cpython/pull/31707/files#diff-1decebeef15f4e0b0ce106c665751ec55068d4d1d1825847925ad4f528b5b872R4525

Come to think of it, if we verify that the argcount is right at specialization time, do we need to re-verify at runtime? Would it be safe to call tp_alloc directly? It seems that the only thing that could change is the kwds dict, but even then that's only used for argument count checking again.

@markshannon
Copy link
Member

This feels a bit fragile to me, but it is an interesting alternative to my approach of pushing an additional cleanup frame (https://github.com/python/cpython/compare/main...faster-cpython:specialize-calls-to-normal-python-classes?expand=1)

This PR should be faster for calls to Python classes, but the extra frame field will have a cost for calls to Python functions.

@Fidget-Spinner
Copy link
Member Author

Fidget-Spinner commented Mar 7, 2022

This feels a bit fragile to me

Indeed, hence why I nearly went insane :). One very easy way to trip over is that all new CALL_PY_* opcodes must also handle this special case, like how the rest are currently doing so. I've added a CALL_PY_FRAME_PASS_SELF() macro so that it's more obvious to people. Most of our test suite will crash (I've tried this) if you forget to add it. Python class creation is an integral part of Python after all.

P.S. Do your benchmarks say anything? I'm really sorry but right now I can't benchmark.

P.P.S. That's an interesting approach that I can't currently wrap my head around. Sorry if I created any duplicate work.

@Fidget-Spinner
Copy link
Member Author

Fidget-Spinner commented Mar 8, 2022

I've updated the first comment with a micro benchmark using pyperf on Windows with PGO. The results show 50% speedup for class creation over main.

@Fidget-Spinner
Copy link
Member Author

Pyperformance shows a speedup in object-heavy workloads. 1% speedup on average:

Slower (10):
- fannkuch: 603 ms +- 6 ms -> 621 ms +- 4 ms: 1.03x slower
- sqlite_synth: 3.39 us +- 0.06 us -> 3.49 us +- 0.13 us: 1.03x slower
- unpickle_list: 7.42 us +- 0.15 us -> 7.63 us +- 0.15 us: 1.03x slower
- mako: 17.7 ms +- 0.1 ms -> 18.1 ms +- 0.1 ms: 1.02x slower
- logging_simple: 8.91 us +- 0.12 us -> 9.09 us +- 0.14 us: 1.02x slower
- django_template: 55.7 ms +- 0.7 ms -> 56.8 ms +- 2.0 ms: 1.02x slower
- pickle: 15.9 us +- 0.5 us -> 16.1 us +- 0.2 us: 1.02x slower
- pidigits: 297 ms +- 1 ms -> 302 ms +- 1 ms: 1.01x slower
- logging_format: 10.1 us +- 0.2 us -> 10.2 us +- 0.2 us: 1.01x slower
- unpickle: 19.9 us +- 0.3 us -> 20.1 us +- 0.2 us: 1.01x slower

Faster (20):
- raytrace: 486 ms +- 4 ms -> 448 ms +- 3 ms: 1.08x faster
- chaos: 113 ms +- 1 ms -> 104 ms +- 1 ms: 1.08x faster
- float: 126 ms +- 1 ms -> 119 ms +- 2 ms: 1.06x faster
- scimark_lu: 193 ms +- 6 ms -> 184 ms +- 1 ms: 1.05x faster
- scimark_sor: 191 ms +- 2 ms -> 184 ms +- 2 ms: 1.04x faster
- chameleon: 11.3 ms +- 0.2 ms -> 10.9 ms +- 0.1 ms: 1.03x faster
- deltablue: 6.57 ms +- 0.07 ms -> 6.38 ms +- 0.06 ms: 1.03x faster
- telco: 10.4 ms +- 0.4 ms -> 10.1 ms +- 0.2 ms: 1.03x faster
- dulwich_log: 123 ms +- 1 ms -> 119 ms +- 1 ms: 1.03x faster
- scimark_fft: 538 ms +- 12 ms -> 523 ms +- 3 ms: 1.03x faster
- json_dumps: 19.6 ms +- 0.3 ms -> 19.2 ms +- 0.2 ms: 1.02x faster
- pathlib: 30.9 ms +- 0.4 ms -> 30.1 ms +- 0.5 ms: 1.02x faster
- nbody: 150 ms +- 2 ms -> 147 ms +- 2 ms: 1.02x faster
- tornado_http: 207 ms +- 4 ms -> 203 ms +- 3 ms: 1.02x faster
- go: 232 ms +- 2 ms -> 228 ms +- 2 ms: 1.02x faster
- unpack_sequence: 75.2 ns +- 0.9 ns -> 74.0 ns +- 0.7 ns: 1.02x faster
- pyflate: 711 ms +- 9 ms -> 700 ms +- 6 ms: 1.02x faster
- regex_effbot: 5.32 ms +- 0.06 ms -> 5.24 ms +- 0.06 ms: 1.02x faster
- regex_compile: 224 ms +- 1 ms -> 221 ms +- 1 ms: 1.01x faster
- pickle_list: 7.18 us +- 0.07 us -> 7.10 us +- 0.08 us: 1.01x faster

Benchmark hidden because not significant (27): 2to3, crypto_pyaes, hexiom, html5lib, json_loads, logging_silent, meteor_contest, nqueens, pickle_dict, pickle_pure_python, python_startup, python_startup_no_site, regex_dna, regex_v8, richards, scimark_monte_carlo, scimark_sparse_mat_mult, spectral_norm, sympy_expand, sympy_integrate, sympy_sum, sympy_str, unpickle_pure_python, xml_etree_parse, xml_etree_iterparse, xml_etree_generate, xml_etree_process

Geometric mean: 1.01x faster

@brandtbucher
Copy link
Member

Should this be closed in favor of #99331?

@brandtbucher brandtbucher removed their request for review November 18, 2022 21:17
@Fidget-Spinner
Copy link
Member Author

I'll close it when that one's merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants