Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-89279: In ceval.c, redefine some macros for speed #32387

Merged
merged 27 commits into from
Apr 22, 2022

Conversation

gvanrossum
Copy link
Member

@gvanrossum gvanrossum commented Apr 7, 2022

In particular, Py_DECREF, Py_XDECREF and Py_IS_TYPE are redefined as macros that completely replace the inline functions of the same name. These three came out in the top four of functions that (in MSVC) somehow weren't inlined. (There was another, _Py_atomic_load_32bit_impl, but it is more difficult.

I got the top-N list using the recipe from this bpo message.

#89279

In particular, Py_DECREF, Py_XDECREF and Py_IS_TYPE are redefined as macros
that completely replace the inline functions of the same name.
These three came out in the top four of functions that (in MSVC)
somehow weren't inlined.
(There was another, _Py_atomic_load_32bit_impl, but it is more difficult.)
Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In C with well defined scopes, it's fine to reuse the same variable name in nested scopes ("op" in this case). It's just more annoying to debug.

C code:

#include <stdio.h>

int main()
{
    int x = 1;
    {
        int x = 2;
        {
            int x = 3;
            printf("x = %i\n", x);
        }
        printf("x = %i\n", x);
    }
    printf("x = %i\n", x);
    return 0;
}

Output:

x = 3
x = 2
x = 1

Python/ceval.c Outdated Show resolved Hide resolved
Python/ceval.c Outdated Show resolved Hide resolved
Python/ceval.c Outdated Show resolved Hide resolved
Python/ceval.c Show resolved Hide resolved
@gvanrossum
Copy link
Member Author

In C with well defined scopes, it's fine to reuse the same variable name in nested scopes ("op" in this case). It's just more annoying to debug.

And yet, in XDECREF when I renamed op1 to op, things went badly. I think it's because the Py_DECREF(op) call expands to { PyObject *op = op; ... } and that doesn't work. :-)

I've done the rest you've asked, and merged origin/main.

@gvanrossum
Copy link
Member Author

gvanrossum commented Apr 7, 2022

PS. Any suggestion on how to handle the -_Py_atomic_load_32bit_impl inline function? It's called via many levels of macros, from COMPUTE_EVAL_BREAKER, _Py_FinishPendingCalls, eval_frame_handle_pending, CHECK_EVAL_BREAKER and the big switch case TARGET(RESUME_QUICK). All these actually call _Py_atomic_load_relaxed.

The compiler says there are 24 separate places where this fails to inline just in _PyEval_EvalFrameDefault (that's the only function for which I collect these stats ATM).

@gvanrossum
Copy link
Member Author

PS. Any suggestion on how to handle the -_Py_atomic_load_32bit_impl inline function?

I took care of this, it was relatively straightforward since the invocations in ceval.c all go through macros that in the end call _Py_atomic_load_32bit_impl. So I just #define that.

So that the assert() in the macro will be checked in debug mode.
Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Python/ceval.c changes LGTM, but I cannot review the 2 other modified files.

Python/ceval.c Outdated
Comment on lines 48 to 49
// bpo-45116: The MSVC compiler fails to inline these in PGO build,
// and they're kind of important for performance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might explain that MSVC can inline them, but it's just an arbitrary limit which cannot be configured.

neonene wrote "PR25244 told me the amount of code in _PyEval_EvalFrameDefault() is over the limit of PGO."

I suggest something like:

Suggested change
// bpo-45116: The MSVC compiler fails to inline these in PGO build,
// and they're kind of important for performance.
// bpo-45116: The MSVC compiler does not inline these 3 static inline functions in PGO build
// in _PyEval_EvalFrameDefault(), because this function is over the limit of PGO,
// and these limits which cannot be configured. Define them as macro to make sure that
// there are always inlined by the preprocessor.

Maybe tomorrow, MSVC PGO limits will change or become configurable, and it will become possible to remove these macros on more recent MSVC versions.

Python/ceval.c Outdated Show resolved Hide resolved
@vstinner
Copy link
Member

vstinner commented Apr 8, 2022

I would prefer to not redefine _Py_atomic_load_relaxed() in ceval.c. PEP 7 was updated to require C11 to build Python 3.11, but C11 without optional features: without atomatic functions/variables. Python still requires pycore_atomic.h with an implementation which depends on the C compiler and on the platform. Maybe tomorrow we will be able to use C11 atomic variables, but I don't think that it's possible right now :-(

@gvanrossum
Copy link
Member Author

I would prefer to not redefine _Py_atomic_load_relaxed() in ceval.c. PEP 7 was updated to require C11 to build Python 3.11, but C11 without optional features: without atomatic functions/variables. Python still requires pycore_atomic.h with an implementation which depends on the C compiler and on the platform. Maybe tomorrow we will be able to use C11 atomic variables, but I don't think that it's possible right now :-(

Oh, I see. The simple version of _Py_atomic_load_relaxed() that I found by tracing through the macros in pycore_atomic.h only works for MSVC (and only for 32-bit values, which int is on that platform). I'll propose a more restricted version.

@gvanrossum
Copy link
Member Author

Using the techniques I developed in faster-cpython/ideas#321 I couldn't find any uses of Py_INCREF in _PyEval_EvalFrameDefault in the PGO build from main/head of python311.dll (x64), so that may have been a false alarm? I did find many calls to Py_DECREF and Py_IS_TYPE, and quite a few to _Py_atomic_load_32bit_impl. It's a bit confusing because the disassembly seems to contain three separate definitions of Py_DECREF (but none of Py_INCREF, and only one of Py_IS_TYPE).

@neonene
Copy link
Contributor

neonene commented Apr 13, 2022

When we take a inlining log at profiling time (PGInstrument):

    Py_INCREF (pgi always inline)
    Py_TYPE (pgi always inline)
    Py_SIZE (pgi always inline)
    -Py_DECREF (initial scan: not a PGO PGI candidate)
    -Py_XDECREF (initial scan: not a PGO PGI candidate)
    -Py_IS_TYPE (initial scan: not a PGO PGI candidate)

Py_INCREF does not need to be profiled and always gets inlined even on PGInstrument builds. With /OPT:REF linker option, its difinitions disappear from the COMDAT, and three definitions remain for Py_DECREF. All definitions will be left specifying /OPT:NOREF in pyproject.props.

@gvanrossum
Copy link
Member Author

Thanks for explaining what /OPT:REF and /OPT:NOREF do, that's helpful (I now understand that there are multiple copies of Py_DECREF because the compiler created different static versions of it for different files).

I have a new slight mystery. I write a script that scans build.log and prints the top-10 most common non-inlined functions (in my 'inline' branch):

    6. -_PyErr_Clear (pgo hard reject)
    6. -gc_list_remove (pgo hard reject)
    7. -format_exc_check_arg (pgo hard reject)
    8. -PyObject_GetItem (pgo hard reject)
   10. -PyErr_SetString (pgo hard reject)
   11. -_PyErr_SetString (pgo hard reject)
   12. -_PyErr_ExceptionMatches (pgo hard reject)
   15. -_PyErr_Format (initial scan: varargs, not eligible)
   15. -insert_to_freepool (pgo hard reject)
   30. -PyMem_RawCalloc (pgo hard reject)

Looking at the last one, PyMem_RawCalloc, it seems this results from a long chain of inlining decisions, like this:

    object_dealloc (pgu decision)
      Py_TYPE (pgu decision)
      PyObject_Free (pgu decision)
        _PyObject_Free (pgu decision)
          pymalloc_free (pgu decision)
            address_in_range (pgu decision)
              arena_map_is_used (pgu decision)
                arena_map_get (pgu decision)
                  -PyMem_RawCalloc (pgo hard reject)
                  -PyMem_RawCalloc (pgo hard reject)

This seems rather silly, especially since we're obviously freeing something and there's a code path that allocates. I am at a loss where exactly object_dealloc is called in _PyEval_EvalFrameDefault -- it is a static function in typeobject.c that is only called via the tp_dealloc slot of objects that don't override it. In this branch, in ceval.c, Py_DECREF is a macro that calls Py_TYPE(op)->tp_dealloc(op). So the PGO/LTO must have recognized that many times the tp_dealloc slot points to object_dealloc, and somehow does something to inline the call if it does? I have not been able to find this code in the disassembly (at least not for _PyEval_EvalFrameDefault).

@sweeneyde
Copy link
Member

Hm, address sanitizer doesn't like ((destructor)PyObject_Free)(op), since PyObject_Free takes void *. Maybe it would prefer destructor d = (destructor)PyObject_Free; d(op);?

@gvanrossum
Copy link
Member Author

I'm running out of time today. I will apply @vstinner's improvement of the macros later this week. (If you create a comment with a suggestion I can do it quicker.)

@sweeneyde I'm not sure that it matters much whether _Py_Dealloc is inlined or not (and there's no simple way to measure it directly). But what do you think of @neonene's suggestion to add Py_ALWAYS_INLINE to _Py_DECREF_SPECIALIZED?

@sweeneyde
Copy link
Member

But what do you think of @neonene's suggestion to add Py_ALWAYS_INLINE to _Py_DECREF_SPECIALIZED?

Sounds reasonable to me, and it would be nice to not have to worry about different casting between the function and the macro, but I don't fully understand what (IncPGO create callsites failed) means, or why that case is different from the others.

@gvanrossum
Copy link
Member Author

I don't fully understand what (IncPGO create callsites failed) means, or why that case is different from the others.

Me neither, but I'm guessing it might have been caused by a shortcut I took where instead of running PCbuild\build.bat --pgo I used PCbuild\build.bat -c PGUpdate which reuses previously collected profiling data. That works (and runs much faster) but may not apply exactly the same inlining as a a fresh run. I'm guessing the PGO machinery detects this situation and its internal name is "IncPGO" for "Incremental PGO". I'll try not to report further results from such runs.

@gvanrossum
Copy link
Member Author

A full run where I had restored the _Py_Dealloc call in my Py_DECREF macro reported this:

  196. -_Py_Dealloc (pgo hard reject)

so I conclude that I should keep that expanded.

@gvanrossum
Copy link
Member Author

With Py_ALWAYS_INLINE instead of the macro for Py_DECREF_SPECIALIZED, I see these new entries:

   38. -insert_to_freepool (pgo hard reject)
   76. -PyMem_RawCalloc (pgo hard reject)

I conclude that the macro is still a good idea.

So the only thing left on my to-do list for this PR is Victor's beatification of the macros.

@gvanrossum
Copy link
Member Author

With this PR as is, I just see:

   28. -PyMem_RawCalloc (pgo hard reject)

Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion to reformat macro on mutliple lines, now as a GitHub suggestion (and with _Py_DECREF_SPECIALIZED).

Python/ceval.c Outdated Show resolved Hide resolved
Co-authored-by: Victor Stinner <vstinner@python.org>
Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, even if I didn't understand everything. But I trust the ones who ran extensive tests on Python benchmarks with different compiler options.

There are many compiler warnings. Are they related to the change?

GitHub Actions / Address sanitizer
Python/ceval.c#L2408
function called through a non-compatible type

Python/ceval.c Outdated Show resolved Hide resolved
Co-authored-by: Dennis Sweeney <36520290+sweeneyde@users.noreply.github.com>
Python/ceval.c Outdated Show resolved Hide resolved
Python/ceval.c Outdated Show resolved Hide resolved
@vstinner
Copy link
Member

I took the liberty of fixing two obvious bugs.

@vstinner
Copy link
Member

macOS failure is unrelated:

FAIL: test_get_server_certificate_timeout (test.test_ssl.SimpleBackgroundTests.test_get_server_certificate_timeout)

@neonene
Copy link
Contributor

neonene commented Apr 22, 2022

I now think that macrofying here is good only for stable APIs. _Py_DECREF_SPECIALIZED can change the spec for best performance, which is difficult to test here.

With Py_ALWAYS_INLINE instead of the macro for Py_DECREF_SPECIALIZED, I see these new entries:

  1. -insert_to_freepool (pgo hard reject)
  2. -PyMem_RawCalloc (pgo hard reject)

Maybe some pragma can fix this. I'll check the entries.

@gvanrossum
Copy link
Member Author

I don’t see the problem with …DECREF_SPECIALIZED. It’s semantics are very straightforward. @sweeneyde WDYT?

@sweeneyde
Copy link
Member

I don’t see the problem with …DECREF_SPECIALIZED. It’s semantics are very straightforward. @sweeneyde WDYT?

I think it would be sufficient to add a comment with something like "if this changes, the macro version in ceval.c should change accordingly".

@sweeneyde
Copy link
Member

Though I do agree that _Py_DECREF_SPECIALIZED is not likely to change semantics, particularly not the non-debug version.

@neonene
Copy link
Contributor

neonene commented Apr 22, 2022

Then, I'm fine.
Sorry for the noise.

Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@gvanrossum gvanrossum merged commit 2f233fc into python:main Apr 22, 2022
@gvanrossum
Copy link
Member Author

Thanks everyone for the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants