Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-130704: Strength reduce LOAD_FAST{_LOAD_FAST} #130708

Open
wants to merge 79 commits into
base: main
Choose a base branch
from

Conversation

mpage
Copy link
Contributor

@mpage mpage commented Feb 28, 2025

This PR eliminates most reference counting overhead for references pushed onto the operand stack using LOAD_FAST{_LOAD_FAST} when we can be sure that the reference in the frame outlives the reference that is pushed onto the operand stack. Instructions that meet this criteria are replaced with new variants (LOAD_FAST_BORROW{_LOAD_FAST_BORROW}) that push appropriately tagged borrowed references.

Performance on the benchmark suite looks good:

  • A ~3% improvement on the free-threaded build. I think this might actually be higher and will continue investigating, but I wanted to put this up for review.
  • A ~2.7% improvement on the default build.

This approach looks like its quite effective at optimization too, at least on the benchmark suite. Roughly 97% of LOAD_FAST{_LOAD_FAST} instructions are optimized according to pystats. Note that these stats were collected using fastbench, so may not match those collected using pyperformance exactly.

The main pieces of the PR are:

New bytecodes

This adds two new bytecode instructions: LOAD_FAST_BORROW and its superinstruction form, LOAD_FAST_BORROW_LOAD_FAST_BORROW.

A new optimization pass

This adds a new optimization pass, optimize_load_fast, to the bytecode compiler that identifies and optimizes eligible instructions. Please read the detailed comment in flowgraph.c for a description of how it works.

Runtime support changes

A new function, PyStackRef_Borrow, was added to the stackref API. It creates a new stackref from an existing stackref without incrementing the reference count on the underlying object.

There are a few places in the runtime where we need to convert borrowed references into owned references:

  • When a frame escapes into the heap (i.e. when it is copied into a generator or when its materialized and unwound).
  • When a reference flows up the call stack (i.e. in RETURN_VALUE or YIELD_VALUE).
  • When someone destroys a reference to the frame "out of band" by poking at f_locals. We place the old reference into a tuple owned by the frame object.

The default build also required:

  • Adding support for stackrefs in the GC.
  • Removing reuse of stackrefs in PyFloat_FromDoubleConsumeInputs.

📚 Documentation preview 📚: https://cpython-previews--130708.org.readthedocs.build/

mpage added 30 commits February 28, 2025 11:41
Ref will be 2 if borrowed
Otherwise, it ends up being loaded using `LOAD_FAST_CHECK`, which increfs
and causes the refcount check to fail when it uses `LOAD_FAST_BORROW`.
These need to be tagged appropriately, not just increfed, so that they are decrefed
when the frame is destroyed.
This may be 1 if the `LOAD_FAST` is optimized to a `LOAD_FAST_BORROW`. It's
not clear that this is testing anything useful, so remove it.
The initial value will differ depending on whether a owned or borrowed reference
is loaded onto the operand stack.
These don't push enough values on the stack.
…unconditional_jump_threading`

Make sure we have a statically known stack depth
PyStackRef_AsPyObjectSteal creates a new reference if the stackref
is deferred. This reference is leaked if we deopt before the corresponding
decref.
These may provide support for borrowed references contained in frames
closer to the top of the call stack. Add them to a list attached to the
frame when they are overwritten, to be destroyed when the frame is destroyed.
`STORE_FAST_LOAD_FAST` and `LOAD_FAST_AND_CLEAR` both need to kill the
local.
This ensures we hit all the blocks
@mpage
Copy link
Contributor Author

mpage commented Mar 18, 2025

I think the tracerefs buildbot failure is unrelated to this PR. It's failing in the same way for other, unrelated PRs.

@markshannon
Copy link
Member

I think this introduces a bug, or rather turns an existing piece of code into a bug.

There is code that assumes that owning a reference to an object and that the refcount of that object is one means that that code is free to mutate and reuse the object.
Which is fine if that reference is a heap reference, as this PR relies on there being at least one strong reference on the stack, but I'm fairly sure that there is some code (annoyingly I can't remember what code) which makes this assumption for a stack reference.

We should fix that code, not this, but I'd like to have a test for it first.

Comment on lines 1218 to 1219
_PyStackRef tmp = PyStackRef_MakeHeapSafe(v);
DEAD(v);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do something like https://gist.github.com/colesbury/ce98264761c831d42d185f758395933e instead?

Having a _PyStackRef that only exists on the C stack worries me especially with the PyObject_CallMethodOneArg() call below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good call

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bug in the code-generator, that you can't write v = PyStackRef_MakeHeapSafe(v);
It's trying to prevent losing a reference in an input variable, but it is too strict.

#131513

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, you should be able to write

    _PyStackRef tmp = PyStackRef_MakeHeapSafe(v);
    DEAD(v);
    v = tmp;

now.

@mpage
Copy link
Contributor Author

mpage commented Mar 20, 2025

Which is fine if that reference is a heap reference, as this PR relies on there being at least one strong reference on the stack, but I'm fairly sure that there is some code (annoyingly I can't remember what code) which makes this assumption for a stack reference.

We should fix that code, not this, but I'd like to have a test for it first.

@markshannon - Are you maybe thinking of this code?

Copy link
Member

@markshannon markshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good overall. A few percent speedup is impressive.

I have a few comments and questions, but nothing blocking.

static inline _PyStackRef
PyStackRef_Borrow(_PyStackRef ref)
{
return PyStackRef_DUP(ref)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an issue for adding lifetime checking for this?
If not, could you make one and assign it to either yourself or me?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -801,6 +801,9 @@ def verify_code(self, code_object, *, bytecode_written=False):
data.extend(self.init._pack_uint32(0))
data.extend(self.init._pack_uint32(self.loader.source_mtime))
data.extend(self.init._pack_uint32(self.loader.source_size))
# Make sure theres > 1 reference to code_object so that the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Make sure theres > 1 reference to code_object so that the
# Make sure there's > 1 reference to code_object so that the

or there is

@@ -1330,15 +1335,6 @@ def test_fold_tuple_of_constants(self):
]
self.cfg_optimization_test(before, after, consts=[], expected_consts=[(1, 2, 3)])

# not enough consts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have you removed these tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bytecode is invalid.

("LOAD_FAST", 0, 1),
("LOAD_CONST", 0, 2),
("LOAD_CONST", 0, 3),
("STORE_FAST_STORE_FAST", 0 << 4 | 1, 4),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
("STORE_FAST_STORE_FAST", 0 << 4 | 1, 4),
("STORE_FAST_STORE_FAST", ((0 << 4) | 1), 4),

Mentally parsing shifts and bitwise operations is extra work for the reader. Be nice and add parentheses.

insts = [
("LOAD_FAST", 0, 1),
("LOAD_CONST", 0, 3),
("STORE_FAST_STORE_FAST", 0 << 4 | 1, 4),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
("STORE_FAST_STORE_FAST", 0 << 4 | 1, 4),
("STORE_FAST_STORE_FAST", ((0 << 4) | 1), 4),

* non-violating LOAD_FAST{_LOAD_FAST} can be optimized.
*/
static int
optimize_load_fast(cfg_builder *g, bool compute_stackdepth)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe drop the compute_stackdepth parameter and replace optimize_load_fast(g, true) with calculate_stackdepth(g); optimize_load_fast(g)

}

/*
* Strength reduce LOAD_FAST{_LOAD_FAST} instructions into weaker variants that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Strength reduce LOAD_FAST{_LOAD_FAST} instructions into weaker variants that
* Strength reduce LOAD_FAST{_LOAD_FAST} instructions into faster variants that

It's true that they are weaker, but the reason we're doing this is that they're faster.

Comment on lines 1218 to 1219
_PyStackRef tmp = PyStackRef_MakeHeapSafe(v);
DEAD(v);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bug in the code-generator, that you can't write v = PyStackRef_MakeHeapSafe(v);
It's trying to prevent losing a reference in an input variable, but it is too strict.

#131513

{
if (stack->size == stack->capacity) {
Py_ssize_t new_cap = Py_MAX(32, stack->capacity * 2);
ref *refs = PyMem_Realloc(stack->refs, sizeof(*stack->refs) * new_cap);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a utility for growing arrays in Include/internal/pycore_c_array.h (_Py_c_array_t).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the code ends up being clearer if we use that here.

@@ -41,6 +41,9 @@ PyAPI_FUNC(PyObject*) _PyCompile_OptimizeCfg(
PyObject *consts,
int nlocals);

// Export for '_testinternalcapi' shared extension
PyAPI_FUNC(PyObject*) _PyCompile_OptimizeLoadFast(PyObject *instructions);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't tend to expose individual peephole optimisations, we test them through OptimizeCFG (see DirectCfgOptimizerTests). Is there a reason why you can't do that for this optimization?

Copy link
Contributor Author

@mpage mpage Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can add a call to optimize_load_fast into _PyCompile_OptimizeCFG. I wasn't sure if _PyCompile_OptimizeCFG was intended to only test _PyCfg_OptimizeCodeUnit. We want this pass to be called as late as possible in compilation, which is why it is called in _PyCfg_OptimizedCfgToInstructionSequence instead of _PyCfg_OptimizeCodeUnit.

@markshannon
Copy link
Member

Which is fine if that reference is a heap reference, as this PR relies on there being at least one strong reference on the stack, but I'm fairly sure that there is some code (annoyingly I can't remember what code) which makes this assumption for a stack reference.
We should fix that code, not this, but I'd like to have a test for it first.

@markshannon - Are you maybe thinking of this code?

No, but that needed fixing as well. Thanks.

I had a look and there doesn't seem to a way to break anything without using the C API.
There are a couple of places that check whether a parameter has refcnt == 1 and reuse the object.
It was partial_new that I was thinking of.
It would be unsafe, if the kwargs were passed directly from the interpreter, but it appears that is impossible as functions.partial(foo, **d) creates a copy of d before it makes the call.

@mpage
Copy link
Contributor Author

mpage commented Mar 24, 2025

A note for reviewers: I realized that we need to special case opcodes that leave some of their operands on the stack. This doesn't appear to reduce the effectiveness of the optimization. Stats look roughly the same as the previous version. Performance is a little worse (2.1% for the free-threaded build, 2.5% for the default), but it's hard for me to tell if that's just from noise in the runner or from changes that landed in main since the last time I ran the benchmarks.

Copy link
Contributor

@colesbury colesbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@markshannon markshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing one hint for the cases generator, otherwise looks good.

We could make the analysis more robust by using the cases generator, but that's for a later PR.

@@ -1209,7 +1219,7 @@ dummy_func(
PyGenObject *gen = (PyGenObject *)receiver_o;
_PyInterpreterFrame *gen_frame = &gen->gi_iframe;
STACK_SHRINK(1);
_PyFrame_StackPush(gen_frame, v);
_PyFrame_StackPush(gen_frame, PyStackRef_MakeHeapSafe(v));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need a DEAD(v); here

break;
}

// We treat opcodes that do not consume all of their inputs on
Copy link
Member

@markshannon markshannon Mar 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach seems fine for now, but the code generator knows exactly how many values are popped and consumed, as opposed to peek at.

We should add a _PyOpcode_num_peeked function, the we'd have consumed = _PyOpcode_num_popped() - _PyOpcode_num_peeked() which would be more robust.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants