-
-
Notifications
You must be signed in to change notification settings - Fork 31.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-130704: Strength reduce LOAD_FAST{_LOAD_FAST}
#130708
base: main
Are you sure you want to change the base?
Conversation
Ref will be 2 if borrowed
Otherwise, it ends up being loaded using `LOAD_FAST_CHECK`, which increfs and causes the refcount check to fail when it uses `LOAD_FAST_BORROW`.
These need to be tagged appropriately, not just increfed, so that they are decrefed when the frame is destroyed.
This may be 1 if the `LOAD_FAST` is optimized to a `LOAD_FAST_BORROW`. It's not clear that this is testing anything useful, so remove it.
The initial value will differ depending on whether a owned or borrowed reference is loaded onto the operand stack.
These don't push enough values on the stack.
…unconditional_jump_threading` Make sure we have a statically known stack depth
…mized to borrowed variants
PyStackRef_AsPyObjectSteal creates a new reference if the stackref is deferred. This reference is leaked if we deopt before the corresponding decref.
These may provide support for borrowed references contained in frames closer to the top of the call stack. Add them to a list attached to the frame when they are overwritten, to be destroyed when the frame is destroyed.
`STORE_FAST_LOAD_FAST` and `LOAD_FAST_AND_CLEAR` both need to kill the local.
This ensures we hit all the blocks
I think the tracerefs buildbot failure is unrelated to this PR. It's failing in the same way for other, unrelated PRs. |
I think this introduces a bug, or rather turns an existing piece of code into a bug. There is code that assumes that owning a reference to an object and that the refcount of that object is one means that that code is free to mutate and reuse the object. We should fix that code, not this, but I'd like to have a test for it first. |
Python/bytecodes.c
Outdated
_PyStackRef tmp = PyStackRef_MakeHeapSafe(v); | ||
DEAD(v); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do something like https://gist.github.com/colesbury/ce98264761c831d42d185f758395933e instead?
Having a _PyStackRef
that only exists on the C stack worries me especially with the PyObject_CallMethodOneArg()
call below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bug in the code-generator, that you can't write v = PyStackRef_MakeHeapSafe(v);
It's trying to prevent losing a reference in an input variable, but it is too strict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, you should be able to write
_PyStackRef tmp = PyStackRef_MakeHeapSafe(v);
DEAD(v);
v = tmp;
now.
@markshannon - Are you maybe thinking of this code? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really good overall. A few percent speedup is impressive.
I have a few comments and questions, but nothing blocking.
static inline _PyStackRef | ||
PyStackRef_Borrow(_PyStackRef ref) | ||
{ | ||
return PyStackRef_DUP(ref) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an issue for adding lifetime checking for this?
If not, could you make one and assign it to either yourself or me?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lib/test/test_importlib/test_abc.py
Outdated
@@ -801,6 +801,9 @@ def verify_code(self, code_object, *, bytecode_written=False): | |||
data.extend(self.init._pack_uint32(0)) | |||
data.extend(self.init._pack_uint32(self.loader.source_mtime)) | |||
data.extend(self.init._pack_uint32(self.loader.source_size)) | |||
# Make sure theres > 1 reference to code_object so that the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Make sure theres > 1 reference to code_object so that the | |
# Make sure there's > 1 reference to code_object so that the |
or there is
@@ -1330,15 +1335,6 @@ def test_fold_tuple_of_constants(self): | |||
] | |||
self.cfg_optimization_test(before, after, consts=[], expected_consts=[(1, 2, 3)]) | |||
|
|||
# not enough consts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why have you removed these tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bytecode is invalid.
Lib/test/test_peepholer.py
Outdated
("LOAD_FAST", 0, 1), | ||
("LOAD_CONST", 0, 2), | ||
("LOAD_CONST", 0, 3), | ||
("STORE_FAST_STORE_FAST", 0 << 4 | 1, 4), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
("STORE_FAST_STORE_FAST", 0 << 4 | 1, 4), | |
("STORE_FAST_STORE_FAST", ((0 << 4) | 1), 4), |
Mentally parsing shifts and bitwise operations is extra work for the reader. Be nice and add parentheses.
Lib/test/test_peepholer.py
Outdated
insts = [ | ||
("LOAD_FAST", 0, 1), | ||
("LOAD_CONST", 0, 3), | ||
("STORE_FAST_STORE_FAST", 0 << 4 | 1, 4), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
("STORE_FAST_STORE_FAST", 0 << 4 | 1, 4), | |
("STORE_FAST_STORE_FAST", ((0 << 4) | 1), 4), |
Misc/NEWS.d/next/Core_and_Builtins/2025-02-28-11-29-35.gh-issue-130704.7RDVLE.rst
Show resolved
Hide resolved
Python/flowgraph.c
Outdated
* non-violating LOAD_FAST{_LOAD_FAST} can be optimized. | ||
*/ | ||
static int | ||
optimize_load_fast(cfg_builder *g, bool compute_stackdepth) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe drop the compute_stackdepth
parameter and replace optimize_load_fast(g, true)
with calculate_stackdepth(g); optimize_load_fast(g)
Python/flowgraph.c
Outdated
} | ||
|
||
/* | ||
* Strength reduce LOAD_FAST{_LOAD_FAST} instructions into weaker variants that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Strength reduce LOAD_FAST{_LOAD_FAST} instructions into weaker variants that | |
* Strength reduce LOAD_FAST{_LOAD_FAST} instructions into faster variants that |
It's true that they are weaker, but the reason we're doing this is that they're faster.
Python/bytecodes.c
Outdated
_PyStackRef tmp = PyStackRef_MakeHeapSafe(v); | ||
DEAD(v); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bug in the code-generator, that you can't write v = PyStackRef_MakeHeapSafe(v);
It's trying to prevent losing a reference in an input variable, but it is too strict.
{ | ||
if (stack->size == stack->capacity) { | ||
Py_ssize_t new_cap = Py_MAX(32, stack->capacity * 2); | ||
ref *refs = PyMem_Realloc(stack->refs, sizeof(*stack->refs) * new_cap); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a utility for growing arrays in Include/internal/pycore_c_array.h (_Py_c_array_t).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the code ends up being clearer if we use that here.
Include/internal/pycore_flowgraph.h
Outdated
@@ -41,6 +41,9 @@ PyAPI_FUNC(PyObject*) _PyCompile_OptimizeCfg( | |||
PyObject *consts, | |||
int nlocals); | |||
|
|||
// Export for '_testinternalcapi' shared extension | |||
PyAPI_FUNC(PyObject*) _PyCompile_OptimizeLoadFast(PyObject *instructions); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't tend to expose individual peephole optimisations, we test them through OptimizeCFG (see DirectCfgOptimizerTests). Is there a reason why you can't do that for this optimization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can add a call to optimize_load_fast
into _PyCompile_OptimizeCFG
. I wasn't sure if _PyCompile_OptimizeCFG
was intended to only test _PyCfg_OptimizeCodeUnit
. We want this pass to be called as late as possible in compilation, which is why it is called in _PyCfg_OptimizedCfgToInstructionSequence
instead of _PyCfg_OptimizeCodeUnit
.
No, but that needed fixing as well. Thanks. I had a look and there doesn't seem to a way to break anything without using the C API. |
Not enough items on stack
A note for reviewers: I realized that we need to special case opcodes that leave some of their operands on the stack. This doesn't appear to reduce the effectiveness of the optimization. Stats look roughly the same as the previous version. Performance is a little worse (2.1% for the free-threaded build, 2.5% for the default), but it's hard for me to tell if that's just from noise in the runner or from changes that landed in main since the last time I ran the benchmarks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing one hint for the cases generator, otherwise looks good.
We could make the analysis more robust by using the cases generator, but that's for a later PR.
@@ -1209,7 +1219,7 @@ dummy_func( | |||
PyGenObject *gen = (PyGenObject *)receiver_o; | |||
_PyInterpreterFrame *gen_frame = &gen->gi_iframe; | |||
STACK_SHRINK(1); | |||
_PyFrame_StackPush(gen_frame, v); | |||
_PyFrame_StackPush(gen_frame, PyStackRef_MakeHeapSafe(v)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need a DEAD(v);
here
break; | ||
} | ||
|
||
// We treat opcodes that do not consume all of their inputs on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach seems fine for now, but the code generator knows exactly how many values are popped and consumed, as opposed to peek at.
We should add a _PyOpcode_num_peeked
function, the we'd have consumed = _PyOpcode_num_popped() - _PyOpcode_num_peeked()
which would be more robust.
This PR eliminates most reference counting overhead for references pushed onto the operand stack using
LOAD_FAST{_LOAD_FAST}
when we can be sure that the reference in the frame outlives the reference that is pushed onto the operand stack. Instructions that meet this criteria are replaced with new variants (LOAD_FAST_BORROW{_LOAD_FAST_BORROW}
) that push appropriately tagged borrowed references.Performance on the benchmark suite looks good:
This approach looks like its quite effective at optimization too, at least on the benchmark suite. Roughly 97% of
LOAD_FAST{_LOAD_FAST}
instructions are optimized according to pystats. Note that these stats were collected using fastbench, so may not match those collected usingpyperformance
exactly.The main pieces of the PR are:
New bytecodes
This adds two new bytecode instructions:
LOAD_FAST_BORROW
and its superinstruction form,LOAD_FAST_BORROW_LOAD_FAST_BORROW
.A new optimization pass
This adds a new optimization pass,
optimize_load_fast
, to the bytecode compiler that identifies and optimizes eligible instructions. Please read the detailed comment inflowgraph.c
for a description of how it works.Runtime support changes
A new function,
PyStackRef_Borrow
, was added to the stackref API. It creates a new stackref from an existing stackref without incrementing the reference count on the underlying object.There are a few places in the runtime where we need to convert borrowed references into owned references:
RETURN_VALUE
orYIELD_VALUE
).f_locals
. We place the old reference into a tuple owned by the frame object.The default build also required:
PyFloat_FromDoubleConsumeInputs
.LOAD_FAST
variants #130704📚 Documentation preview 📚: https://cpython-previews--130708.org.readthedocs.build/