Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-118926: Deferred reference counting GC changes for free threading #121318

Closed
wants to merge 28 commits into from

Conversation

Fidget-Spinner
Copy link
Member

@Fidget-Spinner Fidget-Spinner commented Jul 3, 2024

This PR mainly introduces GC changes to the free threading GC to support deferred reference counting in the future.

To get this to work, new stack references must immediately live on the stack, without any interfering Py_DECREF or escaping code between when we put them on the stack. This ensures they are visible to the GC.

This also NULLs out the rest of the stack because the GC scans the entire stack. The stack pointer may be inconsistent between escaping calls (including those to Py_DECREF) so we need to scan the whole stack.

This PR removes the temporary immortalization introduced previously in #117783. I wanted to do this in a separate PR, but the only way to test this properly is to remove that hack. So it has to be bundled in this PR.

Finally, this PR fixes a few bugs in steals and borrows. This was only caught by the GC changes, not by the debugger that I was working on. Since these are untestable without the GC changes, I bundled them in.

Temporary perf regressions (6 threads):

object_cfunction PASSED: 3.6x faster
cmodule_function FAILED: 2.1x faster (expected: LOAD_ATTR from dict)
generator PASSED: 3.7x faster
pymethod FAILED: 2.5x faster (expected: LOAD_ATTR from pytype_lookup)
pyfunction FAILED: 2.9x faster (expected: LOAD_GLOBAL)
module_function FAILED: 2.9x faster (expected: LOAD_ATTR from module dict)
load_string_const PASSED: 4.0x faster
load_tuple_const PASSED: 3.8x faster
create_closure FAILED: 2.3x faster (unsure why)
create_pyobject FAILED: 1.5x faster (expected: LOAD_GLOBAL)

@Fidget-Spinner
Copy link
Member Author

!buildbot nogil

@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @Fidget-Spinner for commit 8cb139f 🤖

The command will test the builders whose names match following regular expression: nogil

The builders matched are:

  • AMD64 Ubuntu NoGIL PR
  • aarch64 Fedora Rawhide NoGIL PR
  • x86-64 MacOS Intel ASAN NoGIL PR
  • AMD64 Ubuntu NoGIL Refleaks PR
  • AMD64 Fedora Rawhide NoGIL refleaks PR
  • aarch64 Fedora Rawhide NoGIL refleaks PR
  • PPC64LE Fedora Rawhide NoGIL refleaks PR
  • AMD64 Windows Server 2022 NoGIL PR
  • ARM64 MacOS M1 Refleaks NoGIL PR
  • PPC64LE Fedora Rawhide NoGIL PR
  • x86-64 MacOS Intel NoGIL PR
  • ARM64 MacOS M1 NoGIL PR
  • AMD64 Fedora Rawhide NoGIL PR

@Fidget-Spinner
Copy link
Member Author

!buildbot nogil

@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @Fidget-Spinner for commit 700c2fd 🤖

The command will test the builders whose names match following regular expression: nogil

The builders matched are:

  • AMD64 Ubuntu NoGIL PR
  • aarch64 Fedora Rawhide NoGIL PR
  • x86-64 MacOS Intel ASAN NoGIL PR
  • AMD64 Ubuntu NoGIL Refleaks PR
  • AMD64 Fedora Rawhide NoGIL refleaks PR
  • aarch64 Fedora Rawhide NoGIL refleaks PR
  • PPC64LE Fedora Rawhide NoGIL refleaks PR
  • AMD64 Windows Server 2022 NoGIL PR
  • ARM64 MacOS M1 Refleaks NoGIL PR
  • PPC64LE Fedora Rawhide NoGIL PR
  • x86-64 MacOS Intel NoGIL PR
  • ARM64 MacOS M1 NoGIL PR
  • AMD64 Fedora Rawhide NoGIL PR

@Fidget-Spinner
Copy link
Member Author

!buildbot nogil

@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @Fidget-Spinner for commit cde931d 🤖

The command will test the builders whose names match following regular expression: nogil

The builders matched are:

  • AMD64 Ubuntu NoGIL PR
  • aarch64 Fedora Rawhide NoGIL PR
  • x86-64 MacOS Intel ASAN NoGIL PR
  • AMD64 Ubuntu NoGIL Refleaks PR
  • AMD64 Fedora Rawhide NoGIL refleaks PR
  • aarch64 Fedora Rawhide NoGIL refleaks PR
  • PPC64LE Fedora Rawhide NoGIL refleaks PR
  • AMD64 Windows Server 2022 NoGIL PR
  • ARM64 MacOS M1 Refleaks NoGIL PR
  • PPC64LE Fedora Rawhide NoGIL PR
  • x86-64 MacOS Intel NoGIL PR
  • ARM64 MacOS M1 NoGIL PR
  • AMD64 Fedora Rawhide NoGIL PR

@markshannon
Copy link
Member

markshannon commented Jul 15, 2024

One possibility to keep things moving for the free-threading build is to generate a different generated_cases.h for the free-threading interpreter.
For that case, you could write all values to the stack before any decrefs or escaping calls.

@Fidget-Spinner
Copy link
Member Author

It might be quite a lot of work, but I really think we should mark any escaping calls, so that the code generator knows what escapes. DECREF_INPUTS is already handled by the code generator, so no changes to the code should be necessary in bytecodes.c regarding reference count operations.

Ok I can port your old PR over to this one. I presume we're flushing the values to the stack before every escaping call?

@colesbury
Copy link
Contributor

@markshannon, writing all values to the stack is going to introduce an unnecessary performance regression.

This is holding up the free-threaded work for what seems like small aesthetic complaints. If the concern is about maintainer confusion, we can add comments or otherwise document the dozen or so places where this is used.

@markshannon
Copy link
Member

This is not just aesthetics.
Maintainability is important. By degrading maintainability, you are making more work for others. Particularly for my team.

You say that "writing all values to the stack will by slow". Only for those parts of the stack that are in memory.
Which is why it is important to leave the choice of which parts of the stack to spill to memory and which parts to spill to the stack up to the code generator, as it will differ for different tiers and different platforms.

the dozen or so places where this is used.

There are a hundred or more places where execution can escape from the interpreter, not counting DECREFs which add hundreds more. It seems unlikely that this PR correctly identifies all cases where a garbage collection can occur and correctly spills all the necessary values to the stack memory.

Even if it is correct, it stills lays traps for the unwary.
For example, _BINARY_OP contains the code:

    DECREF_INPUTS();
    ERROR_IF(res_o == NULL, error);
    res = PyStackRef_FromPyObjectSteal(res_o);

If this were changed to

    res = PyStackRef_FromPyObjectSteal(res_o);
    DECREF_INPUTS();
    ERROR_IF(PyStackRef_IsNull(res), error);

then it would be unsafe, yet there would be no warning or error from any tools.

@colesbury
Copy link
Contributor

The rewritten _BINARY_OP example is safe because it uses PyStackRef_FromPyObjectSteal. The concern is with calls to
PyStackRef_FromPyObjectNew.

There are a hundred or more places where execution can escape from the interpreter

We should not be thinking about this in terms of where calls escape from the interpreter. We should be thinking about this in terms of where PyStackRef_FromPyObjectNew() calls occur, and ensure that those are always written to the stack.

If you want to refactor PyStackRef_FromPyObjectNew so that it takes a pointer or similar that's fine.

@Fidget-Spinner
Copy link
Member Author

Ok I can port your old PR over to this one. I presume we're flushing the values to the stack before every escaping call?

Sorry I forgot that this may introduce another perf regression on the free-threaded build, even if I limit it just to that. So I'm putting a pause on this. I think we should discuss this on Wednesday.

@markshannon
Copy link
Member

In general, stackrefs must be spilled to the in-memory stack around any escaping call.
To do this automatically, we need to:

  1. Identify all escaping calls
  2. Track assignments to output values
  3. Raise an error if an escaping call occurs unless all or none of the output values have been assigned
  4. Generate the spill around the call.
    • If none of the output values have been defined, we just spill the stack pointer after popping the inputs.
    • If all of the output values have been defined, we save the outputs to memory and save the stack pointer before making the escaping call.

Identifying all escaping calls.

We have a whitelist, all other calls are escaping. The list needs updating, but it's mostly right.

Track assignments to output values

Assignments are easy to spot name = .... Then we need do is see if name is an output variable. Any assignments in nested code, or multiple assignments, should be treated as an error.

It is easy enough to change the code move assignments out of branches, and don't want the code gen to have to do flow analysis.

Generating spills around calls.

Once we having identified the escaping call, we will need to walk backwards and forwards around the call to identify the whole statement.
Once we've done that we need to emit the spill before and the reload after (if any reload is needed)

@markshannon
Copy link
Member

In case that sounds too expensive, don't worry. It shouldn't be.

Spilling the stack pointer is cheap as registers often need to be spilled across calls anyway.
We need to save the output values to memory in most cases anyway, we are just doing it a little earlier.

Although escaping calls are common in bytecodes.c, there are not so common dynamically. Dynamically, only about 10% of instructions include an escaping call and for many of those the additional cost is negligible for the reasons given above.

@colesbury
Copy link
Contributor

colesbury commented Jul 19, 2024

@markshannon, that is not what we discussed and agreed to yesterday Wednesday. What we discussed was tracking PyStackRef_FromPyObjectNew calls from the code generator and spilling those immediately.

  • We do not need or want to track stackrefs in general for the GC, just PyStackRef_FromPyObjectNew
  • There's no point in looking for spilling calls and waiting to write the result of PyStackRef_FromPyObjectNew in multiple places.

@markshannon
Copy link
Member

Tracking just PyStackRef_FromPyObjectNew will solve your immediate problem, but it doesn't help with a broader deferred reference counting implementation nor with top-of-stack caching.

I think what I proposed handles all the use cases, without a significant performance impact.
Even if it turns out the performance impact is too high, the additional analysis with help make any solution more robust.

There's no point in looking for spilling calls and waiting to write the result of PyStackRef_FromPyObjectNew in multiple places.

Why would anything get written multiple times?

@Fidget-Spinner
Copy link
Member Author

I think the original goal was to not block TOS caching and full deferred refcounting. With the latest changes, this is now true. The added goal of laying the foundations for the two should be left as another PR. The responsibility of this PR IMO, is to not burden stack caching and full deferred refcounting (which it should have achieved with the latest commit). Supporting more features is out of scope.

@colesbury
Copy link
Contributor

@markshannon - I'm most concerned because I thought we were on the same page on this approach after our last meeting.

I don't think starting with this approach precludes future changes, like what you've outlined above. This PR is a bottleneck for a lot of the free-threading deferred reference counting work, so I'd appreciate it if we can figure out how to unblock it.

Copy link
Member

@markshannon markshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having the spilling controlled by the code generator is a major improvement.
Thanks for doing that.

I have a few comments

@@ -1428,6 +1436,8 @@
PyObject **items = _PyTuple_ITEMS(seq_o);
for (int i = oparg; --i >= 0; ) {
*values++ = PyStackRef_FromPyObjectNew(items[i]);
#ifdef Py_GIL_DISABLED /* flush specials */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't emit anything if there are no values to flush.

emit_to(out, tkn_iter, "SEMI")
out.emit(";\n")
target = uop.body.index(tkn)
assert uop.body[target-1].kind == "EQUALS", (f"{uop.name} Result of a specials that is flushed"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to crash as soon as anyone writes anything but an assignment in front of PyStackRef_FromPyObjectNew.

We might want to insist that the result of PyStackRef_FromPyObjectNew is assigned to a variable, maybe only an output variable, but that should be an error not a crash.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the error handling. For the most part, PyStackRef_FromPyObjectNew is now required to be assigned to an output variable, but we also allow assignment to a dereferenced pointer due to usages like:

  • frame->localsplus[offset + i] = PyStackRef_FromPyObjectNew(o); (in COPY_FREE_VARS)
  • *values++ = PyStackRef_FromPyObjectNew(items[i]) (in UNPACK_SEQUENCE_LIST)

#ifdef Py_GIL_DISABLED /* flush specials */
stack_pointer[-2 - oparg] = method;
#endif /* flush specials */
stack_pointer[-2 - oparg] = method;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the same value being saved twice?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the generator to avoid saving the same value twice. I got rid of the #ifdef Py_GIL_DISABLED guards because I think they would add too much complexity for no real gain.

@@ -177,6 +177,35 @@ def replace_check_eval_breaker(
if not uop.properties.ends_with_eval_breaker:
out.emit_at("CHECK_EVAL_BREAKER();", tkn)

def replace_pystackref_frompyobjectnew(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that locally replacing PyStackRef_FromPyObjectNew is sufficient in general. It ensures that the references are in memory, but it doesn't ensure that the stack_pointer is saved, so the references might not be visible.

IIRC this is safe for the current FT implementation because you NULL out the unused part of the stack.
Could you add a comment explaining why this is safe for FT even though it doesn't appear to be.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment

target = uop.body.index(tkn)
assert uop.body[target-1].kind == "EQUALS", (f"{uop.name} Result of a specials that is flushed"
" must be written to a stack variable directly")
# Scan to look for the first thing it's assigned to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, this sort of analysis should ideally be performed in the analysis phase, not in the code generation phase.
Don't worry about it for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved this to analyzer.py since it required a number of changes anyways.

found_valid_assignment = assgn_target.text
break

if found_valid_assignment:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we haven't found a valid assignment?
That sounds like an error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now raises an analysis_error.

@@ -213,6 +213,24 @@ def flush(self, out: CWriter, cast_type: str = "uintptr_t", extract_bits: bool =
self.peek_offset.clear()
out.start_line()

def write_variable_to_stack(self, out: CWriter, var_name: str, cast_type: str = "uintptr_t", extract_bits: bool = False) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems odd to be writing individual values to the stack. If any value on the stack needs to be in memory, don't they all?

Would it make more sense to use the flush method?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking at this since Ken is on vacation.

I think that the logic in this PR (after resolving merge conflicts with main) no longer works after #122286.

Before #122286, the tier1 generator had the outputs on the logical stack when emit_tokens() is called. Now, the inputs are still on the logical stack when emit_tokens is called and the outputs are only pushed after emit_tokens().

I think flush() has the same issue: the logical stack still contains the inputs, not the outputs.

@bedevere-app
Copy link

bedevere-app bot commented Aug 2, 2024

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

@colesbury
Copy link
Contributor

@markshannon, I've split off the changes to the interpreter generator into a separate PR:

I think that will make it easier to see the changes to the generated code. (And the merge conflicts were getting unwieldy.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants