Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-115103: Implement delayed memory reclamation (QSBR) #115180

Merged
merged 9 commits into from
Feb 16, 2024

Conversation

colesbury
Copy link
Contributor

@colesbury colesbury commented Feb 8, 2024

This adds a safe memory reclamation scheme for the free-threaded build based on FreeBSD's "GUS" and quiescent state based reclamation (QSBR). The API provides a mechanism for callers to detect when it is safe to free memory that may be concurrently accessed by readers.

Related PRs that build on this:


📚 Documentation preview 📚: https://cpython-previews--115180.org.readthedocs.build/

This adds a safe memory reclamation scheme based on FreeBSD's "GUS" and
quiescent state based reclamation (QSBR). The API provides a mechanism
for callers to detect when it is safe to free memory that may be
concurrently accessed by readers.
@colesbury
Copy link
Contributor Author

Notes for reviewers:

  • This PR does not contain any uses of the API; I wanted to keep this PR a manageable size. I'll put up a PR that makes use of these APIs soon.
  • The per-thread states (struct _qsbr_thread_state) are stored in a contiguous array to make scanning quicker, but this makes the thread initialization code a bit more complicated.
  • I think this is one of the cases where memory orderings are particularly tricky and hard to reason about intuitively. I found it helpful to verify the orderings and fences with CDSChecker. This is the model I used: https://github.com/colesbury/c11-model-checker/blob/cpython-models/test/qsbr.c.
  • _Py_qsbr_poll() will have scaling issues with large numbers of threads. That's something we'll want to deal with eventually, but possibly not in the 3.13 timeframe.

if (QSBR_LT(rd_seq, min_seq)) {
// It's okay if the compare-exchange failed: another thread updated it
(void)_Py_atomic_compare_exchange_uint64(&shared->rd_seq, &rd_seq, min_seq);
rd_seq = min_seq;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the compare/exchange fails is it worth returning the greater of rd_seq and min_seq?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't matter for correctness and I'm not sure it's worth any extra complexity for performance.

Python/qsbr.c Outdated
// Starting size of the array of qsbr thread states
#define MIN_ARRAY_SIZE 8

// The shared write sequence is always odd and incremented by two. Detached
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we document why the shared write sequence is always odd? I'm assuming that's just to avoid issues with it wrapping around and colliding with QSBR_OFFLINE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that was the original motivation, but as I wrote in another comment, with 64-bit counters we don't really have to worry about wrap around.

I'm still tempted to keep this numbering scheme and the QBSR_LT wrap-around safe comparisons, but let me know what you think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to keep it this way, it's just worth a comment on why it's that way :)

@@ -169,6 +169,10 @@ extern PyTypeObject _PyExc_MemoryError;
{ .threshold = 10, }, \
}, \
}, \
.qsbr = { \
.wr_seq = 1, \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these be inited to QSBR_INITIAL (it seems like including pycore_qsbr.h shouldn't be an issue?)

// If there are no free entries, we pause all threads, grow the array,
// and update the pointers in PyThreadState to entries in the new array.
if (qsbr == NULL) {
_PyEval_StopTheWorld(interp);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this isn't safe to do with the mutex locked and not-detached? If another thread is trying to reserve at the same time it won't be detached, but it also won't be responding to the request for the eval breaker.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right -- I mixed up the locking discipline. The outer lock should not use _Py_LOCK_DONT_DETACH. I think the locking discipline with stop-the-world should be:

  1. A lock can be acquired before starting a stop-the-world pause or after starting it, but not both (consistent lock ordering)
  2. If acquired before the stop-the-world, the acquisition should use _Py_LOCK_DETACH (the default)
  3. If acquired within the stop-the-world, the acquisition should use _Py_LOCK_DONT_DETACH

HEAD_LOCK(runtime) is an example of case 3. This is case 2.

The reasoning for case (3) is a bit subtle: locks can be directly handed off to a detached thread, so with a normal PyMutex_Lock(&m) a thread can both hold the mutex m and have paused for another thread's stop-the-world.

Python/qsbr.c Outdated

// Initialize (or reintialize) the freelist of QSBR thread states
static void
initialize_freelist(struct _qsbr_shared *shared)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe initialize_new_array or something like that? It's just that it's not only initing the free lists, it's also initializing all of the thread state pointers into the array (and the new part covers the re-initialization)

uint64_t
_Py_qsbr_advance(struct _qsbr_shared *shared)
{
return _Py_atomic_add_uint64(&shared->wr_seq, QSBR_INCR) + QSBR_INCR;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The FreeBSD version has logic to avoid making sure that wr_seq doesn't get too far ahead from rd_seq and if so blocks until it catches up to avoid extreme wrap around issues... Is there a reason we don't need that? :). Or should we at least add a TODO and/or an assertion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using 64-bit sequence counters to avoid worrying about wrap around. FreeBSD uses 32-bit counters. (2^62 ns = ~146 years; 2^30 ns = ~1 sec)

We could consider using 32-bit counters and handling wrap around in the future. It wouldn't matter much for x86-64 and aarch64, but would probably be more efficient on some other platforms.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that makes sense, FreeBSD's is buried in a typedef so I hadn't noticed it was 32-bit.

return true;
}

rd_seq = qsbr_poll_scan(qsbr->shared);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably just related to future scaling, but I wonder if we could pass in the goal and bail on the full-scan if any thread hasn't reached it yet. It might prevent us from updating the shared for someone else who would benefit from it though...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me if that case will happen often enough to be worth handling. Unfortunately, we can't really measure or test for that until we disable the GIL. With the GIL, only one thread is ever attached and active.

Copy link
Contributor

@DinoV DinoV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@colesbury
Copy link
Contributor Author

@sethmlarson, does this need to be included in the SBOM? This includes code derived from a FreeBSD component, but there isn't really any shared code. In other words, it's derivative enough that I think the copyright notice should be preserved, but different enough that if there were any security vulnerabilities in the FreeBSD component, they would not be relevant here.

@sethmlarson
Copy link
Contributor

@colesbury Thanks for the ping! Great question, the SBOM for CPython actually doesn't concern itself with copyright or licensing right now, it's primary use-case is for vulnerability management and tracking. Given this I don't think we need to update the SBOM for this contribution.

@colesbury colesbury merged commit 5903190 into python:main Feb 16, 2024
35 checks passed
@colesbury colesbury deleted the gh-115103-qsbr branch February 16, 2024 20:25
woodruffw pushed a commit to woodruffw-forks/cpython that referenced this pull request Mar 4, 2024
…115180)

This adds a safe memory reclamation scheme based on FreeBSD's "GUS" and
quiescent state based reclamation (QSBR). The API provides a mechanism
for callers to detect when it is safe to free memory that may be
concurrently accessed by readers.
diegorusso pushed a commit to diegorusso/cpython that referenced this pull request Apr 17, 2024
…115180)

This adds a safe memory reclamation scheme based on FreeBSD's "GUS" and
quiescent state based reclamation (QSBR). The API provides a mechanism
for callers to detect when it is safe to free memory that may be
concurrently accessed by readers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants