Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add flag to skip reference pool mutex if the program doesn't use the pool #4174

Closed
wants to merge 2 commits into from

Conversation

davidhewitt
Copy link
Member

This is a tiny idea I had to try to move ReferencePool work off the critical path for programs which have avoided using the reference pool.

The idea is that after #4095 we end up with the reference pool only containing deferred drops. We can use a boolean flag without any atomic synchronization to record if we ever wrote to the pool. If this flag is false, we skip the work. This should hopefully be extremely friendly to branch prediction - if the program never touches the pool then the branch predictor should elide the (more heavy) mutex work.

I wonder if this might be a more gentle alternative to the config-flag in #4095 which might be good enough for most users. I would also support keeping the flag but make it abort-only as a debugging tool to get to the never-touches-pool state for this PR. cc @adamreichold

Measuring the same sample from #3787 (comment) on my machine, this feature does indeed make quite a marked improvement:

# before - rust noop is slower
$ python test.py
py 2.3021166001853998e-08
rust 3.136266599904047e-08

# after - rust noop is slightly faster
$ python test.py
py 2.5457915995502846e-08
rust 2.360866600065492e-08

src/gil.rs Outdated
Comment on lines 285 to 295
#[inline]
pub(crate) fn update_deferred_reference_counts(py: Python<'_>) {
// Justification for relaxed: worst case this causes already deferred drops to be
// delayed slightly later, and this is also a one-time flag, so if the program is
// using deferred drops it is highly likely that branch prediction will always
// assume this is true and we don't need the atomic overhead.
if POOL.ever_used.load(sync::atomic::Ordering::Relaxed) {
POOL.update_counts(py)
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do this, I would prefer to better encapsulate it, e.g. in impl ReferencePool

#[inline]
fn update_counts(&self, _py: Python<'_>) {
   if self.ever_used.load(Ordering::Relaxed) {
     self.update_counts_impl(py);
   }
}

#[inline(never)]
fn update_counts_impl(&self, _py: Python<'_>) { .. }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might even get away with not implementing all of this ourselves at all and get better self-documenting code: Just put it inside a once_cell::sync::Lazy, i.e.

static POOL: Lazy<ReferencePool> = Lazy::new(ReferencePool::new);

and call register_decref as before (going via impl Deref for Lazy, i.e. Lazy::force), but for update, do

if let Some(pool) = POOL.get() {
  pool.update_counts();
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This would also take care of not updating the flag after the first change, but it might be a bit slower because Lazy track initialization state inside a AtomicU8.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's a good idea, I'll test out performance of using Lazy when I get a chance later 👍

pointer_ops: sync::Mutex::new((Vec::new(), Vec::new())),
}
}

fn register_incref(&self, obj: NonNull<ffi::PyObject>) {
self.ever_used.store(true, sync::atomic::Ordering::Relaxed);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might actually help to even avoid this memory traffic if the flag already is set, e.g.

self.ever_used.compare_exchange(false, true, Ordering::Relaxed, Ordering::Relaxed);

Copy link
Member Author

@davidhewitt davidhewitt May 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing this locally (albeit with a single thread), the .store seems to perform better, so will leave as-is for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

albeit with a single thread

To be honest, a single-threaded test cannot really tell us anything about the effects of cache line bouncing as there is only a single core's cache involved? (In any case, I would be more keen on the Lazy-based approach than open-coding this.)

@davidhewitt
Copy link
Member Author

davidhewitt commented May 11, 2024

Regarding the encapsulation, I fully agree making this part of update_counts but I wasn't keen on conflicting with #4095

(I started with exactly the if / impl split you propose)

@adamreichold
Copy link
Member

I fully agree making this part of update_counts but I wasn't keen on conflicting with #4095

I would be glad if we could finish #4095 first. I it is a soundness fix and even if we decide here to replace the flag entirely, I think doing it one step at a step would be preferable.

@alex
Copy link
Contributor

alex commented May 11, 2024

In #1608 I introduced a similar bool, but it got removed at some point. May be worthwhile to figure out why we removed it.

@adamreichold
Copy link
Member

adamreichold commented May 11, 2024

In #1608 I introduced a similar bool, but it got removed at some point. May be worthwhile to figure out why we removed it.

Note that this is slightly different: This bool is a one-time flag that is set when the program first touches the reference pool and never reset. Hence, this will end up always shared if it is able to stay on a separate cache line (we should make sure to use alignment to prevent any false sharing here).

The flag that was removed did track whether the reference pool was dirty and hence had the same amount of coherence traffic as the atomic on which all futex-based mutexes depend on in any case, i.e. in the uncontended case it would be just as fast/slow as the mutex access to check if the pool is dirty itself.

Now, it might be that the mutex access got slower now that we have removed parking_lot which guaranteed these futex-like performance characteristics. Especially if @davidhewitt did run his benchmark on macOS which IIRC (c.f. rust-lang/rust#93740) still uses pthread mutexes behind the scenes.

@adamreichold
Copy link
Member

adamreichold commented May 11, 2024

Looking at https://github.com/rust-lang/rust/blob/master/library/std/src/sys/sync/mutex/mod.rs, macOS really does appear to use pthread still, so I would assume that moving away from parking_lot to a fair PI mutex to certainly regressed this. But I also don't think that this functionality is worth a dependency like parking_lot.

I really wonder whether we should just use a lockfree stack or queue here as we really do not need mutual exclusion, we just need to steal a single pointer using a compare-exchange.

@davidhewitt
Copy link
Member Author

I prefer #4178 as a more complete solution, so will close this.

@davidhewitt davidhewitt closed this Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants