-
Notifications
You must be signed in to change notification settings - Fork 765
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add flag to skip reference pool mutex if the program doesn't use the pool #4174
Conversation
src/gil.rs
Outdated
#[inline] | ||
pub(crate) fn update_deferred_reference_counts(py: Python<'_>) { | ||
// Justification for relaxed: worst case this causes already deferred drops to be | ||
// delayed slightly later, and this is also a one-time flag, so if the program is | ||
// using deferred drops it is highly likely that branch prediction will always | ||
// assume this is true and we don't need the atomic overhead. | ||
if POOL.ever_used.load(sync::atomic::Ordering::Relaxed) { | ||
POOL.update_counts(py) | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do this, I would prefer to better encapsulate it, e.g. in impl ReferencePool
#[inline]
fn update_counts(&self, _py: Python<'_>) {
if self.ever_used.load(Ordering::Relaxed) {
self.update_counts_impl(py);
}
}
#[inline(never)]
fn update_counts_impl(&self, _py: Python<'_>) { .. }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might even get away with not implementing all of this ourselves at all and get better self-documenting code: Just put it inside a once_cell::sync::Lazy
, i.e.
static POOL: Lazy<ReferencePool> = Lazy::new(ReferencePool::new);
and call register_decref
as before (going via impl Deref for Lazy
, i.e. Lazy::force
), but for update, do
if let Some(pool) = POOL.get() {
pool.update_counts();
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This would also take care of not updating the flag after the first change, but it might be a bit slower because Lazy
track initialization state inside a AtomicU8
.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it's a good idea, I'll test out performance of using Lazy
when I get a chance later 👍
pointer_ops: sync::Mutex::new((Vec::new(), Vec::new())), | ||
} | ||
} | ||
|
||
fn register_incref(&self, obj: NonNull<ffi::PyObject>) { | ||
self.ever_used.store(true, sync::atomic::Ordering::Relaxed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might actually help to even avoid this memory traffic if the flag already is set, e.g.
self.ever_used.compare_exchange(false, true, Ordering::Relaxed, Ordering::Relaxed);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing this locally (albeit with a single thread), the .store
seems to perform better, so will leave as-is for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
albeit with a single thread
To be honest, a single-threaded test cannot really tell us anything about the effects of cache line bouncing as there is only a single core's cache involved? (In any case, I would be more keen on the Lazy
-based approach than open-coding this.)
Regarding the encapsulation, I fully agree making this part of (I started with exactly the if / impl split you propose) |
In #1608 I introduced a similar bool, but it got removed at some point. May be worthwhile to figure out why we removed it. |
Note that this is slightly different: This bool is a one-time flag that is set when the program first touches the reference pool and never reset. Hence, this will end up always shared if it is able to stay on a separate cache line (we should make sure to use alignment to prevent any false sharing here). The flag that was removed did track whether the reference pool was dirty and hence had the same amount of coherence traffic as the atomic on which all futex-based mutexes depend on in any case, i.e. in the uncontended case it would be just as fast/slow as the mutex access to check if the pool is dirty itself. Now, it might be that the mutex access got slower now that we have removed |
Looking at https://github.com/rust-lang/rust/blob/master/library/std/src/sys/sync/mutex/mod.rs, macOS really does appear to use I really wonder whether we should just use a lockfree stack or queue here as we really do not need mutual exclusion, we just need to steal a single pointer using a compare-exchange. |
I prefer #4178 as a more complete solution, so will close this. |
This is a tiny idea I had to try to move
ReferencePool
work off the critical path for programs which have avoided using the reference pool.The idea is that after #4095 we end up with the reference pool only containing deferred drops. We can use a boolean flag without any atomic synchronization to record if we ever wrote to the pool. If this flag is false, we skip the work. This should hopefully be extremely friendly to branch prediction - if the program never touches the pool then the branch predictor should elide the (more heavy) mutex work.
I wonder if this might be a more gentle alternative to the config-flag in #4095 which might be good enough for most users. I would also support keeping the flag but make it abort-only as a debugging tool to get to the never-touches-pool state for this PR. cc @adamreichold
Measuring the same sample from #3787 (comment) on my machine, this feature does indeed make quite a marked improvement: