-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: correct attribution of internal mutex contention #66999
Comments
At a glance, this plan seems reasonable to me. One note on assumed constraints:
While it is nice to avoid increasing the size of a mutex, I don't think this needs to be a hard requirement. Taking a look at where we use mutex, it looks like timers and channels probably have the most objects and would thus have the biggest impact. But neither of those structures is tiny, so I don't think it would be the end of the world if they get bigger. |
Change https://go.dev/cl/585635 mentions this issue: |
Change https://go.dev/cl/585639 mentions this issue: |
Change https://go.dev/cl/585638 mentions this issue: |
Change https://go.dev/cl/585636 mentions this issue: |
Change https://go.dev/cl/585637 mentions this issue: |
Change https://go.dev/cl/586237 mentions this issue: |
Move the nextwaitm field into a small struct, in preparation for additional metadata to track how long Ms need to wait for locks. For #66999 Change-Id: Ib40e43c15cde22f7e35922641107973d99439ecd Reviewed-on: https://go-review.googlesource.com/c/go/+/585635 Reviewed-by: Michael Pratt <mpratt@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Prepare the futex-based implementation of lock2 to maintain a list of waiting Ms. Beyond storing an muintptr in the mutex's key field, we now must never overwrite that field (even for a moment) without taking its current value into account. The semaphore-based implementation of lock2 already has that behavior. Reuse that structure. For #66999 Change-Id: I23b6f6bacb276fe33c6aed5c0571161a7e71fe6c Reviewed-on: https://go-review.googlesource.com/c/go/+/585636 Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com> Reviewed-by: Michael Pratt <mpratt@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
When an M unlocks a contended mutex, it needs to consult a list of the Ms that had to wait during its critical section. This allows the M to attribute the appropriate amount of blame to the unlocking call stack. Mirroring the implementation for users' sync.Mutex contention (via sudog), we can (in a future commit) use the time that the head and tail of the wait list started waiting, and the number of waiters, to estimate the sum of the Ms' delays. When an M acquires the mutex, it needs to remove itself from the list of waiters. Since the futex-based lock implementation leaves the OS in control of the order of M wakeups, we need to be prepared for quickly (constant time) removing any M from the list. First, have each M add itself to a singly-linked wait list when it finds that its lock call will need to sleep. This case is safe against live-lock, since any delay to one M adding itself to the list would be due to another M making durable progress. Second, have the M that holds the lock (either right before releasing, or right after acquiring) update metadata on the list of waiting Ms to double-link the list and maintain a tail pointer and waiter count. That work is amortized-constant: we'll avoid contended locks becoming proportionally more contended and undergoing performance collapse. For #66999 Change-Id: If75cdea915afb59ccec47294e0b52c466aac8736 Reviewed-on: https://go-review.googlesource.com/c/go/+/585637 Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> Reviewed-by: Michael Pratt <mpratt@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
Go 1.22 promised to remove the setting in a future release once the semantics of runtime-internal lock contention matched that of sync.Mutex. That work is done, remove the setting. For #66999 Change-Id: I3c4894148385adf2756d8754e44d7317305ad758 Reviewed-on: https://go-review.googlesource.com/c/go/+/585639 Reviewed-by: Carlos Amedee <carlos@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com> Reviewed-by: Michael Pratt <mpratt@google.com>
Have the test use the same clock (cputicks) as the profiler, and use the test's own measurements as hard bounds on the magnitude to expect in the profile. Compare the depiction of two users of the same lock: one where the critical section is fast, one where it is slow. Confirm that the profile shows the slow critical section as a large source of delay (with #66999 fixed), rather than showing the fast critical section as a large recipient of delay. For #64253 For #66999 Change-Id: I784c8beedc39de564dc8cee42060a5d5ce55de39 Reviewed-on: https://go-review.googlesource.com/c/go/+/586237 Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com> Reviewed-by: Carlos Amedee <carlos@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Pratt <mpratt@google.com>
Change https://go.dev/cl/587175 mentions this issue: |
Change https://go.dev/cl/586796 mentions this issue: |
Mutex contention measurements work with two clocks: nanotime for use in runtime/metrics, and cputicks for the runtime/pprof profile. They're subject to different sampling rates: the runtime/metrics view is always enabled, but the profile is adjustable and is turned off by default. They have different levels of overhead: it can take as little as one instruction to read cputicks while nanotime calls are more elaborate (although some platforms implement cputicks as a nanotime call). The use of the timestamps is also different: the profile's view needs to attach the delay in some Ms' lock2 calls to another M's unlock2 call stack, but the metric's view is only an int64. Treat them differently. Don't bother threading the nanotime clock through to the unlock2 call, measure and report it directly within lock2. Sample nanotime at a constant gTrackingPeriod. Don't consult any clocks unless the mutex is actually contended. Continue liberal use of cputicks for now. For #66999 Change-Id: I1c2085ea0e695bfa90c30fadedc99ced9eb1f69e Reviewed-on: https://go-review.googlesource.com/c/go/+/586796 TryBot-Result: Gopher Robot <gobot@golang.org> Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com> Reviewed-by: Michael Pratt <mpratt@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Run-TryBot: Rhys Hiltner <rhys.hiltner@gmail.com> Reviewed-by: Carlos Amedee <carlos@golang.org>
Change https://go.dev/cl/589096 mentions this issue: |
Change https://go.dev/cl/589116 mentions this issue: |
Change https://go.dev/cl/589098 mentions this issue: |
Change https://go.dev/cl/589097 mentions this issue: |
Change https://go.dev/cl/589115 mentions this issue: |
Change https://go.dev/cl/589117 mentions this issue: |
Change https://go.dev/cl/589095 mentions this issue: |
This reverts commit 8ab131f (CL 586796) Reason for revert: This is part of a patch series that changed the handling of contended lock2/unlock2 calls, reducing the maximum throughput of contended runtime.mutex values, and causing a performance regression on applications where that is (or became) the bottleneck. Updates #66999 Updates #67585 Change-Id: I54711691e86e072081482102019d168292b5150a Reviewed-on: https://go-review.googlesource.com/c/go/+/589095 Reviewed-by: Michael Pratt <mpratt@google.com> Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Than McIntosh <thanm@google.com>
This reverts commit f9ba2cf (CL 586237) Reason for revert: This is part of a patch series that changed the handling of contended lock2/unlock2 calls, reducing the maximum throughput of contended runtime.mutex values, and causing a performance regression on applications where that is (or became) the bottleneck. This test verifies that the semantics of the mutex profile for runtime.mutex values matches that of sync.Mutex values. Without the rest of the patch series, this test would correctly identify that Go 1.22's semantics are incorrect (issue #66999). Updates #66999 Updates #67585 Change-Id: Id06ae01d7bc91c94054c80d273e6530cb2d59d10 Reviewed-on: https://go-review.googlesource.com/c/go/+/589096 Reviewed-by: Michael Pratt <mpratt@google.com> Reviewed-by: Than McIntosh <thanm@google.com> Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
This reverts commit 87e930f (CL 585639) Reason for revert: This is part of a patch series that changed the handling of contended lock2/unlock2 calls, reducing the maximum throughput of contended runtime.mutex values, and causing a performance regression on applications where that is (or became) the bottleneck. Updates #66999 Updates #67585 Change-Id: I1e286d2a16d16e4af202cd5dc04b2d9c4ee71b32 Reviewed-on: https://go-review.googlesource.com/c/go/+/589097 Reviewed-by: Than McIntosh <thanm@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Pratt <mpratt@google.com> Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
This reverts commit ba1c5b2 (CL 585638). Reason for revert: This is part of a patch series that changed the handling of contended lock2/unlock2 calls, reducing the maximum throughput of contended runtime.mutex values, and causing a performance regression on applications where that is (or became) the bottleneck. Updates #66999 Updates #67585 Change-Id: Ibeec5d8deb17e87966cf352fefc7efe2267839d6 Reviewed-on: https://go-review.googlesource.com/c/go/+/589115 Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com> Reviewed-by: Michael Pratt <mpratt@google.com> Reviewed-by: Than McIntosh <thanm@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
This reverts commit d881ed6 (CL 585637). Reason for revert: This is part of a patch series that changed the handling of contended lock2/unlock2 calls, reducing the maximum throughput of contended runtime.mutex values, and causing a performance regression on applications where that is (or became) the bottleneck. Updates #66999 Updates #67585 Change-Id: I70d8d0b74f73be95c43d664f584e8d98519aba26 Reviewed-on: https://go-review.googlesource.com/c/go/+/589116 Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Than McIntosh <thanm@google.com> Reviewed-by: Michael Pratt <mpratt@google.com>
This reverts commit dfb7073 (CL 585636). Reason for revert: This is part of a patch series that changed the handling of contended lock2/unlock2 calls, reducing the maximum throughput of contended runtime.mutex values, and causing a performance regression on applications where that is (or became) the bottleneck. Updates #66999 Updates #67585 Change-Id: I3483bf0b85ba0b77204032a68b7cbe93f142703e Reviewed-on: https://go-review.googlesource.com/c/go/+/589098 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Than McIntosh <thanm@google.com> Reviewed-by: Michael Pratt <mpratt@google.com> Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
This reverts commit be0b569 (CL 585635). Reason for revert: This is part of a patch series that changed the handling of contended lock2/unlock2 calls, reducing the maximum throughput of contended runtime.mutex values, and causing a performance regression on applications where that is (or became) the bottleneck. Updates #66999 Updates #67585 Change-Id: I7843ccaecbd273b7ceacfa0f420dd993b4b15a0a Reviewed-on: https://go-review.googlesource.com/c/go/+/589117 Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Than McIntosh <thanm@google.com> Reviewed-by: Michael Pratt <mpratt@google.com>
Reverted for the Go 1.23 cycle due to #67585 |
This direction seems right. It is also a thing in OpenMP runtime tools API: https://www.openmp.org/wp-content/uploads/OMPT-SC13-BOF.pdf |
Hmm. The blame shifting seems less useful than knowing what was blocked where and for how long. Ideally it seems it would report both - what was blocked for how long and what the lock holder was doing - but this feels like nested profiling which is pretty hard to follow. |
Yes, the semantics of the "mutex" profile are to shift blame to the slow critical section. Part of why this didn't make it for the Go 1.23 cycle is that we only have the explicit list of lock waiters for the lock_sema.go platforms, and the extra bookkeeping required to add it for GOOS=linux was an unwelcome slowdown. It is also useful to know what was blocked and for how long—for blocking that takes place outside of the runtime, that's reported on the "block" profile. Do you have an example of where that would have been useful for runtime-internal blocking, @robaho (maybe an open issue that is/was hard to debug)? If so, I think the next step would be to open a new issue to discuss adding it. I think it's possible, but we'd need to get consensus on whether it's worth the costs. Fixing this didn't happen for Go 1.23, I'm planning to try again for Go 1.24. |
Change https://go.dev/cl/622997 mentions this issue: |
As of Go 1.22, the mutex profile measures contention on internal mutex values, but attributes the contention to the function
runtime._LostContendedRuntimeLock
. This should be improved.When running with
GODEBUG=runtimecontentionstacks=1
, the mutex profile attributes to the caller ofruntime.unlock
all the delay that the M encountered during its precedingruntime.lock
call to obtain that lock. This is incorrect: the delay attributed to aruntime.unlock
call should instead correspond to the total time that the other Ms were unable to proceed because this M was holding the lock.I have the start of a plan to resolve this, so I'm seeking feedback from @mknyszek and @prattmic (or @golang/runtime or others) on whether I'm on the right track — both in the start of the plan, and on whether I'm working to the right set of constraints. Thank you!
The mutex profile for
sync.Mutex
uses fields on thesudog
to track how many Gs have been waiting and for about how long. Thesudog
s form a linked list of waiters, and are dequeued under the control of the runtime. When waking a G, the caller ofsync.Mutex.Unlock
is able to claim responsibility for the delay that it specifically caused, and update the remaining waiters' start time so the nextUnlock
er can claim the right amount too. We'll need similar information for the Ms that hold runtime-internal locks.I'd like to say a few assumed constraints out loud:
futex(&l.key, FUTEX_WAKE, 1)
can wake any M that was waiting onl.key
.runtime.mutex
, it needs to remainuintptr
.sync.Mutex
).muintptr
as flags (as seen in lock_sema.go).sudom
struct, to represent waiting Ms as thesudog
represents waiting Gs.For OSes where we use lock_sema.go, an M that needs to block on a
runtime.mutex
first pushes itself onto the head of a linked list (runtime.mutex.key
, linked viaruntime.m.nextwaitm
), and then sleeps on its own private semaphore. When the current holder of the lock is done, it pops an M off of the list and wakes that specific M. We could add fields to the M for the number of Ms after it on the list, for the time it went to sleep, and for the time that the tail M went to sleep (or a pointer to the tail M). I think that would be sufficient information to match the approximation we do forsync.Mutex
viasudog
, and would need only a simple set of invariants to update when pushing and popping Ms to/from this list.For OSes such as Linux where we use lock_futex.go, the kernel-controlled wakeup order means we'll need something more.
runtime.mutex.key
will point to the waiting M at the head of the linked list (as in lock_sema.go).runtime.lock
).runtime.unlock
) OR removing itself from the list (in the bottom half ofruntime.lock
) at a time.atomic.Xchg
speculative grab.This uses space on the waiting Ms to store the necessary information; they're not using it for anything else while they're blocked.
I'll prepare a CL to implement this plan, but if I've gotten some assumptions wrong or it's otherwise clear that this won't work, it may be easier to spot and to discuss it here rather than in the details of code review. Thanks!
The text was updated successfully, but these errors were encountered: