Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: correct attribution of internal mutex contention #66999

Open
rhysh opened this issue Apr 23, 2024 · 21 comments
Open

runtime: correct attribution of internal mutex contention #66999

rhysh opened this issue Apr 23, 2024 · 21 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@rhysh
Copy link
Contributor

rhysh commented Apr 23, 2024

As of Go 1.22, the mutex profile measures contention on internal mutex values, but attributes the contention to the function runtime._LostContendedRuntimeLock. This should be improved.

When running with GODEBUG=runtimecontentionstacks=1, the mutex profile attributes to the caller of runtime.unlock all the delay that the M encountered during its preceding runtime.lock call to obtain that lock. This is incorrect: the delay attributed to a runtime.unlock call should instead correspond to the total time that the other Ms were unable to proceed because this M was holding the lock.

I have the start of a plan to resolve this, so I'm seeking feedback from @mknyszek and @prattmic (or @golang/runtime or others) on whether I'm on the right track — both in the start of the plan, and on whether I'm working to the right set of constraints. Thank you!


The mutex profile for sync.Mutex uses fields on the sudog to track how many Gs have been waiting and for about how long. The sudogs form a linked list of waiters, and are dequeued under the control of the runtime. When waking a G, the caller of sync.Mutex.Unlock is able to claim responsibility for the delay that it specifically caused, and update the remaining waiters' start time so the next Unlocker can claim the right amount too. We'll need similar information for the Ms that hold runtime-internal locks.

I'd like to say a few assumed constraints out loud:

  • On Linux, futex(&l.key, FUTEX_WAKE, 1) can wake any M that was waiting on l.key.
  • We cannot change the dynamics of locks on Linux (such as by putting the runtime in charge of picking an M to wake, as it is with the semaphore-based implementation).
  • We cannot make interactions with an especially contended lock become O(N) with the number of waiters (any more than it is today, such as from the CPU's cache coherency work).
  • We cannot change the size of a runtime.mutex, it needs to remain uintptr.
  • An M can hold multiple contended locks at the same time.
  • It's OK to grow the M struct by several words.
  • It's OK to approximate the total wait time (as we do with sync.Mutex).
  • The M struct has some alignment (8 bytes?), so we can use a few of the least significant bits of an muintptr as flags (as seen in lock_sema.go).
  • An M can be waiting for only a single lock at a time.
  • There's no appetite for a new sudom struct, to represent waiting Ms as the sudog represents waiting Gs.

For OSes where we use lock_sema.go, an M that needs to block on a runtime.mutex first pushes itself onto the head of a linked list (runtime.mutex.key, linked via runtime.m.nextwaitm), and then sleeps on its own private semaphore. When the current holder of the lock is done, it pops an M off of the list and wakes that specific M. We could add fields to the M for the number of Ms after it on the list, for the time it went to sleep, and for the time that the tail M went to sleep (or a pointer to the tail M). I think that would be sufficient information to match the approximation we do for sync.Mutex via sudog, and would need only a simple set of invariants to update when pushing and popping Ms to/from this list.


For OSes such as Linux where we use lock_futex.go, the kernel-controlled wakeup order means we'll need something more.

  • The contents of runtime.mutex.key will point to the waiting M at the head of the linked list (as in lock_sema.go).
  • The list must be doubly-linked so that an M in the middle (if it's the one the kernel chooses to wake) can remove itself.
  • Any number of Ms may be adding themselves to the list concurrently (in the top half of runtime.lock).
  • But only one M will be reading/editing wait start timestamps and element counts (in the top half of runtime.unlock) OR removing itself from the list (in the bottom half of runtime.lock) at a time.
  • Also, if there are waiting Ms the mutex must always reflect that fact, which means removing lock_futex.go's initial atomic.Xchg speculative grab.

This uses space on the waiting Ms to store the necessary information; they're not using it for anything else while they're blocked.

I'll prepare a CL to implement this plan, but if I've gotten some assumptions wrong or it's otherwise clear that this won't work, it may be easier to spot and to discuss it here rather than in the details of code review. Thanks!

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Apr 23, 2024
@rhysh rhysh self-assigned this Apr 24, 2024
@rhysh rhysh added this to the Go1.23 milestone Apr 24, 2024
@prattmic
Copy link
Member

At a glance, this plan seems reasonable to me.

One note on assumed constraints:

We cannot change the size of a runtime.mutex, it needs to remain uintptr.

While it is nice to avoid increasing the size of a mutex, I don't think this needs to be a hard requirement. Taking a look at where we use mutex, it looks like timers and channels probably have the most objects and would thus have the biggest impact. But neither of those structures is tiny, so I don't think it would be the end of the world if they get bigger.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/585635 mentions this issue: runtime: prepare for extensions to waiting M list

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/585639 mentions this issue: runtime: remove GODEBUG=runtimecontentionstacks

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/585638 mentions this issue: runtime: profile mutex contention during unlock

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/585636 mentions this issue: runtime: use semaphore structure for futex locks

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/585637 mentions this issue: runtime: double-link list of waiting Ms

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/586237 mentions this issue: runtime: improve runtime-internal mutex profile tests

@dmitshur dmitshur added the NeedsFix The path to resolution is known, but the work has not been done. label May 21, 2024
gopherbot pushed a commit that referenced this issue May 21, 2024
Move the nextwaitm field into a small struct, in preparation for
additional metadata to track how long Ms need to wait for locks.

For #66999

Change-Id: Ib40e43c15cde22f7e35922641107973d99439ecd
Reviewed-on: https://go-review.googlesource.com/c/go/+/585635
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
gopherbot pushed a commit that referenced this issue May 21, 2024
Prepare the futex-based implementation of lock2 to maintain a list of
waiting Ms. Beyond storing an muintptr in the mutex's key field, we now
must never overwrite that field (even for a moment) without taking its
current value into account.

The semaphore-based implementation of lock2 already has that behavior.
Reuse that structure.

For #66999

Change-Id: I23b6f6bacb276fe33c6aed5c0571161a7e71fe6c
Reviewed-on: https://go-review.googlesource.com/c/go/+/585636
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
gopherbot pushed a commit that referenced this issue May 21, 2024
When an M unlocks a contended mutex, it needs to consult a list of the
Ms that had to wait during its critical section. This allows the M to
attribute the appropriate amount of blame to the unlocking call stack.

Mirroring the implementation for users' sync.Mutex contention (via
sudog), we can (in a future commit) use the time that the head and tail
of the wait list started waiting, and the number of waiters, to estimate
the sum of the Ms' delays.

When an M acquires the mutex, it needs to remove itself from the list of
waiters. Since the futex-based lock implementation leaves the OS in
control of the order of M wakeups, we need to be prepared for quickly
(constant time) removing any M from the list.

First, have each M add itself to a singly-linked wait list when it finds
that its lock call will need to sleep. This case is safe against
live-lock, since any delay to one M adding itself to the list would be
due to another M making durable progress.

Second, have the M that holds the lock (either right before releasing,
or right after acquiring) update metadata on the list of waiting Ms to
double-link the list and maintain a tail pointer and waiter count. That
work is amortized-constant: we'll avoid contended locks becoming
proportionally more contended and undergoing performance collapse.

For #66999

Change-Id: If75cdea915afb59ccec47294e0b52c466aac8736
Reviewed-on: https://go-review.googlesource.com/c/go/+/585637
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
gopherbot pushed a commit that referenced this issue May 21, 2024
Go 1.22 promised to remove the setting in a future release once the
semantics of runtime-internal lock contention matched that of
sync.Mutex. That work is done, remove the setting.

For #66999

Change-Id: I3c4894148385adf2756d8754e44d7317305ad758
Reviewed-on: https://go-review.googlesource.com/c/go/+/585639
Reviewed-by: Carlos Amedee <carlos@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
gopherbot pushed a commit that referenced this issue May 21, 2024
Have the test use the same clock (cputicks) as the profiler, and use the
test's own measurements as hard bounds on the magnitude to expect in the
profile.

Compare the depiction of two users of the same lock: one where the
critical section is fast, one where it is slow. Confirm that the profile
shows the slow critical section as a large source of delay (with #66999
fixed), rather than showing the fast critical section as a large
recipient of delay.

For #64253
For #66999

Change-Id: I784c8beedc39de564dc8cee42060a5d5ce55de39
Reviewed-on: https://go-review.googlesource.com/c/go/+/586237
Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
Reviewed-by: Carlos Amedee <carlos@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/587175 mentions this issue: runtime/pprof: shorten TestGoroutineSwitch

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/586796 mentions this issue: runtime: split mutex profile clocks

gopherbot pushed a commit that referenced this issue May 22, 2024
Mutex contention measurements work with two clocks: nanotime for use in
runtime/metrics, and cputicks for the runtime/pprof profile. They're
subject to different sampling rates: the runtime/metrics view is always
enabled, but the profile is adjustable and is turned off by default.
They have different levels of overhead: it can take as little as one
instruction to read cputicks while nanotime calls are more elaborate
(although some platforms implement cputicks as a nanotime call). The use
of the timestamps is also different: the profile's view needs to attach
the delay in some Ms' lock2 calls to another M's unlock2 call stack, but
the metric's view is only an int64.

Treat them differently. Don't bother threading the nanotime clock
through to the unlock2 call, measure and report it directly within
lock2. Sample nanotime at a constant gTrackingPeriod.

Don't consult any clocks unless the mutex is actually contended.

Continue liberal use of cputicks for now.

For #66999

Change-Id: I1c2085ea0e695bfa90c30fadedc99ced9eb1f69e
Reviewed-on: https://go-review.googlesource.com/c/go/+/586796
TryBot-Result: Gopher Robot <gobot@golang.org>
Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Run-TryBot: Rhys Hiltner <rhys.hiltner@gmail.com>
Reviewed-by: Carlos Amedee <carlos@golang.org>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/589096 mentions this issue: Revert "runtime: improve runtime-internal mutex profile tests"

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/589116 mentions this issue: Revert "runtime: double-link list of waiting Ms"

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/589098 mentions this issue: Revert "runtime: use semaphore structure for futex locks"

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/589097 mentions this issue: Revert "runtime: remove GODEBUG=runtimecontentionstacks"

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/589115 mentions this issue: Revert "runtime: profile mutex contention during unlock"

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/589117 mentions this issue: Revert "runtime: prepare for extensions to waiting M list"

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/589095 mentions this issue: Revert "runtime: split mutex profile clocks"

gopherbot pushed a commit that referenced this issue May 30, 2024
This reverts commit 8ab131f (CL 586796)

Reason for revert: This is part of a patch series that changed the
handling of contended lock2/unlock2 calls, reducing the maximum
throughput of contended runtime.mutex values, and causing a performance
regression on applications where that is (or became) the bottleneck.

Updates #66999
Updates #67585

Change-Id: I54711691e86e072081482102019d168292b5150a
Reviewed-on: https://go-review.googlesource.com/c/go/+/589095
Reviewed-by: Michael Pratt <mpratt@google.com>
Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Than McIntosh <thanm@google.com>
gopherbot pushed a commit that referenced this issue May 30, 2024
This reverts commit f9ba2cf (CL 586237)

Reason for revert: This is part of a patch series that changed the
handling of contended lock2/unlock2 calls, reducing the maximum
throughput of contended runtime.mutex values, and causing a performance
regression on applications where that is (or became) the bottleneck.

This test verifies that the semantics of the mutex profile for
runtime.mutex values matches that of sync.Mutex values. Without the rest
of the patch series, this test would correctly identify that Go 1.22's
semantics are incorrect (issue #66999).

Updates #66999
Updates #67585

Change-Id: Id06ae01d7bc91c94054c80d273e6530cb2d59d10
Reviewed-on: https://go-review.googlesource.com/c/go/+/589096
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Than McIntosh <thanm@google.com>
Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
gopherbot pushed a commit that referenced this issue May 30, 2024
This reverts commit 87e930f (CL 585639)

Reason for revert: This is part of a patch series that changed the
handling of contended lock2/unlock2 calls, reducing the maximum
throughput of contended runtime.mutex values, and causing a performance
regression on applications where that is (or became) the bottleneck.

Updates #66999
Updates #67585

Change-Id: I1e286d2a16d16e4af202cd5dc04b2d9c4ee71b32
Reviewed-on: https://go-review.googlesource.com/c/go/+/589097
Reviewed-by: Than McIntosh <thanm@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
gopherbot pushed a commit that referenced this issue May 30, 2024
This reverts commit ba1c5b2 (CL 585638).

Reason for revert: This is part of a patch series that changed the
handling of contended lock2/unlock2 calls, reducing the maximum
throughput of contended runtime.mutex values, and causing a performance
regression on applications where that is (or became) the bottleneck.

Updates #66999
Updates #67585

Change-Id: Ibeec5d8deb17e87966cf352fefc7efe2267839d6
Reviewed-on: https://go-review.googlesource.com/c/go/+/589115
Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Than McIntosh <thanm@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
gopherbot pushed a commit that referenced this issue May 30, 2024
This reverts commit d881ed6 (CL 585637).

Reason for revert: This is part of a patch series that changed the
handling of contended lock2/unlock2 calls, reducing the maximum
throughput of contended runtime.mutex values, and causing a performance
regression on applications where that is (or became) the bottleneck.

Updates #66999
Updates #67585

Change-Id: I70d8d0b74f73be95c43d664f584e8d98519aba26
Reviewed-on: https://go-review.googlesource.com/c/go/+/589116
Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Than McIntosh <thanm@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
gopherbot pushed a commit that referenced this issue May 30, 2024
This reverts commit dfb7073 (CL 585636).

Reason for revert: This is part of a patch series that changed the
handling of contended lock2/unlock2 calls, reducing the maximum
throughput of contended runtime.mutex values, and causing a performance
regression on applications where that is (or became) the bottleneck.

Updates #66999
Updates #67585

Change-Id: I3483bf0b85ba0b77204032a68b7cbe93f142703e
Reviewed-on: https://go-review.googlesource.com/c/go/+/589098
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Than McIntosh <thanm@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
gopherbot pushed a commit that referenced this issue May 30, 2024
This reverts commit be0b569 (CL 585635).

Reason for revert: This is part of a patch series that changed the
handling of contended lock2/unlock2 calls, reducing the maximum
throughput of contended runtime.mutex values, and causing a performance
regression on applications where that is (or became) the bottleneck.

Updates #66999
Updates #67585

Change-Id: I7843ccaecbd273b7ceacfa0f420dd993b4b15a0a
Reviewed-on: https://go-review.googlesource.com/c/go/+/589117
Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Than McIntosh <thanm@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
@rhysh
Copy link
Contributor Author

rhysh commented Jun 4, 2024

Reverted for the Go 1.23 cycle due to #67585

@rhysh rhysh reopened this Jun 4, 2024
@chabbimilind
Copy link

chabbimilind commented Jun 6, 2024

This direction seems right.
We should blame the lock wait time to the wait holder (the concept is called blame shifting).
It should be easier in Go since there is already a list of lock-waiters in the runtime.
It was done via sampling profilers:
https://www.cs.rice.edu/~johnmc/papers/hpctoolkit-ppopp-2010.pdf

It is also a thing in OpenMP runtime tools API: https://www.openmp.org/wp-content/uploads/OMPT-SC13-BOF.pdf

@robaho
Copy link

robaho commented Jul 19, 2024

Hmm. The blame shifting seems less useful than knowing what was blocked where and for how long. Ideally it seems it would report both - what was blocked for how long and what the lock holder was doing - but this feels like nested profiling which is pretty hard to follow.

@rhysh
Copy link
Contributor Author

rhysh commented Jul 31, 2024

Yes, the semantics of the "mutex" profile are to shift blame to the slow critical section. Part of why this didn't make it for the Go 1.23 cycle is that we only have the explicit list of lock waiters for the lock_sema.go platforms, and the extra bookkeeping required to add it for GOOS=linux was an unwelcome slowdown.

It is also useful to know what was blocked and for how long—for blocking that takes place outside of the runtime, that's reported on the "block" profile. Do you have an example of where that would have been useful for runtime-internal blocking, @robaho (maybe an open issue that is/was hard to debug)? If so, I think the next step would be to open a new issue to discuss adding it. I think it's possible, but we'd need to get consensus on whether it's worth the costs.

Fixing this didn't happen for Go 1.23, I'm planning to try again for Go 1.24.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/622997 mentions this issue: runtime: unify lock2, allow deeper sleep, control tail

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done.
Projects
Development

No branches or pull requests

6 participants