runtime: correct attribution of internal mutex contention

As of Go 1.22, the mutex profile measures contention on internal mutex values, but attributes the contention to the function `runtime._LostContendedRuntimeLock`. This should be improved.

When running with `GODEBUG=runtimecontentionstacks=1`, the mutex profile attributes to the caller of `runtime.unlock` all the delay that the M encountered during its preceding `runtime.lock` call to obtain that lock. This is incorrect: the delay attributed to a `runtime.unlock` call should instead correspond to the total time that the _other_ Ms were unable to proceed because this M was holding the lock.

I have the start of a plan to resolve this, so I'm seeking feedback from @mknyszek and @prattmic (or @golang/runtime or others) on whether I'm on the right track — both in the start of the plan, and on whether I'm working to the right set of constraints. Thank you!

---

The mutex profile for `sync.Mutex` uses fields on the `sudog` to track how many Gs have been waiting and for about how long. The `sudog`s form a linked list of waiters, and are dequeued under the control of the runtime. When waking a G, the caller of `sync.Mutex.Unlock` is able to claim responsibility for the delay that it specifically caused, and update the remaining waiters' start time so the next `Unlock`er can claim the right amount too. We'll need similar information for the Ms that hold runtime-internal locks.

I'd like to say a few assumed constraints out loud:
- On Linux, `futex(&l.key, FUTEX_WAKE, 1)` can wake any M that was waiting on `l.key`.
- We cannot change the dynamics of locks on Linux (such as by putting the runtime in charge of picking an M to wake, as it is with the semaphore-based implementation).
- We cannot make interactions with an especially contended lock become O(N) with the number of waiters (any more than it is today, such as from the CPU's cache coherency work).
- We cannot change the size of a `runtime.mutex`, it needs to remain `uintptr`.
- An M can hold multiple contended locks at the same time.
- It's OK to grow the M struct by several words.
- It's OK to approximate the total wait time (as we do with `sync.Mutex`).
- The M struct has some alignment (8 bytes?), so we can use a few of the least significant bits of an `muintptr` as flags (as seen in lock_sema.go).
- An M can be waiting for only a single lock at a time.
- There's no appetite for a new `sudom` struct, to represent waiting Ms as the `sudog` represents waiting Gs.

For OSes where we use lock_sema.go, an M that needs to block on a `runtime.mutex` first pushes itself onto the head of a linked list (`runtime.mutex.key`, linked via `runtime.m.nextwaitm`), and then sleeps on its own private semaphore. When the current holder of the lock is done, it pops an M off of the list and wakes that specific M. We could add fields to the M for the number of Ms after it on the list, for the time it went to sleep, and for the time that the tail M went to sleep (or a pointer to the tail M). I think that would be sufficient information to match the approximation we do for `sync.Mutex` via `sudog`, and would need only a simple set of invariants to update when pushing and popping Ms to/from this list.

---

For OSes such as Linux where we use lock_futex.go, the kernel-controlled wakeup order means we'll need something more.
- The contents of `runtime.mutex.key` will point to the waiting M at the head of the linked list (as in lock_sema.go).
- The list must be doubly-linked so that an M in the middle (if it's the one the kernel chooses to wake) can remove itself.
- Any number of Ms may be adding themselves to the list concurrently (in the top half of `runtime.lock`).
- But only one M will be reading/editing wait start timestamps and element counts (in the top half of `runtime.unlock`) OR removing itself from the list (in the bottom half of `runtime.lock`) at a time.
- Also, if there are waiting Ms the mutex must always reflect that fact, which means removing lock_futex.go's initial `atomic.Xchg` speculative grab.

This uses space on the waiting Ms to store the necessary information; they're not using it for anything else while they're blocked.

I'll prepare a CL to implement this plan, but if I've gotten some assumptions wrong or it's otherwise clear that this won't work, it may be easier to spot and to discuss it here rather than in the details of code review. Thanks!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

runtime: correct attribution of internal mutex contention #66999

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

runtime: correct attribution of internal mutex contention #66999

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions