Skip to content

Conversation

@wenyihu6
Copy link
Contributor

@wenyihu6 wenyihu6 commented Aug 29, 2025

Part of: #151847
Resolves: #152022
Release note: none


kvserver: track priority inversion in replicate queue metrics

Previously, replicas could be enqueued at a high priority but end up processing
a lower-priority actions, causing priority inversion and unfairness to other
replicas behind them that needs a repair action. This commit adds metrics to
track such cases. In addition, this commit also adds metrics to track when
replicas are requeued in the replicate queue due to a priority inversion from a
repair action to a rebalance action.


kvserver: add TestPriorityInversionRequeue

Previously, we added priority inversion requeuing mechanism.

This commit adds a unit test that forces the race condition we suspected to be
happening in escalations involving priority inversion and asserts that priority
inversion occurs and that the replica is correctly requeued. Test set up:

  1. range’s leaseholder replica is rebalanced from one store to another.
  2. new leaseholder enqueues the replica for repair with high priority (e.g. to
    finalize the atomic replication change or remove a learner replica)
  3. before processing, the old leaseholder completes the change (exits the joint
    config or removes the learner).
  4. when the new leaseholder processes the replica, it computes a
    ConsiderRebalance action, resulting in a priority inversion and potentially
    blocking other high-priority work.

kvserver: delete per action priority inversion metrics

This commit removes per-action priority inversion metrics due to their high
cardinality. We already have logging in place, which should provide sufficient
observability. For now, we care about is priority inversion that leads to
consider rebalance and requeuing the most.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@wenyihu6 wenyihu6 force-pushed the newobs branch 2 times, most recently from cdbe813 to 6bb7860 Compare August 30, 2025 04:26
@wenyihu6 wenyihu6 changed the title kvserver: add PriorityInversionRequeue kvserver: track priority inversion in replicate queue metrics Aug 30, 2025
@wenyihu6 wenyihu6 marked this pull request as ready for review August 30, 2025 04:29
@wenyihu6 wenyihu6 requested review from a team as code owners August 30, 2025 04:29
@wenyihu6 wenyihu6 requested review from sumeerbhola and tbg August 30, 2025 04:29
@wenyihu6
Copy link
Contributor Author

wenyihu6 commented Aug 30, 2025

@sumeerbhola re: the original comments you had on #152624 (review).

TFTR!

labeled metrics

Agree that the high cardinality is unfortunate. But I'm not sure whether there is a better solution. We need to backport this to v24.3. It doesn’t look like it was included yet (#143536). I’m also not certain if this feature is supported by tsdump. We can consider improving this only on master in a follow up pr. For now, I'm leaning towards taking the loss on high cardinality unless we are also interested in something around tracking the priority distribution in the priority queue. Wdyt?

enqueue counter metric

We only have metrics for measuring pending queue length for now

pending *metric.Gauge
. It's a gauge. I plan to add something for enqueue success and enqueue failures in a follow up pr. Update: follow up pr: #152787.

@wenyihu6
Copy link
Contributor Author

I’m adding a new test, TestPriorityInversionRequeue, to try to force the suspected race condition. It fails under stress on CI. I checked the logs, and it was because if n1 removes the learner too quickly, n4 may miss it when enqueuing to the replicate queue. I’m not too concerned, since the goal is mainly to reproduce and confirm that the race could occur. Things pass under stress for me locally. We might need another testing knob. Do you think this test is worth adding? If so, I’ll continue exploring ways to delay n1’s learner removal until n4 has a chance to enqueue.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically looks good, but let's sync on whether we really meant to add the metrics tracking inversions on a per-type basis.
Thank you for adding the test - nice to see the mechanism confirmed 🔥

@tbg reviewed 1 of 1 files at r1, 3 of 3 files at r2, 2 of 2 files at r3, 2 of 3 files at r4, 1 of 1 files at r7, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)


pkg/kv/kvserver/replicate_queue_test.go line 2564 at r7 (raw file):

	defer leaktest.AfterTest(t)()
	defer log.Scope(t).Close(t)

I'd suggest you skip this test under duress preemptively.


pkg/kv/kvserver/replicate_queue.go line 322 at r7 (raw file):

		Unit:        metric.Unit_COUNT,
	}
	metaReplicateQueuePriorityInversionForAddVoterCount = metric.Metadata{

I thought we discussed that given that we have (rate-limited) logging and total counters, we could do away with the "per-inversion-type" counters. Am I misremembering this?

Copy link
Contributor Author

@wenyihu6 wenyihu6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)


pkg/kv/kvserver/replicate_queue.go line 322 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I thought we discussed that given that we have (rate-limited) logging and total counters, we could do away with the "per-inversion-type" counters. Am I misremembering this?

😞 Don’t recall reaching a conclusion in that discussion. Might have missed some discussion points. I’ve removed it for now.


pkg/kv/kvserver/replicate_queue_test.go line 2564 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I'd suggest you skip this test under duress preemptively.

Done.

@wenyihu6 wenyihu6 requested a review from tbg September 2, 2025 01:36
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not approving yet, but approved once unstacked. 👍

@tbg reviewed 3 of 3 files at r8, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)

@wenyihu6
Copy link
Contributor Author

wenyihu6 commented Sep 3, 2025

Rebased on top of master to pick up #152596.

@wenyihu6 wenyihu6 force-pushed the newobs branch 2 times, most recently from 152829d to 3f06ed0 Compare September 3, 2025 16:45
@wenyihu6
Copy link
Contributor Author

wenyihu6 commented Sep 3, 2025

The last two pushes fixed a rebase mess-up and a linter warning.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

@tbg reviewed 7 of 7 files at r9, 2 of 5 files at r10, 3 of 3 files at r12, 3 of 3 files at r13, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @sumeerbhola)

Previously, replicas could be enqueued at a high priority but end up processing
a lower-priority actions, causing priority inversion and unfairness to other
replicas behind them that needs a repair action. This commit adds metrics to
track such cases. In addition, this commit also adds metrics to track when
replicas are requeued in the replicate queue due to a priority inversion from a
repair action to a rebalance action.
Previously, we added priority inversion requeuing mechanism.

This commit adds a unit test that forces the race condition we suspected to be
happening in escalations involving priority inversion and asserts that priority
inversion occurs and that the replica is correctly requeued. Test set up:

1. range’s leaseholder replica is rebalanced from one store to another.
2. new leaseholder enqueues the replica for repair with high priority (e.g. to
finalize the atomic replication change or remove a learner replica)
3. before processing, the old leaseholder completes the change (exits the joint
   config or removes the learner).
4. when the new leaseholder processes the replica, it computes a
   ConsiderRebalance action, resulting in a priority inversion and potentially
   blocking other high-priority work.
This commit removes per-action priority inversion metrics due to their high
cardinality. We already have logging in place, which should provide sufficient
observability. For now, we care about is priority inversion that leads to
consider rebalance and requeuing the most.
@wenyihu6
Copy link
Contributor Author

wenyihu6 commented Sep 4, 2025

Rebased on master to resolve merge conflicts caused by related prs.

@wenyihu6
Copy link
Contributor Author

wenyihu6 commented Sep 4, 2025

TFTR!

bors r=tbg

@craig
Copy link
Contributor

craig bot commented Sep 4, 2025

@craig craig bot merged commit d8dec00 into cockroachdb:master Sep 4, 2025
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

kvserver: priority at enqueue time can differ from priority when being processed

3 participants