Skip to content

Conversation

@wenyihu6
Copy link
Contributor

@wenyihu6 wenyihu6 commented Sep 5, 2025

Epic: none
Release note: none


kvserver: invoke callback when dropping replicas in bq.SetMaxSize

Previously, when SetMaxSize shrank the queue, replicas were dropped from the
priority queue without invoking their callbacks. This commit ensures callbacks
are properly invoked when SetMaxSize drops replicas. Replicas removed via
removeFromReplicaSetLocked (such as when a replica is destroyed) still don’t
always have their callbacks invoked. While the natural place to invoke the
callback would be at removeFromReplicaSetLocked, invoking callbacks while
holding a lock risks blocking for too long. (We are doing this already for
addInternal though.) This PR focuses specifically on handling the SetMaxSize
case since this PR is intended to be backported. We can follow up with a more
complex but more principled approach on master if needed in the future.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@wenyihu6 wenyihu6 changed the title kvserver: invoke callback SetMaxSize drops replicas kvserver: invoke callback when SetMaxSize drops replicas Sep 5, 2025
@wenyihu6 wenyihu6 force-pushed the setqueuesize branch 2 times, most recently from 2b17d01 to 22e614c Compare September 5, 2025 04:01
@wenyihu6 wenyihu6 marked this pull request as ready for review September 5, 2025 12:38
@wenyihu6 wenyihu6 requested a review from a team as a code owner September 5, 2025 12:38
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tbg reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @sumeerbhola)


-- commits line 13 at r1:
This isn't a backport, but it is on master. I guess you're saying that this PR is intended to be backported, and we might do something "more complex" on master in addition to this PR?


pkg/kv/kvserver/queue.go line 655 at r1 (raw file):

		bq.updateMetricsOnDroppedDueToFullQueue()
		for _, cb := range item.callbacks {
			cb.onEnqueueResult(-1 /*indexOnHeap*/, errDroppedDueToFullQueueSize)

in addInternal, we do call it under the lock. It would be good to be consistent if possible, but I'll also take a comment on the callback field that explains that it's sometimes called with and sometimes without the lock and that it's important to be mindful of that.

Previously, when SetMaxSize shrank the queue, replicas were dropped from the
priority queue without invoking their callbacks. This commit ensures callbacks
are properly invoked when SetMaxSize drops replicas. Replicas removed via
removeFromReplicaSetLocked (such as when a replica is destroyed) still don’t
always have their callbacks invoked. While the natural place to invoke the
callback would be at removeFromReplicaSetLocked, invoking callbacks while
holding a lock risks blocking for too long. (We are doing this already for
addInternal though.) This PR focuses specifically on handling the SetMaxSize
case since this PR is intended to be backported. We can follow up with a more
complex but more principled approach on master if needed in the future.
Copy link
Contributor Author

@wenyihu6 wenyihu6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @sumeerbhola, and @tbg)


-- commits line 13 at r1:

Previously, tbg (Tobias Grieger) wrote…

This isn't a backport, but it is on master. I guess you're saying that this PR is intended to be backported, and we might do something "more complex" on master in addition to this PR?

Clarified the language. I don’t plan to revisit this on master unless needed in the future, but lmk if you think we should.


pkg/kv/kvserver/queue.go line 655 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

in addInternal, we do call it under the lock. It would be good to be consistent if possible, but I'll also take a comment on the callback field that explains that it's sometimes called with and sometimes without the lock and that it's important to be mindful of that.

Added comments to onEnqueueResult and onProcessResult.

@wenyihu6
Copy link
Contributor Author

wenyihu6 commented Sep 8, 2025

TFTR!

bors r=tbg

@wenyihu6
Copy link
Contributor Author

wenyihu6 commented Sep 8, 2025

bors didn't pick this up for some reason, asked here https://cockroachlabs.slack.com/archives/CJ0H8Q97C/p1757347843104249.

bors retry

@wenyihu6
Copy link
Contributor Author

wenyihu6 commented Sep 8, 2025

bors still seems to be stuck - lets overwhelm it more.

bors r+

@wenyihu6
Copy link
Contributor Author

wenyihu6 commented Sep 8, 2025

Revived : )

@craig craig bot merged commit a78d085 into cockroachdb:master Sep 8, 2025
24 checks passed
@craig
Copy link
Contributor

craig bot commented Sep 8, 2025

@blathers-crl
Copy link

blathers-crl bot commented Sep 8, 2025

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating backport branch refs/heads/blathers/backport-release-24.3.20-rc-153008: POST https://api.github.com/repos/wenyihu6/cockroach/git/refs: 403 Resource not accessible by integration []

Backport to branch 24.3.20-rc failed. See errors above.


error creating backport branch refs/heads/blathers/backport-release-25.2.6-rc-153008: POST https://api.github.com/repos/wenyihu6/cockroach/git/refs: 403 Resource not accessible by integration []

Backport to branch 25.2.6-rc failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@wenyihu6 wenyihu6 deleted the setqueuesize branch September 24, 2025 01:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants