Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rac2: introduce cluster setting to reset token counters #133202

Merged
merged 1 commit into from
Oct 23, 2024

Conversation

sumeerbhola
Copy link
Collaborator

kvadmission.flow_controller.token_reset_epoch is an escape hatch for cluster operators to reset RACv2 token counters to the full state.

The operator should increment this epoch (or change it to a value different than before). This can be used to counteract a token leakage bug, but note that if there is indeed a bug, the leakage may resume, and tokens may again be exhausted. So it is expected that this will be used together with disabling replication admission control by setting kvadmission.flow_control.enabled=false. Note that disabling replication admission control should be sufficient, since it should unblock work that is waiting-for-eval. But in case there is another bug that is preventing such work from unblocking, this setting may be useful.

Epic: CRDB-37515

Release note (ops change): The cluster setting
kvadmission.flow_controller.token_reset_epoch is an advanced setting that can be used to refill replication admission control v2 tokens. It should only be used after consultation with an expert.

@sumeerbhola sumeerbhola requested review from pav-kv and kvoli October 22, 2024 23:35
@sumeerbhola sumeerbhola requested a review from a team as a code owner October 22, 2024 23:35
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @pav-kv and @sumeerbhola)


pkg/kv/kvserver/kvflowcontrol/kvflowcontrol.go line 129 at r1 (raw file):

// such work from unblocking, this setting may be useful.
var TokenCounterResetEpoch = settings.RegisterIntSetting(
	settings.SystemOnly, "kvadmission.flow_controller.token_reset_epoch",

nit: haven't seen more than one arg on a line for a setting declaration before, consider moving the name to a new line.

kvadmission.flow_controller.token_reset_epoch is an escape hatch for
cluster operators to reset RACv2 token counters to the full state.

The operator should increment this epoch (or change it to a value
different than before). This can be used to counteract a token leakage
bug, but note that if there is indeed a bug, the leakage may
resume, and tokens may again be exhausted. So it is expected that this
will be used together with disabling replication admission control by
setting kvadmission.flow_control.enabled=false. Note that disabling
replication admission control should be sufficient, since it should
unblock work that is waiting-for-eval. But in case there is another bug
that is preventing such work from unblocking, this setting may be useful.

Epic: CRDB-37515

Release note (ops change): The cluster setting
kvadmission.flow_controller.token_reset_epoch is an advanced setting
that can be used to refill replication admission control v2 tokens. It
should only be used after consultation with an expert.
Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @kvoli and @pav-kv)


pkg/kv/kvserver/kvflowcontrol/kvflowcontrol.go line 129 at r1 (raw file):

Previously, kvoli (Austen) wrote…

nit: haven't seen more than one arg on a line for a setting declaration before, consider moving the name to a new line.

Done

@sumeerbhola sumeerbhola added the backport-24.3.x Flags PRs that need to be backported to 24.3 label Oct 23, 2024
@sumeerbhola
Copy link
Collaborator Author

bors r=kvoli

@craig craig bot merged commit 23adad6 into cockroachdb:master Oct 23, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-24.3.x Flags PRs that need to be backported to 24.3
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants