Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: Clean up snapshot rate limits #63728

Closed
bdarnell opened this issue Apr 15, 2021 · 1 comment · Fixed by #102596
Closed

kv: Clean up snapshot rate limits #63728

bdarnell opened this issue Apr 15, 2021 · 1 comment · Fixed by #102596
Assignees
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs T-kv KV Team

Comments

@bdarnell
Copy link
Contributor

bdarnell commented Apr 15, 2021

Raft snapshots are currently governed by two rate limit settings, kv.snapshot_rebalance.max_rate and kv.snapshot_recovery.max_rate. They currently default to the same value and we recommend that if one is changed, the other be changed to match. In that case there is no good reason to have two settings; we should consolidate on one and get rid of the other (this has been a longstanding TODO in the code but there doesn't seem to have been an issue for it until now)

On the other hand, things have changed enough that we might want to improve the prioritization scheme and use two different values for the rate limits. The idea behind the two settings is that the higher priority "recovery" limit is used when the snapshot is necessary to bring a range in compliance with its survivability goals, with the lower "rebalance" limit used at other times. We deprecated the use of two separate rate limits for two reasons:

  • The classification was not always accurate and snapshots that should have been "recovery" priority were not getting marked as such (this was largely due to the messiness of the code in this area, and also some ambiguity in some cases due to the lack of explicit survival goals for diversity).
  • Because of the concurrentSnapshotApplyLimit (which is currently 1), we had priority inversion problems in which a slow rebalance snapshot would tie up the one allowed snapshot slot, preventing a higher-priority recovery snapshot from proceeding.

Now that we have streaming snapshots, we could consider raising the concurrentSnapshotApplyLimit (and reserve some slots for recovery snapshots). With atomic replica changes and explicit survival goals for multi-region clusters, we should now be able to accurately classify snapshots based on whether or not they are necessary for survival goals. With those two changes, the rebalance/recovery rate limit settings make sense again and we could leave them as two separate options.

Jira issue: CRDB-6748

@andrewbaptist
Copy link
Collaborator

My plan for this is to deprecate the recovery setting and only leave the rebalance setting. We would apply the recovery rate to all snapshot transfers from 23.2 onwards. Ideally, we will be able to continue to bump this rate (and eventually move it away from public facing) once admission control is able to properly mediate snapshot transfer rates.

@shralex shralex added the O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs label Apr 28, 2023
andrewbaptist added a commit to andrewbaptist/cockroach that referenced this issue Apr 28, 2023
Fixes: cockroachdb#63728

There were two settings that controlled snapshot transfer rates, however
this has caused numerous problems when they are set differently as
the code makes timeout assumptions based on this rate.

Epic: none

Release note: This change removes a user configurable setting
`kv.snapshot_recovery.max_rate`. Guidance to customers has been to
always set this equal to `kv.snapshot_rebalance.max_rate` and this
change will start using that setting for all snapshots. If the customer
previously set `kv.snapshot_recovery.max_rate` it will be cleared after
this is released and future attempts to set it will fail with:
"ERROR: unknown cluster setting 'kv.snapshot_recovery.max_rate'
craig bot pushed a commit that referenced this issue May 1, 2023
102596: kv: remove kv.snapshot_recovery.max_rate setting r=tbg a=andrewbaptist

Fixes: #63728

There were two settings that controlled snapshot transfer rates, however, this has caused numerous problems when they are set differently as the code makes timeout assumptions based on this rate.

Epic: none

Release note: This change removes a user configurable setting
`kv.snapshot_recovery.max_rate`. Guidance to customers has been to
always set this equal to `kv.snapshot_rebalance.max_rate` and this
change will start using that setting for all snapshots. If the customer
previously set `kv.snapshot_recovery.max_rate` it will be cleared after
this is released and future attempts to set it will fail with:
"ERROR: unknown cluster setting 'kv.snapshot_recovery.max_rate'

Co-authored-by: Andrew Baptist <baptist@cockroachlabs.com>
@craig craig bot closed this as completed in 2980cd6 May 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants