Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: rebalance snapshots can starve recovery snapshots with asymmetric settings #81832

Closed
nvanbenschoten opened this issue May 25, 2022 · 1 comment · Fixed by #83667
Closed
Assignees
Labels
A-kv Anything in KV that doesn't belong in a more specific category. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-postmortem Originated from a Postmortem action item. T-kv KV Team

Comments

@nvanbenschoten
Copy link
Member

nvanbenschoten commented May 25, 2022

In a customer issue, we saw that rebalance snapshots could starve recovery snapshots. At a high level, this is because a node will only receive one snapshot at a time and the two share the same receiver-side semaphore. This alone is an issue because it means that the difference in importance between snapshots is not recognized.

The issue is more severe when the kv.snapshot_recovery.max_rate and kv.snapshot_rebalance.max_rate settings are given different values. This is because these values inform the timeouts assigned to snapshots. If the recovery rate is high and the rebalance rate is low, recovery snapshots can have a lower timeout than the expected duration of a single rebalance snapshot. This means that any steady rebalance load can starve recovery snapshots. One potential mitigation for this last issue is to set the timeout for a snapshot based on min(kv.snapshot_recovery.max_rate, kv.snapshot_rebalance.max_rate) to avoid this problem.

Jira issue: CRDB-16088

Epic CRDB-16160

@nvanbenschoten nvanbenschoten added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv Anything in KV that doesn't belong in a more specific category. labels May 25, 2022
@blathers-crl blathers-crl bot added the T-kv KV Team label May 25, 2022
@nvanbenschoten nvanbenschoten changed the title kv: rebalance snapshots can starve recovery snapshots kv: rebalance snapshots can starve recovery snapshots with asymmetric settings May 25, 2022
@lunevalex lunevalex added the O-postmortem Originated from a Postmortem action item. label May 27, 2022
@bdarnell
Copy link
Contributor

See also #63728 (and #39200)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv Anything in KV that doesn't belong in a more specific category. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-postmortem Originated from a Postmortem action item. T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants