-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-21.1: kv: rationalize load-based range merging #65362
Merged
nvanbenschoten
merged 2 commits into
cockroachdb:release-21.1
from
nvanbenschoten:backport21.1-64199-64201
Jul 16, 2021
Merged
release-21.1: kv: rationalize load-based range merging #65362
nvanbenschoten
merged 2 commits into
cockroachdb:release-21.1
from
nvanbenschoten:backport21.1-64199-64201
Jul 16, 2021
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
irfansharif
approved these changes
May 18, 2021
Thanks for the reviews. I'm going to sit on this for a week or so longer to make sure it has a bit more time to bake on master. |
nvanbenschoten
force-pushed
the
backport21.1-64199-64201
branch
from
July 8, 2021 19:59
c6bfecb
to
e4f3593
Compare
Informs cockroachdb#63009. Informs cockroachdb#64056. In cockroachdb#63009/cockroachdb#64056, we saw that this test could flake with a nil pointer panic. I don't know quite what's going on here, but when working on a patch for cockroachdb#62700, I managed to hit this panic reliably by accidentally breaking all range merges. After a bit of debugging, it became clear that we were always hitting a panic in the `reset` stage of `TestMergeQueue/sticky-bit` because the previous subtest, `TestMergeQueue/non-collocated`, was moving the RHS range to a different node, failing to merge the two range, and failing itself. This soft failure was being drowned out by the hard failure in the next subtest. This commit replaces the crash with a failure that looks something like the following when range merges are completely disabled: ``` --- FAIL: TestMergeQueue (0.34s) test_log_scope.go:73: test logs captured to: /var/folders/8k/436yf8s97cl_27vlh270yb8c0000gp/T/logTestMergeQueue627909827 test_log_scope.go:74: use -show-logs to present logs inline --- FAIL: TestMergeQueue/both-empty (0.00s) client_merge_test.go:4183: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00" --- FAIL: TestMergeQueue/lhs-undersize (0.00s) client_merge_test.go:4192: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00" --- FAIL: TestMergeQueue/combined-threshold (0.00s) client_merge_test.go:4214: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00" --- FAIL: TestMergeQueue/non-collocated (0.03s) client_merge_test.go:4236: replica doesn't exist --- FAIL: TestMergeQueue/sticky-bit (0.00s) client_merge_test.go:4243: right-hand side range not found --- FAIL: TestMergeQueue/sticky-bit-expiration (0.00s) client_merge_test.go:4268: right-hand side range not found ``` I expect that under stress on master, we will see the `TestMergeQueue/non-collocated` subtest fail.
Closes cockroachdb#62700. Re-addresses cockroachdb#41317. This commit reworks how queries-per-second measurements are used when determining whether to merge two ranges together. At a high-level, the change moves from a scheme where the QPS over the last second on the LHS and RHS ranges are combined and compared against a threshold (half the load-based split threshold) to a scheme where the maximum QPS measured over the past 5 minutes (configurable) on the LHS and RHS ranges are combined and compared against said threshold. The commit makes this change to avoid thrashing and to avoid overreacting to temporary fluctuations in load. These overreactions lead to general instability in clusters, as we saw in cockroachdb#41317. Worse, the overreactions compound and can lead to cluster-wide meltdowns where a transient slowdown can trigger a wave of range merges, which can slow the cluster down further, which can lead to more merges, etc. This is what we saw in cockroachdb#62700. This behavior is bad on small clusters and it is even worse on large ones, where range merges don't just interrupt traffic, but also result in a centralization of load in a previously well-distributed dataset, undoing all of the hard work of load-based splitting and rebalancing and creating serious hotspots. The commit improves this situation by introducing a form of memory into the load-based split `Decider`. This is the object which was previously only responsible for measuring queries-per-second on a range and triggering the process of finding a load-based split point. The object is now given an additional role of taking the second-long QPS samples that it measures and aggregating them together to track the maximum historical QPS over a configurable retention period. This maximum QPS measurement can be used to prevent load-based splits from being merged away until the resulting ranges have consistently remained below a certain QPS threshold for a sufficiently long period of time. The `mergeQueue` is taught how to use this new source of information. It is also taught that it should be conservative about imprecision in this QPS tracking, opting to skip a merge rather than perform one when the maximum QPS measurement has not been tracked for long enough. This means that range merges will typically no longer fire within 5 minutes of a lease transfer. This seems fine, as there are almost never situations where a range merge is desperately needed and we should risk making a bad decision in order to perform one. I've measured this change on the `clearrange` roachtest that we made heavy use of in cockroachdb#62700. As expected, it has the same effect as bumping up the `kv.range_split.by_load_merge_delay` high enough such that ranges never merge on the active table. Here's a screenshot of a recent run. We still see a period of increased tail latency and reduced throughput, which has a strong correlation with Pebble compactions. However, we no longer see the subsequent cluster outage that used to follow, where ranges on the active table would begin to merge and throughput would fall to 0 and struggle to recover, bottoming out repeatedly. <todo insert images> Release note (performance improvement): Range merges are no longer considered if a range has seen significant load over the previous 5 minutes, instead of being considered as long as a range has low load over the last second. This improves stability, as load-based splits will no longer rapidly disappear during transient throughput dips.
nvanbenschoten
force-pushed
the
backport21.1-64199-64201
branch
from
July 16, 2021 16:06
e4f3593
to
b461d2d
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backport:
Please see individual PRs for details.
Closes #62700.
Fully addresses #41317.
This commit reworks how queries-per-second measurements are used when determining whether to merge two ranges together. At a high-level, the change moves from a scheme where the QPS over the last second on the LHS and RHS ranges are combined and compared against a threshold (half the load-based split threshold) to a scheme where the maximum QPS measured over the past 5 minutes (configurable) on the LHS and RHS ranges are combined and compared against said threshold.
The commit makes this change to avoid thrashing and to avoid overreacting to temporary fluctuations in load. These overreactions lead to general instability in clusters, as we saw in #41317. Worse, the overreactions compound and can lead to cluster-wide meltdowns where a transient slowdown can trigger a wave of range merges, which can slow the cluster down further, which can lead to more merges, etc. This is what we saw in #62700. This behavior is bad on small clusters and it is even worse on large ones, where range merges don't just interrupt traffic, but also result in a centralization of load in a previously well-distributed dataset, undoing all of the hard work of load-based splitting and rebalancing and creating serious hotspots.
The commit improves this situation by introducing a form of memory into the load-based split Decider. This is the object which was previously only responsible for measuring queries-per-second on a range and triggering the process of finding a load-based split point. The object is now given an additional role of taking the second-long QPS samples that it measures and aggregating them together to track the maximum historical QPS over a configurable retention period. This maximum QPS measurement can be used to prevent load-based splits from being merged away until the resulting ranges have consistently remained below a certain QPS threshold for a sufficiently long period of time.
The mergeQueue is taught how to use this new source of information. It is also taught that it should be conservative about imprecision in this QPS tracking, opting to skip a merge rather than perform one when the maximum QPS measurement has not been tracked for long enough. This means that range merges will typically no longer fire within 5 minutes of a lease transfer. This seems fine, as there are almost never situations where a range merge is desperately needed and we should risk making a bad decision in order to perform one.
I've measured this change on the clearrange roachtest that we made heavy use of in #62700. As expected, it has the same effect as bumping up the kv.range_split.by_load_merge_delay high enough such that ranges never merge on the active table. Here's a screenshot of a recent run. We still see a period of increased tail latency and reduced throughput, which has a strong correlation with Pebble compactions. However, we no longer see the subsequent cluster outage that used to follow, where ranges on the active table would begin to merge and throughput would fall to 0 and struggle to recover, bottoming out repeatedly.
Release note (performance improvement): Range merges are no longer considered if a range has seen significant load over the previous 5 minutes, instead of being considered as long as a range has low load over the last second. This improves stability, as load-based splits will no longer rapidly disappear during transient throughput dips.
Release justification: avoids instability during transient load blips. Baked on master for 2+ months.
/cc @cockroachdb/release