-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: actuate load-based replica rebalancing under heterogeneous localities #65379
kvserver: actuate load-based replica rebalancing under heterogeneous localities #65379
Conversation
This needs a bit more love. Please hold off on reviewing. |
4358675
to
c0fd11a
Compare
b236586
to
3d8720d
Compare
Note: This patch does not deal with constraints aware lease rebalancing ( There's still some flakes I'm in the process of narrowing down, but this patch should be ready for feedback now. I'd suggest reviewing it commit by commit (if y'all weren't already going to do that). If it's still hard to review, please let me know if there's anything I can do to make it easier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @andreimatei, and @nvanbenschoten)
pkg/kv/kvserver/allocator_scorer.go, line 846 at r2 (raw file):
// candidate for having a replica removed from it given the candidate store // list. This method returns true if there are any ranges that are on stores // that lie _outside of the [underfullThreshold, overfullThreshold] window_ for
I think the comment you've added is not quite right; it doesn't reference the one particular store that this method deals with.
pkg/kv/kvserver/allocator_scorer.go, line 1383 at r2 (raw file):
} func overfullRangeThreshold(options scorerOptions, mean float64) float64 {
should all these functions error of fatal if the field they each want in options
isn't set?
pkg/kv/kvserver/allocator_scorer.go, line 1391 at r2 (raw file):
} func overfullQPSThreshold(options scorerOptions, mean float64) float64 {
nit: I liked this function more with mean
factored out of the math.Min
.
Same for underfullQPSThreshold below
pkg/kv/kvserver/allocator_scorer.go, line 1399 at r2 (raw file):
} // rebalanceFromConvergesScore returns a 1 iff rebalancing a replica away from
nit: s/a 1 iff/1 if :P
pkg/kv/kvserver/allocator_scorer.go, line 1400 at r2 (raw file):
// rebalanceFromConvergesScore returns a 1 iff rebalancing a replica away from // `sc` will _not_ converge its range count towards the mean of stores in `sl`.
You've left one "range count" in there. You probably want it to adapt it to the new "options"
pkg/kv/kvserver/allocator_scorer.go, line 1409 at r2 (raw file):
) int { if options.qpsRebalanceThreshold > 0 { // if `qpsRebalanceThreshold` is set, we disable the `convergesScore`.
Can you expand this comment? Why doesn't convergence apply to qps ?
Same in the function below
pkg/kv/kvserver/allocator_scorer.go, line 461 at r3 (raw file):
} // rankedCandidateListForRemoval creates a candidate list of all existing
nit: I think it'd help if the comment on this method talked a bit about the context. So, someone has decided that one of the replicas of a range needs to go, and this function orders the replicas by load?
It'd also help to comment that sl
contains the replicas. Perhaps rename to replicas
.
pkg/kv/kvserver/allocator_scorer.go, line 491 at r3 (raw file):
}) } if options.deterministic {
if you can, please add a comment to options.deterministic
and explain what that field is about
pkg/kv/kvserver/allocator_scorer.go, line 498 at r3 (raw file):
// We compute the converges and balance scores only in relation to stores that // are the top candidates for removal based on diversity (i.e. only among // candidates that are non-diverse relative to the rest of the replicas).
Can you expand on this comment, and spell out that, in the example below, we want 1 and 2 to be scored higher than 3 and 4? And then could you explain in prose how come the scoring will end up like that?
Another thing that would help is expanding the comments on the worst
method. Right now it says something about "sharing the lowest constraint score", but I don't understand what that means.
pkg/kv/kvserver/allocator_scorer.go, line 515 at r3 (raw file):
// likely to be picked for removal, even if one of those is on a store that is // better or worse than the other. candidates = candidates.worst()
can this be a copy()
?
pkg/kv/kvserver/store_rebalancer.go, line 209 at r2 (raw file):
} func (sr *StoreRebalancer) rebalanceStore(
consider giving this method a comment
pkg/kv/kvserver/store_rebalancer.go, line 216 at r2 (raw file):
qpsRebalanceThreshold: qpsRebalanceThreshold.Get(&sr.st.SV), } qpsMinThreshold := underfullQPSThreshold(options, storeList.candidateQueriesPerSecond.mean)
consider putting a comment about these thresholds - how we only look at stores over the MaxThreshold and how we only transfer leases to streos below MinThreshold (right?)
3d8720d
to
0ecaf1f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exciting to see this!
I am curious to get your take on the backportability of the changes here.
Reviewed 2 of 2 files at r1, 7 of 7 files at r2, 2 of 2 files at r3, 7 of 7 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @andreimatei)
pkg/kv/kvserver/allocator_scorer.go, line 1361 at r1 (raw file):
// the mean. func rebalanceFromConvergesScore(sl StoreList, sc roachpb.StoreCapacity) int { if rebalanceConvergesOnMean(sl, sc, sc.RangeCount-1) {
minor nit: I think this would be slightly easier to read and contrast with rebalanceToConvergesScore
with if !rebalanceConvergesOnMean(...) { return 1 } return 0
. Up to you.
pkg/kv/kvserver/allocator_scorer.go, line 184 at r2 (raw file):
return -(2 + float64(o.convergesScore-c.convergesScore)/10.0) } if !scoresAlmostEqual(float64(c.balanceScore), float64(o.balanceScore)) {
Any reason to cast to float64 in this line and the next? It seems like we should be using int arithmetic as long as possible, like we do with convergesScore
.
pkg/kv/kvserver/allocator_scorer.go, line 851 at r2 (raw file):
// NB: If the given `options` have `qpsRebalanceThreshold` set, this method // makes its determination based on QPS. Otherwise, we fall back on using range // count as a signal.
So we're only ever considering either QPS or range count, but not both? If so, let's be more explicit about that.
pkg/kv/kvserver/allocator_scorer.go, line 888 at r2 (raw file):
// If we reached this point, we're happy with the range where it is.
Let's add this here as well.
pkg/kv/kvserver/allocator_scorer.go, line 1347 at r2 (raw file):
// // If the given options have qpsRebalanceThreshold set, we use that for // computing the balanceScore.
"Otherwise ..."
pkg/kv/kvserver/allocator_scorer.go, line 1349 at r2 (raw file):
// computing the balanceScore. // // TODO(aayush): It would be nice to be able to compose the two dimensions of
Have we thought about what affect this PR will have if we merge it without addressing this TODO? Will we no longer make any effort to balance based on range count? Are we relying on some callers setting qpsRebalanceThreshold
and others not?
pkg/kv/kvserver/allocator_scorer.go, line 1351 at r2 (raw file):
// TODO(aayush): It would be nice to be able to compose the two dimensions of // balanceScore (QPS or RangeCount) and allow the `options` to simply specify an // order of precedence. Callers who care about the balancing out QPS across
s/out/of/
pkg/kv/kvserver/allocator_scorer.go, line 1383 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
should all these functions error of fatal if the field they each want in
options
isn't set?
+1
pkg/kv/kvserver/allocator_scorer.go, line 1409 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
Can you expand this comment? Why doesn't convergence apply to qps ?
Same in the function below
I'm also interested in this.
pkg/kv/kvserver/allocator_scorer.go, line 429 at r3 (raw file):
diversityScore := diversityAllocateScore(s, existingStoreLocalities) balanceScore := balanceScore(candidateStores, s.Capacity, options) var convergesScore int
Why did we remove this? I would have expected it to use rebalanceToConvergesScore
.
pkg/kv/kvserver/allocator_scorer.go, line 514 at r3 (raw file):
// has more data constrained to it, both replicas 1 and 2 would be equally // likely to be picked for removal, even if one of those is on a store that is // better or worse than the other.
The implicit assumption here that tripped me up for a while is that, by comparing against region B and C, both replicas in region A will get a discrete balance score of moreThanMean
(or overfull
), and so their balance will look equivalent. If the balance score was some kind of higher-resolution proportion, then we wouldn't run into this problem, because regardless of the denominator, the replica with more QPS would get a higher score.
Do you mind weaving something about this into the comment?
pkg/kv/kvserver/allocator_scorer.go, line 515 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
can this be a
copy()
?
Or at least a pre-allocated list.
pkg/kv/kvserver/allocator_scorer.go, line 515 at r3 (raw file):
// likely to be picked for removal, even if one of those is on a store that is // better or worse than the other. candidates = candidates.worst()
So this means that only the worst candidates will be returned, right? So this function is no longer returning a ranked list of all stores in the input, it's returning a ranked list of the stores in the input with the worst diversity score. I think that's ok, given how the return value is used, but is that what you were expecting? Do we need to change the contract here?
pkg/kv/kvserver/allocator_scorer_test.go, line 1533 at r2 (raw file):
}{ {sEmpty, underfull}, {sMean, moreThanMean},
Do we want to add a test case that results in lessThanMean?
pkg/kv/kvserver/allocator_scorer_test.go, line 1556 at r2 (raw file):
expBalanceScore balanceStatus }{ {0, 1},
Want to use your new enums here?
pkg/kv/kvserver/allocator_test.go, line 1363 at r2 (raw file):
// We don't expect any QPS based rebalancing when all stores are serving // the same QPS. testStores: allStoresEqual,
nit: it's worth being explicit: expectRebalance: false,
pkg/kv/kvserver/allocator_test.go, line 1446 at r3 (raw file):
{ StoreID: 5, Node: roachpb.NodeDescriptor{NodeID: 5, Locality: region("a")},
nit: any reason this is store 5 but has the same region as store 1? This tripped me up.
pkg/kv/kvserver/store_pool.go, line 657 at r4 (raw file):
} // excludeInvalid takes a store list and removes stores that would be explicitly invalid
Want to pull this rename into a separate commit? That's probably better than having a complex change mixed in with a refactor.
pkg/kv/kvserver/store_rebalancer.go, line 209 at r4 (raw file):
// NB: The StoreRebalancer only cares about the convergence of QPS across // stores, not the convergence of range count. So, we don't use the allocator's // `scorerOptions` here, which set the range count rebalance threshold.
"Instead, we use our own scorerOptions, which sets the qps rebalance threshold."
pkg/kv/kvserver/store_rebalancer.go, line 400 at r4 (raw file):
} if shouldNotMoveAway(ctx, replWithStats, localDesc, now, minQPS) {
Should we inline this call, now that it's not used down below?
pkg/kv/kvserver/store_rebalancer.go, line 650 at r4 (raw file):
ctx context.Context, rbCtx rangeRebalanceContext, ) (finalVoterTargets, finalNonVoterTargets []roachpb.ReplicaDescriptor) { options := scorerOptions{
sr.scorerOptions()
?
pkg/kv/kvserver/store_rebalancer_test.go, line 255 at r4 (raw file):
expectedRebalancedVoters, expectedRebalancedNonVoters []roachpb.StoreID }{ {
We seem to be deleting a lot of test cases. Is this because we're expecting the test coverage provided by TestChooseRangeToRebalance
to be subsumed by TestChooseRangeToRebalanceRandom
? I wonder if we should keep both. Randomized and table-driven tests that exercise similar logic can live side-by-side. In fact, they're a great combination. One finds new kinds of failures and one protects against regressions on known test cases.
pkg/kv/kvserver/store_rebalancer_test.go, line 391 at r4 (raw file):
ctx context.Context, allStores, deadStores []*roachpb.StoreDescriptor, meanQPS float64, ) { var summary bytes.Buffer
nit: strings.Builder
pkg/kv/kvserver/store_rebalancer_test.go, line 617 at r4 (raw file):
}, // Two replicas are in the hot region, both on relatively heavily loaded // nodes. We one of those replicas to be moved to a less busy store within
"We expect one"
pkg/kv/kvserver/store_rebalancer_test.go, line 652 at r4 (raw file):
expRebalancedVoters: []roachpb.StoreID{8, 6, 3}, qps: 50, },
Do you think it's worth adding some test cases that use constraints that look exactly like we'll generate in multi-region databases (for zone and region survival)?
aad90bf
to
fe7cd1e
Compare
Pinging for next steps here. Get it baking on master and then look at a 21.1 backport? |
This needs a few more review cycles. I have not been able to spend any time on this in the last little bit, but hope to in the next couple weeks. |
fe7cd1e
to
41be519
Compare
a62013f
to
70184ba
Compare
This commit fixes the regression(s) introduced by cockroachdb#65379 where we observed replica thrashing in various workloads (cockroachdb#70396 and cockroachdb#71244). The following is a description of the differences between the QPS based rebalancing scheme used in the previous implementation of the store rebalancer (release-21.2 and before). ** lease rebalancing ** *** release 21.2 and before *** QPS based lease rebalancing in CRDB 21.2 considers the overall cluster level average QPS and computes underfull and overfull thresholds based off of this average. For each range that the local store has a lease for, the store rebalancer goroutine checks whether transferring said range's lease away will bring the local store's QPS below the underfull threshold. If so, it ignores the range and moves on to the next one. Otherwise, it iterates through the stores of all the non-leaseholder voting replicas (in ascending order of their QPS) and checks whether it would be reasonable to transfer the lease away to such a store. It ensures that the receiving store would not become overfull after the lease transfer. It checks that the receiving store doesn't have a replica that's lagging behind the current leaseholder. It checks that the receiving store is not in violation of lease preferences. Finally, it ensures that the lease is not on the local store because of access locality considerations (i.e. because of follow-the-workload). All of this was bespoke logic that lived in the store rebalancer (using none of the Allocator's machinery). *** master and this commit *** In cockroachdb#65379, we moved this decision making into the Allocator by adding a new mode in `Allocator.TransferLeaseTarget` that tries to determine whether transferring the lease to another voting replica would reduce the qps delta between the hottest and the coldest stores in the replica set. This commit adds some padding to this logic by ensuring that the qps difference between the store relinquishing the lease and the store receiving the lease is at least 200qps. Furthermore, it ensures that the store receiving the lease won't become significantly hotter than the current leaseholder. ** replica rebalancing ** *** release 21.2 and before *** QPS replica rebalancing in CRDB <=21.2 works similarly to the lease rebalancing logic. We first compute a cluster level QPS average, overfull and underfull thresholds. Based on these thresholds we try to move replicas away from overfull stores and onto stores that are underfull, all while ensuring that the receiving stores would not become overfull after the rebalance. A critical assumption that the store rebalancer made (and still does, in the approach implemented by this commit) is that follower replicas serve the same traffic as the leaseholder. *** master and this commit *** The approach implemented by cockroachdb#65379 and refined by this commit tries to leverage machinery in the Allocator that makes rebalancing decisions that converge load based statistics per equivalence class. Previously, this machinery was only used for range count based replica rebalancing (performed by the `replicateQueue`) but not for qps-based rebalancing. This commit implements a similar approach to what we do now for lease rebalancing, which is to determine whether a rebalance action would reduce the qps delta between the hottest and the coldest store in the equivalence class. This commit adds some safeguards around this logic by ensuring that the store relinquishing the replica and the store receiving it differ by at least 200 qps. Furthermore, it ensures that the replica rebalance would not significantly switch the relative dispositions of the two stores. An important thing to note with the 21.2 implementation of the store rebalancer is that it was making all of its decisions based on cluster-level QPS averages. This behaves poorly in heterogenously sized / loaded clusters where some localities are designed to receive more traffic than others. In such clusters, heavily loaded localities can always be considered "overfull". This usually means that all stores in such localities would be above the "overfull" threshold in the cluster. The logic described above would effectively not do anything since there are no underfull stores to move replicas to. Release note (performance improvement): A set of bugs that rendered QPS-based lease and replica rebalancing in CRDB 21.2 and prior ineffective under heterogenously loaded cluster localities has been fixed. Additionally a limitation which prevent CRDB from effectively alleviating extreme QPS hotspots from nodes has also been fixed.
This commit fixes the regression(s) introduced by cockroachdb#65379 where we observed replica thrashing in various workloads (cockroachdb#70396 and cockroachdb#71244). The following is a description of the differences between the QPS based rebalancing scheme used in the previous implementation of the store rebalancer (release-21.2 and before). ** lease rebalancing ** *** release 21.2 and before *** QPS based lease rebalancing in CRDB 21.2 considers the overall cluster level average QPS and computes underfull and overfull thresholds based off of this average. For each range that the local store has a lease for, the store rebalancer goroutine checks whether transferring said range's lease away will bring the local store's QPS below the underfull threshold. If so, it ignores the range and moves on to the next one. Otherwise, it iterates through the stores of all the non-leaseholder voting replicas (in ascending order of their QPS) and checks whether it would be reasonable to transfer the lease away to such a store. It ensures that the receiving store would not become overfull after the lease transfer. It checks that the receiving store doesn't have a replica that's lagging behind the current leaseholder. It checks that the receiving store is not in violation of lease preferences. Finally, it ensures that the lease is not on the local store because of access locality considerations (i.e. because of follow-the-workload). All of this was bespoke logic that lived in the store rebalancer (using none of the Allocator's machinery). *** master and this commit *** In cockroachdb#65379, we moved this decision making into the Allocator by adding a new mode in `Allocator.TransferLeaseTarget` that tries to determine whether transferring the lease to another voting replica would reduce the qps delta between the hottest and the coldest stores in the replica set. This commit adds some padding to this logic by ensuring that the qps difference between the store relinquishing the lease and the store receiving the lease is at least 200qps. Furthermore, it ensures that the store receiving the lease won't become significantly hotter than the current leaseholder. ** replica rebalancing ** *** release 21.2 and before *** QPS replica rebalancing in CRDB <=21.2 works similarly to the lease rebalancing logic. We first compute a cluster level QPS average, overfull and underfull thresholds. Based on these thresholds we try to move replicas away from overfull stores and onto stores that are underfull, all while ensuring that the receiving stores would not become overfull after the rebalance. A critical assumption that the store rebalancer made (and still does, in the approach implemented by this commit) is that follower replicas serve the same traffic as the leaseholder. *** master and this commit *** The approach implemented by cockroachdb#65379 and refined by this commit tries to leverage machinery in the Allocator that makes rebalancing decisions that converge load based statistics per equivalence class. Previously, this machinery was only used for range count based replica rebalancing (performed by the `replicateQueue`) but not for qps-based rebalancing. This commit implements a similar approach to what we do now for lease rebalancing, which is to determine whether a rebalance action would reduce the qps delta between the hottest and the coldest store in the equivalence class. This commit adds some safeguards around this logic by ensuring that the store relinquishing the replica and the store receiving it differ by at least 200 qps. Furthermore, it ensures that the replica rebalance would not significantly switch the relative dispositions of the two stores. An important thing to note with the 21.2 implementation of the store rebalancer is that it was making all of its decisions based on cluster-level QPS averages. This behaves poorly in heterogenously sized / loaded clusters where some localities are designed to receive more traffic than others. In such clusters, heavily loaded localities can always be considered "overfull". This usually means that all stores in such localities would be above the "overfull" threshold in the cluster. The logic described above would effectively not do anything since there are no underfull stores to move replicas to. Release note (performance improvement): A set of bugs that rendered QPS-based lease and replica rebalancing in CRDB 21.2 and prior ineffective under heterogenously loaded cluster localities has been fixed. Additionally a limitation which prevent CRDB from effectively alleviating extreme QPS hotspots from nodes has also been fixed.
This commit fixes the regression(s) introduced by cockroachdb#65379 where we observed replica thrashing in various workloads (cockroachdb#70396 and cockroachdb#71244). The following is a description of the differences between the QPS based rebalancing scheme used in the previous implementation of the store rebalancer (release-21.2 and before). ** lease rebalancing ** *** release 21.2 and before *** QPS based lease rebalancing in CRDB 21.2 considers the overall cluster level average QPS and computes underfull and overfull thresholds based off of this average. For each range that the local store has a lease for, the store rebalancer goroutine checks whether transferring said range's lease away will bring the local store's QPS below the underfull threshold. If so, it ignores the range and moves on to the next one. Otherwise, it iterates through the stores of all the non-leaseholder voting replicas (in ascending order of their QPS) and checks whether it would be reasonable to transfer the lease away to such a store. It ensures that the receiving store would not become overfull after the lease transfer. It checks that the receiving store doesn't have a replica that's lagging behind the current leaseholder. It checks that the receiving store is not in violation of lease preferences. Finally, it ensures that the lease is not on the local store because of access locality considerations (i.e. because of follow-the-workload). All of this was bespoke logic that lived in the store rebalancer (using none of the Allocator's machinery). *** master and this commit *** In cockroachdb#65379, we moved this decision making into the Allocator by adding a new mode in `Allocator.TransferLeaseTarget` that tries to determine whether transferring the lease to another voting replica would reduce the qps delta between the hottest and the coldest stores in the replica set. This commit adds some padding to this logic by ensuring that the qps difference between the store relinquishing the lease and the store receiving the lease is at least 200qps. Furthermore, it ensures that the store receiving the lease won't become significantly hotter than the current leaseholder. ** replica rebalancing ** *** release 21.2 and before *** QPS replica rebalancing in CRDB <=21.2 works similarly to the lease rebalancing logic. We first compute a cluster level QPS average, overfull and underfull thresholds. Based on these thresholds we try to move replicas away from overfull stores and onto stores that are underfull, all while ensuring that the receiving stores would not become overfull after the rebalance. A critical assumption that the store rebalancer made (and still does, in the approach implemented by this commit) is that follower replicas serve the same traffic as the leaseholder. *** master and this commit *** The approach implemented by cockroachdb#65379 and refined by this commit tries to leverage machinery in the Allocator that makes rebalancing decisions that converge load based statistics per equivalence class. Previously, this machinery was only used for range count based replica rebalancing (performed by the `replicateQueue`) but not for qps-based rebalancing. This commit implements a similar approach to what we do now for lease rebalancing, which is to determine whether a rebalance action would reduce the qps delta between the hottest and the coldest store in the equivalence class. This commit adds some safeguards around this logic by ensuring that the store relinquishing the replica and the store receiving it differ by at least 200 qps. Furthermore, it ensures that the replica rebalance would not significantly switch the relative dispositions of the two stores. An important thing to note with the 21.2 implementation of the store rebalancer is that it was making all of its decisions based on cluster-level QPS averages. This behaves poorly in heterogenously sized / loaded clusters where some localities are designed to receive more traffic than others. In such clusters, heavily loaded localities can always be considered "overfull". This usually means that all stores in such localities would be above the "overfull" threshold in the cluster. The logic described above would effectively not do anything since there are no underfull stores to move replicas to. Release note (performance improvement): A set of bugs that rendered QPS-based lease and replica rebalancing in CRDB 21.2 and prior ineffective under heterogenously loaded cluster localities has been fixed. Additionally a limitation which prevent CRDB from effectively alleviating extreme QPS hotspots from nodes has also been fixed.
This commit fixes the regression(s) introduced by cockroachdb#65379 where we observed replica thrashing in various workloads (cockroachdb#70396 and cockroachdb#71244). The following is a description of the differences between the QPS based rebalancing scheme used in the previous implementation of the store rebalancer (release-21.2 and before). ** lease rebalancing ** *** release 21.2 and before *** QPS based lease rebalancing in CRDB 21.2 considers the overall cluster level average QPS and computes underfull and overfull thresholds based off of this average. For each range that the local store has a lease for, the store rebalancer goroutine checks whether transferring said range's lease away will bring the local store's QPS below the underfull threshold. If so, it ignores the range and moves on to the next one. Otherwise, it iterates through the stores of all the non-leaseholder voting replicas (in ascending order of their QPS) and checks whether it would be reasonable to transfer the lease away to such a store. It ensures that the receiving store would not become overfull after the lease transfer. It checks that the receiving store doesn't have a replica that's lagging behind the current leaseholder. It checks that the receiving store is not in violation of lease preferences. Finally, it ensures that the lease is not on the local store because of access locality considerations (i.e. because of follow-the-workload). All of this was bespoke logic that lived in the store rebalancer (using none of the Allocator's machinery). *** master and this commit *** In cockroachdb#65379, we moved this decision making into the Allocator by adding a new mode in `Allocator.TransferLeaseTarget` that tries to determine whether transferring the lease to another voting replica would reduce the qps delta between the hottest and the coldest stores in the replica set. This commit adds some padding to this logic by ensuring that the qps difference between the store relinquishing the lease and the store receiving the lease is at least 200qps. Furthermore, it ensures that the store receiving the lease won't become significantly hotter than the current leaseholder. ** replica rebalancing ** *** release 21.2 and before *** QPS replica rebalancing in CRDB <=21.2 works similarly to the lease rebalancing logic. We first compute a cluster level QPS average, overfull and underfull thresholds. Based on these thresholds we try to move replicas away from overfull stores and onto stores that are underfull, all while ensuring that the receiving stores would not become overfull after the rebalance. A critical assumption that the store rebalancer made (and still does, in the approach implemented by this commit) is that follower replicas serve the same traffic as the leaseholder. *** master and this commit *** The approach implemented by cockroachdb#65379 and refined by this commit tries to leverage machinery in the Allocator that makes rebalancing decisions that converge load based statistics per equivalence class. Previously, this machinery was only used for range count based replica rebalancing (performed by the `replicateQueue`) but not for qps-based rebalancing. This commit implements a similar approach to what we do now for lease rebalancing, which is to determine whether a rebalance action would reduce the qps delta between the hottest and the coldest store in the equivalence class. This commit adds some safeguards around this logic by ensuring that the store relinquishing the replica and the store receiving it differ by at least 200 qps. Furthermore, it ensures that the replica rebalance would not significantly switch the relative dispositions of the two stores. An important thing to note with the 21.2 implementation of the store rebalancer is that it was making all of its decisions based on cluster-level QPS averages. This behaves poorly in heterogenously sized / loaded clusters where some localities are designed to receive more traffic than others. In such clusters, heavily loaded localities can always be considered "overfull". This usually means that all stores in such localities would be above the "overfull" threshold in the cluster. The logic described above would effectively not do anything since there are no underfull stores to move replicas to. Release note (performance improvement): A set of bugs that rendered QPS-based lease and replica rebalancing in CRDB 21.2 and prior ineffective under heterogenously loaded cluster localities has been fixed. Additionally a limitation which prevent CRDB from effectively alleviating extreme QPS hotspots from nodes has also been fixed.
This commit fixes the regression(s) introduced by cockroachdb#65379 where we observed replica thrashing in various workloads (cockroachdb#70396 and cockroachdb#71244). The following is a description of the differences between the QPS based rebalancing scheme used in the previous implementation of the store rebalancer (release-21.2 and before). ** lease rebalancing ** *** release 21.2 and before *** QPS based lease rebalancing in CRDB 21.2 considers the overall cluster level average QPS and computes underfull and overfull thresholds based off of this average. For each range that the local store has a lease for, the store rebalancer goroutine checks whether transferring said range's lease away will bring the local store's QPS below the underfull threshold. If so, it ignores the range and moves on to the next one. Otherwise, it iterates through the stores of all the non-leaseholder voting replicas (in ascending order of their QPS) and checks whether it would be reasonable to transfer the lease away to such a store. It ensures that the receiving store would not become overfull after the lease transfer. It checks that the receiving store doesn't have a replica that's lagging behind the current leaseholder. It checks that the receiving store is not in violation of lease preferences. Finally, it ensures that the lease is not on the local store because of access locality considerations (i.e. because of follow-the-workload). All of this was bespoke logic that lived in the store rebalancer (using none of the Allocator's machinery). *** master and this commit *** In cockroachdb#65379, we moved this decision making into the Allocator by adding a new mode in `Allocator.TransferLeaseTarget` that tries to determine whether transferring the lease to another voting replica would reduce the qps delta between the hottest and the coldest stores in the replica set. This commit adds some padding to this logic by ensuring that the qps difference between the store relinquishing the lease and the store receiving the lease is at least 200qps. Furthermore, it ensures that the store receiving the lease won't become significantly hotter than the current leaseholder. ** replica rebalancing ** *** release 21.2 and before *** QPS replica rebalancing in CRDB <=21.2 works similarly to the lease rebalancing logic. We first compute a cluster level QPS average, overfull and underfull thresholds. Based on these thresholds we try to move replicas away from overfull stores and onto stores that are underfull, all while ensuring that the receiving stores would not become overfull after the rebalance. A critical assumption that the store rebalancer made (and still does, in the approach implemented by this commit) is that follower replicas serve the same traffic as the leaseholder. *** master and this commit *** The approach implemented by cockroachdb#65379 and refined by this commit tries to leverage machinery in the Allocator that makes rebalancing decisions that converge load based statistics per equivalence class. Previously, this machinery was only used for range count based replica rebalancing (performed by the `replicateQueue`) but not for qps-based rebalancing. This commit implements a similar approach to what we do now for lease rebalancing, which is to determine whether a rebalance action would reduce the qps delta between the hottest and the coldest store in the equivalence class. This commit adds some safeguards around this logic by ensuring that the store relinquishing the replica and the store receiving it differ by at least 200 qps. Furthermore, it ensures that the replica rebalance would not significantly switch the relative dispositions of the two stores. An important thing to note with the 21.2 implementation of the store rebalancer is that it was making all of its decisions based on cluster-level QPS averages. This behaves poorly in heterogenously sized / loaded clusters where some localities are designed to receive more traffic than others. In such clusters, heavily loaded localities can always be considered "overfull". This usually means that all stores in such localities would be above the "overfull" threshold in the cluster. The logic described above would effectively not do anything since there are no underfull stores to move replicas to. Release note (performance improvement): A set of bugs that rendered QPS-based lease and replica rebalancing in CRDB 21.2 and prior ineffective under heterogenously loaded cluster localities has been fixed. Additionally a limitation which prevent CRDB from effectively alleviating extreme QPS hotspots from nodes has also been fixed.
This commit fixes the regression(s) introduced by cockroachdb#65379 where we observed replica thrashing in various workloads (cockroachdb#70396 and cockroachdb#71244). The following is a description of the differences between the QPS based rebalancing scheme used in the previous implementation of the store rebalancer (release-21.2 and before). ** lease rebalancing ** *** release 21.2 and before *** QPS based lease rebalancing in CRDB 21.2 considers the overall cluster level average QPS and computes underfull and overfull thresholds based off of this average. For each range that the local store has a lease for, the store rebalancer goroutine checks whether transferring said range's lease away will bring the local store's QPS below the underfull threshold. If so, it ignores the range and moves on to the next one. Otherwise, it iterates through the stores of all the non-leaseholder voting replicas (in ascending order of their QPS) and checks whether it would be reasonable to transfer the lease away to such a store. It ensures that the receiving store would not become overfull after the lease transfer. It checks that the receiving store doesn't have a replica that's lagging behind the current leaseholder. It checks that the receiving store is not in violation of lease preferences. Finally, it ensures that the lease is not on the local store because of access locality considerations (i.e. because of follow-the-workload). All of this was bespoke logic that lived in the store rebalancer (using none of the Allocator's machinery). *** master and this commit *** In cockroachdb#65379, we moved this decision making into the Allocator by adding a new mode in `Allocator.TransferLeaseTarget` that tries to determine whether transferring the lease to another voting replica would reduce the qps delta between the hottest and the coldest stores in the replica set. This commit adds some padding to this logic by ensuring that the qps difference between the store relinquishing the lease and the store receiving the lease is at least 200qps. Furthermore, it ensures that the store receiving the lease won't become significantly hotter than the current leaseholder. ** replica rebalancing ** *** release 21.2 and before *** QPS replica rebalancing in CRDB <=21.2 works similarly to the lease rebalancing logic. We first compute a cluster level QPS average, overfull and underfull thresholds. Based on these thresholds we try to move replicas away from overfull stores and onto stores that are underfull, all while ensuring that the receiving stores would not become overfull after the rebalance. A critical assumption that the store rebalancer made (and still does, in the approach implemented by this commit) is that follower replicas serve the same traffic as the leaseholder. *** master and this commit *** The approach implemented by cockroachdb#65379 and refined by this commit tries to leverage machinery in the Allocator that makes rebalancing decisions that converge load based statistics per equivalence class. Previously, this machinery was only used for range count based replica rebalancing (performed by the `replicateQueue`) but not for qps-based rebalancing. This commit implements a similar approach to what we do now for lease rebalancing, which is to determine whether a rebalance action would reduce the qps delta between the hottest and the coldest store in the equivalence class. This commit adds some safeguards around this logic by ensuring that the store relinquishing the replica and the store receiving it differ by at least 200 qps. Furthermore, it ensures that the replica rebalance would not significantly switch the relative dispositions of the two stores. An important thing to note with the 21.2 implementation of the store rebalancer is that it was making all of its decisions based on cluster-level QPS averages. This behaves poorly in heterogenously sized / loaded clusters where some localities are designed to receive more traffic than others. In such clusters, heavily loaded localities can always be considered "overfull". This usually means that all stores in such localities would be above the "overfull" threshold in the cluster. The logic described above would effectively not do anything since there are no underfull stores to move replicas to. Release note (performance improvement): A set of bugs that rendered QPS-based lease and replica rebalancing in CRDB 21.2 and prior ineffective under heterogenously loaded cluster localities has been fixed. Additionally a limitation which prevent CRDB from effectively alleviating extreme QPS hotspots from nodes has also been fixed.
This commit fixes the regression(s) introduced by cockroachdb#65379 where we observed replica thrashing in various workloads (cockroachdb#70396 and cockroachdb#71244). The following is a description of the differences between the QPS based rebalancing scheme used in the previous implementation of the store rebalancer (release-21.2 and before). ** lease rebalancing ** *** release 21.2 and before *** QPS based lease rebalancing in CRDB 21.2 considers the overall cluster level average QPS and computes underfull and overfull thresholds based off of this average. For each range that the local store has a lease for, the store rebalancer goroutine checks whether transferring said range's lease away will bring the local store's QPS below the underfull threshold. If so, it ignores the range and moves on to the next one. Otherwise, it iterates through the stores of all the non-leaseholder voting replicas (in ascending order of their QPS) and checks whether it would be reasonable to transfer the lease away to such a store. It ensures that the receiving store would not become overfull after the lease transfer. It checks that the receiving store doesn't have a replica that's lagging behind the current leaseholder. It checks that the receiving store is not in violation of lease preferences. Finally, it ensures that the lease is not on the local store because of access locality considerations (i.e. because of follow-the-workload). All of this was bespoke logic that lived in the store rebalancer (using none of the Allocator's machinery). *** master and this commit *** In cockroachdb#65379, we moved this decision making into the Allocator by adding a new mode in `Allocator.TransferLeaseTarget` that tries to determine whether transferring the lease to another voting replica would reduce the qps delta between the hottest and the coldest stores in the replica set. This commit adds some padding to this logic by ensuring that the qps difference between the store relinquishing the lease and the store receiving the lease is at least 200qps. Furthermore, it ensures that the store receiving the lease won't become significantly hotter than the current leaseholder. ** replica rebalancing ** *** release 21.2 and before *** QPS replica rebalancing in CRDB <=21.2 works similarly to the lease rebalancing logic. We first compute a cluster level QPS average, overfull and underfull thresholds. Based on these thresholds we try to move replicas away from overfull stores and onto stores that are underfull, all while ensuring that the receiving stores would not become overfull after the rebalance. A critical assumption that the store rebalancer made (and still does, in the approach implemented by this commit) is that follower replicas serve the same traffic as the leaseholder. *** master and this commit *** The approach implemented by cockroachdb#65379 and refined by this commit tries to leverage machinery in the Allocator that makes rebalancing decisions that converge load based statistics per equivalence class. Previously, this machinery was only used for range count based replica rebalancing (performed by the `replicateQueue`) but not for qps-based rebalancing. This commit implements a similar approach to what we do now for lease rebalancing, which is to determine whether a rebalance action would reduce the qps delta between the hottest and the coldest store in the equivalence class. This commit adds some safeguards around this logic by ensuring that the store relinquishing the replica and the store receiving it differ by at least 200 qps. Furthermore, it ensures that the replica rebalance would not significantly switch the relative dispositions of the two stores. An important thing to note with the 21.2 implementation of the store rebalancer is that it was making all of its decisions based on cluster-level QPS averages. This behaves poorly in heterogenously sized / loaded clusters where some localities are designed to receive more traffic than others. In such clusters, heavily loaded localities can always be considered "overfull". This usually means that all stores in such localities would be above the "overfull" threshold in the cluster. The logic described above would effectively not do anything since there are no underfull stores to move replicas to. Release note (performance improvement): A set of bugs that rendered QPS-based lease and replica rebalancing in CRDB 21.2 and prior ineffective under heterogenously loaded cluster localities has been fixed. Additionally a limitation which prevent CRDB from effectively alleviating extreme QPS hotspots from nodes has also been fixed.
72296: kvserver: rebalance ranges to minimize QPS delta among stores r=aayushshah15 a=aayushshah15 kvserver: rebalance ranges to minimize QPS delta among stores This commit fixes the regression(s) introduced by #65379 where we observed replica thrashing in various workloads (#70396 and #71244). The following is a description of the differences between the QPS based rebalancing scheme used in the previous implementation of the store rebalancer (release-21.2 and before) and the "new" implementation (22.1 and beyond). **lease rebalancing** ***release 21.2 and before*** QPS based lease rebalancing in CRDB 21.2 considers the overall cluster level average QPS and computes underfull and overfull thresholds based off of this average. For each range that the local store has a lease for, the store rebalancer goroutine checks whether transferring said range's lease away will bring the local store's QPS below the underfull threshold. If so, it ignores the range and moves on to the next one. Otherwise, it iterates through the stores of all the non-leaseholder voting replicas (in ascending order of their QPS) and checks whether it would be reasonable to transfer the lease away to such a store. It ensures that the receiving store would not become overfull after the lease transfer. It checks that the receiving store doesn't have a replica that's lagging behind the current leaseholder. It checks that the receiving store is not in violation of lease preferences. Finally, it ensures that the lease is not on the local store because of access locality considerations (i.e. because of follow-the-workload). All of this was bespoke logic that lived in the store rebalancer (using none of the Allocator's machinery). ***master and this commit*** In #65379, we moved this decision making into the Allocator by adding a new mode in `Allocator.TransferLeaseTarget` that tries to determine whether transferring the lease to another voting replica would reduce the qps delta between the hottest and the coldest stores in the replica set. This commit adds some padding to this logic by ensuring that the qps difference between the store relinquishing the lease and the store receiving the lease is at least 200qps. Furthermore, it ensures that the store receiving the lease won't become significantly hotter than the current leaseholder. **replica rebalancing** ***release 21.2 and before*** QPS replica rebalancing in CRDB <=21.2 works similarly to the lease rebalancing logic. We first compute a cluster level QPS average, overfull and underfull thresholds. Based on these thresholds we try to move replicas away from overfull stores and onto stores that are underfull, all while ensuring that the receiving stores would not become overfull after the rebalance. A critical assumption that the store rebalancer made (and still does, in the approach implemented by this commit) is that follower replicas serve the same traffic as the leaseholder. ***master and this commit*** The approach implemented by #65379 and refined by this commit tries to leverage machinery in the Allocator that makes rebalancing decisions that converge load based statistics per equivalence class. Previously, this machinery was only used for range count based replica rebalancing (performed by the `replicateQueue`) but not for qps-based rebalancing. This commit implements a similar approach to what we do now for lease rebalancing, which is to determine whether a rebalance action would reduce the qps delta between the hottest and the coldest store in the equivalence class. This commit adds some safeguards around this logic by ensuring that the store relinquishing the replica and the store receiving it differ by at least 200 qps. Furthermore, it ensures that the replica rebalance would not significantly switch the relative dispositions of the two stores. An important thing to note with the 21.2 implementation of the store rebalancer is that it was making all of its decisions based on cluster-level QPS averages. This behaves poorly in heterogenously sized / loaded clusters where some localities are designed to receive more traffic than others. In such clusters, heavily loaded localities can always be considered "overfull". This usually means that all stores in such localities would be above the "overfull" threshold in the cluster. The logic described above would effectively not do anything since there are no underfull stores to move replicas to. **Manual testing** This patch has been stress tested with the follower reads roachtests (~250 iterations of `follower-reads/survival=region/locality=global/reads=strong` and 100 iterations of `follower-reads/survival=zone/locality=regional/reads=exact-staleness`). It has also been stress tested with the `rebalance/by-load` roachtests (100 iterations for both `..leases` and `..replicas` tests). I also manually ran a TPCC 10K run with a small ramp (something we know triggers #31135) a few times and saw average QPS converge among stores fairly quickly. ![tpcc-with-low-ramp](https://user-images.githubusercontent.com/10788754/149742518-981825f4-6812-41c1-8320-519caafda9c1.png) Release note (performance improvement): A set of bugs that rendered QPS-based lease and replica rebalancing in CRDB 21.2 and prior ineffective under heterogenously loaded cluster localities has been fixed. Additionally a limitation which prevented CRDB from effectively alleviating extreme QPS hotspots from nodes has also been fixed. 75624: kv: compare MVCC GC threshold against Refresh{Range}Request.RefreshFrom r=nvanbenschoten a=nvanbenschoten Noticed by Sumeer in #74628. A Refresh request needs to observe all MVCC versions between its exclusive RefreshFrom time and its inclusive RefreshTo time. If it were to permit MVCC GC between these times then it could miss conflicts that should cause the refresh to fail. This could in turn lead to violations of serializability. For example: ``` txn1 reads value k1@10 txn2 deletes (tombstones) k1@15 mvcc gc @ 20 clears versions k1@10 and k1@15 txn1 refreshes @ 25, sees no value between (10, 25], refresh successful ``` In the example, the refresh erroneously succeeds because the request is permitted to evaluate after part of the MVCC history it needs to read has been GCed. By considering the RefreshFrom time to be the earliest active timestamp of the request, we avoid this hazard. Instead of being allowed to evaluate, the refresh request in the example would have hit a BatchTimestampBeforeGCError. Co-authored-by: Aayush Shah <aayush.shah15@gmail.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
This commit teaches the
StoreRebalancer
to make load-based rebalancingdecisions that are meaningful within the context of the replication constraints
placed on the ranges being relocated and the set of stores that can legally
receive replicas for such ranges.
Previously, the
StoreRebalancer
would compute the QPS underfull and overfullthresholds based on the overall average QPS being served by all stores in the
cluster. Notably, this included stores that were in replication zones that
would not satisfy required constraints for the range being considered for
rebalancing. This meant that the store rebalancer would effectively never be
able to rebalance ranges within the stores inside heavily loaded replication
zones (since all the valid stores would be above the overfull thresholds).
This patch is a move away from the bespoke relocation logic in the
StoreRebalancer
. Instead, we have theStoreRebalancer
rely on therebalancing logic used by the
replicateQueue
that already has the machineryto compute load based signals for candidates relative to other comparable
stores. The main difference here is that the
StoreRebalancer
uses thismachinery to promote convergence of QPS across stores, whereas the
replicateQueue
uses it to promote convergence of range counts. A series ofpreceeding commits in this patchset generalize the existing replica rebalancing
logic, and this commit teaches the
StoreRebalancer
to use it.This generalization also addresses another key limitation (see #62992) of the
StoreRebalancer
regarding its inability to make partial improvements to arange. Previously, if the
StoreRebalancer
couldn't move a range entirelyoff of overfull stores, it would give up and not even move the subset of
replicas it could. This is no longer the case.
Resolves #61883
Resolves #62992
Resolves #31135
/cc @cockroachdb/kv
Release justification: fixes a set of major limitations behind numerous support escalations
Release note (performance improvement): QPS-based rebalancing is now
aware of different constraints placed on different replication zones. This
means that heterogeneously loaded replication zones (for instance, regions)
will achieve a more even distribution of QPS within the stores inside each
such zone.