Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: actuate load-based replica rebalancing under heterogeneous localities #65379

Merged

Conversation

aayushshah15
Copy link
Contributor

@aayushshah15 aayushshah15 commented May 18, 2021

This commit teaches the StoreRebalancer to make load-based rebalancing
decisions that are meaningful within the context of the replication constraints
placed on the ranges being relocated and the set of stores that can legally
receive replicas for such ranges.

Previously, the StoreRebalancer would compute the QPS underfull and overfull
thresholds based on the overall average QPS being served by all stores in the
cluster. Notably, this included stores that were in replication zones that
would not satisfy required constraints for the range being considered for
rebalancing. This meant that the store rebalancer would effectively never be
able to rebalance ranges within the stores inside heavily loaded replication
zones (since all the valid stores would be above the overfull thresholds).

This patch is a move away from the bespoke relocation logic in the
StoreRebalancer. Instead, we have the StoreRebalancer rely on the
rebalancing logic used by the replicateQueue that already has the machinery
to compute load based signals for candidates relative to other comparable
stores
. The main difference here is that the StoreRebalancer uses this
machinery to promote convergence of QPS across stores, whereas the
replicateQueue uses it to promote convergence of range counts. A series of
preceeding commits in this patchset generalize the existing replica rebalancing
logic, and this commit teaches the StoreRebalancer to use it.

This generalization also addresses another key limitation (see #62992) of the
StoreRebalancer regarding its inability to make partial improvements to a
range. Previously, if the StoreRebalancer couldn't move a range entirely
off of overfull stores, it would give up and not even move the subset of
replicas it could. This is no longer the case.

Resolves #61883
Resolves #62992
Resolves #31135

/cc @cockroachdb/kv

Release justification: fixes a set of major limitations behind numerous support escalations

Release note (performance improvement): QPS-based rebalancing is now
aware of different constraints placed on different replication zones. This
means that heterogeneously loaded replication zones (for instance, regions)
will achieve a more even distribution of QPS within the stores inside each
such zone.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@aayushshah15
Copy link
Contributor Author

This needs a bit more love. Please hold off on reviewing.

@aayushshah15 aayushshah15 force-pushed the 20210502_multiTierLoadBasedRebalancing branch 2 times, most recently from 4358675 to c0fd11a Compare May 25, 2021 04:06
@aayushshah15 aayushshah15 force-pushed the 20210502_multiTierLoadBasedRebalancing branch 4 times, most recently from b236586 to 3d8720d Compare May 25, 2021 09:26
@aayushshah15
Copy link
Contributor Author

Note: This patch does not deal with constraints aware lease rebalancing (StoreRebalancer.chooseLeaseToTransfer()). That is a simpler change, but it should probably go in its own PR.

There's still some flakes I'm in the process of narrowing down, but this patch should be ready for feedback now. I'd suggest reviewing it commit by commit (if y'all weren't already going to do that). If it's still hard to review, please let me know if there's anything I can do to make it easier.

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @andreimatei, and @nvanbenschoten)


pkg/kv/kvserver/allocator_scorer.go, line 846 at r2 (raw file):

// candidate for having a replica removed from it given the candidate store
// list. This method returns true if there are any ranges that are on stores
// that lie _outside of the [underfullThreshold, overfullThreshold] window_ for

I think the comment you've added is not quite right; it doesn't reference the one particular store that this method deals with.


pkg/kv/kvserver/allocator_scorer.go, line 1383 at r2 (raw file):

}

func overfullRangeThreshold(options scorerOptions, mean float64) float64 {

should all these functions error of fatal if the field they each want in options isn't set?


pkg/kv/kvserver/allocator_scorer.go, line 1391 at r2 (raw file):

}

func overfullQPSThreshold(options scorerOptions, mean float64) float64 {

nit: I liked this function more with mean factored out of the math.Min.
Same for underfullQPSThreshold below


pkg/kv/kvserver/allocator_scorer.go, line 1399 at r2 (raw file):

}

// rebalanceFromConvergesScore returns a 1 iff rebalancing a replica away from

nit: s/a 1 iff/1 if :P


pkg/kv/kvserver/allocator_scorer.go, line 1400 at r2 (raw file):

// rebalanceFromConvergesScore returns a 1 iff rebalancing a replica away from
// `sc` will _not_ converge its range count towards the mean of stores in `sl`.

You've left one "range count" in there. You probably want it to adapt it to the new "options"


pkg/kv/kvserver/allocator_scorer.go, line 1409 at r2 (raw file):

) int {
	if options.qpsRebalanceThreshold > 0 {
		// if `qpsRebalanceThreshold` is set, we disable the `convergesScore`.

Can you expand this comment? Why doesn't convergence apply to qps ?
Same in the function below


pkg/kv/kvserver/allocator_scorer.go, line 461 at r3 (raw file):

}

// rankedCandidateListForRemoval creates a candidate list of all existing

nit: I think it'd help if the comment on this method talked a bit about the context. So, someone has decided that one of the replicas of a range needs to go, and this function orders the replicas by load?

It'd also help to comment that sl contains the replicas. Perhaps rename to replicas.


pkg/kv/kvserver/allocator_scorer.go, line 491 at r3 (raw file):

		})
	}
	if options.deterministic {

if you can, please add a comment to options.deterministic and explain what that field is about


pkg/kv/kvserver/allocator_scorer.go, line 498 at r3 (raw file):

	// We compute the converges and balance scores only in relation to stores that
	// are the top candidates for removal based on diversity (i.e. only among
	// candidates that are non-diverse relative to the rest of the replicas).

Can you expand on this comment, and spell out that, in the example below, we want 1 and 2 to be scored higher than 3 and 4? And then could you explain in prose how come the scoring will end up like that?

Another thing that would help is expanding the comments on the worst method. Right now it says something about "sharing the lowest constraint score", but I don't understand what that means.


pkg/kv/kvserver/allocator_scorer.go, line 515 at r3 (raw file):

	// likely to be picked for removal, even if one of those is on a store that is
	// better or worse than the other.
	candidates = candidates.worst()

can this be a copy() ?


pkg/kv/kvserver/store_rebalancer.go, line 209 at r2 (raw file):

}

func (sr *StoreRebalancer) rebalanceStore(

consider giving this method a comment


pkg/kv/kvserver/store_rebalancer.go, line 216 at r2 (raw file):

		qpsRebalanceThreshold: qpsRebalanceThreshold.Get(&sr.st.SV),
	}
	qpsMinThreshold := underfullQPSThreshold(options, storeList.candidateQueriesPerSecond.mean)

consider putting a comment about these thresholds - how we only look at stores over the MaxThreshold and how we only transfer leases to streos below MinThreshold (right?)

@aayushshah15 aayushshah15 force-pushed the 20210502_multiTierLoadBasedRebalancing branch from 3d8720d to 0ecaf1f Compare May 25, 2021 21:09
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exciting to see this!

I am curious to get your take on the backportability of the changes here.

Reviewed 2 of 2 files at r1, 7 of 7 files at r2, 2 of 2 files at r3, 7 of 7 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @andreimatei)


pkg/kv/kvserver/allocator_scorer.go, line 1361 at r1 (raw file):

// the mean.
func rebalanceFromConvergesScore(sl StoreList, sc roachpb.StoreCapacity) int {
	if rebalanceConvergesOnMean(sl, sc, sc.RangeCount-1) {

minor nit: I think this would be slightly easier to read and contrast with rebalanceToConvergesScore with if !rebalanceConvergesOnMean(...) { return 1 } return 0. Up to you.


pkg/kv/kvserver/allocator_scorer.go, line 184 at r2 (raw file):

		return -(2 + float64(o.convergesScore-c.convergesScore)/10.0)
	}
	if !scoresAlmostEqual(float64(c.balanceScore), float64(o.balanceScore)) {

Any reason to cast to float64 in this line and the next? It seems like we should be using int arithmetic as long as possible, like we do with convergesScore.


pkg/kv/kvserver/allocator_scorer.go, line 851 at r2 (raw file):

// NB: If the given `options` have `qpsRebalanceThreshold` set, this method
// makes its determination based on QPS. Otherwise, we fall back on using range
// count as a signal.

So we're only ever considering either QPS or range count, but not both? If so, let's be more explicit about that.


pkg/kv/kvserver/allocator_scorer.go, line 888 at r2 (raw file):

// If we reached this point, we're happy with the range where it is.

Let's add this here as well.


pkg/kv/kvserver/allocator_scorer.go, line 1347 at r2 (raw file):

//
// If the given options have qpsRebalanceThreshold set, we use that for
// computing the balanceScore.

"Otherwise ..."


pkg/kv/kvserver/allocator_scorer.go, line 1349 at r2 (raw file):

// computing the balanceScore.
//
// TODO(aayush): It would be nice to be able to compose the two dimensions of

Have we thought about what affect this PR will have if we merge it without addressing this TODO? Will we no longer make any effort to balance based on range count? Are we relying on some callers setting qpsRebalanceThreshold and others not?


pkg/kv/kvserver/allocator_scorer.go, line 1351 at r2 (raw file):

// TODO(aayush): It would be nice to be able to compose the two dimensions of
// balanceScore (QPS or RangeCount) and allow the `options` to simply specify an
// order of precedence. Callers who care about the balancing out QPS across

s/out/of/


pkg/kv/kvserver/allocator_scorer.go, line 1383 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

should all these functions error of fatal if the field they each want in options isn't set?

+1


pkg/kv/kvserver/allocator_scorer.go, line 1409 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Can you expand this comment? Why doesn't convergence apply to qps ?
Same in the function below

I'm also interested in this.


pkg/kv/kvserver/allocator_scorer.go, line 429 at r3 (raw file):

		diversityScore := diversityAllocateScore(s, existingStoreLocalities)
		balanceScore := balanceScore(candidateStores, s.Capacity, options)
		var convergesScore int

Why did we remove this? I would have expected it to use rebalanceToConvergesScore.


pkg/kv/kvserver/allocator_scorer.go, line 514 at r3 (raw file):

	// has more data constrained to it, both replicas 1 and 2 would be equally
	// likely to be picked for removal, even if one of those is on a store that is
	// better or worse than the other.

The implicit assumption here that tripped me up for a while is that, by comparing against region B and C, both replicas in region A will get a discrete balance score of moreThanMean (or overfull), and so their balance will look equivalent. If the balance score was some kind of higher-resolution proportion, then we wouldn't run into this problem, because regardless of the denominator, the replica with more QPS would get a higher score.

Do you mind weaving something about this into the comment?


pkg/kv/kvserver/allocator_scorer.go, line 515 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

can this be a copy() ?

Or at least a pre-allocated list.


pkg/kv/kvserver/allocator_scorer.go, line 515 at r3 (raw file):

	// likely to be picked for removal, even if one of those is on a store that is
	// better or worse than the other.
	candidates = candidates.worst()

So this means that only the worst candidates will be returned, right? So this function is no longer returning a ranked list of all stores in the input, it's returning a ranked list of the stores in the input with the worst diversity score. I think that's ok, given how the return value is used, but is that what you were expecting? Do we need to change the contract here?


pkg/kv/kvserver/allocator_scorer_test.go, line 1533 at r2 (raw file):

	}{
		{sEmpty, underfull},
		{sMean, moreThanMean},

Do we want to add a test case that results in lessThanMean?


pkg/kv/kvserver/allocator_scorer_test.go, line 1556 at r2 (raw file):

		expBalanceScore balanceStatus
	}{
		{0, 1},

Want to use your new enums here?


pkg/kv/kvserver/allocator_test.go, line 1363 at r2 (raw file):

			// We don't expect any QPS based rebalancing when all stores are serving
			// the same QPS.
			testStores: allStoresEqual,

nit: it's worth being explicit: expectRebalance: false,


pkg/kv/kvserver/allocator_test.go, line 1446 at r3 (raw file):

		{
			StoreID:  5,
			Node:     roachpb.NodeDescriptor{NodeID: 5, Locality: region("a")},

nit: any reason this is store 5 but has the same region as store 1? This tripped me up.


pkg/kv/kvserver/store_pool.go, line 657 at r4 (raw file):

}

// excludeInvalid takes a store list and removes stores that would be explicitly invalid

Want to pull this rename into a separate commit? That's probably better than having a complex change mixed in with a refactor.


pkg/kv/kvserver/store_rebalancer.go, line 209 at r4 (raw file):

// NB: The StoreRebalancer only cares about the convergence of QPS across
// stores, not the convergence of range count. So, we don't use the allocator's
// `scorerOptions` here, which set the range count rebalance threshold.

"Instead, we use our own scorerOptions, which sets the qps rebalance threshold."


pkg/kv/kvserver/store_rebalancer.go, line 400 at r4 (raw file):

		}

		if shouldNotMoveAway(ctx, replWithStats, localDesc, now, minQPS) {

Should we inline this call, now that it's not used down below?


pkg/kv/kvserver/store_rebalancer.go, line 650 at r4 (raw file):

	ctx context.Context, rbCtx rangeRebalanceContext,
) (finalVoterTargets, finalNonVoterTargets []roachpb.ReplicaDescriptor) {
	options := scorerOptions{

sr.scorerOptions()?


pkg/kv/kvserver/store_rebalancer_test.go, line 255 at r4 (raw file):

		expectedRebalancedVoters, expectedRebalancedNonVoters []roachpb.StoreID
	}{
		{

We seem to be deleting a lot of test cases. Is this because we're expecting the test coverage provided by TestChooseRangeToRebalance to be subsumed by TestChooseRangeToRebalanceRandom? I wonder if we should keep both. Randomized and table-driven tests that exercise similar logic can live side-by-side. In fact, they're a great combination. One finds new kinds of failures and one protects against regressions on known test cases.


pkg/kv/kvserver/store_rebalancer_test.go, line 391 at r4 (raw file):

	ctx context.Context, allStores, deadStores []*roachpb.StoreDescriptor, meanQPS float64,
) {
	var summary bytes.Buffer

nit: strings.Builder


pkg/kv/kvserver/store_rebalancer_test.go, line 617 at r4 (raw file):

		},
		// Two replicas are in the hot region, both on relatively heavily loaded
		// nodes. We one of those replicas to be moved to a less busy store within

"We expect one"


pkg/kv/kvserver/store_rebalancer_test.go, line 652 at r4 (raw file):

			expRebalancedVoters: []roachpb.StoreID{8, 6, 3},
			qps:                 50,
		},

Do you think it's worth adding some test cases that use constraints that look exactly like we'll generate in multi-region databases (for zone and region survival)?

@aayushshah15 aayushshah15 force-pushed the 20210502_multiTierLoadBasedRebalancing branch 8 times, most recently from aad90bf to fe7cd1e Compare June 7, 2021 07:44
@shermanCRL
Copy link
Contributor

Pinging for next steps here. Get it baking on master and then look at a 21.1 backport?

@aayushshah15
Copy link
Contributor Author

Pinging for next steps here.

This needs a few more review cycles. I have not been able to spend any time on this in the last little bit, but hope to in the next couple weeks.

aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request Jan 12, 2022
This commit fixes the regression(s) introduced by
cockroachdb#65379 where we observed replica
thrashing in various workloads (cockroachdb#70396 and cockroachdb#71244).

The following is a description of the differences between the QPS based
rebalancing scheme used in the previous implementation of the store rebalancer
(release-21.2 and before).

** lease rebalancing **
*** release 21.2 and before ***
QPS based lease rebalancing in CRDB 21.2 considers the overall cluster level
average QPS and computes underfull and overfull thresholds based off of this
average. For each range that the local store has a lease for, the store
rebalancer goroutine checks whether transferring said range's lease away will
bring the local store's QPS below the underfull threshold. If so, it ignores
the range and moves on to the next one. Otherwise, it iterates through the
stores of all the non-leaseholder voting replicas (in ascending order of their
QPS) and checks whether it would be reasonable to transfer the lease away to
such a store. It ensures that the receiving store would not become overfull
after the lease transfer. It checks that the receiving store doesn't have a
replica that's lagging behind the current leaseholder. It checks that the
receiving store is not in violation of lease preferences. Finally, it ensures
that the lease is not on the local store because of access locality
considerations (i.e. because of follow-the-workload).

All of this was bespoke logic that lived in the store rebalancer (using none of
the Allocator's machinery).

*** master and this commit ***
In cockroachdb#65379, we moved this decision making into the Allocator by adding a new
mode in `Allocator.TransferLeaseTarget` that tries to determine whether
transferring the lease to another voting replica would reduce the qps delta
between the hottest and the coldest stores in the replica set. This commit adds
some padding to this logic by ensuring that the qps difference between the
store relinquishing the lease and the store receiving the lease is at least
200qps. Furthermore, it ensures that the store receiving the lease won't become
significantly hotter than the current leaseholder.

** replica rebalancing **
*** release 21.2 and before ***
QPS replica rebalancing in CRDB <=21.2 works similarly to the lease rebalancing
logic. We first compute a cluster level QPS average, overfull and underfull
thresholds. Based on these thresholds we try to move replicas away from
overfull stores and onto stores that are underfull, all while ensuring that the
receiving stores would not become overfull after the rebalance. A critical
assumption that the store rebalancer made (and still does, in the approach
implemented by this commit) is that follower replicas serve the same traffic as
the leaseholder.

*** master and this commit ***
The approach implemented by cockroachdb#65379 and refined by this commit tries to leverage
machinery in the Allocator that makes rebalancing decisions that converge load
based statistics per equivalence class. Previously, this machinery was only
used for range count based replica rebalancing (performed by the
`replicateQueue`) but not for qps-based rebalancing. This commit implements a
similar approach to what we do now for lease rebalancing, which is to determine
whether a rebalance action would reduce the qps delta between the hottest and
the coldest store in the equivalence class. This commit adds some safeguards
around this logic by ensuring that the store relinquishing the replica and the
store receiving it differ by at least 200 qps. Furthermore, it ensures that the
replica rebalance would not significantly switch the relative dispositions of
the two stores.

An important thing to note with the 21.2 implementation of the store rebalancer
is that it was making all of its decisions based on cluster-level QPS averages.
This behaves poorly in heterogenously sized / loaded clusters where some
localities are designed to receive more traffic than others. In such clusters,
heavily loaded localities can always be considered "overfull". This usually
means that all stores in such localities would be above the "overfull"
threshold in the cluster. The logic described above would effectively not do
anything since there are no underfull stores to move replicas to.

Release note (performance improvement): A set of bugs that rendered QPS-based
lease and replica rebalancing in CRDB 21.2 and prior ineffective under
heterogenously loaded cluster localities has been fixed. Additionally a
limitation which prevent CRDB from effectively alleviating extreme QPS hotspots
from nodes has also been fixed.
aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request Jan 15, 2022
This commit fixes the regression(s) introduced by
cockroachdb#65379 where we observed replica
thrashing in various workloads (cockroachdb#70396 and cockroachdb#71244).

The following is a description of the differences between the QPS based
rebalancing scheme used in the previous implementation of the store rebalancer
(release-21.2 and before).

** lease rebalancing **
*** release 21.2 and before ***
QPS based lease rebalancing in CRDB 21.2 considers the overall cluster level
average QPS and computes underfull and overfull thresholds based off of this
average. For each range that the local store has a lease for, the store
rebalancer goroutine checks whether transferring said range's lease away will
bring the local store's QPS below the underfull threshold. If so, it ignores
the range and moves on to the next one. Otherwise, it iterates through the
stores of all the non-leaseholder voting replicas (in ascending order of their
QPS) and checks whether it would be reasonable to transfer the lease away to
such a store. It ensures that the receiving store would not become overfull
after the lease transfer. It checks that the receiving store doesn't have a
replica that's lagging behind the current leaseholder. It checks that the
receiving store is not in violation of lease preferences. Finally, it ensures
that the lease is not on the local store because of access locality
considerations (i.e. because of follow-the-workload).

All of this was bespoke logic that lived in the store rebalancer (using none of
the Allocator's machinery).

*** master and this commit ***
In cockroachdb#65379, we moved this decision making into the Allocator by adding a new
mode in `Allocator.TransferLeaseTarget` that tries to determine whether
transferring the lease to another voting replica would reduce the qps delta
between the hottest and the coldest stores in the replica set. This commit adds
some padding to this logic by ensuring that the qps difference between the
store relinquishing the lease and the store receiving the lease is at least
200qps. Furthermore, it ensures that the store receiving the lease won't become
significantly hotter than the current leaseholder.

** replica rebalancing **
*** release 21.2 and before ***
QPS replica rebalancing in CRDB <=21.2 works similarly to the lease rebalancing
logic. We first compute a cluster level QPS average, overfull and underfull
thresholds. Based on these thresholds we try to move replicas away from
overfull stores and onto stores that are underfull, all while ensuring that the
receiving stores would not become overfull after the rebalance. A critical
assumption that the store rebalancer made (and still does, in the approach
implemented by this commit) is that follower replicas serve the same traffic as
the leaseholder.

*** master and this commit ***
The approach implemented by cockroachdb#65379 and refined by this commit tries to leverage
machinery in the Allocator that makes rebalancing decisions that converge load
based statistics per equivalence class. Previously, this machinery was only
used for range count based replica rebalancing (performed by the
`replicateQueue`) but not for qps-based rebalancing. This commit implements a
similar approach to what we do now for lease rebalancing, which is to determine
whether a rebalance action would reduce the qps delta between the hottest and
the coldest store in the equivalence class. This commit adds some safeguards
around this logic by ensuring that the store relinquishing the replica and the
store receiving it differ by at least 200 qps. Furthermore, it ensures that the
replica rebalance would not significantly switch the relative dispositions of
the two stores.

An important thing to note with the 21.2 implementation of the store rebalancer
is that it was making all of its decisions based on cluster-level QPS averages.
This behaves poorly in heterogenously sized / loaded clusters where some
localities are designed to receive more traffic than others. In such clusters,
heavily loaded localities can always be considered "overfull". This usually
means that all stores in such localities would be above the "overfull"
threshold in the cluster. The logic described above would effectively not do
anything since there are no underfull stores to move replicas to.

Release note (performance improvement): A set of bugs that rendered QPS-based
lease and replica rebalancing in CRDB 21.2 and prior ineffective under
heterogenously loaded cluster localities has been fixed. Additionally a
limitation which prevent CRDB from effectively alleviating extreme QPS hotspots
from nodes has also been fixed.
aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request Jan 15, 2022
This commit fixes the regression(s) introduced by
cockroachdb#65379 where we observed replica
thrashing in various workloads (cockroachdb#70396 and cockroachdb#71244).

The following is a description of the differences between the QPS based
rebalancing scheme used in the previous implementation of the store rebalancer
(release-21.2 and before).

** lease rebalancing **
*** release 21.2 and before ***
QPS based lease rebalancing in CRDB 21.2 considers the overall cluster level
average QPS and computes underfull and overfull thresholds based off of this
average. For each range that the local store has a lease for, the store
rebalancer goroutine checks whether transferring said range's lease away will
bring the local store's QPS below the underfull threshold. If so, it ignores
the range and moves on to the next one. Otherwise, it iterates through the
stores of all the non-leaseholder voting replicas (in ascending order of their
QPS) and checks whether it would be reasonable to transfer the lease away to
such a store. It ensures that the receiving store would not become overfull
after the lease transfer. It checks that the receiving store doesn't have a
replica that's lagging behind the current leaseholder. It checks that the
receiving store is not in violation of lease preferences. Finally, it ensures
that the lease is not on the local store because of access locality
considerations (i.e. because of follow-the-workload).

All of this was bespoke logic that lived in the store rebalancer (using none of
the Allocator's machinery).

*** master and this commit ***
In cockroachdb#65379, we moved this decision making into the Allocator by adding a new
mode in `Allocator.TransferLeaseTarget` that tries to determine whether
transferring the lease to another voting replica would reduce the qps delta
between the hottest and the coldest stores in the replica set. This commit adds
some padding to this logic by ensuring that the qps difference between the
store relinquishing the lease and the store receiving the lease is at least
200qps. Furthermore, it ensures that the store receiving the lease won't become
significantly hotter than the current leaseholder.

** replica rebalancing **
*** release 21.2 and before ***
QPS replica rebalancing in CRDB <=21.2 works similarly to the lease rebalancing
logic. We first compute a cluster level QPS average, overfull and underfull
thresholds. Based on these thresholds we try to move replicas away from
overfull stores and onto stores that are underfull, all while ensuring that the
receiving stores would not become overfull after the rebalance. A critical
assumption that the store rebalancer made (and still does, in the approach
implemented by this commit) is that follower replicas serve the same traffic as
the leaseholder.

*** master and this commit ***
The approach implemented by cockroachdb#65379 and refined by this commit tries to leverage
machinery in the Allocator that makes rebalancing decisions that converge load
based statistics per equivalence class. Previously, this machinery was only
used for range count based replica rebalancing (performed by the
`replicateQueue`) but not for qps-based rebalancing. This commit implements a
similar approach to what we do now for lease rebalancing, which is to determine
whether a rebalance action would reduce the qps delta between the hottest and
the coldest store in the equivalence class. This commit adds some safeguards
around this logic by ensuring that the store relinquishing the replica and the
store receiving it differ by at least 200 qps. Furthermore, it ensures that the
replica rebalance would not significantly switch the relative dispositions of
the two stores.

An important thing to note with the 21.2 implementation of the store rebalancer
is that it was making all of its decisions based on cluster-level QPS averages.
This behaves poorly in heterogenously sized / loaded clusters where some
localities are designed to receive more traffic than others. In such clusters,
heavily loaded localities can always be considered "overfull". This usually
means that all stores in such localities would be above the "overfull"
threshold in the cluster. The logic described above would effectively not do
anything since there are no underfull stores to move replicas to.

Release note (performance improvement): A set of bugs that rendered QPS-based
lease and replica rebalancing in CRDB 21.2 and prior ineffective under
heterogenously loaded cluster localities has been fixed. Additionally a
limitation which prevent CRDB from effectively alleviating extreme QPS hotspots
from nodes has also been fixed.
aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request Jan 16, 2022
This commit fixes the regression(s) introduced by
cockroachdb#65379 where we observed replica
thrashing in various workloads (cockroachdb#70396 and cockroachdb#71244).

The following is a description of the differences between the QPS based
rebalancing scheme used in the previous implementation of the store rebalancer
(release-21.2 and before).

** lease rebalancing **
*** release 21.2 and before ***
QPS based lease rebalancing in CRDB 21.2 considers the overall cluster level
average QPS and computes underfull and overfull thresholds based off of this
average. For each range that the local store has a lease for, the store
rebalancer goroutine checks whether transferring said range's lease away will
bring the local store's QPS below the underfull threshold. If so, it ignores
the range and moves on to the next one. Otherwise, it iterates through the
stores of all the non-leaseholder voting replicas (in ascending order of their
QPS) and checks whether it would be reasonable to transfer the lease away to
such a store. It ensures that the receiving store would not become overfull
after the lease transfer. It checks that the receiving store doesn't have a
replica that's lagging behind the current leaseholder. It checks that the
receiving store is not in violation of lease preferences. Finally, it ensures
that the lease is not on the local store because of access locality
considerations (i.e. because of follow-the-workload).

All of this was bespoke logic that lived in the store rebalancer (using none of
the Allocator's machinery).

*** master and this commit ***
In cockroachdb#65379, we moved this decision making into the Allocator by adding a new
mode in `Allocator.TransferLeaseTarget` that tries to determine whether
transferring the lease to another voting replica would reduce the qps delta
between the hottest and the coldest stores in the replica set. This commit adds
some padding to this logic by ensuring that the qps difference between the
store relinquishing the lease and the store receiving the lease is at least
200qps. Furthermore, it ensures that the store receiving the lease won't become
significantly hotter than the current leaseholder.

** replica rebalancing **
*** release 21.2 and before ***
QPS replica rebalancing in CRDB <=21.2 works similarly to the lease rebalancing
logic. We first compute a cluster level QPS average, overfull and underfull
thresholds. Based on these thresholds we try to move replicas away from
overfull stores and onto stores that are underfull, all while ensuring that the
receiving stores would not become overfull after the rebalance. A critical
assumption that the store rebalancer made (and still does, in the approach
implemented by this commit) is that follower replicas serve the same traffic as
the leaseholder.

*** master and this commit ***
The approach implemented by cockroachdb#65379 and refined by this commit tries to leverage
machinery in the Allocator that makes rebalancing decisions that converge load
based statistics per equivalence class. Previously, this machinery was only
used for range count based replica rebalancing (performed by the
`replicateQueue`) but not for qps-based rebalancing. This commit implements a
similar approach to what we do now for lease rebalancing, which is to determine
whether a rebalance action would reduce the qps delta between the hottest and
the coldest store in the equivalence class. This commit adds some safeguards
around this logic by ensuring that the store relinquishing the replica and the
store receiving it differ by at least 200 qps. Furthermore, it ensures that the
replica rebalance would not significantly switch the relative dispositions of
the two stores.

An important thing to note with the 21.2 implementation of the store rebalancer
is that it was making all of its decisions based on cluster-level QPS averages.
This behaves poorly in heterogenously sized / loaded clusters where some
localities are designed to receive more traffic than others. In such clusters,
heavily loaded localities can always be considered "overfull". This usually
means that all stores in such localities would be above the "overfull"
threshold in the cluster. The logic described above would effectively not do
anything since there are no underfull stores to move replicas to.

Release note (performance improvement): A set of bugs that rendered QPS-based
lease and replica rebalancing in CRDB 21.2 and prior ineffective under
heterogenously loaded cluster localities has been fixed. Additionally a
limitation which prevent CRDB from effectively alleviating extreme QPS hotspots
from nodes has also been fixed.
aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request Jan 17, 2022
This commit fixes the regression(s) introduced by
cockroachdb#65379 where we observed replica
thrashing in various workloads (cockroachdb#70396 and cockroachdb#71244).

The following is a description of the differences between the QPS based
rebalancing scheme used in the previous implementation of the store rebalancer
(release-21.2 and before).

** lease rebalancing **
*** release 21.2 and before ***
QPS based lease rebalancing in CRDB 21.2 considers the overall cluster level
average QPS and computes underfull and overfull thresholds based off of this
average. For each range that the local store has a lease for, the store
rebalancer goroutine checks whether transferring said range's lease away will
bring the local store's QPS below the underfull threshold. If so, it ignores
the range and moves on to the next one. Otherwise, it iterates through the
stores of all the non-leaseholder voting replicas (in ascending order of their
QPS) and checks whether it would be reasonable to transfer the lease away to
such a store. It ensures that the receiving store would not become overfull
after the lease transfer. It checks that the receiving store doesn't have a
replica that's lagging behind the current leaseholder. It checks that the
receiving store is not in violation of lease preferences. Finally, it ensures
that the lease is not on the local store because of access locality
considerations (i.e. because of follow-the-workload).

All of this was bespoke logic that lived in the store rebalancer (using none of
the Allocator's machinery).

*** master and this commit ***
In cockroachdb#65379, we moved this decision making into the Allocator by adding a new
mode in `Allocator.TransferLeaseTarget` that tries to determine whether
transferring the lease to another voting replica would reduce the qps delta
between the hottest and the coldest stores in the replica set. This commit adds
some padding to this logic by ensuring that the qps difference between the
store relinquishing the lease and the store receiving the lease is at least
200qps. Furthermore, it ensures that the store receiving the lease won't become
significantly hotter than the current leaseholder.

** replica rebalancing **
*** release 21.2 and before ***
QPS replica rebalancing in CRDB <=21.2 works similarly to the lease rebalancing
logic. We first compute a cluster level QPS average, overfull and underfull
thresholds. Based on these thresholds we try to move replicas away from
overfull stores and onto stores that are underfull, all while ensuring that the
receiving stores would not become overfull after the rebalance. A critical
assumption that the store rebalancer made (and still does, in the approach
implemented by this commit) is that follower replicas serve the same traffic as
the leaseholder.

*** master and this commit ***
The approach implemented by cockroachdb#65379 and refined by this commit tries to leverage
machinery in the Allocator that makes rebalancing decisions that converge load
based statistics per equivalence class. Previously, this machinery was only
used for range count based replica rebalancing (performed by the
`replicateQueue`) but not for qps-based rebalancing. This commit implements a
similar approach to what we do now for lease rebalancing, which is to determine
whether a rebalance action would reduce the qps delta between the hottest and
the coldest store in the equivalence class. This commit adds some safeguards
around this logic by ensuring that the store relinquishing the replica and the
store receiving it differ by at least 200 qps. Furthermore, it ensures that the
replica rebalance would not significantly switch the relative dispositions of
the two stores.

An important thing to note with the 21.2 implementation of the store rebalancer
is that it was making all of its decisions based on cluster-level QPS averages.
This behaves poorly in heterogenously sized / loaded clusters where some
localities are designed to receive more traffic than others. In such clusters,
heavily loaded localities can always be considered "overfull". This usually
means that all stores in such localities would be above the "overfull"
threshold in the cluster. The logic described above would effectively not do
anything since there are no underfull stores to move replicas to.

Release note (performance improvement): A set of bugs that rendered QPS-based
lease and replica rebalancing in CRDB 21.2 and prior ineffective under
heterogenously loaded cluster localities has been fixed. Additionally a
limitation which prevent CRDB from effectively alleviating extreme QPS hotspots
from nodes has also been fixed.
aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request Jan 27, 2022
This commit fixes the regression(s) introduced by
cockroachdb#65379 where we observed replica
thrashing in various workloads (cockroachdb#70396 and cockroachdb#71244).

The following is a description of the differences between the QPS based
rebalancing scheme used in the previous implementation of the store rebalancer
(release-21.2 and before).

** lease rebalancing **
*** release 21.2 and before ***
QPS based lease rebalancing in CRDB 21.2 considers the overall cluster level
average QPS and computes underfull and overfull thresholds based off of this
average. For each range that the local store has a lease for, the store
rebalancer goroutine checks whether transferring said range's lease away will
bring the local store's QPS below the underfull threshold. If so, it ignores
the range and moves on to the next one. Otherwise, it iterates through the
stores of all the non-leaseholder voting replicas (in ascending order of their
QPS) and checks whether it would be reasonable to transfer the lease away to
such a store. It ensures that the receiving store would not become overfull
after the lease transfer. It checks that the receiving store doesn't have a
replica that's lagging behind the current leaseholder. It checks that the
receiving store is not in violation of lease preferences. Finally, it ensures
that the lease is not on the local store because of access locality
considerations (i.e. because of follow-the-workload).

All of this was bespoke logic that lived in the store rebalancer (using none of
the Allocator's machinery).

*** master and this commit ***
In cockroachdb#65379, we moved this decision making into the Allocator by adding a new
mode in `Allocator.TransferLeaseTarget` that tries to determine whether
transferring the lease to another voting replica would reduce the qps delta
between the hottest and the coldest stores in the replica set. This commit adds
some padding to this logic by ensuring that the qps difference between the
store relinquishing the lease and the store receiving the lease is at least
200qps. Furthermore, it ensures that the store receiving the lease won't become
significantly hotter than the current leaseholder.

** replica rebalancing **
*** release 21.2 and before ***
QPS replica rebalancing in CRDB <=21.2 works similarly to the lease rebalancing
logic. We first compute a cluster level QPS average, overfull and underfull
thresholds. Based on these thresholds we try to move replicas away from
overfull stores and onto stores that are underfull, all while ensuring that the
receiving stores would not become overfull after the rebalance. A critical
assumption that the store rebalancer made (and still does, in the approach
implemented by this commit) is that follower replicas serve the same traffic as
the leaseholder.

*** master and this commit ***
The approach implemented by cockroachdb#65379 and refined by this commit tries to leverage
machinery in the Allocator that makes rebalancing decisions that converge load
based statistics per equivalence class. Previously, this machinery was only
used for range count based replica rebalancing (performed by the
`replicateQueue`) but not for qps-based rebalancing. This commit implements a
similar approach to what we do now for lease rebalancing, which is to determine
whether a rebalance action would reduce the qps delta between the hottest and
the coldest store in the equivalence class. This commit adds some safeguards
around this logic by ensuring that the store relinquishing the replica and the
store receiving it differ by at least 200 qps. Furthermore, it ensures that the
replica rebalance would not significantly switch the relative dispositions of
the two stores.

An important thing to note with the 21.2 implementation of the store rebalancer
is that it was making all of its decisions based on cluster-level QPS averages.
This behaves poorly in heterogenously sized / loaded clusters where some
localities are designed to receive more traffic than others. In such clusters,
heavily loaded localities can always be considered "overfull". This usually
means that all stores in such localities would be above the "overfull"
threshold in the cluster. The logic described above would effectively not do
anything since there are no underfull stores to move replicas to.

Release note (performance improvement): A set of bugs that rendered QPS-based
lease and replica rebalancing in CRDB 21.2 and prior ineffective under
heterogenously loaded cluster localities has been fixed. Additionally a
limitation which prevent CRDB from effectively alleviating extreme QPS hotspots
from nodes has also been fixed.
aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request Jan 28, 2022
This commit fixes the regression(s) introduced by
cockroachdb#65379 where we observed replica
thrashing in various workloads (cockroachdb#70396 and cockroachdb#71244).

The following is a description of the differences between the QPS based
rebalancing scheme used in the previous implementation of the store rebalancer
(release-21.2 and before).

** lease rebalancing **
*** release 21.2 and before ***
QPS based lease rebalancing in CRDB 21.2 considers the overall cluster level
average QPS and computes underfull and overfull thresholds based off of this
average. For each range that the local store has a lease for, the store
rebalancer goroutine checks whether transferring said range's lease away will
bring the local store's QPS below the underfull threshold. If so, it ignores
the range and moves on to the next one. Otherwise, it iterates through the
stores of all the non-leaseholder voting replicas (in ascending order of their
QPS) and checks whether it would be reasonable to transfer the lease away to
such a store. It ensures that the receiving store would not become overfull
after the lease transfer. It checks that the receiving store doesn't have a
replica that's lagging behind the current leaseholder. It checks that the
receiving store is not in violation of lease preferences. Finally, it ensures
that the lease is not on the local store because of access locality
considerations (i.e. because of follow-the-workload).

All of this was bespoke logic that lived in the store rebalancer (using none of
the Allocator's machinery).

*** master and this commit ***
In cockroachdb#65379, we moved this decision making into the Allocator by adding a new
mode in `Allocator.TransferLeaseTarget` that tries to determine whether
transferring the lease to another voting replica would reduce the qps delta
between the hottest and the coldest stores in the replica set. This commit adds
some padding to this logic by ensuring that the qps difference between the
store relinquishing the lease and the store receiving the lease is at least
200qps. Furthermore, it ensures that the store receiving the lease won't become
significantly hotter than the current leaseholder.

** replica rebalancing **
*** release 21.2 and before ***
QPS replica rebalancing in CRDB <=21.2 works similarly to the lease rebalancing
logic. We first compute a cluster level QPS average, overfull and underfull
thresholds. Based on these thresholds we try to move replicas away from
overfull stores and onto stores that are underfull, all while ensuring that the
receiving stores would not become overfull after the rebalance. A critical
assumption that the store rebalancer made (and still does, in the approach
implemented by this commit) is that follower replicas serve the same traffic as
the leaseholder.

*** master and this commit ***
The approach implemented by cockroachdb#65379 and refined by this commit tries to leverage
machinery in the Allocator that makes rebalancing decisions that converge load
based statistics per equivalence class. Previously, this machinery was only
used for range count based replica rebalancing (performed by the
`replicateQueue`) but not for qps-based rebalancing. This commit implements a
similar approach to what we do now for lease rebalancing, which is to determine
whether a rebalance action would reduce the qps delta between the hottest and
the coldest store in the equivalence class. This commit adds some safeguards
around this logic by ensuring that the store relinquishing the replica and the
store receiving it differ by at least 200 qps. Furthermore, it ensures that the
replica rebalance would not significantly switch the relative dispositions of
the two stores.

An important thing to note with the 21.2 implementation of the store rebalancer
is that it was making all of its decisions based on cluster-level QPS averages.
This behaves poorly in heterogenously sized / loaded clusters where some
localities are designed to receive more traffic than others. In such clusters,
heavily loaded localities can always be considered "overfull". This usually
means that all stores in such localities would be above the "overfull"
threshold in the cluster. The logic described above would effectively not do
anything since there are no underfull stores to move replicas to.

Release note (performance improvement): A set of bugs that rendered QPS-based
lease and replica rebalancing in CRDB 21.2 and prior ineffective under
heterogenously loaded cluster localities has been fixed. Additionally a
limitation which prevent CRDB from effectively alleviating extreme QPS hotspots
from nodes has also been fixed.
craig bot pushed a commit that referenced this pull request Jan 29, 2022
72296: kvserver: rebalance ranges to minimize QPS delta among stores  r=aayushshah15 a=aayushshah15

kvserver: rebalance ranges to minimize QPS delta among stores

This commit fixes the regression(s) introduced by
#65379 where we observed replica
thrashing in various workloads (#70396 and #71244).

The following is a description of the differences between the QPS based
rebalancing scheme used in the previous implementation of the store rebalancer
(release-21.2 and before) and the "new" implementation (22.1 and beyond).

**lease rebalancing**
***release 21.2 and before***
QPS based lease rebalancing in CRDB 21.2 considers the overall cluster level
average QPS and computes underfull and overfull thresholds based off of this
average. For each range that the local store has a lease for, the store
rebalancer goroutine checks whether transferring said range's lease away will
bring the local store's QPS below the underfull threshold. If so, it ignores
the range and moves on to the next one. Otherwise, it iterates through the
stores of all the non-leaseholder voting replicas (in ascending order of their
QPS) and checks whether it would be reasonable to transfer the lease away to
such a store. It ensures that the receiving store would not become overfull
after the lease transfer. It checks that the receiving store doesn't have a
replica that's lagging behind the current leaseholder. It checks that the
receiving store is not in violation of lease preferences. Finally, it ensures
that the lease is not on the local store because of access locality
considerations (i.e. because of follow-the-workload).

All of this was bespoke logic that lived in the store rebalancer (using none of
the Allocator's machinery).

***master and this commit***
In #65379, we moved this decision making into the Allocator by adding a new
mode in `Allocator.TransferLeaseTarget` that tries to determine whether
transferring the lease to another voting replica would reduce the qps delta
between the hottest and the coldest stores in the replica set. This commit adds
some padding to this logic by ensuring that the qps difference between the
store relinquishing the lease and the store receiving the lease is at least
200qps. Furthermore, it ensures that the store receiving the lease won't become
significantly hotter than the current leaseholder.

**replica rebalancing**
***release 21.2 and before***
QPS replica rebalancing in CRDB <=21.2 works similarly to the lease rebalancing
logic. We first compute a cluster level QPS average, overfull and underfull
thresholds. Based on these thresholds we try to move replicas away from
overfull stores and onto stores that are underfull, all while ensuring that the
receiving stores would not become overfull after the rebalance. A critical
assumption that the store rebalancer made (and still does, in the approach
implemented by this commit) is that follower replicas serve the same traffic as
the leaseholder.

***master and this commit***
The approach implemented by #65379 and refined by this commit tries to leverage
machinery in the Allocator that makes rebalancing decisions that converge load
based statistics per equivalence class. Previously, this machinery was only
used for range count based replica rebalancing (performed by the
`replicateQueue`) but not for qps-based rebalancing. This commit implements a
similar approach to what we do now for lease rebalancing, which is to determine
whether a rebalance action would reduce the qps delta between the hottest and
the coldest store in the equivalence class. This commit adds some safeguards
around this logic by ensuring that the store relinquishing the replica and the
store receiving it differ by at least 200 qps. Furthermore, it ensures that the
replica rebalance would not significantly switch the relative dispositions of
the two stores.

An important thing to note with the 21.2 implementation of the store rebalancer
is that it was making all of its decisions based on cluster-level QPS averages.
This behaves poorly in heterogenously sized / loaded clusters where some
localities are designed to receive more traffic than others. In such clusters,
heavily loaded localities can always be considered "overfull". This usually
means that all stores in such localities would be above the "overfull"
threshold in the cluster. The logic described above would effectively not do
anything since there are no underfull stores to move replicas to.

**Manual testing**
This patch has been stress tested with the follower reads roachtests (~250 iterations of 
`follower-reads/survival=region/locality=global/reads=strong` and 100 iterations of 
`follower-reads/survival=zone/locality=regional/reads=exact-staleness`). It has also been 
stress tested with the `rebalance/by-load` roachtests (100 iterations for both `..leases` and 
`..replicas` tests). I also manually ran a TPCC 10K run with a small ramp (something we
know triggers #31135) a few times and
saw average QPS converge among stores fairly quickly.
![tpcc-with-low-ramp](https://user-images.githubusercontent.com/10788754/149742518-981825f4-6812-41c1-8320-519caafda9c1.png)
  

Release note (performance improvement): A set of bugs that rendered QPS-based
lease and replica rebalancing in CRDB 21.2 and prior ineffective under
heterogenously loaded cluster localities has been fixed. Additionally a
limitation which prevented CRDB from effectively alleviating extreme QPS hotspots
from nodes has also been fixed.


75624: kv: compare MVCC GC threshold against Refresh{Range}Request.RefreshFrom r=nvanbenschoten a=nvanbenschoten

Noticed by Sumeer in #74628.

A Refresh request needs to observe all MVCC versions between its
exclusive RefreshFrom time and its inclusive RefreshTo time. If it were
to permit MVCC GC between these times then it could miss conflicts that
should cause the refresh to fail. This could in turn lead to violations
of serializability. For example:

```
txn1 reads value k1@10
txn2 deletes (tombstones) k1@15
mvcc gc @ 20 clears versions k1@10 and k1@15
txn1 refreshes @ 25, sees no value between (10, 25], refresh successful
```

In the example, the refresh erroneously succeeds because the request is
permitted to evaluate after part of the MVCC history it needs to read
has been GCed. By considering the RefreshFrom time to be the earliest
active timestamp of the request, we avoid this hazard. Instead of being
allowed to evaluate, the refresh request in the example would have hit
a BatchTimestampBeforeGCError.

Co-authored-by: Aayush Shah <aayush.shah15@gmail.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants