Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: Load-based rebalancing should be more willing to make certain incremental improvements #31135

Closed
a-robinson opened this issue Oct 9, 2018 · 1 comment · Fixed by #65379
Assignees
Labels
A-kv-distribution Relating to rebalancing and leasing. C-performance Perf of queries or internals. Solution not expected to change functional behavior.

Comments

@a-robinson
Copy link
Contributor

a-robinson commented Oct 9, 2018

This is already called out in a TODO in the code:

// TODO(a-robinson): Support more incremental improvements -- move what we
// can if it makes things better even if it isn't great. For example,
// moving one of the other existing replicas that's on a store with less
// qps than the max threshold but above the mean would help in certain
// locality configurations.

But yesterday I saw a cluster get into a real situation where a greater willingness to move a replica that would make another store overfull would have been the ideal thing to do / what a human would have done. Currently the code is cautious about moving replicas with so many QPS that moving them over-fills whichever store they get moved to. That's a good decision when there's only one such replica on a store, but when there's two it's less good:

I181009 02:36:43.678116 996 storage/store_rebalancer.go:686  [n27,s28,store-rebalancer] r216919's 2456.46 qps would push s13 over the mean (168.26) with 2531.33 qps afterwards
I181009 02:36:43.678126 996 storage/store_rebalancer.go:597  [n27,s28,store-rebalancer] couldn't find enough rebalance targets for r216919 (2/3)
I181009 02:36:43.678723 996 storage/store_rebalancer.go:680  [n27,s28,store-rebalancer] r216912's 2338.62 qps would push s49 over the max threshold (268.26) with 2353.85 qps afterwards
I181009 02:36:43.678730 996 storage/store_rebalancer.go:597  [n27,s28,store-rebalancer] couldn't find enough rebalance targets for r216912 (2/3)
I181009 02:36:43.678738 996 storage/store_rebalancer.go:307  [n27,s28,store-rebalancer] ran out of replicas worth transferring and qps (4953.34) is still above desired threshold (268.26); will check again soon

In these logs, s28 has way more qps than the mean, mostly accounted for by two hot replicas (IIRC they were hot due to a bunch of retried errors, but that's a separate concern). However, moving either of them fails the check in shouldNotMoveTo that attempts to avoid ping-ponging a lone hot lease around between stores on an otherwise lightly loaded cluster:

newCandidateQPS := storeDesc.Capacity.QueriesPerSecond + replWithStats.qps
if storeDesc.Capacity.QueriesPerSecond < minQPS {
if newCandidateQPS > maxQPS {
log.VEventf(ctx, 3,
"r%d's %.2f qps would push s%d over the max threshold (%.2f) with %.2f qps afterwards",
replWithStats.repl.RangeID, replWithStats.qps, candidateStore, maxQPS, newCandidateQPS)
return true
}
} else if newCandidateQPS > meanQPS {
log.VEventf(ctx, 3,
"r%d's %.2f qps would push s%d over the mean (%.2f) with %.2f qps afterwards",
replWithStats.repl.RangeID, replWithStats.qps, candidateStore, meanQPS, newCandidateQPS)
return true
}

In such cases it would be reasonable to make such a move if, for example, the source store would still have more QPS on it than the destination store.

It's worth noting that it's better for 2.1 to be overly cautious than for it to be overly aggressive, so I'm not planning on pushing for any such changes to be backported.

gz#8871

@aayushshah15
Copy link
Contributor

When investigating https://github.com/cockroachlabs/support/issues/1065, I was able to trace the load imbalance we've been seeing in a lot of these "high read amplification" issues to this problem. I'm re-opening this issue until the linked PR is merged.

The linked PR (#65379) will solve this issue as well.

@aayushshah15 aayushshah15 reopened this Jul 19, 2021
@github-actions github-actions bot removed the X-stale label Jul 20, 2021
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Aug 24, 2021
This commit augments `TransferLeaseTarget()` by adding a mode that picks
the best lease transfer target that would lead to QPS convergence across
the stores that have a replica for a given range.

Resolves cockroachdb#31135

Release note: None
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Aug 31, 2021
This commit augments `TransferLeaseTarget()` by adding a mode that picks
the best lease transfer target that would lead to QPS convergence across
the stores that have a replica for a given range.

This commit implements a strategy that predicates lease transfer decisions on
whether they would serve to reduce the QPS delta between existing replicas'
stores.

Resolves cockroachdb#31135

Release note: None

Release justification:
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Sep 7, 2021
This commit augments `TransferLeaseTarget()` by adding a mode that picks
the best lease transfer target that would lead to QPS convergence across
the stores that have a replica for a given range.

This commit implements a strategy that predicates lease transfer decisions on
whether they would serve to reduce the QPS delta between existing replicas'
stores.

Resolves cockroachdb#31135

Release note: None

Release justification:
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Sep 8, 2021
This commit augments `TransferLeaseTarget()` by adding a mode that picks
the best lease transfer target that would lead to QPS convergence across
the stores that have a replica for a given range.

This commit implements a strategy that predicates lease transfer decisions on
whether they would serve to reduce the QPS delta between existing replicas'
stores.

Resolves cockroachdb#31135

Release justification: Fixes high priority bug

Release note (bug fix): Previously, the store rebalancer was unable to
rebalance leases for hot ranges that received a disproportionate amount
of traffic relative to the rest of the cluster. This often led to
prolonged single node hotspots in certain workloads that led to hot
ranges. This bug is now fixed.
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Sep 8, 2021
This commit augments `TransferLeaseTarget()` by adding a mode that picks
the best lease transfer target that would lead to QPS convergence across
the stores that have a replica for a given range.

This commit implements a strategy that predicates lease transfer decisions on
whether they would serve to reduce the QPS delta between existing replicas'
stores.

Resolves cockroachdb#31135

Release justification: Fixes high priority bug

Release note (bug fix): Previously, the store rebalancer was unable to
rebalance leases for hot ranges that received a disproportionate amount
of traffic relative to the rest of the cluster. This often led to
prolonged single node hotspots in certain workloads that led to hot
ranges. This bug is now fixed.
craig bot pushed a commit that referenced this issue Sep 8, 2021
65379: kvserver: actuate load-based replica rebalancing under heterogeneous localities r=aayushshah15 a=aayushshah15

This commit teaches the `StoreRebalancer` to make load-based rebalancing
decisions that are meaningful within the context of the replication constraints
placed on the ranges being relocated and the set of stores that can legally
receive replicas for such ranges.

Previously, the `StoreRebalancer` would compute the QPS underfull and overfull
thresholds based on the overall average QPS being served by all stores in the
cluster. Notably, this included stores that were in replication zones that
would not satisfy required constraints for the range being considered for
rebalancing. This meant that the store rebalancer would effectively never be
able to rebalance ranges within the stores inside heavily loaded replication
zones (since all the _valid_ stores would be above the overfull thresholds).

This patch is a move away from the bespoke relocation logic in the
`StoreRebalancer`. Instead, we have the `StoreRebalancer` rely on the
rebalancing logic used by the `replicateQueue` that already has the machinery
to compute load based signals for candidates _relative to other comparable
stores_. The main difference here is that the `StoreRebalancer` uses this
machinery to promote convergence of QPS across stores, whereas the
`replicateQueue` uses it to promote convergence of range counts. A series of
preceeding commits in this patchset generalize the existing replica rebalancing
logic, and this commit teaches the `StoreRebalancer` to use it.

This generalization also addresses another key limitation (see #62992) of the
`StoreRebalancer` regarding its inability to make partial improvements to a
range. Previously, if the `StoreRebalancer` couldn't move a range _entirely_
off of overfull stores, it would give up and not even move the subset of
replicas it could. This is no longer the case.

Resolves #61883
Resolves #62992
Resolves #31135

/cc @cockroachdb/kv

Release justification: fixes a set of major limitations behind numerous support escalations

Release note (performance improvement): QPS-based rebalancing is now
aware of different constraints placed on different replication zones. This
means that heterogeneously loaded replication zones (for instance, regions)
will achieve a more even distribution of QPS within the stores inside each
such zone.


Co-authored-by: Aayush Shah <aayush.shah15@gmail.com>
@craig craig bot closed this as completed in d61f474 Sep 8, 2021
craig bot pushed a commit that referenced this issue Jan 29, 2022
72296: kvserver: rebalance ranges to minimize QPS delta among stores  r=aayushshah15 a=aayushshah15

kvserver: rebalance ranges to minimize QPS delta among stores

This commit fixes the regression(s) introduced by
#65379 where we observed replica
thrashing in various workloads (#70396 and #71244).

The following is a description of the differences between the QPS based
rebalancing scheme used in the previous implementation of the store rebalancer
(release-21.2 and before) and the "new" implementation (22.1 and beyond).

**lease rebalancing**
***release 21.2 and before***
QPS based lease rebalancing in CRDB 21.2 considers the overall cluster level
average QPS and computes underfull and overfull thresholds based off of this
average. For each range that the local store has a lease for, the store
rebalancer goroutine checks whether transferring said range's lease away will
bring the local store's QPS below the underfull threshold. If so, it ignores
the range and moves on to the next one. Otherwise, it iterates through the
stores of all the non-leaseholder voting replicas (in ascending order of their
QPS) and checks whether it would be reasonable to transfer the lease away to
such a store. It ensures that the receiving store would not become overfull
after the lease transfer. It checks that the receiving store doesn't have a
replica that's lagging behind the current leaseholder. It checks that the
receiving store is not in violation of lease preferences. Finally, it ensures
that the lease is not on the local store because of access locality
considerations (i.e. because of follow-the-workload).

All of this was bespoke logic that lived in the store rebalancer (using none of
the Allocator's machinery).

***master and this commit***
In #65379, we moved this decision making into the Allocator by adding a new
mode in `Allocator.TransferLeaseTarget` that tries to determine whether
transferring the lease to another voting replica would reduce the qps delta
between the hottest and the coldest stores in the replica set. This commit adds
some padding to this logic by ensuring that the qps difference between the
store relinquishing the lease and the store receiving the lease is at least
200qps. Furthermore, it ensures that the store receiving the lease won't become
significantly hotter than the current leaseholder.

**replica rebalancing**
***release 21.2 and before***
QPS replica rebalancing in CRDB <=21.2 works similarly to the lease rebalancing
logic. We first compute a cluster level QPS average, overfull and underfull
thresholds. Based on these thresholds we try to move replicas away from
overfull stores and onto stores that are underfull, all while ensuring that the
receiving stores would not become overfull after the rebalance. A critical
assumption that the store rebalancer made (and still does, in the approach
implemented by this commit) is that follower replicas serve the same traffic as
the leaseholder.

***master and this commit***
The approach implemented by #65379 and refined by this commit tries to leverage
machinery in the Allocator that makes rebalancing decisions that converge load
based statistics per equivalence class. Previously, this machinery was only
used for range count based replica rebalancing (performed by the
`replicateQueue`) but not for qps-based rebalancing. This commit implements a
similar approach to what we do now for lease rebalancing, which is to determine
whether a rebalance action would reduce the qps delta between the hottest and
the coldest store in the equivalence class. This commit adds some safeguards
around this logic by ensuring that the store relinquishing the replica and the
store receiving it differ by at least 200 qps. Furthermore, it ensures that the
replica rebalance would not significantly switch the relative dispositions of
the two stores.

An important thing to note with the 21.2 implementation of the store rebalancer
is that it was making all of its decisions based on cluster-level QPS averages.
This behaves poorly in heterogenously sized / loaded clusters where some
localities are designed to receive more traffic than others. In such clusters,
heavily loaded localities can always be considered "overfull". This usually
means that all stores in such localities would be above the "overfull"
threshold in the cluster. The logic described above would effectively not do
anything since there are no underfull stores to move replicas to.

**Manual testing**
This patch has been stress tested with the follower reads roachtests (~250 iterations of 
`follower-reads/survival=region/locality=global/reads=strong` and 100 iterations of 
`follower-reads/survival=zone/locality=regional/reads=exact-staleness`). It has also been 
stress tested with the `rebalance/by-load` roachtests (100 iterations for both `..leases` and 
`..replicas` tests). I also manually ran a TPCC 10K run with a small ramp (something we
know triggers #31135) a few times and
saw average QPS converge among stores fairly quickly.
![tpcc-with-low-ramp](https://user-images.githubusercontent.com/10788754/149742518-981825f4-6812-41c1-8320-519caafda9c1.png)
  

Release note (performance improvement): A set of bugs that rendered QPS-based
lease and replica rebalancing in CRDB 21.2 and prior ineffective under
heterogenously loaded cluster localities has been fixed. Additionally a
limitation which prevented CRDB from effectively alleviating extreme QPS hotspots
from nodes has also been fixed.


75624: kv: compare MVCC GC threshold against Refresh{Range}Request.RefreshFrom r=nvanbenschoten a=nvanbenschoten

Noticed by Sumeer in #74628.

A Refresh request needs to observe all MVCC versions between its
exclusive RefreshFrom time and its inclusive RefreshTo time. If it were
to permit MVCC GC between these times then it could miss conflicts that
should cause the refresh to fail. This could in turn lead to violations
of serializability. For example:

```
txn1 reads value k1@10
txn2 deletes (tombstones) k1@15
mvcc gc @ 20 clears versions k1@10 and k1@15
txn1 refreshes @ 25, sees no value between (10, 25], refresh successful
```

In the example, the refresh erroneously succeeds because the request is
permitted to evaluate after part of the MVCC history it needs to read
has been GCed. By considering the RefreshFrom time to be the earliest
active timestamp of the request, we avoid this hazard. Instead of being
allowed to evaluate, the refresh request in the example would have hit
a BatchTimestampBeforeGCError.

Co-authored-by: Aayush Shah <aayush.shah15@gmail.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. C-performance Perf of queries or internals. Solution not expected to change functional behavior.
Projects
None yet
4 participants