Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: actuate load-based replica rebalancing under heterogeneous localities #65379

Merged

Commits on Sep 8, 2021

  1. kvserver: clean up convergesScore computation during rebalancing

    Release justification: Fixes high priority bug
    Release note: None
    aayushshah15 committed Sep 8, 2021
    Configuration menu
    Copy the full SHA
    5eba047 View commit details
    Browse the repository at this point in the history
  2. kvserver: allow computing balanceScore off of QPS for rebalancing

    Previously, the replica rebalancing logic inside the allocator only computed
    `balanceScore` (a score of whether a store is overfull, underfull or balanced
    based on some signal) based on range count. This commit augments the replica
    rebalancing logic to support an option to allow computing `balanceScore` based
    on QPS instead. When the `balanceScore` is being computed off of QPS, we
    disable `convergesScore` (which we can only compute off of RangeCount and
    would typically take precedence over `balanceScore`).
    
    A future commit in this patchset will leverage this option in the
    `StoreRebalancer` to make zone-aware rebalancing decisions based on QPS.
    
    Release justification: Fixes high priority bug
    
    Release note: None
    aayushshah15 committed Sep 8, 2021
    Configuration menu
    Copy the full SHA
    b88a2b7 View commit details
    Browse the repository at this point in the history
  3. kvserver: sharpen computation of load based signals for replica removal

    This commit improves the computation of `convergesScore` and
    `balanceScore` during replica removal by computing these scores only in
    relation to the set of candidates that are the least diverse (i.e. the
    candidates that are actually being considered for removal).
    
    This is necessary for these load based signals to be meaningful in
    heterogeneously loaded localities.
    
    Release justification: Fixes high priority bug
    
    Release note: None
    aayushshah15 committed Sep 8, 2021
    Configuration menu
    Copy the full SHA
    3f4ed4e View commit details
    Browse the repository at this point in the history
  4. kvserver: actuate load-based replica rebalancing under heterogeneous …

    …localities
    
    This commit teaches the `StoreRebalancer` to make load-based rebalancing
    decisions that are meaningful within the context of the replication constraints
    placed on the ranges being relocated and the set of stores that can legally
    receive replicas for such ranges.
    
    Previously, the `StoreRebalancer` would compute the QPS underfull and overfull
    thresholds based on the overall average QPS being served by all stores in the
    cluster. Notably, this included stores that were in replication zones that
    would not satisfy required constraints for the range being considered for
    rebalancing. This meant that the store rebalancer would effectively never be
    able to rebalance ranges within the stores inside heavily loaded replication
    zones (since all the _valid_ stores would be above the overfull thresholds).
    
    This patch is a move away from the bespoke relocation logic in the
    `StoreRebalancer`. Instead, we have the `StoreRebalancer` rely on the
    rebalancing logic used by the `replicateQueue` that already has the machinery
    to compute load based signals for candidates _relative to other comparable
    stores_. The main difference here is that the `StoreRebalancer` uses this
    machinery to promote convergence of QPS across stores, whereas the
    `replicateQueue` uses it to promote convergence of range counts. A series of
    preceeding commits in this patchset generalize the existing replica rebalancing
    logic, and this commit teaches the `StoreRebalancer` to use it.
    
    This generalization also addresses another key limitation (see cockroachdb#62922) of the
    `StoreRebalancer` regarding its inability to make partial improvements to a
    range. Previously, if the `StoreRebalancer` couldn't move a range _entirely_
    off of overfull stores, it would give up and not even move the subset of
    replicas it could. This is no longer the case.
    
    Resolves cockroachdb#61883
    Resolves cockroachdb#62992
    
    Release justification: Fixes high priority bug
    
    Release note (performance improvement): QPS-based replica rebalancing is now
    aware of different constraints placed on different replication zones. This
    means that heterogeneously loaded replication zones (for instance, regions)
    will achieve a more even distribution of QPS within the stores inside each
    such zone.
    
    /cc @cockroachdb/kv
    aayushshah15 committed Sep 8, 2021
    Configuration menu
    Copy the full SHA
    d611828 View commit details
    Browse the repository at this point in the history
  5. kvserver: refactor allocator's scorer options

    This commit turns the allocator's `scorerOptions` into an interface that has
    two implementations: one that promotes the balancing of range count across
    comparable stores, and another that promotes the balancing of QPS across
    comparable stores. The replicateQueue uses the former, whereas the
    `StoreRebalancer` uses the latter.
    
    Release justification: Fixes high priority bug
    
    Release note: None
    aayushshah15 committed Sep 8, 2021
    Configuration menu
    Copy the full SHA
    3efcecf View commit details
    Browse the repository at this point in the history
  6. kvserver: rename StoreList.filter

    This commit renames `StoreList`'s `filter()` method to `excludeInvalid()` as
    the existing name was ambiguous.
    
    Release justification: Fixes high priority bug
    
    Release note: None
    aayushshah15 committed Sep 8, 2021
    Configuration menu
    Copy the full SHA
    153fd6b View commit details
    Browse the repository at this point in the history
  7. kvserver: promote QPS convergence during load-based lease rebalancing

    This commit augments `TransferLeaseTarget()` by adding a mode that picks
    the best lease transfer target that would lead to QPS convergence across
    the stores that have a replica for a given range.
    
    This commit implements a strategy that predicates lease transfer decisions on
    whether they would serve to reduce the QPS delta between existing replicas'
    stores.
    
    Resolves cockroachdb#31135
    
    Release justification: Fixes high priority bug
    
    Release note (bug fix): Previously, the store rebalancer was unable to
    rebalance leases for hot ranges that received a disproportionate amount
    of traffic relative to the rest of the cluster. This often led to
    prolonged single node hotspots in certain workloads that led to hot
    ranges. This bug is now fixed.
    aayushshah15 committed Sep 8, 2021
    Configuration menu
    Copy the full SHA
    d61f474 View commit details
    Browse the repository at this point in the history