Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allocator: replace read amp with io thresh #97142

Merged
merged 1 commit into from
Feb 22, 2023

Conversation

kvoli
Copy link
Collaborator

@kvoli kvoli commented Feb 14, 2023

We previously checked stores' L0-sublevels to exclude IO overloaded
stores from being allocation targets (#78608). This commit replaces the signal
with the normalized IO overload score instead, which also factors in the
L0-filecount. We started gossiping this value as of #83720. We continue
gossiping L0-sublevels for mixed-version compatibility; we can stop doing this
in 23.2.

Resolves: #85084

Release note (ops change): We've deprecated two cluster settings:

  • kv.allocator.l0_sublevels_threshold
  • kv.allocator.l0_sublevels_threshold_enforce.
    The pair of them were used to control rebalancing and upreplication behavior in
    the face of IO overloaded stores. This has been now been replaced by other
    internal mechanisms.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@kvoli kvoli self-assigned this Feb 14, 2023
@kvoli kvoli force-pushed the 230214.swap-l0-io branch from d39473c to e8c47d1 Compare February 14, 2023 20:54
@kvoli kvoli marked this pull request as ready for review February 14, 2023 21:04
@kvoli kvoli requested a review from a team as a code owner February 14, 2023 21:04
@kvoli kvoli requested a review from andrewbaptist February 14, 2023 21:04
@kvoli kvoli force-pushed the 230214.swap-l0-io branch from e8c47d1 to fa3ad10 Compare February 14, 2023 21:27
Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since most of the code changes are in the allocator package, consider using allocator: in your {PR,commit} title prefix. Also, do bear with my warring against the passive tense, but I think it would help you write better comments/commit messages.

return score
}

// Intiailly sublevels is 5/20 = 0.25, expect that score.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

// IOOverloadTrackedRetention period, which is 10 minutes. This serves to
// exclude stores based on historical information and not just
// point-in-time information.
storeHealthTracker struct {
Copy link
Contributor

@irfansharif irfansharif Feb 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit, throughout, feel free to ignore] "Store health" seems vague. Why not call it what it is -- storeIOOverloadTracker?

maxL0NumFiles, _ := detail.storeHealthTracker.NumL0FilesTracker.Query(now)
maxL0NumSublevels, _ := detail.storeHealthTracker.NumL0SublevelsTracker.Query(now)
storeDesc.Capacity.IOThreshold.L0NumFiles = int64(maxL0NumFiles)
storeDesc.Capacity.IOThreshold.L0NumSubLevels = int64(maxL0NumSublevels)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something smells fishy here. Are we stamping on follower pausing by updating this store capacity IO threshold state with a rolling 10m max? We're using this same state for follower pausing here:

ioThresholdMap := map[roachpb.StoreID]*admissionpb.IOThreshold{}
for _, sd := range s.cfg.StorePool.GetStores() {
ioThreshold := sd.Capacity.IOThreshold // need a copy
ioThresholdMap[sd.StoreID] = &ioThreshold
}

Which I take is updated for every store every 15s:

ticker := time.NewTicker(ioTokenTickDuration)
done := false
for !done {
select {
case <-ticker.C:
ticks++
if ticks%ticksInAdjustmentInterval == 0 {
metrics := sgc.pebbleMetricsProvider.GetPebbleMetrics()
if len(metrics) != sgc.numStores {
log.Warningf(ctx,
"expected %d store metrics and found %d metrics", sgc.numStores, len(metrics))
}
for _, m := range metrics {
if unsafeGc, ok := sgc.gcMap.Load(int64(m.StoreID)); ok {
gc := (*GrantCoordinator)(unsafeGc)
gc.pebbleMetricsTick(ctx, m)
iotc.UpdateIOThreshold(roachpb.StoreID(m.StoreID), gc.ioLoadListener.ioThreshold)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a problem and I didn't check where the follower pausing was using the storepool to get the descriptors. The easiest way to swap in IOThreshold was to use the store descriptor capacity. The store descriptor is taken directly from the storepool.

I am going to add a new type to encapsulate the descriptor and derived capacity information. Then replace calls to desc.Capacity with the derived capacity everywhere in the allocator and storepool update code.

@andrewbaptist
Copy link
Contributor

@kvoli and I had a long conversation on this issue this afternoon. The high-level idea is

  1. Remove the rolling window on storing this stat. It changes slowly enough that this isn't really necessary
  2. Change the handling of suspect state so that a store that is down for a while transitions through suspect before being healthy. (should be a separate PR)
  3. Consider making a store that hits IO overload (>1.0) to move it to suspect. This will accomplish many of the same goals as having a rolling window, but be more understandable.

@kvoli kvoli force-pushed the 230214.swap-l0-io branch 2 times, most recently from e5f9409 to 66092db Compare February 21, 2023 15:52
@kvoli kvoli changed the title kvserver: replace read amp with io threshold allocator: replace read amp with io thresh Feb 21, 2023
@kvoli kvoli force-pushed the 230214.swap-l0-io branch 2 times, most recently from 87677f2 to 01ac8be Compare February 21, 2023 20:37
Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I've tried to rewrite the commit message/release note to avoid referencing these non-public settings. It's still a good idea to note the deprecation of these old settings, like you've done.

We previously checked stores' L0-sublevels to exclude IO overloaded
stores from being allocation targets (#78608). This commit replaces the signal
with the normalized IO overload score instead, which also factors in the
L0-filecount. We started gossiping this value as of #83720. We continue
gossiping L0-sublevels for mixed-version compatibility; we can stop doing this
in 23.2.

Resolves: cockroachdb#85084

Release note (ops change): We've deprecated two cluster settings:
- kv.allocator.l0_sublevels_threshold
- kv.allocator.l0_sublevels_threshold_enforce.
The pair of them were used to control rebalancing and upreplication behavior in
the face of IO overloaded stores. This has been now been replaced by other
internal mechanisms.

@@ -2607,9 +2607,9 @@ func newStoreMetrics(histogramWindow time.Duration) *StoreMetrics {
// L0SublevelsMax. this is not exported to as metric.
sm.l0SublevelsTracker.swag = slidingwindow.NewMaxSwag(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we get rid of this? It's only used for metrics, AFAICT, and is surfacing a value we're no longer using in production code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gossiped value is taken from this. It isn't exposed to metrics. We can remove it in 23.2.

@kvoli kvoli force-pushed the 230214.swap-l0-io branch from 01ac8be to aae213b Compare February 22, 2023 18:43
@kvoli
Copy link
Collaborator Author

kvoli commented Feb 22, 2023

LGTM. I've tried to rewrite the commit message/release note to avoid referencing these non-public settings. It's still a good idea to note the deprecation of these old settings, like you've done.

I've updated the commit message to be this, thanks!

TYFTR

We previously checked stores' L0-sublevels to exclude IO overloaded
stores from being allocation targets (cockroachdb#78608). This commit replaces the signal
with the normalized IO overload score instead, which also factors in the
L0-filecount. We started gossiping this value as of cockroachdb#83720. We continue
gossiping L0-sublevels for mixed-version compatibility; we can stop doing this
in 23.2.

Resolves: cockroachdb#85084

Release note (ops change): We've deprecated two cluster settings:
- kv.allocator.l0_sublevels_threshold
- kv.allocator.l0_sublevels_threshold_enforce.
The pair of them were used to control rebalancing and upreplication behavior in
the face of IO overloaded stores. This has been now been replaced by other
internal mechanisms.
@kvoli kvoli force-pushed the 230214.swap-l0-io branch from aae213b to 4b11002 Compare February 22, 2023 19:18
@kvoli
Copy link
Collaborator Author

kvoli commented Feb 22, 2023

bors r=irfansharif

@craig
Copy link
Contributor

craig bot commented Feb 22, 2023

Build succeeded:

kvoli added a commit to kvoli/cockroach that referenced this pull request Feb 24, 2023
Previously, the allocator would transfer leases without considering
candidate's IO overload. When leases would transfer to the IO overloaded
stores, service latency tended to degrade.

This commit adds health checks prior to lease transfers. The health
checks are similar to the IO overload checks for allocating replicas in cockroachdb#97142.

The checks work by comparing a candidate store against
`kv.allocator.io_overload_threshold` and the mean of other candidates.
If the candidate store is equal to or greater than both these values, it
is considered IO overloaded.

The current leaseholder has to meet a higher bar to be considered IO
overloaded. It must have an IO overload score greater or equal to
`kv.allocator.io_overload_threshold` +
`kv.allocator.io_overload_threshold_enforcement_leases`.

The level of enforcement for IO overload is controlled by
`kv.allocator.io_overload_threshold_enforcement_leases` controls the
action taken when a candidate store for a lease transfer is IO overloaded.

- `block_none`: don't exclude stores.
- `block_none_log`: don't exclude stores, log an event.
- `block`: exclude stores from being considered as leaseholder targets
  for a range if they exceed The current leaseholder store will NOT be
  excluded as a candidate for its current range leases.
- `shed`: same behavior as block, however the current leaseholder store
  WILL BE excluded as a candidate for its current range leases i.e. The
  lease will always transfer to a healthy and valid store if one exists.

The default is `block` and a buffer value of `0.4`.

Resolves: cockroachdb#96508

Release note: None
kvoli added a commit to kvoli/cockroach that referenced this pull request Feb 27, 2023
Previously, the allocator would return lease transfer targets without
considering the IO overload of stores involved. When leases would
transfer to the IO overloaded stores, service latency tended to degrade.

This commit adds IO overload checks prior to lease transfers. The IO
overload checks are similar to the IO overload checks for allocating
replicas in cockroachdb#97142.

The checks work by comparing a candidate store against
`kv.allocator.lease_io_overload_threshold` and the mean of other candidates.
If the candidate store is equal to or greater than both these values, it
is considered IO overloaded. The default value is 0.5.

The current leaseholder has to meet a higher bar to be considered IO
overloaded. It must have an IO overload score greater or equal to
`kv.allocator.lease_shed_io_overload_threshold`. The default value is
0.9.

The level of enforcement for IO overload is controlled by
`kv.allocator.lease_io_overload_threshold_enforcement` controls the
action taken when a candidate store for a lease transfer is IO overloaded.

- `ignore`: ignore IO overload scores entirely during lease transfers
  (effectively disabling this mechanism);
- `block_transfer_to`: lease transfers only consider stores that aren't
  IO overloaded (existing leases on IO overloaded stores are left as
  is);
- `shed`: actively shed leases from IO overloaded stores to less IO
  overloaded stores (this is a super-set of block_transfer_to).

The default is `block_transfer_to`.

This commit also updates the existing replica IO overload checks to be
prefixed with `Replica`, to avoid confusion between lease and replica
IO overload checks.

Resolves: cockroachdb#96508

Release note (ops change): Range leases will no longer be transferred to
stores which are IO overloaded.
craig bot pushed a commit that referenced this pull request Mar 7, 2023
97587: allocator: check IO overload on lease transfer r=andrewbaptist a=kvoli

Previously, the allocator would return lease transfer targets without
considering the IO overload of stores involved. When leases would
transfer to the IO overloaded stores, service latency tended to degrade.

This commit adds IO overload checks prior to lease transfers. The IO
overload checks are similar to the IO overload checks for allocating
replicas in #97142.

The checks work by comparing a candidate store against
`kv.allocator.lease_io_overload_threshold` and the mean of other candidates.
If the candidate store is equal to or greater than both these values, it
is considered IO overloaded. The default value is 0.5.

The current leaseholder has to meet a higher bar to be considered IO
overloaded. It must have an IO overload score greater or equal to
`kv.allocator.lease_shed_io_overload_threshold`. The default value is
0.9.

The level of enforcement for IO overload is controlled by
`kv.allocator.lease_io_overload_threshold_enforcement` controls the
action taken when a candidate store for a lease transfer is IO overloaded.

- `ignore`: ignore IO overload scores entirely during lease transfers
  (effectively disabling this mechanism);
- `block_transfer_to`: lease transfers only consider stores that aren't
  IO overloaded (existing leases on IO overloaded stores are left as
  is);
- `shed`: actively shed leases from IO overloaded stores to less IO
  overloaded stores (this is a super-set of block_transfer_to).

The default is `block_transfer_to`.

This commit also updates the existing replica IO overload checks to be
prefixed with `Replica`, to avoid confusion between lease and replica
IO overload checks.

Resolves: #96508

Release note (ops change): Range leases will no longer be transferred to
stores which are IO overloaded.

98041: backupccl: fix off by one index in fileSSTSink file extension r=rhu713 a=rhu713

Currently, the logic that extends the last flushed file fileSSTSink does not trigger if there is only one flushed file. This failure to extend the first flushed file can result in file entries in the backup manifest with duplicate start keys. For example, if the first export response written to the sink contains partial entries of a single key `a`, then the span of the first file will be `a-a`, and the span of the subsequent file will always be `a-<end_key>`. The presence of these duplicate start keys breaks the encoding of the external manifest files list SST as the file path + start key combination in the manifest are assumed to be unique.

Fixes #97953 

Release note: None

98072: backupccl: replace restore2TB and restoretpccInc tests r=lidorcarmel a=msbutler

This patch removes the restore2TB* roachtests which ran a 2TB bank restore to
benchmark restore performance across a few hardware configurations. This patch
also replaces the `restoreTPCCInc/nodes=10` test which tested our ability to
handle a backup with a long chain.

This patch also adds:
1. `restore/tpce/400GB/aws/nodes=4/cpus=16` to measure how per-node throughput
scales when the per node vcpu count doubles relative to default.
2. `restore/tpce/400GB/aws/nodes=8/cpus=8` to measure how per-node throughput
scales when the number of nodes doubles relative to default.
3. `restore/tpce/400GB/aws/backupsIncluded=48/nodes=4/cpus=8` to measure
restore reliability and performance on 48 length long backup chain relative to
default.

A future patch will update the fixtures used in the restore node shutdown
scripts, and add more perf based tests.

Fixes #92699

Release note: None

Co-authored-by: Austen McClernon <austen@cockroachlabs.com>
Co-authored-by: Rui Hu <rui@cockroachlabs.com>
Co-authored-by: Michael Butler <butler@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kvserver: allocator should use io threshold instead of l0 sublevels
4 participants