allocator: replace read amp with io thresh #97142

kvoli · 2023-02-14T20:42:50Z

We previously checked stores' L0-sublevels to exclude IO overloaded
stores from being allocation targets (#78608). This commit replaces the signal
with the normalized IO overload score instead, which also factors in the
L0-filecount. We started gossiping this value as of #83720. We continue
gossiping L0-sublevels for mixed-version compatibility; we can stop doing this
in 23.2.

Resolves: #85084

Release note (ops change): We've deprecated two cluster settings:

kv.allocator.l0_sublevels_threshold
kv.allocator.l0_sublevels_threshold_enforce.
The pair of them were used to control rebalancing and upreplication behavior in
the face of IO overloaded stores. This has been now been replaced by other
internal mechanisms.

cockroach-teamcity · 2023-02-14T20:42:59Z

This change is

irfansharif

Since most of the code changes are in the allocator package, consider using allocator: in your {PR,commit} title prefix. Also, do bear with my warring against the passive tense, but I think it would help you write better comments/commit messages.

pkg/roachpb/metadata.proto

irfansharif · 2023-02-16T11:35:15Z

pkg/kv/kvserver/allocator/storepool/store_pool_test.go

+		return score
+	}
+
+	// Intiailly sublevels is 5/20 = 0.25, expect that score.


pkg/kv/kvserver/allocator/allocatorimpl/allocator_scorer.go

irfansharif · 2023-02-16T13:03:10Z

pkg/kv/kvserver/allocator/storepool/store_pool.go

+	// IOOverloadTrackedRetention period, which is 10 minutes. This serves to
+	// exclude stores based on historical information and not just
+	// point-in-time information.
+	storeHealthTracker struct {


[nit, throughout, feel free to ignore] "Store health" seems vague. Why not call it what it is -- storeIOOverloadTracker?

pkg/kv/kvserver/allocator/storepool/store_pool.go

pkg/kv/kvserver/allocator/storepool/store_pool_test.go

pkg/kv/kvserver/allocator/storepool/store_pool.go

irfansharif · 2023-02-16T14:13:06Z

pkg/kv/kvserver/allocator/storepool/store_pool.go

+	maxL0NumFiles, _ := detail.storeHealthTracker.NumL0FilesTracker.Query(now)
+	maxL0NumSublevels, _ := detail.storeHealthTracker.NumL0SublevelsTracker.Query(now)
+	storeDesc.Capacity.IOThreshold.L0NumFiles = int64(maxL0NumFiles)
+	storeDesc.Capacity.IOThreshold.L0NumSubLevels = int64(maxL0NumSublevels)


Something smells fishy here. Are we stamping on follower pausing by updating this store capacity IO threshold state with a rolling 10m max? We're using this same state for follower pausing here:

cockroach/pkg/kv/kvserver/store_raft.go

Lines 787 to 791 in 2111f9b

ioThresholdMap := map[roachpb.StoreID]*admissionpb.IOThreshold{}

for _, sd := range s.cfg.StorePool.GetStores() {

ioThreshold := sd.Capacity.IOThreshold // need a copy

ioThresholdMap[sd.StoreID] = &ioThreshold

}

Which I take is updated for every store every 15s:

cockroach/pkg/util/admission/grant_coordinator.go

Lines 100 to 116 in 2cac11b

ticker := time.NewTicker(ioTokenTickDuration)

done := false

for !done {

select {

case <-ticker.C:

ticks++

if ticks%ticksInAdjustmentInterval == 0 {

metrics := sgc.pebbleMetricsProvider.GetPebbleMetrics()

if len(metrics) != sgc.numStores {

log.Warningf(ctx,

"expected %d store metrics and found %d metrics", sgc.numStores, len(metrics))

}

for _, m := range metrics {

if unsafeGc, ok := sgc.gcMap.Load(int64(m.StoreID)); ok {

gc := (*GrantCoordinator)(unsafeGc)

gc.pebbleMetricsTick(ctx, m)

iotc.UpdateIOThreshold(roachpb.StoreID(m.StoreID), gc.ioLoadListener.ioThreshold)

That is a problem and I didn't check where the follower pausing was using the storepool to get the descriptors. The easiest way to swap in IOThreshold was to use the store descriptor capacity. The store descriptor is taken directly from the storepool.

I am going to add a new type to encapsulate the descriptor and derived capacity information. Then replace calls to desc.Capacity with the derived capacity everywhere in the allocator and storepool update code.

andrewbaptist · 2023-02-16T20:49:19Z

@kvoli and I had a long conversation on this issue this afternoon. The high-level idea is

Remove the rolling window on storing this stat. It changes slowly enough that this isn't really necessary
Change the handling of suspect state so that a store that is down for a while transitions through suspect before being healthy. (should be a separate PR)
Consider making a store that hits IO overload (>1.0) to move it to suspect. This will accomplish many of the same goals as having a rolling window, but be more understandable.

irfansharif

LGTM. I've tried to rewrite the commit message/release note to avoid referencing these non-public settings. It's still a good idea to note the deprecation of these old settings, like you've done.

We previously checked stores' L0-sublevels to exclude IO overloaded
stores from being allocation targets (#78608). This commit replaces the signal
with the normalized IO overload score instead, which also factors in the
L0-filecount. We started gossiping this value as of #83720. We continue
gossiping L0-sublevels for mixed-version compatibility; we can stop doing this
in 23.2.

Resolves: cockroachdb#85084

Release note (ops change): We've deprecated two cluster settings:
- kv.allocator.l0_sublevels_threshold
- kv.allocator.l0_sublevels_threshold_enforce.
The pair of them were used to control rebalancing and upreplication behavior in
the face of IO overloaded stores. This has been now been replaced by other
internal mechanisms.

irfansharif · 2023-02-22T15:34:11Z

pkg/kv/kvserver/metrics.go

@@ -2607,9 +2607,9 @@ func newStoreMetrics(histogramWindow time.Duration) *StoreMetrics {
 		// L0SublevelsMax. this is not exported to as metric.
 		sm.l0SublevelsTracker.swag = slidingwindow.NewMaxSwag(


Should we get rid of this? It's only used for metrics, AFAICT, and is surfacing a value we're no longer using in production code.

The gossiped value is taken from this. It isn't exposed to metrics. We can remove it in 23.2.

kvoli · 2023-02-22T18:44:50Z

LGTM. I've tried to rewrite the commit message/release note to avoid referencing these non-public settings. It's still a good idea to note the deprecation of these old settings, like you've done.

I've updated the commit message to be this, thanks!

TYFTR

We previously checked stores' L0-sublevels to exclude IO overloaded stores from being allocation targets (cockroachdb#78608). This commit replaces the signal with the normalized IO overload score instead, which also factors in the L0-filecount. We started gossiping this value as of cockroachdb#83720. We continue gossiping L0-sublevels for mixed-version compatibility; we can stop doing this in 23.2. Resolves: cockroachdb#85084 Release note (ops change): We've deprecated two cluster settings: - kv.allocator.l0_sublevels_threshold - kv.allocator.l0_sublevels_threshold_enforce. The pair of them were used to control rebalancing and upreplication behavior in the face of IO overloaded stores. This has been now been replaced by other internal mechanisms.

kvoli · 2023-02-22T20:32:56Z

bors r=irfansharif

craig · 2023-02-22T21:26:59Z

Build succeeded:

Bazel Essential CI (Cockroach)

Previously, the allocator would transfer leases without considering candidate's IO overload. When leases would transfer to the IO overloaded stores, service latency tended to degrade. This commit adds health checks prior to lease transfers. The health checks are similar to the IO overload checks for allocating replicas in cockroachdb#97142. The checks work by comparing a candidate store against `kv.allocator.io_overload_threshold` and the mean of other candidates. If the candidate store is equal to or greater than both these values, it is considered IO overloaded. The current leaseholder has to meet a higher bar to be considered IO overloaded. It must have an IO overload score greater or equal to `kv.allocator.io_overload_threshold` + `kv.allocator.io_overload_threshold_enforcement_leases`. The level of enforcement for IO overload is controlled by `kv.allocator.io_overload_threshold_enforcement_leases` controls the action taken when a candidate store for a lease transfer is IO overloaded. - `block_none`: don't exclude stores. - `block_none_log`: don't exclude stores, log an event. - `block`: exclude stores from being considered as leaseholder targets for a range if they exceed The current leaseholder store will NOT be excluded as a candidate for its current range leases. - `shed`: same behavior as block, however the current leaseholder store WILL BE excluded as a candidate for its current range leases i.e. The lease will always transfer to a healthy and valid store if one exists. The default is `block` and a buffer value of `0.4`. Resolves: cockroachdb#96508 Release note: None

Previously, the allocator would return lease transfer targets without considering the IO overload of stores involved. When leases would transfer to the IO overloaded stores, service latency tended to degrade. This commit adds IO overload checks prior to lease transfers. The IO overload checks are similar to the IO overload checks for allocating replicas in cockroachdb#97142. The checks work by comparing a candidate store against `kv.allocator.lease_io_overload_threshold` and the mean of other candidates. If the candidate store is equal to or greater than both these values, it is considered IO overloaded. The default value is 0.5. The current leaseholder has to meet a higher bar to be considered IO overloaded. It must have an IO overload score greater or equal to `kv.allocator.lease_shed_io_overload_threshold`. The default value is 0.9. The level of enforcement for IO overload is controlled by `kv.allocator.lease_io_overload_threshold_enforcement` controls the action taken when a candidate store for a lease transfer is IO overloaded. - `ignore`: ignore IO overload scores entirely during lease transfers (effectively disabling this mechanism); - `block_transfer_to`: lease transfers only consider stores that aren't IO overloaded (existing leases on IO overloaded stores are left as is); - `shed`: actively shed leases from IO overloaded stores to less IO overloaded stores (this is a super-set of block_transfer_to). The default is `block_transfer_to`. This commit also updates the existing replica IO overload checks to be prefixed with `Replica`, to avoid confusion between lease and replica IO overload checks. Resolves: cockroachdb#96508 Release note (ops change): Range leases will no longer be transferred to stores which are IO overloaded.

97587: allocator: check IO overload on lease transfer r=andrewbaptist a=kvoli Previously, the allocator would return lease transfer targets without considering the IO overload of stores involved. When leases would transfer to the IO overloaded stores, service latency tended to degrade. This commit adds IO overload checks prior to lease transfers. The IO overload checks are similar to the IO overload checks for allocating replicas in #97142. The checks work by comparing a candidate store against `kv.allocator.lease_io_overload_threshold` and the mean of other candidates. If the candidate store is equal to or greater than both these values, it is considered IO overloaded. The default value is 0.5. The current leaseholder has to meet a higher bar to be considered IO overloaded. It must have an IO overload score greater or equal to `kv.allocator.lease_shed_io_overload_threshold`. The default value is 0.9. The level of enforcement for IO overload is controlled by `kv.allocator.lease_io_overload_threshold_enforcement` controls the action taken when a candidate store for a lease transfer is IO overloaded. - `ignore`: ignore IO overload scores entirely during lease transfers (effectively disabling this mechanism); - `block_transfer_to`: lease transfers only consider stores that aren't IO overloaded (existing leases on IO overloaded stores are left as is); - `shed`: actively shed leases from IO overloaded stores to less IO overloaded stores (this is a super-set of block_transfer_to). The default is `block_transfer_to`. This commit also updates the existing replica IO overload checks to be prefixed with `Replica`, to avoid confusion between lease and replica IO overload checks. Resolves: #96508 Release note (ops change): Range leases will no longer be transferred to stores which are IO overloaded. 98041: backupccl: fix off by one index in fileSSTSink file extension r=rhu713 a=rhu713 Currently, the logic that extends the last flushed file fileSSTSink does not trigger if there is only one flushed file. This failure to extend the first flushed file can result in file entries in the backup manifest with duplicate start keys. For example, if the first export response written to the sink contains partial entries of a single key `a`, then the span of the first file will be `a-a`, and the span of the subsequent file will always be `a-<end_key>`. The presence of these duplicate start keys breaks the encoding of the external manifest files list SST as the file path + start key combination in the manifest are assumed to be unique. Fixes #97953 Release note: None 98072: backupccl: replace restore2TB and restoretpccInc tests r=lidorcarmel a=msbutler This patch removes the restore2TB* roachtests which ran a 2TB bank restore to benchmark restore performance across a few hardware configurations. This patch also replaces the `restoreTPCCInc/nodes=10` test which tested our ability to handle a backup with a long chain. This patch also adds: 1. `restore/tpce/400GB/aws/nodes=4/cpus=16` to measure how per-node throughput scales when the per node vcpu count doubles relative to default. 2. `restore/tpce/400GB/aws/nodes=8/cpus=8` to measure how per-node throughput scales when the number of nodes doubles relative to default. 3. `restore/tpce/400GB/aws/backupsIncluded=48/nodes=4/cpus=8` to measure restore reliability and performance on 48 length long backup chain relative to default. A future patch will update the fixtures used in the restore node shutdown scripts, and add more perf based tests. Fixes #92699 Release note: None Co-authored-by: Austen McClernon <austen@cockroachlabs.com> Co-authored-by: Rui Hu <rui@cockroachlabs.com> Co-authored-by: Michael Butler <butler@cockroachlabs.com>

kvoli self-assigned this Feb 14, 2023

kvoli force-pushed the 230214.swap-l0-io branch from d39473c to e8c47d1 Compare February 14, 2023 20:54

kvoli marked this pull request as ready for review February 14, 2023 21:04

kvoli requested a review from a team as a code owner February 14, 2023 21:04

kvoli requested a review from andrewbaptist February 14, 2023 21:04

kvoli force-pushed the 230214.swap-l0-io branch from e8c47d1 to fa3ad10 Compare February 14, 2023 21:27

irfansharif suggested changes Feb 16, 2023

View reviewed changes

andrewbaptist mentioned this pull request Feb 16, 2023

kv: Start nodes in a new status to prevent lease transfers #96980

Closed

kvoli force-pushed the 230214.swap-l0-io branch from fa3ad10 to 8cea573 Compare February 16, 2023 20:38

blathers-crl bot requested a review from irfansharif February 16, 2023 20:38

kvoli force-pushed the 230214.swap-l0-io branch 2 times, most recently from e5f9409 to 66092db Compare February 21, 2023 15:52

kvoli changed the title ~~kvserver: replace read amp with io threshold~~ allocator: replace read amp with io thresh Feb 21, 2023

kvoli force-pushed the 230214.swap-l0-io branch 2 times, most recently from 87677f2 to 01ac8be Compare February 21, 2023 20:37

irfansharif approved these changes Feb 22, 2023

View reviewed changes

kvoli force-pushed the 230214.swap-l0-io branch from 01ac8be to aae213b Compare February 22, 2023 18:43

kvoli force-pushed the 230214.swap-l0-io branch from aae213b to 4b11002 Compare February 22, 2023 19:18

craig bot merged commit b3231c0 into cockroachdb:master Feb 22, 2023

cockroach-teamcity mentioned this pull request Feb 23, 2023

PR #97142 - allocator: replace read amp with io thresh cockroachdb/docs#16336

Open

kvoli mentioned this pull request Feb 23, 2023

allocator: check IO overload on lease transfer #97587

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allocator: replace read amp with io thresh #97142

allocator: replace read amp with io thresh #97142

kvoli commented Feb 14, 2023 •

edited

Loading

cockroach-teamcity commented Feb 14, 2023

irfansharif left a comment

irfansharif Feb 16, 2023

kvoli Feb 21, 2023

irfansharif Feb 16, 2023 •

edited

Loading

irfansharif Feb 16, 2023

kvoli Feb 16, 2023

andrewbaptist commented Feb 16, 2023

irfansharif left a comment

irfansharif Feb 22, 2023

kvoli Feb 22, 2023

kvoli commented Feb 22, 2023

kvoli commented Feb 22, 2023

craig bot commented Feb 22, 2023

	ioThresholdMap := map[roachpb.StoreID]*admissionpb.IOThreshold{}
	for _, sd := range s.cfg.StorePool.GetStores() {
	ioThreshold := sd.Capacity.IOThreshold // need a copy
	ioThresholdMap[sd.StoreID] = &ioThreshold
	}

	ticker := time.NewTicker(ioTokenTickDuration)
	done := false
	for !done {
	select {
	case <-ticker.C:
	ticks++
	if ticks%ticksInAdjustmentInterval == 0 {
	metrics := sgc.pebbleMetricsProvider.GetPebbleMetrics()
	if len(metrics) != sgc.numStores {
	log.Warningf(ctx,
	"expected %d store metrics and found %d metrics", sgc.numStores, len(metrics))
	}
	for _, m := range metrics {
	if unsafeGc, ok := sgc.gcMap.Load(int64(m.StoreID)); ok {
	gc := (*GrantCoordinator)(unsafeGc)
	gc.pebbleMetricsTick(ctx, m)
	iotc.UpdateIOThreshold(roachpb.StoreID(m.StoreID), gc.ioLoadListener.ioThreshold)

		@@ -2607,9 +2607,9 @@ func newStoreMetrics(histogramWindow time.Duration) *StoreMetrics {
		// L0SublevelsMax. this is not exported to as metric.
		sm.l0SublevelsTracker.swag = slidingwindow.NewMaxSwag(

allocator: replace read amp with io thresh #97142

allocator: replace read amp with io thresh #97142

Conversation

kvoli commented Feb 14, 2023 • edited Loading

cockroach-teamcity commented Feb 14, 2023

irfansharif left a comment

Choose a reason for hiding this comment

irfansharif Feb 16, 2023

Choose a reason for hiding this comment

kvoli Feb 21, 2023

Choose a reason for hiding this comment

irfansharif Feb 16, 2023 • edited Loading

Choose a reason for hiding this comment

irfansharif Feb 16, 2023

Choose a reason for hiding this comment

kvoli Feb 16, 2023

Choose a reason for hiding this comment

andrewbaptist commented Feb 16, 2023

irfansharif left a comment

Choose a reason for hiding this comment

irfansharif Feb 22, 2023

Choose a reason for hiding this comment

kvoli Feb 22, 2023

Choose a reason for hiding this comment

kvoli commented Feb 22, 2023

kvoli commented Feb 22, 2023

craig bot commented Feb 22, 2023

kvoli commented Feb 14, 2023 •

edited

Loading

irfansharif Feb 16, 2023 •

edited

Loading