-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allocator: replace read amp with io thresh #97142
Conversation
d39473c
to
e8c47d1
Compare
e8c47d1
to
fa3ad10
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since most of the code changes are in the allocator package, consider using allocator:
in your {PR,commit} title prefix. Also, do bear with my warring against the passive tense, but I think it would help you write better comments/commit messages.
return score | ||
} | ||
|
||
// Intiailly sublevels is 5/20 = 0.25, expect that score. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
// IOOverloadTrackedRetention period, which is 10 minutes. This serves to | ||
// exclude stores based on historical information and not just | ||
// point-in-time information. | ||
storeHealthTracker struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit, throughout, feel free to ignore] "Store health" seems vague. Why not call it what it is -- storeIOOverloadTracker?
maxL0NumFiles, _ := detail.storeHealthTracker.NumL0FilesTracker.Query(now) | ||
maxL0NumSublevels, _ := detail.storeHealthTracker.NumL0SublevelsTracker.Query(now) | ||
storeDesc.Capacity.IOThreshold.L0NumFiles = int64(maxL0NumFiles) | ||
storeDesc.Capacity.IOThreshold.L0NumSubLevels = int64(maxL0NumSublevels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something smells fishy here. Are we stamping on follower pausing by updating this store capacity IO threshold state with a rolling 10m max? We're using this same state for follower pausing here:
cockroach/pkg/kv/kvserver/store_raft.go
Lines 787 to 791 in 2111f9b
ioThresholdMap := map[roachpb.StoreID]*admissionpb.IOThreshold{} | |
for _, sd := range s.cfg.StorePool.GetStores() { | |
ioThreshold := sd.Capacity.IOThreshold // need a copy | |
ioThresholdMap[sd.StoreID] = &ioThreshold | |
} |
Which I take is updated for every store every 15s:
cockroach/pkg/util/admission/grant_coordinator.go
Lines 100 to 116 in 2cac11b
ticker := time.NewTicker(ioTokenTickDuration) | |
done := false | |
for !done { | |
select { | |
case <-ticker.C: | |
ticks++ | |
if ticks%ticksInAdjustmentInterval == 0 { | |
metrics := sgc.pebbleMetricsProvider.GetPebbleMetrics() | |
if len(metrics) != sgc.numStores { | |
log.Warningf(ctx, | |
"expected %d store metrics and found %d metrics", sgc.numStores, len(metrics)) | |
} | |
for _, m := range metrics { | |
if unsafeGc, ok := sgc.gcMap.Load(int64(m.StoreID)); ok { | |
gc := (*GrantCoordinator)(unsafeGc) | |
gc.pebbleMetricsTick(ctx, m) | |
iotc.UpdateIOThreshold(roachpb.StoreID(m.StoreID), gc.ioLoadListener.ioThreshold) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a problem and I didn't check where the follower pausing was using the storepool to get the descriptors. The easiest way to swap in IOThreshold was to use the store descriptor capacity. The store descriptor is taken directly from the storepool.
I am going to add a new type to encapsulate the descriptor and derived capacity information. Then replace calls to desc.Capacity with the derived capacity everywhere in the allocator and storepool update code.
fa3ad10
to
8cea573
Compare
@kvoli and I had a long conversation on this issue this afternoon. The high-level idea is
|
e5f9409
to
66092db
Compare
87677f2
to
01ac8be
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I've tried to rewrite the commit message/release note to avoid referencing these non-public settings. It's still a good idea to note the deprecation of these old settings, like you've done.
We previously checked stores' L0-sublevels to exclude IO overloaded
stores from being allocation targets (#78608). This commit replaces the signal
with the normalized IO overload score instead, which also factors in the
L0-filecount. We started gossiping this value as of #83720. We continue
gossiping L0-sublevels for mixed-version compatibility; we can stop doing this
in 23.2.
Resolves: cockroachdb#85084
Release note (ops change): We've deprecated two cluster settings:
- kv.allocator.l0_sublevels_threshold
- kv.allocator.l0_sublevels_threshold_enforce.
The pair of them were used to control rebalancing and upreplication behavior in
the face of IO overloaded stores. This has been now been replaced by other
internal mechanisms.
@@ -2607,9 +2607,9 @@ func newStoreMetrics(histogramWindow time.Duration) *StoreMetrics { | |||
// L0SublevelsMax. this is not exported to as metric. | |||
sm.l0SublevelsTracker.swag = slidingwindow.NewMaxSwag( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we get rid of this? It's only used for metrics, AFAICT, and is surfacing a value we're no longer using in production code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The gossiped value is taken from this. It isn't exposed to metrics. We can remove it in 23.2.
01ac8be
to
aae213b
Compare
I've updated the commit message to be this, thanks! TYFTR |
We previously checked stores' L0-sublevels to exclude IO overloaded stores from being allocation targets (cockroachdb#78608). This commit replaces the signal with the normalized IO overload score instead, which also factors in the L0-filecount. We started gossiping this value as of cockroachdb#83720. We continue gossiping L0-sublevels for mixed-version compatibility; we can stop doing this in 23.2. Resolves: cockroachdb#85084 Release note (ops change): We've deprecated two cluster settings: - kv.allocator.l0_sublevels_threshold - kv.allocator.l0_sublevels_threshold_enforce. The pair of them were used to control rebalancing and upreplication behavior in the face of IO overloaded stores. This has been now been replaced by other internal mechanisms.
aae213b
to
4b11002
Compare
bors r=irfansharif |
Build succeeded: |
Previously, the allocator would transfer leases without considering candidate's IO overload. When leases would transfer to the IO overloaded stores, service latency tended to degrade. This commit adds health checks prior to lease transfers. The health checks are similar to the IO overload checks for allocating replicas in cockroachdb#97142. The checks work by comparing a candidate store against `kv.allocator.io_overload_threshold` and the mean of other candidates. If the candidate store is equal to or greater than both these values, it is considered IO overloaded. The current leaseholder has to meet a higher bar to be considered IO overloaded. It must have an IO overload score greater or equal to `kv.allocator.io_overload_threshold` + `kv.allocator.io_overload_threshold_enforcement_leases`. The level of enforcement for IO overload is controlled by `kv.allocator.io_overload_threshold_enforcement_leases` controls the action taken when a candidate store for a lease transfer is IO overloaded. - `block_none`: don't exclude stores. - `block_none_log`: don't exclude stores, log an event. - `block`: exclude stores from being considered as leaseholder targets for a range if they exceed The current leaseholder store will NOT be excluded as a candidate for its current range leases. - `shed`: same behavior as block, however the current leaseholder store WILL BE excluded as a candidate for its current range leases i.e. The lease will always transfer to a healthy and valid store if one exists. The default is `block` and a buffer value of `0.4`. Resolves: cockroachdb#96508 Release note: None
Previously, the allocator would return lease transfer targets without considering the IO overload of stores involved. When leases would transfer to the IO overloaded stores, service latency tended to degrade. This commit adds IO overload checks prior to lease transfers. The IO overload checks are similar to the IO overload checks for allocating replicas in cockroachdb#97142. The checks work by comparing a candidate store against `kv.allocator.lease_io_overload_threshold` and the mean of other candidates. If the candidate store is equal to or greater than both these values, it is considered IO overloaded. The default value is 0.5. The current leaseholder has to meet a higher bar to be considered IO overloaded. It must have an IO overload score greater or equal to `kv.allocator.lease_shed_io_overload_threshold`. The default value is 0.9. The level of enforcement for IO overload is controlled by `kv.allocator.lease_io_overload_threshold_enforcement` controls the action taken when a candidate store for a lease transfer is IO overloaded. - `ignore`: ignore IO overload scores entirely during lease transfers (effectively disabling this mechanism); - `block_transfer_to`: lease transfers only consider stores that aren't IO overloaded (existing leases on IO overloaded stores are left as is); - `shed`: actively shed leases from IO overloaded stores to less IO overloaded stores (this is a super-set of block_transfer_to). The default is `block_transfer_to`. This commit also updates the existing replica IO overload checks to be prefixed with `Replica`, to avoid confusion between lease and replica IO overload checks. Resolves: cockroachdb#96508 Release note (ops change): Range leases will no longer be transferred to stores which are IO overloaded.
97587: allocator: check IO overload on lease transfer r=andrewbaptist a=kvoli Previously, the allocator would return lease transfer targets without considering the IO overload of stores involved. When leases would transfer to the IO overloaded stores, service latency tended to degrade. This commit adds IO overload checks prior to lease transfers. The IO overload checks are similar to the IO overload checks for allocating replicas in #97142. The checks work by comparing a candidate store against `kv.allocator.lease_io_overload_threshold` and the mean of other candidates. If the candidate store is equal to or greater than both these values, it is considered IO overloaded. The default value is 0.5. The current leaseholder has to meet a higher bar to be considered IO overloaded. It must have an IO overload score greater or equal to `kv.allocator.lease_shed_io_overload_threshold`. The default value is 0.9. The level of enforcement for IO overload is controlled by `kv.allocator.lease_io_overload_threshold_enforcement` controls the action taken when a candidate store for a lease transfer is IO overloaded. - `ignore`: ignore IO overload scores entirely during lease transfers (effectively disabling this mechanism); - `block_transfer_to`: lease transfers only consider stores that aren't IO overloaded (existing leases on IO overloaded stores are left as is); - `shed`: actively shed leases from IO overloaded stores to less IO overloaded stores (this is a super-set of block_transfer_to). The default is `block_transfer_to`. This commit also updates the existing replica IO overload checks to be prefixed with `Replica`, to avoid confusion between lease and replica IO overload checks. Resolves: #96508 Release note (ops change): Range leases will no longer be transferred to stores which are IO overloaded. 98041: backupccl: fix off by one index in fileSSTSink file extension r=rhu713 a=rhu713 Currently, the logic that extends the last flushed file fileSSTSink does not trigger if there is only one flushed file. This failure to extend the first flushed file can result in file entries in the backup manifest with duplicate start keys. For example, if the first export response written to the sink contains partial entries of a single key `a`, then the span of the first file will be `a-a`, and the span of the subsequent file will always be `a-<end_key>`. The presence of these duplicate start keys breaks the encoding of the external manifest files list SST as the file path + start key combination in the manifest are assumed to be unique. Fixes #97953 Release note: None 98072: backupccl: replace restore2TB and restoretpccInc tests r=lidorcarmel a=msbutler This patch removes the restore2TB* roachtests which ran a 2TB bank restore to benchmark restore performance across a few hardware configurations. This patch also replaces the `restoreTPCCInc/nodes=10` test which tested our ability to handle a backup with a long chain. This patch also adds: 1. `restore/tpce/400GB/aws/nodes=4/cpus=16` to measure how per-node throughput scales when the per node vcpu count doubles relative to default. 2. `restore/tpce/400GB/aws/nodes=8/cpus=8` to measure how per-node throughput scales when the number of nodes doubles relative to default. 3. `restore/tpce/400GB/aws/backupsIncluded=48/nodes=4/cpus=8` to measure restore reliability and performance on 48 length long backup chain relative to default. A future patch will update the fixtures used in the restore node shutdown scripts, and add more perf based tests. Fixes #92699 Release note: None Co-authored-by: Austen McClernon <austen@cockroachlabs.com> Co-authored-by: Rui Hu <rui@cockroachlabs.com> Co-authored-by: Michael Butler <butler@cockroachlabs.com>
We previously checked stores' L0-sublevels to exclude IO overloaded
stores from being allocation targets (#78608). This commit replaces the signal
with the normalized IO overload score instead, which also factors in the
L0-filecount. We started gossiping this value as of #83720. We continue
gossiping L0-sublevels for mixed-version compatibility; we can stop doing this
in 23.2.
Resolves: #85084
Release note (ops change): We've deprecated two cluster settings:
The pair of them were used to control rebalancing and upreplication behavior in
the face of IO overloaded stores. This has been now been replaced by other
internal mechanisms.