-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: check L0 sub-levels on allocation #78608
Conversation
027c043
to
c9cd413
Compare
00c06d4
to
ec1084d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very partial review (just came here to see the high level picture).
I think there's an opportunity - somewhere where folks will find it so maybe on the cluster setting comment - to link to this issue because not everyone may be mentally connecting "L0Sublevels" and "Read-Amp".
I also wonder if in light of recent events we also need to consider widening the heuristic from "read amp" to generally high L0 file counts (if you have 500k files in L0, even with a read-amp of 5, things can be pretty bad). But I assume you have considered that, would be nice to read about the rationale somewhere.
pkg/kv/kvserver/allocator_scorer.go
Outdated
settings.SystemOnly, | ||
"kv.allocator.l0_sublevels_threshold_enforce", | ||
"the level of enforcement when a candidate disk has read amplification exceeding"+ | ||
"`kv.allocator.l0_sublevels_threshold`; disabled will take no log will take no action,"+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something is garbled here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to be clearer.
// excluded when a candidate disk is considered unhealthy. | ||
type storeHealthEnforcement int64 | ||
|
||
const ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally the verbiage here and on the const declarations is a bit ambiguous, what are "actions to be included or excluded"? Maybe phrase it in terms of how many safeguards are applied. Off = apply no safeguards, log = apply no safeguards but log when safeguards might make sense, rebalanceOnly = apply safeguards when picking rebalance target, allocate = apply during rebalancing and .. I'm not exactly sure what the other cases are, if a store goes unhealthy, would be be proactively cleaning it off replicas? I don't think so, but am not sure what you mean then. Is this referring to pure up/downreplication? Could be clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally the verbiage here and on the const declarations is a bit ambiguous, what are "actions to be included or excluded"? Maybe phrase it in terms of how many safeguards are applied. Off = apply no safeguards, log = apply no safeguards but log when safeguards might make sense, rebalanceOnly = apply safeguards when picking rebalance target, allocate = apply during rebalancing
You're right. I've updated this to be clearer about the action taken at each level of enforcement. Now it should be clear the distinction between excluding candidates for as rebalance targets vs excluding candidates as targets entirely (for both allocation and rebalancing).
I'm not exactly sure what the other cases are, if a store goes unhealthy, would be be proactively cleaning it off replicas? I don't think so,
You're right, this patch explicitly excludes removing replicas based on read amplification of the store. I've updated with a comment where this is done to make this explicit.
|
||
// maxL0SublevelThreshold: if the number of l0 sublevels of a store descriptor | ||
// is greater than this value, it will never be used as a rebalance target. | ||
maxL0SublevelThreshold = 20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it says 10 on the PR release note, just a heads up to make sure the numbers match at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is 20 in the release note, I've updated this to be clearer and reworked the language.
ec1084d
to
e83ad40
Compare
bor r- |
I mostly agree that the PR in its current incantation (i.e. with
Well, yes, but just because this PR doesn't yet let |
I think we want it to be log-only for now, while we do more testing. Assuming sufficient testing, avoiding sending more work to already overloaded nodes doesn't seem like a very risky notion. It's a bit hard to discuss pros and cons of an algorithm that wasn't designed or implemented (shedding replicas and moving them back based on read amp). Of course thrashing would need to be avoided by such a design. But it still seems worth while doing, see e.g., https://cockroachlabs.slack.com/archives/CJCR55PUG/p1648781917543139?thread_ts=1648744729.730979&cid=CJCR55PUG |
It is designed and implemented in this PR, just statically disabled, so I don't think its hard to discuss the pros and cons here.
For sure, it is a problem that needs to be solved. I'm just questioning using our current decentralized allocator to do it. |
Build failed (retrying...): |
Build failed (retrying...): |
Build failed: |
bors r- |
It wont shed them in this PR even with the highest enforcement setting kvoli/cockroach@a1f4ca6/pkg/kv/kvserver/allocator_test.go#L4346-L4367 Unless I'm missing something? No matter what setting, it won't shed replicas due to L0 sub-levels - that was what we agreed from the beginning . Unrelated but I'm going to remove the cluster version gate and target the successor to when L0 sublevels started to be gossiped - that way it can backport to 22.1.1 or 22.1.2. |
9bab3e7
to
945de32
Compare
Previously, the only store health signal used as a hard allocation and rebalancing constraint was disk capacity. This patch introduces L0 sub-levels as an additional constraint, to avoid allocation and rebalancing to replicas to stores which are unhealthy, indicated by a high number of L0 sub-levels. A store's sub-level count must exceed both the (1) threshold and (2) cluster in order to be considered unhealthy. The average check ensures that a cluster full of moderately high read amplification stores is not unable to make progress, whilst still ensuring that positively skewed distributions exclude the positive tail. Simulation of the effect on candidate exclusion under different L0 sub-level distributions by using the mean as an additional check vs percentiles can be found here: https://gist.github.com/kvoli/be27efd4662e89e8918430a9c7117858 The threshold corresponds to the cluster setting `kv.allocator.L0_sublevels_threshold`, which is the number of L0 sub-levels, that when a candidate store exceeds it will be potentially excluded as a target for rebalancing, or both rebalancing and allocation of replicas. The enforcement of this threshold can be applied under 4 different levels of strictness. This is configured by the cluster setting: `kv.allocator.L0_sublevels_threshold_enforce`. The 4 levels are: `block_none`: L0 sub-levels is ignored entirely. `block_none_log`: L0 sub-levels are logged if threshold exceeded. Both states below log as above. `block_rebalance_to`: L0 sub-levels are considered when excluding stores for rebalance targets. `block_all`: L0 sub-levels are considered when excluding stores for rebalance targets and allocation targets. By default, `kv.allocator.L0_sublevels_threshold` is `20`. Which corresponds to admissions control's threshold, above which it begins limiting admission of work to a store based on store health. The default enforcement level of `kv.allocator.L0_sublevels_threshold_enforce` is `block_none_log`. resolves cockroachdb#73714 Release justification: low risk, high benefit during high read amplification scenarios where an operator may limit rebalancing to high read amplification stores, to stop fueling the flame. Release note (ops change): introduce cluster settings `kv.allocator.l0_sublevels_threshold` and `kv.allocator.L0_sublevels_threshold_enforce`, which enable excluding stores as targets for allocation and rebalancing of replicas when they have high read amplification, indicated by the number of L0 sub-levels in level 0 of the store's LSM. When both `kv.allocator.l0_sublevels_threshold` and the cluster average is exceeded, the action corresponding to `kv.allocator.l0_sublevels_threshold_enforce` is taken. `block_none` will exclude no candidate stores, `block_none_log` will exclude no candidates but log an event, `block_rebalance_to` will exclude candidates stores from being targets of rebalance actions, `block_all` will exclude candidate stores from being targets of both allocation and rebalancing. Default `kv.allocator.l0_sublevels_threshold` is set to `20` and `kv.allocator.l0_sublevels_threshold_enforce` is set to `block_none_log`.
945de32
to
83deaaa
Compare
bors r+ |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 83deaaa to blathers/backport-release-22.1-78608: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.1.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
We previously checked stores' L0-sublevels to exclude IO overloaded stores from being allocation targets (cockroachdb#78608). This commit replaces the signal with the normalized IO overload score instead, which also factors in the L0-filecount. We started gossiping this value as of cockroachdb#83720. We continue gossiping L0-sublevels for mixed-version compatibility; we can stop doing this in 23.2. Resolves: cockroachdb#85084 Release note (ops change): We've deprecated two cluster settings: - kv.allocator.l0_sublevels_threshold - kv.allocator.l0_sublevels_threshold_enforce. The pair of them were used to control rebalancing and upreplication behavior in the face of IO overloaded stores. This has been now been replaced by other internal mechanisms.
96914: sql: add spaces to delimiter in uniqueness violation error r=rharding6373 a=michae2 Whenever a mutation statement violates a uniqueness constraint we return a specific error which shows the affected row. This error is formatted to match the corresponding error in Postgres. We create this error in two places. In one place (`pkg/sql/opt/exec/execbuilder/mutation.go`) we were using ", " as the delimter between values, which matches Postgres. But in the other place (`pkg/sql/row/errors.go`) we were using "," (without a space). This patch adds the space. Epic: None Release note (bug fix): Fix formatting of uniqueness violation errors to match the corresponding errors from PostgreSQL. 97065: schemachanger: Support dropping index cascade with dependent inbound FK r=healthy-pod a=Xiang-Gu Dropping a unique index cascade will need to drop any inbound FK constraint if this index is the only uniqueness provider. This commit enables the declarative schema changer to support this behavior. Fixes: #96731 Epic: None 97142: allocator: replace read amp with io thresh r=irfansharif a=kvoli We previously checked stores' L0-sublevels to exclude IO overloaded stores from being allocation targets (#78608). This commit replaces the signal with the normalized IO overload score instead, which also factors in the L0-filecount. We started gossiping this value as of #83720. We continue gossiping L0-sublevels for mixed-version compatibility; we can stop doing this in 23.2. Resolves: #85084 Release note (ops change): We've deprecated two cluster settings: - kv.allocator.l0_sublevels_threshold - kv.allocator.l0_sublevels_threshold_enforce. The pair of them were used to control rebalancing and upreplication behavior in the face of IO overloaded stores. This has been now been replaced by other internal mechanisms. 97495: roachprod: run scheduled backup init without timeout r=renatolabs a=msbutler Previously, several roachtests failed during a cluster restart because a node serving the default scheduled backup command was not ready to serve requests. At this time, when roachprod start returns, not every node may be ready to serve requests. To prevent this failure mode, this patch changes the scheduled backup cmd during roachprod.Start() to run with infinite timeout and only on the the first node in the cluster. Fixes #97010, #97232 Release note: None Epic: none Co-authored-by: Michael Erickson <michae2@cockroachlabs.com> Co-authored-by: Xiang Gu <xiang@cockroachlabs.com> Co-authored-by: Austen McClernon <austen@cockroachlabs.com> Co-authored-by: Michael Butler <butler@cockroachlabs.com>
Previously, the only store health signal used as a hard allocation and
rebalancing constraint was disk capacity. This patch introduces L0
sub-levels as an additional constraint, to avoid allocation and
rebalancing to replicas to stores which are unhealthy, indicated by a
high number of L0 sub-levlels.
A store's sub-level count must exceed both the (1) threshold and (2)
cluster mean watermark (10% above mean) in order to be considered
unhealthy. The average check ensures that a cluster full of moderately
high read amplification stores is not unable to make progress, whilst
still ensuring that positively skewed distributions exclude the positive
tail.
Simulation of the effect on candidate exclusion under different L0
sub-level distributions by using the mean as an additional check vs
percentiles can be found here:
https://gist.github.com/kvoli/be27efd4662e89e8918430a9c7117858
The L0 sub-level value for each store descriptor is smoothed to be the
maximum over a 5 minute window:
The threshold corresponds to the cluster setting
kv.allocator.L0_sublevels_threshold
, which is the number of L0sub-levels, that when a candidate store exceeds it will be potentially
excluded as a target for rebalancing, or both rebalancing and allocation
of replicas.
The enforcement of this threshold can be applied under 4 different
levels of strictness. This is configured by the cluster setting:
kv.allocator.L0_sublevels_threshold_enforce
.The 4 levels are:
block_none
: L0 sub-levels is ignored entirely.block_none_log
: L0 sub-levels are logged if threshold exceeded.Both states below log as above.
block_rebalance_to
: L0 sub-levels are considered when excluding storesfor rebalance targets.
block_all
: L0 sub-levels are considered when excluding stores forrebalance targets and allocation targets.
By default,
kv.allocator.L0_sublevels_threshold
is20
. Whichcorresponds to admissions control's threshold, above which it begins
limiting admission of work to a store based on store health. The default
enforcement level of
kv.allocator.L0_sublevels_threshold_enforce
isblock_none_log
.resolves #73714
Release justification: low risk, high benefit during high read
amplification scenarios where an operator may limit rebalancing to high
read amplification stores, to stop fueling the flame.
Release note (ops change): introduce cluster settings
kv.allocator.l0_sublevels_threshold
andkv.allocator.L0_sublevels_threshold_enforce
, which enable excludingstores as targets for allocation and rebalancing of replicas when they
have high read amplification, indicated by the number of L0 sub-levels
in level 0 of the store's LSM. When both
kv.allocator.l0_sublevels_threshold
and the cluster average isexceeded, the action corresponding to
kv.allocator.l0_sublevels_threshold_enforce
is taken.block_none
will exclude no candidate stores,
block_none_log
will exclude nocandidates but log an event,
block_rebalance_to
will excludecandidates stores from being targets of rebalance actions,
block_all
will exclude candidate stores from being targets of both allocation and
rebalancing. Default
kv.allocator.l0_sublevels_threshold
is set to20
andkv.allocator.l0_sublevels_threshold_enforce
is set toblock_none_log
.