-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: lower priority level for mvcc gc work #85823
kvserver: lower priority level for mvcc gc work #85823
Conversation
3712eea
to
09a21bf
Compare
@@ -115,6 +116,20 @@ var MaxIntentKeyBytesPerCleanupBatch = settings.RegisterIntSetting( | |||
settings.NonNegativeInt, | |||
) | |||
|
|||
// AdmissionPriority determines the admission priority level to use for MVCC GC | |||
// work. | |||
var AdmissionPriority = settings.RegisterEnumSetting( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I dont think a setting is necessary, but also have no objections to adding one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we follow this up with a change that makes mvccGCQueueTimerDuration
a cluster setting? We have precedent for that with kv.range_merge.queue_interval
.
Reviewed 3 of 3 files at r1.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif, @lunevalex, @sumeerbhola, and @tbg)
-- commits
line 16 at r1:
I suspect that we will eventually need additional pacing in the keyspace iteration that the MVCC GC queue performs (e.g. processReplicatedKeyRange
) to compute which keys to GC. In cases where many keys are being garbage collected, it is probably the case that we will issue batches of GCRequests frequently enough for this change to serve as a form of pacing. However, for MVCC GC queue passes where few keys are actually GCed, this probably won't be enough. In those cases, we'll loop quickly through the entire range with no intermittent GC requests, which you've demonstrated elsewhere to be a CPU hog with adverse effects on goroutine scheduling latency.
MVCC GC work is CPU intensive
I think it's also worth considering that a future revision of MVCC GC might do away with most of the CPU intensive work. In such a world, the synchronous keyspace iteration might be the only work left in the MVCC GC queue, assuming we want accurate stats even with delayed GC during compaction.
That doesn't change anything in this PR, but it might shift how we think about the solution space around disruptive MVCC GC going forward.
pkg/kv/kvserver/gc/gc.go
line 121 at r1 (raw file):
Previously, lunevalex wrote…
nit: I dont think a setting is necessary, but also have no objections to adding one.
I think it's a good idea to add one, given the potential for a negative feedback loop with starved MVCC GC and foreground traffic whose efficiency depends on timely MVCC GC. We might as well build the tool that we would need to combat this if it ever comes up in a support escalation.
pkg/kv/kvserver/gc/gc.go
line 126 at r1 (raw file):
"the admission priority to use for mvcc gc work", "low_pri", map[int64]string{
Not for now, but I wouldn't be surprised if we follow this pattern in other places and decide to package it up as a known enum in the settings
package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif, @lunevalex, @sumeerbhola, and @tbg)
pkg/kv/kvserver/gc/gc.go
line 127 at r1 (raw file):
"low_pri", map[int64]string{ int64(admissionpb.LowPri): "low_pri",
Actually, are these the correct priority levels, and is this the correct default? We use BulkNormalPri
for all other bulk jobs. LowPri
doesn't seem to be used outside of the admission package.
09a21bf
to
cec0265
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we follow this up with a change that makes mvccGCQueueTimerDuration a cluster setting?
Sure, just added a commit here. Will merge on green.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @lunevalex, @nvanbenschoten, @sumeerbhola, and @tbg)
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I suspect that we will eventually need additional pacing in the keyspace iteration that the MVCC GC queue performs (e.g.
processReplicatedKeyRange
) to compute which keys to GC. In cases where many keys are being garbage collected, it is probably the case that we will issue batches of GCRequests frequently enough for this change to serve as a form of pacing. However, for MVCC GC queue passes where few keys are actually GCed, this probably won't be enough. In those cases, we'll loop quickly through the entire range with no intermittent GC requests, which you've demonstrated elsewhere to be a CPU hog with adverse effects on goroutine scheduling latency.
Agreed, more experimentation here will tell us. Added a sentence below
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
MVCC GC work is CPU intensive
I think it's also worth considering that a future revision of MVCC GC might do away with most of the CPU intensive work. In such a world, the synchronous keyspace iteration might be the only work left in the MVCC GC queue, assuming we want accurate stats even with delayed GC during compaction.
That doesn't change anything in this PR, but it might shift how we think about the solution space around disruptive MVCC GC going forward.
Doing MVCC GC as part of compactions makes a lot of sense (I need to periodically sift through X-noremind PRs to dig up such things). Added a sentence.
pkg/kv/kvserver/gc/gc.go
line 121 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I think it's a good idea to add one, given the potential for a negative feedback loop with starved MVCC GC and foreground traffic whose efficiency depends on timely MVCC GC. We might as well build the tool that we would need to combat this if it ever comes up in a support escalation.
Added it for that same reason, especially given it's something we're looking to backport to 22.1. We should want this knob to go away though after more experimentation work, if that's what you're getting at Alex.
pkg/kv/kvserver/gc/gc.go
line 127 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Actually, are these the correct priority levels, and is this the correct default? We use
BulkNormalPri
for all other bulk jobs.LowPri
doesn't seem to be used outside of the admission package.
LowPri was mentioned in an earlier revision of a comment, but I think BulkNormalPri wasn't a thing then. Changed, and removed LowPri. The important thing is to have something lower than NormalPri.
cec0265
to
4d0d98b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 2 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif, @sumeerbhola, and @tbg)
pkg/kv/kvserver/mvcc_gc_queue.go
line 72 at r3 (raw file):
var mvccGCQueueInterval = settings.RegisterDurationSetting( settings.SystemOnly, "kv.mvcc_gc_merge.queue_interval",
s/kv.mvcc_gc_merge.queue_interval/kv.mvcc_gc.queue_interval/
pkg/kv/kvserver/gc/gc.go
line 127 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
LowPri was mentioned in an earlier revision of a comment, but I think BulkNormalPri wasn't a thing then. Changed, and removed LowPri. The important thing is to have something lower than NormalPri.
Should HighPri
also be changed to UserHighPri
?
GC could be expected to be LowPri, so that it does not impact user-facing traffic when resources (e.g. CPU, write capacity of the store) are scarce. However long delays in GC can slow down user-facing traffic due to more versions in the store, and can increase write amplification of the store since there is more live data. Ideally, we should adjust this priority based on how far behind we are with respect to GC-ing a particular range. Keeping the priority level static at NormalPri proved disruptive when a large volume of MVCC GC work is suddenly accrued (if an old protected timestamp record was just released for ex. following a long paused backup job being completed/canceled, or just an old, long running backup job finishing). After dynamic priority adjustment, it's not yet clear whether we need additional pacing mechanisms to provide better latency isolation, similar to ongoing work for backups. MVCC GC work is CPU intensive: \cockroachdb#82955. This patch is also speculative in nature and in response to observed incidents where NormalPri proved too disruptive. Fuller treatment would entail working off of reliable reproductions of this behaviour. Release note (performance improvement): Previously if there was sudden increase in the volume of pending MVCC GC work, there was an impact on foreground latencies. These sudden increases commonly occurred when: - gc.ttlseconds was reduced dramatically over tables/indexes that accrue a lot of MVCC garbage (think "rows being frequently deleted") - a paused backup job from a while ago (think > 1 day) was canceled/failed - a backup job that started a while ago (think > 1 day) just finished Indicators of a large increase in the volume of pending MVCC GC work is a steep climb in the "GC Queue" graph found in the DB console page, when navigating to 'Metrics', and selecting the 'Queues' dashboard. With this patch, the effect on foreground latencies as a result of this sudden build up should be reduced.
4d0d98b
to
7119010
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bors r+
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten, @sumeerbhola, and @tbg)
pkg/kv/kvserver/mvcc_gc_queue.go
line 72 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
s/kv.mvcc_gc_merge.queue_interval/kv.mvcc_gc.queue_interval/
🤦
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 3 files at r1, 2 of 2 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif, @nvanbenschoten, @sumeerbhola, and @tbg)
pkg/kv/kvserver/mvcc_gc_queue.go
line 465 at r4 (raw file):
// BulkNormalPri. // // After we implement dynamic priority adjustment, it's not clear
It is good to have this knob to start with. But I hope we never have to tune it, or do dynamic priority adjustment in the code (it will just make things harder to understand). If someone has not provisioned well enough for GC to function, then it is their problem to add more resources.
Build failed (retrying...): |
bors r- |
Canceled. |
Cluster setting that controls how long the MVCC GC queue waits between processing replicas. It was previously hardcoded to 1s (which is the default value), but changing it would've come in handy in support incidents as a form of manual pacing of MVCC GC work (we have a similar useful knob for the merge queue). Release note: None
7119010
to
f00b130
Compare
bors r+ |
🕐 Waiting for PR status (Github check) to be set, probably by CI. Bors will automatically try to run when all required PR statuses are set. |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from d5e7e0a to blathers/backport-release-22.1-85823: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.1.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
GC could be expected to be LowPri, so that it does not impact
user-facing traffic when resources (e.g. CPU, write capacity of the
store) are scarce. However long delays in GC can slow down user-facing
traffic due to more versions in the store, and can increase write
amplification of the store since there is more live data. Ideally, we
should adjust this priority based on how far behind we are with respect
to GC-ing a particular range. Keeping the priority level static at
NormalPri proved disruptive when a large volume of MVCC GC work is
suddenly accrued (if an old protected timestamp record was just released
for ex. following a long paused backup job being completed/canceled, or
just an old, long running backup job finishing).
After dynamic priority adjustment, it's not yet clear whether we need
additional pacing mechanisms to provide better latency isolation,
similar to ongoing work for backups. MVCC GC work is CPU intensive:
#82955. This patch is also speculative in nature and in response to
observed incidents where NormalPri proved too disruptive. Fuller
treatment would entail working off of reliable reproductions of this
behaviour.
We also added a cluster setting (kv.mvcc_gc.queue_interval)
that controls how long the MVCC GC queue waits between
processing replicas. It was previously hardcoded to 1s (which is the
default value), but changing it would've come in handy in support
incidents as a form of manual pacing of MVCC GC work (we have a similar
useful knob for the merge queue).
Release note (performance improvement): Previously if there was sudden
increase in the volume of pending MVCC GC work, there was an impact on
foreground latencies. These sudden increases commonly occurred when:
a lot of MVCC garbage (think "rows being frequently deleted")
canceled/failed
Indicators of a large increase in the volume of pending MVCC GC work is
a steep climb in the "GC Queue" graph found in the DB console page, when
navigating to 'Metrics', and selecting the 'Queues' dashboard. With this
patch, the effect on foreground latencies as a result of this sudden
build up should be reduced.