-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
admission,kv,bulk: unify (local) store overload protection via admission control #75066
Comments
The store write admission control path now uses a StoreWorkQueue which wraps a WorkQueue and provides additional functionality: - Work can specify WriteBytes and whether it is an IngestRequest. This is used to decide how many byte tokens to consume. - Done work specifies how many bytes were ingested into L0, so token consumption can be adjusted. The main framework change is that a single work item can consume multiple (byte) tokens, which ripples through the various interfaces including requester, granter. There is associated cleanup: kvGranter that was handling both slots and tokens is eliminated since in practice it was only doing one or the other. Instead for the slot case the slotGranter is reused. For the token case the kvStoreTokenGranter is created. The main logic change is in ioLoadListener which computes byte tokens and various estimates. There are TODOs to fix tests that will fail. Informs cockroachdb#75066 Release note: None
I've created a WIP PR for the admission control changes (it lacks tests and I need to fix existing tests, hence WIP), for early opinions #75120 The most familiar reviewers for the admission control code are @RaduBerinde and @ajwerner, but (a) it may be premature for them to review the PR while we aren't settled on the wider approach, (b) if we go ahead with this we should probably build some expertise in KV and storage with that code base since they are the ones who are going to be diagnosing how this is behaving in production settings. |
Overall, I'm very supportive of this proposal -- I don't see how we can properly avoid overload while maximizing throughput without having all work pass through a common throttling mechanism. As you point out, we'll likely need prioritization here too.
There can be a significant time delay between a request being admitted in
Compression complicates this. Most request payloads (e.g.
This could actually come in handy for |
It doesn't take this into account. However, the token calculation is done at 15s intervals, so unless the time delay you refer to is many seconds we should have both the admission decision and the storage work happen in the same interval. Are there any queues other than latching and locking? We ideally don't want to delay after acquiring a shared resource (latches/locks) since then we are holding back other work.
The estimate that admission control currently uses for each work item (which is also the case in the PR) is based on compressed bytes cockroach/pkg/util/admission/granter.go Line 1590 in bed5b79
AddSSTableRequest . If there are going to be other heavy-weight write requests (GCRequest ?) then just summing the bytes in the keys is not comparable. We could either (a) live with this until we notice it is a problem (overcounting here will just cause the estimates for requests that don't provide any byte size to be smaller), (b) in MVCCGarbageCollect make a tighter guess on the size by applying key prefix compression (and use this to return tokens post-work).
|
Latching/locking is a major one, and in the case of long-running transactions these can be blocked for a substantial amount of time (e.g. minutes). If admission control does not take this into account it may be vulnerable to thundering herds. But there are also others, e.g.
Yeah, I've been wondering if we might want a scheme which checks for any throttling right before or after latches have been acquired -- if throttled, the request would release its latches (if acquired) and go back to wait for resources and latches again. Of course, this could be vulnerable to starvation (depending on the queue policy), but it would avoid thundering herds as well as throttling below latches.
I don't think we necessarily need to do anything about this now, but we should ideally have a common, comparable estimate of the amount of storage work required for a given request.
Nice -- we'll know the concurrency, so we can adjust for it. However, these requests will be split into a CPU-bound evaluation phase (proposer only) and an IO-bound replication/application phase. I believe this is generally the case for most other requests too, as we only replicate/apply the resulting Pebble batch which is IO bound. Does the work queue release the slot after evaluation, or only after the request returns? CC @dt. |
It returns the slot after evaluation.
|
The discussion on https://github.com/cockroachlabs/support/issues/1374 overlaps with the one here -- copy-pasting some stuff here about below-raft throttling, which seems to be generally necessary (due to high rate of applying raft entries when a node restarts and it catching up via the log). There is a debate about whether admission control should be involved.
(@erikgrinaker let's continue that discussion on this issue) |
Do we think that a request is the unit of work we want to choose to process now vs later? I wonder if requests like Like say we get an Do we want to try to somehow let that |
Yeah, this does makes sense when considering multiple ranges and stores, and work priorities between them. We'll need to avoid doublecounting above/below Raft though. Do you have any thoughts on cross-node admission control? I.e. could it replace the Raft quota pool with knowledge about followers' store health and Raft log state, or would we need to combine/augment them?
We should try to be smarter than that. We have a similar problem with range load metrics, where we currently consider QPS to be the rate of We now have multiple differing measures of "work" (QPS and WPS, RUs, admission control tokens), which could get confusing and hard to reason about. We should try to harmonize these somehow -- even though e.g. QPS has the luxury of being after-the-fact and can rely on actual measurements. |
I have no objection to being smarter :), (a) if we find that we need to be, based on experiments, and (b) we can figure out something effective.
|
We may want to include disk usage protection here as well, to avoid running the node out of disk with e.g. index backfills. Wrote up a proposal in #79210. |
What is remaining in scope for this issue after #79092 (comment) is addressed (i.e. at which point bulk requests are "just" throttled in admission control and nowhere else)? I am trying to clean up the many overlapping conversations we are having. My understanding is that this issue only deals with unifying bulk requests with local admission control (i.e. remove any special casing outside of admission control for these requests). The other big issue we have is #79215, for which I just completed cleaning up the initial post and breaking out sub-issues. That issue tracks the short-term (i.e. this release) plan for dealing with appends and snapshots. Then, there is also #79755 which is roughly about how to do "distributed admission control", i.e. taking follower health into account in a less ad-hoc way, and this is for now unplanned (past what's required as part of #79755). In my understanding this all fits together then, but if what I think this issue is about isn't correct, please correct me. |
This issue has a lot of really good discussion, and there were a lot of follow-on issues filed for specific take aways. I've read through it a few more times to make sure nothing fell through the cracks.
#86638 address the backup case using a form of cooperative scheduling; bounding how much work a single export request can do.
The detail around AddSST key-rewriting post MVCC-ification, and it being CPU-intensive, is also something that can integrate into the elastic CPU tokens machinery introduced above. There are other factors around disk use noted above and in the relevant issues (a bunch of which are being collected in https://github.com/orgs/cockroachdb/projects/32/views/1). The next set of things (till mid-Jan, after 22.2 stability and EOY holidays) we're planning to look at are focused on index backfills, which will include taking a better look at AddSSTs. We're not looking at using AC for disk storage control soon, or for snapshots (the hope is cockroachdb/pebble#1683 makes it less a problem, which is being worked in in 23.1). The outstanding follower writes throttling is being tracked on the Repl side for now. Throttling of replica/MVCC GC queue is not in near term scope (and perhaps something #42514 can push further down the line). All the other discussion around replica load is interesting and can be continued elsewhere. Closing this issue. We should continue removing throttling knobs, keeping them around just for escalations. Lets file specific issues for specific ones if we've missed any. |
..until expiry. Informs cockroachdb#82896 (more specifically this is a short-term alternative to the part pertaining to continuous tail capture). The issue has more background, but we repeat some below for posterity. It's desirable to draw from a set of tail execution traces collected over time when investigating tail latencies. cockroachdb#82750 introduced a probabilistic mechanism to capture a single tail event for a individual stmts with bounded overhead (determined by the sampling probability, trading off how long until a single capture is obtained). This PR introduces a sql.stmt_diagnostics.collect_continuously.enabled to collect captures continuously over some period of time for aggregate analysis. Longer term we'd want: - Controls over the maximum number of captures we'd want stored over some period of time; - Eviction of older bundles, assuming they're less relevant, making room for newer captures. To safeguard against misuse (in this current form we should only use it for experiments or escalations under controlled environments), we only act on this setting provided the diagnostics request has an expiration timestamp and a specified probability, crude measures to prevent unbounded growth. --- To get some idea of how this can be used, consider the kinds of experiments we're running as part of cockroachdb#75066. Specifically we have a reproduction where we can observe spikes in latencies for foreground traffic in the presence of concurrent backups (incremental/full). In an experiment with incremental backups running every 10m, with full backups running every 35m (`RECURRING '*/10 * * * *' FULL BACKUP '35 * * * *'`), we observe latency spikes during overlap periods. With this cluster setting we were able to set up trace captures over a 10h window to get a set of representative outlier traces to investigate further. > SELECT crdb_internal.request_statement_bundle( 'INSERT INTO new_order(no_o_id, ...)', -- stmt fingerprint 0.05, -- 5% sampling probability '30ms'::INTERVAL, -- 30ms target (p99.9) '10h'::INTERVAL -- capture window ); > WITH histogram AS (SELECT extract('minute', collected_at) AS minute, count(*) FROM system.statement_diagnostics GROUP BY minute) SELECT minute, repeat('*', (30 * count/(max(count) OVER ()))::INT8) AS freq FROM histogram ORDER BY count DESC LIMIT 10; minute | freq ---------+--------------------------------- 36 | ****************************** 38 | ********************* 35 | ********************* 00 | ********************* 37 | ******************** 30 | ******************** 40 | ***************** 20 | ************** 10 | ************* 50 | *********** (10 rows) We see that we captured just the set of bundles/traces we were interested in. Release note: None
..until expiry. Informs cockroachdb#82896 (more specifically this is a short-term alternative to the part pertaining to continuous tail capture). The issue has more background, but we repeat some below for posterity. It's desirable to draw from a set of tail execution traces collected over time when investigating tail latencies. cockroachdb#82750 introduced a probabilistic mechanism to capture a single tail event for a individual stmts with bounded overhead (determined by the sampling probability, trading off how long until a single capture is obtained). This PR introduces a sql.stmt_diagnostics.collect_continuously.enabled to collect captures continuously over some period of time for aggregate analysis. To get some idea of how this can be used, consider the kinds of experiments we're running as part of cockroachdb#75066. Specifically we have a reproduction where we can observe spikes in latencies for foreground traffic in the presence of concurrent backups (incremental/full). In an experiment with incremental backups running every 10m, with full backups running every 35m (`RECURRING '*/10 * * * *' FULL BACKUP '35 * * * *'`), we observe latency spikes during overlap periods. With this cluster setting we were able to set up trace captures over a 10h window to get a set of representative outlier traces to investigate further. SELECT crdb_internal.request_statement_bundle( 'INSERT INTO new_order(no_o_id, ...)', -- stmt fingerprint 0.05, -- 5% sampling probability '30ms'::INTERVAL, -- 30ms target (p99.9) '10h'::INTERVAL -- capture window ); WITH histogram AS (SELECT extract('minute', collected_at) AS minute, count(*) FROM system.statement_diagnostics GROUP BY minute) SELECT minute, repeat('*', (30 * count/(max(count) OVER ()))::INT8) AS freq FROM histogram ORDER BY count DESC LIMIT 10; minute | freq ---------+--------------------------------- 36 | ****************************** 38 | ********************* 35 | ********************* 00 | ********************* 37 | ******************** 30 | ******************** 40 | ***************** 20 | ************** 10 | ************* 50 | *********** (10 rows) We see that we captured just the set of bundles/traces we were interested in. Longer term we'd want: - Controls over the maximum number of captures we'd want stored over some period of time; - Eviction of older bundles, assuming they're less relevant, making room for newer captures. To safeguard against misuse (in this current form we should only use it for experiments or escalations under controlled environments), we only act on this setting provided the diagnostics request has an expiration timestamp and a specified probability, crude measures to prevent unbounded growth. Release note: None
83020: stmtdiagnostics: support continuous bundle collection r=irfansharif a=irfansharif ..until expiry. Informs #82896 (more specifically this is a short-term alternative to the part pertaining to continuous tail capture). The issue has more background, but we repeat some below for posterity. It's desirable to draw from a set of tail execution traces collected over time when investigating tail latencies. #82750 introduced a probabilistic mechanism to capture a single tail event for a individual stmts with bounded overhead (determined by the sampling probability, trading off how long until a single capture is obtained). This PR introduces a `sql.stmt_diagnostics.collect_continuously_until_expired` to collect captures continuously over some period of time for aggregate analysis. Longer term we'd want: - Controls over the maximum number of captures we'd want stored over some period of time; - Eviction of older bundles, assuming they're less relevant, making room for newer captures. To safeguard against misuse (in this current form we should only use it for experiments or escalations under controlled environments), we only act on this setting provided the diagnostics request has an expiration timestamp and a specified probability, crude measures to prevent unbounded growth. --- To get some idea of how this can be used, consider the kinds of experiments we're running as part of #75066. Specifically we have a reproduction where we can observe spikes in latencies for foreground traffic in the presence of concurrent backups (incremental/full). In an experiment with incremental backups running every 10m, with full backups running every 35m (`RECURRING '*/10 * * * *' FULL BACKUP '35 * * * *'`), we observe latency spikes during overlap periods. With this cluster setting we were able to set up trace captures over a 10h window to get a set of representative outlier traces to investigate further. > SELECT crdb_internal.request_statement_bundle( 'INSERT INTO new_order(no_o_id, ...)', -- stmt fingerprint 0.05, -- 5% sampling probability '30ms'::INTERVAL, -- 30ms target (p99.9) '10h'::INTERVAL -- capture window ); > WITH histogram AS (SELECT extract('minute', collected_at) AS minute, count(*) FROM system.statement_diagnostics GROUP BY minute) SELECT minute, repeat('*', (30 * count/(max(count) OVER ()))::INT8) AS freq FROM histogram ORDER BY count DESC LIMIT 10; minute | freq ---------+--------------------------------- 36 | ****************************** 38 | ********************* 35 | ********************* 00 | ********************* 37 | ******************** 30 | ******************** 40 | ***************** 20 | ************** 10 | ************* 50 | *********** (10 rows) We see that we captured just the set of bundles/traces we were interested in. Release note: None 86591: kvserver: sync checksum computation with long poll r=erikgrinaker a=pavelkalinnikov Previously, the checksum computation would run until completion unconditionally (unless the collection request comes before it). This is not the best spend of the limited pool capacity, because the result of this computation may never be requested. After this commit, the checksum computation task is synchronized with the checksum collection request. Both wait at most 5 seconds until the other party has joined. Once joined, the computation starts, otherwise skips. If any party abandons the request, then the `replicaChecksum` record is preserved in the state, and is scheduled for a GC later. This is to help the other party to fail fast, instead of waiting, if it arrives late. This change also removes the no longer needed concurrency limit for the tasks, because tasks are canceled reliably and will not pile up. Fixes #77432 Release note (performance improvement): consistency checks are now properly cancelled on timeout, preventing them from piling up. 88768: ci: add MacOS ARM CI config r=jlinder a=rail Previously, MacOS64 ARM64 platform was added, but CI wouldn't run it. This PR adds a CI platform to build MacOS ARM64 binaries. Release note: None Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com> Co-authored-by: Pavel Kalinnikov <pavel@cockroachlabs.com> Co-authored-by: Rail Aliiev <rail@iqchoice.com>
..until expiry. Informs #82896 (more specifically this is a short-term alternative to the part pertaining to continuous tail capture). The issue has more background, but we repeat some below for posterity. It's desirable to draw from a set of tail execution traces collected over time when investigating tail latencies. #82750 introduced a probabilistic mechanism to capture a single tail event for a individual stmts with bounded overhead (determined by the sampling probability, trading off how long until a single capture is obtained). This PR introduces a sql.stmt_diagnostics.collect_continuously.enabled to collect captures continuously over some period of time for aggregate analysis. To get some idea of how this can be used, consider the kinds of experiments we're running as part of #75066. Specifically we have a reproduction where we can observe spikes in latencies for foreground traffic in the presence of concurrent backups (incremental/full). In an experiment with incremental backups running every 10m, with full backups running every 35m (`RECURRING '*/10 * * * *' FULL BACKUP '35 * * * *'`), we observe latency spikes during overlap periods. With this cluster setting we were able to set up trace captures over a 10h window to get a set of representative outlier traces to investigate further. SELECT crdb_internal.request_statement_bundle( 'INSERT INTO new_order(no_o_id, ...)', -- stmt fingerprint 0.05, -- 5% sampling probability '30ms'::INTERVAL, -- 30ms target (p99.9) '10h'::INTERVAL -- capture window ); WITH histogram AS (SELECT extract('minute', collected_at) AS minute, count(*) FROM system.statement_diagnostics GROUP BY minute) SELECT minute, repeat('*', (30 * count/(max(count) OVER ()))::INT8) AS freq FROM histogram ORDER BY count DESC LIMIT 10; minute | freq ---------+--------------------------------- 36 | ****************************** 38 | ********************* 35 | ********************* 00 | ********************* 37 | ******************** 30 | ******************** 40 | ***************** 20 | ************** 10 | ************* 50 | *********** (10 rows) We see that we captured just the set of bundles/traces we were interested in. Longer term we'd want: - Controls over the maximum number of captures we'd want stored over some period of time; - Eviction of older bundles, assuming they're less relevant, making room for newer captures. To safeguard against misuse (in this current form we should only use it for experiments or escalations under controlled environments), we only act on this setting provided the diagnostics request has an expiration timestamp and a specified probability, crude measures to prevent unbounded growth. Release note: None
@daniel-crlabs, that issues talks latency spikes during backups. That indeed was work that was done under this admission control issue, specifically this PR: #86638. It cannot be backported to 22.1 since it uses a patched Go runtime and on a newer release of Go. Feel free to send them this blog post we wrote about this work specifically: https://www.cockroachlabs.com/blog/rubbing-control-theory/. |
Sounds great, thank you for the response, I'll pass this along to the CEA and the customer. |
We originally introduced these notions in admission control (cockroachdb#78519) for additional threads for Pebble compaction compression. We envisioned granting these "squishy" slots to background activities and permit work only under periods of low load. In working through cockroachdb#86638 (as part of \cockroachdb#75066), we observed experimentally that the moderate-slots count was not sensitive enough to scheduling latency, and consequently latency observed by foreground traffic. Elastic CPU tokens, the kind now being used for backups, offer an alternative to soft slots. We've since replaced uses of soft slots with elastic CPU tokens. This PR just removes the now dead-code code around soft/moderate load slots (it's better to minimize the number of mechanisms in the admission package) Release note: None
We originally introduced these notions in admission control (cockroachdb#78519) for additional threads for Pebble compaction compression. We envisioned granting these "squishy" slots to background activities and permit work only under periods of low load. In working through cockroachdb#86638 (as part of \cockroachdb#75066), we observed experimentally that the moderate-slots count was not sensitive enough to scheduling latency, and consequently latency observed by foreground traffic. Elastic CPU tokens, the kind now being used for backups, offer an alternative to soft slots. We've since replaced uses of soft slots with elastic CPU tokens. This PR just removes the now dead-code code around soft/moderate load slots (it's better to minimize the number of mechanisms in the admission package). Fixes cockroachdb#95590. Release note: None
95590: admission: remove soft/moderate load slots r=irfansharif a=irfansharif We originally introduced these notions in admission control (#78519) for additional threads for Pebble compaction compression. We envisioned granting these "squishy" slots to background activities and permit work only under periods of low load. In working through #86638 (as part of \#75066), we observed experimentally that the moderate-slots count was not sensitive enough to scheduling latency, and consequently latency observed by foreground traffic. Elastic CPU tokens, the kind now being used for backups, offer an alternative to soft slots. We've since replaced uses of soft slots with elastic CPU tokens. This PR just removes the now dead-code code around soft/moderate load slots (it's better to minimize the number of mechanisms in the admission package). Fixes #88032. Release note: None --- First commit is from #95007. Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
Consider a store with the capacity to accept writes at a rate of R bytes/s. This is a thought exercise in that R is not fixed, and is affected by various factors like disk provisioning (which can dynamically change), whether the write was a batch written via the WAL or an ingested sstable (and how many bytes are landing in L0), and compaction concurrency adjustment. We have two categories of mechanisms that attempt to prevent store overload:
Capacity unaware mechanisms: These include
Engine.PreIngestDelay
: This applies a delay per "file" that is being ingested if we are overrocksdb.ingest_backpressure.l0_file_count_threshold
(default of 20). The delay is proportional to how far above the threshold the store is, and usesrocksdb.ingest_backpressure.max_delay
. It is unaware of the size of the file, i.e., the number of bytes being written. The bytes being written can vary significantly for bulk operations based on how many ranges are being buffered inBufferingAdder
before generating sstables. This delay is applied (a) above raft toAddSSTableRequest
even if it is being written as a batch (IngestAsWrites
is true), (b) below raft inaddSSTablePreApply
, if theAddSSTableRequest
was!IngestAsWrites
.AddSSTableRequest
: Applied at proposal time, usingkv.bulk_io_write.concurrent_addsstable_requests
,kv.bulk_io_write.concurrent_addsstable_as_writes_requests
.Store.snapshotApplySem
.Capacity aware mechanisms: Admission control uses two overload thresholds
admission.l0_sub_level_count_overload_threshold
(also 20, like the bulk back-pressuring threshold) andadmission.l0_file_count_overload_threshold
(1000) to decide when to limit admission control tokens for writes. The code estimates the capacity R based on how fast compactions are removing bytes from L0. It is unaware of the exact bytes that will be added by individual requests and computes an estimate per request based on past behavior. It is used only at proposal time (Node.Batch
callsKVAdmissionController.AdmitKVWork
).This setup has multiple deficiencies:
We have existing issues for
We propose to unify all these overload protection mechanisms such that there is one source of byte tokens representing what can be admitted and one queue of requests waiting for admission.
admission.WorkQueue
across tenants. Admission control does not currently have support for different tenant weights, but that is easy to add if needed.Deficiencies:
rocksdb.ingest_backpressure.l0_file_count_threshold
andadmission.l0_sub_level_count_overload_threshold
both default to 20. There is a way to address this in the future via a hierarchical token bucket scheme: theadmission.ioLoadListener
would produce high_overload_tokens and low_overload_tokens where the background operations have to consume both, while foreground operations only use the former.cc: @erikgrinaker @dt @nvanbenschoten
Jira issue: CRDB-12450
Epic CRDB-14607
The text was updated successfully, but these errors were encountered: