Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

admission,kv,bulk: unify (local) store overload protection via admission control #75066

Closed
sumeerbhola opened this issue Jan 18, 2022 · 18 comments
Labels
A-admission-control A-kv-replication Relating to Raft, consensus, and coordination. A-storage Relating to our storage engine (Pebble) on-disk storage. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-storage Storage Team

Comments

@sumeerbhola
Copy link
Collaborator

sumeerbhola commented Jan 18, 2022

Consider a store with the capacity to accept writes at a rate of R bytes/s. This is a thought exercise in that R is not fixed, and is affected by various factors like disk provisioning (which can dynamically change), whether the write was a batch written via the WAL or an ingested sstable (and how many bytes are landing in L0), and compaction concurrency adjustment. We have two categories of mechanisms that attempt to prevent store overload:

  • Capacity unaware mechanisms: These include

    • Engine.PreIngestDelay: This applies a delay per "file" that is being ingested if we are over rocksdb.ingest_backpressure.l0_file_count_threshold (default of 20). The delay is proportional to how far above the threshold the store is, and uses rocksdb.ingest_backpressure.max_delay. It is unaware of the size of the file, i.e., the number of bytes being written. The bytes being written can vary significantly for bulk operations based on how many ranges are being buffered in BufferingAdder before generating sstables. This delay is applied (a) above raft to AddSSTableRequest even if it is being written as a batch (IngestAsWrites is true), (b) below raft in addSSTablePreApply, if the AddSSTableRequest was !IngestAsWrites.
    • Concurrency limit on AddSSTableRequest: Applied at proposal time, using kv.bulk_io_write.concurrent_addsstable_requests, kv.bulk_io_write.concurrent_addsstable_as_writes_requests.
    • Concurrency limit on snapshot application (which is done by ingesting sstables): Store.snapshotApplySem.
  • Capacity aware mechanisms: Admission control uses two overload thresholds admission.l0_sub_level_count_overload_threshold (also 20, like the bulk back-pressuring threshold) and admission.l0_file_count_overload_threshold (1000) to decide when to limit admission control tokens for writes. The code estimates the capacity R based on how fast compactions are removing bytes from L0. It is unaware of the exact bytes that will be added by individual requests and computes an estimate per request based on past behavior. It is used only at proposal time (Node.Batch calls KVAdmissionController.AdmitKVWork).

This setup has multiple deficiencies:

  • Tuning delay or concurrency knobs is not practical. The knobs will be either (a) too conservative and leave unused throughput, resulting in customer questions on why some bulk operation is running slowly while the nodes are underutilized, (b) too aggressive and we won't even know until a rare workload pushes against those knob settings and causes a node to overload.
  • Different knobs control different kinds of work. Someone may validate that stressing the system along each kind of work is not causing the system to overload, but it maybe that if all the work kinds spike (e.g. building secondary index and backup restore and snapshot application) that a node becomes overloaded.
  • Admission control tunes itself but isn't aware of the bytes that will be written by each work item. Estimates are fine for small writes but proper accounting for larger writes is preferable.

We have existing issues for

  • Using store health of all replicas for bulk operations kvserver: remove below-raft throttling #57247 -- we treat that as orthogonal here and assume that we will continue to need a lower level mechanism that serves as a final safeguard to protect an individual store. That is, we may continue to do below-raft throttling even though there are concerns with doing so due to blocking a worker in the raft scheduler (which should be ok since the worker pool is per store), or holding latches for too long.
  • kv: add throttling for background GC operations based on store health #57248 discusses throttling of GC operations, that may do many writes -- we already subject these to admission control at the proposer, but without knowledge of the bytes being written. If each GCRequest can do a large batch write we should compute the bytes being written so that admission control can utilize that information.

We propose to unify all these overload protection mechanisms such that there is one source of byte tokens representing what can be admitted and one queue of requests waiting for admission.

  • Admission control logic will be enhanced to compute byte-based tokens for store writes. Those requests that provide their byte size (which should be all large requests) will consume these tokens. Estimates will be used for two purposes (a) requests that don't provide their byte size, for which the estimate will be used to decide how many tokens to consume (b) computing the fraction of an ingest request that will end up in L0 (to adjust the token consumption for an ingest reques). Just like token estimation, these estimates are continually adjusted based on stats (at every 15s interval).
  • All the paths that currently used capacity unaware mechanisms will call into the admission control WorkQueue for admission (they will not call the subsequent WorkQueue that throttled KV work based on cpu). After the admitted work is done, each ingest request will also provide information on how many bytes were added to L0 (this will need a small Pebble API change), so that the token consumption can be fixed and we have data for future estimates.
    • We will need to prevent double counting: requests that went through admission control above raft at the leaseholder should not be subject to admission control below raft at the same node.
  • Priority or tenant assignment: An open question is how "background" work like snapshot application or index construction should be prioritized in this unified setup. One could give such work a lower priority if it is indeed lower priority than foreground traffic. There may be exceptions e.g. we may not want a lower priority for below-raft throttling. Alternatively, we could also use a made-up TenantID and rely on the proportional scheduling of tokens done in admission.WorkQueue across tenants. Admission control does not currently have support for different tenant weights, but that is easy to add if needed.

Deficiencies:

  • The overload threshold at which we want to start constraining writes could be different for user-facing and background operations: By sharing a single queue and a single source of tokens for that queue we also share the same overload thresholds. This is probably not an immediate problem since rocksdb.ingest_backpressure.l0_file_count_threshold and admission.l0_sub_level_count_overload_threshold both default to 20. There is a way to address this in the future via a hierarchical token bucket scheme: the admission.ioLoadListener would produce high_overload_tokens and low_overload_tokens where the background operations have to consume both, while foreground operations only use the former.

cc: @erikgrinaker @dt @nvanbenschoten

Jira issue: CRDB-12450

Epic CRDB-14607

@sumeerbhola sumeerbhola added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-replication Relating to Raft, consensus, and coordination. A-storage Relating to our storage engine (Pebble) on-disk storage. A-admission-control labels Jan 18, 2022
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Jan 19, 2022
The store write admission control path now uses a StoreWorkQueue
which wraps a WorkQueue and provides additional functionality:
- Work can specify WriteBytes and whether it is an IngestRequest.
  This is used to decide how many byte tokens to consume.
- Done work specifies how many bytes were ingested into L0, so
  token consumption can be adjusted.

The main framework change is that a single work item can consume
multiple (byte) tokens, which ripples through the various
interfaces including requester, granter. There is associated
cleanup: kvGranter that was handling both slots and tokens is
eliminated since in practice it was only doing one or the other.
Instead for the slot case the slotGranter is reused. For the token
case the kvStoreTokenGranter is created.

The main logic change is in ioLoadListener which computes byte
tokens and various estimates.

There are TODOs to fix tests that will fail.

Informs cockroachdb#75066

Release note: None
@sumeerbhola
Copy link
Collaborator Author

I've created a WIP PR for the admission control changes (it lacks tests and I need to fix existing tests, hence WIP), for early opinions #75120

The most familiar reviewers for the admission control code are @RaduBerinde and @ajwerner, but (a) it may be premature for them to review the PR while we aren't settled on the wider approach, (b) if we go ahead with this we should probably build some expertise in KV and storage with that code base since they are the ones who are going to be diagnosing how this is behaving in production settings.

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jan 19, 2022

Overall, I'm very supportive of this proposal -- I don't see how we can properly avoid overload while maximizing throughput without having all work pass through a common throttling mechanism. As you point out, we'll likely need prioritization here too.

We will need to prevent double counting: requests that went through admission control above raft at the leaseholder should not be subject to admission control below raft at the same node.

There can be a significant time delay between a request being admitted in Node.Batch and when it's actually applied, due to e.g. latching and other queues. Does admission control already take this into account, and/or would it be beneficial with additional below-Raft or just-above-Raft throttling to manage this?

Admission control logic will be enhanced to compute byte-based tokens for store writes. Those requests that provide their byte size (which should be all large requests) will consume these tokens.

Compression complicates this. Most request payloads (e.g. Put) are not compressed, but AddSSTable payloads are, so they will be underweighted compared to other traffic. Do you have any thoughts on how to address that?

they will not call the subsequent WorkQueue that throttled KV work based on cpu

This could actually come in handy for AddSSTable. To make it MVCC-compliant, we will in some cases be rewriting the MVCC timestamps in the SST, which is a CPU-bound operation. We would ideally like to be able to throttle the CPU-bound and IO-bound portions of AddSSTable requests separately. It sounds like this might be a useful mechanism for that. CC @dt.

@sumeerbhola
Copy link
Collaborator Author

sumeerbhola commented Jan 19, 2022

There can be a significant time delay between a request being admitted in Node.Batch and when it's actually applied, due to e.g. latching and other queues. Does admission control already take this into account, and/or would it be beneficial with additional below-Raft or just-above-Raft throttling to manage this?

It doesn't take this into account. However, the token calculation is done at 15s intervals, so unless the time delay you refer to is many seconds we should have both the admission decision and the storage work happen in the same interval. Are there any queues other than latching and locking? We ideally don't want to delay after acquiring a shared resource (latches/locks) since then we are holding back other work.

Compression complicates this. Most request payloads (e.g. Put) are not compressed, but AddSSTable payloads are, so they will be underweighted compared to other traffic. Do you have any thoughts on how to address that?

The estimate that admission control currently uses for each work item (which is also the case in the PR) is based on compressed bytes

bytesAddedPerWork := float64(bytesAdded) / float64(admitted)
, so it is comparable with AddSSTableRequest. If there are going to be other heavy-weight write requests (GCRequest?) then just summing the bytes in the keys is not comparable. We could either (a) live with this until we notice it is a problem (overcounting here will just cause the estimates for requests that don't provide any byte size to be smaller), (b) in MVCCGarbageCollect make a tighter guess on the size by applying key prefix compression (and use this to return tokens post-work).

This could actually come in handy for AddSSTable. To make it MVCC-compliant, we will in some cases be rewriting the MVCC timestamps in the SST, which is a CPU-bound operation. We would ideally like to be able to throttle the CPU-bound and IO-bound portions of AddSSTable requests separately. It sounds like this might be a useful mechanism for that.

AddSSTable at the proposer, which is doing the rewriting, is already subject to admission control on master. This means it first goes through the store-specific WorkQueue and consumes the estimated bytes (which will be a wrong number, but which this issue aims to correct), and then the CPU-bound WorkQueue and consumes a slot (that is returned only when it finishes executing). Parallelization of AddSSTable is going to make the latter incorrect since it is actually using multiple slots. If we know the concurrency it is going to use at admission time, this could be fixed.

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jan 23, 2022

There can be a significant time delay between a request being admitted in Node.Batch and when it's actually applied, due to e.g. latching and other queues. Does admission control already take this into account, and/or would it be beneficial with additional below-Raft or just-above-Raft throttling to manage this?

It doesn't take this into account. However, the token calculation is done at 15s intervals, so unless the time delay you refer to is many seconds we should have both the admission decision and the storage work happen in the same interval. Are there any queues other than latching and locking?

Latching/locking is a major one, and in the case of long-running transactions these can be blocked for a substantial amount of time (e.g. minutes). If admission control does not take this into account it may be vulnerable to thundering herds. But there are also others, e.g. AddSSTable currently uses a semaphore to limit concurrent requests below the AdminKVWork() call in Node.Batch, and this can wait for a long time as well.

We ideally don't want to delay after acquiring a shared resource (latches/locks) since then we are holding back other work.

Yeah, I've been wondering if we might want a scheme which checks for any throttling right before or after latches have been acquired -- if throttled, the request would release its latches (if acquired) and go back to wait for resources and latches again. Of course, this could be vulnerable to starvation (depending on the queue policy), but it would avoid thundering herds as well as throttling below latches.

Compression complicates this. Most request payloads (e.g. Put) are not compressed, but AddSSTable payloads are, so they will be underweighted compared to other traffic. Do you have any thoughts on how to address that?

The estimate that admission control currently uses for each work item (which is also the case in the PR) is based on compressed bytes

bytesAddedPerWork := float64(bytesAdded) / float64(admitted)

, so it is comparable with AddSSTableRequest. If there are going to be other heavy-weight write requests (GCRequest?) then just summing the bytes in the keys is not comparable. We could either (a) live with this until we notice it is a problem (overcounting here will just cause the estimates for requests that don't provide any byte size to be smaller), (b) in MVCCGarbageCollect make a tighter guess on the size by applying key prefix compression (and use this to return tokens post-work).

I don't think we necessarily need to do anything about this now, but we should ideally have a common, comparable estimate of the amount of storage work required for a given request.

This could actually come in handy for AddSSTable. To make it MVCC-compliant, we will in some cases be rewriting the MVCC timestamps in the SST, which is a CPU-bound operation. We would ideally like to be able to throttle the CPU-bound and IO-bound portions of AddSSTable requests separately. It sounds like this might be a useful mechanism for that.

AddSSTable at the proposer, which is doing the rewriting, is already subject to admission control on master. This means it first goes through the store-specific WorkQueue and consumes the estimated bytes (which will be a wrong number, but which this issue aims to correct), and then the CPU-bound WorkQueue and consumes a slot (that is returned only when it finishes executing). Parallelization of AddSSTable is going to make the latter incorrect since it is actually using multiple slots. If we know the concurrency it is going to use at admission time, this could be fixed.

Nice -- we'll know the concurrency, so we can adjust for it. However, these requests will be split into a CPU-bound evaluation phase (proposer only) and an IO-bound replication/application phase. I believe this is generally the case for most other requests too, as we only replicate/apply the resulting Pebble batch which is IO bound. Does the work queue release the slot after evaluation, or only after the request returns? CC @dt.

@sumeerbhola
Copy link
Collaborator Author

Does the work queue release the slot after evaluation, or only after the request returns?

It returns the slot after evaluation.

  • This is meant to robust to varying IO time/CPU time ratios since the total number of slots are dynamically adjusted to keep the CPU busy.
  • That said, the assumption in the current integration was that most heavy CPU work would be reads, so proposal evaluation and request return are the same. For writes, we could improve the integration with admission control to call AdmittedWorkDone on the CPU WorkQueue after proposal evaluation and later call AdmittedWorkDone on the IO queue on request return.

@blathers-crl blathers-crl bot added the T-storage Storage Team label Jan 25, 2022
@sumeerbhola
Copy link
Collaborator Author

The discussion on https://github.com/cockroachlabs/support/issues/1374 overlaps with the one here -- copy-pasting some stuff here about below-raft throttling, which seems to be generally necessary (due to high rate of applying raft entries when a node restarts and it catching up via the log). There is a debate about whether admission control should be involved.

I'm not sure if admission control is the right tool here though. We're talking about catching up on work that has already been accepted, not throttling new work. Since the only thing we can really do here is to detect overload and slow down log replay, it seems like Pebble should be able to do this on its own by backpressuring writers

I'm not sure Pebble is the right answer for the same reason that admission control for write proposals was placed in CockroachDB: even below raft, there are many replicas sharing the same store, and if there is one replica seeing a large number of low priority index backfills, while another replica is seeing normal user-facing writes, one can reorder them in an admission control WorkQueue below raft. Pebble does not have the knowledge to do such reordering.

(@erikgrinaker let's continue that discussion on this issue)

@dt
Copy link
Member

dt commented Jan 27, 2022

Do we think that a request is the unit of work we want to choose to process now vs later?

I wonder if requests like RevertRange or ExportRequest or even just larger vs smaller ScanRequests complicate things?

Like say we get an ExportRequest with a size limit and a timestamp predicate, and we currently have capacity, so we start evaluating it. As it is evaluations, its iterator keeps finding and opening SSTs, scanning them (or just their indexes) to see what meets the timestamp predicate, and finding a key here or there that meets it, but is not quickly hitting its size limit. In the meantime, a higher priority Get() request comes in, for a query that just needs one key. If our ExportRequest is still churning away, opening and reading blocks and using up all our cpu/disk bandwidth/iops/whatever, that Get() request will be negatively impacted.

Do we want to try to somehow let that Get() ask for priority? Should the longer-running ExportRequest be asking for some sort of quota as it runs e.g. each time its underlying LSM iterator wants to open another SST, or it has used some amount of computation time, or something like that?

@erikgrinaker
Copy link
Contributor

I'm not sure Pebble is the right answer for the same reason that admission control for write proposals was placed in CockroachDB: even below raft, there are many replicas sharing the same store, and if there is one replica seeing a large number of low priority index backfills, while another replica is seeing normal user-facing writes, one can reorder them in an admission control WorkQueue below raft. Pebble does not have the knowledge to do such reordering.

Yeah, this does makes sense when considering multiple ranges and stores, and work priorities between them. We'll need to avoid doublecounting above/below Raft though.

Do you have any thoughts on cross-node admission control? I.e. could it replace the Raft quota pool with knowledge about followers' store health and Raft log state, or would we need to combine/augment them?

Do we think that a request is the unit of work we want to choose to process now vs later?

We should try to be smarter than that.

We have a similar problem with range load metrics, where we currently consider QPS to be the rate of BatchRequests processed (#50620). This is obviously pretty bad -- both because the number of individual requests in a batch varies significantly, and also because the work done by any individual request varies significantly (e.g. Get vs Scan). This is what the allocator uses to decide e.g. replica placement and load-based splits, leading to some pretty bad decisions at times.

We now have multiple differing measures of "work" (QPS and WPS, RUs, admission control tokens), which could get confusing and hard to reason about. We should try to harmonize these somehow -- even though e.g. QPS has the luxury of being after-the-fact and can rely on actual measurements.

@sumeerbhola
Copy link
Collaborator Author

Do we think that a request is the unit of work we want to choose to process now vs later?

We should try to be smarter than that.
We have a similar problem with range load metrics, where we currently consider QPS to be the rate of BatchRequests processed (#50620).

I have no objection to being smarter :), (a) if we find that we need to be, based on experiments, and (b) we can figure out something effective.

  • Making a request ask for resources as it it executing is like yielding in cooperative scheduling. My initial thinking when working on admission control was that it didn't seem necessary for KV and storage work because (a) requests will do bounded work based on bytes etc. to return, (b) the bad case that @dt describes with a lot of scanning work, but no result, so long running, is hopefully rare (and will get rarer with better filtering of irrelevant versions, like with block property filters, and the planned work on sstable format change to separate old versions), (c) there is typically enough resource capacity to run many concurrent requests (e.g. for cpu tokens the system typically ends up 3+ slots per cpu) so there should be enough concurrency for some user-facing work to get in. There are caveats, like (b) above, and making sure the priority assignment is actually sane (or use synthetic tenants if we want some measure of fair sharing). There has been some discussion about yielding for SQL work in admission control, which can be less bounded in terms of cpu -- it hasn't happened yet (see https://docs.google.com/document/d/1Hg1nrpEm3Cj0eSFRLHgDEJEVzQBKfEUs0WyYfI59i24/edit?disco=AAAAM15RbCE)
  • I believe that admission control is doing better than range load metrics for 2 reasons:
    • The cpu accounting is done via slots (not request rate), so a longer running request holds onto the slot for longer.
    • The write IO accounting is computing the actual mean of bytes added and using it for forecasting per request. We can of course do even better here, as this issue outlined earlier.
  • Regarding what we should do for range load metrics, this seems to be a perennial topic. We should finally nail down what we want here. Read and write bytes from the store, per replica, are definitely within reach. CPU-seconds per replica has been the hard one, without golang changes. Perhaps admission slot-seconds can be a proxy, with the caveat that different replicas may have different amount of io-time or contention-wait-time (I don't mention scheduler-wait-time since admission control reduces this, and it is same across all replicas at a node). Subtracting contention-wait-time is not hard since there is one place to instrument (concurrency package) and it happens once per request. We'll still end up with something that can be critiqued as "little better than walltime" but I suspect that it is substantially better for the common case.

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Mar 26, 2022

We may want to include disk usage protection here as well, to avoid running the node out of disk with e.g. index backfills. Wrote up a proposal in #79210.

@tbg
Copy link
Member

tbg commented May 31, 2022

What is remaining in scope for this issue after #79092 (comment) is addressed (i.e. at which point bulk requests are "just" throttled in admission control and nowhere else)? I am trying to clean up the many overlapping conversations we are having.

My understanding is that this issue only deals with unifying bulk requests with local admission control (i.e. remove any special casing outside of admission control for these requests). The other big issue we have is #79215, for which I just completed cleaning up the initial post and breaking out sub-issues. That issue tracks the short-term (i.e. this release) plan for dealing with appends and snapshots.

Then, there is also #79755 which is roughly about how to do "distributed admission control", i.e. taking follower health into account in a less ad-hoc way, and this is for now unplanned (past what's required as part of #79755).

In my understanding this all fits together then, but if what I think this issue is about isn't correct, please correct me.

@tbg tbg changed the title admission,kv,bulk: unify store overload protection via admission control admission,kv,bulk: unify (local) store overload protection via admission control May 31, 2022
@irfansharif
Copy link
Contributor

irfansharif commented Sep 16, 2022

This issue has a lot of really good discussion, and there were a lot of follow-on issues filed for specific take aways. I've read through it a few more times to make sure nothing fell through the cracks.

Like say we get an ExportRequest with a size limit and a timestamp predicate, and we currently have capacity, so we start evaluating it. As it is evaluations, its iterator keeps finding and opening SSTs, scanning them (or just their indexes) to see what meets the timestamp predicate, and finding a key here or there that meets it, but is not quickly hitting its size limit. In the meantime, a higher priority Get() request comes in, for a query that just needs one key. If our ExportRequest is still churning away, opening and reading blocks and using up all our cpu/disk bandwidth/iops/whatever, that Get() request will be negatively impacted.

#86638 address the backup case using a form of cooperative scheduling; bounding how much work a single export request can do.

To make it MVCC-compliant, we will in some cases be rewriting the MVCC timestamps in the SST, which is a CPU-bound operation. We would ideally like to be able to throttle the CPU-bound and IO-bound portions of AddSSTable requests separately.

The detail around AddSST key-rewriting post MVCC-ification, and it being CPU-intensive, is also something that can integrate into the elastic CPU tokens machinery introduced above. There are other factors around disk use noted above and in the relevant issues (a bunch of which are being collected in https://github.com/orgs/cockroachdb/projects/32/views/1). The next set of things (till mid-Jan, after 22.2 stability and EOY holidays) we're planning to look at are focused on index backfills, which will include taking a better look at AddSSTs. We're not looking at using AC for disk storage control soon, or for snapshots (the hope is cockroachdb/pebble#1683 makes it less a problem, which is being worked in in 23.1). The outstanding follower writes throttling is being tracked on the Repl side for now. Throttling of replica/MVCC GC queue is not in near term scope (and perhaps something #42514 can push further down the line). All the other discussion around replica load is interesting and can be continued elsewhere.

Closing this issue. We should continue removing throttling knobs, keeping them around just for escalations. Lets file specific issues for specific ones if we've missed any.

irfansharif added a commit to irfansharif/cockroach that referenced this issue Sep 24, 2022
..until expiry. Informs cockroachdb#82896 (more specifically this is a short-term
alternative to the part pertaining to continuous tail capture). The
issue has more background, but we repeat some below for posterity.

It's desirable to draw from a set of tail execution traces collected
over time when investigating tail latencies. cockroachdb#82750 introduced a
probabilistic mechanism to capture a single tail event for a individual
stmts with bounded overhead (determined by the sampling probability,
trading off how long until a single capture is obtained). This PR
introduces a sql.stmt_diagnostics.collect_continuously.enabled to
collect captures continuously over some period of time for aggregate
analysis. Longer term we'd want:
- Controls over the maximum number of captures we'd want stored over
  some period of time;
- Eviction of older bundles, assuming they're less relevant, making room
  for newer captures.

To safeguard against misuse (in this current form we should only use it
for experiments or escalations under controlled environments), we only
act on this setting provided the diagnostics request has an expiration
timestamp and a specified probability, crude measures to prevent
unbounded growth.

---

To get some idea of how this can be used, consider the kinds of
experiments we're running as part of cockroachdb#75066. Specifically we have a
reproduction where we can observe spikes in latencies for foreground
traffic in the presence of concurrent backups (incremental/full). In an
experiment with incremental backups running every 10m, with full backups
running every 35m (`RECURRING '*/10 * * * *' FULL BACKUP '35 * * * *'`),
we observe latency spikes during overlap periods. With this cluster
setting we were able to set up trace captures over a 10h window to get a
set of representative outlier traces to investigate further.

    > SELECT crdb_internal.request_statement_bundle(
      'INSERT INTO new_order(no_o_id, ...)', -- stmt fingerprint
      0.05,                                  -- 5% sampling probability
      '30ms'::INTERVAL,                      -- 30ms target (p99.9)
      '10h'::INTERVAL                        -- capture window
    );

    > WITH histogram AS
         (SELECT extract('minute', collected_at) AS minute,
                 count(*) FROM system.statement_diagnostics
          GROUP BY minute)
    SELECT minute, repeat('*', (30 * count/(max(count) OVER ()))::INT8) AS freq
    FROM histogram
    ORDER BY count DESC
    LIMIT 10;

      minute |              freq
    ---------+---------------------------------
          36 | ******************************
          38 | *********************
          35 | *********************
          00 | *********************
          37 | ********************
          30 | ********************
          40 | *****************
          20 | **************
          10 | *************
          50 | ***********
    (10 rows)

We see that we captured just the set of bundles/traces we were interested in.

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this issue Sep 27, 2022
..until expiry. Informs cockroachdb#82896 (more specifically this is a short-term
alternative to the part pertaining to continuous tail capture). The
issue has more background, but we repeat some below for posterity.

It's desirable to draw from a set of tail execution traces collected
over time when investigating tail latencies. cockroachdb#82750 introduced a
probabilistic mechanism to capture a single tail event for a individual
stmts with bounded overhead (determined by the sampling probability,
trading off how long until a single capture is obtained). This PR
introduces a sql.stmt_diagnostics.collect_continuously.enabled to
collect captures continuously over some period of time for aggregate
analysis. To get some idea of how this can be used, consider the kinds
of experiments we're running as part of cockroachdb#75066. Specifically we have a
reproduction where we can observe spikes in latencies for foreground
traffic in the presence of concurrent backups (incremental/full). In an
experiment with incremental backups running every 10m, with full backups
running every 35m (`RECURRING '*/10 * * * *' FULL BACKUP '35 * * * *'`),
we observe latency spikes during overlap periods. With this cluster
setting we were able to set up trace captures over a 10h window to get a
set of representative outlier traces to investigate further.

    SELECT crdb_internal.request_statement_bundle(
      'INSERT INTO new_order(no_o_id, ...)', -- stmt fingerprint
      0.05,                                  -- 5% sampling probability
      '30ms'::INTERVAL,                      -- 30ms target (p99.9)
      '10h'::INTERVAL                        -- capture window
    );

    WITH histogram AS
         (SELECT extract('minute', collected_at) AS minute,
                 count(*) FROM system.statement_diagnostics
          GROUP BY minute)
    SELECT minute, repeat('*', (30 * count/(max(count) OVER ()))::INT8) AS freq
    FROM histogram
    ORDER BY count DESC
    LIMIT 10;

      minute |              freq
    ---------+---------------------------------
          36 | ******************************
          38 | *********************
          35 | *********************
          00 | *********************
          37 | ********************
          30 | ********************
          40 | *****************
          20 | **************
          10 | *************
          50 | ***********
    (10 rows)

We see that we captured just the set of bundles/traces we were
interested in. Longer term we'd want:
- Controls over the maximum number of captures we'd want stored over
  some period of time;
- Eviction of older bundles, assuming they're less relevant, making room
  for newer captures.
To safeguard against misuse (in this current form we should only use it
for experiments or escalations under controlled environments), we only
act on this setting provided the diagnostics request has an expiration
timestamp and a specified probability, crude measures to prevent
unbounded growth.

Release note: None
craig bot pushed a commit that referenced this issue Sep 27, 2022
83020: stmtdiagnostics: support continuous bundle collection r=irfansharif a=irfansharif

..until expiry. Informs #82896 (more specifically this is a short-term
alternative to the part pertaining to continuous tail capture). The
issue has more background, but we repeat some below for posterity.

It's desirable to draw from a set of tail execution traces collected
over time when investigating tail latencies. #82750 introduced a
probabilistic mechanism to capture a single tail event for a individual
stmts with bounded overhead (determined by the sampling probability,
trading off how long until a single capture is obtained). This PR
introduces a `sql.stmt_diagnostics.collect_continuously_until_expired`
to collect captures continuously over some period of time for aggregate
analysis. Longer term we'd want:
- Controls over the maximum number of captures we'd want stored over
  some period of time;
- Eviction of older bundles, assuming they're less relevant, making room
  for newer captures.

To safeguard against misuse (in this current form we should only use it
for experiments or escalations under controlled environments), we only
act on this setting provided the diagnostics request has an expiration
timestamp and a specified probability, crude measures to prevent
unbounded growth.

---

To get some idea of how this can be used, consider the kinds of
experiments we're running as part of #75066. Specifically we have a
reproduction where we can observe spikes in latencies for foreground
traffic in the presence of concurrent backups (incremental/full). In an
experiment with incremental backups running every 10m, with full backups
running every 35m (`RECURRING '*/10 * * * *' FULL BACKUP '35 * * * *'`),
we observe latency spikes during overlap periods. With this cluster
setting we were able to set up trace captures over a 10h window to get a
set of representative outlier traces to investigate further.

    > SELECT crdb_internal.request_statement_bundle(
      'INSERT INTO new_order(no_o_id, ...)', -- stmt fingerprint
      0.05,                                  -- 5% sampling probability
      '30ms'::INTERVAL,                      -- 30ms target (p99.9)
      '10h'::INTERVAL                        -- capture window
    );

    > WITH histogram AS
         (SELECT extract('minute', collected_at) AS minute,
                 count(*) FROM system.statement_diagnostics
          GROUP BY minute)
    SELECT minute, repeat('*', (30 * count/(max(count) OVER ()))::INT8) AS freq
    FROM histogram
    ORDER BY count DESC
    LIMIT 10;

      minute |              freq
    ---------+---------------------------------
          36 | ******************************
          38 | *********************
          35 | *********************
          00 | *********************
          37 | ********************
          30 | ********************
          40 | *****************
          20 | **************
          10 | *************
          50 | ***********
    (10 rows)

We see that we captured just the set of bundles/traces we were interested in.

Release note: None

86591: kvserver: sync checksum computation with long poll r=erikgrinaker a=pavelkalinnikov

Previously, the checksum computation would run until completion unconditionally
(unless the collection request comes before it). This is not the best spend of
the limited pool capacity, because the result of this computation may never be
requested.

After this commit, the checksum computation task is synchronized with the
checksum collection request. Both wait at most 5 seconds until the other party
has joined. Once joined, the computation starts, otherwise skips.

If any party abandons the request, then the `replicaChecksum` record is preserved
in the state, and is scheduled for a GC later. This is to help the other party
to fail fast, instead of waiting, if it arrives late.

This change also removes the no longer needed concurrency limit for the tasks,
because tasks are canceled reliably and will not pile up.

Fixes #77432

Release note (performance improvement): consistency checks are now properly
cancelled on timeout, preventing them from piling up.

88768: ci: add MacOS ARM CI config r=jlinder a=rail

Previously, MacOS64 ARM64 platform was added, but CI wouldn't run it.

This PR adds a CI platform to build MacOS ARM64 binaries.

Release note: None

Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
Co-authored-by: Pavel Kalinnikov <pavel@cockroachlabs.com>
Co-authored-by: Rail Aliiev <rail@iqchoice.com>
blathers-crl bot pushed a commit that referenced this issue Sep 28, 2022
..until expiry. Informs #82896 (more specifically this is a short-term
alternative to the part pertaining to continuous tail capture). The
issue has more background, but we repeat some below for posterity.

It's desirable to draw from a set of tail execution traces collected
over time when investigating tail latencies. #82750 introduced a
probabilistic mechanism to capture a single tail event for a individual
stmts with bounded overhead (determined by the sampling probability,
trading off how long until a single capture is obtained). This PR
introduces a sql.stmt_diagnostics.collect_continuously.enabled to
collect captures continuously over some period of time for aggregate
analysis. To get some idea of how this can be used, consider the kinds
of experiments we're running as part of #75066. Specifically we have a
reproduction where we can observe spikes in latencies for foreground
traffic in the presence of concurrent backups (incremental/full). In an
experiment with incremental backups running every 10m, with full backups
running every 35m (`RECURRING '*/10 * * * *' FULL BACKUP '35 * * * *'`),
we observe latency spikes during overlap periods. With this cluster
setting we were able to set up trace captures over a 10h window to get a
set of representative outlier traces to investigate further.

    SELECT crdb_internal.request_statement_bundle(
      'INSERT INTO new_order(no_o_id, ...)', -- stmt fingerprint
      0.05,                                  -- 5% sampling probability
      '30ms'::INTERVAL,                      -- 30ms target (p99.9)
      '10h'::INTERVAL                        -- capture window
    );

    WITH histogram AS
         (SELECT extract('minute', collected_at) AS minute,
                 count(*) FROM system.statement_diagnostics
          GROUP BY minute)
    SELECT minute, repeat('*', (30 * count/(max(count) OVER ()))::INT8) AS freq
    FROM histogram
    ORDER BY count DESC
    LIMIT 10;

      minute |              freq
    ---------+---------------------------------
          36 | ******************************
          38 | *********************
          35 | *********************
          00 | *********************
          37 | ********************
          30 | ********************
          40 | *****************
          20 | **************
          10 | *************
          50 | ***********
    (10 rows)

We see that we captured just the set of bundles/traces we were
interested in. Longer term we'd want:
- Controls over the maximum number of captures we'd want stored over
  some period of time;
- Eviction of older bundles, assuming they're less relevant, making room
  for newer captures.
To safeguard against misuse (in this current form we should only use it
for experiments or escalations under controlled environments), we only
act on this setting provided the diagnostics request has an expiration
timestamp and a specified probability, crude measures to prevent
unbounded growth.

Release note: None
@daniel-crlabs
Copy link
Contributor

@lancel66 (Lance Lierheimer) has asked the following question (affecting ZD 15188) about this GH issue:

Question on this: What is the effort to backport #75066 to v22.1? Customer will not be moving to v22.2 for another 6 months or so.

@irfansharif
Copy link
Contributor

@daniel-crlabs, that issues talks latency spikes during backups. That indeed was work that was done under this admission control issue, specifically this PR: #86638. It cannot be backported to 22.1 since it uses a patched Go runtime and on a newer release of Go. Feel free to send them this blog post we wrote about this work specifically: https://www.cockroachlabs.com/blog/rubbing-control-theory/.

@daniel-crlabs
Copy link
Contributor

Sounds great, thank you for the response, I'll pass this along to the CEA and the customer.

irfansharif added a commit to irfansharif/cockroach that referenced this issue Jan 20, 2023
We originally introduced these notions in admission control (cockroachdb#78519) for
additional threads for Pebble compaction compression. We envisioned
granting these "squishy" slots to background activities and permit work
only under periods of low load. In working through cockroachdb#86638 (as part of
\cockroachdb#75066), we observed experimentally that the moderate-slots count was
not sensitive enough to scheduling latency, and consequently latency
observed by foreground traffic. Elastic CPU tokens, the kind now being
used for backups, offer an alternative to soft slots. We've since
replaced uses of soft slots with elastic CPU tokens.

This PR just removes the now dead-code code around soft/moderate load
slots (it's better to minimize the number of mechanisms in the admission
package)

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this issue Jan 23, 2023
We originally introduced these notions in admission control (cockroachdb#78519) for
additional threads for Pebble compaction compression. We envisioned
granting these "squishy" slots to background activities and permit work
only under periods of low load. In working through cockroachdb#86638 (as part of
\cockroachdb#75066), we observed experimentally that the moderate-slots count was
not sensitive enough to scheduling latency, and consequently latency
observed by foreground traffic. Elastic CPU tokens, the kind now being
used for backups, offer an alternative to soft slots. We've since
replaced uses of soft slots with elastic CPU tokens.

This PR just removes the now dead-code code around soft/moderate load
slots (it's better to minimize the number of mechanisms in the admission
package). Fixes cockroachdb#95590.

Release note: None
craig bot pushed a commit that referenced this issue Jan 23, 2023
95590: admission: remove soft/moderate load slots r=irfansharif a=irfansharif

We originally introduced these notions in admission control (#78519) for additional threads for Pebble compaction compression. We envisioned granting these "squishy" slots to background activities and permit work only under periods of low load. In working through #86638 (as part of \#75066), we observed experimentally that the moderate-slots count was not sensitive enough to scheduling latency, and consequently latency observed by foreground traffic. Elastic CPU tokens, the kind now being used for backups, offer an alternative to soft slots. We've since replaced uses of soft slots with elastic CPU tokens.

This PR just removes the now dead-code code around soft/moderate load slots (it's better to minimize the number of mechanisms in the admission package). Fixes #88032.

Release note: None

---

First commit is from #95007.

Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-admission-control A-kv-replication Relating to Raft, consensus, and coordination. A-storage Relating to our storage engine (Pebble) on-disk storage. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-storage Storage Team
Projects
None yet
Development

No branches or pull requests

6 participants