tracing,stmtdiagnostics: bounded overhead, continuous tail trace capture #82896

irfansharif · 2022-06-14T17:53:25Z

Is your feature request related to a problem? Please describe.

It's difficult today to diagnose causes for high-tail executions. It's doubly difficult to do so when the high-tail events have occurred in the past as we don't do any form of always-on/continuous profiling.

Describe alternatives you've considered

Keep the status quo, we have two ways to get to verbose traces for tail events today:

sql.trace.stmt.enable_threshold, a somewhat blunt hammer that enables verbose tracing for every execution all the time. Because it's so coarse grained, enabling it imposes a large performance penalty. It's likely that enabling the setting itself drives high-tail events, which is not what we want. Secondly: since it's not scoped to specific statements (it's possible for outlier events to only apply to specific statements, and for different statements to have different tail behaviours), it creates a very verbose haystack to sift through in CRDB logs to find out the execution we're interested in. In previous escalations we approach this setting somewhat timidly, manually switching it on and off at specific points in time in the hopes of capturing the right tail event.
Specifying a minimum latency target when issuing a diagnostics bundle request. This is nice in that it's scoped to a specific fingerprint, but it has two limitations (exactly what this issue proposes we address):
- When setting a latency target, it enables verbose tracing for all future executions until that target is met or the request expires. This can be expensive, especially if trying to capture a tail event for a frequently occurring statement.
- It captures only the first event that meets the criteria (i.e. it's not continuous).

Describe the solution you'd like

As a first step, I want roughly the following ability:

  Pick a stmt fingerprint, declare a sampling probability which controls
  when verbose tracing is enabled for it, and a latency threshold for
  which a trace is persisted/logged/etc. With a given stmt rate (say
  1000/s) and a given percentile I’m trying to capture (say p99.9), we
  have 0.001R stmt/s in the 99.9th percentile (1/s in our example). We
  should be able to set a sampling probability P such that with high
  likelihood (>95%) we capture at least one trace captured over the next
  S seconds. The sampling rate lets you control how the overhead you’re
  introducing for those statements in aggregate, which if dialed up
  higher lets you lower S. You might want to do such a thing for
  infrequently executed statements.

This gives use a mechanism to get to a bounded-overhead tail trace capture (bounded by probability and scope, targeting only a single stmt). A prototype of such a thing can be found here: #82750 (backportable form for 21.2, 22.1: #82864).

For a second step, I'd like for us to capture such outlier traces continuously to be able to have traces for high-tail events in the past (often what we want during escalations). For that I think we'd need to specify the maximum number of traces we want to capture over some period of time for a given finger print and some eviction policy to not accrue garbage indefinitely + keep more valuable traces around (probably just evict the oldest, all things equal we probably only care about recent traces). A crude approximation of this (this = capture multiple bundles for a given request as long as the request is not expired) can be found here: #83020.

Additional context

Very closely relates to #80666.

The text was updated successfully, but these errors were encountered:

We enable the following ability (the first half of cockroachdb#82896): Pick a stmt fingerprint, declare a sampling probability which controls when verbose tracing is enabled for it, and a latency threshold for which a trace is persisted. With a given stmt rate (say 1000/s) and a given percentile we're trying to capture (say p99.9), we have 0.001R stmt/s in the 99.9th percentile (1/s in our example). We should be able to set a sampling probability P such that with high likelihood (>95%) we capture at least one trace over the next S seconds. The sampling rate lets you control how the overhead you’re introducing for those statements in aggregate, which if dialed up higher lets you lower S. You might want to do such a thing for infrequently executed statements. We do all this using the existing statement diagnostics machinery. It looks roughly as follows > SELECT crdb_internal.request_statement_bundle( 'INSERT INTO ...', -- fingerprint 0.01, -- sampling probability '120ms'::INTERVAL, -- latency target '15m'::INTERVAL -- request expiration ); $ cockroach statement-diag list --insecure No statement diagnostics bundles available. Outstanding activation requests: ID Activation time Statement Sampling probability Min execution latency Expires at 770367610894417921 2022-06-14 00:52:17 UTC INSERT INTO ... 0.0100 90ms 2022-06-13 21:07:17.503954 -0400 EDT # wait for an eventual capture.. $ cockroach statement-diag list --insecure Statement diagnostics bundles: ID Collection time Statement 770367073624621057 2022-06-14 00:49:33 UTC INSERT INTO ... $ cockroach statement-diag download 770367073624621057 --insecure Bundle saved to "stmt-bundle-770367073624621057.zip" $ unzip stmt-bundle-770367073624621057.zip stmt $ cat stmt/trace.txt ... 0.846ms 0.017ms event:kv/kvserver/spanlatch/manager.go:532 [n1,s1,r76/1:/Table/10{7/1-8}] waiting to acquire read latch /Table/107/1/41/7/0@0,0, held by write latch /Table/107/1/41/7/0@1655167773.362147000,0 98.776ms 97.930ms event:kv/kvserver/concurrency/concurrency_manager.go:301 [n1,s1,r76/1:/Table/10{7/1-8}] scanning lock table for conflicting locks We leave wiring this up to the UI and continuous capture (second half of \cockroachdb#82896) to future PRs. Release note: None

We enable the following ability (the first half of cockroachdb#82896): Pick a stmt fingerprint, declare a sampling probability which controls when verbose tracing is enabled for it, and a latency threshold for which a trace is persisted. With a given stmt rate (say 1000/s) and a given percentile we're trying to capture (say p99.9), we have 0.001R stmt/s in the 99.9th percentile (1/s in our example). We should be able to set a sampling probability P such that with high likelihood (>95%) we capture at least one trace over the next S seconds. The sampling rate lets you control how the overhead you’re introducing for those statements in aggregate, which if dialed up higher lets you lower S. You might want to do such a thing for infrequently executed statements. We do all this using the existing statement diagnostics machinery. It looks roughly as follows > SELECT crdb_internal.request_statement_bundle( 'INSERT INTO ...', -- fingerprint 0.01, -- sampling probability '120ms'::INTERVAL, -- latency target '15m'::INTERVAL -- request expiration ); $ cockroach statement-diag list --insecure No statement diagnostics bundles available. Outstanding activation requests: ID Statement Sampling probability Min latency 770367610894417921 INSERT INTO ... 0.0100 90ms # wait for an eventual capture.. $ cockroach statement-diag list --insecure Statement diagnostics bundles: ID Collection time Statement 770367073624621057 2022-06-14 00:49:33 UTC INSERT INTO ... $ cockroach statement-diag download 770367073624621057 --insecure Bundle saved to "stmt-bundle-770367073624621057.zip" $ unzip stmt-bundle-770367073624621057.zip stmt $ cat stmt/trace.txt ... 0.846ms 0.017ms event:kv/kvserver/spanlatch/manager.go:532 [n1,s1,r76/1:/Table/10{7/1-8}] waiting to acquire read latch /Table/107/1/41/7/0@0,0, held by write latch /Table/107/1/41/7/0@1655167773.362147000,0 98.776ms 97.930ms event:kv/kvserver/concurrency/concurrency_manager.go:301 [n1,s1,r76/1:/Table/10{7/1-8}] scanning lock table for conflicting locks We leave wiring this up to the UI and continuous capture (second half of \cockroachdb#82896) to future PRs. Release note: None

82750: stmtdiagnostics: support probabilistic bundle collection r=irfansharif a=irfansharif We enable the following ability (the first half of #82896): ``` Pick a stmt fingerprint, declare a sampling probability which controls when verbose tracing is enabled for it, and a latency threshold for which a trace is persisted. ``` With a given stmt rate (say 1000/s) and a given percentile we're trying to capture (say p99.9), we have 0.001R stmt/s in the 99.9th percentile (1/s in our example). We should be able to set a sampling probability P such that with high likelihood (>95%) we capture at least one trace over the next S seconds. The sampling rate lets you control how the overhead you’re introducing for those statements in aggregate, which if dialed up higher lets you lower S. You might want to do such a thing for infrequently executed statements. We do all this using the existing statement diagnostics machinery. It looks roughly as follows ``` > SELECT crdb_internal.request_statement_bundle( 'INSERT INTO ...', -- fingerprint 0.01, -- sampling probability '120ms'::INTERVAL, -- latency target '15m'::INTERVAL -- request expiration ); $ cockroach statement-diag list --insecure No statement diagnostics bundles available. Outstanding activation requests: ID Statement Sampling probability Min latency 770367610894417921 INSERT INTO ... 0.0100 90ms # wait for an eventual capture.. $ cockroach statement-diag list --insecure Statement diagnostics bundles: ID Collection time Statement 770367073624621057 2022-06-14 00:49:33 UTC INSERT INTO ... $ cockroach statement-diag download 770367073624621057 --insecure Bundle saved to "stmt-bundle-770367073624621057.zip" $ unzip stmt-bundle-770367073624621057.zip stmt $ cat stmt/trace.txt ... 0.846ms 0.017ms event:kv/kvserver/spanlatch/manager.go:532 [n1,s1,r76/1:/Table/10{7/1-8}] waiting to acquire read latch /Table/107/1/41/7/0@0,0, held by write latch /Table/107/1/41/7/0@1655167773.362147000,0 98.776ms 97.930ms event:kv/kvserver/concurrency/concurrency_manager.go:301 [n1,s1,r76/1:/Table/10{7/1-8}] scanning lock table for conflicting locks ``` We leave wiring this up to the UI and continuous capture (second half of \#82896) to future PRs. Release note: None Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>

..until expiry. Informs cockroachdb#82896 (more specifically this is a short-term alternative to the part pertaining to continuous tail capture). The issue has more background, but we repeat some below for posterity. It's desirable to draw from a set of tail execution traces collected over time when investigating tail latencies. cockroachdb#82750 introduced a probabilistic mechanism to capture a single tail event for a individual stmts with bounded overhead (determined by the sampling probability, trading off how long until a single capture is obtained). This PR introduces a `sql.stmt_diagnostics.collect_continuously_until_expired` to collect captures continuously over some period of time for aggregate analysis. Longer term we'd want: - Controls over the maximum number of captures we'd want stored over some period of time; - Eviction of older bundles, assuming they're less relevant, making room for newer captures. To safeguard against misuse (in this current form we should only use it for experiments or escalations under controlled environments), we only act on this setting provided the diagnostics request has an expiration timestamp and a specified probability, crude measures to prevent unbounded growth. --- To get some idea of how this can be used, consider the kinds of experiments we're running as part of cockroachdb#75066. Specifically we have a reproduction where we can observe spikes in latencies for foreground traffic in the presence of concurrent backups (incremental/full). In an experiment with incremental backups running every 10m, with full backups running every 35m (`RECURRING '*/10 * * * *' FULL BACKUP '35 * * * *'`), we observe latency spikes during overlap periods. With this cluster setting we were able to set up trace captures over a 10h window to get a set of representative outlier traces to investigate further. > SELECT crdb_internal.request_statement_bundle( 'INSERT INTO new_order(no_o_id, ...)', -- stmt fingerprint 0.05, -- 5% sampling probability '30ms'::INTERVAL, -- 30ms target (p99.9) '10h'::INTERVAL -- capture window ); > WITH histogram AS (SELECT extract('minute', collected_at) AS minute, count(*) FROM system.statement_diagnostics GROUP BY minute) SELECT minute, repeat('*', (30 * count/(max(count) OVER ()))::INT8) AS freq FROM histogram ORDER BY count DESC LIMIT 10; minute | freq ---------+--------------------------------- 36 | ****************************** 38 | ********************* 35 | ********************* 00 | ********************* 37 | ******************** 30 | ******************** 40 | ***************** 20 | ************** 10 | ************* 50 | *********** (10 rows) We see that we captured just the set of bundles/traces we were interested in. Release note: None

..until expiry. Informs cockroachdb#82896 (more specifically this is a short-term alternative to the part pertaining to continuous tail capture). The issue has more background, but we repeat some below for posterity. It's desirable to draw from a set of tail execution traces collected over time when investigating tail latencies. cockroachdb#82750 introduced a probabilistic mechanism to capture a single tail event for a individual stmts with bounded overhead (determined by the sampling probability, trading off how long until a single capture is obtained). This PR introduces a sql.stmt_diagnostics.collect_continuously.enabled to collect captures continuously over some period of time for aggregate analysis. Longer term we'd want: - Controls over the maximum number of captures we'd want stored over some period of time; - Eviction of older bundles, assuming they're less relevant, making room for newer captures. To safeguard against misuse (in this current form we should only use it for experiments or escalations under controlled environments), we only act on this setting provided the diagnostics request has an expiration timestamp and a specified probability, crude measures to prevent unbounded growth. --- To get some idea of how this can be used, consider the kinds of experiments we're running as part of cockroachdb#75066. Specifically we have a reproduction where we can observe spikes in latencies for foreground traffic in the presence of concurrent backups (incremental/full). In an experiment with incremental backups running every 10m, with full backups running every 35m (`RECURRING '*/10 * * * *' FULL BACKUP '35 * * * *'`), we observe latency spikes during overlap periods. With this cluster setting we were able to set up trace captures over a 10h window to get a set of representative outlier traces to investigate further. > SELECT crdb_internal.request_statement_bundle( 'INSERT INTO new_order(no_o_id, ...)', -- stmt fingerprint 0.05, -- 5% sampling probability '30ms'::INTERVAL, -- 30ms target (p99.9) '10h'::INTERVAL -- capture window ); > WITH histogram AS (SELECT extract('minute', collected_at) AS minute, count(*) FROM system.statement_diagnostics GROUP BY minute) SELECT minute, repeat('*', (30 * count/(max(count) OVER ()))::INT8) AS freq FROM histogram ORDER BY count DESC LIMIT 10; minute | freq ---------+--------------------------------- 36 | ****************************** 38 | ********************* 35 | ********************* 00 | ********************* 37 | ******************** 30 | ******************** 40 | ***************** 20 | ************** 10 | ************* 50 | *********** (10 rows) We see that we captured just the set of bundles/traces we were interested in. Release note: None

kevin-v-ngo · 2022-06-23T01:03:47Z

Hi @irfansharif, for UI enhancements do we only need to allow the user to configure a sampling probability? Any other improvements for this version? Adding @Annebirzin

Note: We discussed being able to view all bundles for a fingerprint. Today, users can already view all bundles on the Diagnostics tab in the statement fingerprint details view.

Future improvements we discussed which would require additional backend plumbing as well

Provide a time-based retention and #bundle limit for how long bundles are kept
We have a timer-based countdown for when tracing will be disabled. We should also allow users to specify the # of bundles before tracing is disabled or a specific time for when it's expired. We should also include an option for always keeping the trace on until it's manually disabled.
Surface captured time in the UI (we only have activated time today)

Missing anything?

..until expiry. Informs cockroachdb#82896 (more specifically this is a short-term alternative to the part pertaining to continuous tail capture). The issue has more background, but we repeat some below for posterity. It's desirable to draw from a set of tail execution traces collected over time when investigating tail latencies. cockroachdb#82750 introduced a probabilistic mechanism to capture a single tail event for a individual stmts with bounded overhead (determined by the sampling probability, trading off how long until a single capture is obtained). This PR introduces a sql.stmt_diagnostics.collect_continuously.enabled to collect captures continuously over some period of time for aggregate analysis. Longer term we'd want: - Controls over the maximum number of captures we'd want stored over some period of time; - Eviction of older bundles, assuming they're less relevant, making room for newer captures. To safeguard against misuse (in this current form we should only use it for experiments or escalations under controlled environments), we only act on this setting provided the diagnostics request has an expiration timestamp and a specified probability, crude measures to prevent unbounded growth. --- To get some idea of how this can be used, consider the kinds of experiments we're running as part of cockroachdb#75066. Specifically we have a reproduction where we can observe spikes in latencies for foreground traffic in the presence of concurrent backups (incremental/full). In an experiment with incremental backups running every 10m, with full backups running every 35m (`RECURRING '*/10 * * * *' FULL BACKUP '35 * * * *'`), we observe latency spikes during overlap periods. With this cluster setting we were able to set up trace captures over a 10h window to get a set of representative outlier traces to investigate further. > SELECT crdb_internal.request_statement_bundle( 'INSERT INTO new_order(no_o_id, ...)', -- stmt fingerprint 0.05, -- 5% sampling probability '30ms'::INTERVAL, -- 30ms target (p99.9) '10h'::INTERVAL -- capture window ); > WITH histogram AS (SELECT extract('minute', collected_at) AS minute, count(*) FROM system.statement_diagnostics GROUP BY minute) SELECT minute, repeat('*', (30 * count/(max(count) OVER ()))::INT8) AS freq FROM histogram ORDER BY count DESC LIMIT 10; minute | freq ---------+--------------------------------- 36 | ****************************** 38 | ********************* 35 | ********************* 00 | ********************* 37 | ******************** 30 | ******************** 40 | ***************** 20 | ************** 10 | ************* 50 | *********** (10 rows) We see that we captured just the set of bundles/traces we were interested in. Release note: None

..until expiry. Informs cockroachdb#82896 (more specifically this is a short-term alternative to the part pertaining to continuous tail capture). The issue has more background, but we repeat some below for posterity. It's desirable to draw from a set of tail execution traces collected over time when investigating tail latencies. cockroachdb#82750 introduced a probabilistic mechanism to capture a single tail event for a individual stmts with bounded overhead (determined by the sampling probability, trading off how long until a single capture is obtained). This PR introduces a sql.stmt_diagnostics.collect_continuously.enabled to collect captures continuously over some period of time for aggregate analysis. To get some idea of how this can be used, consider the kinds of experiments we're running as part of cockroachdb#75066. Specifically we have a reproduction where we can observe spikes in latencies for foreground traffic in the presence of concurrent backups (incremental/full). In an experiment with incremental backups running every 10m, with full backups running every 35m (`RECURRING '*/10 * * * *' FULL BACKUP '35 * * * *'`), we observe latency spikes during overlap periods. With this cluster setting we were able to set up trace captures over a 10h window to get a set of representative outlier traces to investigate further. SELECT crdb_internal.request_statement_bundle( 'INSERT INTO new_order(no_o_id, ...)', -- stmt fingerprint 0.05, -- 5% sampling probability '30ms'::INTERVAL, -- 30ms target (p99.9) '10h'::INTERVAL -- capture window ); WITH histogram AS (SELECT extract('minute', collected_at) AS minute, count(*) FROM system.statement_diagnostics GROUP BY minute) SELECT minute, repeat('*', (30 * count/(max(count) OVER ()))::INT8) AS freq FROM histogram ORDER BY count DESC LIMIT 10; minute | freq ---------+--------------------------------- 36 | ****************************** 38 | ********************* 35 | ********************* 00 | ********************* 37 | ******************** 30 | ******************** 40 | ***************** 20 | ************** 10 | ************* 50 | *********** (10 rows) We see that we captured just the set of bundles/traces we were interested in. Longer term we'd want: - Controls over the maximum number of captures we'd want stored over some period of time; - Eviction of older bundles, assuming they're less relevant, making room for newer captures. To safeguard against misuse (in this current form we should only use it for experiments or escalations under controlled environments), we only act on this setting provided the diagnostics request has an expiration timestamp and a specified probability, crude measures to prevent unbounded growth. Release note: None

83020: stmtdiagnostics: support continuous bundle collection r=irfansharif a=irfansharif ..until expiry. Informs #82896 (more specifically this is a short-term alternative to the part pertaining to continuous tail capture). The issue has more background, but we repeat some below for posterity. It's desirable to draw from a set of tail execution traces collected over time when investigating tail latencies. #82750 introduced a probabilistic mechanism to capture a single tail event for a individual stmts with bounded overhead (determined by the sampling probability, trading off how long until a single capture is obtained). This PR introduces a `sql.stmt_diagnostics.collect_continuously_until_expired` to collect captures continuously over some period of time for aggregate analysis. Longer term we'd want: - Controls over the maximum number of captures we'd want stored over some period of time; - Eviction of older bundles, assuming they're less relevant, making room for newer captures. To safeguard against misuse (in this current form we should only use it for experiments or escalations under controlled environments), we only act on this setting provided the diagnostics request has an expiration timestamp and a specified probability, crude measures to prevent unbounded growth. --- To get some idea of how this can be used, consider the kinds of experiments we're running as part of #75066. Specifically we have a reproduction where we can observe spikes in latencies for foreground traffic in the presence of concurrent backups (incremental/full). In an experiment with incremental backups running every 10m, with full backups running every 35m (`RECURRING '*/10 * * * *' FULL BACKUP '35 * * * *'`), we observe latency spikes during overlap periods. With this cluster setting we were able to set up trace captures over a 10h window to get a set of representative outlier traces to investigate further. > SELECT crdb_internal.request_statement_bundle( 'INSERT INTO new_order(no_o_id, ...)', -- stmt fingerprint 0.05, -- 5% sampling probability '30ms'::INTERVAL, -- 30ms target (p99.9) '10h'::INTERVAL -- capture window ); > WITH histogram AS (SELECT extract('minute', collected_at) AS minute, count(*) FROM system.statement_diagnostics GROUP BY minute) SELECT minute, repeat('*', (30 * count/(max(count) OVER ()))::INT8) AS freq FROM histogram ORDER BY count DESC LIMIT 10; minute | freq ---------+--------------------------------- 36 | ****************************** 38 | ********************* 35 | ********************* 00 | ********************* 37 | ******************** 30 | ******************** 40 | ***************** 20 | ************** 10 | ************* 50 | *********** (10 rows) We see that we captured just the set of bundles/traces we were interested in. Release note: None 86591: kvserver: sync checksum computation with long poll r=erikgrinaker a=pavelkalinnikov Previously, the checksum computation would run until completion unconditionally (unless the collection request comes before it). This is not the best spend of the limited pool capacity, because the result of this computation may never be requested. After this commit, the checksum computation task is synchronized with the checksum collection request. Both wait at most 5 seconds until the other party has joined. Once joined, the computation starts, otherwise skips. If any party abandons the request, then the `replicaChecksum` record is preserved in the state, and is scheduled for a GC later. This is to help the other party to fail fast, instead of waiting, if it arrives late. This change also removes the no longer needed concurrency limit for the tasks, because tasks are canceled reliably and will not pile up. Fixes #77432 Release note (performance improvement): consistency checks are now properly cancelled on timeout, preventing them from piling up. 88768: ci: add MacOS ARM CI config r=jlinder a=rail Previously, MacOS64 ARM64 platform was added, but CI wouldn't run it. This PR adds a CI platform to build MacOS ARM64 binaries. Release note: None Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com> Co-authored-by: Pavel Kalinnikov <pavel@cockroachlabs.com> Co-authored-by: Rail Aliiev <rail@iqchoice.com>

..until expiry. Informs #82896 (more specifically this is a short-term alternative to the part pertaining to continuous tail capture). The issue has more background, but we repeat some below for posterity. It's desirable to draw from a set of tail execution traces collected over time when investigating tail latencies. #82750 introduced a probabilistic mechanism to capture a single tail event for a individual stmts with bounded overhead (determined by the sampling probability, trading off how long until a single capture is obtained). This PR introduces a sql.stmt_diagnostics.collect_continuously.enabled to collect captures continuously over some period of time for aggregate analysis. To get some idea of how this can be used, consider the kinds of experiments we're running as part of #75066. Specifically we have a reproduction where we can observe spikes in latencies for foreground traffic in the presence of concurrent backups (incremental/full). In an experiment with incremental backups running every 10m, with full backups running every 35m (`RECURRING '*/10 * * * *' FULL BACKUP '35 * * * *'`), we observe latency spikes during overlap periods. With this cluster setting we were able to set up trace captures over a 10h window to get a set of representative outlier traces to investigate further. SELECT crdb_internal.request_statement_bundle( 'INSERT INTO new_order(no_o_id, ...)', -- stmt fingerprint 0.05, -- 5% sampling probability '30ms'::INTERVAL, -- 30ms target (p99.9) '10h'::INTERVAL -- capture window ); WITH histogram AS (SELECT extract('minute', collected_at) AS minute, count(*) FROM system.statement_diagnostics GROUP BY minute) SELECT minute, repeat('*', (30 * count/(max(count) OVER ()))::INT8) AS freq FROM histogram ORDER BY count DESC LIMIT 10; minute | freq ---------+--------------------------------- 36 | ****************************** 38 | ********************* 35 | ********************* 00 | ********************* 37 | ******************** 30 | ******************** 40 | ***************** 20 | ************** 10 | ************* 50 | *********** (10 rows) We see that we captured just the set of bundles/traces we were interested in. Longer term we'd want: - Controls over the maximum number of captures we'd want stored over some period of time; - Eviction of older bundles, assuming they're less relevant, making room for newer captures. To safeguard against misuse (in this current form we should only use it for experiments or escalations under controlled environments), we only act on this setting provided the diagnostics request has an expiration timestamp and a specified probability, crude measures to prevent unbounded growth. Release note: None

This commit introduces a backportable alternative to cockroachdb#82750 and cockroachdb#83020, to help solve for cockroachdb#82896. It gives a palatable alternative to sql.trace.stmt.enable_threshold. See those PRs/issues for verbose commentary, but roughly we can do the following: Pick a stmt fingerprint, declare a sampling probability which controls when verbose tracing is enabled for it, and a latency threshold for which a trace is logged. With a given stmt rate (say 1000/s) and a given percentile we're trying to capture (say p99.9), we have N stmt/s in the 99.9th percentile (1 in our example). We should be able to set a sampling probability P such that with high likelihood (>95%) we capture at least one trace captured over the next S seconds. The sampling rate lets you control how the overhead you’re introducing for those statements in aggregate, which if dialed up higher lets you lower S. You might want to do such a thing for infrequently executed statements. This commit does the simplest thing: ask for all input through shoddy cluster settings and just log traces to our usual files. Below we outline an example of how to use these settings to catch outlier executions for writes to the history table in TPC-C. When using it, we did not see an overall increase in p99s as a result of the sampling. It also took only about 10s to capture this data, showing clearly that it was due to latch waits. > SELECT encode(fingerprint_id, 'hex'), (statistics -> 'statistics' ->> 'cnt')::INT AS count, metadata ->> 'query' AS query FROM system.statement_statistics ORDER BY COUNT DESC limit 10; encode | count | query -----------------+-------+-------------------------------------------------------------------------------------------------------------------- ... 4e4214880f87d799 | 2680 | INSERT INTO history(h_c_id, h_c_d_id, h_c_w_id, h_d_id, h_w_id, h_amount, h_date, h_data) VALUES ($1, $2, __more6__) > SET CLUSTER SETTING trace.fingerprint = '4e4214880f87d799'; -- fingerprint > SET CLUSTER SETTING trace.fingerprint.probability = 0.01; -- 1% sampling probability > SET CLUSTER SETTING trace.fingerprint.threshold = '90ms'; -- latency threshold 0.521ms 0.005ms event:kv/kvserver/concurrency/concurrency_manager.go:260 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] acquiring latches› 0.532ms 0.011ms event:kv/kvserver/spanlatch/manager.go:532 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] waiting to acquire read latch /Table/109/1/105/3/519/0@0,0, held by write latch /Table/109/1/105/3/519/0@1654849354.483984000,0› 99.838ms 99.306ms event:kv/kvserver/concurrency/concurrency_manager.go:301 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] scanning lock table for conflicting locks› 99.851ms 0.013ms event:kv/kvserver/replica_read.go:251 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] executing read-only batch› Compare this to sql.trace.stmt.enable_threshold which enables verbose tracing for all statements with 100% probability. It introduces far too much overhead for it to be used in reasonable production clusters. The overhead also often masks the problems we're looking to capture. Release note (general change): We introduced a trifecta of three cluster settings to collect trace data for outlier executions with low overhead. This is only going to be available in 22.1; in 22.2 and beyond we have other mechanisms to collect outlier traces. Traces come in handy when looking to investigate latency spikes, and these three settings are intended to supplant most uses of sql.trace.stmt.enable_threshold. That setting enables verbose tracing for all statements with 100% probability which can cause a lot of overhead in production clusters, and also a lot of logging pressure. Instead we introduce the following: - trace.fingerprint - trace.fingerprint.probability - trace.fingerprint.threshold Put together (all have to be set) it only enables tracing for the statement with the set hex-encoded fingerprint, does so probabilistically (where probability is whatever trace.fingerprint.probability is set to), logging it only if the latency threshold is exceeded (configured using trace.fingerprint.threshold). To obtain a hex-encoded fingerprint, consider looking at the contents of system.statement_statistics. For example: > SELECT encode(fingerprint_id, 'hex'), (statistics -> 'statistics' ->> 'cnt')::INT AS count, metadata ->> 'query' AS query FROM system.statement_statistics ORDER BY COUNT DESC limit 10; encode | count | query -----------------+-------+-------------------------------------------------------------------------------------------------------------------- ... 4e4214880f87d799 | 2680 | INSERT INTO history(h_c_id, h_c_d_id, h_c_w_id, h_d_id, h_w_id, h_amount, h_date, h_data) VALUES ($1, $2, __more6__)

irfansharif · 2022-10-13T12:03:01Z

The PRs above provide this ability, albeit through manual cluster settings (in v22.1) and builtins (in v22.2+). @kevin-v-ngo, #80666 (comment) is relevant. I'm not sure yet how this could go UI wise -- you could either have users automatically specify through our UI what statements need to be traced and for what target latencies, or you could have CRDB automatically always collect traces with some low probability, keeping around those that are considered "outliers" now that we've introduce such notions. I think that would be cool -- "pick a statement, and see why the slowest executions in the last X hours took that long. is it due to latches being held? is it due to network hops? is it due to excessive MVCC garbage scanning?". These traces can be a gold mine. Within CRDB we can include structured events in traces for things that are interesting -- like we do for contention events. You can imagine these auto-collected trace data per statement automatically bubbling up to nudges to the user indicating that optimizing a certain statement or set of statements would require doing X, like "reduce the amount of garbage accumulated" or "reduce the amount of contention somehow". But the auto-collection of traces is what's key, it's something the system is always doing in the background with low overhead and the only user action needed is a reaction to actionable data. The things you note above still hold true: we need sane retention policies to not collect traces unboundedly, and somehow surface some relevant trace snippets to users (maybe). I'll leave this issue open for the general brain-dumping, similar to #80666, but if we want to take this forward we could use more targeted issues that are less meandering.

irfansharif · 2022-10-13T12:08:40Z

Just spit balling: imagine we always collected traces, and in our latency graphs we annotated sudden spikes with relevant trace captures at that point in time, if we happened to collect them. That'd be awesome -- you can go from metric to trace/statement bundle without having to have done anything upfront.

irfansharif · 2022-10-19T21:12:46Z

Follow-up issues: #90292, #90288.

blathers-crl bot added the T-sql-observability label Jun 14, 2022

irfansharif mentioned this issue Jun 15, 2022

stmtdiagnostics: support probabilistic bundle collection #82750

Merged

irfansharif mentioned this issue Jun 16, 2022

stmtdiagnostics: support continuous bundle collection #83020

Merged

blathers-crl bot mentioned this issue Sep 28, 2022

release-22.2: stmtdiagnostics: support continuous bundle collection #88882

Merged

irfansharif mentioned this issue Oct 11, 2022

release-22.1: sql,tracing: per-stmt probabilistic trace sampling #89774

Merged

irfansharif mentioned this issue Oct 12, 2022

outliers: incorporate tracing information #80666

Closed

kevin-v-ngo mentioned this issue Oct 19, 2022

Surface enhanced (always on) statement bundles in the console #90288

Open

irfansharif mentioned this issue Oct 19, 2022

tracing,*: always-on probabilistic tracing #90292

Closed

4 tasks

irfansharif closed this as completed Oct 19, 2022

This was referenced Nov 23, 2022

Investigate adding sampling probability to statement bundle collection #92415

Closed

Investigate adding a retention and timeout policy for statement bundles #92416

Open

Add time selector to the Diagnostics tab including collection time #92417

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tracing,stmtdiagnostics: bounded overhead, continuous tail trace capture #82896

tracing,stmtdiagnostics: bounded overhead, continuous tail trace capture #82896

irfansharif commented Jun 14, 2022 •

edited

Loading

kevin-v-ngo commented Jun 23, 2022

irfansharif commented Oct 13, 2022 •

edited

Loading

irfansharif commented Oct 13, 2022

irfansharif commented Oct 19, 2022

tracing,stmtdiagnostics: bounded overhead, continuous tail trace capture #82896

tracing,stmtdiagnostics: bounded overhead, continuous tail trace capture #82896

Comments

irfansharif commented Jun 14, 2022 • edited Loading

kevin-v-ngo commented Jun 23, 2022

irfansharif commented Oct 13, 2022 • edited Loading

irfansharif commented Oct 13, 2022

irfansharif commented Oct 19, 2022

irfansharif commented Jun 14, 2022 •

edited

Loading

irfansharif commented Oct 13, 2022 •

edited

Loading