outliers: incorporate tracing information #80666

matthewtodd · 2022-04-27T21:25:14Z

This issue is a placeholder for the conversations to be had. We'd like to find a good balance point where we can automatically gather useful trace information for "outlier" queries without slowing everything else down.

Andrew:

andrei we’ve been talking a bit about this SQL outliers thing that todd is working and, and one immediate thing that came up from a previous conversation I had with arul was that it’d be nice to get some tracing information from a slow query.

You and I have long talked about the value of “retrospective” tracing where you try to trace everything in a cheap way and then, when some interesting condition occurs, go back and fish that info out of some ring-buffers or something.

The question I have is: can we instead do something cheap where we don’t enable tracing out of the box, but instead in the context shove some timestamp, and after a certain duration from that timestamp has passed, start accumulating tracing information?

I imagine there’s some technical hurdles, but it feels like just having the tail of a trace for an outlier would be illuminating in many circumstances. Potentially we could also, at that point, dump any accumulated tracez info that we might have.

Andrei:

There's various things I believe we could do, including what you're suggesting: ask a span to start recording at some point in the future, if it's still around by then. At least half the work with any feature in the area is figuring out how the trace is collected and persisted or presented.

Other things we could do are always enable structured recording (as opposed to verbose recording). I've been wondering what the cost of that is. Combine this with more smarts about what we do when a trace exceeds its limits (limits in number of messages per span and number of children per span) - instead of simply throwing out data, we should be aggregating it. Aditya is adding some capabilities to the tracing library that are related to doing such aggregation.

There were two things on my list to discuss with the KV Obs team, who's nominally taking over the tracing library. Number one on my list is getting the info collection from the trace registry done by /tracez to work across the cluster. Number two is to talk about this issue of SQL query tail latency and getting some exemplars. But perhaps SQL Obs is a better victim.

Irfan:

The story I think I’m after is: I’m seeing a high p99.9 for my txns. I might be able to figure out which one if I’m maintaining good client-side metrics for each point of access, but would be useful also for the DB to tell me what the latency profiles are for individual statements and where the tail is sitting (is this thing possible today?). Once I have that, I want to look at a verbose trace for a statement execution that took longer than some threshold, to see if it’s due to waiting on locks, being blocked on lease acquisitions, something else.

Jira issue: CRDB-15559

irfansharif · 2022-06-02T22:17:20Z

Internal slack thread talking about difficulties in tail execution diagnosis; this will keep coming up in the form of support escalations that are difficult to RC. +cc @nkodali, perhaps of interest to you and the KV-obs team given where tracing sits.

Prototyping something that could help cockroachdb#80666. I'm hoping we can get to the following ability: Pick a stmt fingerprint, declare a sampling probability which controls when verbose tracing is enabled for it, and a latency threshold for which a trace is persisted/logged/etc. Maybe even a count of how many traces I actually want recorded? Some of these controls could be in the UI, or traces made visible in the UI, but I’d happily take a set of dubious cluster settings or table writes to install these to-be-traced-with-some-probability fingerprints, or just vanilla log outputs. With a given stmt rate (say 1000/s) and a given percentile I’m trying to capture (say p99.9), we have 1 stmt/s in the 99.9th percentile. I need to make this math less handwavey, but I should be able to set a sampling probability P such that with high likelihood (>95%) we capture at least one trace captured over the next S seconds. The sampling rate lets you control how the overhead you’re introducing for those statements in aggregate, which if dialed up higher lets you lower S. You might want to do such a thing for infrequently executed statements. Still not sure how to figure out what statement to trace (we don't capture per-stmt %-iles -- too expensive? Maybe we will in some form after cockroachdb#82473?). And also not sure what we can do better than just logging with captured traces (stash them under /tracez somehow?) or in some system table capped in # of rows + with row-level TTL to evict captures + make room for new, more recent captures? Anyway, this first stab does the simplest thing: ask for all input through shoddy cluster settings and just log stuff. A quick and dirty attempt to try and catch an outlier execution for the history table write in TPC-C. Didn't see an overall increase in p99s as a result of the sampling + took about 10s to capture this, showing it was due to latch waits. > SELECT encode(fingerprint_id, 'hex'), (statistics -> 'statistics' ->> 'cnt')::INT AS count, metadata ->> 'query' AS query FROM system.statement_statistics ORDER BY COUNT DESC limit 10; encode | count | query -----------------+-------+-------------------------------------------------------------------------------------------------------------------- ... 4e4214880f87d799 | 2680 | INSERT INTO history(h_c_id, h_c_d_id, h_c_w_id, h_d_id, h_w_id, h_amount, h_date, h_data) VALUES ($1, $2, __more6__) > SET CLUSTER SETTING trace.fingerprint = '4e4214880f87d799'; > SET CLUSTER SETTING trace.fingerprint.probability = 0.01; > SET CLUSTER SETTING trace.fingerprint.threshold = '90ms'; 0.521ms 0.005ms event:kv/kvserver/concurrency/concurrency_manager.go:260 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] acquiring latches› 0.532ms 0.011ms event:kv/kvserver/spanlatch/manager.go:532 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] waiting to acquire read latch /Table/109/1/105/3/519/0@0,0, held by write latch /Table/109/1/105/3/519/0@1654849354.483984000,0› 99.838ms 99.306ms event:kv/kvserver/concurrency/concurrency_manager.go:301 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] scanning lock table for conflicting locks› 99.851ms 0.013ms event:kv/kvserver/replica_read.go:251 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] executing read-only batch› Release note: None

irfansharif · 2022-10-12T16:03:53Z

With the work underneath #82896 we have manual mechanisms to probabilistically enable tracing for specific fingerprints (the probability is what controls the overhead) and collect them when exceeding some specified threshold. This meets the spirit of what this issue was meant to capture, I think, but it's all too manual. We've also developed machinery to automatically recognize what latencies are to be considered outliers for specific fingerprints over at #82473.

Is there a way now to incorporate the two by probabilistically collecting traces and recording them if they're automatically recognized as outliers? This could realize our desire for "always-on tracing" in that sense and could be immensely powerful, something @tbg, @knz, and I last tried during #55584 but fell short of. Those motivations still hold true. Often we have escalations where by the time things have escalated, we're debugging some latency event in the past that we didn't have traces for back then. We're stuck trying to observe future traces by mucking would various thresholds in hope of seeing that incident again. But instead if we had some way to continuously collect traces for outlier executions, with high likelihood we'd have captured traces from when that incident had occurred which would make diagnosis much easier, and also self-serve which should be the goal. We're displaying ever more tracing things in our UI, this could be a nice addition for all kinds of users. +cc @andreimatei, @abarganier in case this is the kind of thing we could ingest as obs-service events.

This is a brain-dump and perhaps not what this issue was intended for, but I'm also hoping we make something of it. Since it bridges two obs-teams, I'm struggling to find a better place for this conversation.

irfansharif · 2022-10-19T21:13:51Z

Filed #90292.

matthewtodd added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-sql-observability A-sql-cli-observability Issues related to surfacing SQL observability in SHOW, CRDB_INTERNAL, SYSTEM, etc. labels Apr 27, 2022

matthewtodd self-assigned this Apr 27, 2022

matthewtodd changed the title ~~Incorporate tracing information into outliers reporting~~ outliers: incorporate tracing information May 5, 2022

matthewtodd added A-sql-observability Related to observability of the SQL layer and removed A-sql-cli-observability Issues related to surfacing SQL observability in SHOW, CRDB_INTERNAL, SYSTEM, etc. labels May 5, 2022

jlinder added sync-me and removed sync-me labels May 20, 2022

irfansharif added the A-kv-observability label Jun 2, 2022

This was referenced Jun 10, 2022

[prototype] sql,tracing: per-stmt probabilistic trace sampling #82732

Closed

[prototype] sql,tracing: per-stmt probabilistic trace sampling #82864

Closed

irfansharif added the O-postmortem Originated from a Postmortem action item. label Jun 14, 2022

irfansharif mentioned this issue Jun 14, 2022

tracing,stmtdiagnostics: bounded overhead, continuous tail trace capture #82896

Closed

irfansharif removed the O-postmortem Originated from a Postmortem action item. label Jun 14, 2022

matthewtodd removed their assignment Aug 26, 2022

irfansharif mentioned this issue Oct 19, 2022

tracing,*: always-on probabilistic tracing #90292

Closed

4 tasks

irfansharif closed this as completed Oct 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

outliers: incorporate tracing information #80666

outliers: incorporate tracing information #80666

matthewtodd commented Apr 27, 2022 •

edited by exalate-issue-sync bot

Loading

irfansharif commented Jun 2, 2022 •

edited

Loading

irfansharif commented Oct 12, 2022 •

edited

Loading

irfansharif commented Oct 19, 2022

outliers: incorporate tracing information #80666

outliers: incorporate tracing information #80666

Comments

matthewtodd commented Apr 27, 2022 • edited by exalate-issue-sync bot Loading

irfansharif commented Jun 2, 2022 • edited Loading

irfansharif commented Oct 12, 2022 • edited Loading

irfansharif commented Oct 19, 2022

matthewtodd commented Apr 27, 2022 •

edited by exalate-issue-sync bot

Loading

irfansharif commented Jun 2, 2022 •

edited

Loading

irfansharif commented Oct 12, 2022 •

edited

Loading