-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
outliers: incorporate tracing information #80666
Comments
Internal slack thread talking about difficulties in tail execution diagnosis; this will keep coming up in the form of support escalations that are difficult to RC. +cc @nkodali, perhaps of interest to you and the KV-obs team given where tracing sits. |
Prototyping something that could help cockroachdb#80666. I'm hoping we can get to the following ability: Pick a stmt fingerprint, declare a sampling probability which controls when verbose tracing is enabled for it, and a latency threshold for which a trace is persisted/logged/etc. Maybe even a count of how many traces I actually want recorded? Some of these controls could be in the UI, or traces made visible in the UI, but I’d happily take a set of dubious cluster settings or table writes to install these to-be-traced-with-some-probability fingerprints, or just vanilla log outputs. With a given stmt rate (say 1000/s) and a given percentile I’m trying to capture (say p99.9), we have 1 stmt/s in the 99.9th percentile. I need to make this math less handwavey, but I should be able to set a sampling probability P such that with high likelihood (>95%) we capture at least one trace captured over the next S seconds. The sampling rate lets you control how the overhead you’re introducing for those statements in aggregate, which if dialed up higher lets you lower S. You might want to do such a thing for infrequently executed statements. Still not sure how to figure out what statement to trace (we don't capture per-stmt %-iles -- too expensive? Maybe we will in some form after cockroachdb#82473?). And also not sure what we can do better than just logging with captured traces (stash them under /tracez somehow?) or in some system table capped in # of rows + with row-level TTL to evict captures + make room for new, more recent captures? Anyway, this first stab does the simplest thing: ask for all input through shoddy cluster settings and just log stuff. A quick and dirty attempt to try and catch an outlier execution for the history table write in TPC-C. Didn't see an overall increase in p99s as a result of the sampling + took about 10s to capture this, showing it was due to latch waits. > SELECT encode(fingerprint_id, 'hex'), (statistics -> 'statistics' ->> 'cnt')::INT AS count, metadata ->> 'query' AS query FROM system.statement_statistics ORDER BY COUNT DESC limit 10; encode | count | query -----------------+-------+-------------------------------------------------------------------------------------------------------------------- ... 4e4214880f87d799 | 2680 | INSERT INTO history(h_c_id, h_c_d_id, h_c_w_id, h_d_id, h_w_id, h_amount, h_date, h_data) VALUES ($1, $2, __more6__) > SET CLUSTER SETTING trace.fingerprint = '4e4214880f87d799'; > SET CLUSTER SETTING trace.fingerprint.probability = 0.01; > SET CLUSTER SETTING trace.fingerprint.threshold = '90ms'; 0.521ms 0.005ms event:kv/kvserver/concurrency/concurrency_manager.go:260 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] acquiring latches› 0.532ms 0.011ms event:kv/kvserver/spanlatch/manager.go:532 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] waiting to acquire read latch /Table/109/1/105/3/519/0@0,0, held by write latch /Table/109/1/105/3/519/0@1654849354.483984000,0› 99.838ms 99.306ms event:kv/kvserver/concurrency/concurrency_manager.go:301 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] scanning lock table for conflicting locks› 99.851ms 0.013ms event:kv/kvserver/replica_read.go:251 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] executing read-only batch› Release note: None
With the work underneath #82896 we have manual mechanisms to probabilistically enable tracing for specific fingerprints (the probability is what controls the overhead) and collect them when exceeding some specified threshold. This meets the spirit of what this issue was meant to capture, I think, but it's all too manual. We've also developed machinery to automatically recognize what latencies are to be considered outliers for specific fingerprints over at #82473. Is there a way now to incorporate the two by probabilistically collecting traces and recording them if they're automatically recognized as outliers? This could realize our desire for "always-on tracing" in that sense and could be immensely powerful, something @tbg, @knz, and I last tried during #55584 but fell short of. Those motivations still hold true. Often we have escalations where by the time things have escalated, we're debugging some latency event in the past that we didn't have traces for back then. We're stuck trying to observe future traces by mucking would various thresholds in hope of seeing that incident again. But instead if we had some way to continuously collect traces for outlier executions, with high likelihood we'd have captured traces from when that incident had occurred which would make diagnosis much easier, and also self-serve which should be the goal. We're displaying ever more tracing things in our UI, this could be a nice addition for all kinds of users. +cc @andreimatei, @abarganier in case this is the kind of thing we could ingest as obs-service events. This is a brain-dump and perhaps not what this issue was intended for, but I'm also hoping we make something of it. Since it bridges two obs-teams, I'm struggling to find a better place for this conversation. |
Filed #90292. |
This issue is a placeholder for the conversations to be had. We'd like to find a good balance point where we can automatically gather useful trace information for "outlier" queries without slowing everything else down.
Andrew:
Andrei:
Irfan:
Jira issue: CRDB-15559
The text was updated successfully, but these errors were encountered: