Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

outliers: incorporate tracing information #80666

Closed
matthewtodd opened this issue Apr 27, 2022 · 3 comments
Closed

outliers: incorporate tracing information #80666

matthewtodd opened this issue Apr 27, 2022 · 3 comments
Labels
A-kv-observability A-sql-observability Related to observability of the SQL layer C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@matthewtodd
Copy link
Contributor

matthewtodd commented Apr 27, 2022

This issue is a placeholder for the conversations to be had. We'd like to find a good balance point where we can automatically gather useful trace information for "outlier" queries without slowing everything else down.

Andrew:

andrei we’ve been talking a bit about this SQL outliers thing that todd is working and, and one immediate thing that came up from a previous conversation I had with arul was that it’d be nice to get some tracing information from a slow query.

You and I have long talked about the value of “retrospective” tracing where you try to trace everything in a cheap way and then, when some interesting condition occurs, go back and fish that info out of some ring-buffers or something.

The question I have is: can we instead do something cheap where we don’t enable tracing out of the box, but instead in the context shove some timestamp, and after a certain duration from that timestamp has passed, start accumulating tracing information?

I imagine there’s some technical hurdles, but it feels like just having the tail of a trace for an outlier would be illuminating in many circumstances. Potentially we could also, at that point, dump any accumulated tracez info that we might have.

Andrei:

There's various things I believe we could do, including what you're suggesting: ask a span to start recording at some point in the future, if it's still around by then. At least half the work with any feature in the area is figuring out how the trace is collected and persisted or presented.

Other things we could do are always enable structured recording (as opposed to verbose recording). I've been wondering what the cost of that is. Combine this with more smarts about what we do when a trace exceeds its limits (limits in number of messages per span and number of children per span) - instead of simply throwing out data, we should be aggregating it. Aditya is adding some capabilities to the tracing library that are related to doing such aggregation.

There were two things on my list to discuss with the KV Obs team, who's nominally taking over the tracing library. Number one on my list is getting the info collection from the trace registry done by /tracez to work across the cluster. Number two is to talk about this issue of SQL query tail latency and getting some exemplars. But perhaps SQL Obs is a better victim.

Irfan:

The story I think I’m after is: I’m seeing a high p99.9 for my txns. I might be able to figure out which one if I’m maintaining good client-side metrics for each point of access, but would be useful also for the DB to tell me what the latency profiles are for individual statements and where the tail is sitting (is this thing possible today?). Once I have that, I want to look at a verbose trace for a statement execution that took longer than some threshold, to see if it’s due to waiting on locks, being blocked on lease acquisitions, something else.

Jira issue: CRDB-15559

@matthewtodd matthewtodd added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-sql-observability A-sql-cli-observability Issues related to surfacing SQL observability in SHOW, CRDB_INTERNAL, SYSTEM, etc. labels Apr 27, 2022
@matthewtodd matthewtodd self-assigned this Apr 27, 2022
@matthewtodd matthewtodd changed the title Incorporate tracing information into outliers reporting outliers: incorporate tracing information May 5, 2022
@matthewtodd matthewtodd added A-sql-observability Related to observability of the SQL layer and removed A-sql-cli-observability Issues related to surfacing SQL observability in SHOW, CRDB_INTERNAL, SYSTEM, etc. labels May 5, 2022
@jlinder jlinder added sync-me and removed sync-me labels May 20, 2022
@irfansharif
Copy link
Contributor

irfansharif commented Jun 2, 2022

Internal slack thread talking about difficulties in tail execution diagnosis; this will keep coming up in the form of support escalations that are difficult to RC. +cc @nkodali, perhaps of interest to you and the KV-obs team given where tracing sits.

irfansharif added a commit to irfansharif/cockroach that referenced this issue Jun 10, 2022
Prototyping something that could help cockroachdb#80666. I'm hoping we can get to
the following ability:

  Pick a stmt fingerprint, declare a sampling probability which controls
  when verbose tracing is enabled for it, and a latency threshold for
  which a trace is persisted/logged/etc. Maybe even a count of how many
  traces I actually want recorded? Some of these controls could be in the
  UI, or traces made visible in the UI, but I’d happily take a set of
  dubious cluster settings or table writes to install these
  to-be-traced-with-some-probability fingerprints, or just vanilla log
  outputs. With a given stmt rate (say 1000/s) and a given percentile I’m
  trying to capture (say p99.9), we have 1 stmt/s in the 99.9th
  percentile. I need to make this math less handwavey, but I should be
  able to set a sampling probability P such that with high likelihood
  (>95%) we capture at least one trace captured over the next S seconds. The
  sampling rate lets you control how the overhead you’re introducing for
  those statements in aggregate, which if dialed up higher lets you
  lower S. You might want to do such a thing for infrequently executed
  statements.

Still not sure how to figure out what statement to trace (we don't
capture per-stmt %-iles -- too expensive? Maybe we will in some form
after cockroachdb#82473?). And also not sure what we can do better than just
logging with captured traces (stash them under /tracez somehow?) or in
some system table capped in # of rows + with row-level TTL to evict
captures + make room for new, more recent captures? Anyway, this first
stab does the simplest thing: ask for all input through shoddy cluster
settings and just log stuff. A quick and dirty attempt to try and catch
an outlier execution for the history table write in TPC-C. Didn't see an
overall increase in p99s as a result of the sampling + took about 10s to
capture this, showing it was due to latch waits.

    > SELECT
      encode(fingerprint_id, 'hex'),
      (statistics -> 'statistics' ->> 'cnt')::INT AS count,
      metadata ->> 'query' AS query
    FROM system.statement_statistics
    ORDER BY COUNT DESC limit 10;

           encode    | count |                                             query
    -----------------+-------+--------------------------------------------------------------------------------------------------------------------
    ...
    4e4214880f87d799 |  2680 | INSERT INTO history(h_c_id, h_c_d_id, h_c_w_id, h_d_id, h_w_id, h_amount, h_date, h_data) VALUES ($1, $2, __more6__)

    > SET CLUSTER SETTING trace.fingerprint = '4e4214880f87d799';
    > SET CLUSTER SETTING trace.fingerprint.probability = 0.01;
    > SET CLUSTER SETTING trace.fingerprint.threshold = '90ms';

 0.521ms      0.005ms    event:kv/kvserver/concurrency/concurrency_manager.go:260 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] acquiring latches›
 0.532ms      0.011ms    event:kv/kvserver/spanlatch/manager.go:532 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] waiting to acquire read latch /Table/109/1/105/3/519/0@0,0, held by write latch /Table/109/1/105/3/519/0@1654849354.483984000,0›
99.838ms     99.306ms    event:kv/kvserver/concurrency/concurrency_manager.go:301 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] scanning lock table for conflicting locks›
99.851ms      0.013ms    event:kv/kvserver/replica_read.go:251 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] executing read-only batch›

Release note: None
@irfansharif irfansharif added the O-postmortem Originated from a Postmortem action item. label Jun 14, 2022
@irfansharif irfansharif removed the O-postmortem Originated from a Postmortem action item. label Jun 14, 2022
@matthewtodd matthewtodd removed their assignment Aug 26, 2022
@irfansharif
Copy link
Contributor

irfansharif commented Oct 12, 2022

With the work underneath #82896 we have manual mechanisms to probabilistically enable tracing for specific fingerprints (the probability is what controls the overhead) and collect them when exceeding some specified threshold. This meets the spirit of what this issue was meant to capture, I think, but it's all too manual. We've also developed machinery to automatically recognize what latencies are to be considered outliers for specific fingerprints over at #82473.

Is there a way now to incorporate the two by probabilistically collecting traces and recording them if they're automatically recognized as outliers? This could realize our desire for "always-on tracing" in that sense and could be immensely powerful, something @tbg, @knz, and I last tried during #55584 but fell short of. Those motivations still hold true. Often we have escalations where by the time things have escalated, we're debugging some latency event in the past that we didn't have traces for back then. We're stuck trying to observe future traces by mucking would various thresholds in hope of seeing that incident again. But instead if we had some way to continuously collect traces for outlier executions, with high likelihood we'd have captured traces from when that incident had occurred which would make diagnosis much easier, and also self-serve which should be the goal. We're displaying ever more tracing things in our UI, this could be a nice addition for all kinds of users. +cc @andreimatei, @abarganier in case this is the kind of thing we could ingest as obs-service events.

This is a brain-dump and perhaps not what this issue was intended for, but I'm also hoping we make something of it. Since it bridges two obs-teams, I'm struggling to find a better place for this conversation.

@irfansharif
Copy link
Contributor

Filed #90292.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-observability A-sql-observability Related to observability of the SQL layer C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

No branches or pull requests

3 participants