Surface total time and contention time for each plan step in EXPLAIN ANALYZE #64200

kevin-v-ngo · 2021-04-26T05:32:21Z

We introduced a new view for EXPLAIN ANALYZE and received feedback to surface the total time spent and the contention time for each plan step.

The metrics are already available in the DistSQL plan viewer for each plan node:

Please add time and contention time to EXPLAIN ANALYZE:

65559: tracing,tracingservice: adds a trace service to pull clusterwide trace spans r=irfansharif,abarganier a=adityamaru Previously, every node in the cluster had a local inflight span registry that was aware of all the spans that were rooted on that particular node. Child spans of a given traceID executing on a remote node would only become visible to the local registry once execution completes, and the span pushes its recordings over gRPC to the "client" node. This change introduces a `tracingservice` package. Package tracingservice contains a gRPC service to be used for remote inflight span access. It is used for pulling inflight spans from all CockroachDB nodes. Each node will run a trace service, which serves the inflight spans from the local span registry on that node. Each node will also have a trace client dialer, which uses the nodedialer to connect to another node's trace service, and access its inflight spans. The trace client dialer is backed by a remote trace client or a local trace client, which serves as the point of entry to this service. Both clients support the `TraceClient` interface, which includes the following functionalities: - GetSpanRecordings The spans for a traceID are sorted by `StartTime` before they are returned. The per-node trace dialer has yet to be hooked up to an appropriate location depending on where we intend to use it. Resolves: #60999 Informs: #64992 Release note: None 66149: cloud: fix gcs to resuming reader r=dt a=adityamaru This change does a few things: 1. gcs_storage was not returning a resuming reader as a result of which the Read method of the resuming reader that contains logic to retry on certain kinds of errors was not being invoked. 2, Changes the resuming reader to take a storage specific function that can define what errors are retryable in the resuming reader. All storage providers use the same deciding function at the moment and so behavior is unchanged. Release note: None 66152: storage: Disable read sampling and read compactions r=sumeerbhola a=itsbilal Read-triggered compactions are already disabled on 21.1. As the fixes to address known shortcomings with read-triggered compactions are a bit involved (see cockroachdb/pebble#1143 ), disable the feature on master until that issue is fixed. That prevents this known issue from getting in the way of performance experiments. Release note: None. 66155: sql: drop "cluster" from EXPLAIN ANALYZE to improve readability r=maryliag a=maryliag Remove the word "cluster" from "cluster nodes" and "cluster regions" on EXPLAIN ANALYZE to improve readability. Release note: None 66157: sql: add time & contention time to EXPLAIN ANALYZE. r=matthewtodd a=matthewtodd The new fields are labeled `KV time` and `KV contention time`: ``` > EXPLAIN ANALYZE -> UPDATE users SET name = 'Bob Loblaw' -> WHERE id = '32a962b7-8440-4b81-97cd-a7d7757d6eac'; info -------------------------------------------------------------------------------------------- planning time: 353µs execution time: 3ms distribution: local vectorized: true rows read from KV: 52 (5.8 KiB) cumulative time spent in KV: 2ms maximum memory usage: 60 KiB network usage: 0 B (0 messages) cluster regions: us-east1 • update │ cluster nodes: n1 │ cluster regions: us-east1 │ actual row count: 1 │ table: users │ set: name │ auto commit │ └── • render │ cluster nodes: n1 │ cluster regions: us-east1 │ actual row count: 1 │ estimated row count: 0 │ └── • filter │ cluster nodes: n1 │ cluster regions: us-east1 │ actual row count: 1 │ estimated row count: 0 │ filter: id = '32a962b7-8440-4b81-97cd-a7d7757d6eac' │ └── • scan cluster nodes: n1 cluster regions: us-east1 actual row count: 52 KV time: 2ms KV contention time: 0µs KV rows read: 52 KV bytes read: 5.8 KiB estimated row count: 50 (100% of the table; stats collected 3 minutes ago) table: users@primary spans: FULL SCAN (42 rows) Time: 4ms total (execution 4ms / network 0ms) ``` Resolves #64200 Release note (sql change): EXPLAIN ANALYZE output now includes, for each plan step, the total time spent waiting for KV requests as well as the total time those KV requests spent contending with other transactions. Co-authored-by: Aditya Maru <adityamaru@gmail.com> Co-authored-by: Bilal Akhtar <bilal@cockroachlabs.com> Co-authored-by: Marylia Gutierrez <marylia@cockroachlabs.com> Co-authored-by: Matthew Todd <todd@cockroachlabs.com>

kevin-v-ngo added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-sql-observability labels Apr 26, 2021

maryliag assigned matthewtodd Jun 3, 2021

matthewtodd mentioned this issue Jun 7, 2021

sql: add time & contention time to EXPLAIN ANALYZE. #66157

Merged

craig bot closed this as completed in 522b64c Jun 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Surface total time and contention time for each plan step in EXPLAIN ANALYZE #64200

Surface total time and contention time for each plan step in EXPLAIN ANALYZE #64200

kevin-v-ngo commented Apr 26, 2021

Surface total time and contention time for each plan step in EXPLAIN ANALYZE #64200

Surface total time and contention time for each plan step in EXPLAIN ANALYZE #64200

Comments

kevin-v-ngo commented Apr 26, 2021