Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jobs: Jobs Observability tracking issue #64992

Closed
adityamaru opened this issue May 11, 2021 · 3 comments
Closed

jobs: Jobs Observability tracking issue #64992

adityamaru opened this issue May 11, 2021 · 3 comments
Assignees
Labels
A-disaster-recovery C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-disaster-recovery

Comments

@adityamaru
Copy link
Contributor

adityamaru commented May 11, 2021

This is a tracking issue for the investigation and work that is being done to improve observability into job execution. The initial brainstorming document can be found at https://docs.google.com/document/d/19VEvNNW8wdep5cYe4TJQY5_WvrmDqTgnh3V2Bq33gag/edit?usp=sharing.

Tracing related work

  • Aggregating Finish()'ed and in-flight spans from all nodes in the cluster.

    • Issue: sql: allow inflight traces to be collected across the cluster #60999
    • Job execution is distributed across nodes in the cluster. A consistent view of what a job is doing, or has done during execution, is only possible if we are able to aggregate the spans in the execution path across the cluster.
    • Recordings can be split between persistent storage and in-memory data structures; care must be taken when thinking about aggregation scan semantics.
    • Could we get away with not relying on a service that fetches inflight spans from the entire cluster? If every node flushed to the system table often enough, then our introspection into the job could just rely on looking at the system table. This might simplify things by not relying on an RPC fanout.
  • Mutable/ephemeral payloads for more observability into operations that can take long/hang indefinitely.

  • Filtering based on payload type to avoid job recordings from spamming traces included in reports.

    • Jobs are likely to record several structured entries during the course of their execution.
    • Audit places in the code where we pull recordings to see if the high write rate of jobs could result in unrelated traces becoming hard to parse.
    • Investigate whether job-related structured entries should only be exposed via job-specific tools and builtins.
  • Memory footprint of long-lived spans (egs: CDC that could run for the lifetime of the cluster).

    • Issue: tracing: memory budgeting #59188
    • Tie's into how often we drain the spans and write the information to persistent storage.
    • Would prefer to eagerly flush rather than drop traces during job execution.

Jobs related work

  • Persistent storage to deal with node shutdown and stopped SQL pods, via asynchronous flushing of tracing spans from RAM to a system table.

    • Enables job execution to be inspectable even after the job has reached a terminal state.
    • Helps bound memory overhead of the in-memory data structures in long-living spans, which in turn prevents dropping of future recordings.
    • Presumably, every registry will be responsible for periodic flushing, therefore organization and aggregation of entries in the system table will need to be worked out.
  • Builtins, tables, and introspection tools built on top of the system table and inflight span aggregation infrastructure.

    • A job could have both persisted spans as well as inflight spans, and therefore we must scan the system table for spans related to the job before aggregating inflight spans.

Epic CRDB-7076

Jira issue: CRDB-7375

@adityamaru adityamaru added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label May 11, 2021
@adityamaru
Copy link
Contributor Author

Created this issue as an attempt to scope out the work stemming from the comments on the document. I would guess that some of these tasks will require their own design documents once they are better defined.

@adityamaru
Copy link
Contributor Author

Summarizing a discussion the BulkIO team had regarding our next steps:

The problem can be broken down into 3 categories and are listed in terms of priority for the team:

  1. Inspect what a job is doing at the moment.

Solving this problem is the team's first priority, and tracing can help us here. At a point in time, a user should be able to fetch all inflight spans for a job that is currently executing. This involves fetching inflight spans from the node the job was adopted, the nodes on which the job planned its processors, and the nodes serving the KV requests issued by these processors. Doing so will allow engineers to identify what each node is doing at a point in time, and understand whether the job is slow, hanging on a request, or happily executing.

  1. Expose information to the end-user.

This is where we might consider introducing a system table for persisting information that has been aggregated during job execution. We do not believe this feature will rely on tracing directly, because trace events are not meaningful to expose to the end-user. BulkIO will likely run custom aggregation and parsing of information it receives during job execution, and identify what is notable to expose to the end-user.
Jobs and DistSQL already rely on remote nodes sending metas to the coordinator node. It is likely that we will send execution information via metas to the coordinator node, following which the aggregator will parse this information and persist notable events in the system table. We might use tracing to gather this information during execution.

  1. Persist all trace events during job execution.

This would be really nice to have as it would persist historical information about job execution, but the team does not think that a system table is the answer. We have seen the system.jobs table grow to hundreds of gigabytes, and considering the magnitude of trace events that a job could emit, we might need to think about persisting this information elsewhere (logs?). BulkIO is unsure whether this is in scope of the team's milestones, or would be a valuable feature to CRDB infrastructure in general.

@adityamaru adityamaru self-assigned this May 17, 2021
craig bot pushed a commit that referenced this issue May 18, 2021
64977: colexec: implement vectorized ntile window function r=DrewKimball a=DrewKimball

This patch implements the `ntile` window function in the vectorized
engine. `ntile` takes in an integer argument `num_buckets` and then
distributes all rows in a partition equally between the buckets,
outputting the bucket number for each row.

In the vectorized implementation, batches are buffered until the end
of a partition is reached, at which point the `ntile` bucket values can
be calculated. The batches are emitted in a streaming fashion; as soon
as a batch is fully processed, it is returned and work is paused until
the next call to `Next()`.

See #37035

Release note (sql change): the vectorized engine now supports the ntile window function.

65322: jobs: surface the trace_id linked to job execution r=adityamaru a=adityamaru

Please see individual commit messages.

Informs: #64992

65415: authors: add annezhu98 to authors r=annezhu98 a=annezhu98

Release note: None

65418: authors: add <IsaDash> to authors r=IsaDash a=IsaDash

Release note: None

Co-authored-by: Drew Kimball <drewkimball@Drews-MacBook-Pro.local>
Co-authored-by: Aditya Maru <adityamaru@gmail.com>
Co-authored-by: Anne Zhu <anne.zhu@cockroachlabs.com>
Co-authored-by: Isa Dash <isa.dash@cockroachlabs.com>
craig bot pushed a commit that referenced this issue Jun 1, 2021
65324: cli: add cockroach debug job-trace r=pbardea,knz a=adityamaru

This change adds a new cli command - `cockroach debug job-trace`
that can be used to grab information from inflight trace spans
associated with the execution of a particular job.
The command takes a `jobID` and file destination where the
information will be dumped. It also requires a `--url` of the
node from which to pull the inflight spans.

In the future we will have a means of pulling inflight spans from
the cluster, but at the moment this command must be run against every
node to get a complete picture of job execution.

Informs: #64992

Release note (cli change): Adds a `cockroach debug job-trace` command
that takes 2 arguments: <jobID> and file destination, along with a
`--url` pointing to the node on which to execute this command.
The command pulls information about inflight trace spans associated
with the job and dumps it to the file destination.


Co-authored-by: Aditya Maru <adityamaru@gmail.com>
craig bot pushed a commit that referenced this issue Jun 7, 2021
66089: cli: order span payloads by start_time for readability r=knz a=adityamaru

During job execution, there can be several root spans on a node.
Egs: resumer, processor

While the recordings within a span are sorted by start time,
since the root spans are stored in an unordered map, the recordings
across root spans might not be sorted by start_time. When viewing
the output files for a job this results in the processor span printing
its payload before the job payload which is not intuitive.

We might change how we display recordings in the future, but for the
time being this fix makes the "ordering" of events deterministic.

Informs: #64992

Release note: None

Co-authored-by: Aditya Maru <adityamaru@gmail.com>
adityamaru added a commit to adityamaru/cockroach that referenced this issue Jun 8, 2021
…rded

This change wrap the existing Structured record in a proto that
has a `time` field, indicating what time the record was recorded at.

This is similar to how we wrap free form logs in LogRecord and will
help interleaving child recordings between parent recordings, at
the time when they occurred rather than at the span start time.

This change was motivated by the fact that jobs are going to be using
structured recordings to indicate which phase of execution they are
currently in. When consuming this information, it is more intuitive to
see interleaved events, than all events for the parent span followed by
all events for the child span.

Currently we see:
```
ProcSentExportRequest <- processor level
ProcReceivedExportResponse
ExportRequestEvalStarted <- export.go level
ExportRequestEvalFinished
```

We would like to see:
```
ProcSentExportRequest <- processor level
ExportRequestEvalStarted <- export.go level
ExportRequestEvalFinished
ProcReceivedExportResponse
```

Informs: cockroachdb#64992
craig bot pushed a commit that referenced this issue Jun 8, 2021
65559: tracing,tracingservice: adds a trace service to pull clusterwide trace spans r=irfansharif,abarganier a=adityamaru

Previously, every node in the cluster had a local inflight span
registry that was aware of all the spans that were rooted on that
particular node. Child spans of a given traceID executing on a remote
node would only become visible to the local registry once execution
completes, and the span pushes its recordings over gRPC to the
"client" node.

This change introduces a `tracingservice` package.
Package tracingservice contains a gRPC service to be used for
remote inflight span access.

It is used for pulling inflight spans from all CockroachDB nodes.
Each node will run a trace service, which serves the inflight spans from the
local span registry on that node. Each node will also have a trace client
dialer, which uses the nodedialer to connect to another node's trace service,
and access its inflight spans. The trace client dialer is backed by a remote
trace client or a local trace client, which serves as the point of entry to this
service. Both clients support the `TraceClient` interface, which includes the
following functionalities:
  - GetSpanRecordings

The spans for a traceID are sorted by `StartTime` before they are
returned. The per-node trace dialer has yet to be hooked up to an
appropriate location depending on where we intend to use it.

Resolves: #60999
Informs: #64992

Release note: None

66149: cloud: fix gcs to resuming reader r=dt a=adityamaru

This change does a few things:

1. gcs_storage was not returning a resuming reader as a result of
which the Read method of the resuming reader that contains logic
to retry on certain kinds of errors was not being invoked.

2, Changes the resuming reader to take a storage specific function
that can define what errors are retryable in the resuming reader.
All storage providers use the same deciding function at the moment
and so behavior is unchanged.

Release note: None

66152: storage: Disable read sampling and read compactions r=sumeerbhola a=itsbilal

Read-triggered compactions are already disabled on 21.1.
As the fixes to address known shortcomings with read-triggered
compactions are a bit involved (see
cockroachdb/pebble#1143 ), disable
the feature on master until that issue is fixed. That prevents
this known issue from getting in the way of performance
experiments.

Release note: None.

66155: sql: drop "cluster" from EXPLAIN ANALYZE to improve readability r=maryliag a=maryliag

Remove the word "cluster" from "cluster nodes" and "cluster regions"
on EXPLAIN ANALYZE to improve readability.

Release note: None

66157: sql: add time & contention time to EXPLAIN ANALYZE. r=matthewtodd a=matthewtodd

The new fields are labeled `KV time` and `KV contention time`:

```
 > EXPLAIN ANALYZE
-> UPDATE users SET name = 'Bob Loblaw'
-> WHERE id = '32a962b7-8440-4b81-97cd-a7d7757d6eac';
                                            info
--------------------------------------------------------------------------------------------
  planning time: 353µs
  execution time: 3ms
  distribution: local
  vectorized: true
  rows read from KV: 52 (5.8 KiB)
  cumulative time spent in KV: 2ms
  maximum memory usage: 60 KiB
  network usage: 0 B (0 messages)
  cluster regions: us-east1

  • update
  │ cluster nodes: n1
  │ cluster regions: us-east1
  │ actual row count: 1
  │ table: users
  │ set: name
  │ auto commit
  │
  └── • render
      │ cluster nodes: n1
      │ cluster regions: us-east1
      │ actual row count: 1
      │ estimated row count: 0
      │
      └── • filter
          │ cluster nodes: n1
          │ cluster regions: us-east1
          │ actual row count: 1
          │ estimated row count: 0
          │ filter: id = '32a962b7-8440-4b81-97cd-a7d7757d6eac'
          │
          └── • scan
                cluster nodes: n1
                cluster regions: us-east1
                actual row count: 52
                KV time: 2ms
                KV contention time: 0µs
                KV rows read: 52
                KV bytes read: 5.8 KiB
                estimated row count: 50 (100% of the table; stats collected 3 minutes ago)
                table: users@primary
                spans: FULL SCAN
(42 rows)

Time: 4ms total (execution 4ms / network 0ms)
```

Resolves #64200

Release note (sql change): EXPLAIN ANALYZE output now includes, for each plan step, the total time spent waiting for KV requests as well as the total time those KV requests spent contending with other transactions.

Co-authored-by: Aditya Maru <adityamaru@gmail.com>
Co-authored-by: Bilal Akhtar <bilal@cockroachlabs.com>
Co-authored-by: Marylia Gutierrez <marylia@cockroachlabs.com>
Co-authored-by: Matthew Todd <todd@cockroachlabs.com>
adityamaru added a commit to adityamaru/cockroach that referenced this issue Jun 10, 2021
…rded

This change wrap the existing Structured record in a proto that
has a `time` field, indicating what time the record was recorded at.

This is similar to how we wrap free form logs in LogRecord and will
help interleaving child recordings between parent recordings, at
the time when they occurred rather than at the span start time.

This change was motivated by the fact that jobs are going to be using
structured recordings to indicate which phase of execution they are
currently in. When consuming this information, it is more intuitive to
see interleaved events, than all events for the parent span followed by
all events for the child span.

Currently we see:
```
ProcSentExportRequest <- processor level
ProcReceivedExportResponse
ExportRequestEvalStarted <- export.go level
ExportRequestEvalFinished
```

We would like to see:
```
ProcSentExportRequest <- processor level
ExportRequestEvalStarted <- export.go level
ExportRequestEvalFinished
ProcReceivedExportResponse
```

Informs: cockroachdb#64992
adityamaru added a commit to adityamaru/cockroach that referenced this issue Jun 18, 2021
…rded

This change wrap the existing Structured record in a proto that
has a `time` field, indicating what time the record was recorded at.

This is similar to how we wrap free form logs in LogRecord and will
help interleaving child recordings between parent recordings, at
the time when they occurred rather than at the span start time.

This change was motivated by the fact that jobs are going to be using
structured recordings to indicate which phase of execution they are
currently in. When consuming this information, it is more intuitive to
see interleaved events, than all events for the parent span followed by
all events for the child span.

Currently we see:
```
ProcSentExportRequest <- processor level
ProcReceivedExportResponse
ExportRequestEvalStarted <- export.go level
ExportRequestEvalFinished
```

We would like to see:
```
ProcSentExportRequest <- processor level
ExportRequestEvalStarted <- export.go level
ExportRequestEvalFinished
ProcReceivedExportResponse
```

Informs: cockroachdb#64992
craig bot pushed a commit that referenced this issue Jun 21, 2021
66147: tracing: wrap structured recordings with time at which they were recorded r=abarganier a=adityamaru

This change wrap the existing Structured record in a proto that
has a `time` field, indicating what time the record was recorded.

This is similar to how we wrap free form logs in LogRecord and will
help interleaving child recordings between parent recordings, at
the time when they occurred rather than at the span start time.

This change was motivated by the fact that jobs are going to be using
structured recordings to indicate which phase of execution they are
currently in. When consuming this information, it is more intuitive to
see interleaved events, rather than all events for the parent span followed by
all events for the child span.

Currently, we see:
```
ProcSentExportRequest <- processor level
ProcReceivedExportResponse
ExportRequestEvalStarted <- export.go level
ExportRequestEvalFinished
```

We would like to see:
```
ProcSentExportRequest <- processor level
ExportRequestEvalStarted <- export.go level
ExportRequestEvalFinished
ProcReceivedExportResponse
```

Informs: #64992

Co-authored-by: Aditya Maru <adityamaru@gmail.com>
craig bot pushed a commit that referenced this issue Jun 22, 2021
63488: monitoring: renovate grafana dashboards r=kzh a=sai-roach

This PR adds in renovated grafana dashboards that aim for
feature parity with DB Console metrics page.

Dashboards:
- Overview
- Hardware
- Runtime
- SQL
- Storage
- Replication
- Distributed
- Queues
- Slow Requests
- Changefeeds

These dashboards can be previewed by following the instructions in
the monitoring [README.md](https://github.com/cockroachdb/cockroach/blob/master/monitoring/README.md) for spinning up a quick grafana instance.

Release note (ops change): The grafana dashboards have been updated to
more closely resemble DB console metrics.

66678: tracing: collector was incorrectly flattening recordings r=irfansharif,abarganier a=adityamaru

Previously, the trace collector was dialing up a node, visiting
all the root spans on the nodes inflight registry, and placing
`tracingpb.RecordedSpans` into a flat slice. This caused loss of
information about which spans belonged to a chain
rooted at a fixed root span. Such a chain is referred to as a
`tracing.Recording`. Every node can have multiple `tracing.Recording`s
with the same `trace_id`, and they each represent a traced remote
operation.

This change maintains the `tracing.Recording` grouping of spans
by getting the collector to return a `[]tracing.Recording` for each
node. The collectors' unit of iteration consequently becomes a
`tracing.Recording`. This makes more sense when you think about
how we want to consume these traces. Every `tracing.Recording` is
a new traced remote operation, and should be visualized as such in
Jaegar, JSON etc.

This change also augments the collector iterator to return the nodeID
of the node that the current `tracing.Recording` belongs too.

Informs: #64992

Release note: None

66715: workload: make rand workload aware of computed columns r=mgartner a=rafiss

fixes #66683

Release note: None

Co-authored-by: sai-roach <sai@cockroachlabs.com>
Co-authored-by: Aditya Maru <adityamaru@gmail.com>
Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
craig bot pushed a commit that referenced this issue Jun 22, 2021
66042: cloud: add trace recording to retry loops in external storage r=pbardea a=adityamaru

This change adds some tracing to the retry loops we have in
external storage.

Informs: #64992

Release note: None

66678: tracing: collector was incorrectly flattening recordings r=irfansharif,abarganier a=adityamaru

Previously, the trace collector was dialing up a node, visiting
all the root spans on the nodes inflight registry, and placing
`tracingpb.RecordedSpans` into a flat slice. This caused loss of
information about which spans belonged to a chain
rooted at a fixed root span. Such a chain is referred to as a
`tracing.Recording`. Every node can have multiple `tracing.Recording`s
with the same `trace_id`, and they each represent a traced remote
operation.

This change maintains the `tracing.Recording` grouping of spans
by getting the collector to return a `[]tracing.Recording` for each
node. The collectors' unit of iteration consequently becomes a
`tracing.Recording`. This makes more sense when you think about
how we want to consume these traces. Every `tracing.Recording` is
a new traced remote operation, and should be visualized as such in
Jaegar, JSON etc.

This change also augments the collector iterator to return the nodeID
of the node that the current `tracing.Recording` belongs too.

Informs: #64992

Release note: None

66691: cloud: bump orchestrator version to 21.1.2 r=celiala a=celiala

Release note: None


66715: workload: make rand workload aware of computed columns r=mgartner a=rafiss

fixes #66683

Release note: None

Co-authored-by: Aditya Maru <adityamaru@gmail.com>
Co-authored-by: Celia La <celiala456@gmail.com>
Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
craig bot pushed a commit that referenced this issue Jun 28, 2021
66679: sql,tracing: introduce crdb_internal.cluster_inflight_traces r=abarganier a=adityamaru

This change adds a new indexed, virtual table
`crdb_internal.cluster_inflight_traces`. This table surfaces
cluster-wide inflight traces for the trace_id specified via
an index constraint.

Each row in the virtual table corresponds to a
`tracing.Recording` on a particular node for the given
trace ID. A `tracing.Recording` is the trace of a single
operation rooted at a root span on that node. Under the hood,
the virtual table contacts all "live" nodes in the cluster
via the trace collector which streams back a recording at a
time.

The table has 3 additional columns that surface the raw JSON,
string, and JaegarJSON representation of the recording. These
formats are what we dump in a stmt bundle as well, and have
been considered the best way to consume traces. This table
is not meant to be consumed directly via the SQL shell but
will have CLI wrapper built on top of it that will assimilate
and write the traces to files. Similar to how we dump a stmt
bundle.

This change also tweaks some of the recording->string methods
to include StructuredRecords.

Informs: #64992

Release note (sql change): adds a virtual table
`crdb_internal.cluster_inflight_traces` which surfaces
cluster-wide inflight traces for the trace_id specified via
an index constraint. The output of this table is not appropriate
to consume over a SQL connection; follow up changes will add
CLI wrappers to make the interaction more user-friendly.


66837: sql/schemachanger: reorder args on Build, clone nodes, minor renaming r=fqazi a=ajwerner

Release note: None

Co-authored-by: Aditya Maru <adityamaru@gmail.com>
Co-authored-by: Andrew Werner <awerner32@gmail.com>
craig bot pushed a commit that referenced this issue Jun 29, 2021
65725: backupccl: add some span recordings to the backup resumer r=pbardea a=adityamaru

This change adds some initial span recordings to the backup
resumer that might be interesting when inspecting an executing
backup. The processor and KV level requests will be dealt
with in subsequent commits.

Informs: #64992

Release note: None

Co-authored-by: Aditya Maru <adityamaru@gmail.com>
@adityamaru
Copy link
Contributor Author

This is now being tracked in https://cockroachlabs.atlassian.net/browse/CRDB-8964

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-disaster-recovery
Projects
None yet
Development

No branches or pull requests

3 participants