Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tracing: collector was incorrectly flattening recordings #66678

Merged
merged 1 commit into from
Jun 22, 2021

Conversation

adityamaru
Copy link
Contributor

@adityamaru adityamaru commented Jun 21, 2021

Previously, the trace collector was dialing up a node, visiting
all the root spans on the nodes inflight registry, and placing
tracingpb.RecordedSpans into a flat slice. This caused loss of
information about which spans belonged to a chain
rooted at a fixed root span. Such a chain is referred to as a
tracing.Recording. Every node can have multiple tracing.Recordings
with the same trace_id, and they each represent a traced remote
operation.

This change maintains the tracing.Recording grouping of spans
by getting the collector to return a []tracing.Recording for each
node. The collectors' unit of iteration consequently becomes a
tracing.Recording. This makes more sense when you think about
how we want to consume these traces. Every tracing.Recording is
a new traced remote operation, and should be visualized as such in
Jaegar, JSON etc.

This change also augments the collector iterator to return the nodeID
of the node that the current tracing.Recording belongs too.

Informs: #64992

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@@ -20,7 +20,10 @@ message SpanRecordingRequest {
}

message SpanRecordingResponse {
repeated tracing.tracingpb.RecordedSpan span_recordings = 1 [(gogoproto.nullable) = false];
message Recording {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of defining this, is it possible to unify this with tracing.Recording instead? Either through gogoproto.casttype or by making tracing.Recording a pb type (I think the former option is preferable).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline I tried the latter, but it is a prettty big diff. Still debating if it is worth it over in #66692.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, fair, sorry for the goose chase. Leaving it as is seems fine to me.

pkg/util/tracing/service/service.go Outdated Show resolved Hide resolved
return i.recordedSpans[i.recordedSpanIndex]
func (i *Iterator) Value() NodeRecording {
return NodeRecording{
NodeID: i.curNode,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why return the NodeID at all? Do we expect to use it? I imagined the value type would've simply been a trace.Recording instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I envision it is that in #66679, the vtable is going to store rows with the node_id. This will allow the CLI wrapper built on top of to dump all the inflight trace spans of a particular node in a single file.
In the case of jobs, you then get a per node view of job execution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could return two values instead of introducing a wrapper type: (roachpb.NodeID, tracing.Recording)

pkg/util/tracing/collector/collector.go Outdated Show resolved Hide resolved
Copy link
Contributor

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru and @irfansharif)


pkg/util/tracing/collector/collector.go, line 76 at r1 (raw file):

	// recordings and buffered them in `recordings` for consumption via the
	// iterator.
	curNode roachpb.NodeID

Just noting: same thread as below re: NodeIDs


pkg/util/tracing/collector/collector.go, line 168 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Why return the NodeID at all? Do we expect to use it? I imagined the value type would've simply been a trace.Recording instead.

I suppose with BulkIO jobs it might be useful to know which node a span came from, but could that be included in the RecordedSpan itself?

@adityamaru
Copy link
Contributor Author

adityamaru commented Jun 21, 2021

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru and @irfansharif)

pkg/util/tracing/collector/collector.go, line 76 at r1 (raw file):

	// recordings and buffered them in `recordings` for consumption via the
	// iterator.
	curNode roachpb.NodeID

Just noting: same thread as below re: NodeIDs

pkg/util/tracing/collector/collector.go, line 168 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Why return the NodeID at all? Do we expect to use it? I imagined the value type would've simply been a trace.Recording instead.

I suppose with BulkIO jobs it might be useful to know which node a span came from, but could that be included in the RecordedSpan itself?

Hmm I'm not sure how I feel about making the nodeID a property of the span, especially because of how lightweight it is to currently create a span throughout the code. I could be wrong, but the idea of having to get to the current nodes' ID from everywhere in the code base sounds scary? I think some subsystems in CRDB write a "tag" on the span with key = "node". The only place I really see a nodeID being used is when converting the trace to JaegarJSON. Jaeger has a concept of "processes" which are almost like file directories. They are represented by a different colour, and trace hierarchy on the jaeger UI.

I have the same reasoning as you about why we might need nodeIDs, see my comment on irfan's review!

@adityamaru adityamaru force-pushed the status-server-prototype branch 2 times, most recently from c6a94ef to 4989ed6 Compare June 22, 2021 03:12
@@ -20,7 +20,10 @@ message SpanRecordingRequest {
}

message SpanRecordingResponse {
repeated tracing.tracingpb.RecordedSpan span_recordings = 1 [(gogoproto.nullable) = false];
message Recording {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, fair, sorry for the goose chase. Leaving it as is seems fine to me.

@@ -15,14 +15,19 @@ option go_package = "tracingservicepb";
import "gogoproto/gogo.proto";
import "util/tracing/tracingpb/recorded_span.proto";

message SpanRecordingRequest {
// GetSpanRecordingsRequest requests a nodes' inflight span registry to return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this comment to the RPC definition, it'll show up when jumping to interface/method, instead of having to jump to the request type. "GetSpanRecordings returns a nodes' ...", and refer to trace_id as TraceID instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

return i.recordedSpans[i.recordedSpanIndex]
func (i *Iterator) Value() NodeRecording {
return NodeRecording{
NodeID: i.curNode,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could return two values instead of introducing a wrapper type: (roachpb.NodeID, tracing.Recording)

Previously, the trace collector was dialing up a node, visiting
all the root spans on the nodes inflight registry, and placing
`tracingpb.RecordedSpans` into a flat slice. This caused loss of
information about which spans belonged to a chain
rooted at a fixed root span. Such a chain is referred to as a
`tracing.Recording`. Every node can have multiple `tracing.Recording`s
with the same `trace_id`, and they each represent a traced remote
operation.

This change maintains the `tracing.Recording` grouping of spans
by getting the collector to return a `[]tracing.Recording` for each
node. The collectors' unit of iteration consequently becomes a
`tracing.Recording`. This makes more sense when you think about
how we want to consume these traces. Every `tracing.Recording` is
a new traced remote operation, and should be visualized as such in
Jaegar, JSON etc.

This change also augments the collector iterator to return the nodeID
of the node that the current `tracing.Recording` belongs too.

Release note: None
@adityamaru
Copy link
Contributor Author

TFTRs!

bors r=irfansharif,abarganier

craig bot pushed a commit that referenced this pull request Jun 22, 2021
63488: monitoring: renovate grafana dashboards r=kzh a=sai-roach

This PR adds in renovated grafana dashboards that aim for
feature parity with DB Console metrics page.

Dashboards:
- Overview
- Hardware
- Runtime
- SQL
- Storage
- Replication
- Distributed
- Queues
- Slow Requests
- Changefeeds

These dashboards can be previewed by following the instructions in
the monitoring [README.md](https://github.com/cockroachdb/cockroach/blob/master/monitoring/README.md) for spinning up a quick grafana instance.

Release note (ops change): The grafana dashboards have been updated to
more closely resemble DB console metrics.

66678: tracing: collector was incorrectly flattening recordings r=irfansharif,abarganier a=adityamaru

Previously, the trace collector was dialing up a node, visiting
all the root spans on the nodes inflight registry, and placing
`tracingpb.RecordedSpans` into a flat slice. This caused loss of
information about which spans belonged to a chain
rooted at a fixed root span. Such a chain is referred to as a
`tracing.Recording`. Every node can have multiple `tracing.Recording`s
with the same `trace_id`, and they each represent a traced remote
operation.

This change maintains the `tracing.Recording` grouping of spans
by getting the collector to return a `[]tracing.Recording` for each
node. The collectors' unit of iteration consequently becomes a
`tracing.Recording`. This makes more sense when you think about
how we want to consume these traces. Every `tracing.Recording` is
a new traced remote operation, and should be visualized as such in
Jaegar, JSON etc.

This change also augments the collector iterator to return the nodeID
of the node that the current `tracing.Recording` belongs too.

Informs: #64992

Release note: None

66715: workload: make rand workload aware of computed columns r=mgartner a=rafiss

fixes #66683

Release note: None

Co-authored-by: sai-roach <sai@cockroachlabs.com>
Co-authored-by: Aditya Maru <adityamaru@gmail.com>
Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
@craig
Copy link
Contributor

craig bot commented Jun 22, 2021

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Jun 22, 2021

Build succeeded:

@craig craig bot merged commit 1bd0fe9 into cockroachdb:master Jun 22, 2021
@adityamaru adityamaru deleted the status-server-prototype branch June 22, 2021 19:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants