-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tracing: collector was incorrectly flattening recordings #66678
Conversation
@@ -20,7 +20,10 @@ message SpanRecordingRequest { | |||
} | |||
|
|||
message SpanRecordingResponse { | |||
repeated tracing.tracingpb.RecordedSpan span_recordings = 1 [(gogoproto.nullable) = false]; | |||
message Recording { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of defining this, is it possible to unify this with tracing.Recording instead? Either through gogoproto.casttype or by making tracing.Recording a pb type (I think the former option is preferable).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline I tried the latter, but it is a prettty big diff. Still debating if it is worth it over in #66692.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, fair, sorry for the goose chase. Leaving it as is seems fine to me.
return i.recordedSpans[i.recordedSpanIndex] | ||
func (i *Iterator) Value() NodeRecording { | ||
return NodeRecording{ | ||
NodeID: i.curNode, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why return the NodeID at all? Do we expect to use it? I imagined the value type would've simply been a trace.Recording instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way I envision it is that in #66679, the vtable is going to store rows with the node_id. This will allow the CLI wrapper built on top of to dump all the inflight trace spans of a particular node in a single file.
In the case of jobs, you then get a per node view of job execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could return two values instead of introducing a wrapper type: (roachpb.NodeID, tracing.Recording)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru and @irfansharif)
pkg/util/tracing/collector/collector.go, line 76 at r1 (raw file):
// recordings and buffered them in `recordings` for consumption via the // iterator. curNode roachpb.NodeID
Just noting: same thread as below re: NodeIDs
pkg/util/tracing/collector/collector.go, line 168 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Why return the NodeID at all? Do we expect to use it? I imagined the value type would've simply been a trace.Recording instead.
I suppose with BulkIO jobs it might be useful to know which node a span came from, but could that be included in the RecordedSpan itself?
5933711
to
062c674
Compare
Hmm I'm not sure how I feel about making the nodeID a property of the span, especially because of how lightweight it is to currently create a span throughout the code. I could be wrong, but the idea of having to get to the current nodes' ID from everywhere in the code base sounds scary? I think some subsystems in CRDB write a "tag" on the span with I have the same reasoning as you about why we might need nodeIDs, see my comment on irfan's review! |
062c674
to
de4fc2b
Compare
c6a94ef
to
4989ed6
Compare
@@ -20,7 +20,10 @@ message SpanRecordingRequest { | |||
} | |||
|
|||
message SpanRecordingResponse { | |||
repeated tracing.tracingpb.RecordedSpan span_recordings = 1 [(gogoproto.nullable) = false]; | |||
message Recording { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, fair, sorry for the goose chase. Leaving it as is seems fine to me.
@@ -15,14 +15,19 @@ option go_package = "tracingservicepb"; | |||
import "gogoproto/gogo.proto"; | |||
import "util/tracing/tracingpb/recorded_span.proto"; | |||
|
|||
message SpanRecordingRequest { | |||
// GetSpanRecordingsRequest requests a nodes' inflight span registry to return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this comment to the RPC definition, it'll show up when jumping to interface/method, instead of having to jump to the request type. "GetSpanRecordings returns a nodes' ...", and refer to trace_id as TraceID instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
return i.recordedSpans[i.recordedSpanIndex] | ||
func (i *Iterator) Value() NodeRecording { | ||
return NodeRecording{ | ||
NodeID: i.curNode, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could return two values instead of introducing a wrapper type: (roachpb.NodeID, tracing.Recording)
4989ed6
to
fc50e27
Compare
Previously, the trace collector was dialing up a node, visiting all the root spans on the nodes inflight registry, and placing `tracingpb.RecordedSpans` into a flat slice. This caused loss of information about which spans belonged to a chain rooted at a fixed root span. Such a chain is referred to as a `tracing.Recording`. Every node can have multiple `tracing.Recording`s with the same `trace_id`, and they each represent a traced remote operation. This change maintains the `tracing.Recording` grouping of spans by getting the collector to return a `[]tracing.Recording` for each node. The collectors' unit of iteration consequently becomes a `tracing.Recording`. This makes more sense when you think about how we want to consume these traces. Every `tracing.Recording` is a new traced remote operation, and should be visualized as such in Jaegar, JSON etc. This change also augments the collector iterator to return the nodeID of the node that the current `tracing.Recording` belongs too. Release note: None
fc50e27
to
804ec07
Compare
TFTRs! bors r=irfansharif,abarganier |
63488: monitoring: renovate grafana dashboards r=kzh a=sai-roach This PR adds in renovated grafana dashboards that aim for feature parity with DB Console metrics page. Dashboards: - Overview - Hardware - Runtime - SQL - Storage - Replication - Distributed - Queues - Slow Requests - Changefeeds These dashboards can be previewed by following the instructions in the monitoring [README.md](https://github.com/cockroachdb/cockroach/blob/master/monitoring/README.md) for spinning up a quick grafana instance. Release note (ops change): The grafana dashboards have been updated to more closely resemble DB console metrics. 66678: tracing: collector was incorrectly flattening recordings r=irfansharif,abarganier a=adityamaru Previously, the trace collector was dialing up a node, visiting all the root spans on the nodes inflight registry, and placing `tracingpb.RecordedSpans` into a flat slice. This caused loss of information about which spans belonged to a chain rooted at a fixed root span. Such a chain is referred to as a `tracing.Recording`. Every node can have multiple `tracing.Recording`s with the same `trace_id`, and they each represent a traced remote operation. This change maintains the `tracing.Recording` grouping of spans by getting the collector to return a `[]tracing.Recording` for each node. The collectors' unit of iteration consequently becomes a `tracing.Recording`. This makes more sense when you think about how we want to consume these traces. Every `tracing.Recording` is a new traced remote operation, and should be visualized as such in Jaegar, JSON etc. This change also augments the collector iterator to return the nodeID of the node that the current `tracing.Recording` belongs too. Informs: #64992 Release note: None 66715: workload: make rand workload aware of computed columns r=mgartner a=rafiss fixes #66683 Release note: None Co-authored-by: sai-roach <sai@cockroachlabs.com> Co-authored-by: Aditya Maru <adityamaru@gmail.com> Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
Build failed (retrying...): |
Build succeeded: |
Previously, the trace collector was dialing up a node, visiting
all the root spans on the nodes inflight registry, and placing
tracingpb.RecordedSpans
into a flat slice. This caused loss ofinformation about which spans belonged to a chain
rooted at a fixed root span. Such a chain is referred to as a
tracing.Recording
. Every node can have multipletracing.Recording
swith the same
trace_id
, and they each represent a traced remoteoperation.
This change maintains the
tracing.Recording
grouping of spansby getting the collector to return a
[]tracing.Recording
for eachnode. The collectors' unit of iteration consequently becomes a
tracing.Recording
. This makes more sense when you think abouthow we want to consume these traces. Every
tracing.Recording
isa new traced remote operation, and should be visualized as such in
Jaegar, JSON etc.
This change also augments the collector iterator to return the nodeID
of the node that the current
tracing.Recording
belongs too.Informs: #64992
Release note: None