-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tracing,tracingservice: adds a trace service to pull clusterwide trace spans #65559
Conversation
a7bd2d0
to
d67bb84
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, very straightforward and easy to reason about. Just a few nits but mostly !
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @adityamaru and @irfansharif)
pkg/util/tracingservice/client.go, line 59 at r1 (raw file):
Quoted 9 lines of code…
err := c.tracer.VisitSpans(func(span *tracing.Span) error { if span.TraceID() != traceID { return nil } for _, rec := range span.GetRecording() { resp.SpanRecordings = append(resp.SpanRecordings, rec) } return nil })
nit: any chance we can union this anon. function with what's used in tracingservice/service.go
's GetSpanRecordings
without too much hassle?
pkg/util/tracingservice/service.go, line 43 at r1 (raw file):
blob service
nit: trace service
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the massive review lag! I've left a few comments, but this is mostly what I expected.
What motivated the separated package structure? Circular dependencies? I'd personally prefer just putting everything into pkg/util/tracing. tracing.Service or tracing/service feel more concise than tracingservice.Service. None of this precludes us from pulling it out into its own thing later down the line.
It would also be worth considering limiting the amount of in-use memory for this tracing service. I'm not sure what would look like, but the fanout here + RPC-ing over blobs of data could mean a lot of trace data held in memory for all the GetSpanRecordingsFromCluster calls happening concurrently. Is it something we can attach a budget for?
} | ||
|
||
func TestTraceClientGetSpanRecordings(t *testing.T) { | ||
localNodeID := roachpb.NodeID(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered using testcluster instead? Instead of using "raw tracers"? I think that'd be the better thing to do for these fake-distributed tests, cause you're interesting in making sure that the "wiring" is all correct (i.e. we find the right set of nodes, etc.), and that the right service endpoints are registered with grpc/etc. Right now it feels like it wants to be an integration test, but is setting up all the integration by hand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah good point, changed to using a TC with 2 nodes. Unfortunately, we can't have tracingservice
depend on testcluster
or serverutils
because we get into a:
tracingservice
-> testcluster
-> server
-> tracingservice
So I had to move the tests into their own subpackage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's customary in these cases is using package tracingservice_test
for the test files, and keeping them in the same package. Is that possible here?
pkg/util/tracingservice/client.go
Outdated
return res, err | ||
} | ||
var mu syncutil.Mutex | ||
res = append(res, localSpanRecordings.SpanRecordings...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a pattern we use elsewhere? Sidestepping an RPC to talk to the local service directly? Given it's just observability, I think it's fine to not distinguish the local node and issue RPCs everywhere. I think gRPC is smart enough to short circuit the local request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we use this in the blob service for nodelocal, but I agree probably not worth the additional complexity for observability.
Yeah, I had tried putting all of this in edit: just saw your comment about not depending on migrationcluster, so that isn't a problem but I think the other two still are. |
d67bb84
to
d125299
Compare
the refactor seems to have gotten rid of this.
done. |
@irfansharif I have not addressed the memory monitor concern yet. I'm thinking about it but want to limit the size of this PR so might do it as a follow-up if that's okay? |
d125299
to
7d14a8a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif)
pkg/util/tracingservice/client.go, line 128 at r1 (raw file):
I think it's fine to not distinguish the local node and issue RPCs everywhere. I think gRPC is smart enough to short circuit the local request.
Worth leaving a comment on the corresponding for-loop noting this assumption?
} | ||
} | ||
|
||
func setupTraces(t1, t2 *tracing.Tracer) (uint64, uint64, func()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd benefit from a top-level comment here, perhaps with a diagram, that explains the trace structure being setup here. It'd then be easy to answer at a quick glance what the trace should look like from each observer (t1, t2, for each traceID/span).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, hope the diagram is along the lines of what you were thinking.
|
||
traceNode1 := tracing.NewTracer() | ||
traceNode2 := tracing.NewTracer() | ||
tID1, _, cleanup := setupTraces(traceNode1, traceNode2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/tID1/trace1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
// Start another remote child span on "node 1". | ||
anotherChild2RemoteChild := t1.StartSpan("root2.child.remotechild2", tracing.WithParentAndManualCollection(child2.Meta())) | ||
return root.TraceID(), root2.TraceID(), func() { | ||
root.Finish() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] for _, span := range ... { span.Finish() }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
pkg/util/tracingservice/service.go
Outdated
_ context.Context, request *tracingservicepb.SpanRecordingRequest, | ||
) (*tracingservicepb.SpanRecordingResponse, error) { | ||
var resp tracingservicepb.SpanRecordingResponse | ||
err := s.tracer.VisitSpans(func(span *tracing.Span) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really a comment for TestTracingServiceGetSpanRecordings, but just looking at this code I think the corresponding test for it should be very simple. Setup a tracer and start two or more spans with the same trace ID, and one other with a different trace ID. It's important that these two spans are not derived from one another - we want two or more entries in the tracer's map corresponding to two forked spans from the same trace both ending up on the same node. In the test we'll then ensure that the recordings for spans with the request ID are present, and the span from the other trace isn't.
This RPC is a purely "local" one, it doesn't internally fan out to other nodes, so our tests should be similar. The current test for this guy is doing something funky -- it's setting up a remote tracer and manually importing some remote child span into a local one. I don't think this code deal with any of that, so we're testing something more than necessary (which already has coverage for the tests written for GetSpanRecordingsFromCluster).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, done.
pkg/server/BUILD.bazel
Outdated
@@ -171,6 +171,8 @@ go_library( | |||
"//pkg/util/timeutil", | |||
"//pkg/util/tracing", | |||
"//pkg/util/tracing/tracingpb", | |||
"//pkg/util/tracingservice", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think just a slightly better package structure could be defining this entire package under pkg/util/tracing/service -- it's find for a nested package to depend on a top-level directory, we use this pattern in other places. Importing packages could import it as tracingservice or whatever. I just want to try and avoid two top-level tracing packages, when they could be just the one with multiple sub-packages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also see comment elsewhere about the tests
subpackage, I don't think we need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, thanks.
"github.com/stretchr/testify/require" | ||
) | ||
|
||
func TestTracingClientGetSpanRecordings(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move setupTraces to this file, use a much simpler (inlined) setup in the other file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
} | ||
|
||
func TestTraceClientGetSpanRecordings(t *testing.T) { | ||
localNodeID := roachpb.NodeID(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's customary in these cases is using package tracingservice_test
for the test files, and keeping them in the same package. Is that possible here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I have not addressed the memory monitor concern yet. I'm thinking about it but want to limit the size of this PR so might do it as a follow-up if that's okay?
Yup, looks like this isn't wired up to anything (?), so fine by me.
pkg/server/BUILD.bazel
Outdated
@@ -171,6 +171,8 @@ go_library( | |||
"//pkg/util/timeutil", | |||
"//pkg/util/tracing", | |||
"//pkg/util/tracing/tracingpb", | |||
"//pkg/util/tracingservice", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also see comment elsewhere about the tests
subpackage, I don't think we need it.
pkg/util/tracingservice/nodes.go
Outdated
if err != nil { | ||
return nil, err | ||
} | ||
if !live { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you care about this for observability. If a node is not live, do we want to error out or do you want to skip over it (though maybe mentioning somewhere that you did)? Even then, we'd time out if issuing RPCs to dead nodes. This filter here is basically making sure we don't even try. Perhaps that's what we want, but timing out would give us a convenient place to capture the error and propagate it somehow -- if that's what we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking we'd just use node liveness to retrieve a list of non-decommissioned nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, changed to only returning non-decommissioned nodes. Also added a comment explaining that we will rely on the RPC timing out so we can surface something along the lines of "trace for node x is missing".
pkg/util/tracingservice/nodes.go
Outdated
if !live { | ||
return nil, errors.Newf("n%d required, but unavailable", l.NodeID) | ||
} | ||
ns = append(ns, node{id: l.NodeID, epoch: l.Epoch}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need the epoch either. Simply returning a list of roachpb.NodeIDs seem sufficient for our purposes -- all we need is a list of nodes to reach out to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pkg/util/tracingservice/client.go
Outdated
} | ||
|
||
// NewTraceClientDialer returns a TraceClientDialer. | ||
func NewTraceClientDialer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] tracingservice.NewTraceClientDialer doesn't feel like a good/descriptive name. I'm not sure what's better. Something collector maybe? tracing/service.New() could serve local traces through an RPC, tracing/collector.New() could reach out to all the various tracing/services?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the struct to TraceCollector, and the methods to New()
now that they are in their own packages.
pkg/util/tracingservice/service.go
Outdated
var _ tracingservicepb.TracingServer = &Service{} | ||
|
||
// NewTracingService instantiates a tracing service server. | ||
func NewTracingService(tracer *tracing.Tracer) *Service { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually an anti-pattern to have repetition between package and symbols: https://blog.golang.org/package-names. tracing/service.New seems fine to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
40636f1
to
9aa7cbf
Compare
Previously, every node in the cluster had a local inflight span registry that was aware of all the spans that were rooted on that particular node. Child spans of a given traceID executing on a remote node would only become visible to the local registry once execution completes, and the span pushes its recordings over gRPC to the "client" node. This change introduces a `collector` and `service` package that together contain a gRPC service to be used for remote inflight span access. It is used for pulling inflight spans from all CockroachDB nodes. Each node will run a trace service, which serves the inflight spans from the local span registry on that node. Each node will also have a trace collector which uses the nodedialer to connect to another node's trace service, and access its inflight spans. The spans for a traceID are sorted by `StartTime` before they are returned. The per node trace dialer has yet to be hooked up to an appropriate location depending on where we intend to use it. Release note: None
9aa7cbf
to
4b4c209
Compare
TFTRs! bors r=irfansharif,abarganier |
Build succeeded: |
Previously, every node in the cluster had a local inflight span
registry that was aware of all the spans that were rooted on that
particular node. Child spans of a given traceID executing on a remote
node would only become visible to the local registry once execution
completes, and the span pushes its recordings over gRPC to the
"client" node.
This change introduces a
tracingservice
package.Package tracingservice contains a gRPC service to be used for
remote inflight span access.
It is used for pulling inflight spans from all CockroachDB nodes.
Each node will run a trace service, which serves the inflight spans from the
local span registry on that node. Each node will also have a trace client
dialer, which uses the nodedialer to connect to another node's trace service,
and access its inflight spans. The trace client dialer is backed by a remote
trace client or a local trace client, which serves as the point of entry to this
service. Both clients support the
TraceClient
interface, which includes thefollowing functionalities:
The spans for a traceID are sorted by
StartTime
before they arereturned. The per-node trace dialer has yet to be hooked up to an
appropriate location depending on where we intend to use it.
Resolves: #60999
Informs: #64992
Release note: None