Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: allow inflight traces to be collected across the cluster #60999

Closed
tbg opened this issue Feb 23, 2021 · 6 comments · Fixed by #65559
Closed

sql: allow inflight traces to be collected across the cluster #60999

tbg opened this issue Feb 23, 2021 · 6 comments · Fixed by #65559
Assignees
Labels
A-sql-observability Related to observability of the SQL layer A-tracing Relating to tracing in CockroachDB. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@tbg
Copy link
Member

tbg commented Feb 23, 2021

Is your feature request related to a problem? Please describe.
Our internal traces follow a nonstandard collection pattern in which the trace data is returned towards the caller with responses to requests. The standard way is that traces are streamed to a collection endpoint "as they happen".

We chose our model because a) it's just what we started with and b) in a collector model, it is difficult to determine when to stop listening for new changes.

Either way, as a result of the current model, we can't easily look at traces before they are complete. The clearest case of this being an issue is when a request is hanging somewhere (or at least not returning in a timely fashion); that request's trace will not be easily observable until it finishes.

At the time of writing, we have per-node SQL tables crdb_internal.node_inflight_trace_spans (#55733) which in principle can be used to grab a hold of these spans, and there are ways to get information from these spans (though at time of writing, not verbose information).

Actually doing this in practice amounts to a wild goose chase, though, since there is no mechanism that takes a traceID and gathers all of the data across the nodes. It is that mechanism which is tracked in this issue.

Describe the solution you'd like
I'd like there to be a standard way to pull a trace recording including in-flight spans. Initially, this would be used by TSE/SRE/eng during incidents, but it would provide the basis for end-user functionality. For example, you could imagine that next to each job (see #60998) there is a button that, when clicked, gives you the up-to-date trace as a Jaeger JSON file (or even renders a UI! One can dream).

Describe alternatives you've considered

Additional context

@tbg tbg added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-sql-observability Related to observability of the SQL layer A-tracing Relating to tracing in CockroachDB. labels Feb 23, 2021
@adityamaru
Copy link
Contributor

adityamaru commented May 18, 2021

A rough outline of the design I'm currently thinking of for a lightweight service that returns inflight spans for a given trace_id. I'm taking inspiration from how BulkIO implemented nodelocal which required a similar gRPC backed, inter-node file sharing service.

  • Every node would have a tracing service that serves the recordings of all inflight spans tracked in the inflight registry on that node. We already have a means of collecting the recordings of all inflight spans on the current node via Tracer.VisitSpans. The returned RecordedSpans are able to cross RPC boundaries and so the tracing service would send back a response that could look something like:
message TracingServiceResponse {
      repeated RecordedSpan recorded_spans
}
  • Every node would also have a tracing client that uses a nodedialer.Dialer to connect to every other node's tracing service. The tracing client will be the entry point for the users of this service who wish to pull inflight traces from across the cluster.

  • The logic to pull all spans from the cluster can resolve all the nodes in the cluster via the NodeStatusServer, dial each node via the above-mentioned tracing client, and get recordings from every node. The imported recordings can then be filtered, sorted, and used as necessary.

An example of how this might look can be found in the BlobClient/BlobServer.

Some notes:

  • The service would be relying on an O(nodes) RPC fanout. Should we have some sort of memory protection around the number of RPC responses we buffer in memory?

  • I haven't thought about how this service would work in a multi-tenant model, where the SQL pods presumably cannot dial up the KV server. My knowledge in this area is very shallow so I am hoping to get some pointers about this!

cc: @irfansharif @abarganier interested to hear your thoughts.

@adityamaru adityamaru self-assigned this May 18, 2021
@abarganier
Copy link
Contributor

Nice write up, the outline LGTM overall.

Should we have some sort of memory protection around the number of RPC responses we buffer in memory?

In the context of the ideas discussed elsewhere to some day flush traces for long running BulkIO jobs to disk, I was wondering about this too since it sounds like that data could large enough for this to be an issue when combined across nodes. Perhaps we can define an acceptable buffer size for the combined responses at the client level, and then include a request param of bufferSizeLimit / # of nodes in the request, informing each node to return a subset of the data that aims to not exceed that threshold?

From a broader perspective, I wonder how we could make the entirety of the retained tracing data for long running jobs across all nodes accessible given memory constraints like this. Perhaps the upcoming Observability Server (see #65141) could be useful here, polling each node periodically for updates and storing on the obs. server node's disk?

@irfansharif
Copy link
Contributor

irfansharif commented May 18, 2021

Yea, the broad shape here sounds good to me. Traces are hardly larger than a few kilobytes in size; even for long running ones (for days) I'd be surprised if we crossed a few MBs. So I don't think we'll need to worry about flushing traces or anything to disk for size concerns.

If you want trace data to persist across job resumptions where you've lost a handle on the given trace, that's a different problem. Is that really that important? (Probably shouldn't be for an MVP.) Important enough to write something out to disk, keyed by job id? In anycase that can/should sit outside of pkg/tracing. I think it should be understood as metadata the jobs infrastructure is persisting to track across job resumptions (and it just so happens to be earlier trace data).

@abarganier, I think we should probably decouple this work from the obs server for now; we want to have jobs observability whether or not we have an obs server running. Given that the obs server proposal is going to be using shared server code, it'll be easy to port something like this over, but it'd be easier to decouple these things for now (I think).


The logic to pull all spans from the cluster can resolve all the nodes in the cluster via the NodeStatusServer

Each server (ignoring multi-tenant for now) has a view of node liveness, through which it can view all current members of the cluster. I think that's better suited here to get a handle on all node IDs to reach out to.

nodeLiveness *liveness.NodeLiveness

func (nl *NodeLiveness) GetLivenesses() []livenesspb.Liveness {
nl.mu.RLock()
defer nl.mu.RUnlock()
livenesses := make([]livenesspb.Liveness, 0, len(nl.mu.nodes))
for _, l := range nl.mu.nodes {
livenesses = append(livenesses, l.Liveness)
}
return livenesses
}

type Liveness struct {
NodeID github_com_cockroachdb_cockroach_pkg_roachpb.NodeID `protobuf:"varint,1,opt,name=node_id,json=nodeId,proto3,casttype=github.com/cockroachdb/cockroach/pkg/roachpb.NodeID" json:"node_id,omitempty"`


The service would be relying on an O(nodes) RPC fanout. Should we have some sort of memory protection around the number of RPC responses we buffer in memory?

No, but make sure to rate-limit the total number of RPCs you're sending out in parallel. Like I said above, probably traces don't exceed a few KBs in size, and looks like we'd only collect them cluster wide when polling for a live trace (vs. always collecting them).

include a request param of bufferSizeLimit / # of nodes in the request, informing each node to return a subset of the data that aims to not exceed that threshold?

Same as above, these seem like distant concerns.

@irfansharif
Copy link
Contributor

I haven't thought about how this service would work in a multi-tenant model, where the SQL pods presumably cannot dial up the KV server. My knowledge in this area is very shallow so I am hoping to get some pointers about this!

If we want to open this up to tenants, we'd want something like a SQL pod reaching out to a newly introduced KV API to return trace data for a given trace ID (this API would internally fan out etc). I'm not sure about this though, and it seems fraught. Traces have no idea which "tenant" they belong to, so I'm not sure how/if we'd do anything to prevent tenants from retrieving traces for other tenants. Maybe that's not a threat model we care about if tenants only request data for trace IDs it knows about (through the jobs records it was allowed to create: #65322). Dunno, out of my depth here. Also something we'll want to rate limit/admission control cause of this potential RPC fanout.

@abarganier
Copy link
Contributor

Traces are hardly larger than a few kilobytes in size; even for long running ones (for days) I'd be surprised if we crossed a few MBs

Good news - thanks for providing some clarity around this.

we should probably decouple this work from the obs server for now

👍 Sounds good to me - If the memory constraints aren't an actual issue like we originally thought, then probably no reason to involve the obs. server.

rate-limit the total number of RPCs you're sending out in parallel

Is there any rough guide on how many nodes we can make parallel RPCs to, before things become problematic?

@irfansharif
Copy link
Contributor

Is there any rough guide on how many nodes we can make parallel RPCs to, before things become problematic?

Not really, it's more something to do out of an abundance of caution. This isn't too complicated either, it'd look something like this:

// We'll want to rate limit outgoing RPCs (limit pulled out of thin air).
qp := quotapool.NewIntPool("every-node", 25)
log.Infof(ctx, "executing %s on nodes %s", redact.Safe(op), ns)
grp := ctxgroup.WithContext(ctx)
for _, node := range ns {
id := node.ID // copy out of the loop variable
alloc, err := qp.Acquire(ctx, 1)
if err != nil {
return err
}
grp.GoCtx(func(ctx context.Context) error {
defer alloc.Release()
conn, err := c.c.Dialer.Dial(ctx, id, rpc.DefaultClass)
if err != nil {
return err
}
client := serverpb.NewMigrationClient(conn)
return fn(ctx, client)
})
}
return grp.Wait()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-observability Related to observability of the SQL layer A-tracing Relating to tracing in CockroachDB. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants