-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: allow inflight traces to be collected across the cluster #60999
Comments
A rough outline of the design I'm currently thinking of for a lightweight service that returns inflight spans for a given trace_id. I'm taking inspiration from how BulkIO implemented
An example of how this might look can be found in the BlobClient/BlobServer. Some notes:
cc: @irfansharif @abarganier interested to hear your thoughts. |
Nice write up, the outline LGTM overall.
In the context of the ideas discussed elsewhere to some day flush traces for long running BulkIO jobs to disk, I was wondering about this too since it sounds like that data could large enough for this to be an issue when combined across nodes. Perhaps we can define an acceptable buffer size for the combined responses at the client level, and then include a request param of From a broader perspective, I wonder how we could make the entirety of the retained tracing data for long running jobs across all nodes accessible given memory constraints like this. Perhaps the upcoming Observability Server (see #65141) could be useful here, polling each node periodically for updates and storing on the obs. server node's disk? |
Yea, the broad shape here sounds good to me. Traces are hardly larger than a few kilobytes in size; even for long running ones (for days) I'd be surprised if we crossed a few MBs. So I don't think we'll need to worry about flushing traces or anything to disk for size concerns. If you want trace data to persist across job resumptions where you've lost a handle on the given trace, that's a different problem. Is that really that important? (Probably shouldn't be for an MVP.) Important enough to write something out to disk, keyed by job id? In anycase that can/should sit outside of pkg/tracing. I think it should be understood as metadata the jobs infrastructure is persisting to track across job resumptions (and it just so happens to be earlier trace data). @abarganier, I think we should probably decouple this work from the obs server for now; we want to have jobs observability whether or not we have an obs server running. Given that the obs server proposal is going to be using shared server code, it'll be easy to port something like this over, but it'd be easier to decouple these things for now (I think).
Each server (ignoring multi-tenant for now) has a view of node liveness, through which it can view all current members of the cluster. I think that's better suited here to get a handle on all node IDs to reach out to. cockroach/pkg/server/server.go Line 141 in 19e670f
cockroach/pkg/kv/kvserver/liveness/liveness.go Lines 999 to 1007 in dd325f0
cockroach/pkg/kv/kvserver/liveness/livenesspb/liveness.pb.go Lines 156 to 157 in bb6ebca
No, but make sure to rate-limit the total number of RPCs you're sending out in parallel. Like I said above, probably traces don't exceed a few KBs in size, and looks like we'd only collect them cluster wide when polling for a live trace (vs. always collecting them).
Same as above, these seem like distant concerns. |
If we want to open this up to tenants, we'd want something like a SQL pod reaching out to a newly introduced KV API to return trace data for a given trace ID (this API would internally fan out etc). I'm not sure about this though, and it seems fraught. Traces have no idea which "tenant" they belong to, so I'm not sure how/if we'd do anything to prevent tenants from retrieving traces for other tenants. Maybe that's not a threat model we care about if tenants only request data for trace IDs it knows about (through the jobs records it was allowed to create: #65322). Dunno, out of my depth here. Also something we'll want to rate limit/admission control cause of this potential RPC fanout. |
Good news - thanks for providing some clarity around this.
👍 Sounds good to me - If the memory constraints aren't an actual issue like we originally thought, then probably no reason to involve the obs. server.
Is there any rough guide on how many nodes we can make parallel RPCs to, before things become problematic? |
Not really, it's more something to do out of an abundance of caution. This isn't too complicated either, it'd look something like this: cockroach/pkg/migration/migrationcluster/cluster.go Lines 106 to 129 in 126ea38
|
Is your feature request related to a problem? Please describe.
Our internal traces follow a nonstandard collection pattern in which the trace data is returned towards the caller with responses to requests. The standard way is that traces are streamed to a collection endpoint "as they happen".
We chose our model because a) it's just what we started with and b) in a collector model, it is difficult to determine when to stop listening for new changes.
Either way, as a result of the current model, we can't easily look at traces before they are complete. The clearest case of this being an issue is when a request is hanging somewhere (or at least not returning in a timely fashion); that request's trace will not be easily observable until it finishes.
At the time of writing, we have per-node SQL tables
crdb_internal.node_inflight_trace_spans
(#55733) which in principle can be used to grab a hold of these spans, and there are ways to get information from these spans (though at time of writing, not verbose information).Actually doing this in practice amounts to a wild goose chase, though, since there is no mechanism that takes a traceID and gathers all of the data across the nodes. It is that mechanism which is tracked in this issue.
Describe the solution you'd like
I'd like there to be a standard way to pull a trace recording including in-flight spans. Initially, this would be used by TSE/SRE/eng during incidents, but it would provide the basis for end-user functionality. For example, you could imagine that next to each job (see #60998) there is a button that, when clicked, gives you the up-to-date trace as a Jaeger JSON file (or even renders a UI! One can dream).
Describe alternatives you've considered
Additional context
The text was updated successfully, but these errors were encountered: