-
Notifications
You must be signed in to change notification settings - Fork 524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow a way to query for large traces that exceeded the ingester limit #3862
Comments
So there are questions you can ask Tempo using TraceQL to get an understanding of what is creating your enormous trace. These features are in main now.
The above can help, but what I'd really like to do is create a
There is currently no good way to tell a client that we are returning only part of the trace. HTTP status code |
Thanks for your response @joe-elliott. Having the endpoint you described sounds like a good long-term direction; but I was still surprised that there's any query-level enforcement of trace size limits,
Thanks for sharing these! I'm excited to try them out. Are these currently scoped to (or exist in some form in) a release? |
The primary reason to enforce this on the query path is to prevent instability from parsing and combining 100s of MBs of trace data. Thank you for pointing out that issue! As you noticed I closed it :).
I think splitting this limit into two is a fine idea. We would definitely accept that PR.
I think we are in luck. They should be doable in 2.5. I thought perhaps |
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. |
Is your feature request related to a problem? Please describe.
We currently have a deployment with
global.max_bytes_per_trace
configured. Occasionally, some traces exceed this limit, which we see in the distributor/ingester logs aslevel=warn ... desc = TRACE_TOO_LARGE: max size of trace (...) exceeded while adding 55431 bytes to trace <trace-id> for tenant ..."
.However, when we want to troubleshoot what was in those traces leading up to the exceedance, the query fails in Grafana with the following error:
In Grafana, the TraceQL expression used is the exact trace ID reported in the distributor/ingester warning logs.
In a large distributed system, not being able to see the content of these traces makes it difficult to troubleshoot what led to its large size & repair it.
It is also unexpected that a limit is enforced at the query layer at all, because #1225, which proposes query-layer enforcement, is still open.
Describe the solution you'd like
If a trace size limit is already configured & being enforced at the ingest layer, I would like the option of disabling trace size limit enforcement at the query layer, or have a separate configurable limit for queries.
Describe alternatives you've considered
We've adjusted the
server.grpc_server_max_(send|recv)_msg_size
limits to be higher than the trace size limits, though this had no effect (and the error messages didn't indicate a gRPC limit being hit first).We have also increased the trace size limit in order to attempt to find large traces using metrics-generator metrics, before those traces hit the previous limit; but, the process of going from metric to a specific large trace is toilsome, and fails to catch cases where a bug in code quickly creates large counts/sizes of spans in a short period of time. It also required us to set the limit to larger than we'd like as a steady state, partially defeating the purpose of a limit.
Additional context
Our tempo deployment is configured via Helm chart, so any changes in configuration would need to also be configurable through helm.
The text was updated successfully, but these errors were encountered: