-
Notifications
You must be signed in to change notification settings - Fork 535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Log more information when a trace is too large to compact #1931
Comments
Hi @mdisibio, I'd like to take this issue as my first contribution to the Tempo project. I have a few considerations before submitting a PR:
As far as logging concerns, IMHO, one usually looks after specific traces on tracing systems when debugging/trying to identify something wrong or faulty, and expects the system to be a totally reliable source of truth - so a log like this should be as complete as possible, and the log level should be warning (to warn developers that maybe they should try to reduce the size of the trace because they are losing some spans). |
Hi, thanks for looking into this and very thorough research already. Your suggestions to log as a warning and add parameters to
I definitely agree generally. However I have a concern about how valuable logs would be in the extreme cases where this logic is likely to trigger. In our workloads it is not uncommon to have traces with 1 million or more spans and our compactors discard up to 100K spans/s. A log of every span ID doesn't seem valuable at first glance, unless including additional info such as the span name or service. At the minimum we could log the trace ID and count of discarded spans which would be valuable for debugging and not add significant overhead. The changes discussed sound sufficient to accomplish that. Thoughts? |
I totally agree that updating the
Sounds good to me. Maybe we can log as a warning as we stated before, and log verbose details at a trace level? For this verbose log we could opt for a nested count - something like:
Even though I'm not sure it would be overkill. This option should also be documented somewhere, otherwise it would be useless. What do you think? Should we keep it simple and simply log as a warning the total of failed spans - by traceID? |
We may also want to log the following:
Tempo already logs this info when an ingester exceeds the
So maybe we should include those values in the compactor logs as well? (it can be done as a followup PR as well after we add basic logging - I'm just trying to get a sense of what's ideal) Also, the ingester logs this at the error log level. Should the compactor log it at the error level as well? |
Also, if folks aren't actively working on this, I'm interested in picking it up :) |
I will take this one. |
When a trace exceeds
max_bytes_per_trace
the compactor will drop spans. Currently this is tracked with a metric but it would be nice to log the trace ID and/or spans that were dropped. Need to decide on log level, a case could be made for warning, info, or debug.This looks straightforward to do and adding here and here would cover both v2 and parquet formats (and should be future-proof for other formats).
The text was updated successfully, but these errors were encountered: