Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log more information when a trace is too large to compact #1931

Closed
mdisibio opened this issue Dec 2, 2022 · 6 comments · Fixed by #2105
Closed

Log more information when a trace is too large to compact #1931

mdisibio opened this issue Dec 2, 2022 · 6 comments · Fixed by #2105
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@mdisibio
Copy link
Contributor

mdisibio commented Dec 2, 2022

When a trace exceeds max_bytes_per_trace the compactor will drop spans. Currently this is tracked with a metric but it would be nice to log the trace ID and/or spans that were dropped. Need to decide on log level, a case could be made for warning, info, or debug.

This looks straightforward to do and adding here and here would cover both v2 and parquet formats (and should be future-proof for other formats).

@mdisibio mdisibio added the good first issue Good for newcomers label Dec 2, 2022
@joe-elliott joe-elliott added the enhancement New feature or request label Dec 2, 2022
@mattiaforc
Copy link

Hi @mdisibio, I'd like to take this issue as my first contribution to the Tempo project.

I have a few considerations before submitting a PR:

  • We could log the spans/traces while we iterate here, this function is called only here when calculating the number of spans that will be dropped before incrementing the metric. Or, we could do that after the countSpans invocation on line 239 by doing the same kind of for loops, obtaining the ids for both traces and spans and logging them all.
  • Here, unless we change the interface, we can't log here which trace IDs and span IDs are going to be dropped.

As far as logging concerns, IMHO, one usually looks after specific traces on tracing systems when debugging/trying to identify something wrong or faulty, and expects the system to be a totally reliable source of truth - so a log like this should be as complete as possible, and the log level should be warning (to warn developers that maybe they should try to reduce the size of the trace because they are losing some spans).

@mdisibio
Copy link
Contributor Author

mdisibio commented Dec 6, 2022

Hi, thanks for looking into this and very thorough research already. Your suggestions to log as a warning and add parameters to Compactor.RecordDiscardedSpans() and CompactorOptions.SpansDiscarded sound good. Since countSpans is only called in one place as you discovered, updating this method sounds good and we could rename it for clarity. I lean towards this approach rather than after the countSpans invocation ... by doing the same kind of for loops since that would perform double proto unmarshaling to access to spans.

a log like this should be as complete as possible

I definitely agree generally. However I have a concern about how valuable logs would be in the extreme cases where this logic is likely to trigger. In our workloads it is not uncommon to have traces with 1 million or more spans and our compactors discard up to 100K spans/s. A log of every span ID doesn't seem valuable at first glance, unless including additional info such as the span name or service. At the minimum we could log the trace ID and count of discarded spans which would be valuable for debugging and not add significant overhead. The changes discussed sound sufficient to accomplish that.

Thoughts?

@mattiaforc
Copy link

I totally agree that updating the countSpans method is the best solution, I'll implement it that way.

At the minimum we could log the trace ID and count of discarded spans

Sounds good to me.
Only thing that concerns me is that there would be no way to actually identify the missing spans - that is probably just fine for the vast majority of cases - but I think that the option to turn a more complete log on should be present for the user.

Maybe we can log as a warning as we stated before, and log verbose details at a trace level? For this verbose log we could opt for a nested count - something like:

TraceID
  |
  |
  Service name 
    |
    | 
    Span name 1: COUNT
    Span name n: COUNT

Even though I'm not sure it would be overkill. This option should also be documented somewhere, otherwise it would be useless.

What do you think? Should we keep it simple and simply log as a warning the total of failed spans - by traceID?

@modulitos
Copy link
Contributor

modulitos commented Dec 31, 2022

We may also want to log the following:

  1. the size of the traces (or the combined size, ie: len(combinedObj))
  2. the maxBytes value, and the tenantID values

Tempo already logs this info when an ingester exceeds the max_bytes_per_trace config:

TRACE_TOO_LARGE: max size of trace (5000000) exceeded while adding 421754 bytes to trace 97fc466bbb4839c2d7276f78dc37e31e for tenant single-tenant

So maybe we should include those values in the compactor logs as well? (it can be done as a followup PR as well after we add basic logging - I'm just trying to get a sense of what's ideal)

Also, the ingester logs this at the error log level. Should the compactor log it at the error level as well?

@modulitos
Copy link
Contributor

modulitos commented Dec 31, 2022

Also, if folks aren't actively working on this, I'm interested in picking it up :)

@scalalang2
Copy link
Contributor

I will take this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants