Log more information when a trace is too large to compact #1931

mdisibio · 2022-12-02T16:00:38Z

When a trace exceeds max_bytes_per_trace the compactor will drop spans. Currently this is tracked with a metric but it would be nice to log the trace ID and/or spans that were dropped. Need to decide on log level, a case could be made for warning, info, or debug.

This looks straightforward to do and adding here and here would cover both v2 and parquet formats (and should be future-proof for other formats).

The text was updated successfully, but these errors were encountered:

mattiaforc · 2022-12-05T19:05:05Z

Hi @mdisibio, I'd like to take this issue as my first contribution to the Tempo project.

I have a few considerations before submitting a PR:

We could log the spans/traces while we iterate here, this function is called only here when calculating the number of spans that will be dropped before incrementing the metric. Or, we could do that after the countSpans invocation on line 239 by doing the same kind of for loops, obtaining the ids for both traces and spans and logging them all.
Here, unless we change the interface, we can't log here which trace IDs and span IDs are going to be dropped.

As far as logging concerns, IMHO, one usually looks after specific traces on tracing systems when debugging/trying to identify something wrong or faulty, and expects the system to be a totally reliable source of truth - so a log like this should be as complete as possible, and the log level should be warning (to warn developers that maybe they should try to reduce the size of the trace because they are losing some spans).

mdisibio · 2022-12-06T12:48:29Z

Hi, thanks for looking into this and very thorough research already. Your suggestions to log as a warning and add parameters to Compactor.RecordDiscardedSpans() and CompactorOptions.SpansDiscarded sound good. Since countSpans is only called in one place as you discovered, updating this method sounds good and we could rename it for clarity. I lean towards this approach rather than after the countSpans invocation ... by doing the same kind of for loops since that would perform double proto unmarshaling to access to spans.

a log like this should be as complete as possible

I definitely agree generally. However I have a concern about how valuable logs would be in the extreme cases where this logic is likely to trigger. In our workloads it is not uncommon to have traces with 1 million or more spans and our compactors discard up to 100K spans/s. A log of every span ID doesn't seem valuable at first glance, unless including additional info such as the span name or service. At the minimum we could log the trace ID and count of discarded spans which would be valuable for debugging and not add significant overhead. The changes discussed sound sufficient to accomplish that.

Thoughts?

mattiaforc · 2022-12-06T20:54:13Z

I totally agree that updating the countSpans method is the best solution, I'll implement it that way.

At the minimum we could log the trace ID and count of discarded spans

Sounds good to me.
Only thing that concerns me is that there would be no way to actually identify the missing spans - that is probably just fine for the vast majority of cases - but I think that the option to turn a more complete log on should be present for the user.

Maybe we can log as a warning as we stated before, and log verbose details at a trace level? For this verbose log we could opt for a nested count - something like:

TraceID
  |
  |
  Service name 
    |
    | 
    Span name 1: COUNT
    Span name n: COUNT

Even though I'm not sure it would be overkill. This option should also be documented somewhere, otherwise it would be useless.

What do you think? Should we keep it simple and simply log as a warning the total of failed spans - by traceID?

modulitos · 2022-12-31T07:40:58Z

We may also want to log the following:

the size of the traces (or the combined size, ie: len(combinedObj))
the maxBytes value, and the tenantID values

Tempo already logs this info when an ingester exceeds the max_bytes_per_trace config:

TRACE_TOO_LARGE: max size of trace (5000000) exceeded while adding 421754 bytes to trace 97fc466bbb4839c2d7276f78dc37e31e for tenant single-tenant

So maybe we should include those values in the compactor logs as well? (it can be done as a followup PR as well after we add basic logging - I'm just trying to get a sense of what's ideal)

Also, the ingester logs this at the error log level. Should the compactor log it at the error level as well?

modulitos · 2022-12-31T07:41:16Z

Also, if folks aren't actively working on this, I'm interested in picking it up :)

scalalang2 · 2023-02-07T14:25:20Z

I will take this one.

mdisibio added the good first issue Good for newcomers label Dec 2, 2022

joe-elliott added the enhancement New feature or request label Dec 2, 2022

mapno assigned scalalang2 Feb 7, 2023

scalalang2 mentioned this issue Feb 14, 2023

Log when a trace is too large to compact #2105

Merged

3 tasks

mapno closed this as completed in #2105 Feb 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Log more information when a trace is too large to compact #1931

Log more information when a trace is too large to compact #1931

mdisibio commented Dec 2, 2022

mattiaforc commented Dec 5, 2022

mdisibio commented Dec 6, 2022

mattiaforc commented Dec 6, 2022

modulitos commented Dec 31, 2022 •

edited

Loading

modulitos commented Dec 31, 2022 •

edited

Loading

scalalang2 commented Feb 7, 2023

Log more information when a trace is too large to compact #1931

Log more information when a trace is too large to compact #1931

Comments

mdisibio commented Dec 2, 2022

mattiaforc commented Dec 5, 2022

mdisibio commented Dec 6, 2022

mattiaforc commented Dec 6, 2022

modulitos commented Dec 31, 2022 • edited Loading

modulitos commented Dec 31, 2022 • edited Loading

scalalang2 commented Feb 7, 2023

modulitos commented Dec 31, 2022 •

edited

Loading

modulitos commented Dec 31, 2022 •

edited

Loading