-
Notifications
You must be signed in to change notification settings - Fork 530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
filter out stale spans from metrics generator #1612
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good! I've left a bunch of comments, no major stuff just some suggestions/nitpicks.
modules/generator/instance.go
Outdated
@@ -261,14 +266,27 @@ func (i *instance) pushSpans(ctx context.Context, req *tempopb.PushSpansRequest) | |||
func (i *instance) updatePushMetrics(req *tempopb.PushSpansRequest) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest renaming this method since its scope has changed. This method used to only read the spans but now It will also modify the received request. Maybe change it to something like preprocessSpans
? (better suggestion are welcome)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree that the function needs to be renamed if it's going to mutate the contents of the push request.
Side question. Currently the processor interface takes a complete PushSpansRequest. Can we change that to taking individual spans? Then we could avoid the potentially costly slice reallocations being done here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side question. Currently the processor interface takes a complete PushSpansRequest. Can we change that to taking individual spans? Then we could avoid the potentially costly slice reallocations being done here.
Yeah, that should be possible. Both the span metrics and the service graphs processor loop through the batches anyway and they process spans one by one.
It will be a bit tricky how we deal with the resource attributes though. We currently extract some data out of the resource attributes before looping through the instrumentation library spans individually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that is kind of gross. We could nil out the span pointers and make it an expectation of the processors that some spans may be nil.
I think ideally we leave the processors alone and do some clever in place manipulation of the slices to remove the "bad" spans. This logic could get rather gross though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could filter out the outdated spans inside the aggregateMetricsForSpan
function and the consume
function for service graph. But then we wouldn't be keep track of the numbers of spans dropped in this situation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that would require every processor to implement the same logic which will lead to duplicated work and code. I'd be interested to see how many spans we drop in practice, if only a small amount of batches have to be reallocated the impact will be fine. If we are constantly dropping spans that are too old we might have to re-evaluate.
If we need to get rid of these reallocations we could change the interface of Processor
so it passes the resource attributes of the batch next to each span.
@@ -44,6 +45,11 @@ var ( | |||
Name: "metrics_generator_bytes_received_total", | |||
Help: "The total number of proto bytes received per tenant", | |||
}, []string{"tenant"}) | |||
metricSpansDiscarded = promauto.NewCounterVec(prometheus.CounterOpts{ | |||
Namespace: "tempo", | |||
Name: "metrics_generator_spans_discarded_total", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd consider renaming to metrics_generator_discarded_spans_total
just to make it similar to this other metric: https://github.com/grafana/tempo/blob/main/modules/overrides/discarded_spans.go#L12 Not a big deal though, both should show up in grafana 🤷🏻
I think it's good to have separate metrics since a span discarded in the metrics-generator is very different from a span discarded in the ingester/compactor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm should we make it similar to the other discarded span name or should we keep it similar to the other metrics in the same space?
https://github.com/grafana/tempo/blob/main/modules/generator/instance.go#L39
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see 🙃 Err, either is fine I guess? Maybe a slight preference for keeping it consistent with the other tempo_metrics_generator_
metrics then.
Naming is hard 😅
modules/generator/config.go
Outdated
// setting default for max span age before discarding to 30 sec | ||
cfg.MaxSpanAge = 30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious how this default behaves in practices. I honestly have no clue what a typical latency is between the span creation and ingestion by Tempo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the ingestion latency numbers on ops for a few days. Do we think setting it at 30s is a right or is it too aggressive? @kvrhdn @joe-elliott
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
30s as a default looks good to me. It seems this should be include 99% of the data while excluding 1% that is lagging behind.
It's also configurable, so other people can switch it up.
modules/generator/instance.go
Outdated
var newSpansArr []*v1.Span | ||
timeNow := time.Now().UnixNano() | ||
for _, span := range ils.Spans { | ||
if span.EndTimeUnixNano >= uint64(timeNow-i.cfg.MaxSpanAge*1000000000) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both sides of this time range need to be checked. If the user sends a span that's 5 days in the future it should not impact metrics.
we have similar code in the wal:
tempo/tempodb/wal/append_block.go
Line 235 in 5c885f3
func (a *AppendBlock) adjustTimeRangeForSlack(start uint32, end uint32, additionalStartSlack time.Duration) (uint32, uint32) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why we pick 5 days?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think 5 days was just an example. I think we can start with a symmetrical time range, i.e. use the same duration before and after time.Now()
. If this doesn't work well we could still break it out into two config options.
modules/generator/instance.go
Outdated
@@ -261,14 +266,27 @@ func (i *instance) pushSpans(ctx context.Context, req *tempopb.PushSpansRequest) | |||
func (i *instance) updatePushMetrics(req *tempopb.PushSpansRequest) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree that the function needs to be renamed if it's going to mutate the contents of the push request.
Side question. Currently the processor interface takes a complete PushSpansRequest. Can we change that to taking individual spans? Then we could avoid the potentially costly slice reallocations being done here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doc updates look good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
What this PR does: This PR adds a configurable variable under metrics generator to filter out any span that is older than "metrics_ingestion_time_range_slack" before metrics are aggregated. The current default is set to 30s.
Which issue(s) this PR fixes:
Fixes #1537
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]