-
Notifications
You must be signed in to change notification settings - Fork 530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metrics-generator: add custom registry #1340
metrics-generator: add custom registry #1340
Conversation
5c883ba
to
bdb0161
Compare
bdb0161
to
67d2df0
Compare
//serviceGraphUnpairedSpansTotal registry.Counter | ||
//serviceGraphDroppedSpansTotal registry.Counter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These metrics weren't updated before, so I don't know if we should keep them around. I'll check if Grafana actually uses these in the service graphs view.
We already expose these metrics as metrics from Tempo:
metrics_generator_processor_service_graphs_dropped_spans
metrics_generator_processor_service_graphs_unpaired_edges
So a Tempo operator can see these stats already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably need to have these per tenant so that individual tenants can have a feeling for the quality of their metrics. Alternatively, instead of exposing these directly, it might be better to use the new tempo_warnings_total
metric with a specific reason
label for these situations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We currently keep track of dropped spans and unpaired edges per tenant. So we already collect the data, I think it will just be a matter of exposing them to the end-user.
tempo_warnings_total
looks interesting, I'll take a look at what warnings we want to share from the metrics-generator in general.
I think I made a design mistake implementing this: looking at some profiling data this implementation spends a lot of time in Issues:
|
These last two commits change the design and seem to tackle most performance issues. Changes:
This PR is still not as efficient as before, but it's not horrible:
I'll explore further improvements, but I don't expect major design changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This generally looks fine to me. As discussed offline there are a few performance improvements to consider:
- Precalculate the hash for a set of label values
- Don't bother incrementing 0 on histogram buckets
- Consider where pointers could be dropped in favor of label values.
Switched design a bit and this can now do 40MB/s or 200k spans/sec per instance, which is on par with performance before this PR 🎉
Some stuff I tried that didn't work out:
hash := labelValues.getHash()
m.seriesMtx.Lock()
defer m.seriesMtx.Unlock()
s, ok := m.series[hash]
if ok {
s.value += value
s.lastUpdated = time.Now().UnixMilli()
m.series[hash] = s
return
}
if !m.onAddSerie() {
return
}
labelValuesCopy := make([]string, len(labelValues.values))
copy(labelValuesCopy, labelValues.values)
m.series[hash] = serie{
labelValues: labelValuesCopy,
value: value,
lastUpdated: time.Now().UnixMilli(),
} by using |
Trying to get GitHub Actions to run... |
|
||
func (c *counter) Inc(labelValues *LabelValues, value float64) { | ||
if value < 0 { | ||
panic("counter can only increase") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we panic here? Would an error log and early return be sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's ok, the Prometheus client does the same, although the method is more general Add
vs Inc
which takes no params.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we could handle this a bit more elegantly, but I didn't want to bother adding a log.Logger
field to the counter or returning an error just in case someone misuses the counter.
A panic isn't nice, but it should ensure whoever tries to decrease the counter detects it before it ships.
Prometheus client_golang also just panics btw: https://github.com/prometheus/client_golang/blob/main/prometheus/counter.go#L108-L110
Edit: didn't see Marty already answered the same 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, great improvements. Limiting active series and pruning inactive ones will be very useful. Left a few comments.
seriesMtx sync.RWMutex | ||
series map[uint64]*counterSeries | ||
|
||
onAddSeries func(count uint32) bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think calling these callbacks can be a bit confusing since they intervine in the creation of the metrics (i.e. the result of onAddSeries
will determine if the series is created or not).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree these things are confusing. I don't like we have to use anonymous functions to couple the counters/histograms to the registry, but I'm not sure what would work better. I wanted to keep this functions somewhat generic so it's easier to test. Injecting the full managedRegistry
means we have to test them together all the time.
Part of the complexity is that onAddSeries
is used for both checking whether we can add series and to keep count of them. We could split this up into 2 callbacks: canAddSeries
and incActiveSeries
but this is 1) more work and 2) might introduce a race condition between asking whether you can add series and reporting you added them.
h.onRemoveSerie = onRemoveSeries | ||
} | ||
|
||
func (h *histogram) Observe(labelValues *LabelValues, value float64) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it check that the value is not negative too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using Prometheus client_golang library as reference and they seem to allow negative values: https://github.com/prometheus/client_golang/blob/main/prometheus/histogram.go#L307-L309
I've never seen a histogram with negative values though, so I'm not sure if it works as intended 🤷🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
What this PR does:
This introduces a custom registry to store metrics generated in the metrics-generator. The registry can scrape itself and enforce limits.
So far the metrics-generator has relied on the
prometheus.Registry
from prometheus/client_golang. This has a number of issues:This PR introduces a new
Registry
which kind of mirrors theprometheus.Registerer
. It supports counters and histograms. There are two implementations:ManagedRegistry
: a registry that can scrape itself and write data into anstorage.Appender
. It will also enforce limits and remove stale series.TestRegistry
: a simple implementation to verify the correctness of processors usingRegistry
.The config will look like this:
The overrides will look like this:
To see the
ManagedRegistry
in practice: this is the amount of active series when we are not removing stale series. As new data shows up, the amount of metrics series that are being tracked constantly increases, until the instance is restarted.With the
ManagedRegistry
stale series are removed after 15 minutes, resulting in a stable system. This will ensure the amount of active series in the downstream TSDB remains fairly constant over time.Which issue(s) this PR fixes:
Related to #1303
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]