Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics-generator: add custom registry #1340

Merged
merged 15 commits into from
Mar 25, 2022

Conversation

yvrhdn
Copy link
Member

@yvrhdn yvrhdn commented Mar 10, 2022

What this PR does:
This introduces a custom registry to store metrics generated in the metrics-generator. The registry can scrape itself and enforce limits.

So far the metrics-generator has relied on the prometheus.Registry from prometheus/client_golang. This has a number of issues:

  • it is not possible to remove series: if the instance is running for a long time, it will keep resending metrics that haven't been updated in a long time
    • -> this increases the amount of active series in the downstream TSDB and has little additional value
  • it is not possible to enforce a maximum amount of series per tenant
    • -> this can be useful to protect Tempo and the downstream TSDB against a cardinality spike from the ingested traces

This PR introduces a new Registry which kind of mirrors the prometheus.Registerer. It supports counters and histograms. There are two implementations:

  • ManagedRegistry: a registry that can scrape itself and write data into an storage.Appender. It will also enforce limits and remove stale series.
  • TestRegistry: a simple implementation to verify the correctness of processors using Registry.

The config will look like this:

metrics_generator:
    registry:
        collection_interval: 15s
        stale_duration: 30m
        external_labels:
          - source: tempo

The overrides will look like this:

metrics_generator_registry_scrape_interval: 1m
metrics_generator_registry_max_active_series: 10000

To see the ManagedRegistry in practice: this is the amount of active series when we are not removing stale series. As new data shows up, the amount of metrics series that are being tracked constantly increases, until the instance is restarted.

Screenshot 2022-03-10 at 13 50 05

With the ManagedRegistry stale series are removed after 15 minutes, resulting in a stable system. This will ensure the amount of active series in the downstream TSDB remains fairly constant over time.

Screenshot 2022-03-10 at 13 49 11

Which issue(s) this PR fixes:
Related to #1303

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@yvrhdn yvrhdn force-pushed the kvrhdn/metrics-generator-registry branch 3 times, most recently from 5c883ba to bdb0161 Compare March 11, 2022 14:38
@yvrhdn yvrhdn force-pushed the kvrhdn/metrics-generator-registry branch from bdb0161 to 67d2df0 Compare March 11, 2022 15:13
@yvrhdn yvrhdn marked this pull request as ready for review March 11, 2022 15:33
Comment on lines 69 to 70
//serviceGraphUnpairedSpansTotal registry.Counter
//serviceGraphDroppedSpansTotal registry.Counter
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These metrics weren't updated before, so I don't know if we should keep them around. I'll check if Grafana actually uses these in the service graphs view.

We already expose these metrics as metrics from Tempo:

  • metrics_generator_processor_service_graphs_dropped_spans
  • metrics_generator_processor_service_graphs_unpaired_edges

So a Tempo operator can see these stats already.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need to have these per tenant so that individual tenants can have a feeling for the quality of their metrics. Alternatively, instead of exposing these directly, it might be better to use the new tempo_warnings_total metric with a specific reason label for these situations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently keep track of dropped spans and unpaired edges per tenant. So we already collect the data, I think it will just be a matter of exposing them to the end-user.
tempo_warnings_total looks interesting, I'll take a look at what warnings we want to share from the metrics-generator in general.

@yvrhdn yvrhdn mentioned this pull request Mar 13, 2022
26 tasks
@yvrhdn
Copy link
Member Author

yvrhdn commented Mar 14, 2022

I think I made a design mistake implementing this: looking at some profiling data this implementation spends a lot of time in labels.Builder allocating and growing slices.

Issues:

  • we are doing a lot more work to update metrics than to scrape them, even though updating is much more common (it's correlated to spans ingested/sec) while scraping is once every 15s or 1m.
    • updating metrics should be as cheap as possible and allocate as little as possible
  • labels.Labels is inefficient since we often grow slices as we add additional labels (like __name__) and then have to sort them again
    • I think we should avoid using labels.Labels until necessary (i.e. when appending data)

@yvrhdn yvrhdn marked this pull request as draft March 14, 2022 11:02
@yvrhdn
Copy link
Member Author

yvrhdn commented Mar 14, 2022

These last two commits change the design and seem to tackle most performance issues. Changes:

  • Instead of using Prometheus' labels.Labels to calculate the hash of a series, we only hash the label values. We already know the label names and those are fixed for a metric.
  • Instead of storing all series together in registry, we store them inside the counter/histogram: this makes updating more efficient as we have more context, but it makes managing the active series a bit more annoying (I had to introduce the onAddMetricSeries callbacks for that)

This PR is still not as efficient as before, but it's not horrible:

  • before: we could ingest 180k spans/sec with an instance using 2 CPU
  • now: we can only ingest 110k spans/sec (also at 2 CPU)

I'll explore further improvements, but I don't expect major design changes.

@yvrhdn yvrhdn marked this pull request as ready for review March 14, 2022 14:30
Copy link
Member

@joe-elliott joe-elliott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generally looks fine to me. As discussed offline there are a few performance improvements to consider:

  • Precalculate the hash for a set of label values
  • Don't bother incrementing 0 on histogram buckets
  • Consider where pointers could be dropped in favor of label values.

@yvrhdn
Copy link
Member Author

yvrhdn commented Mar 21, 2022

Switched design a bit and this can now do 40MB/s or 200k spans/sec per instance, which is on par with performance before this PR 🎉

  • introduced a LabelValues struct that can cache the hash of label values. This means we only calculate the hash once, even if 2-4 metrics are updated.
  • split up metric into a dedicated counter and histogram type: this results in some duplicated code but each type is optimised for its own kind of series. So for histograms this mean we only manage 1 series for the count, sum and buckets metrics.

Some stuff I tried that didn't work out:

  • changing map[uint64]*serie by map[uint64]serie lowers memory usage a tad (~100 MB), but makes CPU and P99 latency spiky. This is kind of expected as we need to lock the entire map to update a serie:
	hash := labelValues.getHash()

	m.seriesMtx.Lock()
	defer m.seriesMtx.Unlock()

	s, ok := m.series[hash]
	if ok {
		s.value += value
		s.lastUpdated = time.Now().UnixMilli()
		m.series[hash] = s
		return
	}

	if !m.onAddSerie() {
		return
	}

	labelValuesCopy := make([]string, len(labelValues.values))
	copy(labelValuesCopy, labelValues.values)

	m.series[hash] = serie{
		labelValues: labelValuesCopy,
		value:       value,
		lastUpdated: time.Now().UnixMilli(),
	}

by using *serie we only need to lock the map to look up the serie, all other updates to the serie can happen concurrently

@yvrhdn
Copy link
Member Author

yvrhdn commented Mar 21, 2022

Performance comparison between without and with this PR. This looks good now ✅

This is with every metrics-generator instance processing ~40MB/s and 170k spans/s

Screenshot 2022-03-21 at 19 10 22

@yvrhdn yvrhdn requested a review from joe-elliott March 21, 2022 20:28
modules/generator/registry/histogram.go Outdated Show resolved Hide resolved
modules/generator/registry/histogram.go Outdated Show resolved Hide resolved
@yvrhdn yvrhdn closed this Mar 24, 2022
@yvrhdn yvrhdn reopened this Mar 24, 2022
@yvrhdn
Copy link
Member Author

yvrhdn commented Mar 24, 2022

Trying to get GitHub Actions to run...


func (c *counter) Inc(labelValues *LabelValues, value float64) {
if value < 0 {
panic("counter can only increase")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we panic here? Would an error log and early return be sufficient?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok, the Prometheus client does the same, although the method is more general Add vs Inc which takes no params.

Copy link
Member Author

@yvrhdn yvrhdn Mar 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we could handle this a bit more elegantly, but I didn't want to bother adding a log.Logger field to the counter or returning an error just in case someone misuses the counter.
A panic isn't nice, but it should ensure whoever tries to decrease the counter detects it before it ships.

Prometheus client_golang also just panics btw: https://github.com/prometheus/client_golang/blob/main/prometheus/counter.go#L108-L110

Edit: didn't see Marty already answered the same 😅

Copy link
Member

@mapno mapno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, great improvements. Limiting active series and pruning inactive ones will be very useful. Left a few comments.

seriesMtx sync.RWMutex
series map[uint64]*counterSeries

onAddSeries func(count uint32) bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think calling these callbacks can be a bit confusing since they intervine in the creation of the metrics (i.e. the result of onAddSeries will determine if the series is created or not).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree these things are confusing. I don't like we have to use anonymous functions to couple the counters/histograms to the registry, but I'm not sure what would work better. I wanted to keep this functions somewhat generic so it's easier to test. Injecting the full managedRegistry means we have to test them together all the time.

Part of the complexity is that onAddSeries is used for both checking whether we can add series and to keep count of them. We could split this up into 2 callbacks: canAddSeries and incActiveSeries but this is 1) more work and 2) might introduce a race condition between asking whether you can add series and reporting you added them.

modules/generator/registry/histogram.go Show resolved Hide resolved
h.onRemoveSerie = onRemoveSeries
}

func (h *histogram) Observe(labelValues *LabelValues, value float64) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it check that the value is not negative too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using Prometheus client_golang library as reference and they seem to allow negative values: https://github.com/prometheus/client_golang/blob/main/prometheus/histogram.go#L307-L309

I've never seen a histogram with negative values though, so I'm not sure if it works as intended 🤷🏻

modules/generator/registry/counter.go Outdated Show resolved Hide resolved
Copy link
Contributor

@mdisibio mdisibio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

@yvrhdn yvrhdn merged commit e9008c5 into grafana:main Mar 25, 2022
@yvrhdn yvrhdn deleted the kvrhdn/metrics-generator-registry branch March 25, 2022 16:12
@yvrhdn yvrhdn mentioned this pull request Apr 5, 2022
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants