Skip to content

Conversation

@mnpw
Copy link
Contributor

@mnpw mnpw commented Feb 17, 2024

What

  • add purge_timeout option to PrometheusBuilder
  • run a purger that purges based on the purge_timeout

Implements third way as prescribed here to purge old histogram data:

  • update the builder to generate a future which both drives the Hyper server future as well as a call to get_recent_metrics on an interval

Fixes #245

tobz and others added 30 commits May 16, 2021 10:47
Rework static metric names and add new routing layer
…her_fix

metrics-exporter-prometheus: sanitize matchers the same as input keys
…enhancements

metrics-util: make generation tracking configurable in Registry
The Atomic::compare_exchange function only appears in 0.9.2.
t1ha causes a global-buffer-overflow when testing under ASAN. This
change switches the default hash implementation to aHash which has a
similar performance approach.

Switch to aHash removes the global-buffer-overflow and the tests
complete successfully.

In general, cargo bench shows performance improvements as high as 31% on
specific benches.

There are 3 regressions and 29 improvements total
@mnpw mnpw force-pushed the purge-old-histogram-data branch 2 times, most recently from dbbb01d to 139bdd0 Compare February 17, 2024 21:12
- add purge_timeout option to PrometheusBuilder
- run a purger that purges based on the purge_timeout
@mnpw mnpw force-pushed the purge-old-histogram-data branch from 139bdd0 to f4abc32 Compare February 18, 2024 18:32
@tobz
Copy link
Member

tobz commented Feb 20, 2024

As I read back through what I originally wrote... I think a major flaw with my proposal to do this by periodically calling render is that we would actually also handle things like expiration, which could lead to subsequent scrapes/pushes missing data.

I think what we'd actually have to do here is split the logic out of get_recent_metrics that handles draining the histograms. Essentially, take this code and stick it into a new method on the struct -- let's call it run_upkeep for this example -- and trim it down so it only iterates through each histogram, finds/creates the entry in self.distributions, and then drains samples into the entry. After that, we'd add a call to that method right before this block, and in that block we'd remove the logic where it drains the histograms.

Does that make sense?

@mnpw
Copy link
Contributor Author

mnpw commented Feb 22, 2024

Thanks for the update!

I think a major flaw with my proposal to do this by periodically calling render is that we would actually also handle things like expiration, which could lead to subsequent scrapes/pushes missing data.

Does this refer to the self.recency.should_store_* call where we check the validity of a metric? If so then I think eagerly calling this method would be fine. It would be helpful to understand your concern with an example.

@tobz
Copy link
Member

tobz commented Feb 28, 2024

Thanks for the update!

I think a major flaw with my proposal to do this by periodically calling render is that we would actually also handle things like expiration, which could lead to subsequent scrapes/pushes missing data.

Does this refer to the self.recency.should_store_* call where we check the validity of a metric? If so then I think eagerly calling this method would be fine. It would be helpful to understand your concern with an example.

(Sorry for the delayed response here.)

Yes, you're right that this would be the crux of the concern.

The example would be something like:

  • the exporter is configured with an idle timeout of 10 seconds, and a purger interval of 5s
  • the exporter is scraped at t=0, which includes metric A
  • metric A is updated at t=5 (meaning it will go idle at t=15 if not updated by then)
  • the purger runs at t=5 and t=10 and since nothing is idle yet, all it ends up doing is draining the histogram buckets
  • the purger again runs at t=15, but now, metric A is considered idle, which removes metric A from the registry
  • the exporter is scraper at t=30, which now no longer includes metric A

Despite metric A being updated between the two scrapes at t=0 and t=30, we've actually missed the update to metric A, which if we were only rendering during scrapes, would have been included in the second scrape.

@mnpw
Copy link
Contributor Author

mnpw commented Mar 7, 2024

Despite metric A being updated between the two scrapes at t=0 and t=30, we've actually missed the update to metric A, which if we were only rendering during scrapes, would have been included in the second scrape.

Ah this makes sense once I checked documentation for PrometheusBuilder::idle_timeout:

This behavior is driven by requests to generate rendered output, and so metrics will not be removed unless a request has been made recently enough to prune the idle metrics.

I have separated out the purge action to only drain the histograms now.

@mnpw mnpw closed this Mar 8, 2024
@mnpw mnpw force-pushed the purge-old-histogram-data branch from b87237b to 568e0fb Compare March 8, 2024 07:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Clearing/Expiration of old values in histogram/limit to histogram