Safety net for director Prometheus instance #1311

haoming29 · 2024-05-20T20:47:34Z

We experienced a Prometheus cardinality spike recently as our gin exporter has a path label for requests it serves, causing half a million time series in less than a week. Consequently, the director memory went up til 5GiB.

Since the director handles all object access requests and it scrapes all other servers, it's expected to have large cardinality for various metric labels. Although a fix is in effect to reduce known metric cardinality in #1276, to prevent Prometheus process accidentally explodes the director, we should add various limits to Prometheus to control the max memory/time series/labels it can use.

There's a good reference of what knobs to turn from Cloudflare blog https://blog.cloudflare.com/how-cloudflare-runs-prometheus-at-scale and we should adapt them accordingly

The text was updated successfully, but these errors were encountered:

haoming29 added internal Internal code improvements, not user-facing director Issue relating to the director component labels May 20, 2024

haoming29 changed the title ~~Safety for director Prometheus instance~~ Safety net for director Prometheus instance May 20, 2024

haoming29 self-assigned this May 22, 2024

haoming29 added this to the v7.10.0 milestone May 30, 2024

jhiemstrawisc assigned patrickbrophy and unassigned haoming29 Dec 11, 2024

jhiemstrawisc modified the milestones: v7.10.0, v7.13.0 Dec 11, 2024

patrickbrophy mentioned this issue Jan 13, 2025

Add configuration for Prometheus to limit cardinality #1887

Merged

patrickbrophy linked a pull request Jan 13, 2025 that will close this issue

Add configuration for Prometheus to limit cardinality #1887

Merged

jhiemstrawisc closed this as completed in #1887 Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Safety net for director Prometheus instance #1311

Safety net for director Prometheus instance #1311

haoming29 commented May 20, 2024 •

edited

Loading

Safety net for director Prometheus instance #1311

Safety net for director Prometheus instance #1311

Comments

haoming29 commented May 20, 2024 • edited Loading

haoming29 commented May 20, 2024 •

edited

Loading