-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Promtail generates very high cardinality prometheus metrics #5553
Comments
Thanks for reporting this @m1keil. It's a fair point, but I don't think we will make the change you're suggesting and export metrics of different verbosity based on a flag. It adds some complexity and begs the unanswerable question: "what is a detailed metric?". We can't answer this for everyone in every scenario, so we'd prefer to export everything and have users filter out what they don't need. On that note: How does that sound? |
We use match stage but I don't see how can it help here given that:
If there's something I'm missing here let me know.
I guess it's fair enough but I do think you need to consider sane defaults from promtail first. High cardinality prometheus metrics is a known issue that should be taken with care. Some people might enable this without realising this is a problem. |
OK, then I think you need to drop the metrics at scrape time; I can't think of an alternative solution for you.
The problem is choosing what to cut out. You never know what metric would be critically useful to an operator. In this case, Promtail could not expose any fewer metrics without understanding the specifics of your system. |
I think in this case the answer can be simple, detailed = potentially unbound cardinality. This is a common idiom in Prometheus world, for e.g, cAdvisor and node_exporter won't enable all of the metadata or metrics by default.
This is true regardless of how promtail decided to run by default. But sane defaults can prevent mistakes that operators (will) make. The same way we don't ship software to prod with DEBUG logging. |
We're missing each other on a fundamental point: |
Well, yes, but I'm not saying don't allow exposing at all, all I'm saying is "maybe you shouldn't do that by default". Looking at the existing metric per file:
Now, are these metrics helpful? Probably. Are they so important to be on? Maybe a subset. It would be nice if Maybe as a compromise, we can at least add a note about the potential high cardinality in the docs? |
Seems like a reasonable compromise to me. Can you submit the PR? I want to give you credit for this suggestion. |
Closing this one, we can reference it in the PR 👍 |
* Hint about potential high cardinality metrics Add warning about potential high cardinality metrics due to the file path being included in some of the metrics. Depending on the setup, such config can result in a very large amount of label values, which is not recommended by Prometheus. This can result in increased storage, slow down of queries, extra costs and so on. Address what has been discussed in #5553. * Reword observability.md Co-authored-by: Karen Miller <84039272+KMiller-Grafana@users.noreply.github.com> Co-authored-by: Karen Miller <84039272+KMiller-Grafana@users.noreply.github.com>
Describe the bug
We use
consulagent_sd_configs
scrape config to automatically detect and tail logs in our instances. These run Nomad clients that schedule all the workloads.While inspecting the metrics exported by Promtail via
/metrics
I can see large sets of data points that are labelled with the file path that is being tracked by Promtail:I have only 5 services running on this node, but this generates up to 153 different labels sets (due to the metric being a summary/histogram).
Each service can have:
Here's a verbose example:
To make matters worse, the final number of the possible label values is amplified by N nodes that run Nomad.
This can generate so many labels that it can bring a decent Prometheus cluster to its knees.
As far as I can tell, there is no way to disable this detailed monitoring and the only option I have at the moment would be to drop these metrics at scrape time. It would be useful to have such detailed monitoring enabled by demand and not by default (
---server.detailed-instrumentation
).The text was updated successfully, but these errors were encountered: