-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compactor: Prometheus TODO metrics will disappear when everything is done #5587
Comments
This behavior is expected because the metrics are gauges. We need to reset the labels every iteration to clean up stale metrics from previous iterations. If you need 0 value, maybe you could try
Compactor not working is usually tracked by some other more meaningful metrics like |
Yeah that does not help. We had that check and it stopped working in Dezember. We noticed just in August this year. You need to monitor at least the halted metric but even that is not reliable as it becomes unhalted on every restart. Monitoring the rate of complete iterations is the best we currently have. However, its far from optimal. It would be nice to properly monitor outstanding work to make sure we do not backlog too much. I understand that those are labled gauges. Can we have a overall gauge with becomes 0? That would be more helpful than the labeled ones to us. |
Yep so you can monitor the halt metrics along with the progress. For the overall gauge you can use the query I mentioned to turn it to 0 when no compaction required. |
Please have a look at my initial issue: This metric is also null when compactor is not running. We really got issues monitoring compactor without noise because metrics are unstable. This eventually leads to fatigue and people will stop trusting the monitoring. We had a broken compactor for almost a year because people mostly ignored the issue because retarting the pod fixed it for a while. So no that is not a proper solution in my opinion.
As I stated above: Halt suffers a similar issue. It will become 0 after restart making people believe that everything is working now. We are looking for a robust way to monitor compactor. We got workarounds at the moment but still would like a reliable metric without false positives and flapping. |
That's totally a different question. You could use something like You have to combine the multiple metrics together and get a complete view. There is no perfect solution now for you to look at one metric only. If you have any idea feel free to contribute. |
Hello 👋 Looks like there was no activity on this issue for the last two months. |
Unfortunately, that metric also exists when compactor is broken. For now we settled with an ugly workaround and check if we got at least one complete iteration within the last two hours. It ain't nice but with such a low limit we don't get false positives. |
I wonder if this is somewhat fixed now, there is a We refactored the compaction metrics to remove the |
We switched to compactor cronjobs. Those either succeed or fail so it became a non issue for us. Unfortunately, we no longer scrape the metric so I cannot tell. |
I verified with some of my clusters that we now see zero-values for these metrics, rather than absent. |
Thanks for verifying! |
Thanos, Prometheus and Golang version used:
Everything up to current master.
Object Storage Provider: irrelevant here
What happened:
Compactor records its todos in
thanos_compact_todo_compaction_blocks
,thanos_compact_todo_deletion_blocks
andthanos_compact_todo_downsample_blocks
. Those metrics are very helpful when something is todo. However, they disappear when everything is done which makes it hard to differentiate between (a) everything is done and (b) compactor is not working properly.What you expected to happen:
Those metrics should return 0 when nothing is todo.
How to reproduce it (as minimally and precisely as possible):
Full logs to relevant components:
Anything else we need to know:
The text was updated successfully, but these errors were encountered: