Compactor: Prometheus TODO metrics will disappear when everything is done #5587

jabdoa2 · 2022-08-11T08:35:18Z

Thanos, Prometheus and Golang version used:

Everything up to current master.

Object Storage Provider: irrelevant here

What happened:

Compactor records its todos in thanos_compact_todo_compaction_blocks, thanos_compact_todo_deletion_blocks and thanos_compact_todo_downsample_blocks. Those metrics are very helpful when something is todo. However, they disappear when everything is done which makes it hard to differentiate between (a) everything is done and (b) compactor is not working properly.

What you expected to happen:

Those metrics should return 0 when nothing is todo.

How to reproduce it (as minimally and precisely as possible):

Start compactor
Give it something to do
Check metrics endpoint. Metrics should be present
Wait until everything is done
Check metrics endpoint again. Metrics should be gone.

Full logs to relevant components:

Anything else we need to know:

The text was updated successfully, but these errors were encountered:

yeya24 · 2022-08-15T04:09:40Z

This behavior is expected because the metrics are gauges. We need to reset the labels every iteration to clean up stale metrics from previous iterations. If you need 0 value, maybe you could try sum(thanos_compact_todo_compactions) or vector(0).

However, they disappear when everything is done which makes it hard to differentiate between (a) everything is done and (b) compactor is not working properly.

Compactor not working is usually tracked by some other more meaningful metrics like up{job="thanos-compactor}.

jabdoa2 · 2022-08-15T07:04:45Z

Compactor not working is usually tracked by some other more meaningful metrics like up{job="thanos-compactor}.

Yeah that does not help. We had that check and it stopped working in Dezember. We noticed just in August this year. You need to monitor at least the halted metric but even that is not reliable as it becomes unhalted on every restart.

Monitoring the rate of complete iterations is the best we currently have. However, its far from optimal. It would be nice to properly monitor outstanding work to make sure we do not backlog too much. I understand that those are labled gauges. Can we have a overall gauge with becomes 0? That would be more helpful than the labeled ones to us.

yeya24 · 2022-08-15T07:10:42Z

Yep so you can monitor the halt metrics along with the progress.

For the overall gauge you can use the query I mentioned to turn it to 0 when no compaction required.

jabdoa2 · 2022-08-15T09:52:18Z

For the overall gauge you can use the query I mentioned to turn it to 0 when no compaction required.

Please have a look at my initial issue: This metric is also null when compactor is not running. We really got issues monitoring compactor without noise because metrics are unstable. This eventually leads to fatigue and people will stop trusting the monitoring. We had a broken compactor for almost a year because people mostly ignored the issue because retarting the pod fixed it for a while. So no that is not a proper solution in my opinion.

Yep so you can monitor the halt metrics along with the progress.

As I stated above: Halt suffers a similar issue. It will become 0 after restart making people believe that everything is working now. We are looking for a robust way to monitor compactor. We got workarounds at the moment but still would like a reliable metric without false positives and flapping.

yeya24 · 2022-08-15T17:56:56Z

Please have a look at my initial issue: This metric is also null when compactor is not running.

That's totally a different question. You could use something like absent(compact_xxx_build_info) to make sure compactor is running. If you want to ensure not halting you could combine the two.

You have to combine the multiple metrics together and get a complete view. There is no perfect solution now for you to look at one metric only. If you have any idea feel free to contribute.

stale · 2022-11-13T21:09:58Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

jabdoa2 · 2022-11-13T21:40:08Z

Please have a look at my initial issue: This metric is also null when compactor is not running.

That's totally a different question. You could use something like absent(compact_xxx_build_info) to make sure compactor is running. If you want to ensure not halting you could combine the two

Unfortunately, that metric also exists when compactor is broken.

For now we settled with an ugly workaround and check if we got at least one complete iteration within the last two hours. It ain't nice but with such a low limit we don't get false positives.

SuperQ · 2024-04-18T08:47:39Z

I wonder if this is somewhat fixed now, there is a Set(0) in the compaction code.

We refactored the compaction metrics to remove the group label.

#6049

jabdoa2 · 2024-04-18T10:53:14Z

We switched to compactor cronjobs. Those either succeed or fail so it became a non issue for us. Unfortunately, we no longer scrape the metric so I cannot tell.

SuperQ · 2024-04-18T12:28:19Z

I verified with some of my clusters that we now see zero-values for these metrics, rather than absent.

jabdoa2 · 2024-04-18T12:41:20Z

Thanks for verifying!

matej-g added the component: compact label Aug 15, 2022

stale bot added the stale label Nov 13, 2022

stale bot removed the stale label Apr 18, 2024

jabdoa2 closed this as completed Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compactor: Prometheus TODO metrics will disappear when everything is done #5587

Compactor: Prometheus TODO metrics will disappear when everything is done #5587

jabdoa2 commented Aug 11, 2022

yeya24 commented Aug 15, 2022 •

edited

Loading

jabdoa2 commented Aug 15, 2022

yeya24 commented Aug 15, 2022 •

edited

Loading

jabdoa2 commented Aug 15, 2022

yeya24 commented Aug 15, 2022

stale bot commented Nov 13, 2022

jabdoa2 commented Nov 13, 2022

SuperQ commented Apr 18, 2024 •

edited

Loading

jabdoa2 commented Apr 18, 2024

SuperQ commented Apr 18, 2024

jabdoa2 commented Apr 18, 2024

Compactor: Prometheus TODO metrics will disappear when everything is done #5587

Compactor: Prometheus TODO metrics will disappear when everything is done #5587

Comments

jabdoa2 commented Aug 11, 2022

yeya24 commented Aug 15, 2022 • edited Loading

jabdoa2 commented Aug 15, 2022

yeya24 commented Aug 15, 2022 • edited Loading

jabdoa2 commented Aug 15, 2022

yeya24 commented Aug 15, 2022

stale bot commented Nov 13, 2022

jabdoa2 commented Nov 13, 2022

SuperQ commented Apr 18, 2024 • edited Loading

jabdoa2 commented Apr 18, 2024

SuperQ commented Apr 18, 2024

jabdoa2 commented Apr 18, 2024

yeya24 commented Aug 15, 2022 •

edited

Loading

yeya24 commented Aug 15, 2022 •

edited

Loading

SuperQ commented Apr 18, 2024 •

edited

Loading