Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compactor: Prometheus TODO metrics will disappear when everything is done #5587

Closed
jabdoa2 opened this issue Aug 11, 2022 · 11 comments
Closed

Comments

@jabdoa2
Copy link

jabdoa2 commented Aug 11, 2022

Thanos, Prometheus and Golang version used:

Everything up to current master.

Object Storage Provider: irrelevant here

What happened:

Compactor records its todos in thanos_compact_todo_compaction_blocks, thanos_compact_todo_deletion_blocks and thanos_compact_todo_downsample_blocks. Those metrics are very helpful when something is todo. However, they disappear when everything is done which makes it hard to differentiate between (a) everything is done and (b) compactor is not working properly.

What you expected to happen:

Those metrics should return 0 when nothing is todo.

How to reproduce it (as minimally and precisely as possible):

  1. Start compactor
  2. Give it something to do
  3. Check metrics endpoint. Metrics should be present
  4. Wait until everything is done
  5. Check metrics endpoint again. Metrics should be gone.

Full logs to relevant components:

Anything else we need to know:

@yeya24
Copy link
Contributor

yeya24 commented Aug 15, 2022

This behavior is expected because the metrics are gauges. We need to reset the labels every iteration to clean up stale metrics from previous iterations. If you need 0 value, maybe you could try sum(thanos_compact_todo_compactions) or vector(0).

However, they disappear when everything is done which makes it hard to differentiate between (a) everything is done and (b) compactor is not working properly.

Compactor not working is usually tracked by some other more meaningful metrics like up{job="thanos-compactor}.

@jabdoa2
Copy link
Author

jabdoa2 commented Aug 15, 2022

Compactor not working is usually tracked by some other more meaningful metrics like up{job="thanos-compactor}.

Yeah that does not help. We had that check and it stopped working in Dezember. We noticed just in August this year. You need to monitor at least the halted metric but even that is not reliable as it becomes unhalted on every restart.

Monitoring the rate of complete iterations is the best we currently have. However, its far from optimal. It would be nice to properly monitor outstanding work to make sure we do not backlog too much. I understand that those are labled gauges. Can we have a overall gauge with becomes 0? That would be more helpful than the labeled ones to us.

@yeya24
Copy link
Contributor

yeya24 commented Aug 15, 2022

Yep so you can monitor the halt metrics along with the progress.

For the overall gauge you can use the query I mentioned to turn it to 0 when no compaction required.

@jabdoa2
Copy link
Author

jabdoa2 commented Aug 15, 2022

For the overall gauge you can use the query I mentioned to turn it to 0 when no compaction required.

Please have a look at my initial issue: This metric is also null when compactor is not running. We really got issues monitoring compactor without noise because metrics are unstable. This eventually leads to fatigue and people will stop trusting the monitoring. We had a broken compactor for almost a year because people mostly ignored the issue because retarting the pod fixed it for a while. So no that is not a proper solution in my opinion.

Yep so you can monitor the halt metrics along with the progress.

As I stated above: Halt suffers a similar issue. It will become 0 after restart making people believe that everything is working now. We are looking for a robust way to monitor compactor. We got workarounds at the moment but still would like a reliable metric without false positives and flapping.

@yeya24
Copy link
Contributor

yeya24 commented Aug 15, 2022

Please have a look at my initial issue: This metric is also null when compactor is not running.

That's totally a different question. You could use something like absent(compact_xxx_build_info) to make sure compactor is running. If you want to ensure not halting you could combine the two.

You have to combine the multiple metrics together and get a complete view. There is no perfect solution now for you to look at one metric only. If you have any idea feel free to contribute.

@stale
Copy link

stale bot commented Nov 13, 2022

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Nov 13, 2022
@jabdoa2
Copy link
Author

jabdoa2 commented Nov 13, 2022

Please have a look at my initial issue: This metric is also null when compactor is not running.

That's totally a different question. You could use something like absent(compact_xxx_build_info) to make sure compactor is running. If you want to ensure not halting you could combine the two

Unfortunately, that metric also exists when compactor is broken.

For now we settled with an ugly workaround and check if we got at least one complete iteration within the last two hours. It ain't nice but with such a low limit we don't get false positives.

@SuperQ
Copy link
Contributor

SuperQ commented Apr 18, 2024

I wonder if this is somewhat fixed now, there is a Set(0) in the compaction code.

We refactored the compaction metrics to remove the group label.

#6049

@stale stale bot removed the stale label Apr 18, 2024
@jabdoa2
Copy link
Author

jabdoa2 commented Apr 18, 2024

We switched to compactor cronjobs. Those either succeed or fail so it became a non issue for us. Unfortunately, we no longer scrape the metric so I cannot tell.

@SuperQ
Copy link
Contributor

SuperQ commented Apr 18, 2024

I verified with some of my clusters that we now see zero-values for these metrics, rather than absent.

@jabdoa2
Copy link
Author

jabdoa2 commented Apr 18, 2024

Thanks for verifying!

@jabdoa2 jabdoa2 closed this as completed Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants