Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ThanosSidecarUnhealthy doesn't fire if the sidecar is never healthy #3990

Closed
dgrisonnet opened this issue Mar 29, 2021 · 12 comments · Fixed by #4342
Closed

ThanosSidecarUnhealthy doesn't fire if the sidecar is never healthy #3990

dgrisonnet opened this issue Mar 29, 2021 · 12 comments · Fixed by #4342

Comments

@dgrisonnet
Copy link
Contributor

Thanos, Prometheus and Golang version used:

Thanos mixins main.

What happened:

The ThanosSidecarUnhealthy alert never fire if the sidecar is never healthy.

What you expected to happen:

I would've expected the alert to fire 10 minutes after the sidecar start.

How to reproduce it (as minimally and precisely as possible):

Create a sidecar with a wrong prometheus.url so that it will never be able to scrape Prometheus and thus never be healthy.

Anything else we need to know:

The problems lies in the Prometheus query used by the alert:

time() - max(timestamp(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"})) by (job,pod) >= 240

If the Thanos sidecar is unheathy from start up and remain in this state, the thanos_sidecar_last_heartbeat_success_time_seconds metric will not be initialized so calling timestamp on it will not return any value. As a result the ThanosSidecarUnhealthy alert will not fire even though the sidecar is unhealthy.

@dgrisonnet
Copy link
Contributor Author

Based on how the thanos_sidecar_last_heartbeat_success_time_seconds metric works today, I don't think this can be fixed just by changing the promQL query. As a matter of fact, I gave it a couple of tries and I was always blocked by the fact that the metric isn't initialized so we can't really know if the sidecar was unhealthy during startup.

That being said, I can see a couple of other solutions:

  • Initialize thanos_sidecar_last_heartbeat_success_time_seconds upon startup
  • Use thanos_sidecar_prometheus_up or a new up like metric instead of thanos_sidecar_last_heartbeat_success_time_seconds

@stale
Copy link

stale bot commented Jun 2, 2021

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Jun 2, 2021
@dgrisonnet
Copy link
Contributor Author

dgrisonnet commented Jun 2, 2021

This issue is still valid.

cc @arajkumar @slashpai can you please have a look once you have some time on your hands?

@stale stale bot removed the stale label Jun 2, 2021
@slashpai
Copy link

slashpai commented Jun 3, 2021

@dgrisonnet ya I can take a look at this one

/assign

@paulfantom
Copy link
Contributor

Wouldn't the following resolve this:

time() - max(timestamp(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"})) by (job,pod) >= 240
OR
absent(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"})

?

@arajkumar
Copy link
Contributor

@paulfantom Yes, it should fix the problem. I will test it.

@arajkumar
Copy link
Contributor

arajkumar commented Jun 11, 2021

IMHO, timestamp(..) function should be removed, because time stamp will be updated for each scrape regardless of thanos_sidecar_last_heartbeat_success_time_seconds value change.

The following query should ideally work,

time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (job,pod) >= 240
OR
absent(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"})

I will test it before raising a PR.

@arajkumar
Copy link
Contributor

arajkumar commented Jun 14, 2021

@paulfantom @dgrisonnet

I had tested with local instances with following config

prometheus#0

./prometheus --config.file=config.yaml --web.listen-address=0.0.0.0:9090 --storage.tsdb.path=/tmp/data0

prometheus#1

./prometheus --config.file=config.yaml --web.listen-address=0.0.0.0:9091 --storage.tsdb.path=/tmp/data1

thanos sidecar#0

./thanos sidecar --tsdb.path /tmp/data0 --prometheus.url http://localhost:9090

thanos sidecar#1

./thanos sidecar --tsdb.path /tmp/data1 --prometheus.url http://localhost:9091

Note: Prometheus configured to scrape all instances of itself and thanos.

Even after I suspend prometheus#0(kill -TSTP <pid>), timestamp(thanos_sidecar_last_heartbeat_success_time_seconds) keep on changing w.r.t scrape interval. IMHO, This is incorrect and the expression has to be modified to consider the value of thanos_sidecar_last_heartbeat_success_time_seconds as is, because it already carries the last hearbeat timestamp.

Proposed expression

time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (job,pod) >= 240
OR
absent(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"})

@dgrisonnet
Copy link
Contributor Author

This issue can't be solved by adding absent(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) since the thanos_sidecar_last_heartbeat_success_time_seconds metric is by default initialized to 0. The actual issue is that the metric is present, but it may always be 0 if the sidecar never ends up being healthy. This can be tested by connecting the sidecar to an inexistent Prometheus instance.

Also, according to #3204, the timestamp primitive was added to wrap thanos_sidecar_last_heartbeat_success_time_seconds to handle the case where a heartbeat hasn't yet succeeded and the metric is initialized to 0.

If we were to continue with the same metric for this alert, may I suggest:

thanos_sidecar_last_heartbeat_success_time_seconds == bool 0
OR
time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (job,pod) >= 240

@arajkumar
Copy link
Contributor

@dgrisonnet IMHO, without thanos_sidecar_last_heartbeat_success_time_seconds == bool 0 is also going to yield the same result.

@dgrisonnet
Copy link
Contributor Author

No, you will end up in a situation where the alert will fire immediately since the query will be evaluated as time() >=240.

@arajkumar
Copy link
Contributor

arajkumar commented Jun 14, 2021

No, you will end up in a situation where the alert will fire immediately since the query will be evaluated as time() >=240.

Okay, Do you think adding a for attribute would help here? Because the upstream one don't have a for which would cause the alert to be triggered immediately. It is good to have the duration out of an alert expression.

In our case, cmo has 1hr duration for this alert.

EDIT:
I'm bit skeptical about testing against 0 because it would hide the real symptom where prometheus is not responding from the startup for a prolonged duration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants