ThanosSidecarUnhealthy doesn't fire if the sidecar is never healthy #3990

dgrisonnet · 2021-03-29T14:32:54Z

Thanos, Prometheus and Golang version used:

Thanos mixins main.

What happened:

The ThanosSidecarUnhealthy alert never fire if the sidecar is never healthy.

What you expected to happen:

I would've expected the alert to fire 10 minutes after the sidecar start.

How to reproduce it (as minimally and precisely as possible):

Create a sidecar with a wrong prometheus.url so that it will never be able to scrape Prometheus and thus never be healthy.

Anything else we need to know:

The problems lies in the Prometheus query used by the alert:

time() - max(timestamp(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"})) by (job,pod) >= 240

If the Thanos sidecar is unheathy from start up and remain in this state, the thanos_sidecar_last_heartbeat_success_time_seconds metric will not be initialized so calling timestamp on it will not return any value. As a result the ThanosSidecarUnhealthy alert will not fire even though the sidecar is unhealthy.

The text was updated successfully, but these errors were encountered:

dgrisonnet · 2021-03-29T14:42:56Z

Based on how the thanos_sidecar_last_heartbeat_success_time_seconds metric works today, I don't think this can be fixed just by changing the promQL query. As a matter of fact, I gave it a couple of tries and I was always blocked by the fact that the metric isn't initialized so we can't really know if the sidecar was unhealthy during startup.

That being said, I can see a couple of other solutions:

Initialize thanos_sidecar_last_heartbeat_success_time_seconds upon startup
Use thanos_sidecar_prometheus_up or a new up like metric instead of thanos_sidecar_last_heartbeat_success_time_seconds

stale · 2021-06-02T16:04:06Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

dgrisonnet · 2021-06-02T16:23:47Z

This issue is still valid.

cc @arajkumar @slashpai can you please have a look once you have some time on your hands?

slashpai · 2021-06-03T05:07:51Z

@dgrisonnet ya I can take a look at this one

/assign

paulfantom · 2021-06-11T14:36:54Z

Wouldn't the following resolve this:

time() - max(timestamp(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"})) by (job,pod) >= 240
OR
absent(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"})

?

arajkumar · 2021-06-11T17:14:17Z

@paulfantom Yes, it should fix the problem. I will test it.

arajkumar · 2021-06-11T17:19:55Z

IMHO, timestamp(..) function should be removed, because time stamp will be updated for each scrape regardless of thanos_sidecar_last_heartbeat_success_time_seconds value change.

The following query should ideally work,

time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (job,pod) >= 240
OR
absent(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"})

I will test it before raising a PR.

arajkumar · 2021-06-14T07:31:35Z

@paulfantom @dgrisonnet

I had tested with local instances with following config

prometheus#0

./prometheus --config.file=config.yaml --web.listen-address=0.0.0.0:9090 --storage.tsdb.path=/tmp/data0

prometheus#1

./prometheus --config.file=config.yaml --web.listen-address=0.0.0.0:9091 --storage.tsdb.path=/tmp/data1

thanos sidecar#0

./thanos sidecar --tsdb.path /tmp/data0 --prometheus.url http://localhost:9090

thanos sidecar#1

./thanos sidecar --tsdb.path /tmp/data1 --prometheus.url http://localhost:9091

Note: Prometheus configured to scrape all instances of itself and thanos.

Even after I suspend prometheus#0(kill -TSTP <pid>), timestamp(thanos_sidecar_last_heartbeat_success_time_seconds) keep on changing w.r.t scrape interval. IMHO, This is incorrect and the expression has to be modified to consider the value of thanos_sidecar_last_heartbeat_success_time_seconds as is, because it already carries the last hearbeat timestamp.

Proposed expression

time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (job,pod) >= 240
OR
absent(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"})

dgrisonnet · 2021-06-14T08:33:36Z

This issue can't be solved by adding absent(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) since the thanos_sidecar_last_heartbeat_success_time_seconds metric is by default initialized to 0. The actual issue is that the metric is present, but it may always be 0 if the sidecar never ends up being healthy. This can be tested by connecting the sidecar to an inexistent Prometheus instance.

Also, according to #3204, the timestamp primitive was added to wrap thanos_sidecar_last_heartbeat_success_time_seconds to handle the case where a heartbeat hasn't yet succeeded and the metric is initialized to 0.

If we were to continue with the same metric for this alert, may I suggest:

thanos_sidecar_last_heartbeat_success_time_seconds == bool 0
OR
time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (job,pod) >= 240

arajkumar · 2021-06-14T10:36:23Z

@dgrisonnet IMHO, without thanos_sidecar_last_heartbeat_success_time_seconds == bool 0 is also going to yield the same result.

dgrisonnet · 2021-06-14T12:37:16Z

No, you will end up in a situation where the alert will fire immediately since the query will be evaluated as time() >=240.

arajkumar · 2021-06-14T13:25:14Z

No, you will end up in a situation where the alert will fire immediately since the query will be evaluated as time() >=240.

Okay, Do you think adding a for attribute would help here? Because the upstream one don't have a for which would cause the alert to be triggered immediately. It is good to have the duration out of an alert expression.

In our case, cmo has 1hr duration for this alert.

EDIT:
~~I'm bit skeptical about testing against 0 because it would hide the real symptom where prometheus is not responding from the startup for a prolonged duration.~~

stale bot added the stale label Jun 2, 2021

stale bot removed the stale label Jun 2, 2021

arajkumar mentioned this issue Jun 14, 2021

fix(mixin): ThanosSidecarUnhealthy doesn't fire if the sidecar is never healthy #4342

Merged

onprem closed this as completed in #4342 Jun 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ThanosSidecarUnhealthy doesn't fire if the sidecar is never healthy #3990

ThanosSidecarUnhealthy doesn't fire if the sidecar is never healthy #3990

dgrisonnet commented Mar 29, 2021

dgrisonnet commented Mar 29, 2021

stale bot commented Jun 2, 2021

dgrisonnet commented Jun 2, 2021 •

edited

Loading

slashpai commented Jun 3, 2021

paulfantom commented Jun 11, 2021

arajkumar commented Jun 11, 2021

arajkumar commented Jun 11, 2021 •

edited

Loading

arajkumar commented Jun 14, 2021 •

edited

Loading

dgrisonnet commented Jun 14, 2021

arajkumar commented Jun 14, 2021

dgrisonnet commented Jun 14, 2021

arajkumar commented Jun 14, 2021 •

edited

Loading

ThanosSidecarUnhealthy doesn't fire if the sidecar is never healthy #3990

ThanosSidecarUnhealthy doesn't fire if the sidecar is never healthy #3990

Comments

dgrisonnet commented Mar 29, 2021

dgrisonnet commented Mar 29, 2021

stale bot commented Jun 2, 2021

dgrisonnet commented Jun 2, 2021 • edited Loading

slashpai commented Jun 3, 2021

paulfantom commented Jun 11, 2021

arajkumar commented Jun 11, 2021

arajkumar commented Jun 11, 2021 • edited Loading

arajkumar commented Jun 14, 2021 • edited Loading

prometheus#0

prometheus#1

thanos sidecar#0

thanos sidecar#1

Proposed expression

dgrisonnet commented Jun 14, 2021

arajkumar commented Jun 14, 2021

dgrisonnet commented Jun 14, 2021

arajkumar commented Jun 14, 2021 • edited Loading

dgrisonnet commented Jun 2, 2021 •

edited

Loading

arajkumar commented Jun 11, 2021 •

edited

Loading

arajkumar commented Jun 14, 2021 •

edited

Loading

arajkumar commented Jun 14, 2021 •

edited

Loading