Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mixin: Use sidecar's metric timestamp for healthcheck #3204

Merged
merged 1 commit into from
Mar 5, 2021

Conversation

hwoarang
Copy link
Contributor

@hwoarang hwoarang commented Sep 22, 2020

During prometheus updates the alert was firing because the metric was
initialized with a value of '0' before the first heartbeat was sent. As
such, the evaluation of the alert results into actually taking just the
value of time() into consideration which led to misleading information
about the health of the sidecar.

As the thanos_sidecar_last_heartbeat_success_time_seconds metric is
effectively just a timestamp that resets on new deployments, we can
simply wrap it around the timestamp() function which should return
almost the same value of the metric itself with the added benefit that
heartbeat resets will be ignored.

Signed-off-by: Markos Chandras markos@chandras.me

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

Verification

@hwoarang hwoarang force-pushed the add-2m-delay-sidecar branch 2 times, most recently from 8f7b458 to e097018 Compare September 22, 2020 12:37
@hwoarang hwoarang changed the title mixin: Use 2m interval for Thanos sidecar healthcheck mixin: Use sidecar's metric timestamp for healthcheck Sep 24, 2020
@hwoarang hwoarang requested a review from kakkoyun October 1, 2020 08:28
@stale
Copy link

stale bot commented Nov 30, 2020

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale label Nov 30, 2020
@stale stale bot removed the stale label Nov 30, 2020
@hwoarang
Copy link
Contributor Author

Update PR to match latest master branch

@kakkoyun any thoughts here?

@stale
Copy link

stale bot commented Jan 30, 2021

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale label Jan 30, 2021
@stale stale bot removed the stale label Feb 1, 2021
@hwoarang
Copy link
Contributor Author

hwoarang commented Feb 1, 2021

@kakkoyun @bwplotka would it be possible to merge this? If not, do you have any feedback on how to bring it into mergeable state? :) Thank you

Copy link
Member

@kakkoyun kakkoyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kakkoyun
Copy link
Member

kakkoyun commented Feb 1, 2021

@hwoarang This looks good to me in principle, however, you need to generate docs and make sure this fulfils the expected behaviour with tests. CI already points out where it falls short. You can find all necessary task as Make actions.

Copy link
Member

@kakkoyun kakkoyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes as commented above.

@hwoarang hwoarang force-pushed the add-2m-delay-sidecar branch 3 times, most recently from 7dba321 to 4568191 Compare February 12, 2021 12:26
@hwoarang
Copy link
Contributor Author

@hwoarang This looks good to me in principle, however, you need to generate docs and make sure this fulfils the expected behaviour with tests. CI already points out where it falls short. You can find all necessary task as Make actions.

@kakkoyun thank you for your input and apologies for taking a while to address your concerns. Tests are passing now as I had to refactor tests a little bit as we are now effectively testing for a different kind of alert. Please let me know your thoughts. Thank you

@hwoarang
Copy link
Contributor Author

The only tests that seems to fail in circle CI is --- FAIL: TestBucketStore_ManyParts_e2e (0.20s) which I believe is not related to this PR

Base automatically changed from master to main February 26, 2021 16:30
@kakkoyun kakkoyun enabled auto-merge (squash) March 1, 2021 08:52
@kakkoyun
Copy link
Member

kakkoyun commented Mar 1, 2021

Hey @hwoarang, enabled the auto-merge and approved. Please fix the conflicts for the Changelog.

During prometheus updates the alert was firing because the metric was
initialized with a value of '0' before the first heartbeat was sent. As
such, the evaluation of the alert results into actually taking just the
value of time() into consideration which led to misleading information
about the health of the sidecar.

As the thanos_sidecar_last_heartbeat_success_time_seconds metric is
effectively just a timestamp that resets on new deployments, we can
simply wrap it around the timestamp() function which should return
almost the same value of the metric itself with the added benefit that
heartbeat resets will be ignored.

This also refactors the relevant tests and drops the timeout to 4
minutes in order to ensure that we do not get hit by stale data if
the sidecar takes longer to start.

Signed-off-by: Markos Chandras <markos@chandras.me>
@hwoarang
Copy link
Contributor Author

hwoarang commented Mar 5, 2021

Hey @hwoarang, enabled the auto-merge and approved. Please fix the conflicts for the Changelog.

Hello @kakkoyun . Thank you so much. I will have this ready today.

@kakkoyun kakkoyun merged commit 80e0257 into thanos-io:main Mar 5, 2021
kakkoyun added a commit that referenced this pull request Mar 5, 2021
@hwoarang hwoarang deleted the add-2m-delay-sidecar branch March 5, 2021 12:52
@kakkoyun
Copy link
Member

kakkoyun commented Mar 5, 2021

@hwoarang This PR has introduced regressions around the pod to instance label renamings. It's my bad to mark as auto-merge. I didn't anticipate this. I'll send a consecutive PR to fix the issues.

@hwoarang
Copy link
Contributor Author

hwoarang commented Mar 5, 2021

@kakkoyun really sorry that I missed that locally. I assumed the green CI was a good indication :)

andrejbranch pushed a commit to andrejbranch/thanos that referenced this pull request Mar 11, 2021
During prometheus updates the alert was firing because the metric was
initialized with a value of '0' before the first heartbeat was sent. As
such, the evaluation of the alert results into actually taking just the
value of time() into consideration which led to misleading information
about the health of the sidecar.

As the thanos_sidecar_last_heartbeat_success_time_seconds metric is
effectively just a timestamp that resets on new deployments, we can
simply wrap it around the timestamp() function which should return
almost the same value of the metric itself with the added benefit that
heartbeat resets will be ignored.

This also refactors the relevant tests and drops the timeout to 4
minutes in order to ensure that we do not get hit by stale data if
the sidecar takes longer to start.

Signed-off-by: Markos Chandras <markos@chandras.me>
dgrisonnet pushed a commit to dgrisonnet/thanos that referenced this pull request Mar 26, 2021
During prometheus updates the alert was firing because the metric was
initialized with a value of '0' before the first heartbeat was sent. As
such, the evaluation of the alert results into actually taking just the
value of time() into consideration which led to misleading information
about the health of the sidecar.

As the thanos_sidecar_last_heartbeat_success_time_seconds metric is
effectively just a timestamp that resets on new deployments, we can
simply wrap it around the timestamp() function which should return
almost the same value of the metric itself with the added benefit that
heartbeat resets will be ignored.

This also refactors the relevant tests and drops the timeout to 4
minutes in order to ensure that we do not get hit by stale data if
the sidecar takes longer to start.

Signed-off-by: Markos Chandras <markos@chandras.me>
dgrisonnet pushed a commit to dgrisonnet/thanos that referenced this pull request Mar 26, 2021
During prometheus updates the alert was firing because the metric was
initialized with a value of '0' before the first heartbeat was sent. As
such, the evaluation of the alert results into actually taking just the
value of time() into consideration which led to misleading information
about the health of the sidecar.

As the thanos_sidecar_last_heartbeat_success_time_seconds metric is
effectively just a timestamp that resets on new deployments, we can
simply wrap it around the timestamp() function which should return
almost the same value of the metric itself with the added benefit that
heartbeat resets will be ignored.

This also refactors the relevant tests and drops the timeout to 4
minutes in order to ensure that we do not get hit by stale data if
the sidecar takes longer to start.

Signed-off-by: Markos Chandras <markos@chandras.me>
dgrisonnet pushed a commit to dgrisonnet/thanos that referenced this pull request Mar 26, 2021
During prometheus updates the alert was firing because the metric was
initialized with a value of '0' before the first heartbeat was sent. As
such, the evaluation of the alert results into actually taking just the
value of time() into consideration which led to misleading information
about the health of the sidecar.

As the thanos_sidecar_last_heartbeat_success_time_seconds metric is
effectively just a timestamp that resets on new deployments, we can
simply wrap it around the timestamp() function which should return
almost the same value of the metric itself with the added benefit that
heartbeat resets will be ignored.

This also refactors the relevant tests and drops the timeout to 4
minutes in order to ensure that we do not get hit by stale data if
the sidecar takes longer to start.

Signed-off-by: Markos Chandras <markos@chandras.me>
Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
kakkoyun pushed a commit that referenced this pull request Mar 26, 2021
During prometheus updates the alert was firing because the metric was
initialized with a value of '0' before the first heartbeat was sent. As
such, the evaluation of the alert results into actually taking just the
value of time() into consideration which led to misleading information
about the health of the sidecar.

As the thanos_sidecar_last_heartbeat_success_time_seconds metric is
effectively just a timestamp that resets on new deployments, we can
simply wrap it around the timestamp() function which should return
almost the same value of the metric itself with the added benefit that
heartbeat resets will be ignored.

This also refactors the relevant tests and drops the timeout to 4
minutes in order to ensure that we do not get hit by stale data if
the sidecar takes longer to start.

Signed-off-by: Markos Chandras <markos@chandras.me>
Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>

Co-authored-by: Markos Chandras <hwoarang@users.noreply.github.com>
onprem pushed a commit that referenced this pull request Jun 23, 2021
…er healthy (#4342)

* Revert "mixin: Use sidecar's metric timestamp for healthcheck (#3204) (#3979)"

This reverts commit 5139e33.

Signed-off-by: Arunprasad Rajkumar <arajkuma@redhat.com>

* fix(mixin): ThanosSidecarUnhealthy doesn't fire if the sidecar is never healthy

Signed-off-by: Arunprasad Rajkumar <arajkuma@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants