-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mixin: Use sidecar's metric timestamp for healthcheck #3204
Conversation
8f7b458
to
e097018
Compare
e097018
to
47cb8d8
Compare
47cb8d8
to
50f8f99
Compare
Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward? This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
50f8f99
to
88b2e6e
Compare
Update PR to match latest master branch @kakkoyun any thoughts here? |
Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward? This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
88b2e6e
to
7313f0a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@hwoarang This looks good to me in principle, however, you need to generate docs and make sure this fulfils the expected behaviour with tests. CI already points out where it falls short. You can find all necessary task as Make actions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requesting changes as commented above.
7dba321
to
4568191
Compare
@kakkoyun thank you for your input and apologies for taking a while to address your concerns. Tests are passing now as I had to refactor tests a little bit as we are now effectively testing for a different kind of alert. Please let me know your thoughts. Thank you |
The only tests that seems to fail in circle CI is |
Hey @hwoarang, enabled the auto-merge and approved. Please fix the conflicts for the Changelog. |
4568191
to
b4bf820
Compare
During prometheus updates the alert was firing because the metric was initialized with a value of '0' before the first heartbeat was sent. As such, the evaluation of the alert results into actually taking just the value of time() into consideration which led to misleading information about the health of the sidecar. As the thanos_sidecar_last_heartbeat_success_time_seconds metric is effectively just a timestamp that resets on new deployments, we can simply wrap it around the timestamp() function which should return almost the same value of the metric itself with the added benefit that heartbeat resets will be ignored. This also refactors the relevant tests and drops the timeout to 4 minutes in order to ensure that we do not get hit by stale data if the sidecar takes longer to start. Signed-off-by: Markos Chandras <markos@chandras.me>
b4bf820
to
95b2fe9
Compare
@hwoarang This PR has introduced regressions around the pod to instance label renamings. It's my bad to mark as auto-merge. I didn't anticipate this. I'll send a consecutive PR to fix the issues. |
@kakkoyun really sorry that I missed that locally. I assumed the green CI was a good indication :) |
During prometheus updates the alert was firing because the metric was initialized with a value of '0' before the first heartbeat was sent. As such, the evaluation of the alert results into actually taking just the value of time() into consideration which led to misleading information about the health of the sidecar. As the thanos_sidecar_last_heartbeat_success_time_seconds metric is effectively just a timestamp that resets on new deployments, we can simply wrap it around the timestamp() function which should return almost the same value of the metric itself with the added benefit that heartbeat resets will be ignored. This also refactors the relevant tests and drops the timeout to 4 minutes in order to ensure that we do not get hit by stale data if the sidecar takes longer to start. Signed-off-by: Markos Chandras <markos@chandras.me>
During prometheus updates the alert was firing because the metric was initialized with a value of '0' before the first heartbeat was sent. As such, the evaluation of the alert results into actually taking just the value of time() into consideration which led to misleading information about the health of the sidecar. As the thanos_sidecar_last_heartbeat_success_time_seconds metric is effectively just a timestamp that resets on new deployments, we can simply wrap it around the timestamp() function which should return almost the same value of the metric itself with the added benefit that heartbeat resets will be ignored. This also refactors the relevant tests and drops the timeout to 4 minutes in order to ensure that we do not get hit by stale data if the sidecar takes longer to start. Signed-off-by: Markos Chandras <markos@chandras.me>
During prometheus updates the alert was firing because the metric was initialized with a value of '0' before the first heartbeat was sent. As such, the evaluation of the alert results into actually taking just the value of time() into consideration which led to misleading information about the health of the sidecar. As the thanos_sidecar_last_heartbeat_success_time_seconds metric is effectively just a timestamp that resets on new deployments, we can simply wrap it around the timestamp() function which should return almost the same value of the metric itself with the added benefit that heartbeat resets will be ignored. This also refactors the relevant tests and drops the timeout to 4 minutes in order to ensure that we do not get hit by stale data if the sidecar takes longer to start. Signed-off-by: Markos Chandras <markos@chandras.me>
During prometheus updates the alert was firing because the metric was initialized with a value of '0' before the first heartbeat was sent. As such, the evaluation of the alert results into actually taking just the value of time() into consideration which led to misleading information about the health of the sidecar. As the thanos_sidecar_last_heartbeat_success_time_seconds metric is effectively just a timestamp that resets on new deployments, we can simply wrap it around the timestamp() function which should return almost the same value of the metric itself with the added benefit that heartbeat resets will be ignored. This also refactors the relevant tests and drops the timeout to 4 minutes in order to ensure that we do not get hit by stale data if the sidecar takes longer to start. Signed-off-by: Markos Chandras <markos@chandras.me> Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
During prometheus updates the alert was firing because the metric was initialized with a value of '0' before the first heartbeat was sent. As such, the evaluation of the alert results into actually taking just the value of time() into consideration which led to misleading information about the health of the sidecar. As the thanos_sidecar_last_heartbeat_success_time_seconds metric is effectively just a timestamp that resets on new deployments, we can simply wrap it around the timestamp() function which should return almost the same value of the metric itself with the added benefit that heartbeat resets will be ignored. This also refactors the relevant tests and drops the timeout to 4 minutes in order to ensure that we do not get hit by stale data if the sidecar takes longer to start. Signed-off-by: Markos Chandras <markos@chandras.me> Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com> Co-authored-by: Markos Chandras <hwoarang@users.noreply.github.com>
…er healthy (#4342) * Revert "mixin: Use sidecar's metric timestamp for healthcheck (#3204) (#3979)" This reverts commit 5139e33. Signed-off-by: Arunprasad Rajkumar <arajkuma@redhat.com> * fix(mixin): ThanosSidecarUnhealthy doesn't fire if the sidecar is never healthy Signed-off-by: Arunprasad Rajkumar <arajkuma@redhat.com>
During prometheus updates the alert was firing because the metric was
initialized with a value of '0' before the first heartbeat was sent. As
such, the evaluation of the alert results into actually taking just the
value of time() into consideration which led to misleading information
about the health of the sidecar.
As the thanos_sidecar_last_heartbeat_success_time_seconds metric is
effectively just a timestamp that resets on new deployments, we can
simply wrap it around the timestamp() function which should return
almost the same value of the metric itself with the added benefit that
heartbeat resets will be ignored.
Signed-off-by: Markos Chandras markos@chandras.me
Changes
Verification