-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a /-/healthy
endpoint for monitoring component health
#2197
Add a /-/healthy
endpoint for monitoring component health
#2197
Conversation
818a1b8
to
c698048
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks nice, thanks! Just minor comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a few suggestions here to expand an clarify some doc things. :-)
Hi, I wanted to share some concerns with using component health for liveness probes. Kubernetes intends for liveness probes to be an indication of when the pod is "stuck" and needs to be killed. Kubernetes will eventually kill the pod if the liveness probe fails enough times. With this definition of liveness, component health is not a good indicator. Components like Killing the pod when components are unhealthy will not only not fix the problem, but since Alloy requires all components to be healthy on startup, it will also completely halt all telemetry collection until the problem is manually fixed. |
Oh yeah, I 100% agree and I would not recommend using it as liveness probe. But people still wanted it. I think adding this endpoint is harmless, because it can be used for other health checking purposes or diagnostics, not just as a liveness probe. We do know people who wanted to use it as liveness probe, but this would be doing it against our advice. They may have some special requirements though and it may work okay for them. |
Hi @rfratto, thank you for opining! I agree that in the component health is not a good indicator for whether Alloy as a whole should be restarted. Apparently, the issue which originally prompted this feature request is an informer timeout like the one referenced in #2161. I do think there is a benefit in the |
I opened another PR for the k8s informers to keep retrying, as mentioned in the comment above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor tweaks to the docs and all good
a32e9b4
to
91659a0
Compare
3e6c716
to
cc86481
Compare
Co-authored-by: Clayton Cornell <131809008+clayton-cornell@users.noreply.github.com>
Co-authored-by: Clayton Cornell <131809008+clayton-cornell@users.noreply.github.com>
Co-authored-by: Clayton Cornell <131809008+clayton-cornell@users.noreply.github.com>
13e43bf
to
e095bb3
Compare
PR Description
Adding a
/-/healthy
endpoint which returns an error if at least one component is not healthy. The/-/ready
endpoint doesn't check for component health, so this could be useful for Kubernetes liveliness probes.Which issue(s) this PR fixes
Fixes #2061
Notes to the Reviewer
Prometheus also has a
/-/healthy
endpoint, and so does Grafana Agent Static mode. However, both of these were not very useful because they always return HTTP 200. There is an open issue in the Prometheus repo to make it more useful.PR Checklist