Add a `/-/healthy` endpoint for monitoring component health #2197

ptodev · 2024-11-29T13:22:45Z

PR Description

Adding a /-/healthy endpoint which returns an error if at least one component is not healthy. The /-/ready endpoint doesn't check for component health, so this could be useful for Kubernetes liveliness probes.

Which issue(s) this PR fixes

Fixes #2061

Notes to the Reviewer

Prometheus also has a /-/healthy endpoint, and so does Grafana Agent Static mode. However, both of these were not very useful because they always return HTTP 200. There is an open issue in the Prometheus repo to make it more useful.

PR Checklist

CHANGELOG.md updated
Documentation added
Tests updated
Config converters updated

thampiotr

Looks nice, thanks! Just minor comments.

docs/sources/reference/_index.md

docs/sources/reference/http/_index.md

internal/service/http/http.go

internal/service/http/http_test.go

clayton-cornell

I added a few suggestions here to expand an clarify some doc things. :-)

docs/sources/reference/_index.md

docs/sources/reference/http/_index.md

rfratto · 2024-11-29T18:35:11Z

Hi, I wanted to share some concerns with using component health for liveness probes.

Kubernetes intends for liveness probes to be an indication of when the pod is "stuck" and needs to be killed. Kubernetes will eventually kill the pod if the liveness probe fails enough times.

With this definition of liveness, component health is not a good indicator. Components like remote.s3 can report themselves as unhealthy if an object in S3 gets deleted, and custom components can fail if you give them an invalid configuration (such as someone rolls out a bad config from fleet management).

Killing the pod when components are unhealthy will not only not fix the problem, but since Alloy requires all components to be healthy on startup, it will also completely halt all telemetry collection until the problem is manually fixed.

thampiotr · 2024-12-02T10:56:05Z

Hi, I wanted to share some concerns with using component health for liveness probes.

Oh yeah, I 100% agree and I would not recommend using it as liveness probe. But people still wanted it. I think adding this endpoint is harmless, because it can be used for other health checking purposes or diagnostics, not just as a liveness probe. We do know people who wanted to use it as liveness probe, but this would be doing it against our advice. They may have some special requirements though and it may work okay for them.

ptodev · 2024-12-02T11:10:48Z

Hi @rfratto, thank you for opining! I agree that in the component health is not a good indicator for whether Alloy as a whole should be restarted.

Apparently, the issue which originally prompted this feature request is an informer timeout like the one referenced in #2161.
@thampiotr - if that timeout does happen, maybe the real issue is that the component should keep retrying to sync the informers rather than giving up? Judging by the code, the prometheus.operator components will only sync the informers when the config is updated?

I do think there is a benefit in the /healhy endpoint as a whole, although I do agree that we shouldn't recommend that users use it as a liveliness probe. It could be useful in situations where users update the Alloy config and then want to double check that it works as expected. They might do rolling deployments - e.g. 1% of Alloy instances first, and then if they are healthy, the deployment can continue to more instances. I mistakenly thought that liveliness probes in k8s can do similar things, but apparently that's not the case.

ptodev · 2024-12-02T13:40:35Z

I opened another PR for the k8s informers to keep retrying, as mentioned in the comment above.

clayton-cornell

Minor tweaks to the docs and all good

docs/sources/reference/http/_index.md

Co-authored-by: Clayton Cornell <131809008+clayton-cornell@users.noreply.github.com>

ptodev requested review from clayton-cornell and a team as code owners November 29, 2024 13:22

ptodev linked an issue Nov 29, 2024 that may be closed by this pull request

Expose Alloy overall component health via HTTP endpoint #2061

Closed

ptodev force-pushed the 2061-expose-alloy-overall-component-health-via-http-endpoint branch from 818a1b8 to c698048 Compare November 29, 2024 13:35

thampiotr reviewed Nov 29, 2024

View reviewed changes

clayton-cornell reviewed Nov 29, 2024

View reviewed changes

clayton-cornell added the type/docs Docs Squad label across all Grafana Labs repos label Nov 29, 2024

thampiotr approved these changes Dec 2, 2024

View reviewed changes

clayton-cornell approved these changes Dec 2, 2024

View reviewed changes

ptodev force-pushed the 2061-expose-alloy-overall-component-health-via-http-endpoint branch from a32e9b4 to 91659a0 Compare December 2, 2024 19:37

ptodev commented Dec 2, 2024

View reviewed changes

docs/sources/reference/http/_index.md Outdated Show resolved Hide resolved

clayton-cornell reviewed Dec 2, 2024

View reviewed changes

docs/sources/reference/http/_index.md Outdated Show resolved Hide resolved

ptodev force-pushed the 2061-expose-alloy-overall-component-health-via-http-endpoint branch from 3e6c716 to cc86481 Compare December 11, 2024 11:51

ptodev and others added 8 commits December 11, 2024 12:05

Add an endpoint for monitoring health

26d2959

Apply suggestions from code review

a6dd7f5

Co-authored-by: Clayton Cornell <131809008+clayton-cornell@users.noreply.github.com>

Add menuTitle

6e77e2e

Update docs/sources/reference/http/_index.md

7d697b6

Co-authored-by: Clayton Cornell <131809008+clayton-cornell@users.noreply.github.com>

Update health message

5db23e9

Update docs/sources/reference/http/_index.md

c99bf0a

Co-authored-by: Clayton Cornell <131809008+clayton-cornell@users.noreply.github.com>

Add note about liveness probes

f141567

Reword docs

e095bb3

ptodev force-pushed the 2061-expose-alloy-overall-component-health-via-http-endpoint branch from 13e43bf to e095bb3 Compare December 11, 2024 12:05

ptodev merged commit b97d2b6 into main Dec 11, 2024
17 of 18 checks passed

ptodev deleted the 2061-expose-alloy-overall-component-health-via-http-endpoint branch December 11, 2024 15:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a `/-/healthy` endpoint for monitoring component health #2197

Add a `/-/healthy` endpoint for monitoring component health #2197

ptodev commented Nov 29, 2024

thampiotr left a comment

clayton-cornell left a comment

rfratto commented Nov 29, 2024

thampiotr commented Dec 2, 2024

ptodev commented Dec 2, 2024 •

edited

Loading

ptodev commented Dec 2, 2024

clayton-cornell left a comment

Add a /-/healthy endpoint for monitoring component health #2197

Add a /-/healthy endpoint for monitoring component health #2197

Conversation

ptodev commented Nov 29, 2024

PR Description

Which issue(s) this PR fixes

Notes to the Reviewer

PR Checklist

thampiotr left a comment

Choose a reason for hiding this comment

clayton-cornell left a comment

Choose a reason for hiding this comment

rfratto commented Nov 29, 2024

thampiotr commented Dec 2, 2024

ptodev commented Dec 2, 2024 • edited Loading

ptodev commented Dec 2, 2024

clayton-cornell left a comment

Choose a reason for hiding this comment

Add a `/-/healthy` endpoint for monitoring component health #2197

Add a `/-/healthy` endpoint for monitoring component health #2197

ptodev commented Dec 2, 2024 •

edited

Loading