-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Liveness Probe HTTP Endpoint #390
Comments
@jlind23 @nimarezainia We should discuss this with our priority with @norrietaylor for the endpoint on k8s project. |
Hey folks, we are also hitting an issue in the synthetics-service (https://github.com/elastic/synthetics-service/issues/435) where the agent fails in acquire the leader lease:
We think that this issue might be caused by the Apiserver being too slow to provide the lease. We are digging into metrics to certify that, but for now, the only solution we have to re-establish the agent is to manually restart the daemonsets. Would be great if we could set a liveness/readiness endpoint in the k8s template so k8s takes care of re-starting the pods for us. For the deployment of agent itself we have crafted a helm chart(src but unfortunately no liveness/readiness is set(src) Since we are reaching public beta phase, it would be really great if you could raise the priority on this is possible. |
I would like to add to @maneta's comment that we would need an auto-recover mechanism for this leader lease loss. I mean, we should retry until the API is available again, to get the degraded metricset working again. |
@michel-laterman @pierrehilbert I know some pieces of this are blocked on v2 work being completed. Could we update the issue description with what's been completed and what's left to close out this issue? |
For v8.4.0 the fleet-gateway component is capable of setting the elastic-agent health to degraded if it fails 2 consecutive check ins. This will lead to cases where the health in fleet and the health from the local agent (through The liveness endpoint for v8.4.0 (
No additional liveness endpoints are in v8.4.0. For v8.5.0 it will likely need to be reimplemented in order to work with the new v5 architecture |
I would also be interested in a liveness probe endpoint for managed fleet agents. |
can't we just use the status code of the agent?
|
@blakerouse where are we with this now that we've switched over to v2? |
The I know the only HTTP endpoint at the moment that the elastc-agent will run is for metrics here: That only gets turned on if the metrics endpoint is enabled, I don't know if we want to be at the same endpoint or use a different one with a different configuration. I also think we should think about the path that is used to determine status of a running component:
|
Makes sense to offer the ability to have liveness on a more granular level. This would allow users to decide which components need to be up for their use case. I do think we should offer an overall one as well, which only returns 200 if all components are healthy. This would improve our experience on Cloud quite a bit. Today from the orchestration's point of view, the Agent container can be healthy even if one of the processes inside crashed and Agent didn't or couldn't restart it for whatever reason (bug, expired key, etc.). Having this endpoint would help us signal to the orchestrator that the container needs to be restarted. |
So, I just got assigned to this and I'm missing a bit of context;
EDIT: some quick googling suggests that k8's liveness probes just look for an HTTP response in the range of 200-399. So it sounds like what we want is to turn the health state into a relevant response code, and anything on top of that (JSON with additional state info, etc) is extra? |
There is a catch to this, which is that I don't think that the
I would have the liveness probe fail anytime the agent is unhealthy (component or unit). Otherwise it isn't providing any value over There is actually some already written code around this you can revive or rewrite as needed. elastic-agent/internal/pkg/agent/application/coordinator/handler.go Lines 15 to 22 in c9bb164
There is some additional context in #1157 |
Yeah, this is the part that kinda bugs me a bit. I feel like this should "just work" if we're running under k8s, unless there's some security reason why we don't want it on-by-default when the user is running under k8s? |
I think it is fair not to want an HTTP interface on the agent by default outside of k8s, and doing it on k8s would make the configuration in the default elastic-agent.yml file runtime environment dependent, which we generally don't do right now. We could default So no matter which default we go with, the prerequisite is to make turning the HTTP endpoint on and off from Fleet work properly. |
So, my thinking was something like this:
Unless there's a reason why a user might be running under k8s but want the
Missed this detail the first time; is there something the agent-side code needs to care about with regards to working under fleet? My understanding is that the config backend kind of abstracts that away. |
I have another weird idea: The k8s config lets users set headers for the liveness probe:
Instead of creating a second HTTP endpoint, can we look for a header in the HTTP request and change the behavior of the existing |
Having to set headers to have a liveness endpoint work properly is not idiomatic for k8s. Much, much more common to have a liveness/health/healthz/whatever endpoint purely dedicated to this purpose. |
Describe the enhancement:
Currently the Elastic Agent running in a container does not have a
liveness
HTTP endpoint where kubernetes can check the overall health of the Elastic Agent container. This needs to be added so that in the case that the Elastic Agent is not working correctly it can be restarted by Kubernetes.Some items that would be good for the liveness probe to be alerted to on failure:
Liveness probe should have some subpaths defined for inputs that need to monitor there own liveness:
/liveness/endpoint
- Checks if endpoint should be alive (see https://github.com/elastic/security-team/issues/3449#issuecomment-1112559420) for more detailsDescribe a specific use case for the enhancement or feature:
This needs to be added so that in the case that the Elastic Agent (or an integration that runs in a sidecar) is not working correctly it can be restarted by Kubernetes.
Additional Requirements
Enabling the liveness endpoint requires the ability to enable and possibly modify the agent HTTP configuration. This is currently not reloadable and cannot be configured from Fleet. For Fleet managed users to benefit from this we should make sure this can be turned on from Fleet.
elastic-agent/elastic-agent.yml
Lines 75 to 82 in b3e8275
The text was updated successfully, but these errors were encountered: