Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Liveness Probe HTTP Endpoint #390

Closed
blakerouse opened this issue Apr 28, 2022 · 17 comments · Fixed by #4499
Closed

Liveness Probe HTTP Endpoint #390

blakerouse opened this issue Apr 28, 2022 · 17 comments · Fixed by #4499
Assignees
Labels
estimation:Month Task that represents a month of work. Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@blakerouse
Copy link
Contributor

blakerouse commented Apr 28, 2022

Describe the enhancement:

Currently the Elastic Agent running in a container does not have a liveness HTTP endpoint where kubernetes can check the overall health of the Elastic Agent container. This needs to be added so that in the case that the Elastic Agent is not working correctly it can be restarted by Kubernetes.

Some items that would be good for the liveness probe to be alerted to on failure:

  • Not able to connect to Fleet Server (in managed mode)
  • Overall bad state of inputs after a period of time

Liveness probe should have some subpaths defined for inputs that need to monitor there own liveness:

/liveness/endpoint - Checks if endpoint should be alive (see https://github.com/elastic/security-team/issues/3449#issuecomment-1112559420) for more details

Describe a specific use case for the enhancement or feature:

This needs to be added so that in the case that the Elastic Agent (or an integration that runs in a sidecar) is not working correctly it can be restarted by Kubernetes.

Additional Requirements

Enabling the liveness endpoint requires the ability to enable and possibly modify the agent HTTP configuration. This is currently not reloadable and cannot be configured from Fleet. For Fleet managed users to benefit from this we should make sure this can be turned on from Fleet.

# http:
# # enables http endpoint
# enabled: false
# # The HTTP endpoint will bind to this hostname, IP address, unix socket or named pipe.
# # When using IP addresses, it is recommended to only use localhost.
# host: localhost
# # Port on which the HTTP endpoint will bind. Default is 0 meaning feature is disabled.
# port: 6791

@blakerouse blakerouse added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Apr 28, 2022
@ph
Copy link
Contributor

ph commented Apr 28, 2022

@jlind23 @nimarezainia We should discuss this with our priority with @norrietaylor for the endpoint on k8s project.

@maneta
Copy link

maneta commented May 3, 2022

Hey folks, we are also hitting an issue in the synthetics-service (https://github.com/elastic/synthetics-service/issues/435) where the agent fails in acquire the leader lease:

May 2, 2022 @ 14:56:46.498	elastic-agent-6qrjw	I0502 14:56:46.498250       7 leaderelection.go:278] failed to renew lease kube-system/elastic-agent-cluster-leader: timed out waiting for the condition

May 2, 2022 @ 14:56:46.498	elastic-agent-6qrjw	E0502 14:56:46.498350       7 leaderelection.go:301] Failed to release lock: resource name may not be empty

We think that this issue might be caused by the Apiserver being too slow to provide the lease. We are digging into metrics to certify that, but for now, the only solution we have to re-establish the agent is to manually restart the daemonsets. Would be great if we could set a liveness/readiness endpoint in the k8s template so k8s takes care of re-starting the pods for us.

For the deployment of agent itself we have crafted a helm chart(src but unfortunately no liveness/readiness is set(src)

Since we are reaching public beta phase, it would be really great if you could raise the priority on this is possible.

@rbrunan
Copy link

rbrunan commented May 3, 2022

I would like to add to @maneta's comment that we would need an auto-recover mechanism for this leader lease loss. I mean, we should retry until the API is available again, to get the degraded metricset working again.

@joshdover
Copy link
Contributor

@michel-laterman @pierrehilbert I know some pieces of this are blocked on v2 work being completed. Could we update the issue description with what's been completed and what's left to close out this issue?

@michel-laterman
Copy link
Contributor

For v8.4.0 the fleet-gateway component is capable of setting the elastic-agent health to degraded if it fails 2 consecutive check ins. This will lead to cases where the health in fleet and the health from the local agent (through elastic-agent status) will differ.
The status command can now show the failing component or app in the message field, and also when the status has changed (through update_timestamp).

The liveness endpoint for v8.4.0 (GET /liveness) returns the health status (OK -> 200, DEGRADED/FAILURE -> 503), along with a json body that has

id # agent ID
status # status as a string
message # update message
update_timestamp # when the health status changed

No additional liveness endpoints are in v8.4.0.

For v8.5.0 it will likely need to be reimplemented in order to work with the new v5 architecture

@michel-laterman michel-laterman removed their assignment Sep 14, 2022
@rgarcia89
Copy link

I would also be interested in a liveness probe endpoint for managed fleet agents.
Especially for when an agent stays in Unhealty state for to long

@rgarcia89
Copy link

can't we just use the status code of the agent?

          livenessProbe:
            exec:
              command:
                - /usr/share/elastic-agent/elastic-agent
                - status
            initialDelaySeconds: 20
            successThreshold: 1
            failureThreshold: 3
            periodSeconds: 10

@joshdover
Copy link
Contributor

@blakerouse where are we with this now that we've switched over to v2?

@blakerouse
Copy link
Contributor Author

The ServeHTTP handler was created on the coordinator but doesn't look like it was wired in.

https://github.com/elastic/elastic-agent/blob/main/internal/pkg/agent/application/coordinator/handler.go#L26

I know the only HTTP endpoint at the moment that the elastc-agent will run is for metrics here:

https://github.com/elastic/elastic-agent/blob/main/internal/pkg/agent/application/monitoring/server.go#L49

That only gets turned on if the metrics endpoint is enabled, I don't know if we want to be at the same endpoint or use a different one with a different configuration.

I also think we should think about the path that is used to determine status of a running component:

  • /liveness - General overall check to ensure that the Elastic Agent is healthy
  • /liveness/by-id/${component_id} - Check liveness of specific component including all units
  • /liveness/by-id/${component_id} - Check liveness of specific unit
  • /liveness/by-type/${input_type} - Allow checking without needing to know the exact computed component ID by Input Type (this would return an array as its possible for multiple units per input type, but in some cases would always be an array of 1, in endpoint case)

@joshdover
Copy link
Contributor

Makes sense to offer the ability to have liveness on a more granular level. This would allow users to decide which components need to be up for their use case. I do think we should offer an overall one as well, which only returns 200 if all components are healthy.

This would improve our experience on Cloud quite a bit. Today from the orchestration's point of view, the Agent container can be healthy even if one of the processes inside crashed and Agent didn't or couldn't restart it for whatever reason (bug, expired key, etc.). Having this endpoint would help us signal to the orchestrator that the container needs to be restarted.

@fearful-symmetry
Copy link
Contributor

fearful-symmetry commented Mar 25, 2024

So, I just got assigned to this and I'm missing a bit of context;

  1. How is the user going to enable/disable this? Should it be enabled automatically if we're running under k8s?
  2. What are the actual requirements for the endpoints? Do we need a certain OpenAPI spec? Should they return some kind of non-200 HTTP code if the component is unhealthy?

EDIT: some quick googling suggests that k8's liveness probes just look for an HTTP response in the range of 200-399. So it sounds like what we want is to turn the health state into a relevant response code, and anything on top of that (JSON with additional state info, etc) is extra?

@cmacknz
Copy link
Member

cmacknz commented Mar 25, 2024

  1. Serve the liveness endpoint whenever the user has agent.monitoring.HTTP.enabled: true:
    # http:
    # # enables http endpoint
    # enabled: false
    # # The HTTP endpoint will bind to this hostname, IP address, unix socket or named pipe.
    # # When using IP addresses, it is recommended to only use localhost.
    # host: localhost
    # # Port on which the HTTP endpoint will bind. Default is 0 meaning feature is disabled.
    # port: 6791
    # # Metrics buffer endpoint
    # buffer.enabled: false
    # # Configuration for the diagnostics action handler

There is a catch to this, which is that I don't think that the agent.monitoring.HTTP.enabled configuration is reloadable which really limits how useful this can be. I think you need to make it so that you can set agent.monitoring.HTTP.enabled: true using the Fleet override API for an agent that was previously enrolled and have it turn on.

  1. In general follow the conventions for a Kubernetes HTTP liveness probe. This is primarily going to be used as a liveness probe on k8s and things that want it to work similarly (load balancers for example).

Any code greater than or equal to 200 and less than 400 indicates success. Any other code indicates failure.

I would have the liveness probe fail anytime the agent is unhealthy (component or unit). Otherwise it isn't providing any value over GET /stats or GET /debug/pprof which already exist today. This won't consider the Fleet connectivity at all, but I think ideally we'd want it to report an error if the agent considers it offline from Fleet (offline after X minutes). This can be a future extension.

There is actually some already written code around this you can revive or rewrite as needed.

// LivenessResponse is the response body for the liveness endpoint.
type LivenessResponse struct {
ID string `json:"id"`
Status string `json:"status"`
Message string `json:"message"`
UpdateTime time.Time `json:"update_timestamp"`
}

There is some additional context in #1157

@fearful-symmetry
Copy link
Contributor

I think you need to make it so that you can set agent.monitoring.HTTP.enabled: true using the Fleet override API for an agent that was previously enrolled and have it turn on.

Yeah, this is the part that kinda bugs me a bit. I feel like this should "just work" if we're running under k8s, unless there's some security reason why we don't want it on-by-default when the user is running under k8s?

@cmacknz
Copy link
Member

cmacknz commented Mar 26, 2024

I think it is fair not to want an HTTP interface on the agent by default outside of k8s, and doing it on k8s would make the configuration in the default elastic-agent.yml file runtime environment dependent, which we generally don't do right now.

We could default agent.monitoring.http.enabled: true by default, but we are going to run into the same problem where for agents already enrolled into Fleet this has no effect, and given it is a change in default behavior it should be easy to turn off in Fleet, where it also isn't exposed.

So no matter which default we go with, the prerequisite is to make turning the HTTP endpoint on and off from Fleet work properly.

@fearful-symmetry
Copy link
Contributor

I think it is fair not to want an HTTP interface on the agent by default outside of k8s, and doing it on k8s would make the configuration in the default elastic-agent.yml file runtime environment dependent, which we generally don't do right now.

So, my thinking was something like this:

if in_k8s() {
    serve_liveness_endpoint()
}

Unless there's a reason why a user might be running under k8s but want the /liveness endpoint disabled by default. We could also separate the config for the "normal" HTTP debug interface and the /liveness interface. I'm just assuming that this is something a user would want on by default if they're running agent in k8s, but I'm not a k8s expert so I could be wrong.

I think you need to make it so that you can set agent.monitoring.HTTP.enabled: true using the Fleet override API for an agent that was previously enrolled and have it turn on.

Missed this detail the first time; is there something the agent-side code needs to care about with regards to working under fleet? My understanding is that the config backend kind of abstracts that away.

@fearful-symmetry
Copy link
Contributor

fearful-symmetry commented Mar 26, 2024

I have another weird idea:

The k8s config lets users set headers for the liveness probe:

    livenessProbe:
      httpGet:
        path: /healthz
        port: 8080
        httpHeaders:
        - name: Custom-Header
          value: Awesome

Instead of creating a second HTTP endpoint, can we look for a header in the HTTP request and change the behavior of the existing /processes endpoint? It's also possible that k8s sets a default User-Agent header or something that we can check as well. This means we don't need any special or additional HTTP endpoints we/users need to care about, and instead we have a single API that works for liveness proves and general-purpose status information. Provides a bit of a cleaner API.

@cmacknz
Copy link
Member

cmacknz commented Mar 26, 2024

Having to set headers to have a liveness endpoint work properly is not idiomatic for k8s. Much, much more common to have a liveness/health/healthz/whatever endpoint purely dedicated to this purpose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
estimation:Month Task that represents a month of work. Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging a pull request may close this issue.