Liveness Probe HTTP Endpoint #390

blakerouse · 2022-04-28T19:15:34Z

Describe the enhancement:

Currently the Elastic Agent running in a container does not have a liveness HTTP endpoint where kubernetes can check the overall health of the Elastic Agent container. This needs to be added so that in the case that the Elastic Agent is not working correctly it can be restarted by Kubernetes.

Some items that would be good for the liveness probe to be alerted to on failure:

Not able to connect to Fleet Server (in managed mode)
Overall bad state of inputs after a period of time

Liveness probe should have some subpaths defined for inputs that need to monitor there own liveness:

/liveness/endpoint - Checks if endpoint should be alive (see https://github.com/elastic/security-team/issues/3449#issuecomment-1112559420) for more details

Describe a specific use case for the enhancement or feature:

This needs to be added so that in the case that the Elastic Agent (or an integration that runs in a sidecar) is not working correctly it can be restarted by Kubernetes.

Additional Requirements

Enabling the liveness endpoint requires the ability to enable and possibly modify the agent HTTP configuration. This is currently not reloadable and cannot be configured from Fleet. For Fleet managed users to benefit from this we should make sure this can be turned on from Fleet.

elastic-agent/elastic-agent.yml

Lines 75 to 82 in b3e8275

    
           #   http: 
        
           #       # enables http endpoint 
        
           #       enabled: false 
        
           #       # The HTTP endpoint will bind to this hostname, IP address, unix socket or named pipe. 
        
           #       # When using IP addresses, it is recommended to only use localhost. 
        
           #       host: localhost 
        
           #       # Port on which the HTTP endpoint will bind. Default is 0 meaning feature is disabled. 
        
           #       port: 6791

The text was updated successfully, but these errors were encountered:

ph · 2022-04-28T19:18:41Z

@jlind23 @nimarezainia We should discuss this with our priority with @norrietaylor for the endpoint on k8s project.

maneta · 2022-05-03T12:53:42Z

Hey folks, we are also hitting an issue in the synthetics-service (https://github.com/elastic/synthetics-service/issues/435) where the agent fails in acquire the leader lease:

May 2, 2022 @ 14:56:46.498	elastic-agent-6qrjw	I0502 14:56:46.498250       7 leaderelection.go:278] failed to renew lease kube-system/elastic-agent-cluster-leader: timed out waiting for the condition

May 2, 2022 @ 14:56:46.498	elastic-agent-6qrjw	E0502 14:56:46.498350       7 leaderelection.go:301] Failed to release lock: resource name may not be empty

We think that this issue might be caused by the Apiserver being too slow to provide the lease. We are digging into metrics to certify that, but for now, the only solution we have to re-establish the agent is to manually restart the daemonsets. Would be great if we could set a liveness/readiness endpoint in the k8s template so k8s takes care of re-starting the pods for us.

For the deployment of agent itself we have crafted a helm chart(src but unfortunately no liveness/readiness is set(src)

Since we are reaching public beta phase, it would be really great if you could raise the priority on this is possible.

rbrunan · 2022-05-03T13:01:12Z

I would like to add to @maneta's comment that we would need an auto-recover mechanism for this leader lease loss. I mean, we should retry until the API is available again, to get the degraded metricset working again.

joshdover · 2022-07-22T12:46:10Z

@michel-laterman @pierrehilbert I know some pieces of this are blocked on v2 work being completed. Could we update the issue description with what's been completed and what's left to close out this issue?

michel-laterman · 2022-07-25T17:20:56Z

For v8.4.0 the fleet-gateway component is capable of setting the elastic-agent health to degraded if it fails 2 consecutive check ins. This will lead to cases where the health in fleet and the health from the local agent (through elastic-agent status) will differ.
The status command can now show the failing component or app in the message field, and also when the status has changed (through update_timestamp).

The liveness endpoint for v8.4.0 (GET /liveness) returns the health status (OK -> 200, DEGRADED/FAILURE -> 503), along with a json body that has

id # agent ID
status # status as a string
message # update message
update_timestamp # when the health status changed

No additional liveness endpoints are in v8.4.0.

For v8.5.0 it will likely need to be reimplemented in order to work with the new v5 architecture

rgarcia89 · 2023-01-23T07:25:40Z

I would also be interested in a liveness probe endpoint for managed fleet agents.
Especially for when an agent stays in Unhealty state for to long

rgarcia89 · 2023-01-23T07:51:48Z

can't we just use the status code of the agent?

          livenessProbe:
            exec:
              command:
                - /usr/share/elastic-agent/elastic-agent
                - status
            initialDelaySeconds: 20
            successThreshold: 1
            failureThreshold: 3
            periodSeconds: 10

joshdover · 2023-01-23T17:28:24Z

@blakerouse where are we with this now that we've switched over to v2?

blakerouse · 2023-01-23T18:32:16Z

The ServeHTTP handler was created on the coordinator but doesn't look like it was wired in.

https://github.com/elastic/elastic-agent/blob/main/internal/pkg/agent/application/coordinator/handler.go#L26

I know the only HTTP endpoint at the moment that the elastc-agent will run is for metrics here:

https://github.com/elastic/elastic-agent/blob/main/internal/pkg/agent/application/monitoring/server.go#L49

That only gets turned on if the metrics endpoint is enabled, I don't know if we want to be at the same endpoint or use a different one with a different configuration.

I also think we should think about the path that is used to determine status of a running component:

/liveness - General overall check to ensure that the Elastic Agent is healthy
/liveness/by-id/${component_id} - Check liveness of specific component including all units
/liveness/by-id/${component_id} - Check liveness of specific unit
/liveness/by-type/${input_type} - Allow checking without needing to know the exact computed component ID by Input Type (this would return an array as its possible for multiple units per input type, but in some cases would always be an array of 1, in endpoint case)

joshdover · 2023-02-01T11:27:27Z

Makes sense to offer the ability to have liveness on a more granular level. This would allow users to decide which components need to be up for their use case. I do think we should offer an overall one as well, which only returns 200 if all components are healthy.

This would improve our experience on Cloud quite a bit. Today from the orchestration's point of view, the Agent container can be healthy even if one of the processes inside crashed and Agent didn't or couldn't restart it for whatever reason (bug, expired key, etc.). Having this endpoint would help us signal to the orchestrator that the container needs to be restarted.

fearful-symmetry · 2024-03-25T19:27:25Z

So, I just got assigned to this and I'm missing a bit of context;

How is the user going to enable/disable this? Should it be enabled automatically if we're running under k8s?
What are the actual requirements for the endpoints? Do we need a certain OpenAPI spec? Should they return some kind of non-200 HTTP code if the component is unhealthy?

EDIT: some quick googling suggests that k8's liveness probes just look for an HTTP response in the range of 200-399. So it sounds like what we want is to turn the health state into a relevant response code, and anything on top of that (JSON with additional state info, etc) is extra?

cmacknz · 2024-03-25T19:44:43Z

Serve the liveness endpoint whenever the user has agent.monitoring.HTTP.enabled: true:

elastic-agent/elastic-agent.yml

Lines 75 to 85 in c9bb164

    
           #   http: 
        
           #       # enables http endpoint 
        
           #       enabled: false 
        
           #       # The HTTP endpoint will bind to this hostname, IP address, unix socket or named pipe. 
        
           #       # When using IP addresses, it is recommended to only use localhost. 
        
           #       host: localhost 
        
           #       # Port on which the HTTP endpoint will bind. Default is 0 meaning feature is disabled. 
        
           #       port: 6791 
        
           #       # Metrics buffer endpoint 
        
           #       buffer.enabled: false 
        
           #   # Configuration for the diagnostics action handler

There is a catch to this, which is that I don't think that the agent.monitoring.HTTP.enabled configuration is reloadable which really limits how useful this can be. I think you need to make it so that you can set agent.monitoring.HTTP.enabled: true using the Fleet override API for an agent that was previously enrolled and have it turn on.

In general follow the conventions for a Kubernetes HTTP liveness probe. This is primarily going to be used as a liveness probe on k8s and things that want it to work similarly (load balancers for example).

Any code greater than or equal to 200 and less than 400 indicates success. Any other code indicates failure.

I would have the liveness probe fail anytime the agent is unhealthy (component or unit). Otherwise it isn't providing any value over GET /stats or GET /debug/pprof which already exist today. This won't consider the Fleet connectivity at all, but I think ideally we'd want it to report an error if the agent considers it offline from Fleet (offline after X minutes). This can be a future extension.

There is actually some already written code around this you can revive or rewrite as needed.

elastic-agent/internal/pkg/agent/application/coordinator/handler.go

Lines 15 to 22 in c9bb164

    
           // LivenessResponse is the response body for the liveness endpoint. 
        
           type LivenessResponse struct { 
        
           	ID         string    `json:"id"` 
        
           	Status     string    `json:"status"` 
        
           	Message    string    `json:"message"` 
        
           	UpdateTime time.Time `json:"update_timestamp"` 
        
           }

There is some additional context in #1157

fearful-symmetry · 2024-03-26T14:36:16Z

I think you need to make it so that you can set agent.monitoring.HTTP.enabled: true using the Fleet override API for an agent that was previously enrolled and have it turn on.

Yeah, this is the part that kinda bugs me a bit. I feel like this should "just work" if we're running under k8s, unless there's some security reason why we don't want it on-by-default when the user is running under k8s?

cmacknz · 2024-03-26T15:20:17Z

I think it is fair not to want an HTTP interface on the agent by default outside of k8s, and doing it on k8s would make the configuration in the default elastic-agent.yml file runtime environment dependent, which we generally don't do right now.

We could default agent.monitoring.http.enabled: true by default, but we are going to run into the same problem where for agents already enrolled into Fleet this has no effect, and given it is a change in default behavior it should be easy to turn off in Fleet, where it also isn't exposed.

So no matter which default we go with, the prerequisite is to make turning the HTTP endpoint on and off from Fleet work properly.

fearful-symmetry · 2024-03-26T16:48:24Z

I think it is fair not to want an HTTP interface on the agent by default outside of k8s, and doing it on k8s would make the configuration in the default elastic-agent.yml file runtime environment dependent, which we generally don't do right now.

So, my thinking was something like this:

if in_k8s() {
    serve_liveness_endpoint()
}

Unless there's a reason why a user might be running under k8s but want the /liveness endpoint disabled by default. We could also separate the config for the "normal" HTTP debug interface and the /liveness interface. I'm just assuming that this is something a user would want on by default if they're running agent in k8s, but I'm not a k8s expert so I could be wrong.

I think you need to make it so that you can set agent.monitoring.HTTP.enabled: true using the Fleet override API for an agent that was previously enrolled and have it turn on.

Missed this detail the first time; is there something the agent-side code needs to care about with regards to working under fleet? My understanding is that the config backend kind of abstracts that away.

fearful-symmetry · 2024-03-26T17:11:38Z

I have another weird idea:

The k8s config lets users set headers for the liveness probe:

    livenessProbe:
      httpGet:
        path: /healthz
        port: 8080
        httpHeaders:
        - name: Custom-Header
          value: Awesome

Instead of creating a second HTTP endpoint, can we look for a header in the HTTP request and change the behavior of the existing /processes endpoint? It's also possible that k8s sets a default User-Agent header or something that we can check as well. This means we don't need any special or additional HTTP endpoints we/users need to care about, and instead we have a single API that works for liveness proves and general-purpose status information. Provides a bit of a cleaner API.

cmacknz · 2024-03-26T17:27:58Z

Having to set headers to have a liveness endpoint work properly is not idiomatic for k8s. Much, much more common to have a liveness/health/healthz/whatever endpoint purely dedicated to this purpose.

blakerouse added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Apr 28, 2022

jlind23 added the 8.4-candidate label May 6, 2022

michel-laterman self-assigned this May 19, 2022

michel-laterman added the estimation:Month Task that represents a month of work. label May 19, 2022

jlind23 added v8.4.0 and removed 8.4-candidate labels May 24, 2022

michel-laterman mentioned this issue Jun 16, 2022

status identifies failing component, fleet gateway may report degraded, liveness endpoint added #569

Merged

5 tasks

pierrehilbert added v8.5.0 and removed v8.4.0 labels Aug 1, 2022

michel-laterman removed their assignment Sep 14, 2022

cmacknz added 8.8-candidate and removed v8.5.0 labels Feb 1, 2023

nimarezainia removed the 8.8-candidate label Mar 18, 2024

pierrehilbert assigned fearful-symmetry Mar 19, 2024

fearful-symmetry mentioned this issue Mar 28, 2024

Add /liveness endpoint to elastic-agent #4499

Merged

7 tasks

fearful-symmetry closed this as completed in #4499 Apr 15, 2024

jen-huang mentioned this issue May 10, 2024

[Fleet] Enable agent.monitoring.http settings on agent policy UI elastic/kibana#180922

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Liveness Probe HTTP Endpoint #390

Liveness Probe HTTP Endpoint #390

blakerouse commented Apr 28, 2022 •

edited by cmacknz

Loading

ph commented Apr 28, 2022

maneta commented May 3, 2022

rbrunan commented May 3, 2022

joshdover commented Jul 22, 2022

michel-laterman commented Jul 25, 2022

rgarcia89 commented Jan 23, 2023

rgarcia89 commented Jan 23, 2023

joshdover commented Jan 23, 2023

blakerouse commented Jan 23, 2023

joshdover commented Feb 1, 2023

fearful-symmetry commented Mar 25, 2024 •

edited

Loading

cmacknz commented Mar 25, 2024

fearful-symmetry commented Mar 26, 2024

cmacknz commented Mar 26, 2024 •

edited

Loading

fearful-symmetry commented Mar 26, 2024

fearful-symmetry commented Mar 26, 2024 •

edited

Loading

cmacknz commented Mar 26, 2024

Liveness Probe HTTP Endpoint #390

Liveness Probe HTTP Endpoint #390

Comments

blakerouse commented Apr 28, 2022 • edited by cmacknz Loading

ph commented Apr 28, 2022

maneta commented May 3, 2022

rbrunan commented May 3, 2022

joshdover commented Jul 22, 2022

michel-laterman commented Jul 25, 2022

rgarcia89 commented Jan 23, 2023

rgarcia89 commented Jan 23, 2023

joshdover commented Jan 23, 2023

blakerouse commented Jan 23, 2023

joshdover commented Feb 1, 2023

fearful-symmetry commented Mar 25, 2024 • edited Loading

cmacknz commented Mar 25, 2024

fearful-symmetry commented Mar 26, 2024

cmacknz commented Mar 26, 2024 • edited Loading

fearful-symmetry commented Mar 26, 2024

fearful-symmetry commented Mar 26, 2024 • edited Loading

cmacknz commented Mar 26, 2024

blakerouse commented Apr 28, 2022 •

edited by cmacknz

Loading

fearful-symmetry commented Mar 25, 2024 •

edited

Loading

cmacknz commented Mar 26, 2024 •

edited

Loading

fearful-symmetry commented Mar 26, 2024 •

edited

Loading