Liveness endpoint behaviour when unable to check in with fleet server #1157

michel-laterman · 2022-09-12T16:38:10Z

As a fix to the issue in #1148 made in #1152. The elastic-agent will no longer report a degraded state if the checkin to fleet-server fails.

This degraded state reporting was used by the liveness endpoint to ensure that the agent reported a 200 status when healthy.
The original changes were added as part of #569.

We need to decide what the liveness endpoint should be reporting and how that interacts with what the agent report to fleet.

michel-laterman · 2022-09-12T16:39:32Z

cc @blakerouse, I believe the original liveness endpoint behaviour made it into v2

cmacknz · 2022-09-12T19:28:28Z

I think the issue with the original implementation was the possibility for the agent to check in and report a degraded state because of the inability to check in, which is a bit of a paradoxical situation given that you have to check in to do it.

To me it makes the most sense to align what the agent's local liveness endpoint does with the equivalent of what Fleet considers the agent offline state.

We released this functionality in v8.4.0 so we shouldn't break it by not reporting when the agent cannot connect to Fleet server. I think we should address this in v8.5.0 by:

Reporting an error from the agent's liveness endpoint when the agent is offline using the same threshold and logic that Fleet itself uses to determine when agent is offline. The Fleet logic for offline is defined here: https://github.com/elastic/kibana/blob/342e7a17839dc78a4b29d7770eaad3138a8bddb8/x-pack/plugins/fleet/common/services/agent_status.ts#L24-L42
Only reporting this offline state via the liveness endpoint, and never reporting to Fleet. Hopefully we can accomplish this by filtering a new offline state from the state reporter or something similarly easy.

@michel-laterman thoughts on this?

blakerouse · 2022-09-13T15:04:49Z

Yes I think what we need here is 2 different statuses. The first is a local status, so what is my status as the Elastic Agent is local to the machine. The second is the status that is reported to Fleet Server.

When elastic-agent status is called or the liveness probe is used the fact that communication with Fleet Server is failing is important information.

michel-laterman · 2022-09-13T18:45:47Z

@cmacknz, that sounds good it should also appear in elastic-agent status as that would be an easy indicator for customers

jlind23 · 2022-09-21T15:43:58Z

@michel-laterman would you be able to take this one on your plate as it seems you understand the whole picture here?

michel-laterman · 2022-09-22T20:34:32Z

@jlind23 Would this need to be backported to 8.5? The changes we're making for 8.6 means a fix for 8.5 is a completely separate fix.

jlind23 · 2022-09-23T07:51:28Z

@michel-laterman as previous fix landed in 8.5 I think it would be great to have a fix backported to 8.5 too.
@cmacknz thoughts?

cmacknz · 2022-09-23T14:19:54Z

Yes lets backport to 8.5 so as not to break anything in that release.

michel-laterman added enhancement New feature or request Team:Elastic-Agent Label for the Agent team 8.6-candidate labels Sep 12, 2022

cmacknz changed the title ~~Liveness endpoint behaviour~~ Liveness endpoint behaviour when unable to check in with fleet server Sep 12, 2022

cmacknz added the v8.5.0 label Sep 12, 2022

cmacknz assigned michalpristas Sep 13, 2022

cmacknz mentioned this issue Sep 13, 2022

Avoid reporting Unhealthy on fleet connectivity issues #1152

Merged

6 tasks

cmacknz unassigned michalpristas Sep 21, 2022

jlind23 assigned michel-laterman Sep 21, 2022

michel-laterman mentioned this issue Sep 23, 2022

Expand status reporter/controller interfaces to allow local reporters #1285

Merged

4 tasks

michel-laterman closed this as completed in #1285 Sep 26, 2022

michel-laterman mentioned this issue Sep 26, 2022

Use error.TypeLocal to report local issue for liveness endpoint #1306

Closed

6 tasks

michel-laterman mentioned this issue Oct 17, 2022

Change the stater to include a local flag. #1308

Merged

4 tasks

cmacknz mentioned this issue Mar 25, 2024

Liveness Probe HTTP Endpoint #390

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Liveness endpoint behaviour when unable to check in with fleet server #1157

Liveness endpoint behaviour when unable to check in with fleet server #1157

michel-laterman commented Sep 12, 2022

michel-laterman commented Sep 12, 2022

cmacknz commented Sep 12, 2022 •

edited

Loading

blakerouse commented Sep 13, 2022

michel-laterman commented Sep 13, 2022

jlind23 commented Sep 21, 2022

michel-laterman commented Sep 22, 2022 •

edited

Loading

jlind23 commented Sep 23, 2022

cmacknz commented Sep 23, 2022

Liveness endpoint behaviour when unable to check in with fleet server #1157

Liveness endpoint behaviour when unable to check in with fleet server #1157

Comments

michel-laterman commented Sep 12, 2022

michel-laterman commented Sep 12, 2022

cmacknz commented Sep 12, 2022 • edited Loading

blakerouse commented Sep 13, 2022

michel-laterman commented Sep 13, 2022

jlind23 commented Sep 21, 2022

michel-laterman commented Sep 22, 2022 • edited Loading

jlind23 commented Sep 23, 2022

cmacknz commented Sep 23, 2022

cmacknz commented Sep 12, 2022 •

edited

Loading

michel-laterman commented Sep 22, 2022 •

edited

Loading