Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

elastic-agent docker: support healthcheck for the container #24503

Closed
mtojek opened this issue Mar 11, 2021 · 9 comments · Fixed by #24856
Closed

elastic-agent docker: support healthcheck for the container #24503

mtojek opened this issue Mar 11, 2021 · 9 comments · Fixed by #24856
Assignees
Labels
Team:Elastic-Agent Label for the Agent team

Comments

@mtojek
Copy link
Contributor

mtojek commented Mar 11, 2021

With the Fleet Server enabled in the agent's Docker container, we need to find a way to signal that the container is healthy. Before 7.13.0-SNAPSHOT we used the following healthcheck: https://github.com/elastic/elastic-package/blob/master/internal/install/static_snapshot_yml.go#L85

Do you have any recommendation how to signal that the container is healthy - it has a default policy assigned? Do you think you can add the healthcheck definition to the official Docker image?

@mtojek mtojek added the Team:Elastic-Agent Label for the Agent team label Mar 11, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/agent (Team:Agent)

@ruflin
Copy link
Member

ruflin commented Mar 16, 2021

@simitt How are you handling this for the Cloud container?

@simitt
Copy link
Contributor

simitt commented Mar 16, 2021

@ruflin no special handling for the healthcheck yet; instead, if the legacy APM Server cannot be started or dies, it sends a signal to terminate the Elastic Agent, resulting in the whole container being terminated.

For sub processes managed by Elastic Agent, we have discussed in the past that the Elastic Agent should provide a healthcheck endpoint providing details per sub process but also an overall health indicator.

@ruflin
Copy link
Member

ruflin commented Mar 17, 2021

@michalpristas @ph Do we have this overall healthcheck already tracked somewhere? I remember we discussed this in the past. Does Agent already has some http endpoint or similar?

@ph
Copy link
Contributor

ph commented Mar 17, 2021

We track health status internally and we should be able to expose it in any way necessary, @ruflin I believe this is linked to the #24091

@mtojek
Copy link
Contributor Author

mtojek commented Mar 17, 2021

From our perspective (users) it's valuable if the healthcheck signals green if the default policy is assigned for the first time.

@ruflin
Copy link
Member

ruflin commented Mar 31, 2021

I think an Agent should be healthy, as soon as the first policy is received and acked. This does not have to be the default policy.

We should improve this healthcheck later on to have more fine grained status information depending on processes / inputs status.

@simitt
Copy link
Contributor

simitt commented Mar 31, 2021

IMO if the agent is started in Fleet Server mode, then the Agent's health should consider the Fleet Server's health, ideally by consuming a health endpoint from the Fleet Server.

@mtojek
Copy link
Contributor Author

mtojek commented Mar 31, 2021

As we're observing some flakiness with booting the agent I did a short exercise to check responses fro /api/status:

$ for I in `seq 1 1 10000`; do curl http://localhost:8220/api/status ; sleep 1; done
curl: (52) Empty reply from server
curl: (52) Empty reply from server
curl: (52) Empty reply from server
curl: (52) Empty reply from server
curl: (52) Empty reply from server
curl: (52) Empty reply from server
curl: (52) Empty reply from server
curl: (52) Empty reply from server
curl: (52) Empty reply from server
curl: (52) Empty reply from server
curl: (52) Empty reply from server
curl: (52) Empty reply from server
curl: (52) Empty reply from server
curl: (52) Empty reply from server
curl: (52) Empty reply from server
{"name":"fleet-server","version":"","status":"STARTING"}{"name":"fleet-server","version":"","status":"HEALTHY"}curl: (52) Empty reply from server
curl: (52) Empty reply from server
curl: (52) Empty reply from server
{"name":"fleet-server","version":"","status":"HEALTHY"}{"name":"fleet-server","version":"","status":"HEALTHY"}{"name":"fleet-server","version":"","status":"HEALTHY"}{"name":"fleet-server","version":"","status":"HEALTHY"}{"name":"fleet-server","version":"","status":"HEALTHY"}{"name":"fleet-server","version":"","status":"HEALTHY"}{"name":"fleet-server","version":"","status":"HEALTHY"}{"name":"fleet-server","version":"","status":"HEALTHY"}{"name":"fleet-server","version":"","status":"HEALTHY"}{"name":"fleet-server","version":"","status":"HEALTHY"}{"name":"fleet-server","version":"","status":"HEALTHY"}{"name":"fleet-server","version":"","status":"HEALTHY"}{"name":"fleet-server","version":"","status":"HEALTHY"}{"name":"fleet-server","version":"","status":"HEALTHY"}{"name":"fleet-server","version":"","status":"HEALTHY"}{"name":"fleet-server","version":"","status":"HEALTHY"}{"name":"fleet-server","version":"","status":"HEALTHY"}{"name":"fleet-server","version":"","status":"HEALTHY"}{"name":"fleet-server","version":"","status":"HEALTHY"}{"name":"fleet-server","version":"","status":"HEALTHY"}{"name":"fleet-server","version":"","status":"HEALTHY"}{"name":"fleet-server","version":"","status":"HEALTHY"}{"name":"fleet-server","version":"","status":"HEALTHY"}{"name":"fleet-server","version":"","status":"HEALTHY"}{"name":"fleet-server","version":"","status":"HEALTHY"}ć{"name":"fleet-server","version":"","status":"HEALTHY"}

Please mind the gap between HEALTHY states. It seems that the Fleet Server got restarted then which means that we need a different workaround :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants