Add `/liveness` endpoint to elastic-agent #4499

fearful-symmetry · 2024-03-28T19:54:24Z

What does this PR do?

Closes #390

Why is it important?

This adds a /liveness endpoint to elastic agent that we enable along with the processes endpoint. However, unlike the /processes endpoint, liveness will return a 500 is any components are in an Unhealthy or Degraded state.

This is a pretty simple change as-is, since I figured it would be easier to discuss other theoretical changes when we can actually see code. In particular:

Should this be toggled by something other than the monitoring.http flag? Should we enabled it based on some k8s autodiscover config or env variable or something else? If this feature is fundamentally tied to a k8s workflow, it seems like enabling it should also be tied to k8s in some way.
Should we fail the health check based on any other conditions?

I'm also holding off on changing any docs until we're sure we have what we want with regards to config.

How to test this PR

set monitoring.http.enabled to true
Check /liveness/ and /liveness/[component-id] using either curl or a k8s health check.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in ./changelog/fragments using the changelog tool
I have added an integration test or an E2E test

elasticmachine · 2024-03-28T19:54:26Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

mergify · 2024-03-28T19:55:05Z

This pull request does not have a backport label. Could you fix it @fearful-symmetry? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

pchila

Code looks ok, just a couple of questions about when we want to fail the liveness probe and copying potentially big structures in memory

internal/pkg/agent/application/monitoring/process.go

internal/pkg/agent/application/monitoring/processes_cloud.go

mergify · 2024-04-04T20:59:57Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b liveness-endpoint upstream/liveness-endpoint
git merge upstream/main
git push upstream liveness-endpoint

…-endpoint

fearful-symmetry · 2024-04-04T22:44:47Z

So, I'm still doing manual testing, but updating this as-is while I figure out how integration tests should work.

leehinman · 2024-04-12T14:43:57Z

internal/pkg/agent/application/coordinator/coordinator.go

@@ -977,6 +999,8 @@ func (c *Coordinator) runLoopIteration(ctx context.Context) {
 	case upgradeDetails := <-c.upgradeDetailsChan:
 		c.setUpgradeDetails(upgradeDetails)

+	case c.heartbeatChan <- struct{}{}:


question on this. This looks like we will put something on the channel as soon as possible, so if the coordinator gets blocked after that, you would read from the heartbeatChan and think that it was up, because in the past it had been able to write to the channel. On your next read it would fail. Is that correct?

If it is, I'd rather see something that records a timestamp every time runLoopIteration is called, then we can check to see if that timestamp is within our timeout window.

so if the coordinator gets blocked after that, you would read from the heartbeatChan and think that it was up, because in the past it had been able to write to the channel. On your next read it would fail. Is that correct?

So, if I understand you correctly, yes. We could end up in a state where the coordinator blocks right after a heartbeat call. Because the liveness endpoint is meant to be repeated on some kind of regular period, I'm not too worried about that.

I specially didn't go with some kind of timestamp mechanism on @faec 's advice, since she was worried about the added complexity/edge cases of a time comparison, compared to a simple signal like this.

I think this just tells us that the runLoopIteration function can run and the select can hit the heartbeat case. It doesn't tell us if the coordinator can/has processed any of the other cases.

I was thinking to be "alive", we want to know that one of the case statements besides heartbeat statement has run. We would probably need to bound that comparison with the check-in Interval. Or to put another way, the runLoopIteration function should happen at least every check-in Interval (2x is probably safer) and it might happen more frequently.

Yeah, I sort of agree, this only really works as a basic heartbeat.

My concern is that "has the coordinator done anything else?" would be a bit flaky, as we can't really guarantee what state other sub-components will be in at any given time. @faec can comment more, but for that to work we might need to refactor part of the coordinator to allow for a more sophisticated health check.

internal/pkg/agent/application/monitoring/stats.go

_meta/config/common.p2.yml.tmpl

internal/pkg/agent/application/monitoring/liveness.go

cmacknz

Mostly LGTM, a couple of minor things in particular the changelog has a typo.

changelog/fragments/1711653910-add-liveness-endpoint.yaml

internal/pkg/agent/application/monitoring/liveness.go

cmacknz · 2024-04-15T19:11:14Z

internal/pkg/agent/application/monitoring/server_test.go

 )

+func TestConfigUpdateOnReload(t *testing.T) {


Am I blind or is this test just starting the server and not actually reloading anything?

Yeah, the name is a tad deceptive, the integration test is the only one doing the reloading.

elastic-sonarqube · 2024-04-15T20:13:22Z

Quality Gate passed

The SonarQube Quality Gate passed, but some issues were introduced.

2 New issues
0 Security Hotspots
68.3% 68.3% Coverage on New Code
0.0% 0.0% Duplication on New Code

See analysis details on SonarQube

cmacknz

Thanks, sanity checked in a local kind cluster to make sure it interoperates with k8s as expected.

Using

          livenessProbe:
            httpGet:
              path: /liveness
              port: 6792
            initialDelaySeconds: 3
            periodSeconds: 3

With a Fleet managed agent configured with:

PUT kbn:/api/fleet/agent_policies/3de09e84-54ec-479e-851b-f4947ff95262
{
  "name": "Policy",
  "namespace": "default",
  "overrides": {
      "agent": {
        "monitoring": {
          "http": {
            "enabled":true,
            "host": "0.0.0.0",
            "port": 6792
          }
        }
      }
  }
}

Works as expected.

This reverts commit 29ce53e.

…lastic#4583) This reverts commit eca5bc7.

* Reapply "Add `/liveness` endpoint to elastic-agent (#4499)" (#4583) This reverts commit eca5bc7. * add behavior to not disable http monitor on reload with nil config, add tests * improve comments * linter * more linter... * fix spelling * check original config state when reloading config * change behavior of config set from overrides * fix tests * add second test to make sure old behavior with hard-coded monitoring config still works * rename method

rgarcia89 · 2024-05-03T12:12:22Z

@cmacknz is there also a possibility to enable monitoring via the UI of the agent policies?

cmacknz · 2024-05-03T15:06:10Z

Logs and Metrics monitoring has always been possible to configure in the UI.

This PR makes it possible to configure the agent to expose the liveness endpoint via the UI, but what polls the /liveness endpoint is deployment specific and Fleet itself can't/won't monitor it since it is local to the machine the agent is on.

rgarcia89 · 2024-05-03T16:36:59Z

Thanks I am well aware about the functionality of the liveness and readiness probes. However adding the probe definition to the daemonset is not enough. By default the agents are not listening on the port which is being used in your snipped. Therefore, I am trying to figure out where to enable the exposure of it, so that it can be used for the probing. I might have overseen the section in the UI to enable it. Could you maybe provide a screenshot?

PS: I am talking about fleet managed agents.

cmacknz · 2024-05-06T17:48:46Z

The first problem you have is that the code in this PR is not released yet, and won't be until 8.15.0. The steps in #4499 (review) won't work until then.

The second thing with a Fleet managed agent prior to 8.15.0 is that the agent.monitoring.http settings need to be set in the local configuration when the agent container first starts up. This is because you can't change them from Fleet yet. Following the process in https://www.elastic.co/guide/en/fleet/current/advanced-kubernetes-managed-by-fleet.html to modify the initial contents of the elastic-agent.yml should get you around this problem, but the /liveness endpoint wont' exist until 8.15.0.

The third thing is that you likely need to change the default host the monitoring server binds to to 0.0.0.0 from localhost for the liveness probe to work.

@cmacknz

…180922) ## Summary Related to elastic/ingest-dev#2471 With elastic/elastic-agent#4499 merged, it became possible to reload monitoring settings changes in a running agent, so enabling these settings on the UI. To verify: - Create an agent policy and enroll an agent with latest 8.15-SNAPSHOT - edit agent policy, and change monitoring settings, e.g. change port - verify that the metrics endpoint is running on the new port <img width="958" alt="image" src="https://github.com/elastic/kibana/assets/90178898/91e3d2ec-8275-40c3-b5a6-7cdbb6b07cd3"> <img width="1109" alt="image" src="https://github.com/elastic/kibana/assets/90178898/83be9610-5095-485f-83fd-bf4dbe5cb44a"> @cmacknz Does it make sense to allow changing the host name? It seems to me that monitoring can only work in localhost. Another question, how can we verify that the `buffer.enabled` setting is applied correctly? ``` 15:07:40.054 elastic_agent [elastic_agent][error] Failed creating a server: failed to create api server: listen tcp 192.168.178.217:6791: bind: cannot assign requested address ``` Also, I'm not sure if switching off the `enabled` flag did anything, seeing this again in the logs: ``` 15:13:15.167 elastic_agent [elastic_agent][info] Starting server 15:13:15.168 elastic_agent [elastic_agent][info] Starting stats endpoint 15:13:15.168 elastic_agent [elastic_agent][debug] Server started 15:13:15.168 elastic_agent [elastic_agent][info] Metrics endpoint listening on: 127.0.0.1:6791 (configured: http://localhost:6791) ``` --------- Co-authored-by: Jen Huang <its.jenetic@gmail.com>

fearful-symmetry added 3 commits March 28, 2024 11:45

add liveness endpoint

2755f0f

format files

11ad640

add changelog

1c78dbb

fearful-symmetry added the Team:Elastic-Agent Label for the Agent team label Mar 28, 2024

fearful-symmetry self-assigned this Mar 28, 2024

fearful-symmetry requested a review from a team as a code owner March 28, 2024 19:54

fearful-symmetry requested review from michalpristas and blakerouse March 28, 2024 19:54

mergify bot added the backport-skip label Mar 28, 2024

fearful-symmetry added 2 commits March 28, 2024 14:06

linter...

bdc35c6

linter, again

5590642

pierrehilbert requested a review from pchila March 29, 2024 09:01

pchila reviewed Mar 29, 2024

View reviewed changes

internal/pkg/agent/application/monitoring/process.go Outdated Show resolved Hide resolved

internal/pkg/agent/application/monitoring/processes_cloud.go Outdated Show resolved Hide resolved

refactor helpers so we don't move the component type around

0eab59a

fearful-symmetry mentioned this pull request Apr 3, 2024

Add NewWithConfig init function to api/server.go elastic/elastic-agent-libs#192

Merged

3 tasks

fearful-symmetry added 2 commits April 4, 2024 13:48

finish up reloader, HTTP handling

fb6a31d

cleanup

1dd8c9d

fearful-symmetry and others added 5 commits April 4, 2024 14:07

Merge remote-tracking branch 'upstream/main' into liveness-endpoint

240c25b

linter

4eeff6b

notice

3eee7aa

remove tests based on changed behavior

82e338d

Merge remote-tracking branch 'origin/liveness-endpoint' into liveness…

673e9e9

…-endpoint

fearful-symmetry added 3 commits April 5, 2024 13:13

add integration test

9166208

format

0e53358

format, again

3c1fba3

pierrehilbert requested a review from leehinman April 8, 2024 16:03

leehinman reviewed Apr 12, 2024

View reviewed changes

simplify switch statement

336b73d

cmacknz reviewed Apr 12, 2024

View reviewed changes

internal/pkg/agent/application/monitoring/stats.go Outdated Show resolved Hide resolved

cmacknz reviewed Apr 12, 2024

View reviewed changes

internal/pkg/agent/application/monitoring/stats.go Outdated Show resolved Hide resolved

cmacknz reviewed Apr 12, 2024

View reviewed changes

_meta/config/common.p2.yml.tmpl Outdated Show resolved Hide resolved

update docs, tests, provide a 'coordinator' mode

1b65f37

fearful-symmetry requested review from leehinman and cmacknz April 12, 2024 18:34

cmacknz reviewed Apr 12, 2024

View reviewed changes

internal/pkg/agent/application/monitoring/liveness.go Outdated Show resolved Hide resolved

change default behavior, rename failon opts

6e46f01

fearful-symmetry requested a review from cmacknz April 12, 2024 21:07

leehinman approved these changes Apr 15, 2024

View reviewed changes

cmacknz reviewed Apr 15, 2024

View reviewed changes

cleanup, rename a few things, short-out coordinator check

3e64845

cmacknz approved these changes Apr 15, 2024

View reviewed changes

fearful-symmetry merged commit 29ce53e into elastic:main Apr 15, 2024
9 checks passed

juliaElastic mentioned this pull request Apr 16, 2024

[Fleet] Enable agent.monitoring.http settings on agent policy UI elastic/kibana#180922

Merged

cmacknz mentioned this pull request Apr 16, 2024

Disable automatic reload of the agent monitoring HTTP server until handling of default values is figured out #4582

Closed

cmacknz added a commit to cmacknz/elastic-agent that referenced this pull request Apr 16, 2024

Revert "Add /liveness endpoint to elastic-agent (elastic#4499)"

cb558c5

This reverts commit 29ce53e.

cmacknz added a commit that referenced this pull request Apr 16, 2024

Revert "Add /liveness endpoint to elastic-agent (#4499)" (#4583)

eca5bc7

This reverts commit 29ce53e.

fearful-symmetry added a commit to fearful-symmetry/elastic-agent that referenced this pull request Apr 16, 2024

Reapply "Add /liveness endpoint to elastic-agent (elastic#4499)" (e…

521c55d

…lastic#4583) This reverts commit eca5bc7.

jlind23 mentioned this pull request May 14, 2024

[Elastic Agent] Endpoint for metrics/healthcheck #119

Closed

rogercoll mentioned this pull request Aug 28, 2024

Add health_check extension to bundled otel.yml file #5369

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `/liveness` endpoint to elastic-agent #4499

Add `/liveness` endpoint to elastic-agent #4499

fearful-symmetry commented Mar 28, 2024 •

edited

Loading

elasticmachine commented Mar 28, 2024

mergify bot commented Mar 28, 2024

pchila left a comment

mergify bot commented Apr 4, 2024

fearful-symmetry commented Apr 4, 2024

leehinman Apr 12, 2024

fearful-symmetry Apr 12, 2024

leehinman Apr 12, 2024

fearful-symmetry Apr 15, 2024

cmacknz left a comment

cmacknz Apr 15, 2024

fearful-symmetry Apr 15, 2024

elastic-sonarqube bot commented Apr 15, 2024

cmacknz left a comment

rgarcia89 commented May 3, 2024

cmacknz commented May 3, 2024

rgarcia89 commented May 3, 2024

cmacknz commented May 6, 2024

Add /liveness endpoint to elastic-agent #4499

Add /liveness endpoint to elastic-agent #4499

Conversation

fearful-symmetry commented Mar 28, 2024 • edited Loading

What does this PR do?

Why is it important?

How to test this PR

Checklist

elasticmachine commented Mar 28, 2024

mergify bot commented Mar 28, 2024

pchila left a comment

Choose a reason for hiding this comment

mergify bot commented Apr 4, 2024

fearful-symmetry commented Apr 4, 2024

leehinman Apr 12, 2024

Choose a reason for hiding this comment

fearful-symmetry Apr 12, 2024

Choose a reason for hiding this comment

leehinman Apr 12, 2024

Choose a reason for hiding this comment

fearful-symmetry Apr 15, 2024

Choose a reason for hiding this comment

cmacknz left a comment

Choose a reason for hiding this comment

cmacknz Apr 15, 2024

Choose a reason for hiding this comment

fearful-symmetry Apr 15, 2024

Choose a reason for hiding this comment

elastic-sonarqube bot commented Apr 15, 2024

Quality Gate passed

cmacknz left a comment

Choose a reason for hiding this comment

rgarcia89 commented May 3, 2024

cmacknz commented May 3, 2024

rgarcia89 commented May 3, 2024

cmacknz commented May 6, 2024

Add `/liveness` endpoint to elastic-agent #4499

Add `/liveness` endpoint to elastic-agent #4499

fearful-symmetry commented Mar 28, 2024 •

edited

Loading