Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add /liveness endpoint to elastic-agent #4499

Merged
merged 32 commits into from
Apr 15, 2024

Conversation

fearful-symmetry
Copy link
Contributor

@fearful-symmetry fearful-symmetry commented Mar 28, 2024

What does this PR do?

Closes #390

Why is it important?

This adds a /liveness endpoint to elastic agent that we enable along with the processes endpoint. However, unlike the /processes endpoint, liveness will return a 500 is any components are in an Unhealthy or Degraded state.

This is a pretty simple change as-is, since I figured it would be easier to discuss other theoretical changes when we can actually see code. In particular:

  1. Should this be toggled by something other than the monitoring.http flag? Should we enabled it based on some k8s autodiscover config or env variable or something else? If this feature is fundamentally tied to a k8s workflow, it seems like enabling it should also be tied to k8s in some way.
  2. Should we fail the health check based on any other conditions?

I'm also holding off on changing any docs until we're sure we have what we want with regards to config.

How to test this PR

  • set monitoring.http.enabled to true
  • Check /liveness/ and /liveness/[component-id] using either curl or a k8s health check.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

@fearful-symmetry fearful-symmetry added the Team:Elastic-Agent Label for the Agent team label Mar 28, 2024
@fearful-symmetry fearful-symmetry self-assigned this Mar 28, 2024
@fearful-symmetry fearful-symmetry requested a review from a team as a code owner March 28, 2024 19:54
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

Copy link
Contributor

mergify bot commented Mar 28, 2024

This pull request does not have a backport label. Could you fix it @fearful-symmetry? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@pierrehilbert pierrehilbert requested a review from pchila March 29, 2024 09:01
Copy link
Member

@pchila pchila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks ok, just a couple of questions about when we want to fail the liveness probe and copying potentially big structures in memory

internal/pkg/agent/application/monitoring/process.go Outdated Show resolved Hide resolved
Copy link
Contributor

mergify bot commented Apr 4, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b liveness-endpoint upstream/liveness-endpoint
git merge upstream/main
git push upstream liveness-endpoint

@fearful-symmetry
Copy link
Contributor Author

So, I'm still doing manual testing, but updating this as-is while I figure out how integration tests should work.

@pierrehilbert pierrehilbert requested a review from leehinman April 8, 2024 16:03
@@ -977,6 +999,8 @@ func (c *Coordinator) runLoopIteration(ctx context.Context) {
case upgradeDetails := <-c.upgradeDetailsChan:
c.setUpgradeDetails(upgradeDetails)

case c.heartbeatChan <- struct{}{}:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question on this. This looks like we will put something on the channel as soon as possible, so if the coordinator gets blocked after that, you would read from the heartbeatChan and think that it was up, because in the past it had been able to write to the channel. On your next read it would fail. Is that correct?

If it is, I'd rather see something that records a timestamp every time runLoopIteration is called, then we can check to see if that timestamp is within our timeout window.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if the coordinator gets blocked after that, you would read from the heartbeatChan and think that it was up, because in the past it had been able to write to the channel. On your next read it would fail. Is that correct?

So, if I understand you correctly, yes. We could end up in a state where the coordinator blocks right after a heartbeat call. Because the liveness endpoint is meant to be repeated on some kind of regular period, I'm not too worried about that.

I specially didn't go with some kind of timestamp mechanism on @faec 's advice, since she was worried about the added complexity/edge cases of a time comparison, compared to a simple signal like this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this just tells us that the runLoopIteration function can run and the select can hit the heartbeat case. It doesn't tell us if the coordinator can/has processed any of the other cases.

I was thinking to be "alive", we want to know that one of the case statements besides heartbeat statement has run. We would probably need to bound that comparison with the check-in Interval. Or to put another way, the runLoopIteration function should happen at least every check-in Interval (2x is probably safer) and it might happen more frequently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I sort of agree, this only really works as a basic heartbeat.

My concern is that "has the coordinator done anything else?" would be a bit flaky, as we can't really guarantee what state other sub-components will be in at any given time. @faec can comment more, but for that to work we might need to refactor part of the coordinator to allow for a more sophisticated health check.

internal/pkg/agent/application/monitoring/stats.go Outdated Show resolved Hide resolved
Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM, a couple of minor things in particular the changelog has a typo.

changelog/fragments/1711653910-add-liveness-endpoint.yaml Outdated Show resolved Hide resolved
changelog/fragments/1711653910-add-liveness-endpoint.yaml Outdated Show resolved Hide resolved
changelog/fragments/1711653910-add-liveness-endpoint.yaml Outdated Show resolved Hide resolved
internal/pkg/agent/application/monitoring/liveness.go Outdated Show resolved Hide resolved
)

func TestConfigUpdateOnReload(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I blind or is this test just starting the server and not actually reloading anything?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the name is a tad deceptive, the integration test is the only one doing the reloading.

Copy link

Quality Gate passed Quality Gate passed

The SonarQube Quality Gate passed, but some issues were introduced.

2 New issues
0 Security Hotspots
68.3% 68.3% Coverage on New Code
0.0% 0.0% Duplication on New Code

See analysis details on SonarQube

Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, sanity checked in a local kind cluster to make sure it interoperates with k8s as expected.

Using

          livenessProbe:
            httpGet:
              path: /liveness
              port: 6792
            initialDelaySeconds: 3
            periodSeconds: 3

With a Fleet managed agent configured with:

PUT kbn:/api/fleet/agent_policies/3de09e84-54ec-479e-851b-f4947ff95262
{
  "name": "Policy",
  "namespace": "default",
  "overrides": {
      "agent": {
        "monitoring": {
          "http": {
            "enabled":true,
            "host": "0.0.0.0",
            "port": 6792
          }
        }
      }
  }
}

Works as expected.

@fearful-symmetry fearful-symmetry merged commit 29ce53e into elastic:main Apr 15, 2024
9 checks passed
cmacknz added a commit to cmacknz/elastic-agent that referenced this pull request Apr 16, 2024
cmacknz added a commit that referenced this pull request Apr 16, 2024
fearful-symmetry added a commit to fearful-symmetry/elastic-agent that referenced this pull request Apr 16, 2024
fearful-symmetry added a commit that referenced this pull request Apr 24, 2024
* Reapply "Add `/liveness` endpoint to elastic-agent (#4499)" (#4583)

This reverts commit eca5bc7.

* add behavior to not disable http monitor on reload with nil config, add tests

* improve comments

* linter

* more linter...

* fix spelling

* check original config state when reloading config

* change behavior of config set from overrides

* fix tests

* add second test to make sure old behavior with hard-coded monitoring config still works

* rename method
@rgarcia89
Copy link

@cmacknz is there also a possibility to enable monitoring via the UI of the agent policies?

@cmacknz
Copy link
Member

cmacknz commented May 3, 2024

Logs and Metrics monitoring has always been possible to configure in the UI.

This PR makes it possible to configure the agent to expose the liveness endpoint via the UI, but what polls the /liveness endpoint is deployment specific and Fleet itself can't/won't monitor it since it is local to the machine the agent is on.

@rgarcia89
Copy link

Thanks I am well aware about the functionality of the liveness and readiness probes. However adding the probe definition to the daemonset is not enough. By default the agents are not listening on the port which is being used in your snipped. Therefore, I am trying to figure out where to enable the exposure of it, so that it can be used for the probing. I might have overseen the section in the UI to enable it. Could you maybe provide a screenshot?

PS: I am talking about fleet managed agents.

@cmacknz
Copy link
Member

cmacknz commented May 6, 2024

The first problem you have is that the code in this PR is not released yet, and won't be until 8.15.0. The steps in #4499 (review) won't work until then.

The second thing with a Fleet managed agent prior to 8.15.0 is that the agent.monitoring.http settings need to be set in the local configuration when the agent container first starts up. This is because you can't change them from Fleet yet. Following the process in https://www.elastic.co/guide/en/fleet/current/advanced-kubernetes-managed-by-fleet.html to modify the initial contents of the elastic-agent.yml should get you around this problem, but the /liveness endpoint wont' exist until 8.15.0.

The third thing is that you likely need to change the default host the monitoring server binds to to 0.0.0.0 from localhost for the liveness probe to work.

juliaElastic added a commit to elastic/kibana that referenced this pull request May 14, 2024
…180922)

## Summary

Related to elastic/ingest-dev#2471

With elastic/elastic-agent#4499 merged, it
became possible to reload monitoring settings changes in a running
agent, so enabling these settings on the UI.

To verify:
- Create an agent policy and enroll an agent with latest 8.15-SNAPSHOT
- edit agent policy, and change monitoring settings, e.g. change port
- verify that the metrics endpoint is running on the new port

<img width="958" alt="image"
src="https://github.com/elastic/kibana/assets/90178898/91e3d2ec-8275-40c3-b5a6-7cdbb6b07cd3">
<img width="1109" alt="image"
src="https://github.com/elastic/kibana/assets/90178898/83be9610-5095-485f-83fd-bf4dbe5cb44a">

@cmacknz Does it make sense to allow changing the host name? It seems to
me that monitoring can only work in localhost.
Another question, how can we verify that the `buffer.enabled` setting is
applied correctly?

```
15:07:40.054
elastic_agent
[elastic_agent][error] Failed creating a server: failed to create api server: listen tcp 192.168.178.217:6791: bind: cannot assign requested address
```

Also, I'm not sure if switching off the `enabled` flag did anything,
seeing this again in the logs:
```
15:13:15.167
elastic_agent
[elastic_agent][info] Starting server
15:13:15.168
elastic_agent
[elastic_agent][info] Starting stats endpoint
15:13:15.168
elastic_agent
[elastic_agent][debug] Server started
15:13:15.168
elastic_agent
[elastic_agent][info] Metrics endpoint listening on: 127.0.0.1:6791 (configured: http://localhost:6791)
```

---------

Co-authored-by: Jen Huang <its.jenetic@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Liveness Probe HTTP Endpoint
9 participants