-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add /liveness
endpoint to elastic-agent
#4499
Add /liveness
endpoint to elastic-agent
#4499
Conversation
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
This pull request does not have a backport label. Could you fix it @fearful-symmetry? 🙏
NOTE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks ok, just a couple of questions about when we want to fail the liveness probe and copying potentially big structures in memory
This pull request is now in conflicts. Could you fix it? 🙏
|
So, I'm still doing manual testing, but updating this as-is while I figure out how integration tests should work. |
@@ -977,6 +999,8 @@ func (c *Coordinator) runLoopIteration(ctx context.Context) { | |||
case upgradeDetails := <-c.upgradeDetailsChan: | |||
c.setUpgradeDetails(upgradeDetails) | |||
|
|||
case c.heartbeatChan <- struct{}{}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question on this. This looks like we will put something on the channel as soon as possible, so if the coordinator gets blocked after that, you would read from the heartbeatChan and think that it was up, because in the past it had been able to write to the channel. On your next read it would fail. Is that correct?
If it is, I'd rather see something that records a timestamp every time runLoopIteration
is called, then we can check to see if that timestamp is within our timeout window.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so if the coordinator gets blocked after that, you would read from the heartbeatChan and think that it was up, because in the past it had been able to write to the channel. On your next read it would fail. Is that correct?
So, if I understand you correctly, yes. We could end up in a state where the coordinator blocks right after a heartbeat call. Because the liveness endpoint is meant to be repeated on some kind of regular period, I'm not too worried about that.
I specially didn't go with some kind of timestamp mechanism on @faec 's advice, since she was worried about the added complexity/edge cases of a time comparison, compared to a simple signal like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this just tells us that the runLoopIteration
function can run and the select can hit the heartbeat case. It doesn't tell us if the coordinator can/has processed any of the other cases.
I was thinking to be "alive", we want to know that one of the case statements besides heartbeat statement has run. We would probably need to bound that comparison with the check-in Interval. Or to put another way, the runLoopIteration
function should happen at least every check-in Interval (2x is probably safer) and it might happen more frequently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I sort of agree, this only really works as a basic heartbeat.
My concern is that "has the coordinator done anything else?" would be a bit flaky, as we can't really guarantee what state other sub-components will be in at any given time. @faec can comment more, but for that to work we might need to refactor part of the coordinator to allow for a more sophisticated health check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly LGTM, a couple of minor things in particular the changelog has a typo.
) | ||
|
||
func TestConfigUpdateOnReload(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I blind or is this test just starting the server and not actually reloading anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the name is a tad deceptive, the integration test is the only one doing the reloading.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, sanity checked in a local kind cluster to make sure it interoperates with k8s as expected.
Using
livenessProbe:
httpGet:
path: /liveness
port: 6792
initialDelaySeconds: 3
periodSeconds: 3
With a Fleet managed agent configured with:
PUT kbn:/api/fleet/agent_policies/3de09e84-54ec-479e-851b-f4947ff95262
{
"name": "Policy",
"namespace": "default",
"overrides": {
"agent": {
"monitoring": {
"http": {
"enabled":true,
"host": "0.0.0.0",
"port": 6792
}
}
}
}
}
Works as expected.
This reverts commit 29ce53e.
…lastic#4583) This reverts commit eca5bc7.
* Reapply "Add `/liveness` endpoint to elastic-agent (#4499)" (#4583) This reverts commit eca5bc7. * add behavior to not disable http monitor on reload with nil config, add tests * improve comments * linter * more linter... * fix spelling * check original config state when reloading config * change behavior of config set from overrides * fix tests * add second test to make sure old behavior with hard-coded monitoring config still works * rename method
@cmacknz is there also a possibility to enable monitoring via the UI of the agent policies? |
Logs and Metrics monitoring has always been possible to configure in the UI. This PR makes it possible to configure the agent to expose the liveness endpoint via the UI, but what polls the /liveness endpoint is deployment specific and Fleet itself can't/won't monitor it since it is local to the machine the agent is on. |
Thanks I am well aware about the functionality of the liveness and readiness probes. However adding the probe definition to the daemonset is not enough. By default the agents are not listening on the port which is being used in your snipped. Therefore, I am trying to figure out where to enable the exposure of it, so that it can be used for the probing. I might have overseen the section in the UI to enable it. Could you maybe provide a screenshot? PS: I am talking about fleet managed agents. |
The first problem you have is that the code in this PR is not released yet, and won't be until 8.15.0. The steps in #4499 (review) won't work until then. The second thing with a Fleet managed agent prior to 8.15.0 is that the The third thing is that you likely need to change the default host the monitoring server binds to to |
…180922) ## Summary Related to elastic/ingest-dev#2471 With elastic/elastic-agent#4499 merged, it became possible to reload monitoring settings changes in a running agent, so enabling these settings on the UI. To verify: - Create an agent policy and enroll an agent with latest 8.15-SNAPSHOT - edit agent policy, and change monitoring settings, e.g. change port - verify that the metrics endpoint is running on the new port <img width="958" alt="image" src="https://github.com/elastic/kibana/assets/90178898/91e3d2ec-8275-40c3-b5a6-7cdbb6b07cd3"> <img width="1109" alt="image" src="https://github.com/elastic/kibana/assets/90178898/83be9610-5095-485f-83fd-bf4dbe5cb44a"> @cmacknz Does it make sense to allow changing the host name? It seems to me that monitoring can only work in localhost. Another question, how can we verify that the `buffer.enabled` setting is applied correctly? ``` 15:07:40.054 elastic_agent [elastic_agent][error] Failed creating a server: failed to create api server: listen tcp 192.168.178.217:6791: bind: cannot assign requested address ``` Also, I'm not sure if switching off the `enabled` flag did anything, seeing this again in the logs: ``` 15:13:15.167 elastic_agent [elastic_agent][info] Starting server 15:13:15.168 elastic_agent [elastic_agent][info] Starting stats endpoint 15:13:15.168 elastic_agent [elastic_agent][debug] Server started 15:13:15.168 elastic_agent [elastic_agent][info] Metrics endpoint listening on: 127.0.0.1:6791 (configured: http://localhost:6791) ``` --------- Co-authored-by: Jen Huang <its.jenetic@gmail.com>
What does this PR do?
Closes #390
Why is it important?
This adds a
/liveness
endpoint to elastic agent that we enable along with theprocesses
endpoint. However, unlike the/processes
endpoint,liveness
will return a500
is any components are in an Unhealthy or Degraded state.This is a pretty simple change as-is, since I figured it would be easier to discuss other theoretical changes when we can actually see code. In particular:
monitoring.http
flag? Should we enabled it based on some k8s autodiscover config or env variable or something else? If this feature is fundamentally tied to a k8s workflow, it seems like enabling it should also be tied to k8s in some way.I'm also holding off on changing any docs until we're sure we have what we want with regards to config.
How to test this PR
monitoring.http.enabled
totrue
/liveness/
and/liveness/[component-id]
using either curl or a k8s health check.Checklist
./changelog/fragments
using the changelog tool