-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filebeat running under Elastic-Agent not harvesting logs after restart #30533
Comments
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
I've been having this issue for a while, now, combined with a different one where the agent wouldn't update its policy, although it was otherwise reported as healthy.
This particular workaround was suggested by Elastic support, and it seemed to help with the update issue, but it won't reliably fix the harvesting issue. |
@vladvasiliu you are right, this is why we are currently trying to find a proper fix for this issue. |
Hi, We're also hitting this issue, which is quite problematic since we have 300+ agents connected. |
Not really, we have both 7.14.1 (older agents) and 7.16.3 running on Centos7 and Ubuntu both hitting this issue. Yesterday we've upgraded some agents from 7.14.1 to 7.16.3, which started sending logs again after the upgrade. Another agent (7.14.3) which we didn't upgrade started sending logs again after changing the log level for the agent to debug. |
@rlapond one alternative it to modify the policy, for example, adding a dummy custom log integration. It'll cause the elastic-agents to receive a new configuration and propagate it to filebeat. I believe it'd be the easiest to fix several failing filebeats at once. |
@rlapond @vladvasiliu Would you say this issue is often related to either a machine restart or a service restart? |
@ph Not confirmed on all Agents, but on 1 Agent (Linux) it for sure happened because of a machine restart after updating and restarting at 04:00 in the morning. On Elasticsearch side we didn't receive any logs from that exact moment. After the machine restart Elastic-agent and Filebeat were running with no errors. I had restart the Elastic-agent service for filebeat to start shipping logs again. |
@AndersonQ : this didn't help in my case, running 8.0.0. The fleet server reports the agents are still out of date after more than an hour. Changing policies back and forth didn't help, agents are still out of date (but otherwise healthy). @ph This seems to be triggered upon restart, yes. Most of my agents are running on POS systems which are turned off at night. I don't restart the agent otherwise. But it seems that restarting only the agent service or the whole system doesn't reliably fix the issue. However, once it starts, and it works, I've never noticed it stop working. |
Thanks, I am trying to reproduce that exact case. |
I've been trying to reproduce the issue locally with a fleet server under load closer to your current situation (around 1200 agents where 99% are simulated), I've been killing, restarting gracefully, changing the configuration between these operations and the agent always come back up, cleanly and ack the new configuration and report the logs. I've looked a the code path and we have added more logging for a future versions to give more details. But I want to get to the end of this story. Let's start differently, which logs you are monitoring that show you the behavior that the logs aren't coming? Are you watching the collected syslog? I am to replicate more closely your situation. |
If you look at the /opt/Elastic/Agent/data/elastic-agent*/filebeat* is there any error there? |
@ph, in my case, the agents are running on Windows 2016 LTSB and Server 2019. The 2016 ones are the ones that are shut down for the night, and which have the problems. They're collecting JSON logs from a bunch of files, using the custom log integration. The 2019 ones are collecting only Windows logs, with the system integration. I can't guarantee it 100%, since those servers rarely reboot, but I would say that the issue doesn't happen on the 2019 / system integration agents. (They do have the policy update not working issue, but I suppose that's different). At one point, I had the system integration activated on the 2016 agents, too, and it seemed to work even when the custom log didn't. Another point: The agents are configured to send their own logs. When the custom log integration doesn't work, I usually don't get the agent's logs either. There are no errors in the filebeat logs. But they don't say anything about starting any inputs, either. Then they go on about non-zero metrics every 30 seconds. If you have a build with enhanced logging, I'd be happy to deploy it on a bunch of machines and report back. |
We were affected by this as soon as we started testing Elastic Agent with the Azure logs integration. We are using v7.16.3 It was when we either restarted the agent or updated the config which triggered a restart. The Azure integration dropped and the only error I can see in the logs is {"log.level":"debug","@timestamp":"2022-02-22T10:44:55.535Z","log.logger":"add_cloud_metadata","log.origin":{"file.name":"add_cloud_metadata/providers.go","file.line":166},"message":"add_cloud_metadata: received disposition for azure after 2.434822ms. result=[provider:azure, error=failed with http status code 404, metadata={}]","service.name":"filebeat","ecs.version":"1.6.0"} Several other integrations fail to start but the agent appears healthy in the Kibana UI. Contact support if you want to do a screen share. I've provided about 20 diagnostics to support and they all contain the azure logs integration in the config file. |
We're having related issue's as well. Mostly on Linux and not only after restart but we also have endpoints not sending logs after an agent upgrade (about 50 stopped sending logs after an agent upgrade this week). Sometimes changing the policy or changing the logging level helps but that is not always the case. Most systems are not in our control but I do have one with this issue right now with debugging enabled and terminal access. If I can help in any way (setting up a call to speed things up is a possibility) let me know. |
What additional information:
Comparing with a working system: |
Bit more digging in the filebeat logging. I changed to logging level (which causes a config reload) on a working and the defect system. On a working endpoint this results in "centralmgmt" events for filebeat.inputs, filebeat.outputs and filebeat.modules. On the defunct endpoint only the filebeat.outputs is present but it does show the correct url for that in the logs. Maybe somebody else can scan their logs for "centralmgmt" events and verify if it's the same issue. |
@renekalff Can you verify how many filebeat processes run on a problematic host? I was able to reproduce it 2 times. |
There are two processes running one for the agent logs the other for the system logs.
|
This is really good, because, this look like the registry point for the inputs and modules aren't there. |
@renekalff Thanks for your help, I know where the problem is and I will work on a fix, but I cannot reproduce it reliably because it's a timing/visibility issue. The problem is in some circumstances, the manager that takes care of the reloadable part of beats is not completely initialized when the manager is ready to accept the configuration. |
Would be awesome if that fixes the issue. We've tried to reproduce this last week and also weren't able to. |
I will work on a fix and get that released asap. |
I have the PR open #30694, will get that reviewed and tested asap. |
Hi Team
Build details:
Logs: Thanks! |
After a restart a filebeat running under the elastic-agent doesn't start harvesting logs. Upon restart filebeat receive the config from the elastic-agent, it's processed, however only the
output
is applied.It happens inconsistently, so far reported on Linux and Windows endpoints.In a fleet of agents enrolled to the same fleet-server only a few will show this behaviour. A changing the elastic-agent's policy or restarting it fixes the problem.
Even when showing this behaviour there are no errors in the logs of either filebeat or elastic-agent.
Workarounds:
tl;dr:
the long version:
The elastic-agent needs to receive a new configuration/policy from fleet-server and forward it to filebeat. Any of the above methods will cause the elastic-agent to receive a new policy and update filebeat's configuration.
Known facts:
This mean that the system is correctly configured and sane and it is able to recover from the situation.
block that is modified after a restart.
What have been tried:
We might have been able to reproduce once, but we do not have the log to inspect.
Related PRs:
The text was updated successfully, but these errors were encountered: