Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filebeat running under Elastic-Agent not harvesting logs after restart #30533

Closed
AndersonQ opened this issue Feb 22, 2022 · 26 comments · Fixed by #30694
Closed

Filebeat running under Elastic-Agent not harvesting logs after restart #30533

AndersonQ opened this issue Feb 22, 2022 · 26 comments · Fixed by #30694
Assignees
Labels
backport-7.17 Automated backport to the 7.17 branch with mergify backport-v8.0.0 Automated backport with mergify backport-v8.1.0 Automated backport with mergify blocker bug QA:Validated Validated by the QA Team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team v8.2.0

Comments

@AndersonQ
Copy link
Member

AndersonQ commented Feb 22, 2022

After a restart a filebeat running under the elastic-agent doesn't start harvesting logs. Upon restart filebeat receive the config from the elastic-agent, it's processed, however only the output is applied.It happens inconsistently, so far reported on Linux and Windows endpoints.

In a fleet of agents enrolled to the same fleet-server only a few will show this behaviour. A changing the elastic-agent's policy or restarting it fixes the problem.

Even when showing this behaviour there are no errors in the logs of either filebeat or elastic-agent.

Workarounds:

tl;dr:

  • modify the policy. E.g. add a dummy custom log integration OR
  • swap the policy assigned to the affected agent. E.g. assign a dummy policy and then assign back the proper policy OR
  • restart the elastic-agent

the long version:
The elastic-agent needs to receive a new configuration/policy from fleet-server and forward it to filebeat. Any of the above methods will cause the elastic-agent to receive a new policy and update filebeat's configuration.

Known facts:

  • When a new configuration with changes is send to the Agent, it will restart sending events.
    This mean that the system is correctly configured and sane and it is able to recover from the situation.
  • We can confirm the configuration is available it's retrieved from the diagnostic command.
  • We can confirm in the logs that the output receive the Elasticsearch configuration, this is the only
    block that is modified after a restart.
  • We know that Filebeat is up and running, this means that the Agent detect input size is > 0.
  • If there is a problem in the configuration block the manager in libbeat will just ignore the error and continue.
  • Filebeat load the registrar state.
  • I know that Filebeat doesn't start any log harvester from the periodic metric dump.
  • I know that Filebeat is not processing any events from the periodic metric dump.
  • It's a random issue, if you have a fleet of agent some agent can be healthy and send events and others not.
  • It's confirmed to happen on Linux and Windows

What have been tried:

  • We have tried to reproduce the issues by simulating the configuration changes with nginx or custom log harvester.
    We might have been able to reproduce once, but we do not have the log to inspect.

Related PRs:

@AndersonQ AndersonQ added the bug label Feb 22, 2022
@AndersonQ AndersonQ self-assigned this Feb 22, 2022
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Feb 22, 2022
@ph ph added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Feb 22, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Feb 22, 2022
@jlind23 jlind23 added backport-7.17 Automated backport to the 7.17 branch with mergify backport-v8.0.0 Automated backport with mergify backport-v8.1.0 Automated backport with mergify v8.2.0 labels Feb 23, 2022
@vladvasiliu
Copy link

I've been having this issue for a while, now, combined with a different one where the agent wouldn't update its policy, although it was otherwise reported as healthy.

When a new configuration with changes is send to the Agent, it will restart sending events.

This particular workaround was suggested by Elastic support, and it seemed to help with the update issue, but it won't reliably fix the harvesting issue.

@jlind23
Copy link
Collaborator

jlind23 commented Feb 28, 2022

@vladvasiliu you are right, this is why we are currently trying to find a proper fix for this issue.

@rlapond
Copy link

rlapond commented Mar 2, 2022

Hi,

We're also hitting this issue, which is quite problematic since we have 300+ agents connected.

@jlind23
Copy link
Collaborator

jlind23 commented Mar 2, 2022

@rlapond do you face this issue on some particular agents?
cc @ph

@rlapond
Copy link

rlapond commented Mar 2, 2022

Not really, we have both 7.14.1 (older agents) and 7.16.3 running on Centos7 and Ubuntu both hitting this issue. Yesterday we've upgraded some agents from 7.14.1 to 7.16.3, which started sending logs again after the upgrade.

Another agent (7.14.3) which we didn't upgrade started sending logs again after changing the log level for the agent to debug.

@AndersonQ
Copy link
Member Author

@rlapond one alternative it to modify the policy, for example, adding a dummy custom log integration. It'll cause the elastic-agents to receive a new configuration and propagate it to filebeat. I believe it'd be the easiest to fix several failing filebeats at once.

@ph
Copy link
Contributor

ph commented Mar 2, 2022

@rlapond @vladvasiliu Would you say this issue is often related to either a machine restart or a service restart?

@rlapond
Copy link

rlapond commented Mar 2, 2022

@ph Not confirmed on all Agents, but on 1 Agent (Linux) it for sure happened because of a machine restart after updating and restarting at 04:00 in the morning. On Elasticsearch side we didn't receive any logs from that exact moment.

After the machine restart Elastic-agent and Filebeat were running with no errors. I had restart the Elastic-agent service for filebeat to start shipping logs again.

@vladvasiliu
Copy link

@AndersonQ : this didn't help in my case, running 8.0.0. The fleet server reports the agents are still out of date after more than an hour. Changing policies back and forth didn't help, agents are still out of date (but otherwise healthy).

@ph This seems to be triggered upon restart, yes. Most of my agents are running on POS systems which are turned off at night. I don't restart the agent otherwise. But it seems that restarting only the agent service or the whole system doesn't reliably fix the issue. However, once it starts, and it works, I've never noticed it stop working.

@ph
Copy link
Contributor

ph commented Mar 2, 2022

Thanks, I am trying to reproduce that exact case.

@ph
Copy link
Contributor

ph commented Mar 2, 2022

@vladvasiliu @rlapond

I've been trying to reproduce the issue locally with a fleet server under load closer to your current situation (around 1200 agents where 99% are simulated), I've been killing, restarting gracefully, changing the configuration between these operations and the agent always come back up, cleanly and ack the new configuration and report the logs. I've looked a the code path and we have added more logging for a future versions to give more details.

But I want to get to the end of this story.

Let's start differently, which logs you are monitoring that show you the behavior that the logs aren't coming? Are you watching the collected syslog? I am to replicate more closely your situation.

@ph
Copy link
Contributor

ph commented Mar 2, 2022

If you look at the /opt/Elastic/Agent/data/elastic-agent*/filebeat* is there any error there?

@vladvasiliu
Copy link

@ph, in my case, the agents are running on Windows 2016 LTSB and Server 2019.

The 2016 ones are the ones that are shut down for the night, and which have the problems. They're collecting JSON logs from a bunch of files, using the custom log integration.

The 2019 ones are collecting only Windows logs, with the system integration.

I can't guarantee it 100%, since those servers rarely reboot, but I would say that the issue doesn't happen on the 2019 / system integration agents. (They do have the policy update not working issue, but I suppose that's different).

At one point, I had the system integration activated on the 2016 agents, too, and it seemed to work even when the custom log didn't.

Another point: The agents are configured to send their own logs. When the custom log integration doesn't work, I usually don't get the agent's logs either.

There are no errors in the filebeat logs. But they don't say anything about starting any inputs, either. Then they go on about non-zero metrics every 30 seconds.

If you have a build with enhanced logging, I'd be happy to deploy it on a bunch of machines and report back.

@mbudge
Copy link
Contributor

mbudge commented Mar 3, 2022

We were affected by this as soon as we started testing Elastic Agent with the Azure logs integration. We are using v7.16.3

It was when we either restarted the agent or updated the config which triggered a restart. The Azure integration dropped and the only error I can see in the logs is

{"log.level":"debug","@timestamp":"2022-02-22T10:44:55.535Z","log.logger":"add_cloud_metadata","log.origin":{"file.name":"add_cloud_metadata/providers.go","file.line":166},"message":"add_cloud_metadata: received disposition for azure after 2.434822ms. result=[provider:azure, error=failed with http status code 404, metadata={}]","service.name":"filebeat","ecs.version":"1.6.0"}

Several other integrations fail to start but the agent appears healthy in the Kibana UI.

Contact support if you want to do a screen share. I've provided about 20 diagnostics to support and they all contain the azure logs integration in the config file.

@renekalff
Copy link

renekalff commented Mar 3, 2022

We're having related issue's as well. Mostly on Linux and not only after restart but we also have endpoints not sending logs after an agent upgrade (about 50 stopped sending logs after an agent upgrade this week). Sometimes changing the policy or changing the logging level helps but that is not always the case. Most systems are not in our control but I do have one with this issue right now with debugging enabled and terminal access. If I can help in any way (setting up a call to speed things up is a possibility) let me know.

@renekalff
Copy link

What additional information:

  • Agent show's healthy
  • Agent version 7.16.3
  • Metrics are received but no logs
  • CentOS 7

Comparing with a working system:
./elastic-agent inspect | grep -v api_key | md5sum show's agent is running with the same configuration
curl -XGET --unix-socket ./filebeat.sock 'http:/state?pretty show's 0 input.count instead of 2 and no output.elasticsearch.cluster_uuid
./elastic-agent diagnostics shows no differences except for the id's
cat /proc/<filebeat pid>/cmdline | md5sum show's filebeat is started with same parameters

@renekalff
Copy link

Bit more digging in the filebeat logging. I changed to logging level (which causes a config reload) on a working and the defect system. On a working endpoint this results in "centralmgmt" events for filebeat.inputs, filebeat.outputs and filebeat.modules. On the defunct endpoint only the filebeat.outputs is present but it does show the correct url for that in the logs.

Maybe somebody else can scan their logs for "centralmgmt" events and verify if it's the same issue.
filebeat.json.txt

@ph
Copy link
Contributor

ph commented Mar 3, 2022

@renekalff Can you verify how many filebeat processes run on a problematic host? I was able to reproduce it 2 times.

@renekalff
Copy link

There are two processes running one for the agent logs the other for the system logs.

@renekalff Can you verify how many filebeat processes run on a problematic host? I was able to reproduce it 2 times.

@ph
Copy link
Contributor

ph commented Mar 3, 2022

{"log.level":"info","@timestamp":"2022-03-03T11:20:41.207+0100","log.logger":"centralmgmt.fleet","log.origin":{"file.name":"management/manager.go","file.line":109},"message":"Starting fleet management service","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-03-03T11:20:41.732+0100","log.logger":"centralmgmt.fleet","log.origin":{"file.name":"management/manager.go","file.line":150},"message":"Status change to Configuring: Updating configuration","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-03-03T11:20:41.733+0100","log.logger":"centralmgmt.fleet","log.origin":{"file.name":"management/manager.go","file.line":271},"message":"Applying settings for output","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-03-03T11:20:41.733+0100","log.logger":"esclientleg","log.origin":{"file.name":"eslegclient/connection.go","file.line":102},"message":"elasticsearch url: https://:443","service.name":"filebeat","ecs.version":"1.6.0"}

This is really good, because, this look like the registry point for the inputs and modules aren't there.
If I remember correctly the "Applying (inputs, modules, output)" is based what exist in the registry of the reloadable part.

@ph
Copy link
Contributor

ph commented Mar 3, 2022

@renekalff Thanks for your help, I know where the problem is and I will work on a fix, but I cannot reproduce it reliably because it's a timing/visibility issue. The problem is in some circumstances, the manager that takes care of the reloadable part of beats is not completely initialized when the manager is ready to accept the configuration.

@renekalff
Copy link

@renekalff Thanks for your help, I know where the problem is and I will work on a fix, but I cannot reproduce it reliably because it's a timing/visibility issue. The problem is in some circumstances, the manager that takes care of the reloadable part of beats is not completely initialized when the manager is ready to accept the configuration.

Would be awesome if that fixes the issue. We've tried to reproduce this last week and also weren't able to.

@ph
Copy link
Contributor

ph commented Mar 3, 2022

I will work on a fix and get that released asap.

@ph
Copy link
Contributor

ph commented Mar 4, 2022

I have the PR open #30694, will get that reviewed and tested asap.

@jlind23 jlind23 linked a pull request Mar 7, 2022 that will close this issue
3 tasks
@ph ph closed this as completed in #30694 Mar 14, 2022
@amolnater-qasource
Copy link

Hi Team
We have revalidated this issue on 8.2 Snapshot Kibana cloud environment and found it fixed now.
Host: Windows 10

  • We restarted the Agent and found filebeat-metricbeat logs generating after agent restart successfully.

Build details:

BUILD: 51276
COMMIT: 32d08793f02171391ab4b0d3044970c3f14c2d94
Artifact Link: https://snapshots.elastic.co/8.2.0-5cc993c1/downloads/beats/elastic-agent/elastic-agent-8.2.0-SNAPSHOT-windows-x86_64.zip

Screenshot:
8

Logs:
logs.zip

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-7.17 Automated backport to the 7.17 branch with mergify backport-v8.0.0 Automated backport with mergify backport-v8.1.0 Automated backport with mergify blocker bug QA:Validated Validated by the QA Team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team v8.2.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants