-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Ingest Manager] elastic-agent process is not properly terminated after restart #127
Comments
Pinging @elastic/ingest-management (Team:Ingest Management) |
I'm gonna close this issue, as I was not able to reproduce it in a full-blown Centos VM, which includes the whole systemd service manager, so it seems related to the limitations of systemd in Docker. Steps to reproduce it
One-shot scriptOnce you SSH'ed into the VM: sudo su -
cd /
curl https://snapshots.elastic.co/8.0.0-3ce083a1/downloads/beats/elastic-agent/elastic-agent-8.0.0-SNAPSHOT-x86_64.rpm -o /elastic-agent-8.0.0-SNAPSHOT-x86_64.rpm
yum localinstall /elastic-agent-8.0.0-SNAPSHOT-x86_64.rpm -y
systemctl enable elastic-agent
systemctl start elastic-agent
ps aux | grep elastic Restart the service: systemctl restart elastic-agent
ps aux | grep elastic There is NO elastic-agent in the Zombie state |
@mdelapenya |
Was this fixed in 8.0 ? |
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
@zez3 Reopening for further investigation. Can you share how you reproduced it? |
@ruflin
During that time 16 Nov -> 18 Novemeber I can see this errors in
other agents have crashed at different times
it's not something specific to an exact day
I have to check my other agents. this is on a macos
|
Perhaps, I could have a hint. I restarted my docker service(one by one) on all my ECE hosts. on 4 of my hosts where the agent is running I managed to get the zombies. Some where not affected.
let me know if you need some logs or if we should make a live session. There is also an support ticket for this open. |
As you mention ECE, I assume all the Elastic Agents that you got above are running inside a Docker container? These are the hosted elastic agents? How did you restart the docker service? Does it stop and then start the container? I'm asking because maybe the container restart is causing it. My general understanding of the To also undertstand the priority on this issue a bit: The defunct processes are there after a restart but the system works still as expected? |
Nope, my agents are not inside containers, they are on bare metal machines. I restarted the docker service on the ECE hosts where fleet, kibana and all other elastic nodes in my deployments are running. It's basically cutting the agent from fleet and ES. Yes, the parent process is still running and operating properly by spawning(restarting) a new child. |
In your scenario you manage can create the defunct processes if you restart the Elastic Agent with the fleet-server the Elastic Agents connect to. This is the bit I missed before as I assume it is when you restart the Elastic Agents on the edge. What happens in this scenario is that the Elastic Agents on the edge temporarily loose connection to the fleet-server which indicates to me, that is where we should investigate further and likely is not related in any way to Docker or ECE. |
@ruflin what you are saying is that when the Agent lose its connection to fleet server then somehow defunct processes are created right? |
The agent was not restarted. Only Fleel+Kiabana+ES and all the other ECE containers. most likely is not related in any way to Docker or ECE perhaps indirectly because Fleet server resides here Another hint is that only the agents(beats/children underneath the parent process) with a high(~2000 eps) load on them caused the defunct processes. |
Very interesting detail. Will help to investigate it further. We should put load on the Elastic Agents (subprocess) for testing. |
I don't know the code paths here well (at all), but
|
I think @andrewkroh is right here, we need to audit the stop path of the process, It's currently been changed in elastic/beats#29650 |
Closing it as won't fix. It will be part of the V2 architecture. |
@jlind23 Can I track this future V2 architecture somewhere? |
the Zombie issue still persists on 8.11.1 https://www.howtogeek.com/119815/htg-explains-what-is-a-zombie-process-on-linux/ |
Environment
Steps to Reproduce
docker run --name centos centos:7 tail -f /dev/null
docker exec -ti centos bash
curl https://snapshots.elastic.co/8.0.0-3ce083a1/downloads/beats/elastic-agent/elastic-agent-8.0.0-SNAPSHOT-x86_64.rpm -o /elastic-agent-8.0.0-SNAPSHOT-x86_64.rpm
curl https://raw.githubusercontent.com/gdraheim/docker-systemctl-replacement/master/files/docker/systemctl.py -o /usr/bin/systemctl
yum localinstall /elastic-agent-8.0.0-SNAPSHOT-x86_64.rpm -y
systemctl enable elastic-agent
systemctl start elastic-agent
top
. There should be only one process for the elastic-agentsystemctl restart elastic-agent
top
Behaviours:
Expected behaviour
After the initial restart, the elastic-agent appears once, not in the Z state.
Current behaviour
After the initial restart, the elastic-agent appears twice, one in the Z state, and the other in the S state (as shown in the attachment)
Other observations
This behavior persists across multiple restarts: the elastic-agent process gets into the zombie state each time is restarted (note that I restarted it three times, so there are 3 zombie processes):
One shot script
docker run -d --name centos centos:7 tail -f /dev/null docker exec -ti centos bash
Inside the container
curl https://snapshots.elastic.co/8.0.0-3ce083a1/downloads/beats/elastic-agent/elastic-agent-8.0.0-SNAPSHOT-x86_64.rpm -o /elastic-agent-8.0.0-SNAPSHOT-x86_64.rpm curl https://raw.githubusercontent.com/gdraheim/docker-systemctl-replacement/master/files/docker/systemctl.py -o /usr/bin/systemctl yum localinstall /elastic-agent-8.0.0-SNAPSHOT-x86_64.rpm -y systemctl enable elastic-agent systemctl start elastic-agent systemctl restart elastic-agent top
The text was updated successfully, but these errors were encountered: