Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ingest Manager] elastic-agent process is not properly terminated after restart #127

Closed
mdelapenya opened this issue Aug 11, 2020 · 20 comments
Labels
bug Something isn't working good first issue Good for newcomers Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@mdelapenya
Copy link
Contributor

mdelapenya commented Aug 11, 2020

Environment

Steps to Reproduce

  1. Start a Centos:7 docker container: docker run --name centos centos:7 tail -f /dev/null
  2. Enter the container: docker exec -ti centos bash
  3. Download the agent RPM package: curl https://snapshots.elastic.co/8.0.0-3ce083a1/downloads/beats/elastic-agent/elastic-agent-8.0.0-SNAPSHOT-x86_64.rpm -o /elastic-agent-8.0.0-SNAPSHOT-x86_64.rpm
  4. Install systemctl replacement for Docker: curl https://raw.githubusercontent.com/gdraheim/docker-systemctl-replacement/master/files/docker/systemctl.py -o /usr/bin/systemctl
  5. Install the RPM package with yum: yum localinstall /elastic-agent-8.0.0-SNAPSHOT-x86_64.rpm -y
  6. Enable service: systemctl enable elastic-agent
  7. Start service: systemctl start elastic-agent
  8. Check processes: top. There should be only one process for the elastic-agent
  9. Restart service: systemctl restart elastic-agent
  10. Check processes: top

Behaviours:

Expected behaviour

After the initial restart, the elastic-agent appears once, not in the Z state.
Screenshot 2020-08-11 at 17 16 34

Current behaviour

After the initial restart, the elastic-agent appears twice, one in the Z state, and the other in the S state (as shown in the attachment)
Screenshot 2020-08-11 at 17 15 38

Other observations

This behavior persists across multiple restarts: the elastic-agent process gets into the zombie state each time is restarted (note that I restarted it three times, so there are 3 zombie processes):
Screenshot 2020-08-11 at 17 18 22

One shot script

docker run -d --name centos centos:7 tail -f /dev/null
docker exec -ti centos bash

Inside the container

curl https://snapshots.elastic.co/8.0.0-3ce083a1/downloads/beats/elastic-agent/elastic-agent-8.0.0-SNAPSHOT-x86_64.rpm -o /elastic-agent-8.0.0-SNAPSHOT-x86_64.rpm
curl https://raw.githubusercontent.com/gdraheim/docker-systemctl-replacement/master/files/docker/systemctl.py -o /usr/bin/systemctl 
yum localinstall /elastic-agent-8.0.0-SNAPSHOT-x86_64.rpm -y
systemctl enable elastic-agent
systemctl start elastic-agent
systemctl restart elastic-agent
top
@mdelapenya mdelapenya added the bug Something isn't working label Aug 11, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/ingest-management (Team:Ingest Management)

@mdelapenya mdelapenya changed the title [Ingest Manager] elastic-agent process is not properly terminated after [Ingest Manager] elastic-agent process is not properly terminated after restart Aug 11, 2020
@mdelapenya
Copy link
Contributor Author

mdelapenya commented Aug 12, 2020

I'm gonna close this issue, as I was not able to reproduce it in a full-blown Centos VM, which includes the whole systemd service manager, so it seems related to the limitations of systemd in Docker.

Steps to reproduce it

  1. create a VM with Centos on Google Cloud
  2. Download the RPM package for the elastic-agent
  3. Install it with yum localinstall
  4. Enable the service
  5. Start the service
  6. Check processes are started
  7. Restart the service using systemctl
  8. Check processes are restarted

One-shot script

Once you SSH'ed into the VM:

sudo su -
cd /
curl https://snapshots.elastic.co/8.0.0-3ce083a1/downloads/beats/elastic-agent/elastic-agent-8.0.0-SNAPSHOT-x86_64.rpm -o /elastic-agent-8.0.0-SNAPSHOT-x86_64.rpm
yum localinstall /elastic-agent-8.0.0-SNAPSHOT-x86_64.rpm -y
systemctl enable elastic-agent
systemctl start elastic-agent
ps aux | grep elastic

Restart the service:

systemctl restart elastic-agent
ps aux | grep elastic

There is NO elastic-agent in the Zombie state

@zez3
Copy link

zez3 commented Nov 26, 2021

@mdelapenya
I can reproduce this on Ubuntu for every policy change there is a new Zombie process spawned.
Running the 7.15.2 Elastic agent

@zez3
Copy link

zez3 commented Nov 26, 2021

Was this fixed in 8.0 ?

@ruflin ruflin added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Nov 29, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@ruflin
Copy link
Member

ruflin commented Nov 29, 2021

@zez3 Reopening for further investigation. Can you share how you reproduced it?

@ruflin ruflin reopened this Nov 29, 2021
@zez3
Copy link

zez3 commented Nov 29, 2021

@ruflin
hmm, not sure exactly how but it happens:

 ps aux | grep 'Z' | grep defunct
root        4746  0.0  0.0      0     0 ?        Zs   Nov09   0:00 [elastic-agent] <defunct>
root        6166  0.0  0.0      0     0 ?        Zs   Nov09   0:00 [elastic-agent] <defunct>
root       74439  0.0  0.0      0     0 ?        Zs   Nov17   0:00 [elastic-agent] <defunct>
root       74468  0.0  0.0      0     0 ?        Zs   Nov17   0:00 [elastic-agent] <defunct>

During that time 16 Nov -> 18 Novemeber I can see this errors in


Showing entries from Nov 17, 10:37:48
10:37:48.831
elastic_agent
[elastic_agent][error] Could not communicate with fleet-server Checking API will retry, error: could not decode the response, raw response: Post "https://x.x.1.196:18315/api/fleet/agents/2d258220-fa3a-4e7b-bf86-dc2cdd2b15b1/checkin": unexpected EOF
Nov 18, 2021
10:31:25.199
elastic_agent
[elastic_agent][error] Could not communicate with fleet-server Checking API will retry, error: could not decode the response, raw response: Post "https://x.x.1.196:18383/api/fleet/agents/2d258220-fa3a-4e7b-bf86-dc2cdd2b15b1/checkin": unexpected EOF
10:31:25.199
elastic_agent
[elastic_agent][error] Could not communicate with fleet-server Checking API will retry, error: could not decode the response, raw response: Post "https://x.x.1.196:18383/api/fleet/agents/2d258220-fa3a-4e7b-bf86-dc2cdd2b15b1/checkin": unexpected EOF
10:31:25.199
elastic_agent
[elastic_agent][error] Could not communicate with fleet-server Checking API will retry, error: could not decode the response, raw response: Post "https://x.x.1.196:18383/api/fleet/agents/2d258220-fa3a-4e7b-bf86-dc2cdd2b15b1/checkin": unexpected EOF
Showing entries until Nov 18, 10:31:25

other agents have crashed at different times

ps aux | grep 'Z' | grep defunct
root     2263706  0.0  0.0      0     0 ?        Zs   Nov20   0:00 [elastic-agent] <defunct>
root     2263972  0.0  0.0      0     0 ?        Zs   Nov20   0:00 [elastic-agent] <defunct>

it's not something specific to an exact day

 ps aux | grep 'Z' | grep defunct
root        3002  0.0  0.0      0     0 ?        Zs   Nov10   0:00 [elastic-agent] <defunct>
root     3257393  0.0  0.0      0     0 ?        Zs   Nov17   0:00 [elastic-agent] <defunct>
root     3259342  0.0  0.0      0     0 ?        Zs   Nov17   0:00 [elastic-agent] <defunct>

I have to check my other agents.

this is on a macos

mac:~ $ ps aux | grep 'Z'
USER               PID  %CPU %MEM      VSZ    RSS   TT  STAT STARTED      TIME COMMAND
root               431   0.0  0.0        0      0   ??  Z    Tue04AM   0:00.00 (elastic-agent)

@zez3
Copy link

zez3 commented Nov 29, 2021

Perhaps, I could have a hint. I restarted my docker service(one by one) on all my ECE hosts.

on 4 of my hosts where the agent is running I managed to get the zombies. Some where not affected.

 ps aux | grep 'Z' | grep defunct
root      294556  0.0  0.0      0     0 ?        Zs   16:26   0:00 [elastic-agent] <defunct>

ps aux | grep 'Z' | grep defunct
root     2231066  0.0  0.0      0     0 ?        Zs   Nov29   0:00 [elastic-agent] <defunct>

ps aux | grep 'Z' | grep defunct
root      4499  0.0  0.0      0     0 ?        Zs   Nov29   0:00 [elastic-agent] <defunct>

 ps aux | grep 'Z' | grep defunct
root     3244101  0.0  0.0      0     0 ?        Zs   Nov29   0:00 [elastic-agent] <defunct>

let me know if you need some logs or if we should make a live session. There is also an support ticket for this open.

@ruflin
Copy link
Member

ruflin commented Nov 30, 2021

As you mention ECE, I assume all the Elastic Agents that you got above are running inside a Docker container? These are the hosted elastic agents?

How did you restart the docker service? Does it stop and then start the container? I'm asking because maybe the container restart is causing it.

My general understanding of the <defunct> process is that there should be some parent still around.

To also undertstand the priority on this issue a bit: The defunct processes are there after a restart but the system works still as expected?

@zez3
Copy link

zez3 commented Nov 30, 2021

Nope, my agents are not inside containers, they are on bare metal machines. I restarted the docker service on the ECE hosts where fleet, kibana and all other elastic nodes in my deployments are running. It's basically cutting the agent from fleet and ES.

Yes, the parent process is still running and operating properly by spawning(restarting) a new child.
If this happens 2-3 times a month then in one year it would eat a bit of RAM but nothing critical.

@ruflin
Copy link
Member

ruflin commented Dec 1, 2021

In your scenario you manage can create the defunct processes if you restart the Elastic Agent with the fleet-server the Elastic Agents connect to. This is the bit I missed before as I assume it is when you restart the Elastic Agents on the edge.

What happens in this scenario is that the Elastic Agents on the edge temporarily loose connection to the fleet-server which indicates to me, that is where we should investigate further and likely is not related in any way to Docker or ECE.

@jlind23
Copy link
Contributor

jlind23 commented Dec 1, 2021

@ruflin what you are saying is that when the Agent lose its connection to fleet server then somehow defunct processes are created right?

@zez3
Copy link

zez3 commented Dec 2, 2021

I assume it is when you restart the Elastic Agents on the edge.

The agent was not restarted. Only Fleel+Kiabana+ES and all the other ECE containers.

most likely is not related in any way to Docker or ECE perhaps indirectly because Fleet server resides here

Another hint is that only the agents(beats/children underneath the parent process) with a high(~2000 eps) load on them caused the defunct processes.

@ruflin
Copy link
Member

ruflin commented Dec 2, 2021

a high(~2000 eps)

Very interesting detail. Will help to investigate it further. We should put load on the Elastic Agents (subprocess) for testing.

@andrewkroh
Copy link
Member

andrewkroh commented Dec 3, 2021

I don't know the code paths here well (at all), but Stop() looks problematic if it is used without StopWait(). The method does not call exec.Cmd.Wait() which invokes wait for Linux that is required to release the resources associated with the child process. It also closes some internal channels and goroutines.

Wait() may be called elsewhere but it's hard to verify/ensure that all code paths lead to it.

https://github.com/elastic/beats/blob/a91bba523d2075272d0aad0bd5e7f006d29cdc84/x-pack/elastic-agent/pkg/core/process/process.go#L69-L72

@jlind23 jlind23 added good first issue Good for newcomers v8.1.0 and removed 8.1-candidate labels Dec 6, 2021
@ph
Copy link
Contributor

ph commented Feb 1, 2022

I think @andrewkroh is right here, we need to audit the stop path of the process, It's currently been changed in elastic/beats#29650

@jlind23 jlind23 transferred this issue from elastic/beats Mar 7, 2022
@jlind23 jlind23 removed the v8.1.0 label Mar 9, 2022
@jlind23
Copy link
Contributor

jlind23 commented Mar 9, 2022

Closing it as won't fix. It will be part of the V2 architecture.

@jlind23 jlind23 closed this as completed Mar 9, 2022
@zez3
Copy link

zez3 commented Mar 9, 2022

@jlind23 Can I track this future V2 architecture somewhere?

@jlind23
Copy link
Contributor

jlind23 commented Mar 10, 2022

@zez3 i've started a new issue just there: #189

@zez3
Copy link

zez3 commented Nov 20, 2023

@jlind23

the Zombie issue still persists on 8.11.1

https://www.howtogeek.com/119815/htg-explains-what-is-a-zombie-process-on-linux/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

No branches or pull requests

7 participants