-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] agent
leaves defunct processes with version 7.38.0
#12997
Comments
Hello @noizwaves, Thank you very much for your detailed bug report ! We confirm the regression you described. We’ve understood what is happening and we are currently actively working on a fix (#13019). Before a fix is released, here are some mitigations that can be used to workaround the bug: Running the agent in the host PID namespace will make the zombie processes adopted by the host PID 1 (systemd) which will properly reap them. So, the first workaround consists in adding this in the Helm datadog:
dogstatsd:
useHostPID: true The possibility to make the agent run in the host PID namespace was initially proposed for the dogstatsd origin detection feature. But it can be used as a workaround for this issue. The second possible workaround consists in explicitly resetting datadog:
env:
- name: LD_PRELOAD
value: "" The |
Many thanks for the quick turn around on this @L3n41c! |
it looks like we have the same issue. we deployed Datadog from the helm chart datadog-2.37.0. it uses Datadog 7.38.1.
However, I checked our |
I tried the second solution. The
|
Hi @wd, |
Hi @wd, |
@sgnn7 thank you! |
After upgrading to datadog-agent version 7.38.0 (and also 7.38.1) for our Kubrnetes monitoring, the
agent
process is producing an excessive number of defunct processes. These processes accumulate, and once the system limit (i.e./proc/sys/kernel/pid_max
) is reached, new processes cannot spawn resulting in general system instability.The instability presents itself as errors in other Kubernetes internal components and also workloads running on the cluster, including (but not limited to):
mount failed: fork/exec /usr/bin/mount: resource temporarily unavailable
read init-p: connection reset by peer: unknown
fork/exec /proc/self/exe: resource temporarily unavailable: unknown
failed to create new OS thread (have 4 already; errno=11) runtime: may need to increase max user processes (ulimit -u) fatal error: newosproc
For our workloads, this happened after ~4.5 days of node uptime.
Agent Environment
Describe what happened:
Running
ps -e | grep defunct | wc -l
on a Kubernetes node running 7.38.0 reveals an increasing number of defunct processes on the system:Investigating the processes reveals the problematic processes share parent PID 26607:
And the parent is
agent run
For 7.37.1,
ps -e | grep defunct | wc -l
always returns 0.Describe what you expected:
The number of defunct processes should always be 0.
Steps to reproduce the issue:
while true; do ps -e | grep defunct | wc -l; sleep 60; done
Additional environment details (Operating System, Cloud provider, etc):
uname -a
ofLinux eks-cloud-dev-us-west-2a-staging-i-04816b5f5ba502ba7 5.4.196-108.356.amzn2.x86_64 #1 SMP Thu May 26 12:49:47 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
DataDog agent deployed via Helm (chart version
datadog-2.32.3
) with values:and
datadog-custom-envs
of:Edit: Add extra debugging information about defunct processes
The text was updated successfully, but these errors were encountered: