Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] agent leaves defunct processes with version 7.38.0 #12997

Closed
noizwaves opened this issue Aug 3, 2022 · 7 comments · Fixed by #13019
Closed

[BUG] agent leaves defunct processes with version 7.38.0 #12997

noizwaves opened this issue Aug 3, 2022 · 7 comments · Fixed by #13019

Comments

@noizwaves
Copy link

noizwaves commented Aug 3, 2022

After upgrading to datadog-agent version 7.38.0 (and also 7.38.1) for our Kubrnetes monitoring, the agent process is producing an excessive number of defunct processes. These processes accumulate, and once the system limit (i.e. /proc/sys/kernel/pid_max) is reached, new processes cannot spawn resulting in general system instability.

The instability presents itself as errors in other Kubernetes internal components and also workloads running on the cluster, including (but not limited to):

  • mount failed: fork/exec /usr/bin/mount: resource temporarily unavailable
  • read init-p: connection reset by peer: unknown
  • fork/exec /proc/self/exe: resource temporarily unavailable: unknown
  • failed to create new OS thread (have 4 already; errno=11) runtime: may need to increase max user processes (ulimit -u) fatal error: newosproc

For our workloads, this happened after ~4.5 days of node uptime.

Agent Environment

Describe what happened:
Running ps -e | grep defunct | wc -l on a Kubernetes node running 7.38.0 reveals an increasing number of defunct processes on the system:

$ while true; do ps -e | grep defunct | wc -l; sleep 60; done
29
33
40
44
48
52
56
60
64
68
...

Investigating the processes reveals the problematic processes share parent PID 26607:

$ ps -efj | grep defunct
...
root     31959 26607 26607 26607  0 23:18 ?        00:00:00 [blkid] <defunct>
root     31987 26607 26607 26607  0 22:30 ?        00:00:00 [blkid] <defunct>
root     32093 26607 26607 26607  0 23:02 ?        00:00:00 [blkid] <defunct>
root     32210 26607 26607 26607  0 22:15 ?        00:00:00 [blkid] <defunct>
root     32325 26607 26607 26607  0 22:46 ?        00:00:00 [blkid] <defunct>
root     32384 26607 26607 26607  0 23:18 ?        00:00:00 [blkid] <defunct>
root     32438 26607 26607 26607  0 22:31 ?        00:00:00 [blkid] <defunct>
root     32605 26607 26607 26607  0 23:02 ?        00:00:00 [blkid] <defunct>
root     32715 26607 26607 26607  0 22:15 ?        00:00:00 [blkid] <defunct>

And the parent is agent run

$ ps -efj | grep -v defunct | grep 26607
ec2-user 23316 13238 23314 13238  0 23:29 pts/0    00:00:00 grep --color=auto 26607
root     26607 26577 26607 26607  3 22:12 ?        00:02:53 agent run

For 7.37.1, ps -e | grep defunct | wc -l always returns 0.

Describe what you expected:
The number of defunct processes should always be 0.

Steps to reproduce the issue:

  1. Install Helm chart to cluster
  2. SSH into node on cluster
  3. Run while true; do ps -e | grep defunct | wc -l; sleep 60; done

Additional environment details (Operating System, Cloud provider, etc):

  • Amazon EKS 1.20
  • EKS nodes running 1.20.15, image built from Amazon Linux 2
    • uname -a of Linux eks-cloud-dev-us-west-2a-staging-i-04816b5f5ba502ba7 5.4.196-108.356.amzn2.x86_64 #1 SMP Thu May 26 12:49:47 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

DataDog agent deployed via Helm (chart version datadog-2.32.3) with values:

agents:
  containers:
    agent:
      resources:
        limits:
          memory: 512Mi
        requests:
          memory: 512Mi
  image:
    repository: public.ecr.aws/datadog/agent
    tag: 7.38.0
  tolerations:
  - operator: Exists
clusterAgent:
  confd: {}
  enabled: false
datadog:
  apiKey: "**REDACTED**"
  apm:
    portEnabled: true
  appKey: "**REDACTED**"
  checksd: {}
  containerExclude: kube_namespace:monitoring
  dogstatsd:
    useHostPort: true
  env:
  - name: DD_LOGS_CONFIG_LOGS_DD_URL
    value: tcp-encrypted-intake.logs.datadoghq.com:12345
  - name: DD_APM_FILTER_TAGS_REJECT
    value: http.path_group:/health-check,http.url:/health-check
  - name: DD_APM_NON_LOCAL_TRAFFIC
    value: "true"
  - name: DD_APM_IGNORE_RESOURCES
    value: OPTIONS \(/\^\\/\$/\)|POST \(/\^\\/\$/\)|GET /\.well-known/apollo/server-health
  - name: DD_AC_EXCLUDE
    value: "true"
  envFrom:
  - configMapRef:
      name: datadog-custom-envs
  logs:
    containerCollectAll: true
    enabled: true
  securityAgent:
    runtime:
      enabled: false
  tags: vpc:vpc-01234567890

and datadog-custom-envs of:

data:
  DD_CONTAINER_EXCLUDE_LOGS: kube_namespace:.*
  DD_CONTAINER_INCLUDE_LOGS: kube_namespace:custom-metrics kube_namespace:external-secrets
    kube_namespace:kube-system

Edit: Add extra debugging information about defunct processes

@L3n41c
Copy link
Member

L3n41c commented Aug 5, 2022

Hello @noizwaves,

Thank you very much for your detailed bug report !

We confirm the regression you described. We’ve understood what is happening and we are currently actively working on a fix (#13019).
The bug was introduced in docker agent images 7.38.0 and 7.38.1 and affects only containerised agents running without hostPID: true.

Before a fix is released, here are some mitigations that can be used to workaround the bug:

Running the agent in the host PID namespace will make the zombie processes adopted by the host PID 1 (systemd) which will properly reap them.

So, the first workaround consists in adding this in the Helm values.yaml:

datadog:
  dogstatsd:
    useHostPID: true

The possibility to make the agent run in the host PID namespace was initially proposed for the dogstatsd origin detection feature. But it can be used as a workaround for this issue.

The second possible workaround consists in explicitly resetting LD_PRELOAD to the empty string by adding this in the Helm values.yaml:

datadog:
  env:
    - name: LD_PRELOAD
      value: ""    

The LD_PRELOAD library is only needed to support old docker versions on the host but on such old hosts, we don’t leak zombie processes so that keeping LD_PRELOAD is fine.

@noizwaves
Copy link
Author

Many thanks for the quick turn around on this @L3n41c!

@wd
Copy link

wd commented Aug 10, 2022

it looks like we have the same issue. we deployed Datadog from the helm chart datadog-2.37.0. it uses Datadog 7.38.1.
we can see some defunct processes.

root     2510928 90.3  1.0 4638480 1570216 ?     Ssl  05:06   1:52 agent run
root     2510967  0.0  0.0      0     0 ?        Z    05:06   0:00 [agent] <defunct>
root     2511051  0.2  0.0 2479864 60336 ?       Ssl  05:06   0:00 trace-agent -config=/etc/datadog-agent/datadog.yaml
root     2511092  0.0  0.0      0     0 ?        Z    05:06   0:00 [trace-agent] <defunct>
root     2511164  5.5  0.1 5248300 159256 ?      Ssl  05:06   0:06 process-agent --cfgpath=/etc/datadog-agent/datadog.yaml
root     2511184  0.0  0.0      0     0 ?        Z    05:06   0:00 [process-agent] <defunct>

However, I checked our values.yaml and found that we already set the useHostPID to true. Am I missing something?

@wd
Copy link

wd commented Aug 10, 2022

I tried the second solution. The system-probe crashed..

$ k logs -f datadog-k8f9n -c system-probe
runtime/cgo: pthread_create failed: Operation not permitted
SIGABRT: abort
PC=0x7f1f4c7b4a7c m=0 sigcode=18446744073709551610
....

@sgnn7
Copy link
Contributor

sgnn7 commented Aug 10, 2022

Hi @wd,
The linked fix was back-ported to 7.38.x branch and will be released as part of version 7.38.2. While we cannot provide the exact availability date for that version, we are in the final stages of the QA/release process for it. Baring anything unexpected, you shouldn't have to wait long.

@sgnn7
Copy link
Contributor

sgnn7 commented Aug 10, 2022

Hi @wd,
v7.38.2 (with the relevant fix) should be available now:

@wd
Copy link

wd commented Aug 11, 2022

@sgnn7 thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants