[BUG] `agent` leaves defunct processes with version 7.38.0 #12997

noizwaves · 2022-08-03T22:49:02Z

After upgrading to datadog-agent version 7.38.0 (and also 7.38.1) for our Kubrnetes monitoring, the agent process is producing an excessive number of defunct processes. These processes accumulate, and once the system limit (i.e. /proc/sys/kernel/pid_max) is reached, new processes cannot spawn resulting in general system instability.

The instability presents itself as errors in other Kubernetes internal components and also workloads running on the cluster, including (but not limited to):

mount failed: fork/exec /usr/bin/mount: resource temporarily unavailable
read init-p: connection reset by peer: unknown
fork/exec /proc/self/exe: resource temporarily unavailable: unknown
failed to create new OS thread (have 4 already; errno=11) runtime: may need to increase max user processes (ulimit -u) fatal error: newosproc

For our workloads, this happened after ~4.5 days of node uptime.

Agent Environment

Describe what happened:
Running ps -e | grep defunct | wc -l on a Kubernetes node running 7.38.0 reveals an increasing number of defunct processes on the system:

$ while true; do ps -e | grep defunct | wc -l; sleep 60; done
29
33
40
44
48
52
56
60
64
68
...

Investigating the processes reveals the problematic processes share parent PID 26607:

$ ps -efj | grep defunct
...
root     31959 26607 26607 26607  0 23:18 ?        00:00:00 [blkid] <defunct>
root     31987 26607 26607 26607  0 22:30 ?        00:00:00 [blkid] <defunct>
root     32093 26607 26607 26607  0 23:02 ?        00:00:00 [blkid] <defunct>
root     32210 26607 26607 26607  0 22:15 ?        00:00:00 [blkid] <defunct>
root     32325 26607 26607 26607  0 22:46 ?        00:00:00 [blkid] <defunct>
root     32384 26607 26607 26607  0 23:18 ?        00:00:00 [blkid] <defunct>
root     32438 26607 26607 26607  0 22:31 ?        00:00:00 [blkid] <defunct>
root     32605 26607 26607 26607  0 23:02 ?        00:00:00 [blkid] <defunct>
root     32715 26607 26607 26607  0 22:15 ?        00:00:00 [blkid] <defunct>

And the parent is agent run

$ ps -efj | grep -v defunct | grep 26607
ec2-user 23316 13238 23314 13238  0 23:29 pts/0    00:00:00 grep --color=auto 26607
root     26607 26577 26607 26607  3 22:12 ?        00:02:53 agent run

For 7.37.1, ps -e | grep defunct | wc -l always returns 0.

Describe what you expected:
The number of defunct processes should always be 0.

Steps to reproduce the issue:

Install Helm chart to cluster
SSH into node on cluster
Run while true; do ps -e | grep defunct | wc -l; sleep 60; done

Additional environment details (Operating System, Cloud provider, etc):

Amazon EKS 1.20
EKS nodes running 1.20.15, image built from Amazon Linux 2
- uname -a of Linux eks-cloud-dev-us-west-2a-staging-i-04816b5f5ba502ba7 5.4.196-108.356.amzn2.x86_64 #1 SMP Thu May 26 12:49:47 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

DataDog agent deployed via Helm (chart version datadog-2.32.3) with values:

agents:
  containers:
    agent:
      resources:
        limits:
          memory: 512Mi
        requests:
          memory: 512Mi
  image:
    repository: public.ecr.aws/datadog/agent
    tag: 7.38.0
  tolerations:
  - operator: Exists
clusterAgent:
  confd: {}
  enabled: false
datadog:
  apiKey: "**REDACTED**"
  apm:
    portEnabled: true
  appKey: "**REDACTED**"
  checksd: {}
  containerExclude: kube_namespace:monitoring
  dogstatsd:
    useHostPort: true
  env:
  - name: DD_LOGS_CONFIG_LOGS_DD_URL
    value: tcp-encrypted-intake.logs.datadoghq.com:12345
  - name: DD_APM_FILTER_TAGS_REJECT
    value: http.path_group:/health-check,http.url:/health-check
  - name: DD_APM_NON_LOCAL_TRAFFIC
    value: "true"
  - name: DD_APM_IGNORE_RESOURCES
    value: OPTIONS \(/\^\\/\$/\)|POST \(/\^\\/\$/\)|GET /\.well-known/apollo/server-health
  - name: DD_AC_EXCLUDE
    value: "true"
  envFrom:
  - configMapRef:
      name: datadog-custom-envs
  logs:
    containerCollectAll: true
    enabled: true
  securityAgent:
    runtime:
      enabled: false
  tags: vpc:vpc-01234567890

and datadog-custom-envs of:

data:
  DD_CONTAINER_EXCLUDE_LOGS: kube_namespace:.*
  DD_CONTAINER_INCLUDE_LOGS: kube_namespace:custom-metrics kube_namespace:external-secrets
    kube_namespace:kube-system

Edit: Add extra debugging information about defunct processes

The text was updated successfully, but these errors were encountered:

L3n41c · 2022-08-05T09:14:51Z

Hello @noizwaves,

Thank you very much for your detailed bug report !

We confirm the regression you described. We’ve understood what is happening and we are currently actively working on a fix (#13019).
The bug was introduced in docker agent images 7.38.0 and 7.38.1 and affects only containerised agents running without hostPID: true.

Before a fix is released, here are some mitigations that can be used to workaround the bug:

Running the agent in the host PID namespace will make the zombie processes adopted by the host PID 1 (systemd) which will properly reap them.

So, the first workaround consists in adding this in the Helm values.yaml:

datadog:
  dogstatsd:
    useHostPID: true

The possibility to make the agent run in the host PID namespace was initially proposed for the dogstatsd origin detection feature. But it can be used as a workaround for this issue.

The second possible workaround consists in explicitly resetting LD_PRELOAD to the empty string by adding this in the Helm values.yaml:

datadog:
  env:
    - name: LD_PRELOAD
      value: ""

The LD_PRELOAD library is only needed to support old docker versions on the host but on such old hosts, we don’t leak zombie processes so that keeping LD_PRELOAD is fine.

noizwaves · 2022-08-05T15:49:12Z

Many thanks for the quick turn around on this @L3n41c!

wd · 2022-08-10T05:16:40Z

it looks like we have the same issue. we deployed Datadog from the helm chart datadog-2.37.0. it uses Datadog 7.38.1.
we can see some defunct processes.

root     2510928 90.3  1.0 4638480 1570216 ?     Ssl  05:06   1:52 agent run
root     2510967  0.0  0.0      0     0 ?        Z    05:06   0:00 [agent] <defunct>
root     2511051  0.2  0.0 2479864 60336 ?       Ssl  05:06   0:00 trace-agent -config=/etc/datadog-agent/datadog.yaml
root     2511092  0.0  0.0      0     0 ?        Z    05:06   0:00 [trace-agent] <defunct>
root     2511164  5.5  0.1 5248300 159256 ?      Ssl  05:06   0:06 process-agent --cfgpath=/etc/datadog-agent/datadog.yaml
root     2511184  0.0  0.0      0     0 ?        Z    05:06   0:00 [process-agent] <defunct>

However, I checked our values.yaml and found that we already set the useHostPID to true. Am I missing something?

wd · 2022-08-10T05:55:19Z

I tried the second solution. The system-probe crashed..

$ k logs -f datadog-k8f9n -c system-probe
runtime/cgo: pthread_create failed: Operation not permitted
SIGABRT: abort
PC=0x7f1f4c7b4a7c m=0 sigcode=18446744073709551610
....

sgnn7 · 2022-08-10T06:05:19Z

Hi @wd,
The linked fix was back-ported to 7.38.x branch and will be released as part of version 7.38.2. While we cannot provide the exact availability date for that version, we are in the final stages of the QA/release process for it. Baring anything unexpected, you shouldn't have to wait long.

sgnn7 · 2022-08-10T16:38:29Z

Hi @wd,
v7.38.2 (with the relevant fix) should be available now:

wd · 2022-08-11T01:19:09Z

@sgnn7 thank you!

noizwaves added the team/triage label Aug 3, 2022

L3n41c mentioned this issue Aug 4, 2022

Fix zombie processes create by the agent #13019

Merged

10 tasks

L3n41c closed this as completed in #13019 Aug 5, 2022

L3n41c mentioned this issue Aug 5, 2022

Backport #13019 to 7.38.x: Fix zombie processes created by the agent (#13019) #13038

Merged

10 tasks

cartermckinnon mentioned this issue Aug 8, 2022

Max pids is set to 32768 awslabs/amazon-eks-ami#737

Closed

alexb-img mentioned this issue Sep 26, 2022

[BUG] agent causes pid exhaustion with version 7.39.0 #13649

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] `agent` leaves defunct processes with version 7.38.0 #12997

[BUG] `agent` leaves defunct processes with version 7.38.0 #12997

noizwaves commented Aug 3, 2022 •

edited

Loading

L3n41c commented Aug 5, 2022 •

edited

Loading

noizwaves commented Aug 5, 2022

wd commented Aug 10, 2022

wd commented Aug 10, 2022

sgnn7 commented Aug 10, 2022 •

edited

Loading

sgnn7 commented Aug 10, 2022 •

edited

Loading

wd commented Aug 11, 2022

[BUG] agent leaves defunct processes with version 7.38.0 #12997

[BUG] agent leaves defunct processes with version 7.38.0 #12997

Comments

noizwaves commented Aug 3, 2022 • edited Loading

L3n41c commented Aug 5, 2022 • edited Loading

noizwaves commented Aug 5, 2022

wd commented Aug 10, 2022

wd commented Aug 10, 2022

sgnn7 commented Aug 10, 2022 • edited Loading

sgnn7 commented Aug 10, 2022 • edited Loading

wd commented Aug 11, 2022

[BUG] `agent` leaves defunct processes with version 7.38.0 #12997

[BUG] `agent` leaves defunct processes with version 7.38.0 #12997

noizwaves commented Aug 3, 2022 •

edited

Loading

L3n41c commented Aug 5, 2022 •

edited

Loading

sgnn7 commented Aug 10, 2022 •

edited

Loading

sgnn7 commented Aug 10, 2022 •

edited

Loading