Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(rcm): fix stuck child processes when a gevent application creates forks #5088

Closed
wants to merge 61 commits into from

Conversation

avara1986
Copy link
Member

@avara1986 avara1986 commented Feb 9, 2023

Description

Workaround (II) for remote config to work properly with gevent. We found that Remote Config create stuck/zombie child processes if an application with gevent uses os.fork.

What we do?

  • Remove RemoteConfig.worker, because worker has PeriodicService as parent and PeriodicService use a worker too.
  • Remove duplicated locks.
  • Add methods like _stop_service, stop... using the same structure as TelemetryWriter and SpanWriter
  • Remove enable() behavior. One of the previous fix (chore(iast): remote config thread workaround for gevent #4749) ensures that the Remote config worker thread is launched at the very end, not interrupting the gevent patching mechanism. With forksafe.register(cls._restart) we were calling twice to remoteconfig.enable(): one at fork time (and it raises the same error of the previous fix) and once again at register the products in the child process (this call is ok)

Caveat: this fix requires with env variable DD_GEVENT_PATCH_ALL=true but that's documented to be required on the gevent integration docs (we also added a reference to possible stuck processes if not set to ease the googleability of it).

How to reproduce the error

  • Create a flask application:
import os
import time

from flask import Flask

app = Flask(__name__)

def start():
    pid1 = os.fork()
    if pid1 == 0:
        os.setsid()
        x = 2
        while x > 0:
            time.sleep(0.2)
            print("child pid = ", pid1)
            x -= 1
    else:
        os.waitpid(pid1, 0)

@app.route('/')
def index():
    start()
    return "OK"
  • start an agent
  • add this environment variables:
DD_GEVENT_PATCH_ALL=true
DD_REMOTE_CONFIGURATION_ENABLED=true
  • run the application with
 ddtrace-run gunicorn -w 3 -k gevent app:app
  • Add traffic. e.g:
#!/bin/bash
while :
do
    curl -s -k 'GET'  http://127.0.0.1:8000/
done
  • Check your processes, you should have 4-5 gunicorn workers, if you see more than this number, you have dead processes. The could remain even after stopping gunicorn.
ps -aux | grep gunicorn
alberto+  132031  1.4  0.1 115840 63280 pts/0    S+   16:33   0:00 /py-39/bin/gunicorn -c gunicorn_conf.py app:app
alberto+  132070  0.4  0.2 201700 65172 pts/0    Sl+  16:33   0:00 /py-39/bin/gunicorn -c gunicorn_conf.py app:app
alberto+  132071  0.3  0.2 201716 65192 pts/0    Sl+  16:33   0:00 /py-39/bin/gunicorn -c gunicorn_conf.py app:app
alberto+  132072  0.3  0.2 201712 65184 pts/0    Sl+  16:33   0:00 /py-39/bin/gunicorn -c gunicorn_conf.py app:app

Another way to check is run apachebenchmark (ab) like:

ab  -n 1000000 -c 100 http://127.0.0.1:8001/

And running the script in another console. Eventually, the Flask app will stop responding (usually with 7 processes running on my machine).

You could check the gevent infinite lock with:

gdb -p 132072

or

py-spy record --native  --rate 300 -o my-flamegraph-result.svg --pid 132072

Checklist

  • Change(s) are motivated and described in the PR description.
  • Testing strategy is described if automated tests are not included in the PR.
  • Risk is outlined (performance impact, potential for breakage, maintainability, etc).
  • Change is maintainable (easy to change, telemetry, documentation).
  • Library release note guidelines are followed.
  • Documentation is included (in-code, generated user docs, public corp docs).
  • Author is aware of the performance implications of this PR as reported in the benchmarks PR comment.

Reviewer Checklist

  • Title is accurate.
  • No unnecessary changes are introduced.
  • Description motivates each change.
  • Avoids breaking API changes unless absolutely necessary.
  • Testing strategy adequately addresses listed risk(s).
  • Change is maintainable (easy to change, telemetry, documentation).
  • Release note makes sense to a user of the library.
  • Reviewer is aware of, and discussed the performance implications of this PR as reported in the benchmarks PR comment.

@juanjux juanjux changed the title fix(rcm): stuck child processes when application creates forks fix(rcm): fix stuck child processes when a gevent application creates forks Feb 9, 2023
Signed-off-by: Juanjo Alvarez <juanjo.alvarezmartinez@datadoghq.com>
@avara1986 avara1986 added the ASM Application Security Monitoring label Feb 9, 2023
Signed-off-by: Juanjo Alvarez <juanjo.alvarezmartinez@datadoghq.com>
ddtrace/internal/remoteconfig/__init__.py Outdated Show resolved Hide resolved
ddtrace/internal/remoteconfig/__init__.py Outdated Show resolved Hide resolved
ddtrace/internal/remoteconfig/client.py Outdated Show resolved Hide resolved
ddtrace/internal/agent.py Outdated Show resolved Hide resolved
@pr-commenter
Copy link

pr-commenter bot commented Feb 9, 2023

Benchmarks

Comparing candidate commit f72ae72 in PR branch avara1986/APPSEC-8131-rcm-forks-error with baseline commit 4372d46 in branch 1.x.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 56 cases.

@emmettbutler
Copy link
Collaborator

emmettbutler commented Feb 14, 2023

I pulled out the regression test from this change and ran it against 1.x (you can see the diff in #5130), and I found that the test hangs in CI. That tells us that the test replicates a real issue.

@juanjux juanjux requested a review from a team as a code owner February 15, 2023 08:04
juanjux
juanjux previously approved these changes Feb 15, 2023
gnufede
gnufede previously approved these changes Feb 15, 2023
emmettbutler
emmettbutler previously approved these changes Feb 17, 2023
@emmettbutler emmettbutler enabled auto-merge (squash) February 17, 2023 21:54
@avara1986 avara1986 dismissed stale reviews from emmettbutler, gnufede, and juanjux via f73fc22 February 28, 2023 09:04
@avara1986 avara1986 closed this Mar 9, 2023
auto-merge was automatically disabled March 9, 2023 08:28

Pull request was closed

@emmettbutler emmettbutler deleted the avara1986/APPSEC-8131-rcm-forks-error branch June 20, 2023 14:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ASM Application Security Monitoring
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants