fix(internal): remove nogevent compatibility layer #5105

emmettbutler · 2023-02-10T15:52:10Z

This pull request removes the nogevent compatibility layer from the library and tests. It also changes the behavior of some feature flags related to gevent and adjusts a whole lot of tests to work with these changes.

This change was manually tested on staging and shown to fix the issue we were investigating there. See noteboook 4782453 in the ddstaging datadoghq account for a detailed look at the metrics collected during that test.

What Changed?

sitecustomize.py: refactored module cloning logic; changed the default of DD_UNLOAD_MODULES_FROM_SITECUSTOMIZE to auto meaning it runs when gevent is installed; unloaded a few additional modules like time and attrs that were causing tests and manual checks to fail; deprecated DD_GEVENT_PATCH_ALL flag
Removed various hacks and layers that had intended to fix gevent compatibility, including in forksafe.py, periodic.py, ddtrace_gevent_check.py, and nogevent.py
profiling/: adjusted all uses of the removed module nogevent to work with threading
Adjusted tests to work with these removals

We tried to separate some of these changes into other pull requests, which you can see linked in the discussion below. Because of how difficult it is to replicate the issue we're chasing outside of the staging environment, we decided to minimize the number of variables under test and combine the various changes into a single pull request. This makes it a bit harder to review, which we've tried to mitigate with the checklist above.

Risk

The main risk in this change is the change to the default behavior of module cloning. We've mitigated this risk with the automated test suite as well as manual testing described in the notebook above.

Why doesn't this change put all new behavior behind a feature flag and leave the default behavior untouched?

The main reason for this decision is pragmatic: it's really hard to test for the issue this solves, requiring a turnaround time of about an hour to get feedback from changes. The secondary reason is that the nogevent layer is highly coupled to the rest of the library's code, and putting it behind a feature flag is a significant and nontrivial effort. The third reason is that full support of all of the various configurations and combinations with other tools that gevent can be used in is a goal that we could probably spend infinite time on if we chose to. Given this, we need to intentionally set a goal that solves the current and likely near-future issues as completely as possible, make it the default behavior, and call this effort "done". @brettlangdon @P403n1x87 @Yun-Kim and I are in agreement that the evidence in noteboook 4782453 in the ddstaging datadoghq account is enough to justify this change to the default behavior.

Performance Testing

Performance testing with a sample flask application (notebook 4442578) shows no immediately noticeable impact from tracing.

Dynamic instrumentation seems to cause slow-downs, and the reason has been tracked down to joining service threads on shutdown. Avoiding the joins cures the problem, but further testing is required to ensure that DI still behaves as intended. Note that the tracer is already implemented this way, and this is probably why we don't see it impacting process shutdown.

Profiling also shows a slight slow-down in short-lived processes, when enabled, but this is an already existing issue. This seems to be due to retrieving the response from the agent after the payload has been uploaded. A potential solution to this might be offered by asynchronous processing with libdatadog.

The following are the details of the scenario used to measure the performance under different configurations.
The application is the simple Flask app of the issue reproducer:

# app.py

import os
import time

from ddtrace.internal.remoteconfig import RemoteConfig
from flask import Flask

app = Flask(__name__)


def start():
    pid1 = os.fork()
    if pid1 == 0:
        os.setsid()
        x = 2
        while x > 0:
            time.sleep(0.2)
            x -= 1
    else:
        os.waitpid(pid1, 0)


@app.route("/")
def index():
    start()
    return "OK" if RemoteConfig._worker is not None else "NOK"

We can control what products to start with the following run.sh script

#!/bin/bash

source .venv/bin/activate

export DD_DYNAMIC_INSTRUMENTATION_ENABLED=true
export DD_PROFILING_ENABLED=false
export DD_TRACE_ENABLED=false

export DD_ENV=gab-testing
export DD_SERVICE=flask-gevent

ddtrace-run gunicorn -w 3 -k gevent app:app

deactivate

To run the app we create a virtual environment with

python3.9 -m venv .venv
source .venv/bin/activate
pip install flask gevent gunicorn
pip install -e path/to/dd-trace-py
deactivate

and then invoke the script, adjusting the exported variables as required

chmod +x run.sh
./run.sh

In another terminal we can check the average response time by sending requests to the application while running. With the following simple k6 script

import http from 'k6/http';

export default function () {
  http.get('http://localhost:8000');
}

We invoke k6 with

k6 run -d 30s script.js

and look for this line in the output

http_req_duration..............: avg=335.68ms min=119.56ms med=418.76ms max=451.49ms p(90)

Checklist

Change(s) are motivated and described in the PR description.
Testing strategy is described if automated tests are not included in the PR.
Risk is outlined (performance impact, potential for breakage, maintainability, etc).
Change is maintainable (easy to change, telemetry, documentation).
Library release note guidelines are followed.
Documentation is included (in-code, generated user docs, public corp docs).
Author is aware of the performance implications of this PR as reported in the benchmarks PR comment.

Reviewer Checklist

Title is accurate.
No unnecessary changes are introduced.
Description motivates each change.
Avoids breaking API changes unless absolutely necessary.
Testing strategy adequately addresses listed risk(s).
Change is maintainable (easy to change, telemetry, documentation).
Release note makes sense to a user of the library.
Reviewer is aware of, and discussed the performance implications of this PR as reported in the benchmarks PR comment.

module cloning makes it obsolete

these changes were moved to DataDog#5105

pr-commenter · 2023-02-10T17:20:12Z

Benchmarks

Comparing candidate commit e345d0f in PR branch emmett.butler/cleanups-before-removing-nogevent-3 with baseline commit de60d7e in branch 1.x.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 56 cases.

…nt-3

mabdinur

Left some questions, feel free to resolve if the questions do not require follow up.

ddtrace/bootstrap/sitecustomize.py

ddtrace/contrib/gunicorn/__init__.py

ddtrace/profiling/_threading.pyx

ddtrace/profiling/collector/_lock.py

ddtrace/profiling/collector/stack.pyx

releasenotes/notes/nogevent-6f2892cb412f987f.yaml

Co-authored-by: Munir Abdinur <munir.abdinur@datadoghq.com>

…nt-3

Yun-Kim

LGTM, if we can fix the typing suggestion then I'm good to 👍

emmettbutler · 2023-03-02T20:58:52Z

@Yun-Kim that suggestion causes the linting to fail #5105 (comment)
maybe a fix for however typing is broken in that file is out of scope?

…nt-3

@brettlangdon

This pull request removes the `nogevent` compatibility layer from the library and tests. It also changes the behavior of some feature flags related to gevent and adjusts a whole lot of tests to work with these changes. This change was manually tested on delancie-dummy staging and shown to fix the issue we were investigating there. See noteboook 4782453 in the ddstaging datadoghq account for a detailed look at the metrics collected during that test. * `sitecustomize.py`: refactored module cloning logic; changed the default of `DD_UNLOAD_MODULES_FROM_SITECUSTOMIZE` to `auto` meaning it runs when gevent is installed; unloaded a few additional modules like `time` and `attrs` that were causing tests and manual checks to fail; deprecated `DD_GEVENT_PATCH_ALL` flag * Removed various hacks and layers that had intended to fix gevent compatibility, including in `forksafe.py`, `periodic.py`, `ddtrace_gevent_check.py`, and `nogevent.py` * `profiling/`: adjusted all uses of the removed module `nogevent` to work with `threading` * Adjusted tests to work with these removals We tried to separate some of these changes into other pull requests, which you can see linked in the discussion below. Because of how difficult it is to replicate the issue we're chasing outside of the staging environment, we decided to minimize the number of variables under test and combine the various changes into a single pull request. This makes it a bit harder to review, which we've tried to mitigate with the checklist above. The main risk in this change is the change to the default behavior of module cloning. We've mitigated this risk with the automated test suite as well as manual testing described in the notebook above. **Why doesn't this change put all new behavior behind a feature flag and leave the default behavior untouched?** The main reason for this decision is pragmatic: it's really hard to test for the issue this solves, requiring a turnaround time of about an hour to get feedback from changes. The secondary reason is that the `nogevent` layer is highly coupled to the rest of the library's code, and putting it behind a feature flag is a significant and nontrivial effort. The third reason is that full support of all of the various configurations and combinations with other tools that gevent can be used in is a goal that we could probably spend infinite time on if we chose to. Given this, we need to intentionally set a goal that solves the current and likely near-future issues as completely as possible, make it the default behavior, and call this effort "done". @brettlangdon @P403n1x87 @Yun-Kim and I are in agreement that the evidence in noteboook 4782453 in the ddstaging datadoghq account is enough to justify this change to the default behavior. Performance testing with a sample flask application (notebook 4442578) shows no immediately noticeable impact of tracing. Dynamic instrumentation seems to cause slow-downs, and the reason has been tracked down to joining service threads on shutdown. Avoiding the joins cures the problem, but further testing is required to ensure that DI still behaves as intended. Profiling also shows a slow-down in the application response when enabled. This seems to be due to retrieving the response from the agent after the payload has been uploaded. A potential solution to this might be offered by libdatadog. The following are the details of the scenario used to measure the performance under different configurations. The application is the simple Flask app of the issue reproducer: ``` import os import time from ddtrace.internal.remoteconfig import RemoteConfig from flask import Flask app = Flask(__name__) def start(): pid1 = os.fork() if pid1 == 0: os.setsid() x = 2 while x > 0: time.sleep(0.2) x -= 1 else: os.waitpid(pid1, 0) @app.route("/") def index(): start() return "OK" if RemoteConfig._worker is not None else "NOK" ``` We can control what products to start with the following run.sh script ``` source .venv/bin/activate export DD_DYNAMIC_INSTRUMENTATION_ENABLED=true export DD_PROFILING_ENABLED=false export DD_TRACE_ENABLED=false export DD_ENV=gab-testing export DD_SERVICE=flask-gevent ddtrace-run gunicorn -w 3 -k gevent app:app deactivate ``` To run the app we create a virtual environment with ``` python3.9 -m venv .venv source .venv/bin/activate pip install flask gevent gunicorn pip install -e path/to/dd-trace-py deactivate ``` and then invoke the script, adjusting the exported variables as required ``` chmod +x run.sh ./run.sh ``` In another terminal we can check the average response time by sending requests to the application while running. With the following simple k6 script ``` import http from 'k6/http'; export default function () { http.get('http://localhost:8000'); } ``` We invoke k6 with ``` k6 run -d 30s script.js ``` and look for this line in the output ``` http_req_duration..............: avg=335.68ms min=119.56ms med=418.76ms max=451.49ms p(90) ``` Co-authored-by: Yun Kim <yun.kim@datadoghq.com> Co-authored-by: Gabriele N. Tornetta <P403n1x87@users.noreply.github.com> Co-authored-by: Juanjo Alvarez Martinez <juanjo.alvarezmartinez@datadoghq.com> Co-authored-by: Gabriele N. Tornetta <gabriele.tornetta@datadoghq.com> Co-authored-by: Yun Kim <35776586+Yun-Kim@users.noreply.github.com> Co-authored-by: Brett Langdon <brett.langdon@datadoghq.com>

…5105) (#5275) - [x] Backport #5105 to 1.9 Co-authored-by: Yun Kim <yun.kim@datadoghq.com> Co-authored-by: Gabriele N. Tornetta <P403n1x87@users.noreply.github.com> Co-authored-by: Juanjo Alvarez Martinez <juanjo.alvarezmartinez@datadoghq.com> Co-authored-by: Gabriele N. Tornetta <gabriele.tornetta@datadoghq.com> Co-authored-by: Yun Kim <35776586+Yun-Kim@users.noreply.github.com> Co-authored-by: Brett Langdon <brett.langdon@datadoghq.com>

emmettbutler added 5 commits February 10, 2023 07:35

remove nogevent files

155ef9c

remove gevent compatibility code from periodic.py

4903830

remove gevent compatibility code from a bunch of tests

9f95824

remove gevent compatibility code from periodic tests

a7192a7

remove the import hook that reinitialized the gevent hub

3a61a09

module cloning makes it obsolete

emmettbutler requested review from a team as code owners February 10, 2023 15:52

emmettbutler requested review from ZStriker19 and mabdinur February 10, 2023 15:52

emmettbutler changed the title ~~Emmett.butler/cleanups before removing nogevent 3~~ chore(internal): remove nogevent compatibility layer Feb 10, 2023

emmettbutler marked this pull request as draft February 10, 2023 15:54

emmettbutler added a commit to P403n1x87/dd-trace-py that referenced this pull request Feb 10, 2023

undo nogevent removal

389db52

these changes were moved to DataDog#5105

emmettbutler mentioned this pull request Feb 10, 2023

fix(gevent): fix gevent compatibility with "module cloning" approach #4863

Merged

20 tasks

remove nogevent compatibility layer from a few more tests

59add29

emmettbutler added a commit to P403n1x87/dd-trace-py that referenced this pull request Feb 10, 2023

remove test changes that were moved to DataDog#5105

acf5c01

emmettbutler added 6 commits February 10, 2023 09:27

remove hanging references to DD_GEVENT_PATCH_ALL

8430fcd

remove unused flag from test server

8cfaacf

pull out changes from #4863

5ea55d4

do not patch gevent

a3411a8

remove gevent check script

348bd77

pull change from #4863

a770989

emmettbutler added a commit to P403n1x87/dd-trace-py that referenced this pull request Feb 10, 2023

undo change moved to DataDog#5105

5786253

move changes from #4863

f41dfa9

emmettbutler added a commit to P403n1x87/dd-trace-py that referenced this pull request Feb 10, 2023

move changes to DataDog#5105

a242f22

This was referenced Feb 14, 2023

fix(rcm): fix stuck processes on os.fork with gevent #5099

Closed

fix(rcm): fix stuck child processes when a gevent application creates forks #5088

Closed

Flake8 fix

b1a3d78

P403n1x87 mentioned this pull request Feb 15, 2023

fix(gevent): unload time but not typing modules for module cloning #5135

Merged

15 tasks

P403n1x87 added the changelog/no-changelog A changelog entry is not required for this PR. label Feb 15, 2023

P403n1x87 added 3 commits March 1, 2023 18:39

Merge branch '1.x' into emmett.butler/cleanups-before-removing-nogeve…

8d005dd

…nt-3

Merge branch '1.x' into emmett.butler/cleanups-before-removing-nogeve…

921ee43

…nt-3

Merge branch '1.x' into emmett.butler/cleanups-before-removing-nogeve…

4dc15e2

…nt-3

P403n1x87 requested a review from brettlangdon March 2, 2023 15:41

mabdinur reviewed Mar 2, 2023

View reviewed changes

Update ddtrace/profiling/collector/_lock.py

97a4557

Co-authored-by: Munir Abdinur <munir.abdinur@datadoghq.com>

emmettbutler dismissed P403n1x87’s stale review via 97a4557 March 2, 2023 17:08

emmettbutler and others added 5 commits March 2, 2023 09:08

Update releasenotes/notes/nogevent-6f2892cb412f987f.yaml

3e11e13

Co-authored-by: Munir Abdinur <munir.abdinur@datadoghq.com>

Update releasenotes/notes/nogevent-6f2892cb412f987f.yaml

55e3ceb

Co-authored-by: Munir Abdinur <munir.abdinur@datadoghq.com>

Merge branch '1.x' into emmett.butler/cleanups-before-removing-nogeve…

2aed200

…nt-3

Update _lock.py

0f896f9

Update stack.pyx

be8b534

mabdinur approved these changes Mar 2, 2023

View reviewed changes

emmettbutler requested a review from P403n1x87 March 2, 2023 17:59

Yun-Kim reviewed Mar 2, 2023

View reviewed changes

emmettbutler requested a review from Yun-Kim March 2, 2023 21:22

Yun-Kim approved these changes Mar 2, 2023

View reviewed changes

Merge branch '1.x' into emmett.butler/cleanups-before-removing-nogeve…

aaa5166

…nt-3

P403n1x87 approved these changes Mar 2, 2023

View reviewed changes

emmettbutler and others added 2 commits March 2, 2023 16:55

Merge branch '1.x' into emmett.butler/cleanups-before-removing-nogeve…

7ae585f

…nt-3

Merge branch '1.x' into emmett.butler/cleanups-before-removing-nogeve…

e345d0f

…nt-3

emmettbutler merged commit a665ffe into 1.x Mar 3, 2023

emmettbutler deleted the emmett.butler/cleanups-before-removing-nogevent-3 branch March 3, 2023 10:08

github-actions bot added this to the v1.11.0 milestone Mar 3, 2023

Yun-Kim mentioned this pull request Mar 3, 2023

chore(docs): clarify gunicorn/gevent documentation #5251

Closed

15 tasks

emmettbutler mentioned this pull request Mar 6, 2023

fix(internal): [backport to 1.9] remove nogevent compatibility layer (#5105) #5275

Merged

1 task

mabdinur mentioned this pull request Mar 8, 2023

Unclear docs around dd-trace-py and gunicorn #5248

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(internal): remove nogevent compatibility layer #5105

fix(internal): remove nogevent compatibility layer #5105

emmettbutler commented Feb 10, 2023 •

edited by P403n1x87

Loading

pr-commenter bot commented Feb 10, 2023 •

edited

Loading

mabdinur left a comment

Yun-Kim left a comment

emmettbutler commented Mar 2, 2023

fix(internal): remove nogevent compatibility layer #5105

fix(internal): remove nogevent compatibility layer #5105

Conversation

emmettbutler commented Feb 10, 2023 • edited by P403n1x87 Loading

What Changed?

Risk

Performance Testing

Checklist

Reviewer Checklist

pr-commenter bot commented Feb 10, 2023 • edited Loading

Benchmarks

mabdinur left a comment

Choose a reason for hiding this comment

Yun-Kim left a comment

Choose a reason for hiding this comment

emmettbutler commented Mar 2, 2023

emmettbutler commented Feb 10, 2023 •

edited by P403n1x87

Loading

pr-commenter bot commented Feb 10, 2023 •

edited

Loading