Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(internal): remove nogevent compatibility layer #5105

Merged
merged 99 commits into from
Mar 3, 2023

Conversation

emmettbutler
Copy link
Collaborator

@emmettbutler emmettbutler commented Feb 10, 2023

This pull request removes the nogevent compatibility layer from the library and tests. It also changes the behavior of some feature flags related to gevent and adjusts a whole lot of tests to work with these changes.

This change was manually tested on staging and shown to fix the issue we were investigating there. See noteboook 4782453 in the ddstaging datadoghq account for a detailed look at the metrics collected during that test.

What Changed?

  • sitecustomize.py: refactored module cloning logic; changed the default of DD_UNLOAD_MODULES_FROM_SITECUSTOMIZE to auto meaning it runs when gevent is installed; unloaded a few additional modules like time and attrs that were causing tests and manual checks to fail; deprecated DD_GEVENT_PATCH_ALL flag
  • Removed various hacks and layers that had intended to fix gevent compatibility, including in forksafe.py, periodic.py, ddtrace_gevent_check.py, and nogevent.py
  • profiling/: adjusted all uses of the removed module nogevent to work with threading
  • Adjusted tests to work with these removals

We tried to separate some of these changes into other pull requests, which you can see linked in the discussion below. Because of how difficult it is to replicate the issue we're chasing outside of the staging environment, we decided to minimize the number of variables under test and combine the various changes into a single pull request. This makes it a bit harder to review, which we've tried to mitigate with the checklist above.

Risk

The main risk in this change is the change to the default behavior of module cloning. We've mitigated this risk with the automated test suite as well as manual testing described in the notebook above.

Why doesn't this change put all new behavior behind a feature flag and leave the default behavior untouched?

The main reason for this decision is pragmatic: it's really hard to test for the issue this solves, requiring a turnaround time of about an hour to get feedback from changes. The secondary reason is that the nogevent layer is highly coupled to the rest of the library's code, and putting it behind a feature flag is a significant and nontrivial effort. The third reason is that full support of all of the various configurations and combinations with other tools that gevent can be used in is a goal that we could probably spend infinite time on if we chose to. Given this, we need to intentionally set a goal that solves the current and likely near-future issues as completely as possible, make it the default behavior, and call this effort "done". @brettlangdon @P403n1x87 @Yun-Kim and I are in agreement that the evidence in noteboook 4782453 in the ddstaging datadoghq account is enough to justify this change to the default behavior.

Performance Testing

Performance testing with a sample flask application (notebook 4442578) shows no immediately noticeable impact from tracing.

Dynamic instrumentation seems to cause slow-downs, and the reason has been tracked down to joining service threads on shutdown. Avoiding the joins cures the problem, but further testing is required to ensure that DI still behaves as intended. Note that the tracer is already implemented this way, and this is probably why we don't see it impacting process shutdown.

Profiling also shows a slight slow-down in short-lived processes, when enabled, but this is an already existing issue. This seems to be due to retrieving the response from the agent after the payload has been uploaded. A potential solution to this might be offered by asynchronous processing with libdatadog.

The following are the details of the scenario used to measure the performance under different configurations.
The application is the simple Flask app of the issue reproducer:

# app.py

import os
import time

from ddtrace.internal.remoteconfig import RemoteConfig
from flask import Flask

app = Flask(__name__)


def start():
    pid1 = os.fork()
    if pid1 == 0:
        os.setsid()
        x = 2
        while x > 0:
            time.sleep(0.2)
            x -= 1
    else:
        os.waitpid(pid1, 0)


@app.route("/")
def index():
    start()
    return "OK" if RemoteConfig._worker is not None else "NOK"

We can control what products to start with the following run.sh script

#!/bin/bash

source .venv/bin/activate

export DD_DYNAMIC_INSTRUMENTATION_ENABLED=true
export DD_PROFILING_ENABLED=false
export DD_TRACE_ENABLED=false

export DD_ENV=gab-testing
export DD_SERVICE=flask-gevent

ddtrace-run gunicorn -w 3 -k gevent app:app

deactivate

To run the app we create a virtual environment with

python3.9 -m venv .venv
source .venv/bin/activate
pip install flask gevent gunicorn
pip install -e path/to/dd-trace-py
deactivate

and then invoke the script, adjusting the exported variables as required

chmod +x run.sh
./run.sh

In another terminal we can check the average response time by sending requests to the application while running. With the following simple k6 script

import http from 'k6/http';

export default function () {
  http.get('http://localhost:8000');
}

We invoke k6 with

k6 run -d 30s script.js

and look for this line in the output

http_req_duration..............: avg=335.68ms min=119.56ms med=418.76ms max=451.49ms p(90)

Checklist

  • Change(s) are motivated and described in the PR description.
  • Testing strategy is described if automated tests are not included in the PR.
  • Risk is outlined (performance impact, potential for breakage, maintainability, etc).
  • Change is maintainable (easy to change, telemetry, documentation).
  • Library release note guidelines are followed.
  • Documentation is included (in-code, generated user docs, public corp docs).
  • Author is aware of the performance implications of this PR as reported in the benchmarks PR comment.

Reviewer Checklist

  • Title is accurate.
  • No unnecessary changes are introduced.
  • Description motivates each change.
  • Avoids breaking API changes unless absolutely necessary.
  • Testing strategy adequately addresses listed risk(s).
  • Change is maintainable (easy to change, telemetry, documentation).
  • Release note makes sense to a user of the library.
  • Reviewer is aware of, and discussed the performance implications of this PR as reported in the benchmarks PR comment.

@emmettbutler emmettbutler requested review from a team as code owners February 10, 2023 15:52
@emmettbutler emmettbutler changed the title Emmett.butler/cleanups before removing nogevent 3 chore(internal): remove nogevent compatibility layer Feb 10, 2023
@emmettbutler emmettbutler marked this pull request as draft February 10, 2023 15:54
emmettbutler added a commit to P403n1x87/dd-trace-py that referenced this pull request Feb 10, 2023
these changes were moved to DataDog#5105
emmettbutler added a commit to P403n1x87/dd-trace-py that referenced this pull request Feb 10, 2023
@pr-commenter
Copy link

pr-commenter bot commented Feb 10, 2023

Benchmarks

Comparing candidate commit e345d0f in PR branch emmett.butler/cleanups-before-removing-nogevent-3 with baseline commit de60d7e in branch 1.x.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 56 cases.

emmettbutler added a commit to P403n1x87/dd-trace-py that referenced this pull request Feb 10, 2023
@P403n1x87 P403n1x87 added the changelog/no-changelog A changelog entry is not required for this PR. label Feb 15, 2023
@P403n1x87 P403n1x87 requested a review from brettlangdon March 2, 2023 15:41
Copy link
Contributor

@mabdinur mabdinur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some questions, feel free to resolve if the questions do not require follow up.

ddtrace/bootstrap/sitecustomize.py Show resolved Hide resolved
ddtrace/bootstrap/sitecustomize.py Show resolved Hide resolved
ddtrace/bootstrap/sitecustomize.py Show resolved Hide resolved
ddtrace/contrib/gunicorn/__init__.py Show resolved Hide resolved
ddtrace/profiling/_threading.pyx Show resolved Hide resolved
ddtrace/profiling/_threading.pyx Show resolved Hide resolved
ddtrace/profiling/collector/_lock.py Show resolved Hide resolved
ddtrace/profiling/collector/stack.pyx Outdated Show resolved Hide resolved
releasenotes/notes/nogevent-6f2892cb412f987f.yaml Outdated Show resolved Hide resolved
releasenotes/notes/nogevent-6f2892cb412f987f.yaml Outdated Show resolved Hide resolved
Co-authored-by: Munir Abdinur <munir.abdinur@datadoghq.com>
@emmettbutler emmettbutler requested a review from P403n1x87 March 2, 2023 17:59
Copy link
Contributor

@Yun-Kim Yun-Kim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, if we can fix the typing suggestion then I'm good to 👍

@emmettbutler
Copy link
Collaborator Author

@Yun-Kim that suggestion causes the linting to fail #5105 (comment)
maybe a fix for however typing is broken in that file is out of scope?

@emmettbutler emmettbutler requested a review from Yun-Kim March 2, 2023 21:22
@emmettbutler emmettbutler merged commit a665ffe into 1.x Mar 3, 2023
@emmettbutler emmettbutler deleted the emmett.butler/cleanups-before-removing-nogevent-3 branch March 3, 2023 10:08
@github-actions github-actions bot added this to the v1.11.0 milestone Mar 3, 2023
emmettbutler added a commit that referenced this pull request Mar 6, 2023
This pull request removes the `nogevent` compatibility layer from the
library and tests. It also changes the behavior of some feature flags
related to gevent and adjusts a whole lot of tests to work with these
changes.

This change was manually tested on delancie-dummy staging and shown to
fix the issue we were investigating there. See noteboook 4782453 in the
ddstaging datadoghq account for a detailed look at the metrics collected
during that test.

* `sitecustomize.py`: refactored module cloning logic; changed the
default of `DD_UNLOAD_MODULES_FROM_SITECUSTOMIZE` to `auto` meaning it
runs when gevent is installed; unloaded a few additional modules like
`time` and `attrs` that were causing tests and manual checks to fail;
deprecated `DD_GEVENT_PATCH_ALL` flag
* Removed various hacks and layers that had intended to fix gevent
compatibility, including in `forksafe.py`, `periodic.py`,
`ddtrace_gevent_check.py`, and `nogevent.py`
* `profiling/`: adjusted all uses of the removed module `nogevent` to
work with `threading`
* Adjusted tests to work with these removals

We tried to separate some of these changes into other pull requests,
which you can see linked in the discussion below. Because of how
difficult it is to replicate the issue we're chasing outside of the
staging environment, we decided to minimize the number of variables
under test and combine the various changes into a single pull request.
This makes it a bit harder to review, which we've tried to mitigate with
the checklist above.

The main risk in this change is the change to the default behavior of
module cloning. We've mitigated this risk with the automated test suite
as well as manual testing described in the notebook above.

**Why doesn't this change put all new behavior behind a feature flag and
leave the default behavior untouched?**

The main reason for this decision is pragmatic: it's really hard to test
for the issue this solves, requiring a turnaround time of about an hour
to get feedback from changes. The secondary reason is that the
`nogevent` layer is highly coupled to the rest of the library's code,
and putting it behind a feature flag is a significant and nontrivial
effort. The third reason is that full support of all of the various
configurations and combinations with other tools that gevent can be used
in is a goal that we could probably spend infinite time on if we chose
to. Given this, we need to intentionally set a goal that solves the
current and likely near-future issues as completely as possible, make it
the default behavior, and call this effort "done". @brettlangdon
@P403n1x87 @Yun-Kim and I are in agreement that the evidence in
noteboook 4782453 in the ddstaging datadoghq account is enough to
justify this change to the default behavior.

Performance testing with a sample flask application (notebook 4442578)
shows no immediately noticeable impact of tracing.

Dynamic instrumentation seems to cause slow-downs, and the reason has
been tracked down to joining service threads on shutdown. Avoiding the
joins cures the problem, but further testing is required to ensure that
DI still behaves as intended.

Profiling also shows a slow-down in the application response when
enabled. This seems to be due to retrieving the response from the agent
after the payload has been uploaded. A potential solution to this might
be offered by libdatadog.

The following are the details of the scenario used to measure the
performance under different configurations.
The application is the simple Flask app of the issue reproducer:

```

import os
import time

from ddtrace.internal.remoteconfig import RemoteConfig
from flask import Flask

app = Flask(__name__)

def start():
    pid1 = os.fork()
    if pid1 == 0:
        os.setsid()
        x = 2
        while x > 0:
            time.sleep(0.2)
            x -= 1
    else:
        os.waitpid(pid1, 0)

@app.route("/")
def index():
    start()
    return "OK" if RemoteConfig._worker is not None else "NOK"
```

We can control what products to start with the following run.sh script

```

source .venv/bin/activate

export DD_DYNAMIC_INSTRUMENTATION_ENABLED=true
export DD_PROFILING_ENABLED=false
export DD_TRACE_ENABLED=false

export DD_ENV=gab-testing
export DD_SERVICE=flask-gevent

ddtrace-run gunicorn -w 3 -k gevent app:app

deactivate
```

To run the app we create a virtual environment with

```
python3.9 -m venv .venv
source .venv/bin/activate
pip install flask gevent gunicorn
pip install -e path/to/dd-trace-py
deactivate
```

and then invoke the script, adjusting the exported variables as required

```
chmod +x run.sh
./run.sh
```

In another terminal we can check the average response time by sending
requests to the application while running. With the following simple k6
script

```
import http from 'k6/http';

export default function () {
  http.get('http://localhost:8000');
}
```

We invoke k6 with

```
k6 run -d 30s script.js
```

and look for this line in the output

```
http_req_duration..............: avg=335.68ms min=119.56ms med=418.76ms max=451.49ms p(90)
```

Co-authored-by: Yun Kim <yun.kim@datadoghq.com>
Co-authored-by: Gabriele N. Tornetta <P403n1x87@users.noreply.github.com>
Co-authored-by: Juanjo Alvarez Martinez <juanjo.alvarezmartinez@datadoghq.com>
Co-authored-by: Gabriele N. Tornetta <gabriele.tornetta@datadoghq.com>
Co-authored-by: Yun Kim <35776586+Yun-Kim@users.noreply.github.com>
Co-authored-by: Brett Langdon <brett.langdon@datadoghq.com>
emmettbutler added a commit that referenced this pull request Mar 6, 2023
…5105) (#5275)

- [x] Backport #5105 to 1.9

Co-authored-by: Yun Kim <yun.kim@datadoghq.com>
Co-authored-by: Gabriele N. Tornetta <P403n1x87@users.noreply.github.com>
Co-authored-by: Juanjo Alvarez Martinez <juanjo.alvarezmartinez@datadoghq.com>
Co-authored-by: Gabriele N. Tornetta <gabriele.tornetta@datadoghq.com>
Co-authored-by: Yun Kim <35776586+Yun-Kim@users.noreply.github.com>
Co-authored-by: Brett Langdon <brett.langdon@datadoghq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants