Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(internal): [backport to 1.9] remove nogevent compatibility layer (#5105) #5275

Merged
merged 1 commit into from
Mar 6, 2023

Conversation

emmettbutler
Copy link
Collaborator

@emmettbutler emmettbutler commented Mar 6, 2023

This pull request removes the `nogevent` compatibility layer from the
library and tests. It also changes the behavior of some feature flags
related to gevent and adjusts a whole lot of tests to work with these
changes.

This change was manually tested on delancie-dummy staging and shown to
fix the issue we were investigating there. See noteboook 4782453 in the
ddstaging datadoghq account for a detailed look at the metrics collected
during that test.

* `sitecustomize.py`: refactored module cloning logic; changed the
default of `DD_UNLOAD_MODULES_FROM_SITECUSTOMIZE` to `auto` meaning it
runs when gevent is installed; unloaded a few additional modules like
`time` and `attrs` that were causing tests and manual checks to fail;
deprecated `DD_GEVENT_PATCH_ALL` flag
* Removed various hacks and layers that had intended to fix gevent
compatibility, including in `forksafe.py`, `periodic.py`,
`ddtrace_gevent_check.py`, and `nogevent.py`
* `profiling/`: adjusted all uses of the removed module `nogevent` to
work with `threading`
* Adjusted tests to work with these removals

We tried to separate some of these changes into other pull requests,
which you can see linked in the discussion below. Because of how
difficult it is to replicate the issue we're chasing outside of the
staging environment, we decided to minimize the number of variables
under test and combine the various changes into a single pull request.
This makes it a bit harder to review, which we've tried to mitigate with
the checklist above.

The main risk in this change is the change to the default behavior of
module cloning. We've mitigated this risk with the automated test suite
as well as manual testing described in the notebook above.

**Why doesn't this change put all new behavior behind a feature flag and
leave the default behavior untouched?**

The main reason for this decision is pragmatic: it's really hard to test
for the issue this solves, requiring a turnaround time of about an hour
to get feedback from changes. The secondary reason is that the
`nogevent` layer is highly coupled to the rest of the library's code,
and putting it behind a feature flag is a significant and nontrivial
effort. The third reason is that full support of all of the various
configurations and combinations with other tools that gevent can be used
in is a goal that we could probably spend infinite time on if we chose
to. Given this, we need to intentionally set a goal that solves the
current and likely near-future issues as completely as possible, make it
the default behavior, and call this effort "done". @brettlangdon
@P403n1x87 @Yun-Kim and I are in agreement that the evidence in
noteboook 4782453 in the ddstaging datadoghq account is enough to
justify this change to the default behavior.

Performance testing with a sample flask application (notebook 4442578)
shows no immediately noticeable impact of tracing.

Dynamic instrumentation seems to cause slow-downs, and the reason has
been tracked down to joining service threads on shutdown. Avoiding the
joins cures the problem, but further testing is required to ensure that
DI still behaves as intended.

Profiling also shows a slow-down in the application response when
enabled. This seems to be due to retrieving the response from the agent
after the payload has been uploaded. A potential solution to this might
be offered by libdatadog.

The following are the details of the scenario used to measure the
performance under different configurations.
The application is the simple Flask app of the issue reproducer:

```

import os
import time

from ddtrace.internal.remoteconfig import RemoteConfig
from flask import Flask

app = Flask(__name__)

def start():
    pid1 = os.fork()
    if pid1 == 0:
        os.setsid()
        x = 2
        while x > 0:
            time.sleep(0.2)
            x -= 1
    else:
        os.waitpid(pid1, 0)

@app.route("/")
def index():
    start()
    return "OK" if RemoteConfig._worker is not None else "NOK"
```

We can control what products to start with the following run.sh script

```

source .venv/bin/activate

export DD_DYNAMIC_INSTRUMENTATION_ENABLED=true
export DD_PROFILING_ENABLED=false
export DD_TRACE_ENABLED=false

export DD_ENV=gab-testing
export DD_SERVICE=flask-gevent

ddtrace-run gunicorn -w 3 -k gevent app:app

deactivate
```

To run the app we create a virtual environment with

```
python3.9 -m venv .venv
source .venv/bin/activate
pip install flask gevent gunicorn
pip install -e path/to/dd-trace-py
deactivate
```

and then invoke the script, adjusting the exported variables as required

```
chmod +x run.sh
./run.sh
```

In another terminal we can check the average response time by sending
requests to the application while running. With the following simple k6
script

```
import http from 'k6/http';

export default function () {
  http.get('http://localhost:8000');
}
```

We invoke k6 with

```
k6 run -d 30s script.js
```

and look for this line in the output

```
http_req_duration..............: avg=335.68ms min=119.56ms med=418.76ms max=451.49ms p(90)
```

Co-authored-by: Yun Kim <yun.kim@datadoghq.com>
Co-authored-by: Gabriele N. Tornetta <P403n1x87@users.noreply.github.com>
Co-authored-by: Juanjo Alvarez Martinez <juanjo.alvarezmartinez@datadoghq.com>
Co-authored-by: Gabriele N. Tornetta <gabriele.tornetta@datadoghq.com>
Co-authored-by: Yun Kim <35776586+Yun-Kim@users.noreply.github.com>
Co-authored-by: Brett Langdon <brett.langdon@datadoghq.com>
@emmettbutler emmettbutler requested review from a team as code owners March 6, 2023 17:22
@emmettbutler emmettbutler changed the title fix(internal): remove nogevent compatibility layer (#5105) fix(internal): [backport to 1.9] remove nogevent compatibility layer (#5105) Mar 6, 2023
@codecov-commenter
Copy link

Codecov Report

Merging #5275 (9426fa4) into 1.9 (90c89a5) will increase coverage by 0.25%.
The diff coverage is 27.93%.

@@            Coverage Diff             @@
##              1.9    #5275      +/-   ##
==========================================
+ Coverage   76.51%   76.76%   +0.25%     
==========================================
  Files         863      862       -1     
  Lines       66398    66228     -170     
==========================================
+ Hits        50804    50840      +36     
+ Misses      15594    15388     -206     
Impacted Files Coverage Δ
ddtrace/auto.py 0.00% <0.00%> (ø)
ddtrace/bootstrap/sitecustomize.py 0.00% <0.00%> (ø)
ddtrace/contrib/aws_lambda/patch.py 0.00% <0.00%> (ø)
ddtrace/contrib/gevent/__init__.py 100.00% <ø> (ø)
ddtrace/contrib/gunicorn/__init__.py 0.00% <ø> (ø)
ddtrace/internal/forksafe.py 32.75% <ø> (+5.22%) ⬆️
ddtrace/profiling/collector/_lock.py 0.00% <0.00%> (ø)
ddtrace/profiling/recorder.py 0.00% <0.00%> (ø)
tests/commands/test_runner.py 0.00% <0.00%> (ø)
tests/contrib/aws_lambda/handlers.py 0.00% <0.00%> (ø)
... and 30 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@emmettbutler emmettbutler merged commit b2830b3 into 1.9 Mar 6, 2023
@emmettbutler emmettbutler deleted the emmett.butler/backport-5105 branch March 6, 2023 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants