fix(internal): [backport to 1.9] remove nogevent compatibility layer (#5105) #5275

emmettbutler · 2023-03-06T17:22:16Z

Backport fix(internal): remove nogevent compatibility layer #5105 to 1.9

@brettlangdon

This pull request removes the `nogevent` compatibility layer from the library and tests. It also changes the behavior of some feature flags related to gevent and adjusts a whole lot of tests to work with these changes. This change was manually tested on delancie-dummy staging and shown to fix the issue we were investigating there. See noteboook 4782453 in the ddstaging datadoghq account for a detailed look at the metrics collected during that test. * `sitecustomize.py`: refactored module cloning logic; changed the default of `DD_UNLOAD_MODULES_FROM_SITECUSTOMIZE` to `auto` meaning it runs when gevent is installed; unloaded a few additional modules like `time` and `attrs` that were causing tests and manual checks to fail; deprecated `DD_GEVENT_PATCH_ALL` flag * Removed various hacks and layers that had intended to fix gevent compatibility, including in `forksafe.py`, `periodic.py`, `ddtrace_gevent_check.py`, and `nogevent.py` * `profiling/`: adjusted all uses of the removed module `nogevent` to work with `threading` * Adjusted tests to work with these removals We tried to separate some of these changes into other pull requests, which you can see linked in the discussion below. Because of how difficult it is to replicate the issue we're chasing outside of the staging environment, we decided to minimize the number of variables under test and combine the various changes into a single pull request. This makes it a bit harder to review, which we've tried to mitigate with the checklist above. The main risk in this change is the change to the default behavior of module cloning. We've mitigated this risk with the automated test suite as well as manual testing described in the notebook above. **Why doesn't this change put all new behavior behind a feature flag and leave the default behavior untouched?** The main reason for this decision is pragmatic: it's really hard to test for the issue this solves, requiring a turnaround time of about an hour to get feedback from changes. The secondary reason is that the `nogevent` layer is highly coupled to the rest of the library's code, and putting it behind a feature flag is a significant and nontrivial effort. The third reason is that full support of all of the various configurations and combinations with other tools that gevent can be used in is a goal that we could probably spend infinite time on if we chose to. Given this, we need to intentionally set a goal that solves the current and likely near-future issues as completely as possible, make it the default behavior, and call this effort "done". @brettlangdon @P403n1x87 @Yun-Kim and I are in agreement that the evidence in noteboook 4782453 in the ddstaging datadoghq account is enough to justify this change to the default behavior. Performance testing with a sample flask application (notebook 4442578) shows no immediately noticeable impact of tracing. Dynamic instrumentation seems to cause slow-downs, and the reason has been tracked down to joining service threads on shutdown. Avoiding the joins cures the problem, but further testing is required to ensure that DI still behaves as intended. Profiling also shows a slow-down in the application response when enabled. This seems to be due to retrieving the response from the agent after the payload has been uploaded. A potential solution to this might be offered by libdatadog. The following are the details of the scenario used to measure the performance under different configurations. The application is the simple Flask app of the issue reproducer: ``` import os import time from ddtrace.internal.remoteconfig import RemoteConfig from flask import Flask app = Flask(__name__) def start(): pid1 = os.fork() if pid1 == 0: os.setsid() x = 2 while x > 0: time.sleep(0.2) x -= 1 else: os.waitpid(pid1, 0) @app.route("/") def index(): start() return "OK" if RemoteConfig._worker is not None else "NOK" ``` We can control what products to start with the following run.sh script ``` source .venv/bin/activate export DD_DYNAMIC_INSTRUMENTATION_ENABLED=true export DD_PROFILING_ENABLED=false export DD_TRACE_ENABLED=false export DD_ENV=gab-testing export DD_SERVICE=flask-gevent ddtrace-run gunicorn -w 3 -k gevent app:app deactivate ``` To run the app we create a virtual environment with ``` python3.9 -m venv .venv source .venv/bin/activate pip install flask gevent gunicorn pip install -e path/to/dd-trace-py deactivate ``` and then invoke the script, adjusting the exported variables as required ``` chmod +x run.sh ./run.sh ``` In another terminal we can check the average response time by sending requests to the application while running. With the following simple k6 script ``` import http from 'k6/http'; export default function () { http.get('http://localhost:8000'); } ``` We invoke k6 with ``` k6 run -d 30s script.js ``` and look for this line in the output ``` http_req_duration..............: avg=335.68ms min=119.56ms med=418.76ms max=451.49ms p(90) ``` Co-authored-by: Yun Kim <yun.kim@datadoghq.com> Co-authored-by: Gabriele N. Tornetta <P403n1x87@users.noreply.github.com> Co-authored-by: Juanjo Alvarez Martinez <juanjo.alvarezmartinez@datadoghq.com> Co-authored-by: Gabriele N. Tornetta <gabriele.tornetta@datadoghq.com> Co-authored-by: Yun Kim <35776586+Yun-Kim@users.noreply.github.com> Co-authored-by: Brett Langdon <brett.langdon@datadoghq.com>

codecov-commenter · 2023-03-06T17:44:56Z

Codecov Report

Merging #5275 (9426fa4) into 1.9 (90c89a5) will increase coverage by 0.25%.
The diff coverage is 27.93%.

@@            Coverage Diff             @@
##              1.9    #5275      +/-   ##
==========================================
+ Coverage   76.51%   76.76%   +0.25%     
==========================================
  Files         863      862       -1     
  Lines       66398    66228     -170     
==========================================
+ Hits        50804    50840      +36     
+ Misses      15594    15388     -206

Impacted Files	Coverage Δ
ddtrace/auto.py	`0.00% <0.00%> (ø)`
ddtrace/bootstrap/sitecustomize.py	`0.00% <0.00%> (ø)`
ddtrace/contrib/aws_lambda/patch.py	`0.00% <0.00%> (ø)`
ddtrace/contrib/gevent/__init__.py	`100.00% <ø> (ø)`
ddtrace/contrib/gunicorn/__init__.py	`0.00% <ø> (ø)`
ddtrace/internal/forksafe.py	`32.75% <ø> (+5.22%)`	⬆️
ddtrace/profiling/collector/_lock.py	`0.00% <0.00%> (ø)`
ddtrace/profiling/recorder.py	`0.00% <0.00%> (ø)`
tests/commands/test_runner.py	`0.00% <0.00%> (ø)`
tests/contrib/aws_lambda/handlers.py	`0.00% <0.00%> (ø)`
... and 30 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

emmettbutler requested review from a team as code owners March 6, 2023 17:22

emmettbutler requested review from majorgreys and ZStriker19 March 6, 2023 17:22

emmettbutler changed the title ~~fix(internal): remove nogevent compatibility layer (#5105)~~ fix(internal): [backport to 1.9] remove nogevent compatibility layer (#5105) Mar 6, 2023

majorgreys approved these changes Mar 6, 2023

View reviewed changes

ZStriker19 approved these changes Mar 6, 2023

View reviewed changes

emmettbutler merged commit b2830b3 into 1.9 Mar 6, 2023

emmettbutler deleted the emmett.butler/backport-5105 branch March 6, 2023 18:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(internal): [backport to 1.9] remove nogevent compatibility layer (#5105) #5275

fix(internal): [backport to 1.9] remove nogevent compatibility layer (#5105) #5275

emmettbutler commented Mar 6, 2023 •

edited

Loading

codecov-commenter commented Mar 6, 2023

fix(internal): [backport to 1.9] remove nogevent compatibility layer (#5105) #5275

fix(internal): [backport to 1.9] remove nogevent compatibility layer (#5105) #5275

Conversation

emmettbutler commented Mar 6, 2023 • edited Loading

codecov-commenter commented Mar 6, 2023

Codecov Report

emmettbutler commented Mar 6, 2023 •

edited

Loading