-
Notifications
You must be signed in to change notification settings - Fork 417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(internal): [backport to 1.9] remove nogevent compatibility layer (#5105) #5275
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Collaborator
emmettbutler
commented
Mar 6, 2023
•
edited
Loading
edited
- Backport fix(internal): remove nogevent compatibility layer #5105 to 1.9
This pull request removes the `nogevent` compatibility layer from the library and tests. It also changes the behavior of some feature flags related to gevent and adjusts a whole lot of tests to work with these changes. This change was manually tested on delancie-dummy staging and shown to fix the issue we were investigating there. See noteboook 4782453 in the ddstaging datadoghq account for a detailed look at the metrics collected during that test. * `sitecustomize.py`: refactored module cloning logic; changed the default of `DD_UNLOAD_MODULES_FROM_SITECUSTOMIZE` to `auto` meaning it runs when gevent is installed; unloaded a few additional modules like `time` and `attrs` that were causing tests and manual checks to fail; deprecated `DD_GEVENT_PATCH_ALL` flag * Removed various hacks and layers that had intended to fix gevent compatibility, including in `forksafe.py`, `periodic.py`, `ddtrace_gevent_check.py`, and `nogevent.py` * `profiling/`: adjusted all uses of the removed module `nogevent` to work with `threading` * Adjusted tests to work with these removals We tried to separate some of these changes into other pull requests, which you can see linked in the discussion below. Because of how difficult it is to replicate the issue we're chasing outside of the staging environment, we decided to minimize the number of variables under test and combine the various changes into a single pull request. This makes it a bit harder to review, which we've tried to mitigate with the checklist above. The main risk in this change is the change to the default behavior of module cloning. We've mitigated this risk with the automated test suite as well as manual testing described in the notebook above. **Why doesn't this change put all new behavior behind a feature flag and leave the default behavior untouched?** The main reason for this decision is pragmatic: it's really hard to test for the issue this solves, requiring a turnaround time of about an hour to get feedback from changes. The secondary reason is that the `nogevent` layer is highly coupled to the rest of the library's code, and putting it behind a feature flag is a significant and nontrivial effort. The third reason is that full support of all of the various configurations and combinations with other tools that gevent can be used in is a goal that we could probably spend infinite time on if we chose to. Given this, we need to intentionally set a goal that solves the current and likely near-future issues as completely as possible, make it the default behavior, and call this effort "done". @brettlangdon @P403n1x87 @Yun-Kim and I are in agreement that the evidence in noteboook 4782453 in the ddstaging datadoghq account is enough to justify this change to the default behavior. Performance testing with a sample flask application (notebook 4442578) shows no immediately noticeable impact of tracing. Dynamic instrumentation seems to cause slow-downs, and the reason has been tracked down to joining service threads on shutdown. Avoiding the joins cures the problem, but further testing is required to ensure that DI still behaves as intended. Profiling also shows a slow-down in the application response when enabled. This seems to be due to retrieving the response from the agent after the payload has been uploaded. A potential solution to this might be offered by libdatadog. The following are the details of the scenario used to measure the performance under different configurations. The application is the simple Flask app of the issue reproducer: ``` import os import time from ddtrace.internal.remoteconfig import RemoteConfig from flask import Flask app = Flask(__name__) def start(): pid1 = os.fork() if pid1 == 0: os.setsid() x = 2 while x > 0: time.sleep(0.2) x -= 1 else: os.waitpid(pid1, 0) @app.route("/") def index(): start() return "OK" if RemoteConfig._worker is not None else "NOK" ``` We can control what products to start with the following run.sh script ``` source .venv/bin/activate export DD_DYNAMIC_INSTRUMENTATION_ENABLED=true export DD_PROFILING_ENABLED=false export DD_TRACE_ENABLED=false export DD_ENV=gab-testing export DD_SERVICE=flask-gevent ddtrace-run gunicorn -w 3 -k gevent app:app deactivate ``` To run the app we create a virtual environment with ``` python3.9 -m venv .venv source .venv/bin/activate pip install flask gevent gunicorn pip install -e path/to/dd-trace-py deactivate ``` and then invoke the script, adjusting the exported variables as required ``` chmod +x run.sh ./run.sh ``` In another terminal we can check the average response time by sending requests to the application while running. With the following simple k6 script ``` import http from 'k6/http'; export default function () { http.get('http://localhost:8000'); } ``` We invoke k6 with ``` k6 run -d 30s script.js ``` and look for this line in the output ``` http_req_duration..............: avg=335.68ms min=119.56ms med=418.76ms max=451.49ms p(90) ``` Co-authored-by: Yun Kim <yun.kim@datadoghq.com> Co-authored-by: Gabriele N. Tornetta <P403n1x87@users.noreply.github.com> Co-authored-by: Juanjo Alvarez Martinez <juanjo.alvarezmartinez@datadoghq.com> Co-authored-by: Gabriele N. Tornetta <gabriele.tornetta@datadoghq.com> Co-authored-by: Yun Kim <35776586+Yun-Kim@users.noreply.github.com> Co-authored-by: Brett Langdon <brett.langdon@datadoghq.com>
emmettbutler
changed the title
fix(internal): remove nogevent compatibility layer (#5105)
fix(internal): [backport to 1.9] remove nogevent compatibility layer (#5105)
Mar 6, 2023
majorgreys
approved these changes
Mar 6, 2023
Codecov Report
@@ Coverage Diff @@
## 1.9 #5275 +/- ##
==========================================
+ Coverage 76.51% 76.76% +0.25%
==========================================
Files 863 862 -1
Lines 66398 66228 -170
==========================================
+ Hits 50804 50840 +36
+ Misses 15594 15388 -206
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
ZStriker19
approved these changes
Mar 6, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.