-
Notifications
You must be signed in to change notification settings - Fork 417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(internal): remove nogevent compatibility layer #5105
fix(internal): remove nogevent compatibility layer #5105
Conversation
module cloning makes it obsolete
these changes were moved to DataDog#5105
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some questions, feel free to resolve if the questions do not require follow up.
Co-authored-by: Munir Abdinur <munir.abdinur@datadoghq.com>
Co-authored-by: Munir Abdinur <munir.abdinur@datadoghq.com>
Co-authored-by: Munir Abdinur <munir.abdinur@datadoghq.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, if we can fix the typing suggestion then I'm good to 👍
@Yun-Kim that suggestion causes the linting to fail #5105 (comment) |
This pull request removes the `nogevent` compatibility layer from the library and tests. It also changes the behavior of some feature flags related to gevent and adjusts a whole lot of tests to work with these changes. This change was manually tested on delancie-dummy staging and shown to fix the issue we were investigating there. See noteboook 4782453 in the ddstaging datadoghq account for a detailed look at the metrics collected during that test. * `sitecustomize.py`: refactored module cloning logic; changed the default of `DD_UNLOAD_MODULES_FROM_SITECUSTOMIZE` to `auto` meaning it runs when gevent is installed; unloaded a few additional modules like `time` and `attrs` that were causing tests and manual checks to fail; deprecated `DD_GEVENT_PATCH_ALL` flag * Removed various hacks and layers that had intended to fix gevent compatibility, including in `forksafe.py`, `periodic.py`, `ddtrace_gevent_check.py`, and `nogevent.py` * `profiling/`: adjusted all uses of the removed module `nogevent` to work with `threading` * Adjusted tests to work with these removals We tried to separate some of these changes into other pull requests, which you can see linked in the discussion below. Because of how difficult it is to replicate the issue we're chasing outside of the staging environment, we decided to minimize the number of variables under test and combine the various changes into a single pull request. This makes it a bit harder to review, which we've tried to mitigate with the checklist above. The main risk in this change is the change to the default behavior of module cloning. We've mitigated this risk with the automated test suite as well as manual testing described in the notebook above. **Why doesn't this change put all new behavior behind a feature flag and leave the default behavior untouched?** The main reason for this decision is pragmatic: it's really hard to test for the issue this solves, requiring a turnaround time of about an hour to get feedback from changes. The secondary reason is that the `nogevent` layer is highly coupled to the rest of the library's code, and putting it behind a feature flag is a significant and nontrivial effort. The third reason is that full support of all of the various configurations and combinations with other tools that gevent can be used in is a goal that we could probably spend infinite time on if we chose to. Given this, we need to intentionally set a goal that solves the current and likely near-future issues as completely as possible, make it the default behavior, and call this effort "done". @brettlangdon @P403n1x87 @Yun-Kim and I are in agreement that the evidence in noteboook 4782453 in the ddstaging datadoghq account is enough to justify this change to the default behavior. Performance testing with a sample flask application (notebook 4442578) shows no immediately noticeable impact of tracing. Dynamic instrumentation seems to cause slow-downs, and the reason has been tracked down to joining service threads on shutdown. Avoiding the joins cures the problem, but further testing is required to ensure that DI still behaves as intended. Profiling also shows a slow-down in the application response when enabled. This seems to be due to retrieving the response from the agent after the payload has been uploaded. A potential solution to this might be offered by libdatadog. The following are the details of the scenario used to measure the performance under different configurations. The application is the simple Flask app of the issue reproducer: ``` import os import time from ddtrace.internal.remoteconfig import RemoteConfig from flask import Flask app = Flask(__name__) def start(): pid1 = os.fork() if pid1 == 0: os.setsid() x = 2 while x > 0: time.sleep(0.2) x -= 1 else: os.waitpid(pid1, 0) @app.route("/") def index(): start() return "OK" if RemoteConfig._worker is not None else "NOK" ``` We can control what products to start with the following run.sh script ``` source .venv/bin/activate export DD_DYNAMIC_INSTRUMENTATION_ENABLED=true export DD_PROFILING_ENABLED=false export DD_TRACE_ENABLED=false export DD_ENV=gab-testing export DD_SERVICE=flask-gevent ddtrace-run gunicorn -w 3 -k gevent app:app deactivate ``` To run the app we create a virtual environment with ``` python3.9 -m venv .venv source .venv/bin/activate pip install flask gevent gunicorn pip install -e path/to/dd-trace-py deactivate ``` and then invoke the script, adjusting the exported variables as required ``` chmod +x run.sh ./run.sh ``` In another terminal we can check the average response time by sending requests to the application while running. With the following simple k6 script ``` import http from 'k6/http'; export default function () { http.get('http://localhost:8000'); } ``` We invoke k6 with ``` k6 run -d 30s script.js ``` and look for this line in the output ``` http_req_duration..............: avg=335.68ms min=119.56ms med=418.76ms max=451.49ms p(90) ``` Co-authored-by: Yun Kim <yun.kim@datadoghq.com> Co-authored-by: Gabriele N. Tornetta <P403n1x87@users.noreply.github.com> Co-authored-by: Juanjo Alvarez Martinez <juanjo.alvarezmartinez@datadoghq.com> Co-authored-by: Gabriele N. Tornetta <gabriele.tornetta@datadoghq.com> Co-authored-by: Yun Kim <35776586+Yun-Kim@users.noreply.github.com> Co-authored-by: Brett Langdon <brett.langdon@datadoghq.com>
…5105) (#5275) - [x] Backport #5105 to 1.9 Co-authored-by: Yun Kim <yun.kim@datadoghq.com> Co-authored-by: Gabriele N. Tornetta <P403n1x87@users.noreply.github.com> Co-authored-by: Juanjo Alvarez Martinez <juanjo.alvarezmartinez@datadoghq.com> Co-authored-by: Gabriele N. Tornetta <gabriele.tornetta@datadoghq.com> Co-authored-by: Yun Kim <35776586+Yun-Kim@users.noreply.github.com> Co-authored-by: Brett Langdon <brett.langdon@datadoghq.com>
This pull request removes the
nogevent
compatibility layer from the library and tests. It also changes the behavior of some feature flags related to gevent and adjusts a whole lot of tests to work with these changes.This change was manually tested on staging and shown to fix the issue we were investigating there. See noteboook 4782453 in the ddstaging datadoghq account for a detailed look at the metrics collected during that test.
What Changed?
sitecustomize.py
: refactored module cloning logic; changed the default ofDD_UNLOAD_MODULES_FROM_SITECUSTOMIZE
toauto
meaning it runs when gevent is installed; unloaded a few additional modules liketime
andattrs
that were causing tests and manual checks to fail; deprecatedDD_GEVENT_PATCH_ALL
flagforksafe.py
,periodic.py
,ddtrace_gevent_check.py
, andnogevent.py
profiling/
: adjusted all uses of the removed modulenogevent
to work withthreading
We tried to separate some of these changes into other pull requests, which you can see linked in the discussion below. Because of how difficult it is to replicate the issue we're chasing outside of the staging environment, we decided to minimize the number of variables under test and combine the various changes into a single pull request. This makes it a bit harder to review, which we've tried to mitigate with the checklist above.
Risk
The main risk in this change is the change to the default behavior of module cloning. We've mitigated this risk with the automated test suite as well as manual testing described in the notebook above.
Why doesn't this change put all new behavior behind a feature flag and leave the default behavior untouched?
The main reason for this decision is pragmatic: it's really hard to test for the issue this solves, requiring a turnaround time of about an hour to get feedback from changes. The secondary reason is that the
nogevent
layer is highly coupled to the rest of the library's code, and putting it behind a feature flag is a significant and nontrivial effort. The third reason is that full support of all of the various configurations and combinations with other tools that gevent can be used in is a goal that we could probably spend infinite time on if we chose to. Given this, we need to intentionally set a goal that solves the current and likely near-future issues as completely as possible, make it the default behavior, and call this effort "done". @brettlangdon @P403n1x87 @Yun-Kim and I are in agreement that the evidence in noteboook 4782453 in the ddstaging datadoghq account is enough to justify this change to the default behavior.Performance Testing
Performance testing with a sample flask application (notebook 4442578) shows no immediately noticeable impact from tracing.
Dynamic instrumentation seems to cause slow-downs, and the reason has been tracked down to joining service threads on shutdown. Avoiding the joins cures the problem, but further testing is required to ensure that DI still behaves as intended. Note that the tracer is already implemented this way, and this is probably why we don't see it impacting process shutdown.
Profiling also shows a slight slow-down in short-lived processes, when enabled, but this is an already existing issue. This seems to be due to retrieving the response from the agent after the payload has been uploaded. A potential solution to this might be offered by asynchronous processing with libdatadog.
The following are the details of the scenario used to measure the performance under different configurations.
The application is the simple Flask app of the issue reproducer:
We can control what products to start with the following run.sh script
To run the app we create a virtual environment with
and then invoke the script, adjusting the exported variables as required
In another terminal we can check the average response time by sending requests to the application while running. With the following simple k6 script
We invoke k6 with
and look for this line in the output
Checklist
Reviewer Checklist