-
Notifications
You must be signed in to change notification settings - Fork 467
feat(profiling): remove stack v1 #15185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 4.0-breaking-changes
Are you sure you want to change the base?
Conversation
|
|
Bootstrap import analysisComparison of import times between this PR and base. SummaryThe average import time from this PR is: 231 ± 2 ms. The average import time from base is: 233 ± 1 ms. The import time difference between this PR and base is: -1.89 ± 0.06 ms. Import time breakdownThe following import paths have shrunk:
|
Performance SLOsComparing candidate taegyunkim/remove-stack-v1 (06b2266) with baseline 4.0-breaking-changes (1c95d17) ❌ Test Failures (1 suite)❌ telemetryaddmetric - 28/30✅ 1-count-metric-1-timesTime: ✅ 3.170µs (SLO: <20.000µs 📉 -84.1%) vs baseline: +1.9% Memory: ✅ 31.595MB (SLO: <34.000MB -7.1%) vs baseline: +5.1% ✅ 1-count-metrics-100-timesTime: ✅ 211.468µs (SLO: <220.000µs -3.9%) vs baseline: +0.9% Memory: ✅ 31.575MB (SLO: <34.000MB -7.1%) vs baseline: +4.9% ✅ 1-distribution-metric-1-timesTime: ✅ 3.529µs (SLO: <20.000µs 📉 -82.4%) vs baseline: +2.1% Memory: ✅ 31.536MB (SLO: <34.000MB -7.2%) vs baseline: +4.9% ❌ 1-distribution-metrics-100-timesTime: ❌ 222.076µs (SLO: <220.000µs +0.9%) vs baseline: -0.2% Memory: ✅ 31.536MB (SLO: <34.000MB -7.2%) vs baseline: +4.9% ✅ 1-gauge-metric-1-timesTime: ✅ 2.276µs (SLO: <20.000µs 📉 -88.6%) vs baseline: ~same Memory: ✅ 31.575MB (SLO: <34.000MB -7.1%) vs baseline: +4.7% ✅ 1-gauge-metrics-100-timesTime: ✅ 138.127µs (SLO: <150.000µs -7.9%) vs baseline: +0.4% Memory: ✅ 31.516MB (SLO: <34.000MB -7.3%) vs baseline: +4.9% ✅ 1-rate-metric-1-timesTime: ✅ 3.297µs (SLO: <20.000µs 📉 -83.5%) vs baseline: +2.0% Memory: ✅ 31.536MB (SLO: <34.000MB -7.2%) vs baseline: +4.8% ✅ 1-rate-metrics-100-timesTime: ✅ 224.309µs (SLO: <250.000µs 📉 -10.3%) vs baseline: +0.7% Memory: ✅ 31.575MB (SLO: <34.000MB -7.1%) vs baseline: +5.2% ✅ 100-count-metrics-100-timesTime: ✅ 21.015ms (SLO: <22.000ms -4.5%) vs baseline: +2.3% Memory: ✅ 31.536MB (SLO: <34.000MB -7.2%) vs baseline: +4.8% ❌ 100-distribution-metrics-100-timesTime: ❌ 2.318ms (SLO: <2.300ms +0.8%) vs baseline: +1.0% Memory: ✅ 31.556MB (SLO: <34.000MB -7.2%) vs baseline: +5.0% ✅ 100-gauge-metrics-100-timesTime: ✅ 1.427ms (SLO: <1.550ms -8.0%) vs baseline: ~same Memory: ✅ 31.536MB (SLO: <34.000MB -7.2%) vs baseline: +4.9% ✅ 100-rate-metrics-100-timesTime: ✅ 2.273ms (SLO: <2.550ms 📉 -10.8%) vs baseline: +0.6% Memory: ✅ 31.575MB (SLO: <34.000MB -7.1%) vs baseline: +4.9% ✅ flush-1-metricTime: ✅ 4.624µs (SLO: <20.000µs 📉 -76.9%) vs baseline: +0.9% Memory: ✅ 31.556MB (SLO: <34.000MB -7.2%) vs baseline: +5.1% ✅ flush-100-metricsTime: ✅ 174.400µs (SLO: <250.000µs 📉 -30.2%) vs baseline: -0.7% Memory: ✅ 31.850MB (SLO: <34.000MB -6.3%) vs baseline: +4.7% ✅ flush-1000-metricsTime: ✅ 2.137ms (SLO: <2.500ms 📉 -14.5%) vs baseline: -0.1% Memory: ✅ 32.696MB (SLO: <34.500MB -5.2%) vs baseline: +4.7% 🟡 Near SLO Breach (4 suites)🟡 djangosimple - 30/30✅ appsecTime: ✅ 20.352ms (SLO: <22.300ms -8.7%) vs baseline: -0.4% Memory: ✅ 66.034MB (SLO: <67.000MB 🟡 -1.4%) vs baseline: +5.0% ✅ exception-replay-enabledTime: ✅ 1.339ms (SLO: <1.450ms -7.7%) vs baseline: ~same Memory: ✅ 64.064MB (SLO: <67.000MB -4.4%) vs baseline: +4.9% ✅ iastTime: ✅ 20.392ms (SLO: <22.250ms -8.3%) vs baseline: -0.5% Memory: ✅ 65.959MB (SLO: <67.000MB 🟡 -1.6%) vs baseline: +4.8% ✅ profilerTime: ✅ 15.451ms (SLO: <16.550ms -6.6%) vs baseline: ~same Memory: ✅ 53.517MB (SLO: <54.500MB 🟡 -1.8%) vs baseline: +4.1% ✅ resource-renamingTime: ✅ 20.537ms (SLO: <21.750ms -5.6%) vs baseline: -0.2% Memory: ✅ 65.996MB (SLO: <67.000MB 🟡 -1.5%) vs baseline: +4.8% ✅ span-code-originTime: ✅ 25.285ms (SLO: <28.200ms 📉 -10.3%) vs baseline: -0.6% Memory: ✅ 67.179MB (SLO: <69.500MB -3.3%) vs baseline: +4.8% ✅ tracerTime: ✅ 20.382ms (SLO: <21.750ms -6.3%) vs baseline: -0.3% Memory: ✅ 65.949MB (SLO: <67.000MB 🟡 -1.6%) vs baseline: +4.9% ✅ tracer-and-profilerTime: ✅ 22.814ms (SLO: <23.500ms -2.9%) vs baseline: +1.0% Memory: ✅ 67.459MB (SLO: <68.000MB 🟡 -0.8%) vs baseline: +4.4% ✅ tracer-dont-create-db-spansTime: ✅ 19.303ms (SLO: <21.500ms 📉 -10.2%) vs baseline: ~same Memory: ✅ 65.921MB (SLO: <67.000MB 🟡 -1.6%) vs baseline: +4.7% ✅ tracer-minimalTime: ✅ 16.632ms (SLO: <17.500ms -5.0%) vs baseline: -0.1% Memory: ✅ 65.962MB (SLO: <67.000MB 🟡 -1.5%) vs baseline: +4.8% ✅ tracer-nativeTime: ✅ 20.426ms (SLO: <21.750ms -6.1%) vs baseline: +0.1% Memory: ✅ 67.555MB (SLO: <72.500MB -6.8%) vs baseline: +4.7% ✅ tracer-no-cachesTime: ✅ 18.436ms (SLO: <19.650ms -6.2%) vs baseline: ~same Memory: ✅ 65.961MB (SLO: <67.000MB 🟡 -1.6%) vs baseline: +4.7% ✅ tracer-no-databasesTime: ✅ 18.666ms (SLO: <20.100ms -7.1%) vs baseline: -0.4% Memory: ✅ 65.864MB (SLO: <67.000MB 🟡 -1.7%) vs baseline: +4.8% ✅ tracer-no-middlewareTime: ✅ 20.034ms (SLO: <21.500ms -6.8%) vs baseline: -0.6% Memory: ✅ 65.901MB (SLO: <67.000MB 🟡 -1.6%) vs baseline: +4.7% ✅ tracer-no-templatesTime: ✅ 20.315ms (SLO: <22.000ms -7.7%) vs baseline: +0.6% Memory: ✅ 65.827MB (SLO: <67.000MB 🟡 -1.8%) vs baseline: +4.6% 🟡 errortrackingdjangosimple - 6/6✅ errortracking-enabled-allTime: ✅ 18.033ms (SLO: <19.850ms -9.2%) vs baseline: +0.3% Memory: ✅ 65.824MB (SLO: <66.500MB 🟡 -1.0%) vs baseline: +4.7% ✅ errortracking-enabled-userTime: ✅ 18.042ms (SLO: <19.400ms -7.0%) vs baseline: -0.1% Memory: ✅ 65.795MB (SLO: <66.500MB 🟡 -1.1%) vs baseline: +4.7% ✅ tracer-enabledTime: ✅ 17.961ms (SLO: <19.450ms -7.7%) vs baseline: -0.5% Memory: ✅ 65.915MB (SLO: <66.500MB 🟡 -0.9%) vs baseline: +5.1% 🟡 errortrackingflasksqli - 6/6✅ errortracking-enabled-allTime: ✅ 2.075ms (SLO: <2.300ms -9.8%) vs baseline: +0.2% Memory: ✅ 52.687MB (SLO: <53.500MB 🟡 -1.5%) vs baseline: +5.0% ✅ errortracking-enabled-userTime: ✅ 2.073ms (SLO: <2.250ms -7.9%) vs baseline: +0.1% Memory: ✅ 52.585MB (SLO: <53.500MB 🟡 -1.7%) vs baseline: +4.8% ✅ tracer-enabledTime: ✅ 2.070ms (SLO: <2.300ms 📉 -10.0%) vs baseline: -0.2% Memory: ✅ 52.551MB (SLO: <53.500MB 🟡 -1.8%) vs baseline: +4.8% 🟡 flasksimple - 18/18✅ appsec-getTime: ✅ 4.581ms (SLO: <4.750ms -3.5%) vs baseline: -0.6% Memory: ✅ 62.069MB (SLO: <65.000MB -4.5%) vs baseline: +4.9% ✅ appsec-postTime: ✅ 6.627ms (SLO: <6.750ms 🟡 -1.8%) vs baseline: ~same Memory: ✅ 62.285MB (SLO: <65.000MB -4.2%) vs baseline: +5.2% ✅ appsec-telemetryTime: ✅ 4.579ms (SLO: <4.750ms -3.6%) vs baseline: -0.3% Memory: ✅ 62.046MB (SLO: <65.000MB -4.5%) vs baseline: +4.9% ✅ debuggerTime: ✅ 1.857ms (SLO: <2.000ms -7.2%) vs baseline: ~same Memory: ✅ 45.247MB (SLO: <47.000MB -3.7%) vs baseline: +5.0% ✅ iast-getTime: ✅ 1.858ms (SLO: <2.000ms -7.1%) vs baseline: +0.1% Memory: ✅ 42.177MB (SLO: <49.000MB 📉 -13.9%) vs baseline: +4.9% ✅ profilerTime: ✅ 1.907ms (SLO: <2.100ms -9.2%) vs baseline: -0.2% Memory: ✅ 46.206MB (SLO: <47.000MB 🟡 -1.7%) vs baseline: +4.3% ✅ resource-renamingTime: ✅ 3.381ms (SLO: <3.650ms -7.4%) vs baseline: +0.2% Memory: ✅ 52.701MB (SLO: <53.500MB 🟡 -1.5%) vs baseline: +5.5% ✅ tracerTime: ✅ 3.361ms (SLO: <3.650ms -7.9%) vs baseline: -0.2% Memory: ✅ 52.562MB (SLO: <53.500MB 🟡 -1.8%) vs baseline: +5.4% ✅ tracer-nativeTime: ✅ 3.365ms (SLO: <3.650ms -7.8%) vs baseline: +0.3% Memory: ✅ 53.998MB (SLO: <60.000MB 📉 -10.0%) vs baseline: +4.9%
|
| ], | ||
| ), | ||
| Venv( | ||
| name="profile-diff", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@P403n1x87 don't think this venv is used anywhere. Deleting.
|
Is there a plan to rename the remaining |
|
I'm concerned with the size of this diff: 2.6K lines added, 6K lines removed. I think we should break this up into several PRs. Example suggestion: |
If this is not required for the stack v1 removal, I would make this a separate PR. Sounds like a major improvement that will benefit from extra visibility. |
| stack_v2_failure_msg, stack_v2_is_available = _check_for_stack_v2_available() | ||
| if config.stack.v2_enabled and not stack_v2_is_available: | ||
| if config.stack.enabled and not stack_v2_is_available: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need this stack_v2_is_available check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it tells whether _stack_v2.so library is available or not.
| configured_features.append("stack_v2") | ||
| else: | ||
| configured_features.append("stack") | ||
| configured_features.append("stack_v2") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be just stack, since there's no versioning anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use that config str in at least two places
- Searching/filtering for profiles
- Searching/filtering crash logs
If we have it as stack, that would require us to have a separate logic to check for the version of the library, when we'd like to aggregate profiles/crash logs depending on the configuration.
| # Also patch threading.Thread so echion can track thread lifetimes | ||
| def init_stack_v2() -> None: | ||
| if config.stack.v2_enabled and stack_v2.is_available: | ||
| if config.stack.enabled and stack_v2.is_available: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question for stack_v2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It tells us that whether _stack_v2.so file is available. I'd like to keep it.
| self._collectors.append( | ||
| stack.StackCollector( | ||
| tracer=self.tracer, | ||
| endpoint_collection_enabled=self.endpoint_collection_enabled, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why endpoint_collection_enabled was removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's enabled by default and stack_v2 uses a different mechanism to support it.
https://docs.datadoghq.com/profiler/connect_traces_and_profiles/?tab=python#endpoint-profiling
this is what endpoint profiling is and these are relevant lines of code
dd-trace-py/ddtrace/internal/datadog/profiling/ddup/_ddup.pyx
Lines 390 to 394 in c412a40
| processor = tracer._endpoint_call_counter_span_processor | |
| endpoint_counts, endpoint_to_span_ids = processor.reset() | |
| call_ddup_profile_set_endpoints(endpoint_to_span_ids) | |
| call_ddup_profile_add_endpoint_counts(endpoint_counts) |
The removal of stack v1 is done mostly in a single PR as the deadline to make changes to 4.0-breaking-changes is today, and @brettlangdon requested me to remove the code as part of 4.0 release. I initially thought of removing them after 4.0 was released as now it can't be enabled unless someone modifies the code to do so. |
Description
Remove Python Profiler v1 implementation, and clean up test suites.
One of the notable improvements from this PR is that
StackCollectorno longer spawns an extra background thread by inheriting fromCollectorinstead ofPeriodicService. Also, ddtrace/profiling/collector/stack.pyx has been de-cythonized as it now is a simple wrapper aroundstack_v2native extension module.Testing
Risks
Additional Notes