-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PROF-8864] Dynamic allocation sampling #3395
[PROF-8864] Dynamic allocation sampling #3395
Conversation
23e2850
to
5a072aa
Compare
5a072aa
to
423c4ed
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is soopah exciting!
Intuitively the approach seems to make sense. I'm curious to look at the benchmarking numbers.
I also wonder how bias compares the other options we've used for the other profilers. I seem to remember both Felix and Christophe working at some point on some simulators to compare alternatives, it may be worth seeing if we can plug this approach into the simulator and do a comparison?
ext/ddtrace_profiling_native_extension/collectors_discrete_dynamic_sampler.c
Outdated
Show resolved
Hide resolved
ext/ddtrace_profiling_native_extension/collectors_discrete_dynamic_sampler.h
Outdated
Show resolved
Hide resolved
ext/ddtrace_profiling_native_extension/collectors_discrete_dynamic_sampler.h
Show resolved
Hide resolved
ext/ddtrace_profiling_native_extension/collectors_cpu_and_wall_time_worker.c
Outdated
Show resolved
Hide resolved
ext/ddtrace_profiling_native_extension/collectors_discrete_dynamic_sampler.c
Outdated
Show resolved
Hide resolved
ext/ddtrace_profiling_native_extension/collectors_discrete_dynamic_sampler.c
Show resolved
Hide resolved
ext/ddtrace_profiling_native_extension/collectors_discrete_dynamic_sampler.c
Outdated
Show resolved
Hide resolved
ext/ddtrace_profiling_native_extension/collectors_discrete_dynamic_sampler.c
Outdated
Show resolved
Hide resolved
spec/datadog/profiling/collectors/discrete_dynamic_sampler_spec.rb
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did another pass! I think this is getting in pretty good shape -- I guess we need to get together and decide the next steps towards declaring allocation (and possibly heap) as beta?
ext/ddtrace_profiling_native_extension/collectors_cpu_and_wall_time_worker.c
Show resolved
Hide resolved
ext/ddtrace_profiling_native_extension/collectors_discrete_dynamic_sampler.c
Outdated
Show resolved
Hide resolved
ext/ddtrace_profiling_native_extension/collectors_discrete_dynamic_sampler.c
Show resolved
Hide resolved
ext/ddtrace_profiling_native_extension/collectors_discrete_dynamic_sampler.c
Outdated
Show resolved
Hide resolved
ext/ddtrace_profiling_native_extension/collectors_discrete_dynamic_sampler.c
Outdated
Show resolved
Hide resolved
ext/ddtrace_profiling_native_extension/collectors_discrete_dynamic_sampler.c
Outdated
Show resolved
Hide resolved
spec/datadog/profiling/collectors/discrete_dynamic_sampler_spec.rb
Outdated
Show resolved
Hide resolved
spec/datadog/profiling/collectors/discrete_dynamic_sampler_spec.rb
Outdated
Show resolved
Hide resolved
# Question is: how long do we have to wait for probing samples? Intuitively, we need to build enough budget over | ||
# time for us to be able to take that probing hit assuming things remain the same. Each adjustment window | ||
# with no sampling activity earns us 0.02 seconds of budget. Therefore we need 100 of these to go by before | ||
# we see the next probing sample. | ||
stats = simulate_load(duration_seconds: 99, events_per_second: 4, sampling_seconds: 2) | ||
expect(stats[:num_samples]).to eq(0) | ||
stats = simulate_load(duration_seconds: 1, events_per_second: 4, sampling_seconds: 2) | ||
expect(stats[:num_samples]).to eq(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems... a bit too much? E.g. even on an extremely busy application, should we aim to have at least 1 sample per second or something like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yeah I guess I went with a rather extreme situation to force the sampler's hand (sampling a single event taking 2 seconds which is already 2 adjustment windows) and that is which is leading to the extreme probing interval. I'll write a less extreme version of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the latest version, the test thresholds are adjusted, but I actually was thinking something a bit different.
It's more, should we tweak the sampler so that there's some minimum rate it will never fall below, at least during a 60-second profile period? E.g. it seems very extreme to decide not to have a single sample during a profiled period.
I don't think this is blocking for this PR, especially since we are planning to tweak with the thresholds for the new sampler anyway :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a kind of meta question of how hard/soft our target should be: any kind of forced minimum risks making the overhead target a mere suggestion. Bugs aside, the only way for us to arrive at a situation where we never sample in a 60-second period is if we consistently get single event sampling times that are individually above our overhead target (say the default 0.01 seconds). If this happens and you enforce a minimum of 1 sample/second, then you'd be going 60x over the defined threshold.
An alternative is to simply shutdown the allocation sampler if we detect that due to adjustments we are never reaching a minimum target (i.e. do not force a minimum but disable if a minimum cannot be met since data quality will be lousy).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking this effect can happen due to two main causes:
-
There was a hiccup somewhere. Machine is very busy, kernel descheduled profiler while it was sampling. Machine was suspended/live moved in a VM/snapshotted. That kinda thing.
-
We truly took a huge time to sample, and every sample we take will be as costly as this.
I think the challenge is -- how do we take 1) in stride, and still be conservative enough to do the right thing in 2) ?
Furthermore, I kinda suspect (without data) that 1 may be more common than 2, so that's why I think it's worth trying to take it into consideration, if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you raise a very good point. I have some ideas on it that I'm planning on researching separately: https://datadoghq.atlassian.net/browse/PROF-9044
…amic_sampler.c Co-authored-by: Ivo Anjo <ivo.anjo@datadoghq.com>
…c.rb Co-authored-by: Ivo Anjo <ivo.anjo@datadoghq.com>
…amic_sampler.c Co-authored-by: Ivo Anjo <ivo.anjo@datadoghq.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 This looks great to me!
I left one last reply ( #3395 (comment) ) regarding the sampler being can be a bit too strict (?) in extreme cases, but I don't think this is blocking at all and this is so much better than what we had before anyway :)
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #3395 +/- ##
==========================================
+ Coverage 98.25% 98.26% +0.01%
==========================================
Files 1262 1266 +4
Lines 73606 74298 +692
Branches 3445 3476 +31
==========================================
+ Hits 72320 73011 +691
- Misses 1286 1287 +1 ☔ View full report in Codecov by Sentry. |
This reverts commit d56ba2a.
**What does this PR do?** This PR removes the unused profiler PID controller implementation. **Motivation:** In #3190 we imported the PID controller, and planned to use it to control the allocation profiler. In the end, we ended up going with a different solution for that (#3395) so let's remove the PID controller for now. We can always revert this commit if we need it again. **Additional Notes:** N/A **How to test the change?** Validate that CI is still green :)
What does this PR do?
Motivation:
Hardcoded/fixed sample rate configurations cannot be a good fit for all applications and their usage patterns.
Take the previous default: sample 1 allocation out of every 50:
By using a dynamic sampler that attempts to adhere to a specific target of CPU overhead, it becomes much easier to reason about the impact of the allocation profiler as well as have the sampling automatically adapt to usage patterns:
Additional Notes:
The heap profiler still uses a fixed "track every x allocation sample" logic but the plan is to try and add some dynamic sampling there as well in a follow-up PR.
Nevertheless, it already has a positive effect on heap tracking overhead since part of that work currently occurs in the context of allocation sampling.
CPU usage, P50 and P95 behaviour is improved in all allocation+heap flavours. However, P100 for allocation-sampling-only becomes slightly more noisy as the dynamic nature of sampling can lead to a certain amount of oversampling around spikes (NOTE: left side is this PR, right side is current master):
We can study extra strategies to reduce the P100 noise as follow-up:
How to test the change?
For Datadog employees:
credentials of any kind, I've requested a review from
@DataDog/security-design-and-guidance
.Unsure? Have a question? Request a review!