[PROF-8864] Dynamic allocation sampling #3395

AlexJF · 2024-01-23T18:08:37Z

What does this PR do?

Introduces a dynamic sampler for discrete events that is governed by an overhead target. The concept is similar to that of the existing dynamic sampler only that one operates under the assumption that we can sample "at will".
- Every second, sampler settings are readjusted based on perceived volume and overhead of the recent past (via exponential moving averages).
Removes the allocation sample rate configuration, instead choosing to plug the aforementioned sampler into the allocation sampling logic.

Motivation:
Hardcoded/fixed sample rate configurations cannot be a good fit for all applications and their usage patterns.

Take the previous default: sample 1 allocation out of every 50:

On an app that does 1 allocation every second, we'd essentially sample only one allocation per profile with a negligible impact in the application.
On an app that does 500 allocations every second, we'd now be sampling 600 allocations per profile which looks a bit more meaningful but it's now hard to quantify/understand that impact.

By using a dynamic sampler that attempts to adhere to a specific target of CPU overhead, it becomes much easier to reason about the impact of the allocation profiler as well as have the sampling automatically adapt to usage patterns:

When mostly idle we can afford to sample all allocations that come.
On heavy usage/traffic, we'll automatically sample less and less to keep overhead under control.

Additional Notes:
The heap profiler still uses a fixed "track every x allocation sample" logic but the plan is to try and add some dynamic sampling there as well in a follow-up PR.

Nevertheless, it already has a positive effect on heap tracking overhead since part of that work currently occurs in the context of allocation sampling.

CPU usage, P50 and P95 behaviour is improved in all allocation+heap flavours. However, P100 for allocation-sampling-only becomes slightly more noisy as the dynamic nature of sampling can lead to a certain amount of oversampling around spikes (NOTE: left side is this PR, right side is current master):

We can study extra strategies to reduce the P100 noise as follow-up:

Smaller adjustment windows.
Event-count-based readjustment - Use the lowest of Y seconds or X events since last readjustment to determine when a new readjustment is needed.
Replacing the exponential moving averages with PID controllers to increase sensitivity to big slope changes.

How to test the change?

For Datadog employees:

If this PR touches code that signs or publishes builds or packages, or handles
credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
This PR doesn't touch any of that.

Unsure? Have a question? Request a review!

ivoanjo

This is soopah exciting!

Intuitively the approach seems to make sense. I'm curious to look at the benchmarking numbers.

I also wonder how bias compares the other options we've used for the other profilers. I seem to remember both Felix and Christophe working at some point on some simulators to compare alternatives, it may be worth seeing if we can plug this approach into the simulator and do a comparison?

ext/ddtrace_profiling_native_extension/extconf.rb

ext/ddtrace_profiling_native_extension/collectors_discrete_dynamic_sampler.c

ext/ddtrace_profiling_native_extension/collectors_discrete_dynamic_sampler.h

ext/ddtrace_profiling_native_extension/collectors_cpu_and_wall_time_worker.c

ext/ddtrace_profiling_native_extension/collectors_discrete_dynamic_sampler.c

spec/datadog/profiling/collectors/discrete_dynamic_sampler_spec.rb

ivoanjo

Did another pass! I think this is getting in pretty good shape -- I guess we need to get together and decide the next steps towards declaring allocation (and possibly heap) as beta?

ext/ddtrace_profiling_native_extension/collectors_cpu_and_wall_time_worker.c

ext/ddtrace_profiling_native_extension/collectors_discrete_dynamic_sampler.c

spec/datadog/profiling/collectors/discrete_dynamic_sampler_spec.rb

ivoanjo · 2024-01-30T10:44:16Z

spec/datadog/profiling/collectors/discrete_dynamic_sampler_spec.rb

+    # Question is: how long do we have to wait for probing samples? Intuitively, we need to build enough budget over
+    # time for us to be able to take that probing hit assuming things remain the same. Each adjustment window
+    # with no sampling activity earns us 0.02 seconds of budget. Therefore we need 100 of these to go by before
+    # we see the next probing sample.
+    stats = simulate_load(duration_seconds: 99, events_per_second: 4, sampling_seconds: 2)
+    expect(stats[:num_samples]).to eq(0)
+    stats = simulate_load(duration_seconds: 1, events_per_second: 4, sampling_seconds: 2)
+    expect(stats[:num_samples]).to eq(1)


This seems... a bit too much? E.g. even on an extremely busy application, should we aim to have at least 1 sample per second or something like that?

Hmm yeah I guess I went with a rather extreme situation to force the sampler's hand (sampling a single event taking 2 seconds which is already 2 adjustment windows) and that is which is leading to the extreme probing interval. I'll write a less extreme version of this.

In the latest version, the test thresholds are adjusted, but I actually was thinking something a bit different.

It's more, should we tweak the sampler so that there's some minimum rate it will never fall below, at least during a 60-second profile period? E.g. it seems very extreme to decide not to have a single sample during a profiled period.

I don't think this is blocking for this PR, especially since we are planning to tweak with the thresholds for the new sampler anyway :)

It's a kind of meta question of how hard/soft our target should be: any kind of forced minimum risks making the overhead target a mere suggestion. Bugs aside, the only way for us to arrive at a situation where we never sample in a 60-second period is if we consistently get single event sampling times that are individually above our overhead target (say the default 0.01 seconds). If this happens and you enforce a minimum of 1 sample/second, then you'd be going 60x over the defined threshold.

An alternative is to simply shutdown the allocation sampler if we detect that due to adjustments we are never reaching a minimum target (i.e. do not force a minimum but disable if a minimum cannot be met since data quality will be lousy).

I'm thinking this effect can happen due to two main causes:

There was a hiccup somewhere. Machine is very busy, kernel descheduled profiler while it was sampling. Machine was suspended/live moved in a VM/snapshotted. That kinda thing.

We truly took a huge time to sample, and every sample we take will be as costly as this.

I think the challenge is -- how do we take 1) in stride, and still be conservative enough to do the right thing in 2) ?

Furthermore, I kinda suspect (without data) that 1 may be more common than 2, so that's why I think it's worth trying to take it into consideration, if possible.

Yes you raise a very good point. I have some ideas on it that I'm planning on researching separately: https://datadoghq.atlassian.net/browse/PROF-9044

spec/datadog/profiling/collectors/discrete_dynamic_sampler_spec.rb

…amic_sampler.c Co-authored-by: Ivo Anjo <ivo.anjo@datadoghq.com>

…c.rb Co-authored-by: Ivo Anjo <ivo.anjo@datadoghq.com>

…amic_sampler.c Co-authored-by: Ivo Anjo <ivo.anjo@datadoghq.com>

AlexJF · 2024-01-31T11:01:02Z

Validated early observed behaviour remains with latest round of changes. Going to experiment with smaller adjustment windows and/or number of events triggers but those can come as new PRs to not block this one.

ivoanjo

👍 This looks great to me!

I left one last reply ( #3395 (comment) ) regarding the sampler being can be a bit too strict (?) in extreme cases, but I don't think this is blocking at all and this is so much better than what we had before anyway :)

codecov-commenter · 2024-01-31T17:17:30Z

Codecov Report

Attention: 3 lines in your changes are missing coverage. Please review.

Comparison is base (8453ee4) 98.25% compared to head (c7164b8) 98.26%.
Report is 35 commits behind head on master.

Files	Patch %	Lines
lib/datadog/tracing/contrib/redis/patcher.rb	82.35%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #3395      +/-   ##
==========================================
+ Coverage   98.25%   98.26%   +0.01%     
==========================================
  Files        1262     1266       +4     
  Lines       73606    74298     +692     
  Branches     3445     3476      +31     
==========================================
+ Hits        72320    73011     +691     
- Misses       1286     1287       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

This reverts commit d56ba2a.

**What does this PR do?** This PR removes the unused profiler PID controller implementation. **Motivation:** In #3190 we imported the PID controller, and planned to use it to control the allocation profiler. In the end, we ended up going with a different solution for that (#3395) so let's remove the PID controller for now. We can always revert this commit if we need it again. **Additional Notes:** N/A **How to test the change?** Validate that CI is still green :)

AlexJF mentioned this pull request Jan 23, 2024

[PROF-8864] Use the dynamic sampler for allocations #3382

Closed

2 tasks

github-actions bot added core Involves Datadog core libraries profiling Involves Datadog profiling labels Jan 23, 2024

AlexJF force-pushed the alexjf/prof-8864-dynamic-allocation-sampling2 branch from 23e2850 to 5a072aa Compare January 23, 2024 18:19

[PROF-8864] Dynamic allocation sampling

423c4ed

AlexJF force-pushed the alexjf/prof-8864-dynamic-allocation-sampling2 branch from 5a072aa to 423c4ed Compare January 23, 2024 18:38

AlexJF added 2 commits January 23, 2024 19:05

Fix tests

5ab6b19

Fix sampler tests on unsupported Rubies

7f6bb46

AlexJF marked this pull request as ready for review January 24, 2024 11:54

AlexJF requested review from a team as code owners January 24, 2024 11:54

ivoanjo reviewed Jan 24, 2024

View reviewed changes

AlexJF added 3 commits January 29, 2024 16:07

Address comments

ba2d263

Remove unused functions for old stats

cee14a0

Improve tests and initial sampling state

4a18b4d

ivoanjo reviewed Jan 30, 2024

View reviewed changes

AlexJF and others added 8 commits January 30, 2024 14:48

Update ext/ddtrace_profiling_native_extension/collectors_discrete_dyn…

e1b7b24

…amic_sampler.c Co-authored-by: Ivo Anjo <ivo.anjo@datadoghq.com>

Update spec/datadog/profiling/collectors/discrete_dynamic_sampler_spe…

6c4244b

…c.rb Co-authored-by: Ivo Anjo <ivo.anjo@datadoghq.com>

Update ext/ddtrace_profiling_native_extension/collectors_discrete_dyn…

18b9dc7

…amic_sampler.c Co-authored-by: Ivo Anjo <ivo.anjo@datadoghq.com>

Address more comments

f8dcd14

Fix failing cpu collector tests

d833806

Address remaining comment

eef1307

Merge branch 'master' into alexjf/prof-8864-dynamic-allocation-sampling2

bb75b30

Update gemfiles/*

1aa2b15

ivoanjo approved these changes Jan 31, 2024

View reviewed changes

AlexJF added 2 commits January 31, 2024 11:34

Merge branch 'master' into alexjf/prof-8864-dynamic-allocation-sampling2

36316cd

Preemptively adjust every 100 samples

d56ba2a

Revert "Preemptively adjust every 100 samples"

c7164b8

This reverts commit d56ba2a.

AlexJF merged commit ae79470 into master Jan 31, 2024
219 checks passed

AlexJF deleted the alexjf/prof-8864-dynamic-allocation-sampling2 branch January 31, 2024 18:58

github-actions bot added this to the 1.20.0 milestone Jan 31, 2024

TonyCTHsu mentioned this pull request Feb 5, 2024

Bump to version 1.20.0 #3435

Merged

AlexJF mentioned this pull request Feb 5, 2024

[PROF-9045] Better handle spikes within dynamic sampling adjustment windows #3434

Merged

2 tasks

ivoanjo mentioned this pull request Feb 9, 2024

[NO-TICKET] Remove unused profiler PID controller #3453

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROF-8864] Dynamic allocation sampling #3395

[PROF-8864] Dynamic allocation sampling #3395

AlexJF commented Jan 23, 2024 •

edited

Loading

ivoanjo left a comment

ivoanjo left a comment

ivoanjo Jan 30, 2024

AlexJF Jan 30, 2024

ivoanjo Jan 31, 2024

AlexJF Jan 31, 2024

ivoanjo Jan 31, 2024

AlexJF Jan 31, 2024 •

edited

Loading

AlexJF commented Jan 31, 2024

ivoanjo left a comment

codecov-commenter commented Jan 31, 2024 •

edited

Loading

[PROF-8864] Dynamic allocation sampling #3395

[PROF-8864] Dynamic allocation sampling #3395

Conversation

AlexJF commented Jan 23, 2024 • edited Loading

ivoanjo left a comment

Choose a reason for hiding this comment

ivoanjo left a comment

Choose a reason for hiding this comment

ivoanjo Jan 30, 2024

Choose a reason for hiding this comment

AlexJF Jan 30, 2024

Choose a reason for hiding this comment

ivoanjo Jan 31, 2024

Choose a reason for hiding this comment

AlexJF Jan 31, 2024

Choose a reason for hiding this comment

ivoanjo Jan 31, 2024

Choose a reason for hiding this comment

AlexJF Jan 31, 2024 • edited Loading

Choose a reason for hiding this comment

AlexJF commented Jan 31, 2024

ivoanjo left a comment

Choose a reason for hiding this comment

codecov-commenter commented Jan 31, 2024 • edited Loading

Codecov Report

AlexJF commented Jan 23, 2024 •

edited

Loading

AlexJF Jan 31, 2024 •

edited

Loading

codecov-commenter commented Jan 31, 2024 •

edited

Loading