[PROF-8864] Use the dynamic sampler for allocations #3382

AlexJF · 2024-01-12T12:20:01Z

2.0 Upgrade Guide notes

What does this PR do?

Motivation:

Additional Notes:

How to test the change?

For Datadog employees:

If this PR touches code that signs or publishes builds or packages, or handles
credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
This PR doesn't touch any of that.

Unsure? Have a question? Request a review!

ivoanjo

Left a few notes. Overall I'm curious to see if this approach will pan out/be enough 👀

ivoanjo · 2024-01-15T11:53:02Z

lib/datadog/core/configuration/settings.rb

-            option :experimental_allocation_sample_rate do |o|
-              o.type :int
-              o.env 'DD_PROFILING_EXPERIMENTAL_ALLOCATION_SAMPLE_RATE'
-              o.default 50
-            end


It's a very small chance, but perhaps someone out there actually made use of this, so I would suggest doing like we did for allocation_counting_enabled above and keep the setting, replacing it warning message.

Just to make sure users have a smooth upgrade experience, and we're still on time to drop this for 2.0.

ivoanjo · 2024-01-15T11:55:52Z

lib/datadog/profiling/exporter.rb

+        worker:,
        code_provenance_collector:,
        internal_metadata:,
        minimum_duration_seconds: PROFILE_DURATION_THRESHOLD_SECONDS,
        time_provider: Time
      )
        @pprof_recorder = pprof_recorder
+        @worker = worker


Minor: As far as the Exporter cares, it wants something that provides stats, regardless of what that thing is (e.g. the CpuAndWallTimeWorker). So maybe we can call it profiler_stats or something like that?

ivoanjo · 2024-01-15T11:56:59Z

lib/datadog/profiling/flush.rb

-        @internal_metadata_json = JSON.fast_generate(internal_metadata.map { |k, v| [k, v.to_s] }.to_h)
+        @internal_metadata_json = JSON.fast_generate(internal_metadata)


Minor: I guess we're OK with using non-string values as well for this in the backend?

We are, backend should expect all of internal_metadata to be a valid JSON.

ext/ddtrace_profiling_native_extension/collectors_cpu_and_wall_time_worker.c

ivoanjo · 2024-01-15T12:10:19Z

ext/ddtrace_profiling_native_extension/collectors_cpu_and_wall_time_worker.c

+  VALUE pretty_cpu_sampling_time_ns_total = state->stats.cpu_sampling_time_ns_total == 0 ? Qnil : ULL2NUM(state->stats.cpu_sampling_time_ns_total);
+  VALUE pretty_cpu_sampling_time_ns_avg =
+    state->stats.cpu_sampled == 0 ? Qnil : DBL2NUM(((double) state->stats.cpu_sampling_time_ns_total) / state->stats.cpu_sampled);
+  VALUE pretty_cpu_sleeping_time_ns_avg = state->stats.cpu_sampling_time_ns_max == 0 ? Qnil: DBL2NUM(((double) state->stats.cpu_sampling_sleep_time_ns_total) / state->stats.cpu_sampled);


I'm curious, why state->stats.cpu_sampling_time_ns_max == 0 and not state->stats.cpu_sampled == 0 ?

Because I'm dumb that's why 😅

ext/ddtrace_profiling_native_extension/collectors_dynamic_sampling_rate.c

ivoanjo · 2024-01-15T13:37:13Z

ext/ddtrace_profiling_native_extension/collectors_dynamic_sampling_rate.c

+// We currently have 2 flavours of these functions:
+// * `dynamic_sampling_rate_after_sample_continuous()` - This function operates under the assumption that, if desired
+//   we could be continuously sampling. In other words, we own the decision of when to sample and thus the overhead
+//   is a direct result of how much a single sample takes and how often we choose to do this.
+// * `dynamic_sampling_rate_after_sample_discrete()` - This function operates under the assumption that sampling
+//   cannot be done at will and has to align with discrete and distinct sampling opportunities (e.g. allocation
+//   events). Thus overhead calculations have to take into account the approximate interval between these opportunities
+//   which we do by keeping an exponential moving average of the times between consutive `dynamic_sampling_rate_should_sample`


Minor: May be worth clarifying which one we're using in each situation (right now we only mention allocation for discrete)

ivoanjo · 2024-01-15T13:52:15Z

ext/ddtrace_profiling_native_extension/collectors_dynamic_sampling_rate.h

  atomic_long next_sample_after_monotonic_wall_time_ns;
+  long last_check_time_ns;
+  unsigned long tick_time_ns;


Something that was not really documented before is why there's an atomic_long for next_sample_after_monotonic_wall_time_ns -- because the value gets set on the thread that samples (e.g. the one that has the GVL) but also read by the CpuAndWallTimeWorker thread.

For the new bits, we don't actually need the synchronization as all the calls will be done by the thread holding the GVL but.... we should probably document all these assumptions, as this code in a way is less generic than it looks in regards to thread safety.

ivoanjo · 2024-01-15T14:02:19Z

ext/ddtrace_profiling_native_extension/collectors_dynamic_sampling_rate.c

+  // * between_time = 0
+  //
+  // Then sleeping_time would wield (100 * 1ms) / 2 - 1 = 49ms
+  uint64_t time_to_sleep_ns = 100.0 * sampling_time_ns / overhead_target - sampling_time_ns - tick_time_ns;


I think this can underflow? E.g. if sampling time is small and tick time is large?

ivoanjo · 2024-01-15T14:03:00Z

ext/ddtrace_profiling_native_extension/collectors_dynamic_sampling_rate.c

+  // * sleeping_time -> How long we want to delay sampling for to keep to overhead_target
+  // * tick_time -> Time between sampling opportunities (0 for continuous operation, time between sampling decisions in discrete ones)
+  // Thus, total_time can be understood to be sampling_time + sleeping_time + tick_time and we want to solve for sleep_time in the
+  // following relation:
+  //
+  //   sampling_time  ----- overhead_target
+  //   total_time     ------ 100%
+  //
+  // Which wields:
+  //
+  //     total_time = 100 * sampling_time / overhead_target <=>
+  // <=> sleeping_time + sampling_time + tick_time  = 100 * sampling_time / overhead_target <=>
+  // <=> sleeping_time = 100 * sampling_time / overhead_target - sampling_time - tick_time
+  //
+  // For a concrete example of continuous sampling where:
+  // * overhead_target = 2%
+  // * sampling_time = 1ms
+  // * between_time = 0
+  //
+  // Then sleeping_time would wield (100 * 1ms) / 2 - 1 = 49ms


Minor: I suggest using time_to_sleep here or switching the equation to use sleeping_time, it's weird to use two names for the same thing :)

…_time_worker.c Co-authored-by: Ivo Anjo <ivo.anjo@datadoghq.com>

…ling_rate.c Co-authored-by: Ivo Anjo <ivo.anjo@datadoghq.com>

AlexJF · 2024-01-23T18:08:50Z

Replaced with #3395

github-actions bot added core Involves Datadog core libraries profiling Involves Datadog profiling labels Jan 12, 2024

ivoanjo reviewed Jan 15, 2024

View reviewed changes

AlexJF and others added 5 commits January 15, 2024 16:05

[PROF-8864] Use the dynamic sampler for allocations

a439c4d

Update ext/ddtrace_profiling_native_extension/collectors_cpu_and_wall…

42aef2c

…_time_worker.c Co-authored-by: Ivo Anjo <ivo.anjo@datadoghq.com>

Update ext/ddtrace_profiling_native_extension/collectors_cpu_and_wall…

b17cdc3

…_time_worker.c Co-authored-by: Ivo Anjo <ivo.anjo@datadoghq.com>

Update ext/ddtrace_profiling_native_extension/collectors_dynamic_samp…

64cb4e4

…ling_rate.c Co-authored-by: Ivo Anjo <ivo.anjo@datadoghq.com>

[PROF-8664] Address some of the comments and other stuff

70e6c70

AlexJF force-pushed the alexjf/prof-8864-dynamic-allocation-sampling branch from df69a6f to 70e6c70 Compare January 15, 2024 16:05

AlexJF and others added 5 commits January 15, 2024 16:46

Update gemfiles/*

4f515a9

Alternative clamping

8fe6d95

Alternative tick-based sampling

720fb6d

More stats

24e2fd5

Check if stats are rolling over

0175e07

AlexJF closed this Jan 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROF-8864] Use the dynamic sampler for allocations #3382

[PROF-8864] Use the dynamic sampler for allocations #3382

AlexJF commented Jan 12, 2024

ivoanjo left a comment

ivoanjo Jan 15, 2024

ivoanjo Jan 15, 2024

ivoanjo Jan 15, 2024

AlexJF Jan 15, 2024

ivoanjo Jan 15, 2024

AlexJF Jan 15, 2024 •

edited

Loading

ivoanjo Jan 15, 2024

ivoanjo Jan 15, 2024

ivoanjo Jan 15, 2024

ivoanjo Jan 15, 2024

AlexJF commented Jan 23, 2024

		@internal_metadata_json = JSON.fast_generate(internal_metadata.map { \|k, v\| [k, v.to_s] }.to_h)
		@internal_metadata_json = JSON.fast_generate(internal_metadata)

[PROF-8864] Use the dynamic sampler for allocations #3382

[PROF-8864] Use the dynamic sampler for allocations #3382

Conversation

AlexJF commented Jan 12, 2024

ivoanjo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlexJF Jan 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlexJF commented Jan 23, 2024

AlexJF Jan 15, 2024 •

edited

Loading