Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROF-8864] Use the dynamic sampler for allocations #3382

Closed
wants to merge 10 commits into from

Conversation

AlexJF
Copy link
Contributor

@AlexJF AlexJF commented Jan 12, 2024

2.0 Upgrade Guide notes

What does this PR do?

Motivation:

Additional Notes:

How to test the change?

For Datadog employees:

  • If this PR touches code that signs or publishes builds or packages, or handles
    credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
  • This PR doesn't touch any of that.

Unsure? Have a question? Request a review!

@github-actions github-actions bot added core Involves Datadog core libraries profiling Involves Datadog profiling labels Jan 12, 2024
Copy link
Member

@ivoanjo ivoanjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few notes. Overall I'm curious to see if this approach will pan out/be enough 👀

Comment on lines -376 to -380
option :experimental_allocation_sample_rate do |o|
o.type :int
o.env 'DD_PROFILING_EXPERIMENTAL_ALLOCATION_SAMPLE_RATE'
o.default 50
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a very small chance, but perhaps someone out there actually made use of this, so I would suggest doing like we did for allocation_counting_enabled above and keep the setting, replacing it warning message.

Just to make sure users have a smooth upgrade experience, and we're still on time to drop this for 2.0.

Comment on lines +32 to +39
worker:,
code_provenance_collector:,
internal_metadata:,
minimum_duration_seconds: PROFILE_DURATION_THRESHOLD_SECONDS,
time_provider: Time
)
@pprof_recorder = pprof_recorder
@worker = worker
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: As far as the Exporter cares, it wants something that provides stats, regardless of what that thing is (e.g. the CpuAndWallTimeWorker). So maybe we can call it profiler_stats or something like that?

@internal_metadata_json = JSON.fast_generate(internal_metadata.map { |k, v| [k, v.to_s] }.to_h)
@internal_metadata_json = JSON.fast_generate(internal_metadata)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: I guess we're OK with using non-string values as well for this in the backend?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are, backend should expect all of internal_metadata to be a valid JSON.

VALUE pretty_cpu_sampling_time_ns_total = state->stats.cpu_sampling_time_ns_total == 0 ? Qnil : ULL2NUM(state->stats.cpu_sampling_time_ns_total);
VALUE pretty_cpu_sampling_time_ns_avg =
state->stats.cpu_sampled == 0 ? Qnil : DBL2NUM(((double) state->stats.cpu_sampling_time_ns_total) / state->stats.cpu_sampled);
VALUE pretty_cpu_sleeping_time_ns_avg = state->stats.cpu_sampling_time_ns_max == 0 ? Qnil: DBL2NUM(((double) state->stats.cpu_sampling_sleep_time_ns_total) / state->stats.cpu_sampled);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, why state->stats.cpu_sampling_time_ns_max == 0 and not state->stats.cpu_sampled == 0 ?

Copy link
Contributor Author

@AlexJF AlexJF Jan 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I'm dumb that's why 😅

Comment on lines 35 to 42
// We currently have 2 flavours of these functions:
// * `dynamic_sampling_rate_after_sample_continuous()` - This function operates under the assumption that, if desired
// we could be continuously sampling. In other words, we own the decision of when to sample and thus the overhead
// is a direct result of how much a single sample takes and how often we choose to do this.
// * `dynamic_sampling_rate_after_sample_discrete()` - This function operates under the assumption that sampling
// cannot be done at will and has to align with discrete and distinct sampling opportunities (e.g. allocation
// events). Thus overhead calculations have to take into account the approximate interval between these opportunities
// which we do by keeping an exponential moving average of the times between consutive `dynamic_sampling_rate_should_sample`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: May be worth clarifying which one we're using in each situation (right now we only mention allocation for discrete)

Comment on lines 10 to 12
atomic_long next_sample_after_monotonic_wall_time_ns;
long last_check_time_ns;
unsigned long tick_time_ns;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something that was not really documented before is why there's an atomic_long for next_sample_after_monotonic_wall_time_ns -- because the value gets set on the thread that samples (e.g. the one that has the GVL) but also read by the CpuAndWallTimeWorker thread.

For the new bits, we don't actually need the synchronization as all the calls will be done by the thread holding the GVL but.... we should probably document all these assumptions, as this code in a way is less generic than it looks in regards to thread safety.

// * between_time = 0
//
// Then sleeping_time would wield (100 * 1ms) / 2 - 1 = 49ms
uint64_t time_to_sleep_ns = 100.0 * sampling_time_ns / overhead_target - sampling_time_ns - tick_time_ns;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can underflow? E.g. if sampling time is small and tick time is large?

Comment on lines +108 to +127
// * sleeping_time -> How long we want to delay sampling for to keep to overhead_target
// * tick_time -> Time between sampling opportunities (0 for continuous operation, time between sampling decisions in discrete ones)
// Thus, total_time can be understood to be sampling_time + sleeping_time + tick_time and we want to solve for sleep_time in the
// following relation:
//
// sampling_time ----- overhead_target
// total_time ------ 100%
//
// Which wields:
//
// total_time = 100 * sampling_time / overhead_target <=>
// <=> sleeping_time + sampling_time + tick_time = 100 * sampling_time / overhead_target <=>
// <=> sleeping_time = 100 * sampling_time / overhead_target - sampling_time - tick_time
//
// For a concrete example of continuous sampling where:
// * overhead_target = 2%
// * sampling_time = 1ms
// * between_time = 0
//
// Then sleeping_time would wield (100 * 1ms) / 2 - 1 = 49ms
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: I suggest using time_to_sleep here or switching the equation to use sleeping_time, it's weird to use two names for the same thing :)

@AlexJF AlexJF force-pushed the alexjf/prof-8864-dynamic-allocation-sampling branch from df69a6f to 70e6c70 Compare January 15, 2024 16:05
@AlexJF
Copy link
Contributor Author

AlexJF commented Jan 23, 2024

Replaced with #3395

@AlexJF AlexJF closed this Jan 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Involves Datadog core libraries profiling Involves Datadog profiling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants