-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PROF-8864] Use the dynamic sampler for allocations #3382
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few notes. Overall I'm curious to see if this approach will pan out/be enough 👀
option :experimental_allocation_sample_rate do |o| | ||
o.type :int | ||
o.env 'DD_PROFILING_EXPERIMENTAL_ALLOCATION_SAMPLE_RATE' | ||
o.default 50 | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a very small chance, but perhaps someone out there actually made use of this, so I would suggest doing like we did for allocation_counting_enabled
above and keep the setting, replacing it warning message.
Just to make sure users have a smooth upgrade experience, and we're still on time to drop this for 2.0.
worker:, | ||
code_provenance_collector:, | ||
internal_metadata:, | ||
minimum_duration_seconds: PROFILE_DURATION_THRESHOLD_SECONDS, | ||
time_provider: Time | ||
) | ||
@pprof_recorder = pprof_recorder | ||
@worker = worker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: As far as the Exporter
cares, it wants something that provides stats, regardless of what that thing is (e.g. the CpuAndWallTimeWorker). So maybe we can call it profiler_stats
or something like that?
@internal_metadata_json = JSON.fast_generate(internal_metadata.map { |k, v| [k, v.to_s] }.to_h) | ||
@internal_metadata_json = JSON.fast_generate(internal_metadata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: I guess we're OK with using non-string values as well for this in the backend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are, backend should expect all of internal_metadata to be a valid JSON.
ext/ddtrace_profiling_native_extension/collectors_cpu_and_wall_time_worker.c
Outdated
Show resolved
Hide resolved
VALUE pretty_cpu_sampling_time_ns_total = state->stats.cpu_sampling_time_ns_total == 0 ? Qnil : ULL2NUM(state->stats.cpu_sampling_time_ns_total); | ||
VALUE pretty_cpu_sampling_time_ns_avg = | ||
state->stats.cpu_sampled == 0 ? Qnil : DBL2NUM(((double) state->stats.cpu_sampling_time_ns_total) / state->stats.cpu_sampled); | ||
VALUE pretty_cpu_sleeping_time_ns_avg = state->stats.cpu_sampling_time_ns_max == 0 ? Qnil: DBL2NUM(((double) state->stats.cpu_sampling_sleep_time_ns_total) / state->stats.cpu_sampled); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious, why state->stats.cpu_sampling_time_ns_max == 0
and not state->stats.cpu_sampled == 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I'm dumb that's why 😅
ext/ddtrace_profiling_native_extension/collectors_dynamic_sampling_rate.c
Outdated
Show resolved
Hide resolved
// We currently have 2 flavours of these functions: | ||
// * `dynamic_sampling_rate_after_sample_continuous()` - This function operates under the assumption that, if desired | ||
// we could be continuously sampling. In other words, we own the decision of when to sample and thus the overhead | ||
// is a direct result of how much a single sample takes and how often we choose to do this. | ||
// * `dynamic_sampling_rate_after_sample_discrete()` - This function operates under the assumption that sampling | ||
// cannot be done at will and has to align with discrete and distinct sampling opportunities (e.g. allocation | ||
// events). Thus overhead calculations have to take into account the approximate interval between these opportunities | ||
// which we do by keeping an exponential moving average of the times between consutive `dynamic_sampling_rate_should_sample` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: May be worth clarifying which one we're using in each situation (right now we only mention allocation for discrete)
atomic_long next_sample_after_monotonic_wall_time_ns; | ||
long last_check_time_ns; | ||
unsigned long tick_time_ns; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something that was not really documented before is why there's an atomic_long
for next_sample_after_monotonic_wall_time_ns
-- because the value gets set on the thread that samples (e.g. the one that has the GVL) but also read by the CpuAndWallTimeWorker
thread.
For the new bits, we don't actually need the synchronization as all the calls will be done by the thread holding the GVL but.... we should probably document all these assumptions, as this code in a way is less generic than it looks in regards to thread safety.
// * between_time = 0 | ||
// | ||
// Then sleeping_time would wield (100 * 1ms) / 2 - 1 = 49ms | ||
uint64_t time_to_sleep_ns = 100.0 * sampling_time_ns / overhead_target - sampling_time_ns - tick_time_ns; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can underflow? E.g. if sampling time is small and tick time is large?
// * sleeping_time -> How long we want to delay sampling for to keep to overhead_target | ||
// * tick_time -> Time between sampling opportunities (0 for continuous operation, time between sampling decisions in discrete ones) | ||
// Thus, total_time can be understood to be sampling_time + sleeping_time + tick_time and we want to solve for sleep_time in the | ||
// following relation: | ||
// | ||
// sampling_time ----- overhead_target | ||
// total_time ------ 100% | ||
// | ||
// Which wields: | ||
// | ||
// total_time = 100 * sampling_time / overhead_target <=> | ||
// <=> sleeping_time + sampling_time + tick_time = 100 * sampling_time / overhead_target <=> | ||
// <=> sleeping_time = 100 * sampling_time / overhead_target - sampling_time - tick_time | ||
// | ||
// For a concrete example of continuous sampling where: | ||
// * overhead_target = 2% | ||
// * sampling_time = 1ms | ||
// * between_time = 0 | ||
// | ||
// Then sleeping_time would wield (100 * 1ms) / 2 - 1 = 49ms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: I suggest using time_to_sleep
here or switching the equation to use sleeping_time
, it's weird to use two names for the same thing :)
…_time_worker.c Co-authored-by: Ivo Anjo <ivo.anjo@datadoghq.com>
…_time_worker.c Co-authored-by: Ivo Anjo <ivo.anjo@datadoghq.com>
…ling_rate.c Co-authored-by: Ivo Anjo <ivo.anjo@datadoghq.com>
df69a6f
to
70e6c70
Compare
Replaced with #3395 |
2.0 Upgrade Guide notes
What does this PR do?
Motivation:
Additional Notes:
How to test the change?
For Datadog employees:
credentials of any kind, I've requested a review from
@DataDog/security-design-and-guidance
.Unsure? Have a question? Request a review!