-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NO-TICKET] Redesign GC profiling to add timeline support and reduce overhead #3313
Conversation
ext/ddtrace_profiling_native_extension/collectors_thread_context.c
Outdated
Show resolved
Hide resolved
.wall_time_ns = state->gc_tracking.accumulated_wall_time_ns, | ||
.timeline_wall_time_ns = state->gc_tracking.accumulated_wall_time_ns, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get the need for both? Is this a backend limitation (i.e., why doesn't it just parse wall_time_ns
values directly)? Why does it (seem to) work fine for non-GC frames?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great question. I've pushed a note in 391db52 to address this:
// This event gets both a regular cpu/wall-time duration, as a normal cpu/wall-time sample would, as well as a
// timeline duration.
// This is done to enable two use-cases:
// * regular cpu/wall-time makes this event show up as a regular stack in the flamegraph
// * the timeline duration is used when the event shows up in the timeline
In practice, we could change the backend to use get the cpu/wall-time timeline duration and use it for the timeline event, but doing it this way keeps us consistent with what other runtimes do for timeline (e.g. dotnet), and thus makes it easy to share code across them.
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## alexjf/prof-8667-heap-profiling-part1 #3313 +/- ##
======================================================================
Coverage 98.22% 98.23%
======================================================================
Files 1253 1253
Lines 72794 72774 -20
Branches 3419 3414 -5
======================================================================
- Hits 71504 71486 -18
+ Misses 1290 1288 -2 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still LGTM, 👍 Heads-up on incoming conflicts 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work, @ivoanjo!
When this feature was added, we weren't yet working an allocation profiling, so having stacks would hint at where an allocation was that triggered the GC. This... is basically a very crappy allocation profiler, and as an anti-bonus it can also get quite expensive. As part of making this feature shape up for us to have it on by default, let's no longer collect stacks. As an alternative, you can get a similar context either from allocation profiling OR by looking at thread timeline for what the thread was doing around the time the GC happened.
These abstractions added some complexity to the stack collector, and were only needed to allow us to put extra placeholder frames on top of a true collected stack. Since this was only being used by the GC placeholder feature, and we've changed that to not collect a stack anymore, we can clean up this mechanism and simplify the code.
This function is rather trivial BUT we have a really nice big comment to explain it so it's useful to extract it so we can have the comment separate, instead of moving the big comment inside the already very-complex `sample_thread_internal`.
While still accurate, I don't think this comment is still relevant -- we no longer have the skid issue, and the rest is just historical context that I don't think is that interesting to keep around.
This event type will be used to report GC information for use in the profiling timeline.
This helper takes care of mapping `rb_gc_latest_gc_info` into metadata that gets added to GC samples. This helper will get tested via the thread context specs (changes upcoming).
…overhead **What does this PR do?** This PR redesigns the garbage collection profiling feature to add support for timeline and reduce its overhead. A TL;DR description of how it works now is below: 1. Wiring it up * Ruby exposes native-level tracepoint events for when it starts/stops doing GC (`RUBY_INTERNAL_EVENT_GC_ENTER`/`RUBY_INTERNAL_EVENT_GC_EXIT`). * Inside the `CpuAndWallTimeWorker` we create a tracepoint that listens for these events and calls the `on_gc_start` and `on_gc_finish` events on the `ThreadContext` collector. * The `RUBY_INTERNAL_EVENT_GC_ENTER`/`RUBY_INTERNAL_EVENT_GC_EXIT` tracepoint events are emitted while Ruby is doing GC, which means we can't allocate anything. The code we use is thus built to not allocate anything. * We can't do all the work we want in `on_gc_finish` because we can't allocate anything, so that function returns a flag that tells the `CpuAndWallTimeWorker` to call `sample_after_gc` as soon as the GC finishes. 2. Tracking * When a thread starts doing GC (`on_gc_start`), we record its current cpu-time and wall-time. * When a thread stops doing GC (`on_gc_finish`), we record how much cpu-time and wall-time elapsed. * When `sample_after_gc` gets called, we turn the data recorded above + some metadata from Ruby into a sample event to represent GC as a fake Garbage Collection thread. 3. Overhead * To avoid impacting applications a lot, the Ruby GC works incrementally. That is, rather than doing a full GC (marking, sweeping) in a single step, Ruby breaks this into smaller steps. * We get told about every smaller step: we get a pair of `on_gc_start` /`on_gc_finish` calls for each step. * In the past, we always called `sample_after_gc` after every `on_gc_finish` to record the GC. This was a big source of overhead: in a sample application that allocates as much as possible in a loop I was able to observe more than 50k calls to `on_gc_start`/`on_gc_finish` per second, which obviously would represent a lot of overhead if we create one GC sample for each of these. * To avoid this overhead, we coalesce/accumulate the cpu/wall-time from multiple calls to `on_gc_start`/`on_gc_finish` and only create a sample to represent this data from time to time (default is every 10ms or every full GC cycle, whatever happens first). Compared to the old implementation, here's what changed: * Garbage collection events get attributed to a placeholder "Garbage Collection" thread. * We no longer record "Garbage Collection" samples on the thread that triggered GC. (And thus no longer include the stack of that thread when recording GC information). This enabled a nice cleanup of the code in the stack collector * GCs across several threads can be included in the same sample * Timeline support is included * Extra metadata about the GC can be seen in the timeline * Overhead is lowered, even during heavy periods of GC action **Motivation:** I worked on this as part of the Profiler R&D week. I've been wanting to go back and fix the problems with the gc profiling implementation for a long time :) **Additional Notes:** N/A **How to test the change?** Test coverage is included. During development, this tiny benchmark was useful: ```ruby require 'benchmark/ips' require 'ddtrace' puts RUBY_DESCRIPTION Datadog.configure do |c| c.profiling.enabled = true c.profiling.advanced.allocation_counting_enabled = false c.telemetry.enabled = false end class Potato def initialize(val) @potato = val end end Benchmark.ips do |x| x.config(time: 10, warmup: 2) x.report("DD_PROFILING_FORCE_ENABLE_GC=#{ENV['DD_PROFILING_FORCE_ENABLE_GC']} Potato.new(10)", "Potato.new(10)") x.compare! end ```
Instead, we expose the `record_placeholder_stack` directly.
No longer needed!
…cer-2.7 docker image ships This allows us to keep using `-Werror` in CI without issue, which is pretty valuable.
Because GC samples can be coalesced, we can't assert on 5 GCs resulting in exactly 5 GC events.
… timeline duration
807a519
to
bfdd1b7
Compare
Update: This PR was originally opened on top of #3281 but there were a number of changes to that branch, so I chose to rebase on top of master and force-push. The rebase was clean other than a manual fix in |
It seems that even with `GC.start()` we can observe different reasons, so I've relaxed the spec.
Because GC samples are pushed with `thread id: GC`, this was breaking `object_id_from` in tests that happened to have some GC samples. There is no `object_id` for the "GC thread" because there is no such thread -- it's just a way of representing GC information.
Thanks y'all for the help and reviews. Merging away! |
What does this PR do?
This PR redesigns the garbage collection profiling feature to add support for timeline and reduce its overhead.
Here's how it looks in timeline mode:
Here's how it looks in flamegraph mode:
A TL;DR description of how it works now is below:
RUBY_INTERNAL_EVENT_GC_ENTER
/RUBY_INTERNAL_EVENT_GC_EXIT
).CpuAndWallTimeWorker
we create a tracepoint that listens for these events and calls theon_gc_start
andon_gc_finish
events on theThreadContext
collector.RUBY_INTERNAL_EVENT_GC_ENTER
/RUBY_INTERNAL_EVENT_GC_EXIT
tracepoint events are emitted while Ruby is doing GC, which means we can't allocate anything. The code we use is thus built to not allocate anything.on_gc_finish
because we can't allocate anything, so that function returns a flag that tells theCpuAndWallTimeWorker
to callsample_after_gc
as soon as the GC finishes.on_gc_start
), we record its current cpu-time and wall-time.on_gc_finish
), we record how much cpu-time and wall-time elapsed.sample_after_gc
gets called, we turn the data recorded above + some metadata from Ruby into a sample event to represent GC as a fake Garbage Collection thread.on_gc_start
/on_gc_finish
calls for each step.sample_after_gc
after everyon_gc_finish
to record the GC. This was a big source of overhead: in a sample application that allocates as much as possible in a loop I was able to observe more than 50k calls toon_gc_start
/on_gc_finish
per second, which obviously would represent a lot of overhead if we create one GC sample for each of these.on_gc_start
/on_gc_finish
and only create a sample to represent this data from time to time (default is every 10ms or every full GC cycle, whatever happens first).Compared to the old implementation, here's what changed:
Motivation:
I worked on this as part of the Profiler R&D week. I've been wanting to go back and fix the problems with the gc profiling implementation for a long time :)
Additional Notes:
I'm opening this PR on top of @AlexJF's #3281 because otherwise both were going to conflict, given the changes to the stack recorder. The two PRs are otherwise completely independent.
I'll wait until #3281 is merged to master to merge this PR as well.
How to test the change?
Test coverage is included.
During development, this tiny benchmark was useful:
For Datadog employees:
credentials of any kind, I've requested a review from
@DataDog/security-design-and-guidance
.