-
Notifications
You must be signed in to change notification settings - Fork 377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PROF-10128] Refactor and speed up profiler stack sampling #3837
[PROF-10128] Refactor and speed up profiler stack sampling #3837
Conversation
They were out-of-sync >_>
This will be more useful later once freeing a per_thread_context gets more complex.
Recording a placeholder stack does not involve any sampling, so it doesn't actually need a sampling_buffer.
This will be used by the thread context collector to validate max frames during initialization, separately from creating new sampling buffers.
This doesn't yet change anything; this refactor will later allow us to reuse data from the sampling_buffers from sample to sample. This does slightly increase memory usage, but not by as much as may be expected. Ignoring the locations (these will get shared in a later commit), a sampling buffer has max_frames * (VALUE, int, bool) which is less than 10KiB per thread.
The locations, unlike the rest of the sampling buffer, do not (yet?) need to be per-thread, so let's save some memory.
Not really to save memory, but semantically it's closer to what we want.
Instead of storing the `last_ruby_frame` just to later need to re-retrieve the filename, let's directly store the `last_ruby_frame_filename`.
As explained in the added comment, for our use this is expected to be equivalent (and our integration tests that compare our output with the Ruby stack trace APIs still pass). This gets us a bit less of work during sampling (but not a lot), but more importantly it will make it easier to add caching in a later commit.
Rather than having separate arrays for each kind of information we want to get from a stack (VALUE *stack_buffer + int *lines_buffer + bool *is_ruby_frame), introduce a new `frame_info` struct to contain this information. This makes it easy to then add or change members in this struct without having to think about allocating yet more arrays, and managing their lifetimes and passing them in as arguments. (Spoiler: We'll be using this in a commit very soon to introduce caching) As a side bonus of this change, I noticed that we could just expose the method name for native methods as an `ID`, rather than handing over the `rb_callable_method_entry_t` ("cme") as we did before. This has a few advantages: * It's effectively a way to inline the work done by the previous call to `ddtrace_rb_profile_frame_method_name`. Because we had to copy/paste some code from the Ruby sources to support `ddtrace_rb_profile_frame_method_name` on legacy Rubies, we no longer need that code and can delete it! * An `ID` is interned/immutable, so it's much easier to later cache if we want to. (Although no plans yet...)
The intuition here is that in many apps, the same stack (or at least the bottom of the stack) won't change very often -- e.g. during an HTTP request we may take a bunch of samples, and all of them will have a common stack that's webserver + rails routing + endpoint doing work. While right now this mostly avoids us repeating calls to `calc_lineno`, (calculating the line number for Ruby frames) one of the most interesting changes in this PR is the new `same_frame` flag. This `same_frame` flag allows the caller to know when the stack has changed, and I'm hoping to later connect this to libdatadog improvements, e.g. if we store the ids for each stack entry in libdatadog, we can know if the stack changed or not, and thus if we can just re-feed the same ids to libdatadog or need to record new ones.
Looking at the implementation of `StringValuePtr`, it does a bunch of checks that we don't really need because we already have the `ENFORCE_TYPE` making sure we do have a string. So we can just use the string without further ado.
I don't often use this setting so it had bitrotten a bit 😅 * `sampler_readjust` no longer has the `readjustment_reason`. We can recompute for debugging if needed, but for now I just removed the reference * When building at `-O0` it looks like my version of gcc is refusing to inline `monotonic_wall_time_now_ns` which is a problem because there's no non-inline version of that function, and thus it breaks running the tests with > symbol lookup error: datadog_profiling_native_extension.3.1.4_x86_64-linux.so: undefined symbol: monotonic_wall_time_now_ns `-O1` is probably fine for development as well so let's go with it for now?
…ile_ stuff Since we now always return an iseq from ddtrace_rb_profile_frames (it's in the name of the struct element even!), we can directly use the `rb_iseq_` functions. Looking at the `rb_profile_` functions what they do is check if they have an iseq or something else, and when they have an iseq, they call these functions. So this allows us to skip the extra tests because we know we have iseqs. (FYI an "iseq" is an "instruction sequence" -- it's the object that represents the YARV VM bytecode)
This was a really weird one. I don't like handwaving it away, but debugging any more of this would need me to go really deep into Ruby's GC to understand where exactly Ruby is deciding to keep this object alive and I'm not sure such a time investment is worth it for this bug, so I've put in a workaround for now.
BenchmarksBenchmark execution time: 2024-08-13 11:08:39 Comparing candidate commit 607d5d5 in PR branch Found 3 performance improvements and 0 performance regressions! Performance is the same for 20 metrics, 2 unstable metrics. scenario:sample timeline=false
scenario:sample+serialize retain_every=10 heap_samples=false heap_size=false heap_sample_every=1 skip_end_gc=false
scenario:stack collector
|
buffer->is_ruby_frame = ruby_xcalloc(max_frames, sizeof(bool)); | ||
buffer->locations = ruby_xcalloc(max_frames, sizeof(ddog_prof_Location)); | ||
buffer->locations = locations; | ||
buffer->stack_buffer = ruby_xcalloc(max_frames, sizeof(frame_info)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice consolidation here!
// | ||
// cme = rb_vm_frame_method_entry(cfp); | ||
|
||
// if (cme && cme->def->type == VM_METHOD_TYPE_ISEQ && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment here explicitly explaining what this commented out code section means (so we know when to restore it if needed, or remove it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, left note in e6c9122
@@ -67,7 +67,7 @@ def skip_building_extension!(reason) | |||
|
|||
if ENV['DDTRACE_DEBUG'] == 'true' | |||
$defs << '-DDD_DEBUG' | |||
CONFIG['optflags'] = '-O0' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is O0
not necessary to get the debug information we need?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've used "-O1" with gdb and didn't see any issue/missing info, so I think it's fine, and it's easy to tweak locally -- I just wanted to avoid a default config that was broken (at least on my Ubuntu/gcc version?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great overall!
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #3837 +/- ##
==========================================
- Coverage 97.85% 97.84% -0.01%
==========================================
Files 1264 1264
Lines 75731 75732 +1
Branches 3719 3719
==========================================
- Hits 74103 74099 -4
- Misses 1628 1633 +5 ☔ View full report in Codecov by Sentry. |
struct { | ||
ID method_id; | ||
} native_frame; | ||
} as; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Such fluent! Much API! Wow!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something something great artists copy.
@@ -479,6 +480,16 @@ int ddtrace_rb_profile_frames(VALUE thread, int start, int limit, frame_info *st | |||
continue; | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in the past, the same cfp would "eat" all the start
whereas now it'll skip just one? Not sure if this was ever something we relied on but just highlighting this slight change in behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really great catch!
I completely glossed over it because a) we don't use start
and b) I remembered some discussion somewhere mentioning it was actually broken.
It's actually fixed nowadays https://bugs.ruby-lang.org/issues/14607 // ruby/ruby#8280 . I had spotted that upstream had moved the cfp = ...
to the for header as well, but I hadn't spotted it was to fix start
.
I did spot that upstream also added the start skip logic to the if (cme && cme->def->type == VM_METHOD_TYPE_CFUNC)
branch below. I'll push a commit to add this one as well, so we keep closer to upstream in case we ever want to use start
.
# I've bisected this and undoing 3d4b7fcf30b529b191ca737ae13629eb27b8ab63 also makes the flakiness | ||
# go away, but again, that change doesn't seem to have anything to do with GC. | ||
# As the weird behavior is transitory, e.g. it provably goes away on the next GC, I'll go with this | ||
# workaround for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I seem to remember Ruby taking into account register values in its heuristic to detect potential usages of objects. May just be that with the extra caching of things like iseq and what not we make it slightly more likely to hit these edge cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is my suspicion -- Ruby's GC is conservative, in that if anything "looks" like a VALUE
, it assumes it is. Actually debugging this would mean going into the GC marking code and understand why exactly it's making this decision which is somewhat of a time investment which is why I took the slightly-easier way out 😅.
We haven't used the `start` parameter so far, but let's keep closer to upstream here. (We had already half-applied the change from upstream for another reason, so now we have the full fix).
eb5af8c
to
607d5d5
Compare
What does this PR do?
This PR refactors the way the profiler samples the Ruby stack, and then builds on that refactoring to speed up sampling.
It also opens the door to future speedups once libdatadog allows us to skip re-interning strings, frames and stacks that we know have been sampled before.
Specifically this PR does the following:
ddtrace_rb_profile_frames
returns from sampling the Ruby stack so that the stack collector can use cheaper APIs to get the same information as before.Here's the results of running benchmarks/profiler_sample_loop_v2.rb with these changes on my local machine:
Motivation:
This PR further reduces the overhead of profiling, especially more complex/deep stacks. I started looking into this with the goal of improving allocation profiling, but stack sampling is shared with cpu/wall-time profiling as well, so those will also be able to take advantage of the improvement.
Additional Notes:
There's a lot of small parts moving around in this PR. I recommend reviewing commit-by-commit, as I tried to break every change into a small standalone step that can be reviewed in a much nicer way.
How to test the change?
Our existing tests cover these changes.