Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROF-10128] Refactor and speed up profiler stack sampling #3837

Merged
merged 17 commits into from
Aug 13, 2024

Conversation

ivoanjo
Copy link
Member

@ivoanjo ivoanjo commented Aug 9, 2024

What does this PR do?

This PR refactors the way the profiler samples the Ruby stack, and then builds on that refactoring to speed up sampling.

It also opens the door to future speedups once libdatadog allows us to skip re-interning strings, frames and stacks that we know have been sampled before.

Specifically this PR does the following:

  1. It tweaks the data that ddtrace_rb_profile_frames returns from sampling the Ruby stack so that the stack collector can use cheaper APIs to get the same information as before.
  2. It changes the sampling buffer we put stack traces in to be per-thread. This allows us to reuse some of the work we do from sample to sample, in cases where frames on the stack are the same. Thus sampling the same stack over and over will be slightly cheaper.
  3. It adds a few more micro-optimizations that I spotted from profiling the profiler :)
  4. When we sample stacks, we expose a flag that can be used to determine if a frame was unchanged from the last sample. This is what opens the door for future optimizations using libdatadog.

Here's the results of running benchmarks/profiler_sample_loop_v2.rb with these changes on my local machine:

Warming up --------------------------------------
stack collector improved-sampling
                       783.000 i/100ms
Calculating -------------------------------------
stack collector improved-sampling
                          7.828k (± 3.3%) i/s -    234.900k in  30.042969s

Warming up --------------------------------------
stack collector master
                       746.000 i/100ms
Calculating -------------------------------------
stack collector master
                          7.057k (± 3.3%) i/s -    211.864k in  30.058848s

Comparison:
stack collector improved-sampling:     7827.8 i/s
stack collector master:     7056.8 i/s - 1.11x  slower

Motivation:

This PR further reduces the overhead of profiling, especially more complex/deep stacks. I started looking into this with the goal of improving allocation profiling, but stack sampling is shared with cpu/wall-time profiling as well, so those will also be able to take advantage of the improvement.

Additional Notes:

There's a lot of small parts moving around in this PR. I recommend reviewing commit-by-commit, as I tried to break every change into a small standalone step that can be reviewed in a much nicer way.

How to test the change?

Our existing tests cover these changes.

ivoanjo added 15 commits August 8, 2024 11:56
This will be more useful later once freeing a per_thread_context gets
more complex.
Recording a placeholder stack does not involve any sampling, so it
doesn't actually need a sampling_buffer.
This will be used by the thread context collector to validate max frames
during initialization, separately from creating new sampling buffers.
This doesn't yet change anything; this refactor will later allow us to
reuse data from the sampling_buffers from sample to sample.

This does slightly increase memory usage, but not by as much as may
be expected.

Ignoring the locations (these will get shared in a later commit), a
sampling buffer has max_frames * (VALUE, int, bool) which is less than
10KiB per thread.
The locations, unlike the rest of the sampling buffer, do not (yet?)
need to be per-thread, so let's save some memory.
Not really to save memory, but semantically it's closer to what we want.
Instead of storing the `last_ruby_frame` just to later need to
re-retrieve the filename, let's directly store the
`last_ruby_frame_filename`.
As explained in the added comment, for our use this is expected to be
equivalent (and our integration tests that compare our output with the
Ruby stack trace APIs still pass).

This gets us a bit less of work during sampling (but not a lot), but
more importantly it will make it easier to add caching in a later
commit.
Rather than having separate arrays for each kind of information we want
to get from a stack (VALUE *stack_buffer + int *lines_buffer + bool
*is_ruby_frame), introduce a new `frame_info` struct to contain this
information.

This makes it easy to then add or change members in this struct without
having to think about allocating yet more arrays, and managing their
lifetimes and passing them in as arguments. (Spoiler: We'll be using
this in a commit very soon to introduce caching)

As a side bonus of this change, I noticed that we could just expose the
method name for native methods as an `ID`, rather than handing over
the `rb_callable_method_entry_t` ("cme") as we did before.

This has a few advantages:
* It's effectively a way to inline the work done by the previous call
  to `ddtrace_rb_profile_frame_method_name`. Because we had to
  copy/paste some code from the Ruby sources to support
  `ddtrace_rb_profile_frame_method_name` on legacy Rubies, we no
   longer need that code and can delete it!
* An `ID` is interned/immutable, so it's much easier to later cache
  if we want to. (Although no plans yet...)
The intuition here is that in many apps, the same stack (or at least the
bottom of the stack) won't change very often -- e.g. during an HTTP
request we may take a bunch of samples, and all of them will have a
common stack that's webserver + rails routing + endpoint doing work.

While right now this mostly avoids us repeating calls to `calc_lineno`,
(calculating the line number for Ruby frames) one of the most
interesting changes in this PR is the new `same_frame` flag.

This `same_frame` flag allows the caller to know when the stack has
changed, and I'm hoping to later connect this to libdatadog
improvements, e.g. if we store the ids for each stack entry in
libdatadog, we can know if the stack changed or not, and thus
if we can just re-feed the same ids to libdatadog or need to record new
ones.
Looking at the implementation of `StringValuePtr`, it does a bunch of
checks that we don't really need because we already have the
`ENFORCE_TYPE` making sure we do have a string.

So we can just use the string without further ado.
I don't often use this setting so it had bitrotten a bit 😅

* `sampler_readjust` no longer has the `readjustment_reason`. We can
  recompute for debugging if needed, but for now I just removed the
  reference
* When building at `-O0` it looks like my version of gcc is refusing
  to inline `monotonic_wall_time_now_ns` which is a problem because
  there's no non-inline version of that function, and thus it breaks
  running the tests with

  > symbol lookup error: datadog_profiling_native_extension.3.1.4_x86_64-linux.so: undefined symbol: monotonic_wall_time_now_ns

  `-O1` is probably fine for development as well so let's go with it
  for now?
…ile_ stuff

Since we now always return an iseq from ddtrace_rb_profile_frames
(it's in the name of the struct element even!), we can directly use the
`rb_iseq_` functions.

Looking at the `rb_profile_` functions what they do is check if they
have an iseq or something else, and when they have an iseq, they call
these functions.

So this allows us to skip the extra tests because we know we have iseqs.

(FYI an "iseq" is an "instruction sequence" -- it's the object that
represents the YARV VM bytecode)
This was a really weird one. I don't like handwaving it away, but
debugging any more of this would need me to go really deep into Ruby's
GC to understand where exactly Ruby is deciding to keep this object
alive and I'm not sure such a time investment is worth it for this bug,
so I've put in a workaround for now.
@ivoanjo ivoanjo requested a review from a team as a code owner August 9, 2024 11:39
@github-actions github-actions bot added the profiling Involves Datadog profiling label Aug 9, 2024
@ivoanjo ivoanjo requested a review from AlexJF August 9, 2024 11:40
@pr-commenter
Copy link

pr-commenter bot commented Aug 9, 2024

Benchmarks

Benchmark execution time: 2024-08-13 11:08:39

Comparing candidate commit 607d5d5 in PR branch ivoanjo/prof-10128-stack-sampling-improvements with baseline commit 850462b in branch master.

Found 3 performance improvements and 0 performance regressions! Performance is the same for 20 metrics, 2 unstable metrics.

scenario:sample timeline=false

  • 🟩 throughput [+0.464op/s; +0.498op/s] or [+6.796%; +7.293%]

scenario:sample+serialize retain_every=10 heap_samples=false heap_size=false heap_sample_every=1 skip_end_gc=false

  • 🟩 throughput [+0.174op/s; +0.185op/s] or [+6.692%; +7.117%]

scenario:stack collector

  • 🟩 throughput [+207.015op/s; +208.782op/s] or [+7.565%; +7.630%]

buffer->is_ruby_frame = ruby_xcalloc(max_frames, sizeof(bool));
buffer->locations = ruby_xcalloc(max_frames, sizeof(ddog_prof_Location));
buffer->locations = locations;
buffer->stack_buffer = ruby_xcalloc(max_frames, sizeof(frame_info));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice consolidation here!

//
// cme = rb_vm_frame_method_entry(cfp);

// if (cme && cme->def->type == VM_METHOD_TYPE_ISEQ &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment here explicitly explaining what this commented out code section means (so we know when to restore it if needed, or remove it)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, left note in e6c9122

@@ -67,7 +67,7 @@ def skip_building_extension!(reason)

if ENV['DDTRACE_DEBUG'] == 'true'
$defs << '-DDD_DEBUG'
CONFIG['optflags'] = '-O0'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is O0 not necessary to get the debug information we need?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've used "-O1" with gdb and didn't see any issue/missing info, so I think it's fine, and it's easy to tweak locally -- I just wanted to avoid a default config that was broken (at least on my Ubuntu/gcc version?).

Copy link
Member

@marcotc marcotc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great overall!

@codecov-commenter
Copy link

codecov-commenter commented Aug 12, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.84%. Comparing base (850462b) to head (607d5d5).
Report is 18 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3837      +/-   ##
==========================================
- Coverage   97.85%   97.84%   -0.01%     
==========================================
  Files        1264     1264              
  Lines       75731    75732       +1     
  Branches     3719     3719              
==========================================
- Hits        74103    74099       -4     
- Misses       1628     1633       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

struct {
ID method_id;
} native_frame;
} as;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such fluent! Much API! Wow!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something something great artists copy.

@@ -479,6 +480,16 @@ int ddtrace_rb_profile_frames(VALUE thread, int start, int limit, frame_info *st
continue;
}
Copy link
Contributor

@AlexJF AlexJF Aug 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in the past, the same cfp would "eat" all the start whereas now it'll skip just one? Not sure if this was ever something we relied on but just highlighting this slight change in behaviour.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really great catch!

I completely glossed over it because a) we don't use start and b) I remembered some discussion somewhere mentioning it was actually broken.

It's actually fixed nowadays https://bugs.ruby-lang.org/issues/14607 // ruby/ruby#8280 . I had spotted that upstream had moved the cfp = ... to the for header as well, but I hadn't spotted it was to fix start.

I did spot that upstream also added the start skip logic to the if (cme && cme->def->type == VM_METHOD_TYPE_CFUNC) branch below. I'll push a commit to add this one as well, so we keep closer to upstream in case we ever want to use start.

Comment on lines +1035 to +1038
# I've bisected this and undoing 3d4b7fcf30b529b191ca737ae13629eb27b8ab63 also makes the flakiness
# go away, but again, that change doesn't seem to have anything to do with GC.
# As the weird behavior is transitory, e.g. it provably goes away on the next GC, I'll go with this
# workaround for now.
Copy link
Contributor

@AlexJF AlexJF Aug 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I seem to remember Ruby taking into account register values in its heuristic to detect potential usages of objects. May just be that with the extra caching of things like iseq and what not we make it slightly more likely to hit these edge cases?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is my suspicion -- Ruby's GC is conservative, in that if anything "looks" like a VALUE, it assumes it is. Actually debugging this would mean going into the GC marking code and understand why exactly it's making this decision which is somewhat of a time investment which is why I took the slightly-easier way out 😅.

We haven't used the `start` parameter so far, but let's keep closer
to upstream here.

(We had already half-applied the change from upstream for another
reason, so now we have the full fix).
@ivoanjo ivoanjo force-pushed the ivoanjo/prof-10128-stack-sampling-improvements branch from eb5af8c to 607d5d5 Compare August 13, 2024 10:31
@ivoanjo ivoanjo merged commit 5296456 into master Aug 13, 2024
186 checks passed
@ivoanjo ivoanjo deleted the ivoanjo/prof-10128-stack-sampling-improvements branch August 13, 2024 11:09
@github-actions github-actions bot added this to the 2.3.0 milestone Aug 13, 2024
@TonyCTHsu TonyCTHsu mentioned this pull request Aug 22, 2024
@ivoanjo ivoanjo added the performance Involves performance (e.g. CPU, memory, etc) label Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Involves performance (e.g. CPU, memory, etc) profiling Involves Datadog profiling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants