[PROF-9342] Introduce profiler workaround for Ruby Dir interruption bug #3720

ivoanjo · 2024-06-14T14:33:32Z

What does this PR do?

This PR introduces the "dir interruption" workaround to the profiler. This workaround targets an issue in the Ruby Dir class that was originally reported by a customer in #3450 .

In short, 13 of the Ruby Dir class APIs can be, in rare situations, interrupted by the profiler's sampling mechanism. This interruption makes Ruby incorrectly think that it has reached the end of a directory it is reading, so this could mean for the affected APIs:

Ruby would return that a directory is empty when it isn't
Ruby would return only some, but not all of the files in the directory

The "dir interruption" workaround prevents this issue from happening by monkey patching the affected APIs to wrap the critical regions with calls to CpuAndWallTimeWorker._native_hold_interruptions, CpuAndWallTimeWorker._native_resume_interruptions, preventing the profiler from interrupting those APIs at the wrong time.

Because all stable Ruby versions are known to be affected by this issue, the workaround is applied by default. This PR adds a new setting that can be used to turn it off:

DD_PROFILING_DIR_INTERRUPTION_WORKAROUND_ENABLED=false
Datadog.configure { |c| c.profiling.advanced.dir_interruption_workaround_enabled = false }

Motivation:

Avoid triggering bug out-of-the box.

Additional Notes:

~~I also plan to report the issue upstream, and hopefully work with the Ruby core team to fix it.~~

Update: reported upstream as https://bugs.ruby-lang.org/issues/20586 and https://bugs.ruby-lang.org/issues/20587 .

The "no signals" workaround documented in our FAQ also prevents this issue from happening, but at the cost of lower quality data gathering, and since this affects every Ruby version, we didn't want to have to toggle the "no signals" workaround enabled by default.

How to test the change?

This PR includes a lot of test coverage.

I've also tested the monkey patch with the Ruby standard library specs ( https://github.com/ruby/spec/tree/master/core/dir ) on my local machine. I'll follow up this PR with a PR on the https://github.com/datadog/prof-correctness/ to automatically add this to CI there, as otherwise it was a bit awkward to add to the dd-trace-rb repository.

For datadog folks, I've setup a simple GCS bucket than can be used to trigger this issue:

$ gcsfuse test_fs_dd_trace_rb fuse-testing/
$ ls fuse-testing/
hello.txt  test.html  test.txt
# Trigger issue without the workaround
$ DD_PROFILING_ENABLED=true DD_PROFILING_DIR_INTERRUPTION_WORKAROUND_ENABLED=false bundle exec ddprofrb exec ruby -e "Datadog::Profiling.wait_until_running; pp Dir.children('fuse-testing/')"
[]
# Trigger issue with workaround
$ DD_PROFILING_ENABLED=true DD_PROFILING_DIR_INTERRUPTION_WORKAROUND_ENABLED=true bundle exec ddprofrb exec ruby -e "Datadog::Profiling.wait_until_running; pp Dir.children('fuse-testing/')"
["hello.txt", "test.html", "test.txt"]
# The "no signals" workaround can be used instead of the "dir interruption" workaround
$ DD_PROFILING_ENABLED=true DD_PROFILING_DIR_INTERRUPTION_WORKAROUND_ENABLED=false DD_PROFILING_NO_SIGNALS_WORKAROUND_ENABLED=true bundle exec ddprofrb exec ruby -e "Datadog::Profiling.wait_until_running; pp Dir.children('fuse-testing/')"
W, [2024-06-14T15:30:09.108975 #948820]  WARN -- datadog: [datadog] Profiling "no signals" workaround enabled via configuration. Profiling data will have lower quality.
["hello.txt", "test.html", "test.txt"]

These will be needed to implement the dir interruption workaround.

Not yet wired up.

I've separately tested these with ruby-spec, but haven't integrated it yet for automated testing. Will do it a follow-up commit.

For future reference, here's what I get on my local machine: ``` ruby 3.1.4p223 (2023-03-30 revision 957bb7cb81) [x86_64-linux] Warming up -------------------------------------- hold / resume 391.856k i/100ms Calculating ------------------------------------- hold / resume 3.829M (± 1.9%) i/s - 38.402M in 10.033530s ```

...that mentions #3450 .

Looks like the `sort:` was added later, and it wasn't particularly critical to what we were testing so I chose to omit it.

…d by default This broke Ruby 2.5, which enables it by default.

codecov-commenter · 2024-06-14T15:02:33Z

Codecov Report

Attention: Patch coverage is 58.54271% with 165 lines in your changes missing coverage. Please review.

Project coverage is 97.89%. Comparing base (4cd328d) to head (259e134).
Report is 7 commits behind head on master.

Files	Patch %	Lines
lib/datadog/profiling/ext/dir_monkey_patches.rb	21.96%	135 Missing ⚠️
...c/datadog/profiling/ext/dir_monkey_patches_spec.rb	82.14%	30 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #3720      +/-   ##
==========================================
- Coverage   98.11%   97.89%   -0.22%     
==========================================
  Files        1235     1237       +2     
  Lines       73749    74146     +397     
  Branches     3560     3594      +34     
==========================================
+ Hits        72357    72588     +231     
- Misses       1392     1558     +166

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

lib/datadog/profiling/ext/dir_monkey_patches.rb

lloeki · 2024-06-18T08:51:09Z

lib/datadog/profiling/ext/dir_monkey_patches.rb

+                # <-- We're safe now while running customer code
+                yield entry_name
+                # <-- We'll go back to the Dir internals, critical region again
+                Datadog::Profiling::Collectors::CpuAndWallTimeWorker._native_hold_interruptions


Shouldn't it be in an ensure as well?

🤔 I don't think so...? It's ok to deliver signals during exception handling, so if an exception gets raised in the client code, it should be fine for the profiler to continue operating from them on, since no more directory calls will be made by Ruby.

Or do you mean making sure that _native_hold_interruptions never raises or something similar?

I really don't know, hence why I ask ;)

I'm basing myself on the following outline:

begin # [A+] enter our code: hold super do # [B-] exit our code back to not-our-code: resume yield # [B+] enter our code: hold end ensure # [A-] exit our code: resume end

On the super happy path the execution order is [A+] [B-] [B+] [A-].

But if yield raises it becomes [A+] [B-] [A-], which in my mind triggers some "consistency" alarms and makes me wonder about internal details:

what happens when _native_resume_signals is called twice in a row? is it a NOOP? is it doing work twice (e.g invoking a syscall) but idempotent?

there seems to be work to be protected after yield (because of that [B+]) and before method exit ([A-]) so I presume that work is in super, but if yield raises there may be work to protect too in a hypothetical ensure block inside that super too? Or is [B+] entirely superfluous even in the happy yield case?

Conversely, imagine an abstraction like so:

module SignalHolder def hold # hold yield ensure # resume end def ignore # resume yield ensure # hold end end

And the code becoming:

SignalHolder.hold do super { SignalHolder.ignore { yield } } end

The above is not even a suggestion it's merely to materialise and compare the kind of structural "consistency".

It could simply instead just need a comment to prevent the mind tripwire or help someone who would refactor this code in the future for some reason.

But really, I'd say it's fine to merge if you're confident and my concerns are unwarranted.

Got it, thanks for explaining!

So, in parts:

Yes, it's OK if the yield raises. Specifically:

We can call _native_resume_signals many times in a row.

As you suspected, it leads to pthread_sigmask which I believe on Linux will lead to the rt_sigprocmask system call being called.
I've included a benchmark in the PR of what the cost of this is, and here's a run I did:

ruby 3.1.4p223 (2023-03-30 revision 957bb7cb81) [x86_64-linux] Warming up -------------------------------------- hold / resume 391.856k i/100ms Calculating ------------------------------------- hold / resume 3.829M (± 1.9%) i/s - 38.402M in 10.033530s

...so I think we're fine if we end up calling it a few more times than strictly needed.

The root of the issue in the Ruby VM happens because in the affected APIs Ruby silently interprets an interruption as "end of folder/empty folder". When an exception is being raised, the iteration will stop anyway, so there's no longer a concern of the interruption causing a semantic mismatch (at least that I can see).

I've pushed 259e134 to add this note to the code as in hindsight I agree it was not very clear before (and hopefully this clarifies it)

ext/datadog_profiling_native_extension/collectors_cpu_and_wall_time_worker.c

lib/datadog/profiling/ext/dir_monkey_patches.rb

lloeki · 2024-06-18T09:30:44Z

lib/datadog/profiling.rb

@@ -136,6 +136,7 @@ def self.allocation_count # rubocop:disable Lint/NestedMethodDefinition (On purp
      return false unless supported?

      require_relative 'profiling/ext/forking'
+      require_relative 'profiling/ext/dir_monkey_patches'


Should this be more functionally named? ("what" instead of "how")

WDYT of dir_signal_masking?

I went with this name because I know it will show up in the profiling flamegraph for customers.

My thinking is that it's clearer if you see "oh, why is there a datadog thing here? ah, it's monkey patching my uses of dir". Signal masking makes me think a bit more along hte lines of "what is this signal masking thing? what is masking anyway?".

Ha, interesting! I was thinking that if I saw that in a stack trace or something I would wonder "why is Datadog monkey patching dir stuff at all?". In a way "monkey patching" is how but not what/why. In a way it's also redundant with ext, the sort of conventional way to say "this folder contains things that extend stuff". dir_signal_fix maybe?

But at this stage that's becoming bikeshedding and certainly not a blocker. Feel free to move forward.

Adding links to upstream issues in the comment at the top of the file would help too I think (I reread it and with only the Datadog issue link it makes it like this is a Datadog issue when it's really an upstream issue)

I think the joke about naming things being hard applies here >_>. I suspect/fear just relying on ext/ may be too subtle (e.g. when you're looking a flamegraph), but am open to changing if folks get confused about the current name.

+1 I've added a link to the upstream ticket in 259e134.

lib/datadog/profiling/ext/dir_monkey_patches.rb

The interruptions part is just a consequence of signal delivery.

Loic suggested this may provide better compatibility, and it seems to make sense given we're monkey patching a quite core Ruby class. This diff is way easier to review without whitespace. What I did was copy/paste the module, remove the `ruby2_keywords`, add `**kwargs` to the Ruby 3 variants and clean up any things that don't make sense for Ruby 3 (e.g. testing for 2.5 for some methods).

…arks ...which I totally was almost forgetting.

…patch **What does this PR do?** This PR adds a new test scenario, the `ruby_dir_interruption_patch`. **Motivation:** This scenario tests DataDog/dd-trace-rb#3720 in a very unique way: by loading the profiler, and then running the upstream Ruby test suite for the `Dir` class with our monkey patches applied. (For context, our "monkey patches" are a couple of Ruby modules which wrap the behavior of the standard library `Dir` class, and thus we need to be quite careful to make sure not to break any existing functionality.) **Additional Notes:** This test does not actually generate or assert on any profiles, it only checks that the profiler is running and the monkey patches were correctly applied prior to running the Ruby test suite. The `mspec_config.rb` file is also quite important, since it ensures that both the profiler as well as the monkey patches were correctly applied (e.g. to avoid the test passing trivially because e.g. the profiler actually didn't start). Because at time of writing DataDog/dd-trace-rb#3720 is not yet merged, this test is expected to fail with: ``` $ TEST_RUN_SECS=5 TEST_SCENARIOS="ruby_dir_interruption_patch" go test -v -run TestScenarios === RUN TestScenarios correctness_test.go:210: Considering only scenarios in ruby_dir_interruption_patch correctness_test.go:225: Extract base image from: scenarios/ruby_dir_interruption_patch/Dockerfile === RUN TestScenarios/scenarios/ruby_dir_interruption_patch correctness_test.go:244: Folder: scenarios/ruby_dir_interruption_patch correctness_test.go:245: Json file: scenarios/ruby_dir_interruption_patch/expected_profile.json correctness_test.go:246: Docker file: scenarios/ruby_dir_interruption_patch/Dockerfile correctness_test.go:248: Built test app with: test-app correctness_test.go:137: Running docker command with output /home/ivo.anjo/datadog/prof-correctness/data/ruby_dir_interruption_patch-1618380497 correctness_test.go:138: '[/bin/sh -c bundle exec ddprofrb exec ruby mspec-master/bin/mspec-run --config mspec_config.rb spec-master/core/dir/]' correctness_test.go:151: Error running the test docker run -v /home/ivo.anjo/datadog/prof-correctness/data/ruby_dir_interruption_patch-1618380497:/app/data:rw -u 1000:1000 --security-opt seccomp=unconfined -e EXECUTION_TIME_SEC=5 -e DD_SERVICE=prof-correctness-scenarios/ruby_dir_interruption_patch test-app:latest - /app/mspec_config.rb:3:in `<top (required)>': uninitialized constant Datadog::Profiling::Ext::DirInstanceMonkeyPatches (NameError) if Dir.ancestors.first == Datadog::Profiling::Ext::DirInstanceMonkeyPatches && ^^^^^^^^^^^^^^^^^^^^^^^^^^ from /app/mspec-master/lib/mspec/utils/script.rb:91:in `load' from /app/mspec-master/lib/mspec/utils/script.rb:91:in `block (2 levels) in try_load' from /app/mspec-master/lib/mspec/utils/script.rb:86:in `each' from /app/mspec-master/lib/mspec/utils/script.rb:86:in `block in try_load' from /app/mspec-master/lib/mspec/utils/script.rb:85:in `each' from /app/mspec-master/lib/mspec/utils/script.rb:85:in `try_load' from /app/mspec-master/lib/mspec/utils/script.rb:102:in `load' from /app/mspec-master/lib/mspec/commands/mspec-run.rb:34:in `block in options' from /app/mspec-master/lib/mspec/utils/options.rb:111:in `process' from /app/mspec-master/lib/mspec/utils/options.rb:143:in `parse' from /app/mspec-master/lib/mspec/commands/mspec-run.rb:76:in `options' from /app/mspec-master/lib/mspec/utils/script.rb:286:in `main' from mspec-master/bin/mspec-run:7:in `<main>' I, [2024-06-19T08:58:37.927905 #7] INFO -- datadog: [datadog] DATADOG CONFIGURATION - CORE - {"date":"2024-06-19T08:58:37Z","os_name":"x86_64-pc-linux","version":"2.1.0","lang":"ruby","lang_version":"3.3.3","env":null,"service":"prof-correctness-scenarios/ruby_dir_interruption_patch","dd_version":null,"debug":false,"tags":null,"runtime_metrics_enabled":false,"vm":"ruby-3.3.3","health_metrics_enabled":false,"profiling_enabled":true} --- FAIL: TestScenarios (19.35s) ``` ...e.g. the monkey patches are missing. For local testing, you can tweak the `gems.rb` to pick up the new branch: ```ruby gem 'datadog', git: 'https://github.com/datadog/dd-trace-rb.git', branch: 'ivoanjo/prof-9342-dir-interruption-workaround' ``` and you'll see the test passing. Furthermore, I've created a branch where I introduce a bug in `Dir.home`: ```ruby gem 'datadog', git: 'https://github.com/datadog/dd-trace-rb.git', branch: 'ivoanjo/prof-9342-dir-interruption-workaround-deleteme' ``` and here's the Ruby test suite flagging this behavior issue: ``` $ TEST_RUN_SECS=5 TEST_SCENARIOS="ruby_dir_interruption_patch" go test -v -run TestScenarios === RUN TestScenarios correctness_test.go:210: Considering only scenarios in ruby_dir_interruption_patch correctness_test.go:225: Extract base image from: scenarios/ruby_dir_interruption_patch/Dockerfile === RUN TestScenarios/scenarios/ruby_dir_interruption_patch correctness_test.go:244: Folder: scenarios/ruby_dir_interruption_patch correctness_test.go:245: Json file: scenarios/ruby_dir_interruption_patch/expected_profile.json correctness_test.go:246: Docker file: scenarios/ruby_dir_interruption_patch/Dockerfile correctness_test.go:248: Built test app with: test-app correctness_test.go:137: Running docker command with output /home/ivo.anjo/datadog/prof-correctness/data/ruby_dir_interruption_patch-1872402346 correctness_test.go:138: '[/bin/sh -c bundle exec ddprofrb exec ruby mspec-master/bin/mspec-run --config mspec_config.rb spec-master/core/dir/]' correctness_test.go:151: Error running the test docker run -v /home/ivo.anjo/datadog/prof-correctness/data/ruby_dir_interruption_patch-1872402346:/app/data:rw -u 1000:1000 --security-opt seccomp=unconfined -e EXECUTION_TIME_SEC=5 -e DD_SERVICE=prof-correctness-scenarios/ruby_dir_interruption_patch test-app:latest - I, [2024-06-19T09:04:55.521052 #7] INFO -- datadog: [datadog] DATADOG CONFIGURATION - CORE - {"date":"2024-06-19T09:04:55Z","os_name":"x86_64-pc-linux","version":"2.1.0","lang":"ruby","lang_version":"3.3.3","env":null,"service":"prof-correctness-scenarios/ruby_dir_interruption_patch","dd_version":null,"debug":false,"tags":null,"runtime_metrics_enabled":false,"vm":"ruby-3.3.3","health_metrics_enabled":false,"profiling_enabled":true} Dir interruption patch is present! ruby 3.3.3 (2024-06-12 revision f1c7b6f435) [x86_64-linux] ...................................................................................................................................................................................................................................................................FFEEFFEF............................................................... 1) Dir.home when called without arguments returns the current user's home directory, reading $HOME first FAILED Expected nil == "/rubyspec_home" to be truthy but was false /app/spec-master/core/dir/home_spec.rb:16:in `block (3 levels) in <top (required)>' /app/spec-master/core/dir/home_spec.rb:4:in `<top (required)>' 2) Dir.home when called without arguments returns a non-frozen string FAILED Expected nil.frozen? to be falsy but was true /app/spec-master/core/dir/home_spec.rb:20:in `block (3 levels) in <top (required)>' /app/spec-master/core/dir/home_spec.rb:4:in `<top (required)>' (...etc...) 32 files, 330 examples, 440 expectations, 5 failures, 3 errors, 0 tagged --- FAIL: TestScenarios (19.42s) ```

ivoanjo · 2024-06-19T10:37:09Z

I've updated the PR description with links to the upstream reports for this issue https://bugs.ruby-lang.org/issues/20586 / https://bugs.ruby-lang.org/issues/20587 .

**What does this PR do?** This PR disables the Dir interruption bug workaround that was added in #3720 for Ruby versions 3.4 and above. **Motivation:** The bug has been fixed upstream (see https://bugs.ruby-lang.org/issues/20586 ) and thus is no longer needed! **Additional Notes:** N/A **How to test the change?** This change includes test coverage.

ivoanjo added 14 commits June 13, 2024 10:41

Expose methods for holding/resuming interruptions

b03696e

These will be needed to implement the dir interruption workaround.

Bootstrap setting for controlling dir interruption workaround

aeb2db6

Not yet wired up.

Add monkey patches for affected dir class APIs

88f1164

I've separately tested these with ruby-spec, but haven't integrated it yet for automated testing. Will do it a follow-up commit.

Add inline suggestion to function

3c51234

Add helper to load monkey patches (similar to our Kernel monkey patches)

42c40b2

Add type signatures for dir monkey patches

25a96a1

Load new file when loading profiler

57e64bf

Wire up dir interruption workaround to setting

0e80c77

Add test coverage for DirMonkeyPatches

577fdb0

Add comments to make Rubocop happy

43a3e5e

Add nice description to dir interruption workaround setting

9152470

...that mentions #3450 .

Tweak spec for compatibility with Ruby 2.7

b30ca79

Looks like the `sort:` was added later, and it wasn't particularly critical to what we were testing so I chose to omit it.

Load DirMonkeyPatches before referencing it to avoid failing on JRuby

d6eebe9

ivoanjo requested review from a team as code owners June 14, 2024 14:33

github-actions bot added core Involves Datadog core libraries profiling Involves Datadog profiling labels Jun 14, 2024

ivoanjo added 2 commits June 14, 2024 15:37

Tweak spec to remove assumption that no signals workaround is disable…

30601a2

…d by default This broke Ruby 2.5, which enables it by default.

Make rubocop happy

9ad82fe

p-datadog approved these changes Jun 17, 2024

View reviewed changes