-
Notifications
You must be signed in to change notification settings - Fork 380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PROF-9342] Introduce profiler workaround for Ruby Dir interruption bug #3720
[PROF-9342] Introduce profiler workaround for Ruby Dir interruption bug #3720
Conversation
These will be needed to implement the dir interruption workaround.
Not yet wired up.
I've separately tested these with ruby-spec, but haven't integrated it yet for automated testing. Will do it a follow-up commit.
For future reference, here's what I get on my local machine: ``` ruby 3.1.4p223 (2023-03-30 revision 957bb7cb81) [x86_64-linux] Warming up -------------------------------------- hold / resume 391.856k i/100ms Calculating ------------------------------------- hold / resume 3.829M (± 1.9%) i/s - 38.402M in 10.033530s ```
...that mentions #3450 .
Looks like the `sort:` was added later, and it wasn't particularly critical to what we were testing so I chose to omit it.
…d by default This broke Ruby 2.5, which enables it by default.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #3720 +/- ##
==========================================
- Coverage 98.11% 97.89% -0.22%
==========================================
Files 1235 1237 +2
Lines 73749 74146 +397
Branches 3560 3594 +34
==========================================
+ Hits 72357 72588 +231
- Misses 1392 1558 +166 ☔ View full report in Codecov by Sentry. |
# <-- We're safe now while running customer code | ||
yield entry_name | ||
# <-- We'll go back to the Dir internals, critical region again | ||
Datadog::Profiling::Collectors::CpuAndWallTimeWorker._native_hold_interruptions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it be in an ensure
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 I don't think so...? It's ok to deliver signals during exception handling, so if an exception gets raised in the client code, it should be fine for the profiler to continue operating from them on, since no more directory calls will be made by Ruby.
Or do you mean making sure that _native_hold_interruptions
never raises or something similar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't know, hence why I ask ;)
I'm basing myself on the following outline:
begin
# [A+] enter our code: hold
super do
# [B-] exit our code back to not-our-code: resume
yield
# [B+] enter our code: hold
end
ensure
# [A-] exit our code: resume
end
On the super happy path the execution order is [A+] [B-] [B+] [A-]
.
But if yield
raises it becomes [A+] [B-] [A-]
, which in my mind triggers some "consistency" alarms and makes me wonder about internal details:
- what happens when
_native_resume_signals
is called twice in a row? is it a NOOP? is it doing work twice (e.g invoking a syscall) but idempotent? - there seems to be work to be protected after
yield
(because of that[B+]
) and before method exit ([A-]
) so I presume that work is insuper
, but ifyield
raises there may be work to protect too in a hypotheticalensure
block inside thatsuper
too? Or is[B+]
entirely superfluous even in the happyyield
case?
Conversely, imagine an abstraction like so:
module SignalHolder
def hold
# hold
yield
ensure
# resume
end
def ignore
# resume
yield
ensure
# hold
end
end
And the code becoming:
SignalHolder.hold do
super { SignalHolder.ignore { yield } }
end
The above is not even a suggestion it's merely to materialise and compare the kind of structural "consistency".
It could simply instead just need a comment to prevent the mind tripwire or help someone who would refactor this code in the future for some reason.
But really, I'd say it's fine to merge if you're confident and my concerns are unwarranted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks for explaining!
So, in parts:
- Yes, it's OK if the
yield
raises. Specifically:-
We can call
_native_resume_signals
many times in a row.As you suspected, it leads to
pthread_sigmask
which I believe on Linux will lead to thert_sigprocmask
system call being called.
I've included a benchmark in the PR of what the cost of this is, and here's a run I did:ruby 3.1.4p223 (2023-03-30 revision 957bb7cb81) [x86_64-linux] Warming up -------------------------------------- hold / resume 391.856k i/100ms Calculating ------------------------------------- hold / resume 3.829M (± 1.9%) i/s - 38.402M in 10.033530s
...so I think we're fine if we end up calling it a few more times than strictly needed.
-
The root of the issue in the Ruby VM happens because in the affected APIs Ruby silently interprets an interruption as "end of folder/empty folder". When an exception is being raised, the iteration will stop anyway, so there's no longer a concern of the interruption causing a semantic mismatch (at least that I can see).
-
I've pushed 259e134 to add this note to the code as in hindsight I agree it was not very clear before (and hopefully this clarifies it)
ext/datadog_profiling_native_extension/collectors_cpu_and_wall_time_worker.c
Outdated
Show resolved
Hide resolved
@@ -136,6 +136,7 @@ def self.allocation_count # rubocop:disable Lint/NestedMethodDefinition (On purp | |||
return false unless supported? | |||
|
|||
require_relative 'profiling/ext/forking' | |||
require_relative 'profiling/ext/dir_monkey_patches' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be more functionally named? ("what" instead of "how")
WDYT of dir_signal_masking
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with this name because I know it will show up in the profiling flamegraph for customers.
My thinking is that it's clearer if you see "oh, why is there a datadog thing here? ah, it's monkey patching my uses of dir". Signal masking makes me think a bit more along hte lines of "what is this signal masking thing? what is masking anyway?".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha, interesting! I was thinking that if I saw that in a stack trace or something I would wonder "why is Datadog monkey patching dir stuff at all?". In a way "monkey patching" is how but not what/why. In a way it's also redundant with ext
, the sort of conventional way to say "this folder contains things that extend stuff". dir_signal_fix
maybe?
But at this stage that's becoming bikeshedding and certainly not a blocker. Feel free to move forward.
Adding links to upstream issues in the comment at the top of the file would help too I think (I reread it and with only the Datadog issue link it makes it like this is a Datadog issue when it's really an upstream issue)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the joke about naming things being hard applies here >_>. I suspect/fear just relying on ext/
may be too subtle (e.g. when you're looking a flamegraph), but am open to changing if folks get confused about the current name.
+1 I've added a link to the upstream ticket in 259e134.
The interruptions part is just a consequence of signal delivery.
Loic suggested this may provide better compatibility, and it seems to make sense given we're monkey patching a quite core Ruby class. This diff is way easier to review without whitespace. What I did was copy/paste the module, remove the `ruby2_keywords`, add `**kwargs` to the Ruby 3 variants and clean up any things that don't make sense for Ruby 3 (e.g. testing for 2.5 for some methods).
…arks ...which I totally was almost forgetting.
…patch **What does this PR do?** This PR adds a new test scenario, the `ruby_dir_interruption_patch`. **Motivation:** This scenario tests DataDog/dd-trace-rb#3720 in a very unique way: by loading the profiler, and then running the upstream Ruby test suite for the `Dir` class with our monkey patches applied. (For context, our "monkey patches" are a couple of Ruby modules which wrap the behavior of the standard library `Dir` class, and thus we need to be quite careful to make sure not to break any existing functionality.) **Additional Notes:** This test does not actually generate or assert on any profiles, it only checks that the profiler is running and the monkey patches were correctly applied prior to running the Ruby test suite. The `mspec_config.rb` file is also quite important, since it ensures that both the profiler as well as the monkey patches were correctly applied (e.g. to avoid the test passing trivially because e.g. the profiler actually didn't start). Because at time of writing DataDog/dd-trace-rb#3720 is not yet merged, this test is expected to fail with: ``` $ TEST_RUN_SECS=5 TEST_SCENARIOS="ruby_dir_interruption_patch" go test -v -run TestScenarios === RUN TestScenarios correctness_test.go:210: Considering only scenarios in ruby_dir_interruption_patch correctness_test.go:225: Extract base image from: scenarios/ruby_dir_interruption_patch/Dockerfile === RUN TestScenarios/scenarios/ruby_dir_interruption_patch correctness_test.go:244: Folder: scenarios/ruby_dir_interruption_patch correctness_test.go:245: Json file: scenarios/ruby_dir_interruption_patch/expected_profile.json correctness_test.go:246: Docker file: scenarios/ruby_dir_interruption_patch/Dockerfile correctness_test.go:248: Built test app with: test-app correctness_test.go:137: Running docker command with output /home/ivo.anjo/datadog/prof-correctness/data/ruby_dir_interruption_patch-1618380497 correctness_test.go:138: '[/bin/sh -c bundle exec ddprofrb exec ruby mspec-master/bin/mspec-run --config mspec_config.rb spec-master/core/dir/]' correctness_test.go:151: Error running the test docker run -v /home/ivo.anjo/datadog/prof-correctness/data/ruby_dir_interruption_patch-1618380497:/app/data:rw -u 1000:1000 --security-opt seccomp=unconfined -e EXECUTION_TIME_SEC=5 -e DD_SERVICE=prof-correctness-scenarios/ruby_dir_interruption_patch test-app:latest - /app/mspec_config.rb:3:in `<top (required)>': uninitialized constant Datadog::Profiling::Ext::DirInstanceMonkeyPatches (NameError) if Dir.ancestors.first == Datadog::Profiling::Ext::DirInstanceMonkeyPatches && ^^^^^^^^^^^^^^^^^^^^^^^^^^ from /app/mspec-master/lib/mspec/utils/script.rb:91:in `load' from /app/mspec-master/lib/mspec/utils/script.rb:91:in `block (2 levels) in try_load' from /app/mspec-master/lib/mspec/utils/script.rb:86:in `each' from /app/mspec-master/lib/mspec/utils/script.rb:86:in `block in try_load' from /app/mspec-master/lib/mspec/utils/script.rb:85:in `each' from /app/mspec-master/lib/mspec/utils/script.rb:85:in `try_load' from /app/mspec-master/lib/mspec/utils/script.rb:102:in `load' from /app/mspec-master/lib/mspec/commands/mspec-run.rb:34:in `block in options' from /app/mspec-master/lib/mspec/utils/options.rb:111:in `process' from /app/mspec-master/lib/mspec/utils/options.rb:143:in `parse' from /app/mspec-master/lib/mspec/commands/mspec-run.rb:76:in `options' from /app/mspec-master/lib/mspec/utils/script.rb:286:in `main' from mspec-master/bin/mspec-run:7:in `<main>' I, [2024-06-19T08:58:37.927905 #7] INFO -- datadog: [datadog] DATADOG CONFIGURATION - CORE - {"date":"2024-06-19T08:58:37Z","os_name":"x86_64-pc-linux","version":"2.1.0","lang":"ruby","lang_version":"3.3.3","env":null,"service":"prof-correctness-scenarios/ruby_dir_interruption_patch","dd_version":null,"debug":false,"tags":null,"runtime_metrics_enabled":false,"vm":"ruby-3.3.3","health_metrics_enabled":false,"profiling_enabled":true} --- FAIL: TestScenarios (19.35s) ``` ...e.g. the monkey patches are missing. For local testing, you can tweak the `gems.rb` to pick up the new branch: ```ruby gem 'datadog', git: 'https://github.com/datadog/dd-trace-rb.git', branch: 'ivoanjo/prof-9342-dir-interruption-workaround' ``` and you'll see the test passing. Furthermore, I've created a branch where I introduce a bug in `Dir.home`: ```ruby gem 'datadog', git: 'https://github.com/datadog/dd-trace-rb.git', branch: 'ivoanjo/prof-9342-dir-interruption-workaround-deleteme' ``` and here's the Ruby test suite flagging this behavior issue: ``` $ TEST_RUN_SECS=5 TEST_SCENARIOS="ruby_dir_interruption_patch" go test -v -run TestScenarios === RUN TestScenarios correctness_test.go:210: Considering only scenarios in ruby_dir_interruption_patch correctness_test.go:225: Extract base image from: scenarios/ruby_dir_interruption_patch/Dockerfile === RUN TestScenarios/scenarios/ruby_dir_interruption_patch correctness_test.go:244: Folder: scenarios/ruby_dir_interruption_patch correctness_test.go:245: Json file: scenarios/ruby_dir_interruption_patch/expected_profile.json correctness_test.go:246: Docker file: scenarios/ruby_dir_interruption_patch/Dockerfile correctness_test.go:248: Built test app with: test-app correctness_test.go:137: Running docker command with output /home/ivo.anjo/datadog/prof-correctness/data/ruby_dir_interruption_patch-1872402346 correctness_test.go:138: '[/bin/sh -c bundle exec ddprofrb exec ruby mspec-master/bin/mspec-run --config mspec_config.rb spec-master/core/dir/]' correctness_test.go:151: Error running the test docker run -v /home/ivo.anjo/datadog/prof-correctness/data/ruby_dir_interruption_patch-1872402346:/app/data:rw -u 1000:1000 --security-opt seccomp=unconfined -e EXECUTION_TIME_SEC=5 -e DD_SERVICE=prof-correctness-scenarios/ruby_dir_interruption_patch test-app:latest - I, [2024-06-19T09:04:55.521052 #7] INFO -- datadog: [datadog] DATADOG CONFIGURATION - CORE - {"date":"2024-06-19T09:04:55Z","os_name":"x86_64-pc-linux","version":"2.1.0","lang":"ruby","lang_version":"3.3.3","env":null,"service":"prof-correctness-scenarios/ruby_dir_interruption_patch","dd_version":null,"debug":false,"tags":null,"runtime_metrics_enabled":false,"vm":"ruby-3.3.3","health_metrics_enabled":false,"profiling_enabled":true} Dir interruption patch is present! ruby 3.3.3 (2024-06-12 revision f1c7b6f435) [x86_64-linux] ...................................................................................................................................................................................................................................................................FFEEFFEF............................................................... 1) Dir.home when called without arguments returns the current user's home directory, reading $HOME first FAILED Expected nil == "/rubyspec_home" to be truthy but was false /app/spec-master/core/dir/home_spec.rb:16:in `block (3 levels) in <top (required)>' /app/spec-master/core/dir/home_spec.rb:4:in `<top (required)>' 2) Dir.home when called without arguments returns a non-frozen string FAILED Expected nil.frozen? to be falsy but was true /app/spec-master/core/dir/home_spec.rb:20:in `block (3 levels) in <top (required)>' /app/spec-master/core/dir/home_spec.rb:4:in `<top (required)>' (...etc...) 32 files, 330 examples, 440 expectations, 5 failures, 3 errors, 0 tagged --- FAIL: TestScenarios (19.42s) ```
I've updated the PR description with links to the upstream reports for this issue https://bugs.ruby-lang.org/issues/20586 / https://bugs.ruby-lang.org/issues/20587 . |
**What does this PR do?** This PR disables the Dir interruption bug workaround that was added in #3720 for Ruby versions 3.4 and above. **Motivation:** The bug has been fixed upstream (see https://bugs.ruby-lang.org/issues/20586 ) and thus is no longer needed! **Additional Notes:** N/A **How to test the change?** This change includes test coverage.
**What does this PR do?** This PR disables the Dir interruption bug workaround that was added in #3720 for Ruby versions 3.4 and above. **Motivation:** The bug has been fixed upstream (see https://bugs.ruby-lang.org/issues/20586 ) and thus is no longer needed! **Additional Notes:** N/A **How to test the change?** This change includes test coverage.
What does this PR do?
This PR introduces the "dir interruption" workaround to the profiler. This workaround targets an issue in the Ruby
Dir
class that was originally reported by a customer in #3450 .In short, 13 of the Ruby
Dir
class APIs can be, in rare situations, interrupted by the profiler's sampling mechanism. This interruption makes Ruby incorrectly think that it has reached the end of a directory it is reading, so this could mean for the affected APIs:The "dir interruption" workaround prevents this issue from happening by monkey patching the affected APIs to wrap the critical regions with calls to
CpuAndWallTimeWorker._native_hold_interruptions
,CpuAndWallTimeWorker._native_resume_interruptions
, preventing the profiler from interrupting those APIs at the wrong time.Because all stable Ruby versions are known to be affected by this issue, the workaround is applied by default. This PR adds a new setting that can be used to turn it off:
DD_PROFILING_DIR_INTERRUPTION_WORKAROUND_ENABLED=false
Datadog.configure { |c| c.profiling.advanced.dir_interruption_workaround_enabled = false }
Motivation:
Avoid triggering bug out-of-the box.
Additional Notes:
I also plan to report the issue upstream, and hopefully work with the Ruby core team to fix it.Update: reported upstream as https://bugs.ruby-lang.org/issues/20586 and https://bugs.ruby-lang.org/issues/20587 .
The "no signals" workaround documented in our FAQ also prevents this issue from happening, but at the cost of lower quality data gathering, and since this affects every Ruby version, we didn't want to have to toggle the "no signals" workaround enabled by default.
How to test the change?
This PR includes a lot of test coverage.
I've also tested the monkey patch with the Ruby standard library specs ( https://github.com/ruby/spec/tree/master/core/dir ) on my local machine. I'll follow up this PR with a PR on the https://github.com/datadog/prof-correctness/ to automatically add this to CI there, as otherwise it was a bit awkward to add to the dd-trace-rb repository.
For datadog folks, I've setup a simple GCS bucket than can be used to trigger this issue: