-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PROF-8917] Add support for the libdatadog crash tracker #3384
Conversation
…bdatadog helpers These are going to be needed also for the crash tracker code. While doing this extraction, I've gone ahead and made the failure logging on `convert_tags` use the ddtrace logger directly, rather than having to rely on an extra method to do this conversion. Since these methods are covered by the tests in `http_transport_spec.rb`, I've chosen not to expose them in other ways for testing.
This is also going to be needed by the crash tracking feature.
**What does this PR do?** This PR adds support for the libdatadog crash tracker feature (off by default). The crash tracker works by detecting when the Ruby VM reaches a segmentation fault, reporting the crash information as a last profile before the VM dies. All of the interesting work is in <DataDog/libdatadog#282>, this PR basically just wires things up. **Motivation:** This will be a useful tool when debugging VM crashes. **Additional Notes:** I'm opening this PR as a draft as the libdatadog support has not yet landed/been released. Also, there's a few open questions on: * fork handling * when to shut down **How to test the change?** (You'll need to build <<DataDog/libdatadog#282> until the crash tracker gets included in a libdatadog release) To test the crash tracker with an actual crash, try running the following on Ruby 2.6: ```bash $ DD_PROFILING_ENABLED=true DD_PROFILING_EXPERIMENTAL_CRASH_TRACKING_ENABLED=true DD_TRACE_DEBUG=true bundle exec ddtracerb exec ruby -e 'Process.detach(fork { exit! }).instance_variable_get(:@foo)' ``` This should also work in the future but right now it doesn't work correctly; still looking into why: ```bash $ DD_PROFILING_ENABLED=true DD_PROFILING_EXPERIMENTAL_CRASH_TRACKING_ENABLED=true DD_TRACE_DEBUG=true bundle exec ddtracerb exec ruby -e 'Process.kill("SEGV", Process.pid)' ```
`EventGroup` has long since been removed from the codebase
5efb033
to
18758de
Compare
When the profiler starts up, a number of other log messages are also printed, so let's get rid of a few. (Also, when it's disabled, it shows up on the environment logger, so that's already covered too.)
**What does this PR do?** This PR upgrades dd-trace-rb to use libdatadog 7. There was a small breaking API change -- the rename of `ddog_Endpoint` APIs to `ddog_prof_Endpoint`. **Motivation:** Make sure Ruby is up-to-date on libdatadog. **Additional Notes:** As far as Ruby is impacted, the main changes in libdatadog 7 are a number of fixes to the crash tracker (getting us closer to merging #3384) as well as some potential memory improvements from (DataDog/libdatadog#303). I'm opening this as a draft PR as libdatadog 7 is not yet available on rubygems.org, and I'll come back to re-trigger CI and mark this as non-draft once it is. **How to test the change?** Our existing test coverage includes libdatadog testing, so a green CI is good here :)
**What does this PR do?** This PR upgrades dd-trace-rb to use libdatadog 7. There was a small breaking API change -- the rename of `ddog_Endpoint` APIs to `ddog_prof_Endpoint`. **Motivation:** Make sure Ruby is up-to-date on libdatadog. **Additional Notes:** As far as Ruby is impacted, the main changes in libdatadog 7 are a number of fixes to the crash tracker (getting us closer to merging #3384) as well as some potential memory improvements from (DataDog/libdatadog#303). I'm opening this as a draft PR as libdatadog 7 is not yet available on rubygems.org, and I'll come back to re-trigger CI and mark this as non-draft once it is. **How to test the change?** Our existing test coverage includes libdatadog testing, so a green CI is good here :)
The in-process option has known though issues to be solved, so let's go with the much safer option of having the receiver do the hard work.
This reverts commit 32f02b4.
There's still two FIXMEs in here, which I'll fix in follow-up commits.
This will be integrated into the libdatadog v9 release in DataDog/libdatadog#423 .
Update: I've changed this PR to be on top of #3627 ( libdatadog v9 upgrade) + updated it for libdatadog v9 compatibility. I think once libdatadog 9 support lands in master, this is finally 100% good to go :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. A few comments/questions
.additional_files = {}, | ||
// The Ruby VM already uses an alt stack to detect stack overflows so the crash handler must not overwrite it. | ||
// | ||
// @ivoanjo: Specifically, with `create_alt_stack = true` I saw a segfault, such as Ruby 2.6's bug with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, this is why this is an option here :)
|
||
ddog_prof_CrashtrackerReceiverConfig receiver_config = { | ||
.args = {}, | ||
.env = {.ptr = &ld_library_path_env, .len = 1}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might override the env rather than appending. Which would you expect to happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question -- I don't particularly have a preference.
On one side, inheriting the full env would be useful if the crash tracker wanted to log some extra info from there. On the other hand, we can always add anything else we want extra later.
Dealer's choice? Do you think it'd be useful for me to pass along the env that Ruby has + with the addition of the ld_library_path?
@@ -60,3 +62,87 @@ size_t read_ddogerr_string_and_drop(ddog_Error *error, char *string, size_t capa | |||
ddog_Error_drop(error); | |||
return error_msg_size; | |||
} | |||
|
|||
__attribute__((warn_unused_result)) | |||
ddog_prof_Endpoint endpoint_from(VALUE exporter_configuration) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you want a way to send to file endpoints?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can imagine some uses for it in some weird debugging cases... 🤔
From my side, I'm fine with not having that ability, so I don't particularly think it's worth the effort.
# otherwise `false` | ||
option :experimental_crash_tracking_enabled do |o| | ||
o.type :bool | ||
o.env 'DD_PROFILING_EXPERIMENTAL_CRASH_TRACKING_ENABLED' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we document this publicly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's a good idea. Did you have something in mind? I'm thinking I could open a PR to add to https://docs.datadoghq.com/profiler/profiler_troubleshooting/ruby/ .
|
||
unless transport.respond_to?(:exporter_configuration) | ||
Datadog.logger.warn( | ||
'Cannot enable profiling crash tracking as a custom settings.profiling.exporter.transport is configured' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just FYI, what is this and why does it prevent crashtracking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, really good question -- I've added a big comment to the code to explain this in 09976cc :)
describe '#experimental_crash_tracking_enabled=' do | ||
it 'updates the #experimental_crash_tracking_enabled setting' do | ||
expect { settings.profiling.advanced.experimental_crash_tracking_enabled = true } | ||
.to change { settings.profiling.advanced.experimental_crash_tracking_enabled } | ||
.from(false) | ||
.to(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ruby is kinda a fun language
expect_in_fork do | ||
crashtracker.reset_after_fork | ||
|
||
expect(`pgrep -f libdatadog-crashtracking-receiver`.lines.size).to be 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to have other processes running that could be captured here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is, although we never run our tests in parallel inside the same container in CI, so only if you're running the test suite locally would this potentially happen.
I've added an extra check at the beginning of these tests to check that no existing crash tracker was already running: 5431d2b .
expect_in_fork(fork_expectations: fork_expectations) do | ||
crashtracker.start | ||
|
||
Process.kill('SEGV', Process.pid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it also make sense to test an actual crash (e.g. null ptr deref)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That I know of, there's no Ruby-level way to cause a segfault. I even thought to check the built-in ffi fiddle gem, which actually is neat enough to handle its own potential segfaults cleanly and turn them into Ruby exceptions:
spec/datadog/profiling/crashtracker_spec.rb:188:in `[]': NULL pointer dereference (Fiddle::DLError)
So to actually do this, I would need to add a cause_crash
method on the native side of the Profiler, which.... seems like a sharp edge that is not great to leave lying around >_> ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You didn't try hard enough? :D
$ ruby -r fiddle -e 'Fiddle.free(42)'
-e:1: [BUG] Segmentation fault at 0x0000000000000022
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, interesting, I didn't think of that :D . Thanks @lloeki for the suggestion, I'll make a note to shoot a small PR with your suggestion :)
|
||
crash_report = JSON.parse(request.body, symbolize_names: true)[:payload].first | ||
|
||
expect(crash_report[:stack_trace]).to_not be_empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also check that the correct signal type was reported
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in a2ae730
context 'when a crash tracker instance is provided' do | ||
let(:optional_crashtracker) { instance_double(Datadog::Profiling::Crashtracker) } | ||
|
||
it 'signals the crash tracker to start before other components' do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
This makes the test fail early if there's an unexpected crash tracker instance running, rather than failing later in a more confusing way.
3c27b38
to
3a6b242
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #3384 +/- ##
==========================================
- Coverage 98.13% 98.11% -0.02%
==========================================
Files 1223 1225 +2
Lines 72139 72379 +240
Branches 3421 3433 +12
==========================================
+ Hits 70795 71018 +223
- Misses 1344 1361 +17 ☔ View full report in Codecov by Sentry. |
CI is now green on this! Took a while but we made it :D |
**What does this PR do?** This PR adds an additional test case to the profiling crashtracker as discussed in #3384 (comment) (thanks @lloeki for the suggestion). **Motivation:** Improve test coverage for the feature. **Additional Notes:** The diff looks more noiser than it is because of whitespace changes, I recommend reviewing this PR without them. **How to test the change?** Check that the new test passes!
What does this PR do?
This PR adds support for the libdatadog crash tracker feature (off by default).
The crash tracker works by detecting when the Ruby VM reaches a segmentation fault, reporting the crash information as a last profile before the VM dies.
All of the interesting work is in DataDog/libdatadog#282, this PR basically just wires things up.
Motivation:
This will be a useful tool when debugging VM crashes.
Additional Notes:
I'm opening this PR as a draft as the libdatadog support has not yet landed/been released. Also, there's a few open questions on:fork handlingwhen to shut downHow to test the change?
(You'll need to build <DataDog/libdatadog#282 until the crash tracker gets included in a libdatadog release)
To test the crash tracker with an actual crash, try running the following on Ruby 2.6:
This should also work in the future but right now it doesn't work correctly; still looking into why:This also works on every Ruby:
For Datadog employees:
credentials of any kind, I've requested a review from
@DataDog/security-design-and-guidance
.