DEBUG-2334 Probe Notifier Worker component #4028

p-datadog · 2024-10-24T21:43:19Z

What does this PR do?
This is a background thread that notification payloads (probe
status and probe snapshots) can be submitted to.
The payloads will be batched into groups if possible, and
sent to the local agent asynchronously.

Motivation:

Initial DI implementation.

Change log entry
None

Additional Notes:

How to test the change?
Unit tests in this PR

This is a background thread that notification payloads (probe status and probe snapshots) can be submitted to. The payloads will be batched into groups if possible, and sent to the local agent asynchronously.

pr-commenter · 2024-10-24T22:22:14Z

Benchmarks

Benchmark execution time: 2024-10-28 12:05:46

Comparing candidate commit 6fdad9c in PR branch di-probe-notifier with baseline commit 91d883f in branch master.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 24 metrics, 2 unstable metrics.

Strech

I would like to add some small code adjustments to reduce the volume of the methods. Great job 👏🏼

Strech · 2024-10-25T09:45:45Z

lib/datadog/di/probe_notifier_worker.rb

+      # Minimum interval between submissions.
+      # TODO make this into an internal setting and increase default to 2 or 3.
+      MIN_SEND_INTERVAL = 1


WDYT of adding scale here? Is it seconds, milliseconds or ..? Maybe we can name it MIN_SEND_INTERVAL_SEC?

Given that #4012 has not yet been looked at, and I have another 2000+ lines of code pending locally, I would like to only make changes in this and other open PRs that address clear problems. I am happy to discuss adding units to times and if there is team consensus on how the units should be indicated, add them in a subsequent PR.

Strech · 2024-10-25T09:46:41Z

lib/datadog/di/probe_notifier_worker.rb

+      def initialize(settings, agent_settings, transport)
+        @settings = settings
+        @status_queue = []
+        @snapshot_queue = []
+        @transport = transport
+        @lock = Mutex.new
+        @wake = Core::Semaphore.new
+        @io_in_progress = false
+        @sleep_remaining = nil
+        @wake_scheduled = false
+      end


I do not see agent_settings to be used, is it correct?

They are now consumed by the transport, I removed agent settings from probe notifier worker.

lib/datadog/di/probe_notifier_worker.rb

Strech · 2024-10-25T09:50:23Z

lib/datadog/di/probe_notifier_worker.rb

+            begin
+              more = maybe_send
+            rescue => exc
+              raise if settings.dynamic_instrumentation.propagate_all_exceptions
+
+              warn "Error in probe notifier worker: #{exc.class}: #{exc} (at #{exc.backtrace.first})"
+            end
+            @lock.synchronize do
+              @wake_scheduled = more
+            end
+            wake.wait(more ? MIN_SEND_INTERVAL : nil)
+          end


Suggested change

begin

more = maybe_send

rescue => exc

raise if settings.dynamic_instrumentation.propagate_all_exceptions

warn "Error in probe notifier worker: #{exc.class}: #{exc} (at #{exc.backtrace.first})"

end

@lock.synchronize do

@wake_scheduled = more

end

wake.wait(more ? MIN_SEND_INTERVAL : nil)

end

begin

more = maybe_send

rescue => exc

raise if settings.dynamic_instrumentation.propagate_all_exceptions

warn "Error in probe notifier worker: #{exc.class}: #{exc} (at #{exc.backtrace.first})"

end

@lock.synchronize { @wake_scheduled = more }

wake.wait(more ? MIN_SEND_INTERVAL : nil)

end

Strech · 2024-10-25T09:52:15Z

lib/datadog/di/probe_notifier_worker.rb

+        unless thread&.join(timeout)
+          thread.kill
+        end


Suggested change

unless thread&.join(timeout)

thread.kill

end

thread.kill unless thread&.join(timeout)

lib/datadog/di/probe_notifier_worker.rb

Strech · 2024-10-25T09:53:37Z

lib/datadog/di/probe_notifier_worker.rb

+      [
+        [:status, 'probe status'],
+        [:snapshot, 'snapshot'],
+      ].each do |(event_type, event_name)|


Suggested change

[

[:status, 'probe status'],

[:snapshot, 'snapshot'],

].each do |(event_type, event_name)|

{

status: 'probe status',

snapshot: 'snapshot'

}.each do |event_type, event_name|

ivoanjo · 2024-10-25T10:05:22Z

lib/datadog/di/probe_notifier_worker.rb

+        unless thread&.join(timeout)
+          thread.kill
+        end


👀 This will fail if thread is nil:

[3] pry(main)> thread = nil => nil [4] pry(main)> unless thread&.join(123) [4] pry(main)* thread.kill [4] pry(main)* end NoMethodError: undefined method `kill' for nil:NilClass from (pry):7:in `__pry__

Thank you, repaired.

ivoanjo · 2024-10-25T10:07:30Z

lib/datadog/di/probe_notifier_worker.rb

+          if io_in_progress
+            # If we just call Thread.pass we could be in a busy loop -
+            # add a sleep.
+            sleep 0.25
+            next
+          elsif queues_empty
+            break
+          else
+            sleep 0.25
+            next


It's possible to avoid the sleeping by using a condition variable to flag when the queue is empty

I added a note to investigate this.

ivoanjo · 2024-10-25T10:08:55Z

spec/datadog/di/probe_notifier_worker_spec.rb

+      context 'when three snapshots are added in quick succession' do
+        it 'sends two batches' do
+          expect(worker.send(:snapshot_queue)).to be_empty
+
+          expect(transport).to receive(:send_snapshot).once.with([snapshot])
+
+          worker.add_snapshot(snapshot)
+          sleep 0.1
+          worker.add_snapshot(snapshot)
+          sleep 0.1
+          worker.add_snapshot(snapshot)
+
+          # Since sending is asynchronous, we need to relinquish execution
+          # for the sending thread to run.
+          sleep(0.1)
+
+          # At this point the first snapshot should have been sent,
+          # with the remaining two in the queue
+          expect(worker.send(:snapshot_queue)).to eq([snapshot, snapshot])
+
+          sleep 0.4
+          # Still within the cooldown period
+          expect(worker.send(:snapshot_queue)).to eq([snapshot, snapshot])
+
+          expect(transport).to receive(:send_snapshot).once.with([snapshot, snapshot])
+
+          sleep 0.5
+          expect(worker.send(:snapshot_queue)).to eq([])
+        end


If possible, avoid using sleeps in tests -- they make the test suite both slower and flakier >_>

I made a note to investigate this. The tests are currently not flaky and speeding them up is a lower priority than shipping DI to customers, but if they start being flaky I will revisit this sooner.

Note that slow specs kinda affect all CI runs so yeah, try not to add too much to that fire >_>

codecov-commenter · 2024-10-25T18:02:20Z

Codecov Report

Attention: Patch coverage is 87.50000% with 23 lines in your changes missing coverage. Please review.

Project coverage is 97.84%. Comparing base (91d883f) to head (6fdad9c).
Report is 24 commits behind head on master.

Files with missing lines	Patch %	Lines
lib/datadog/di/probe_notifier_worker.rb	77.88%	23 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #4028      +/-   ##
==========================================
- Coverage   97.86%   97.84%   -0.03%     
==========================================
  Files        1321     1324       +3     
  Lines       79326    79509     +183     
  Branches     3934     3959      +25     
==========================================
+ Hits        77631    77794     +163     
- Misses       1695     1715      +20

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Strech

I put a few cents here, but I think it's ok

ivoanjo · 2024-10-29T07:16:31Z

lib/datadog/di/probe_notifier_worker.rb

+            batch = instance_variable_get("@#{event_type}_queue")
+            instance_variable_set("@#{event_type}_queue", [])


Minor: It may be worth adding an attr_accessor and then use send(...) to access these variables.

The reading side is done with an attribute, are you concerned that there maybe a spelling mistake here and the wrong variable will be written to?

I was mostly thinking that instance_variable_set and instance_variable_get are very sharp weapons so having the attr_accessor seems a bit easier to avoid bugs.

On the other hand, it is true that we can misspell the creation of the attr_accessor as well, so it's not like there's no potential for bugs there either.

Let me leave it this way for now but if the matter comes up again I can redo as attr_accessor.

* master: DEBUG-2334 Probe Notifier Worker component (DataDog#4028) DEBUG-2334 dynamic instrumentation probe notification builder (DataDog#4011) Handle low-level libddwaf exception in Context [NO-TICKET] Minor: Fix typos in safe_dup_spec.rb Remove libdatadog musl Remove ffi's after installation Remove cached gems

* DEBUG-2334 Probe Notifier Worker component This is a background thread that notification payloads (probe status and probe snapshots) can be submitted to. The payloads will be batched into groups if possible, and sent to the local agent asynchronously.

p added 3 commits October 24, 2024 11:16

DEBUG-2334 Probe Notifier Worker component

c901b11

This is a background thread that notification payloads (probe status and probe snapshots) can be submitted to. The payloads will be batched into groups if possible, and sent to the local agent asynchronously.

extract semaphore implementation, reimplement waiting

3644180

types

82de6e5

p-datadog requested a review from a team as a code owner October 24, 2024 21:43

p-datadog and others added 3 commits October 24, 2024 17:43

Merge branch 'master' into di-probe-notifier

f0ead5f

set wake_scheduled

f5f2442

rubocop

be1eae5

Strech reviewed Oct 25, 2024

View reviewed changes

ivoanjo reviewed Oct 25, 2024

View reviewed changes

p added 5 commits October 25, 2024 11:17

agent settings no longer used

91c1a44

set thread to nil initially

0b08177

standard

07fe7bc

get rid of agent settings

bc9192f

dependency-inject logger

cc58ee4

p-datadog added the dev/internal Other internal work that does not need to be included in the changelog label Oct 25, 2024

types

b8ae0cd

p-datadog mentioned this pull request Oct 25, 2024

DEBUG-2334 dynamic instrumentation probe notification builder #4011

Merged

handle the case of missing thread

6fdad9c

Strech approved these changes Oct 28, 2024

View reviewed changes

ivoanjo reviewed Oct 29, 2024

View reviewed changes

p-datadog merged commit e8ef953 into master Oct 29, 2024
269 of 270 checks passed

p-datadog deleted the di-probe-notifier branch October 29, 2024 14:11

github-actions bot added this to the 2.5.0 milestone Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DEBUG-2334 Probe Notifier Worker component #4028

DEBUG-2334 Probe Notifier Worker component #4028

p-datadog commented Oct 24, 2024

pr-commenter bot commented Oct 24, 2024 •

edited

Loading

Strech left a comment

Strech Oct 25, 2024

p-datadog Oct 25, 2024

Strech Oct 25, 2024

p-datadog Oct 25, 2024

Strech Oct 25, 2024

Strech Oct 25, 2024

Strech Oct 25, 2024 •

edited

Loading

ivoanjo Oct 25, 2024

p-datadog Oct 28, 2024

ivoanjo Oct 25, 2024

p-datadog Oct 28, 2024

ivoanjo Oct 25, 2024

p-datadog Oct 28, 2024

ivoanjo Oct 28, 2024

codecov-commenter commented Oct 25, 2024 •

edited

Loading

Strech left a comment

ivoanjo Oct 29, 2024

p-datadog Oct 29, 2024

ivoanjo Oct 29, 2024

p-datadog Oct 29, 2024

		batch = instance_variable_get("@#{event_type}_queue")
		instance_variable_set("@#{event_type}_queue", [])

DEBUG-2334 Probe Notifier Worker component #4028

DEBUG-2334 Probe Notifier Worker component #4028

Conversation

p-datadog commented Oct 24, 2024

pr-commenter bot commented Oct 24, 2024 • edited Loading

Benchmarks

Strech left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Strech Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Oct 25, 2024 • edited Loading

Codecov Report

Strech left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pr-commenter bot commented Oct 24, 2024 •

edited

Loading

Strech Oct 25, 2024 •

edited

Loading

codecov-commenter commented Oct 25, 2024 •

edited

Loading