Fix metrics reporting in applications using `fork`s #205

remeh · 2021-09-10T12:09:36Z

Overview

The library is detecting while if a fork happened and it is automatically recreating resources accordingly, which should result in fixing metrics report for applications or frameworks using forks.

This PR is also fixing the telemetry if forks happened.

Technically

The two senders implementation use mechanism to detect if a fork happened, Sender is checking if its companion thread is running, SingleThreadSender is using Process.pid
When a fork happened, the MessageBuffer is cleaned in order for the new process to not start reporting metrics already handled by the parent process
Added locks make the calls to Sender/SingleThreadSender methods thread-safe, it doens't mean that the library is completely thread-safe yet, further work is needed for the Forwarder to accomplish this.

driv3r

isn't there too much happening in here? wouldn't it be easier to just expose something like reset method that we could call in after_fork ?

lib/datadog/statsd/sender.rb

ivoanjo

Left a few comments, but I think this is in great shape!

lib/datadog/statsd/forwarder.rb

lib/datadog/statsd/message_buffer.rb

lib/datadog/statsd/sender.rb

ivoanjo · 2021-09-15T11:20:55Z

(And thanks for the patience with my slow review turnaround)

remeh · 2021-09-20T07:44:59Z

Hey @driv3r, thanks for the feedback and the comment. I definitely agree that a lot is happening here, but the idea would be to not have people going through a lot of documentation (or missing it) and opening issues because their framework is using forks. Plus, dogstatsd-ruby seems to be used a lot with these kind of frameworks, that's why I would like this to work without user interaction.

ivoanjo

A few more non-trivial notes 😰 😅

lib/datadog/statsd/sender.rb

ivoanjo · 2021-09-20T09:28:15Z

lib/datadog/statsd/sender.rb

      def add(message)
        raise ArgumentError, 'Start sender first' unless message_queue

+        # if the thread does not exist, we assume we are running in a forked process,
+        # empty the message queue and message buffers (these messages belong to
+        # the parent process) and spawn a new companion thread.
+        if !sender_thread.alive?
+          @mx.synchronize {
+            # a call from another thread has already re-created
+            # the companion thread before this one acquired the lock
+            break if sender_thread.alive?
+            @logger.debug { "Statsd: companion thread is dead, re-creating one" } if @logger
+
+            message_queue.close if CLOSEABLE_QUEUES
+            @message_queue = nil
+            message_buffer.reset
+            message_buffer.reset_telemetry
+            start
+          }
+        end
+
        message_queue << message
      end


🤔 Hmm I still see a potential issue here, similar to the ones in #flush and #rendez_vous: with some poor timing of stop, by the time we get to the message_queue << message line, message_queue may be nil.

(And its cousin issue, @sender_thread being nil)

I think part of the issue is that we have the background thread setting these two things to nil without ever synchronizing with any other threads, which can get surprised by this at many points in their execution. While we could expand the synchronization even more, that seems to me to be a bit heavy-handed, especially since we have a thread-safe construct (Queue) that we're building around in this class.

Here's my suggestion:

Construct the Queue in #initialize

Never set it to nil or close it (but we may #clear it when restarting the background thread or after a stop). This enables us to always know that we can safely use it and call methods on it.

Only synchronize when mutating @sender_thread -> starting it, changing it (when it dies), or setting it to nil (when it finishes due to #stop). Reading @sender_thread for checks is OK to do without locks.

This is just a suggestion, so feel free to ignore and do something else. But I think this class as it is, is hiding a lot of complexity introduced by the shared mutable state, truly the more I look the more I see potential issues.

Let me know if you'd like to pair on this; perhaps that way we can get this across the finish line without so much async back-and-forth and rework.

That is why at first I've used the mutex to synchronize the close as well, it would avoid timing with the #stop call since the #stop would be able to run only if it has the lock. I'll see what change represents your suggestion, and I agree that this class ships a lot of complexity now...

lib/datadog/statsd/message_buffer.rb

README.md

…ined behavior

ivoanjo

Caught one final possible issue, but otherwise it LGTM.

As we discussed via chat, some of the rarer thread-safety issues are going to be tackled in follow-up PRs.

lib/datadog/statsd/single_thread_sender.rb

[sender] detect when a fork happened and reset resources accordingly

e1e00b5

driv3r reviewed Sep 14, 2021

View reviewed changes

lib/datadog/statsd/sender.rb Outdated Show resolved Hide resolved

ivoanjo reviewed Sep 15, 2021

View reviewed changes

[statsd] address multiple feedback.

9756575

remeh added 2 commits September 20, 2021 11:07

[telemetry] reset on message buffer re-creation.

38758f4

[telemetry] reset the telemetry on a fork in the single_thread_sender

a3375b2

ivoanjo reviewed Sep 20, 2021

View reviewed changes

remeh added 2 commits September 20, 2021 14:11

[sender] cleaner interface to reset the message buffer.

dc24b67

sender: mention that multithread calls to close/add can lead to undef…

073f685

…ined behavior

remeh mentioned this pull request Sep 27, 2021

[sender] reset buffers on forks and reset the companion thread if dead or nil #203

Closed

Merge branch 'master' into remeh/fork-detect-v3

5aedbb8

ivoanjo approved these changes Sep 28, 2021

View reviewed changes

lib/datadog/statsd/single_thread_sender.rb Outdated Show resolved Hide resolved

[sender] remove a redundant call in the single_thread_sender

362acc7

remeh merged commit 14c76e3 into master Sep 28, 2021

remeh deleted the remeh/fork-detect-v3 branch September 28, 2021 15:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix metrics reporting in applications using `fork`s #205

Fix metrics reporting in applications using `fork`s #205

remeh commented Sep 10, 2021 •

edited

Loading

driv3r left a comment

ivoanjo left a comment

ivoanjo commented Sep 15, 2021

remeh commented Sep 20, 2021 •

edited

Loading

ivoanjo left a comment

ivoanjo Sep 20, 2021

remeh Sep 20, 2021

ivoanjo left a comment

Fix metrics reporting in applications using forks #205

Fix metrics reporting in applications using forks #205

Conversation

remeh commented Sep 10, 2021 • edited Loading

Overview

Technically

driv3r left a comment

Choose a reason for hiding this comment

ivoanjo left a comment

Choose a reason for hiding this comment

ivoanjo commented Sep 15, 2021

remeh commented Sep 20, 2021 • edited Loading

ivoanjo left a comment

Choose a reason for hiding this comment

ivoanjo Sep 20, 2021

Choose a reason for hiding this comment

remeh Sep 20, 2021

Choose a reason for hiding this comment

ivoanjo left a comment

Choose a reason for hiding this comment

Fix metrics reporting in applications using `fork`s #205

Fix metrics reporting in applications using `fork`s #205

remeh commented Sep 10, 2021 •

edited

Loading

remeh commented Sep 20, 2021 •

edited

Loading