[sender] reset buffers on forks and reset the companion thread if dead or nil #203

remeh · 2021-09-03T08:32:32Z

We want to reset the MessageBuffer if we appear to be in a recent fork, since the contained messages belong to the parent process.

On top of that, when using the original multi-threaded Sender, this PR is re-spawning the companion thread if it seems to be dead for whatever reasons. It should automatically help when someone is using the multi-threaded mode in an app or framework using forks.

For some reasons, I had to remove the use of UDPSocket.new.tap since it was causing the interpreter to crash as soon as there was re-creation of a thread in fork childs.

This is a follow-up of and is replacing #199.

…d or nil

…tap` usage.

ivoanjo

Left a few notes! One general note is, since our customers do read through these PRs, I suggest renaming the PR to something clearer, such as "Fix metrics reporting in applications using fork" or something similar.

I also recommend clarifying in what situation UDPSocket.new.tap was causing the VM to crash, since this seems quite relevant to users on older versions that does use that code (should they upgrade ASAP to avoid this crash? Does it only get triggered along with other changes in this PR?).

lib/datadog/statsd/connection.rb

lib/datadog/statsd/message_buffer.rb

lib/datadog/statsd/sender.rb

lib/datadog/statsd/single_thread_sender.rb

lib/datadog/statsd/udp_connection.rb

ivoanjo · 2021-09-07T08:17:53Z

One extra thing that occurred to me is that some of the changes to the Sender classes are probably not thread-safe.

Since the intention seems to be that one Sender instance is used for a given ruby app, with all threads using it, I think we'll need to take an extra pass at the code while considering "what if this is happening concurrently".

remeh · 2021-09-07T08:19:02Z

One extra thing that occurred to me is that some of the changes to the Sender classes are probably not thread-safe.

Since the intention seems to be that one Sender instance is used for a given ruby app, with all threads using it, I think we'll need to take an extra pass at the code while considering "what if this is happening concurrently".

Yes entirely true, that's exactly what I'm doing right now by adding a mutex, mainly around the message queue. 👍

…es simultaneously

ivoanjo

Sorry for the extra "oh btw concurrency and lots of changes" notes. Feel free to separate changes out to a separate PR if you'd like to get the fork checking in first and separately tackle the concurrency.

lib/datadog/statsd/forwarder.rb

lib/datadog/statsd/sender.rb

ivoanjo · 2021-09-08T08:33:11Z

lib/datadog/statsd/sender.rb

+          @mx.synchronize {
+            message_queue.close if CLOSEABLE_QUEUES
+            @message_queue = nil
+            message_buffer.reset
+            start


There's a race hiding here. If a process just forked, and started two new threads, and both threads try to report data, the following timeline can occur:

Thread A checks line 52, sees that no sender thread is alive. Grabs the lock. (Scheduler switches threads)

Thread B checks line 52, sees that no sender thread is alive. Tries to grab the lock, is taken, so it blocks waiting for lock (Scheduler switches threads)

Thread A closes previous queue and resets it. Creates a new thread. Releases the lock. Adds message to queue. (Scheduler switches threads)

Thread B wakes up, grabs the lock, closes the queue that Thread A created and resets it (data from Thread A) gets lost. Creates another sender thread. etc...

TL;DR After grabbing the lock, we need to check that the world state is as we expected it to be when we went in to grab the lock, as it might've moved in the meanwhile :)

If a process just forked, and started two new threads

~~I don't expect the first process to have to create a thread, but I agree it could happen if for an unknown reason the first process see its thread dying at the same moment.~~ I'll look into this.

ivoanjo · 2021-09-08T08:35:07Z

lib/datadog/statsd/sender.rb

      if CLOSEABLE_QUEUES
        def stop(join_worker: true)
-          message_queue = @message_queue
-          message_queue.close if message_queue
+          @mx.synchronize {
+            message_queue = @message_queue
+            message_queue.close if message_queue

-          sender_thread = @sender_thread
-          sender_thread.join if sender_thread && join_worker
+            sender_thread = @sender_thread
+            sender_thread.join if sender_thread && join_worker
+          }
        end
      else
        def stop(join_worker: true)
-          message_queue = @message_queue
-          message_queue << :close if message_queue
+          @mx.synchronize {
+            message_queue = @message_queue
+            message_queue << :close if message_queue

-          sender_thread = @sender_thread
-          sender_thread.join if sender_thread && join_worker
+            sender_thread = @sender_thread
+            sender_thread.join if sender_thread && join_worker
+          }
        end
      end


Minor: These two methods are very similar -- would it be worth unifying them and only doing the if CLOSEABLE_QUEUES in the one line that changes between them?

ivoanjo · 2021-09-08T08:37:13Z

lib/datadog/statsd/sender.rb

+        @mx.synchronize {
+          message_queue.push(:flush)
+          rendez_vous if sync
+        }
      end

      def rendez_vous


Since #rendez_vous is public, should it also perform similar checks to #flush and use the lock as well?

ivoanjo · 2021-09-08T08:39:55Z

lib/datadog/statsd/single_thread_sender.rb

      def add(message)
+        # we have just forked, meaning we have messages in the buffer that we should
+        # not send, they belong to the parent process, let's clear the buffer.
+        if forked?
+          @message_buffer.reset
+          update_fork_pid
+        end


A similar race to the one I described for the Sender can happen here. Two threads call this method at the same time, both of them get forked? => true and the @message_buffer gets reset twice.

Also the interaction with the buffer itself may have issues with concurrency.

lib/datadog/statsd/connection.rb

remeh · 2021-09-27T13:33:02Z

Addressed by #205

remeh added 3 commits September 3, 2021 10:26

[sender] reset buffers on forks and reset the companion thread if dea…

cfd3475

…d or nil

[connection] re-connect isn't automatic anymore since I've removed `.…

d069347

…tap` usage.

[sender] avoid allocations by using #Thread.alive? instead.

a2f6bfb

remeh mentioned this pull request Sep 3, 2021

In the MessageBuffer, detect if we've just been called by a fork child. #199

Closed

[sender] comments.

1cfa9ca

ivoanjo reviewed Sep 6, 2021

View reviewed changes

remeh added 2 commits September 6, 2021 17:20

[sender] re-introduce the usage of attr_reader

e075290

Address various feedback

83659d7

remeh force-pushed the remeh/fork-detect-v2 branch from 6f92771 to 9614c8c Compare September 7, 2021 08:33

[sender] reset the message buffer and log warnings.

72105bf

remeh force-pushed the remeh/fork-detect-v2 branch from 9614c8c to 72105bf Compare September 7, 2021 08:42

[sender] proper locks in flush and stop if multiple threads call thes…

7460c87

…es simultaneously

remeh force-pushed the remeh/fork-detect-v2 branch from 4c08e46 to 7460c87 Compare September 7, 2021 10:11

[connection] simplified connection code.

903e08a

ivoanjo reviewed Sep 8, 2021

View reviewed changes

[sender] better synchronizationo and log message

3497846

ivoanjo mentioned this pull request Sep 15, 2021

Fix metrics reporting in applications using forks #205

Merged

remeh closed this Sep 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[sender] reset buffers on forks and reset the companion thread if dead or nil #203

[sender] reset buffers on forks and reset the companion thread if dead or nil #203

remeh commented Sep 3, 2021 •

edited

Loading

ivoanjo left a comment

ivoanjo commented Sep 7, 2021

remeh commented Sep 7, 2021

ivoanjo left a comment

ivoanjo Sep 8, 2021

remeh Sep 8, 2021 •

edited

Loading

ivoanjo Sep 8, 2021

ivoanjo Sep 8, 2021

ivoanjo Sep 8, 2021

remeh commented Sep 27, 2021

[sender] reset buffers on forks and reset the companion thread if dead or nil #203

[sender] reset buffers on forks and reset the companion thread if dead or nil #203

Conversation

remeh commented Sep 3, 2021 • edited Loading

ivoanjo left a comment

Choose a reason for hiding this comment

ivoanjo commented Sep 7, 2021

remeh commented Sep 7, 2021

ivoanjo left a comment

Choose a reason for hiding this comment

ivoanjo Sep 8, 2021

Choose a reason for hiding this comment

remeh Sep 8, 2021 • edited Loading

Choose a reason for hiding this comment

ivoanjo Sep 8, 2021

Choose a reason for hiding this comment

ivoanjo Sep 8, 2021

Choose a reason for hiding this comment

ivoanjo Sep 8, 2021

Choose a reason for hiding this comment

remeh commented Sep 27, 2021

remeh commented Sep 3, 2021 •

edited

Loading

remeh Sep 8, 2021 •

edited

Loading