Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix metrics reporting in applications using forks #205

Merged
merged 8 commits into from
Sep 28, 2021
Merged

Conversation

remeh
Copy link
Contributor

@remeh remeh commented Sep 10, 2021

Overview

The library is detecting while if a fork happened and it is automatically recreating resources accordingly, which should result in fixing metrics report for applications or frameworks using forks.

This PR is also fixing the telemetry if forks happened.

Technically

  • The two senders implementation use mechanism to detect if a fork happened, Sender is checking if its companion thread is running, SingleThreadSender is using Process.pid
  • When a fork happened, the MessageBuffer is cleaned in order for the new process to not start reporting metrics already handled by the parent process
  • Added locks make the calls to Sender/SingleThreadSender methods thread-safe, it doens't mean that the library is completely thread-safe yet, further work is needed for the Forwarder to accomplish this.

Copy link

@driv3r driv3r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't there too much happening in here? wouldn't it be easier to just expose something like reset method that we could call in after_fork ?

lib/datadog/statsd/sender.rb Outdated Show resolved Hide resolved
Copy link
Member

@ivoanjo ivoanjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments, but I think this is in great shape!

lib/datadog/statsd/forwarder.rb Outdated Show resolved Hide resolved
lib/datadog/statsd/message_buffer.rb Outdated Show resolved Hide resolved
lib/datadog/statsd/sender.rb Outdated Show resolved Hide resolved
lib/datadog/statsd/sender.rb Outdated Show resolved Hide resolved
lib/datadog/statsd/sender.rb Outdated Show resolved Hide resolved
lib/datadog/statsd/sender.rb Outdated Show resolved Hide resolved
lib/datadog/statsd/sender.rb Outdated Show resolved Hide resolved
@ivoanjo
Copy link
Member

ivoanjo commented Sep 15, 2021

(And thanks for the patience with my slow review turnaround)

@remeh
Copy link
Contributor Author

remeh commented Sep 20, 2021

Hey @driv3r, thanks for the feedback and the comment. I definitely agree that a lot is happening here, but the idea would be to not have people going through a lot of documentation (or missing it) and opening issues because their framework is using forks. Plus, dogstatsd-ruby seems to be used a lot with these kind of frameworks, that's why I would like this to work without user interaction.

Copy link
Member

@ivoanjo ivoanjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more non-trivial notes 😰 😅

lib/datadog/statsd/sender.rb Show resolved Hide resolved
Comment on lines 54 to 76
def add(message)
raise ArgumentError, 'Start sender first' unless message_queue

# if the thread does not exist, we assume we are running in a forked process,
# empty the message queue and message buffers (these messages belong to
# the parent process) and spawn a new companion thread.
if !sender_thread.alive?
@mx.synchronize {
# a call from another thread has already re-created
# the companion thread before this one acquired the lock
break if sender_thread.alive?
@logger.debug { "Statsd: companion thread is dead, re-creating one" } if @logger

message_queue.close if CLOSEABLE_QUEUES
@message_queue = nil
message_buffer.reset
message_buffer.reset_telemetry
start
}
end

message_queue << message
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 Hmm I still see a potential issue here, similar to the ones in #flush and #rendez_vous: with some poor timing of stop, by the time we get to the message_queue << message line, message_queue may be nil.

(And its cousin issue, @sender_thread being nil)

I think part of the issue is that we have the background thread setting these two things to nil without ever synchronizing with any other threads, which can get surprised by this at many points in their execution. While we could expand the synchronization even more, that seems to me to be a bit heavy-handed, especially since we have a thread-safe construct (Queue) that we're building around in this class.

Here's my suggestion:

  1. Construct the Queue in #initialize
  2. Never set it to nil or close it (but we may #clear it when restarting the background thread or after a stop). This enables us to always know that we can safely use it and call methods on it.
  3. Only synchronize when mutating @sender_thread -> starting it, changing it (when it dies), or setting it to nil (when it finishes due to #stop). Reading @sender_thread for checks is OK to do without locks.

This is just a suggestion, so feel free to ignore and do something else. But I think this class as it is, is hiding a lot of complexity introduced by the shared mutable state, truly the more I look the more I see potential issues.

Let me know if you'd like to pair on this; perhaps that way we can get this across the finish line without so much async back-and-forth and rework.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is why at first I've used the mutex to synchronize the close as well, it would avoid timing with the #stop call since the #stop would be able to run only if it has the lock. I'll see what change represents your suggestion, and I agree that this class ships a lot of complexity now...

lib/datadog/statsd/message_buffer.rb Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
Copy link
Member

@ivoanjo ivoanjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caught one final possible issue, but otherwise it LGTM.

As we discussed via chat, some of the rarer thread-safety issues are going to be tackled in follow-up PRs.

lib/datadog/statsd/single_thread_sender.rb Outdated Show resolved Hide resolved
@remeh remeh merged commit 14c76e3 into master Sep 28, 2021
@remeh remeh deleted the remeh/fork-detect-v3 branch September 28, 2021 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants