-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix metrics reporting in applications using fork
s
#205
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't there too much happening in here? wouldn't it be easier to just expose something like reset
method that we could call in after_fork
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments, but I think this is in great shape!
(And thanks for the patience with my slow review turnaround) |
Hey @driv3r, thanks for the feedback and the comment. I definitely agree that a lot is happening here, but the idea would be to not have people going through a lot of documentation (or missing it) and opening issues because their framework is using forks. Plus, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more non-trivial notes 😰 😅
lib/datadog/statsd/sender.rb
Outdated
def add(message) | ||
raise ArgumentError, 'Start sender first' unless message_queue | ||
|
||
# if the thread does not exist, we assume we are running in a forked process, | ||
# empty the message queue and message buffers (these messages belong to | ||
# the parent process) and spawn a new companion thread. | ||
if !sender_thread.alive? | ||
@mx.synchronize { | ||
# a call from another thread has already re-created | ||
# the companion thread before this one acquired the lock | ||
break if sender_thread.alive? | ||
@logger.debug { "Statsd: companion thread is dead, re-creating one" } if @logger | ||
|
||
message_queue.close if CLOSEABLE_QUEUES | ||
@message_queue = nil | ||
message_buffer.reset | ||
message_buffer.reset_telemetry | ||
start | ||
} | ||
end | ||
|
||
message_queue << message | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 Hmm I still see a potential issue here, similar to the ones in #flush
and #rendez_vous
: with some poor timing of stop
, by the time we get to the message_queue << message
line, message_queue
may be nil
.
(And its cousin issue, @sender_thread
being nil
)
I think part of the issue is that we have the background thread setting these two things to nil
without ever synchronizing with any other threads, which can get surprised by this at many points in their execution. While we could expand the synchronization even more, that seems to me to be a bit heavy-handed, especially since we have a thread-safe construct (Queue
) that we're building around in this class.
Here's my suggestion:
- Construct the
Queue
in#initialize
- Never set it to
nil
or close it (but we may#clear
it when restarting the background thread or after a stop). This enables us to always know that we can safely use it and call methods on it. - Only synchronize when mutating
@sender_thread
-> starting it, changing it (when it dies), or setting it tonil
(when it finishes due to#stop
). Reading@sender_thread
for checks is OK to do without locks.
This is just a suggestion, so feel free to ignore and do something else. But I think this class as it is, is hiding a lot of complexity introduced by the shared mutable state, truly the more I look the more I see potential issues.
Let me know if you'd like to pair on this; perhaps that way we can get this across the finish line without so much async back-and-forth and rework.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is why at first I've used the mutex to synchronize the close as well, it would avoid timing with the #stop
call since the #stop
would be able to run only if it has the lock. I'll see what change represents your suggestion, and I agree that this class ships a lot of complexity now...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caught one final possible issue, but otherwise it LGTM.
As we discussed via chat, some of the rarer thread-safety issues are going to be tackled in follow-up PRs.
Overview
The library is detecting while if a fork happened and it is automatically recreating resources accordingly, which should result in fixing metrics report for applications or frameworks using
fork
s.This PR is also fixing the telemetry if forks happened.
Technically
Sender
is checking if its companion thread is running,SingleThreadSender
is usingProcess.pid
MessageBuffer
is cleaned in order for the new process to not start reporting metrics already handled by the parent processSender
/SingleThreadSender
methods thread-safe, it doens't mean that the library is completely thread-safe yet, further work is needed for theForwarder
to accomplish this.