Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: eliminate deadlocks in the barrier node when delete is used #2181

Merged
merged 1 commit into from
Mar 28, 2019

Conversation

jsternberg
Copy link
Contributor

Modified two different parts of the barrier node to prevent any
deadlocks. There were two possible methods to deadlock with one of these
found to be hit by people using these nodes.

The first was when a point was received, Point would be called and a
message would be sent to reset the timer. If the message queue was
backed up, then it was possible for emitBarrier to be called and for
it to try and forward a message to itself. But, the goroutine that was
reading the points from the edge was attempting to reset the timer so
the delete group message was never able to be consumed and the timer was
never able to be reset.

This first one was fixed by adding a buffer of size one to that channel
and making the reset non-blocking when the channel was full. This
guarantees that the message will be sent and prevents it from blocking
if the reset message is sent twice in quick succession.

The second is if the delete group message was received. This would call
Stop. If emitBarrier was called in between having the delete group
message, then it could deadlock in the same way as the above because the
delete group message needs emit to be called and emit is blocked on
waiting for the idle handler to exit.

The stop method was also not thread safe and it was easy enough to make
it thread safe since it could potentially be called from a different
thread.

This is fixed by changing the stop channel to have a buffer of 1 and
making it so stopping was sending a goroutine to avoid a potential
double close of the goroutine. When the stop message is sent, it will be
done non-blocking to prevent a deadlock since only one message should be
received anyway. The call to Stop() in the idleHandler is changed to
a non-blocking version and does not wait for the goroutine to exit while
the other stop will wait for the goroutine to exit.

Fixes #2144.

@jsternberg jsternberg requested a review from nathanielc March 27, 2019 17:32
Copy link
Contributor

@nathanielc nathanielc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Were you able to create the deadlock locally and see that this fixed it?

@jsternberg
Copy link
Contributor Author

I did not try. Let me give it a try. It should be pretty simple to replicate.

Modified two different parts of the barrier node to prevent any
deadlocks. There were two possible methods to deadlock with one of these
found to be hit by people using these nodes.

The first was when a point was received, `Point` would be called and a
message would be sent to reset the timer. If the message queue was
backed up, then it was possible for `emitBarrier` to be called and for
it to try and forward a message to itself. But, the goroutine that was
reading the points from the edge was attempting to reset the timer so
the delete group message was never able to be consumed and the timer was
never able to be reset.

This first one was fixed by adding a buffer of size one to that channel
and making the reset non-blocking when the channel was full. This
guarantees that the message will be sent and prevents it from blocking
if the reset message is sent twice in quick succession.

The second is if the delete group message was received. This would call
`Stop`. If `emitBarrier` was called in between having the delete group
message, then it could deadlock in the same way as the above because the
delete group message needs emit to be called and emit is blocked on
waiting for the idle handler to exit.

The stop method was also not thread safe and it was easy enough to make
it thread safe since it could potentially be called from a different
thread.

This is fixed by changing the stop channel to have a buffer of 1 and
making it so stopping was sending a goroutine to avoid a potential
double close of the goroutine. When the stop message is sent, it will be
done non-blocking to prevent a deadlock since only one message should be
received anyway. The call to `Stop()` in the `idleHandler` is changed to
a non-blocking version and does not wait for the goroutine to exit while
the other stop will wait for the goroutine to exit.
@jsternberg jsternberg force-pushed the fix/barrier-delete-deadlock branch from f0623ae to 947b7c7 Compare March 27, 2019 22:24
@jsternberg jsternberg merged commit 70cb863 into master Mar 28, 2019
@jsternberg jsternberg deleted the fix/barrier-delete-deadlock branch March 28, 2019 15:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Kapacitor leaks sockets when using barrier nodes with delete
2 participants