fix: eliminate deadlocks in the barrier node when delete is used #2181

jsternberg · 2019-03-27T17:32:39Z

Modified two different parts of the barrier node to prevent any
deadlocks. There were two possible methods to deadlock with one of these
found to be hit by people using these nodes.

The first was when a point was received, Point would be called and a
message would be sent to reset the timer. If the message queue was
backed up, then it was possible for emitBarrier to be called and for
it to try and forward a message to itself. But, the goroutine that was
reading the points from the edge was attempting to reset the timer so
the delete group message was never able to be consumed and the timer was
never able to be reset.

This first one was fixed by adding a buffer of size one to that channel
and making the reset non-blocking when the channel was full. This
guarantees that the message will be sent and prevents it from blocking
if the reset message is sent twice in quick succession.

The second is if the delete group message was received. This would call
Stop. If emitBarrier was called in between having the delete group
message, then it could deadlock in the same way as the above because the
delete group message needs emit to be called and emit is blocked on
waiting for the idle handler to exit.

The stop method was also not thread safe and it was easy enough to make
it thread safe since it could potentially be called from a different
thread.

This is fixed by changing the stop channel to have a buffer of 1 and
making it so stopping was sending a goroutine to avoid a potential
double close of the goroutine. When the stop message is sent, it will be
done non-blocking to prevent a deadlock since only one message should be
received anyway. The call to Stop() in the idleHandler is changed to
a non-blocking version and does not wait for the goroutine to exit while
the other stop will wait for the goroutine to exit.

Fixes #2144.

nathanielc

This looks great! Were you able to create the deadlock locally and see that this fixed it?

jsternberg · 2019-03-27T20:22:23Z

I did not try. Let me give it a try. It should be pretty simple to replicate.

Modified two different parts of the barrier node to prevent any deadlocks. There were two possible methods to deadlock with one of these found to be hit by people using these nodes. The first was when a point was received, `Point` would be called and a message would be sent to reset the timer. If the message queue was backed up, then it was possible for `emitBarrier` to be called and for it to try and forward a message to itself. But, the goroutine that was reading the points from the edge was attempting to reset the timer so the delete group message was never able to be consumed and the timer was never able to be reset. This first one was fixed by adding a buffer of size one to that channel and making the reset non-blocking when the channel was full. This guarantees that the message will be sent and prevents it from blocking if the reset message is sent twice in quick succession. The second is if the delete group message was received. This would call `Stop`. If `emitBarrier` was called in between having the delete group message, then it could deadlock in the same way as the above because the delete group message needs emit to be called and emit is blocked on waiting for the idle handler to exit. The stop method was also not thread safe and it was easy enough to make it thread safe since it could potentially be called from a different thread. This is fixed by changing the stop channel to have a buffer of 1 and making it so stopping was sending a goroutine to avoid a potential double close of the goroutine. When the stop message is sent, it will be done non-blocking to prevent a deadlock since only one message should be received anyway. The call to `Stop()` in the `idleHandler` is changed to a non-blocking version and does not wait for the goroutine to exit while the other stop will wait for the goroutine to exit.

jsternberg requested a review from nathanielc March 27, 2019 17:32

nathanielc reviewed Mar 27, 2019

View reviewed changes

jsternberg force-pushed the fix/barrier-delete-deadlock branch from f0623ae to 947b7c7 Compare March 27, 2019 22:24

nathanielc approved these changes Mar 28, 2019

View reviewed changes

jsternberg merged commit 70cb863 into master Mar 28, 2019

jsternberg deleted the fix/barrier-delete-deadlock branch March 28, 2019 15:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: eliminate deadlocks in the barrier node when delete is used #2181

fix: eliminate deadlocks in the barrier node when delete is used #2181

jsternberg commented Mar 27, 2019

nathanielc left a comment

jsternberg commented Mar 27, 2019

fix: eliminate deadlocks in the barrier node when delete is used #2181

fix: eliminate deadlocks in the barrier node when delete is used #2181

Conversation

jsternberg commented Mar 27, 2019

nathanielc left a comment

Choose a reason for hiding this comment

jsternberg commented Mar 27, 2019