-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix deadlock when writing to pipe blocks #168
Conversation
f06ddee
to
f0e2206
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The old behavior does indeed look fishy as generally locks need to be taken and released in order.
I guess the main thing here is the message may send out of order now.
I don't think that should matter.
CI appears to have hung, however. |
@dmcgowan could you take a look? This deadlock seems to be happening quite a bit offlate for some customers. |
Where is the other sendLock hung at? The stream creation ordering must still be preserved and unlocking streamlock before the send lock opens up the possibility of a protocol error sending out of order stream creates. While that is unlikely, it must be protected and would probably be even more difficult to reproduce. |
Goroutine 1 is holding I put more info here too: #72 (comment) What is the reason that stream IDs need to be sent over the wire in order? Could we relax that constraint? |
Its seems in that case, goroutine 3 should be able to proceed if we managed the locking better. Goroutine 1 and 2 are correctly blocking each other. I think putting the back pressure on the client would be good, but we definitely shouldn't have any contention between requests and response in the client. I think if we broke up streamLock from an rwmutex into two mutexes it would be cleaner. The first mutex (or could just be any synced map) around access to s.streams, then the second mutex only on
This design we borrowed from spdy/http2 which has pretty solid reasoning behind it. It keeps the stream ids unique without worrying about clashing or tracking a large map of active or previously seen identifiers. This make the implementation easy and fast, you only need a single incrementing integer. |
I think it works to just take I'm still reviewing it to make sure it doesn't seem like it will break anything else, but this appears good as We could also swap from |
Even better! We can leave the stream lock alone in this PR but switching to sync.Mutex + map or sync.Map is probably better than using RWMutex. |
Sounds great! I will fix up the commit description to be accurate, and look over a bit more for any concerns. Otherwise I think this should be good. |
Use sendLock to guard the entire stream allocation + write to wire operation, and streamLock to only guard access to the underlying stream map. This ensures the following: - We uphold the constraint that new stream IDs on the wire are always increasing, because whoever holds sendLock will be ensured to get the next stream ID and be the next to write to the wire. - Locks are always released in LIFO order. This prevents deadlocks. Taking sendLock before releasing streamLock means that if a goroutine blocks writing to the pipe, it can make another goroutine get stuck trying to take sendLock, and therefore streamLock will be kept locked as well. This can lead to the receiver goroutine no longer being able to read responses from the pipe, since it needs to take streamLock when processing a response. This ultimately leads to a complete deadlock of the client. It is reasonable for a server to block writes to the pipe if the client is not reading responses fast enough. So we can't expect writes to never block. I have repro'd the hang with a simple ttrpc client and server. The client spins up 100 goroutines that spam the server with requests constantly. After a few seconds of running I can see it hang. I have set the buffer size for the pipe to 0 to more easily repro, but it would still be possible to hit with a larger buffer size (just may take a higher volume of requests or larger payloads). I also validated that I no longer see the hang with this fix, by leaving the test client/server running for a few minutes. Obviously not 100% conclusive, but before I could get a hang within several seconds of running. Signed-off-by: Kevin Parsons <kevpar@microsoft.com>
It's possible adding another mutex (to have separate mutexes for "write to the wire" and "allocate stream + write to the wire") would reduce contention slightly, since |
Thanks for the quick review! |
Swap to calling send, which handles taking sendLock for us rather than doing it directly in createStream. This means that streamLock is now released before taking sendLock.
Taking sendLock before releasing streamLock means that if a goroutine blocks writing to the pipe, it can make another goroutine get stuck trying to take sendLock, and therefore streamLock will be kept locked as well. This can lead to the receiver goroutine no longer being able to read responses from the pipe, since it needs to take streamLock when processing a response. This ultimately leads to a complete deadlock of the client.
It is reasonable for a server to block writes to the pipe if the client is not reading responses fast enough. So we can't expect writes to never block.
I have repro'd the hang with a simple ttrpc client and server. The client spins up 100 goroutines that spam the server with requests constantly. After a few seconds of running I can see it hang. I have set the buffer size for the pipe to 0 to more easily repro, but it would still be possible to hit with a larger buffer size (just may take a higher volume of requests or larger payloads).
I also validated that I no longer see the hang with this fix, by leaving the test client/server running for a few minutes. Obviously not 100% conclusive, but before I could get a hang within several seconds of running.