Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlock when writing to pipe blocks #168

Merged
merged 1 commit into from
May 13, 2024
Merged

Commits on May 13, 2024

  1. client: Fix deadlock when writing to pipe blocks

    Use sendLock to guard the entire stream allocation + write to wire
    operation, and streamLock to only guard access to the underlying stream
    map. This ensures the following:
    - We uphold the constraint that new stream IDs on the wire are always
      increasing, because whoever holds sendLock will be ensured to get the
      next stream ID and be the next to write to the wire.
    - Locks are always released in LIFO order. This prevents deadlocks.
    
    Taking sendLock before releasing streamLock means that if a goroutine
    blocks writing to the pipe, it can make another goroutine get stuck
    trying to take sendLock, and therefore streamLock will be kept locked as
    well. This can lead to the receiver goroutine no longer being able to
    read responses from the pipe, since it needs to take streamLock when
    processing a response. This ultimately leads to a complete deadlock of
    the client.
    
    It is reasonable for a server to block writes to the pipe if the client
    is not reading responses fast enough. So we can't expect writes to never
    block.
    
    I have repro'd the hang with a simple ttrpc client and server. The
    client spins up 100 goroutines that spam the server with requests
    constantly. After a few seconds of running I can see it hang. I have set
    the buffer size for the pipe to 0 to more easily repro, but it would
    still be possible to hit with a larger buffer size (just may take a
    higher volume of requests or larger payloads).
    
    I also validated that I no longer see the hang with this fix, by leaving
    the test client/server running for a few minutes. Obviously not 100%
    conclusive, but before I could get a hang within several seconds of
    running.
    
    Signed-off-by: Kevin Parsons <kevpar@microsoft.com>
    kevpar committed May 13, 2024
    Configuration menu
    Copy the full SHA
    1b4f6f8 View commit details
    Browse the repository at this point in the history