Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race conditions due to zero-copy semantics of pipe transport. #261

Merged
merged 4 commits into from
Jul 15, 2022

Conversation

lthibault
Copy link
Collaborator

@lthibault lthibault commented Jul 12, 2022

While investigating the race-condition triggered by #260, it was determined that the zero-copy semantics of transport.NewPipe were causing data races. In the observed case, both sides of the transport connection were concurrently calling Message.Release(). In a hitherto unobserved case, it is also possible for data races to emerge if one side of the transport connection mutates a message segment.

This PR addresses the above design flaw by removing transport.NewPipe and creating a new internal package called bufpipe. Rather than implement the Transport interface, bufpipe.New returns a pair of io.ReadWriteClosers, which can be consumed by rpc.NewStreamTransport. Similarly, tests requiring unbuffered transport may use the standard-library net.Pipe.

An important conclusion of this investigation is that an in-process, zero-copy transport is not feasible under the current design.

@lthibault lthibault added the bug label Jul 12, 2022
@lthibault lthibault self-assigned this Jul 12, 2022
@lthibault lthibault force-pushed the bugfix/racy-pipe branch 5 times, most recently from c133ed2 to 21f2649 Compare July 13, 2022 11:09
@lthibault lthibault requested a review from zenhack July 13, 2022 11:09
@lthibault lthibault marked this pull request as ready for review July 13, 2022 11:10
@lthibault
Copy link
Collaborator Author

Tests are timing out, but passing on my machine. @zenhack Can you confirm you're getting the same thing? It's timing out in TestRecvCancel, but I kinda suspect the fuzz tests (as evidenced by -short succeeding).

Thoughts on how to proceed?

@zenhack
Copy link
Contributor

zenhack commented Jul 13, 2022

I have only skimmed this, but the design seems a bit fragile to me, at least in terms of preserving the desired semantics -- we really want the buffer size to specify a number of messages, not a number of bytes.

What if instead of returning a pair of rwcs, we returned a pair of Codecs, but instead of just sending the messages themselves down a channel, we call Marshal/Unmarshal before sending/receiving?

@lthibault
Copy link
Collaborator Author

I have only skimmed this, but the design seems a bit fragile to me, at least in terms of preserving the desired semantics -- we really want the buffer size to specify a number of messages, not a number of bytes.

I'm not sure I follow. Where might this break, in practice?

The semantics of Codec and Transport shouldn't care about buffering. If you're referring to the semantics of the unit tests, the only issue AFAICT is that we need to be able to buffer an arbitrary number of messages before closing one end of the pipe. In the old version, this is done by specifying a maximum number of slots a priori. In the present version, the buffer is unbounded so there's no chance of a full buffer causing a deadlock. So if anything, this seems less fragile, no?

What if instead of returning a pair of rwcs, we returned a pair of Codecs, but instead of just sending the messages themselves down a channel, we call Marshal/Unmarshal before sending/receiving?

That ought to work, but it's not clear to me what we gain from the added complexity. I understand that it gives us (bounded?) buffering of messages, but what problem is that solving?

@lthibault
Copy link
Collaborator Author

lthibault commented Jul 13, 2022

P.S.: I'm also of a mind that if buffering causes higher-level code to misbehave, the bug in the higher level code, and the buffered pipe is a valuable canary. As a matter of design, Codec and Transport shouldn't make assumptions about buffering.

Of course, it's possible that I'm overlooking something, so I'm interested to hear your thoughts 🙂

@zenhack
Copy link
Contributor

zenhack commented Jul 15, 2022

The semantics of Codec and Transport shouldn't care about buffering. If you're referring to the semantics of the unit tests, the only issue AFAICT is that we need to be able to buffer an arbitrary number of messages before closing one end of the pipe. In the old version, this is done by specifying a maximum number of slots a priori. In the present version, the buffer is unbounded so there's no chance of a full buffer causing a deadlock. So if anything, this seems less fragile, no?

Ah, that makes sense. Yeah, if nothing is depending on not buffering then that's different. That said, the lack of buffering in net.Pipe() has caught a couple deadlocks, so I wonder if we should keep it unbuffered where we can?

In that case I won't die on this hill, but I think the thing I described is actually simpler (especially as a delta from the current implementation): it's just two channels and a call to .Marshal()/Marshal() on either side

@lthibault
Copy link
Collaborator Author

That said, the lack of buffering in net.Pipe() has caught a couple deadlocks, so I wonder if we should keep it unbuffered where we can?

There was exactly one test that was using a synchronous pipe. I've replaced it with net.Pipe, so we ought to be able to catch this in the future.

In that case I won't die on this hill, but I think the thing I described is actually simpler (especially as a delta from the current implementation): it's just two channels and a call to .Marshal()/Marshal() on either side

Ah, I had something more complicated in mind. I'll give it a whirl.

@zenhack zenhack merged commit f9388ba into main Jul 15, 2022
@lthibault lthibault deleted the bugfix/racy-pipe branch July 15, 2022 21:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants