-
-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Framer's sequential reads of frame header then payload can leave underlying async stream's @read_buffer in a corrupted state #14
Framer's sequential reads of frame header then payload can leave underlying async stream's @read_buffer in a corrupted state #14
Comments
this script minimally reproduces the bug |
Wow, nice find, I'll sort this out right away! Thanks! |
Do you mind explaining in what situation you are retrying? I would assume that if the operation failed, you'd give up completely. |
If the operation fails we would have to throw away the connection since the connection is left in a corrupted state (there's now a H/2 payload on the wire with no header, which is garbage for all intents and purposes.). Currently, Request 1 timing out on reading a payload off the wire means any future reads off the wire are wrecked. We want to use the same connection for as long as possible. Having to re-establish a connection every time a read times out is quite toilsome. Shopify has also seen this precise bug occur while using async-http (cc @dwdking) We have had the patches in the PRs I've made deployed at Stripe for a couple of months now. Before the patch we were experiencing an incredibly high number of errors from this issue every day, the patch brought it down to ~0. |
That makes sense and I understand the value of the related PRs. However, if timeout is a problem, why not increase the timeout too? It sounds like you are having a timeout while reading the frame, then retrying if timeout occurs. Maybe it would be more logical to increase the timeout, e.g. in your case 2x or 3x? At least my intention with the timeout is as a last ditch effort and retrying the operation would not make sense after a timeout occurred (the connection could be in a funky state as you've correctly outlined). |
Increasing the timeout is not always feasible, since the timeout is determined by set of constraints the system must meet. There are a couple of use-cases for wanting the connection to remain in a healthy state after timeout:
The current behavior basically ends up necessitating throwing away the connection upon timeout. However, this is currently (1) not explicit in documentation, (2) not clear from the API, and (3) not something the library protects against. What we end up with is having every callsite to begin
stream.read_frame
rescue Async::TimeoutError
# throw away connection
end which is (a) very toilsome and ugly, (b) non-trivial to get right. In summary, the problems faced were the following:
The proposed fixes provide a strong correctness invariant, while allowing connections to persist. |
Thanks for the clear explanation, it makes sense. I agree, the invariant makes sense. |
Okay, I'm planning to work on this here: #19 |
@maruth-stripe are you using |
(I work with maruth) we built our own wrapper around protocol-http2 that uses async for a gRPC client that has some fairly specific constraints because the Stripe codebase is 30 million lines of Ruby with about 2000 active ruby developers. |
Thanks @fables-tales for the clarification. What I'm trying to understand is what scenario you are re-entering the In Can you help answer a few questions for me?
|
@maruth-stripe keep me honest here but I don't think we are re-entering read frame, we specifically added code to check against that. the core behind our code looks something like:
|
Okay, so for my understanding, you aren't multiplexing requests on a single connection and instead depending on sequential processing of frames for each stream until the stream is done? |
multiplexing is possible:
is a pattern we support |
What do you do if you receive a frame for a different stream? Are you multiplexing using a queue for each stream or something like that? |
yes, if |
Consider a call flow like:
@read_buffer
in our underlying async/io/stream contains exactly 9 bytes.read_frame
(takes 9 bytes off the underlying@read_buffer
in consume_read_buffer) and this completely drains @read_buffer@read_buffer
is still empty, we do not parse the frame, and exit the call flow.read_frame
with a higher timeoutread_header
again, which will callfill_read_buffer
(fills buffer with ~thousands of bytes)@read_buffer
in the underlying stream now contains the payload of the previous frame, instead of a valid frame header, and we get a protocol error.I think in this case the "right" thing to do is put the 9 bytes back in the read buffer, or hold the frame header and retry reading the payload, instead of trying to read the header out of what is certainly payload.
The text was updated successfully, but these errors were encountered: