-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor connection.rs
to make fail_with
errors impossible
#1721
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really good!
I just have a few questions about the oneshot sender checks and handling.
dc19f56
to
a6453fc
Compare
a6453fc
to
57e5f2f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good.
If you want to chat about how we handle pending requests on error, just let me know.
- Add an `ExitClient` transition, used when the internal client channel is closed or dropped, and there are no more pending requests - Ignore pending requests after an `ExitClient` transition - Reject pending requests when the peer has caused an error (the `Exit` and `ExitRequest` transitions) - Remove `PeerError::ConnectionDropped`, because it is now handled by `ExitClient`. (Which is an internal error, not a peer error.)
I think we're ready for a review here, but we might want to do some doc updates before we merge this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're pretty good here now, but I'd like to make sure:
- @yaahc checks the commits I added
- we've revised the module docs and other comments as needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, there are some comments we can tweak, but we can make those changes later.
Co-authored-by: teor <teor@riseup.net>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment I suggested was applied.
All the failures here seem to be on large checkpoint sync for testnet, which is unreliable iirc. cc @teor2345 but I think this is ready for merging. |
Yeah we just merged a PR which disabled that test, and it was the only one that failed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Poll::Ready(Err(self | ||
.error_slot | ||
.try_get_error() | ||
.expect("failed servers must set their error slot"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change replaces error_slot
with PeerError::ConnectionClosed
.
What happened to the error value that used to be here?
Are we hiding more specific errors as part of this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
possibly, but I think the errors we used to grab here were misleading and shouldn't be propagated back to this source. The error slot used to store the error that brought down a Connection
, but that error is actually associated with one of the client requests created by call
. We would propagate the error back to that caller over the channel and also copy it to be shared with all future callers. My instinct is that this isn't particularly useful and that we will correctly report all these errors still, they just wont end up being reported many extra times. Imo its fine for subsequent attempts to use a Client
with a dead Connection
to just say "sorry this is already closed".
let ClientRequest { tx, .. } = e.into_inner(); | ||
let _ = tx.send(Err(PeerError::ConnectionClosed.into())); | ||
future::ready(Err(self | ||
.error_slot | ||
.try_get_error() | ||
.expect("failed servers must set their error slot"))) | ||
.boxed() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens to the ClientRequest.tx
we used to send on here?
What happens to the specific error in the error_slot
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's never sent to the Connection
in the background so we never poll the rx
so there's no reason to send an error through the tx
before dropping it. I think this originally got added when we were figuring out the MustUseSender
, and this was added to avoid panicking because the sender hadn't been used, but we don't even convert to a MustUseSender
yet here so it should be fine.
There error slot here is gone by the same logic above, the error from a previous request doesn't need to be propagated back to subsequent failed attempts at new requests.
span, | ||
mut tx, | ||
mut handler, | ||
request_timer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When does the timer start and finish, compared with the old timer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timer should be constructed at the end of a state transition, same as before. It used to happen in the body of handle_client_request
, once we've finished processing the request and manually updated the various bits of state to approximate a state transition. Now it is handled in the TryFrom
impl where we take the Transition
and use it to construct the subsequent State
, as far as I can tell this should result in it starting and finishing at the same time.
match conn.handle_message_as_request(msg).await { | ||
Ok(()) => { | ||
Transition::AwaitResponse { tx, handler, span } | ||
// Transition::AwaitRequest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be awaiting a request here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, I think I initially set it to AwaitRequest
durring the refactor but that was incorrect and caused panics, so I commented this out when I was fixing it. Just need to go back and delete this commented out code.
Either::Right((Either::Left(_), _peer_fut)) => { | ||
trace!(parent: &span, "client request timed out"); | ||
let e = PeerError::ClientRequestTimeout; | ||
match handler { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Special case: ping timeouts fail the connection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Let's add these comments back.)
trace!(parent: &span, "client request timed out"); | ||
let e = PeerError::ClientRequestTimeout; | ||
match handler { | ||
Handler::Ping(_) => Transition::CloseResponse { e: e.into(), tx }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Other request timeouts fail the request.
Motivation
Prior to this PR, we've been experiencing intermittent panics related to runtime invariants within
Connection
being invalidated. We've resolved most of these bugs but the ones related tofail_with
have been stubborn enough that we decided it would be best to refactor the associated error handling logic to hopefully prevent such errors at compile time.Solution
This PR attempts to solve the
fail_with
panics by removing the API entirely, and replacing it with error propagation through return values. We refactored the primary event loop inrun
by extracting the body into a free function that must produce aTransition
struct describing what state transition theConnection
should next take. This mechanism is now used instead offail_with
to propagate errors back to our callers and to cleanly tear down state when aConnection
encounters an error. When we wish to close a connection we useTransition::Close*
which cannot be used to construct a subsequent state instead of usingfail_with
to mutate some shared state which propagates back the error.The code in this pull request has:
Review
@teor2345
Related Issues
Closes #1599
Closes #1610
Follow Up Work