Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor connection.rs to make fail_with errors impossible #1721

Merged
merged 13 commits into from
Feb 19, 2021
Merged

Conversation

yaahc
Copy link
Contributor

@yaahc yaahc commented Feb 10, 2021

Motivation

Prior to this PR, we've been experiencing intermittent panics related to runtime invariants within Connection being invalidated. We've resolved most of these bugs but the ones related to fail_with have been stubborn enough that we decided it would be best to refactor the associated error handling logic to hopefully prevent such errors at compile time.

Solution

This PR attempts to solve the fail_with panics by removing the API entirely, and replacing it with error propagation through return values. We refactored the primary event loop in run by extracting the body into a free function that must produce a Transition struct describing what state transition the Connection should next take. This mechanism is now used instead of fail_with to propagate errors back to our callers and to cleanly tear down state when a Connection encounters an error. When we wish to close a connection we use Transition::Close* which cannot be used to construct a subsequent state instead of using fail_with to mutate some shared state which propagates back the error.

The code in this pull request has:

  • Documentation Comments
  • Manual Integration Testing

Review

@teor2345

Related Issues

Closes #1599
Closes #1610

Follow Up Work

@zfnd-bot zfnd-bot bot assigned yaahc Feb 10, 2021
Copy link
Contributor

@teor2345 teor2345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good!

I just have a few questions about the oneshot sender checks and handling.

zebra-network/src/peer/connection.rs Outdated Show resolved Hide resolved
zebra-network/src/peer/connection.rs Show resolved Hide resolved
Copy link
Contributor

@teor2345 teor2345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good.

If you want to chat about how we handle pending requests on error, just let me know.

zebra-network/src/peer/connection.rs Outdated Show resolved Hide resolved
zebra-network/src/peer/connection.rs Outdated Show resolved Hide resolved
zebra-network/src/peer/connection.rs Outdated Show resolved Hide resolved
zebra-network/src/peer/connection.rs Show resolved Hide resolved
zebra-network/src/peer/error.rs Outdated Show resolved Hide resolved
@teor2345 teor2345 added A-rust Area: Updates to Rust code C-bug Category: This is a bug C-cleanup Category: This is a cleanup I-panic Zebra panics with an internal error message P-High labels Feb 17, 2021
- Add an `ExitClient` transition, used when the internal client channel
  is closed or dropped, and there are no more pending requests
- Ignore pending requests after an `ExitClient` transition
- Reject pending requests when the peer has caused an error
  (the `Exit` and `ExitRequest` transitions)
- Remove `PeerError::ConnectionDropped`, because it is now handled by
  `ExitClient`. (Which is an internal error, not a peer error.)
@teor2345 teor2345 self-assigned this Feb 17, 2021
@teor2345 teor2345 requested a review from a team February 17, 2021 04:03
@teor2345
Copy link
Contributor

I think we're ready for a review here, but we might want to do some doc updates before we merge this PR.

@teor2345 teor2345 marked this pull request as ready for review February 17, 2021 04:04
Copy link
Contributor

@teor2345 teor2345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're pretty good here now, but I'd like to make sure:

  • @yaahc checks the commits I added
  • we've revised the module docs and other comments as needed

@yaahc yaahc requested a review from teor2345 February 18, 2021 23:55
teor2345
teor2345 previously approved these changes Feb 19, 2021
Copy link
Contributor

@teor2345 teor2345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, there are some comments we can tweak, but we can make those changes later.

Co-authored-by: teor <teor@riseup.net>
@yaahc yaahc dismissed stale reviews from teor2345 via 5a64d98 February 19, 2021 21:28
Copy link
Contributor

@teor2345 teor2345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment I suggested was applied.

@yaahc
Copy link
Contributor Author

yaahc commented Feb 19, 2021

All the failures here seem to be on large checkpoint sync for testnet, which is unreliable iirc. cc @teor2345 but I think this is ready for merging.

@teor2345
Copy link
Contributor

Yeah we just merged a PR which disabled that test, and it was the only one that failed.

@yaahc yaahc merged commit 736092a into main Feb 19, 2021
@yaahc yaahc deleted the jane/fail-with branch February 19, 2021 22:11
Copy link
Contributor

@teor2345 teor2345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yaahc here are my questions about this PR and the hang/slowness in #1801.

Comment on lines -202 to -205
Poll::Ready(Err(self
.error_slot
.try_get_error()
.expect("failed servers must set their error slot")))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change replaces error_slot with PeerError::ConnectionClosed.

What happened to the error value that used to be here?
Are we hiding more specific errors as part of this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possibly, but I think the errors we used to grab here were misleading and shouldn't be propagated back to this source. The error slot used to store the error that brought down a Connection, but that error is actually associated with one of the client requests created by call. We would propagate the error back to that caller over the channel and also copy it to be shared with all future callers. My instinct is that this isn't particularly useful and that we will correctly report all these errors still, they just wont end up being reported many extra times. Imo its fine for subsequent attempts to use a Client with a dead Connection to just say "sorry this is already closed".

Comment on lines -224 to -230
let ClientRequest { tx, .. } = e.into_inner();
let _ = tx.send(Err(PeerError::ConnectionClosed.into()));
future::ready(Err(self
.error_slot
.try_get_error()
.expect("failed servers must set their error slot")))
.boxed()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens to the ClientRequest.tx we used to send on here?
What happens to the specific error in the error_slot?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's never sent to the Connection in the background so we never poll the rx so there's no reason to send an error through the tx before dropping it. I think this originally got added when we were figuring out the MustUseSender, and this was added to avoid panicking because the sender hadn't been used, but we don't even convert to a MustUseSender yet here so it should be fine.

There error slot here is gone by the same logic above, the error from a previous request doesn't need to be propagated back to subsequent failed attempts at new requests.

zebra-network/src/peer/connection.rs Show resolved Hide resolved
span,
mut tx,
mut handler,
request_timer,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When does the timer start and finish, compared with the old timer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timer should be constructed at the end of a state transition, same as before. It used to happen in the body of handle_client_request, once we've finished processing the request and manually updated the various bits of state to approximate a state transition. Now it is handled in the TryFrom impl where we take the Transition and use it to construct the subsequent State, as far as I can tell this should result in it starting and finishing at the same time.

match conn.handle_message_as_request(msg).await {
Ok(()) => {
Transition::AwaitResponse { tx, handler, span }
// Transition::AwaitRequest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be awaiting a request here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, I think I initially set it to AwaitRequest durring the refactor but that was incorrect and caused panics, so I commented this out when I was fixing it. Just need to go back and delete this commented out code.

Either::Right((Either::Left(_), _peer_fut)) => {
trace!(parent: &span, "client request timed out");
let e = PeerError::ClientRequestTimeout;
match handler {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Special case: ping timeouts fail the connection.

Copy link
Contributor

@teor2345 teor2345 Feb 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Let's add these comments back.)

trace!(parent: &span, "client request timed out");
let e = PeerError::ClientRequestTimeout;
match handler {
Handler::Ping(_) => Transition::CloseResponse { e: e.into(), tx },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Other request timeouts fail the request.

zebra-network/src/peer/connection.rs Show resolved Hide resolved
zebra-network/src/peer/connection.rs Show resolved Hide resolved
zebra-network/src/peer/connection.rs Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-rust Area: Updates to Rust code C-bug Category: This is a bug C-cleanup Category: This is a cleanup I-panic Zebra panics with an internal error message
Projects
None yet
2 participants