Refactor `connection.rs` to make `fail_with` errors impossible #1721

yaahc · 2021-02-10T22:14:42Z

Motivation

Prior to this PR, we've been experiencing intermittent panics related to runtime invariants within Connection being invalidated. We've resolved most of these bugs but the ones related to fail_with have been stubborn enough that we decided it would be best to refactor the associated error handling logic to hopefully prevent such errors at compile time.

Solution

This PR attempts to solve the fail_with panics by removing the API entirely, and replacing it with error propagation through return values. We refactored the primary event loop in run by extracting the body into a free function that must produce a Transition struct describing what state transition the Connection should next take. This mechanism is now used instead of fail_with to propagate errors back to our callers and to cleanly tear down state when a Connection encounters an error. When we wish to close a connection we use Transition::Close* which cannot be used to construct a subsequent state instead of using fail_with to mutate some shared state which propagates back the error.

The code in this pull request has:

Documentation Comments
Manual Integration Testing

Review

@teor2345

Related Issues

Closes #1599
Closes #1610

Follow Up Work

teor2345

This looks really good!

I just have a few questions about the oneshot sender checks and handling.

zebra-network/src/peer/connection.rs

teor2345

This looks good.

If you want to chat about how we handle pending requests on error, just let me know.

zebra-network/src/peer/connection.rs

zebra-network/src/peer/error.rs

- Add an `ExitClient` transition, used when the internal client channel is closed or dropped, and there are no more pending requests - Ignore pending requests after an `ExitClient` transition - Reject pending requests when the peer has caused an error (the `Exit` and `ExitRequest` transitions) - Remove `PeerError::ConnectionDropped`, because it is now handled by `ExitClient`. (Which is an internal error, not a peer error.)

teor2345 · 2021-02-17T04:04:44Z

I think we're ready for a review here, but we might want to do some doc updates before we merge this PR.

teor2345

I think we're pretty good here now, but I'd like to make sure:

@yaahc checks the commits I added
we've revised the module docs and other comments as needed

zebra-network/src/peer/connection.rs

teor2345

Looks good, there are some comments we can tweak, but we can make those changes later.

Co-authored-by: teor <teor@riseup.net>

teor2345

The comment I suggested was applied.

yaahc · 2021-02-19T22:08:37Z

All the failures here seem to be on large checkpoint sync for testnet, which is unreliable iirc. cc @teor2345 but I think this is ready for merging.

teor2345 · 2021-02-19T22:11:03Z

Yeah we just merged a PR which disabled that test, and it was the only one that failed.

teor2345

@yaahc here are my questions about this PR and the hang/slowness in #1801.

teor2345 · 2021-02-23T01:07:34Z

zebra-network/src/peer/client.rs

-            Poll::Ready(Err(self
-                .error_slot
-                .try_get_error()
-                .expect("failed servers must set their error slot")))


This change replaces error_slot with PeerError::ConnectionClosed.

What happened to the error value that used to be here?
Are we hiding more specific errors as part of this change?

possibly, but I think the errors we used to grab here were misleading and shouldn't be propagated back to this source. The error slot used to store the error that brought down a Connection, but that error is actually associated with one of the client requests created by call. We would propagate the error back to that caller over the channel and also copy it to be shared with all future callers. My instinct is that this isn't particularly useful and that we will correctly report all these errors still, they just wont end up being reported many extra times. Imo its fine for subsequent attempts to use a Client with a dead Connection to just say "sorry this is already closed".

teor2345 · 2021-02-23T01:08:21Z

zebra-network/src/peer/client.rs

-                    let ClientRequest { tx, .. } = e.into_inner();
-                    let _ = tx.send(Err(PeerError::ConnectionClosed.into()));
-                    future::ready(Err(self
-                        .error_slot
-                        .try_get_error()
-                        .expect("failed servers must set their error slot")))
-                    .boxed()


What happens to the ClientRequest.tx we used to send on here?
What happens to the specific error in the error_slot?

It's never sent to the Connection in the background so we never poll the rx so there's no reason to send an error through the tx before dropping it. I think this originally got added when we were figuring out the MustUseSender, and this was added to avoid panicking because the sender hadn't been used, but we don't even convert to a MustUseSender yet here so it should be fine.

There error slot here is gone by the same logic above, the error from a previous request doesn't need to be propagated back to subsequent failed attempts at new requests.

zebra-network/src/peer/connection.rs

teor2345 · 2021-02-23T01:11:12Z

zebra-network/src/peer/connection.rs

+                span,
+                mut tx,
+                mut handler,
+                request_timer,


When does the timer start and finish, compared with the old timer?

The timer should be constructed at the end of a state transition, same as before. It used to happen in the body of handle_client_request, once we've finished processing the request and manually updated the various bits of state to approximate a state transition. Now it is handled in the TryFrom impl where we take the Transition and use it to construct the subsequent State, as far as I can tell this should result in it starting and finishing at the same time.

teor2345 · 2021-02-23T01:13:14Z

zebra-network/src/peer/connection.rs

+                            match conn.handle_message_as_request(msg).await {
+                                Ok(()) => {
+                                    Transition::AwaitResponse { tx, handler, span }
+                                    // Transition::AwaitRequest


Should we be awaiting a request here?

I don't think so, I think I initially set it to AwaitRequest durring the refactor but that was incorrect and caused panics, so I commented this out when I was fixing it. Just need to go back and delete this commented out code.

teor2345 · 2021-02-23T01:15:58Z

zebra-network/src/peer/connection.rs

+                    Either::Right((Either::Left(_), _peer_fut)) => {
+                        trace!(parent: &span, "client request timed out");
+                        let e = PeerError::ClientRequestTimeout;
+                        match handler {


// Special case: ping timeouts fail the connection.

(Let's add these comments back.)

teor2345 · 2021-02-23T01:17:41Z

zebra-network/src/peer/connection.rs

+                        trace!(parent: &span, "client request timed out");
+                        let e = PeerError::ClientRequestTimeout;
+                        match handler {
+                            Handler::Ping(_) => Transition::CloseResponse { e: e.into(), tx },


// Other request timeouts fail the request.

zebra-network/src/peer/connection.rs

leverage return value for propagating errors

6e17040

zfnd-bot bot assigned yaahc Feb 10, 2021

teor2345 reviewed Feb 10, 2021

View reviewed changes

zebra-network/src/peer/connection.rs Outdated Show resolved Hide resolved

zebra-network/src/peer/connection.rs Show resolved Hide resolved

yaahc force-pushed the jane/fail-with branch from dc19f56 to a6453fc Compare February 10, 2021 23:45

introduce Transition enum

57e5f2f

yaahc force-pushed the jane/fail-with branch from a6453fc to 57e5f2f Compare February 11, 2021 00:04

accidental drop on mustusesender

b12d1f1

This was referenced Feb 11, 2021

Redesign Zebra's internal Requests so they are easier to process #1723

Closed

Handle empty network Requests and Responses at the translation layer #1724

Closed

teor2345 added this to the 2021 Sprint 3 milestone Feb 16, 2021

teor2345 suggested changes Feb 16, 2021

View reviewed changes

teor2345 added A-rust Area: Updates to Rust code C-bug Category: This is a bug C-cleanup Category: This is a cleanup I-panic Zebra panics with an internal error message P-High labels Feb 17, 2021

teor2345 added 3 commits February 17, 2021 13:55

rustfmt

c1e8cdd

Remove remaining references to fail_with

3433ccd

teor2345 self-assigned this Feb 17, 2021

teor2345 requested a review from a team February 17, 2021 04:03

teor2345 marked this pull request as ready for review February 17, 2021 04:04

teor2345 mentioned this pull request Feb 18, 2021

Re-deploy the Foundation DNS Seeders #1753

Closed

3 tasks

teor2345 suggested changes Feb 18, 2021

View reviewed changes

yaahc added 5 commits February 18, 2021 13:31

rename transitions from Exit to Close

95b31fe

deduplicate match arms in handle_client_request

9c9d4ea

update comments throughout connection.rs

599374b

remove unnecessary Option around request timeout

bb872bc

ensure peer/client.rs comments are up to date

03d444c

make sure peer/error.s comments are up to date

f34aa89

yaahc requested a review from teor2345 February 18, 2021 23:55

teor2345 reviewed Feb 19, 2021

View reviewed changes

zebra-network/src/peer/connection.rs Show resolved Hide resolved

teor2345 reviewed Feb 19, 2021

View reviewed changes

zebra-network/src/peer/connection.rs Show resolved Hide resolved

teor2345 reviewed Feb 19, 2021

View reviewed changes

zebra-network/src/peer/connection.rs Show resolved Hide resolved

teor2345 reviewed Feb 19, 2021

View reviewed changes

zebra-network/src/peer/connection.rs Outdated Show resolved Hide resolved

teor2345 reviewed Feb 19, 2021

View reviewed changes

zebra-network/src/peer/connection.rs Show resolved Hide resolved

teor2345 previously approved these changes Feb 19, 2021

View reviewed changes

teor2345 mentioned this pull request Feb 19, 2021

Map selects to a typed custom enum in zebra-network #1783

Closed

5 tasks

Apply suggestions from code review

5a64d98

Co-authored-by: teor <teor@riseup.net>

yaahc dismissed stale reviews from teor2345 via 5a64d98 February 19, 2021 21:28

teor2345 approved these changes Feb 19, 2021

View reviewed changes

yaahc merged commit 736092a into main Feb 19, 2021

yaahc deleted the jane/fail-with branch February 19, 2021 22:11

This was referenced Feb 23, 2021

WIP: Handle merged inv messages #1799

Closed

Zebra full mainnet sync is very slow or hangs #1801

Closed

teor2345 reviewed Feb 23, 2021

View reviewed changes

This was referenced Feb 23, 2021

Panic: called fail_with on already-failed connection state, from msg=GetData #1599

Closed

Test: Revert "Refactor connection.rs to make fail_with errors impossible" #1803

Merged

teor2345 reviewed Feb 23, 2021

View reviewed changes

zebra-network/src/peer/connection.rs Show resolved Hide resolved

This was referenced Feb 23, 2021

zebrad 1.0.0-alpha.3 Release #1804

Merged

Stop sending blocks and transactions on error #1818

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor `connection.rs` to make `fail_with` errors impossible #1721

Refactor `connection.rs` to make `fail_with` errors impossible #1721

yaahc commented Feb 10, 2021 •

edited

Loading

teor2345 left a comment

teor2345 left a comment

teor2345 commented Feb 17, 2021

teor2345 left a comment •

edited by yaahc

Loading

teor2345 left a comment

teor2345 left a comment

yaahc commented Feb 19, 2021

teor2345 commented Feb 19, 2021

teor2345 left a comment

teor2345 Feb 23, 2021

yaahc Feb 23, 2021

teor2345 Feb 23, 2021

yaahc Feb 23, 2021

teor2345 Feb 23, 2021

yaahc Feb 23, 2021

teor2345 Feb 23, 2021

yaahc Feb 23, 2021

teor2345 Feb 23, 2021

teor2345 Feb 23, 2021 •

edited

Loading

teor2345 Feb 23, 2021

Refactor connection.rs to make fail_with errors impossible #1721

Refactor connection.rs to make fail_with errors impossible #1721

Conversation

yaahc commented Feb 10, 2021 • edited Loading

Motivation

Solution

Review

Related Issues

Follow Up Work

teor2345 left a comment

Choose a reason for hiding this comment

teor2345 left a comment

Choose a reason for hiding this comment

teor2345 commented Feb 17, 2021

teor2345 left a comment • edited by yaahc Loading

Choose a reason for hiding this comment

teor2345 left a comment

Choose a reason for hiding this comment

teor2345 left a comment

Choose a reason for hiding this comment

yaahc commented Feb 19, 2021

teor2345 commented Feb 19, 2021

teor2345 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

teor2345 Feb 23, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Refactor `connection.rs` to make `fail_with` errors impossible #1721

Refactor `connection.rs` to make `fail_with` errors impossible #1721

yaahc commented Feb 10, 2021 •

edited

Loading

teor2345 left a comment •

edited by yaahc

Loading

teor2345 Feb 23, 2021 •

edited

Loading