Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report RPC Errors to the application on peer disconnections #5680

Merged
merged 6 commits into from
May 6, 2024

Conversation

dapplion
Copy link
Collaborator

@dapplion dapplion commented May 1, 2024

Issue Addressed

Extends

As we are overhaling some internal RPC infrastructure, a desired feature is to report peer disconnects on RPC requests.

This PR should report an RPCError(Disconnected) if a connection is terminated whilst an RPC request is underway.

Proposed Changes

  • Emit RPCError::Disconnect to any outbound streams with a disconnecting peer
  • Remove sync code that fails download attempts of disconnected peers, and expect inject_error to handle it

@dapplion dapplion mentioned this pull request May 1, 2024
jimmygchen added a commit that referenced this pull request May 1, 2024
Squashed commit of the following:

commit f5dc1a3
Author: dapplion <35266934+dapplion@users.noreply.github.com>
Date:   Wed May 1 17:14:50 2024 +0900

    Expect RPCError::Disconnect to fail ongoing requests

commit 888f129
Author: dapplion <35266934+dapplion@users.noreply.github.com>
Date:   Wed May 1 14:14:22 2024 +0900

    Report RPC Errors to the application on peer disconnections

    Co-authored-by: Age Manning <Age@AgeManning.com>
@dapplion dapplion requested a review from AgeManning May 1, 2024 12:24
@realbigsean realbigsean added ready-for-review The code is ready for review v5.2.0 Q2 2024 labels May 1, 2024
Copy link
Member

@pawanjay176 pawanjay176 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great catch, I just have a question

if matches!(
self.state(),
BackFillState::Failed | BackFillState::NotRequired
) {
return Ok(());
}

if let Some(batch_ids) = self.active_requests.remove(peer_id) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great simplification. The repeated logic was a source of many seen/unseen bugs earlier. Kudos for having the big picture and spotting that this can be removed 🙌

My only concern with this is that in inject_error, we call batch.download_failed(true) instead of false which we shouldn't be doing for disconnections maybe? Repeated peer disconnections for the same batch might end up marking the entire chain as invalid and redoing a bunch of stuff.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Should also be noted that these disconnects are ungraceful.

i.e When lighthouse disconnects from a peer, it will wait to try and fulfill all its requests. It wont just drop the connection. In fact, a stream timeout will occur before a disconnection in a graceful disconnect.

The error peer disconnect should only happen when a peer drops the connection without fulfilling a request (which lighthouse doesn't do unless there is a network error).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, current stable assumes that peer disconnection == failed download. But this RPCError::Disconnect error will only fire if there is an active outgoing request that gets terminated ungracefully.

Repeated peer disconnections for the same batch might end up marking the entire chain as invalid and redoing a bunch of stuff.

Should only apply if:

  • we initiate request to peer A
  • peer A disconnects ungracefully before completing request
  • we initiate a retry request to peer B
  • peer B disconnects ungracefully before completing request

This should not happen frequently

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough

for id in batch_ids {
if let Some(batch) = self.batches.get_mut(&id) {
if let BatchOperationOutcome::Failed { blacklist } =
batch.download_failed(true)?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar comment as above regarding marking the batch as failed/not failed.

In forward sync, this might mean potentially not retrying a valid chain because the peers on the good chain are disconnecting.

sender.send_request(peer_id, 42, rpc_request.clone());
}
NetworkEvent::RPCFailed { error, id: 42, .. } => match error {
RPCError::Disconnected => return,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the test should make sure we only get to this branch

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this branch is the only way to break out of the loop, not hitting this branch will timeout the test. I considered adding something explicit but it feels redundant

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right.

@pawanjay176 pawanjay176 added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels May 1, 2024
Copy link
Member

@AgeManning AgeManning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a quick look. This looks good to me. I like the simplification.

The errors that we now see are from ungraceful disconnects, which probably should be punished and treated like an RPC error, imo

if matches!(
self.state(),
BackFillState::Failed | BackFillState::NotRequired
) {
return Ok(());
}

if let Some(batch_ids) = self.active_requests.remove(peer_id) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Should also be noted that these disconnects are ungraceful.

i.e When lighthouse disconnects from a peer, it will wait to try and fulfill all its requests. It wont just drop the connection. In fact, a stream timeout will occur before a disconnection in a graceful disconnect.

The error peer disconnect should only happen when a peer drops the connection without fulfilling a request (which lighthouse doesn't do unless there is a network error).

michaelsproul pushed a commit that referenced this pull request May 2, 2024
Squashed commit of the following:

commit f5dc1a3
Author: dapplion <35266934+dapplion@users.noreply.github.com>
Date:   Wed May 1 17:14:50 2024 +0900

    Expect RPCError::Disconnect to fail ongoing requests

commit 888f129
Author: dapplion <35266934+dapplion@users.noreply.github.com>
Date:   Wed May 1 14:14:22 2024 +0900

    Report RPC Errors to the application on peer disconnections

    Co-authored-by: Age Manning <Age@AgeManning.com>
@dapplion
Copy link
Collaborator Author

dapplion commented May 3, 2024

@realbigsean noted that a lookup can get stuck if it has no available peers and is awaiting a download. This case should never happen with current code as a lookup is never left in AwaitingDownload state. However, for completeness I have added a check to drop lookups in that case in 02f1b2d

@pawanjay176 pawanjay176 removed the waiting-on-author The reviewer has suggested changes and awaits thier implementation. label May 3, 2024
michaelsproul added a commit that referenced this pull request May 3, 2024
Squashed commit of the following:

commit 02f1b2d
Author: dapplion <35266934+dapplion@users.noreply.github.com>
Date:   Fri May 3 10:17:42 2024 +0900

    Drop lookups after peer disconnect and not awaiting events

commit f5dc1a3
Author: dapplion <35266934+dapplion@users.noreply.github.com>
Date:   Wed May 1 17:14:50 2024 +0900

    Expect RPCError::Disconnect to fail ongoing requests

commit 888f129
Author: dapplion <35266934+dapplion@users.noreply.github.com>
Date:   Wed May 1 14:14:22 2024 +0900

    Report RPC Errors to the application on peer disconnections

    Co-authored-by: Age Manning <Age@AgeManning.com>
@dapplion
Copy link
Collaborator Author

dapplion commented May 3, 2024

The RPCError events are never received by sync due to this condition here

if !self.peer_manager().is_connected(&peer_id) {
debug!(
self.log,
"Ignoring rpc message of disconnecting peer";
event
);
return None;
}

The tests on network/sync/block_lookups and lighthouse_network pass respectively as they don't test the full integration.

@AgeManning there's a lot of code in this function that is currently not expecting events for disconnected peers. Would be best to just allow events for disconnected peers if the event type if RPCError::Disconnect?

Copy link
Member

@AgeManning AgeManning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest changes allowing the RpcError::Disconnected to propagate, looks good to me.

To be clear, this edge case happens where:

  1. We make a request
  2. The peer disconnects ungracefully
  3. There is a race between receiving the RpcError::Disconnected from the rpc handler and the Swarm peer-discconected message. Potentially we always lose this race, and the peer manager considers the peer disconnected before we can read the final handler message.

The only issue I see here, is that it's going to break a previously intuitive construct we previously had, which was that the last log/message we ever see from a peer is "Peer Disconnected".

After this change, we can see logs and messages after the peer has disconnected.

i.e
"Peer Disconnected"
"RPC Error::Disconnected"

I don't immediately see a solution to this, because the ordering is coming from the swarm which we don't have much control over here.

@realbigsean
Copy link
Member

@mergify queue

Copy link

mergify bot commented May 6, 2024

queue

✅ The pull request has been merged automatically

The pull request has been merged automatically at b87c36a

@mergify mergify bot merged commit b87c36a into sigp:unstable May 6, 2024
27 checks passed
@dapplion dapplion deleted the rpc-error-on-disconnect branch May 7, 2024 02:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v5.2.0 Q2 2024
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants