-
Notifications
You must be signed in to change notification settings - Fork 743
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Report RPC Errors to the application on peer disconnections #5680
Conversation
Co-authored-by: Age Manning <Age@AgeManning.com>
Squashed commit of the following: commit f5dc1a3 Author: dapplion <35266934+dapplion@users.noreply.github.com> Date: Wed May 1 17:14:50 2024 +0900 Expect RPCError::Disconnect to fail ongoing requests commit 888f129 Author: dapplion <35266934+dapplion@users.noreply.github.com> Date: Wed May 1 14:14:22 2024 +0900 Report RPC Errors to the application on peer disconnections Co-authored-by: Age Manning <Age@AgeManning.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great catch, I just have a question
if matches!( | ||
self.state(), | ||
BackFillState::Failed | BackFillState::NotRequired | ||
) { | ||
return Ok(()); | ||
} | ||
|
||
if let Some(batch_ids) = self.active_requests.remove(peer_id) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great simplification. The repeated logic was a source of many seen/unseen bugs earlier. Kudos for having the big picture and spotting that this can be removed 🙌
My only concern with this is that in inject_error
, we call batch.download_failed(true)
instead of false
which we shouldn't be doing for disconnections maybe? Repeated peer disconnections for the same batch might end up marking the entire chain as invalid and redoing a bunch of stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Should also be noted that these disconnects are ungraceful.
i.e When lighthouse disconnects from a peer, it will wait to try and fulfill all its requests. It wont just drop the connection. In fact, a stream timeout will occur before a disconnection in a graceful disconnect.
The error peer disconnect should only happen when a peer drops the connection without fulfilling a request (which lighthouse doesn't do unless there is a network error).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, current stable assumes that peer disconnection == failed download. But this RPCError::Disconnect
error will only fire if there is an active outgoing request that gets terminated ungracefully.
Repeated peer disconnections for the same batch might end up marking the entire chain as invalid and redoing a bunch of stuff.
Should only apply if:
- we initiate request to peer A
- peer A disconnects ungracefully before completing request
- we initiate a retry request to peer B
- peer B disconnects ungracefully before completing request
This should not happen frequently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair enough
for id in batch_ids { | ||
if let Some(batch) = self.batches.get_mut(&id) { | ||
if let BatchOperationOutcome::Failed { blacklist } = | ||
batch.download_failed(true)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar comment as above regarding marking the batch as failed/not failed.
In forward sync, this might mean potentially not retrying a valid chain because the peers on the good chain are disconnecting.
sender.send_request(peer_id, 42, rpc_request.clone()); | ||
} | ||
NetworkEvent::RPCFailed { error, id: 42, .. } => match error { | ||
RPCError::Disconnected => return, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the test should make sure we only get to this branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this branch is the only way to break out of the loop, not hitting this branch will timeout the test. I considered adding something explicit but it feels redundant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a quick look. This looks good to me. I like the simplification.
The errors that we now see are from ungraceful disconnects, which probably should be punished and treated like an RPC error, imo
if matches!( | ||
self.state(), | ||
BackFillState::Failed | BackFillState::NotRequired | ||
) { | ||
return Ok(()); | ||
} | ||
|
||
if let Some(batch_ids) = self.active_requests.remove(peer_id) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Should also be noted that these disconnects are ungraceful.
i.e When lighthouse disconnects from a peer, it will wait to try and fulfill all its requests. It wont just drop the connection. In fact, a stream timeout will occur before a disconnection in a graceful disconnect.
The error peer disconnect should only happen when a peer drops the connection without fulfilling a request (which lighthouse doesn't do unless there is a network error).
Squashed commit of the following: commit f5dc1a3 Author: dapplion <35266934+dapplion@users.noreply.github.com> Date: Wed May 1 17:14:50 2024 +0900 Expect RPCError::Disconnect to fail ongoing requests commit 888f129 Author: dapplion <35266934+dapplion@users.noreply.github.com> Date: Wed May 1 14:14:22 2024 +0900 Report RPC Errors to the application on peer disconnections Co-authored-by: Age Manning <Age@AgeManning.com>
@realbigsean noted that a lookup can get stuck if it has no available peers and is awaiting a download. This case should never happen with current code as a lookup is never left in AwaitingDownload state. However, for completeness I have added a check to drop lookups in that case in 02f1b2d |
Squashed commit of the following: commit 02f1b2d Author: dapplion <35266934+dapplion@users.noreply.github.com> Date: Fri May 3 10:17:42 2024 +0900 Drop lookups after peer disconnect and not awaiting events commit f5dc1a3 Author: dapplion <35266934+dapplion@users.noreply.github.com> Date: Wed May 1 17:14:50 2024 +0900 Expect RPCError::Disconnect to fail ongoing requests commit 888f129 Author: dapplion <35266934+dapplion@users.noreply.github.com> Date: Wed May 1 14:14:22 2024 +0900 Report RPC Errors to the application on peer disconnections Co-authored-by: Age Manning <Age@AgeManning.com>
The RPCError events are never received by sync due to this condition here lighthouse/beacon_node/lighthouse_network/src/service/mod.rs Lines 1370 to 1377 in 3058b96
The tests on network/sync/block_lookups and lighthouse_network pass respectively as they don't test the full integration. @AgeManning there's a lot of code in this function that is currently not expecting events for disconnected peers. Would be best to just allow events for disconnected peers if the event type if RPCError::Disconnect? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest changes allowing the RpcError::Disconnected to propagate, looks good to me.
To be clear, this edge case happens where:
- We make a request
- The peer disconnects ungracefully
- There is a race between receiving the RpcError::Disconnected from the rpc handler and the Swarm peer-discconected message. Potentially we always lose this race, and the peer manager considers the peer disconnected before we can read the final handler message.
The only issue I see here, is that it's going to break a previously intuitive construct we previously had, which was that the last log/message we ever see from a peer is "Peer Disconnected".
After this change, we can see logs and messages after the peer has disconnected.
i.e
"Peer Disconnected"
"RPC Error::Disconnected"
I don't immediately see a solution to this, because the ordering is coming from the swarm which we don't have much control over here.
Co-authored-by: Age Manning <Age@AgeManning.com>
@mergify queue |
✅ The pull request has been merged automaticallyThe pull request has been merged automatically at b87c36a |
Issue Addressed
Extends
As we are overhaling some internal RPC infrastructure, a desired feature is to report peer disconnects on RPC requests.
This PR should report an RPCError(Disconnected) if a connection is terminated whilst an RPC request is underway.
Proposed Changes
RPCError::Disconnect
to any outbound streams with a disconnecting peerinject_error
to handle it