-
Notifications
You must be signed in to change notification settings - Fork 743
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Report RPC Errors to the application on peer disconnections #5680
Changes from all commits
888f129
f5dc1a3
02f1b2d
a8d21e1
b0fe7bc
568db57
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -307,49 +307,15 @@ impl<T: BeaconChainTypes> BackFillSync<T> { | |
/// A peer has disconnected. | ||
/// If the peer has active batches, those are considered failed and re-requested. | ||
#[must_use = "A failure here indicates the backfill sync has failed and the global sync state should be updated"] | ||
pub fn peer_disconnected( | ||
&mut self, | ||
peer_id: &PeerId, | ||
network: &mut SyncNetworkContext<T>, | ||
) -> Result<(), BackFillError> { | ||
pub fn peer_disconnected(&mut self, peer_id: &PeerId) -> Result<(), BackFillError> { | ||
if matches!( | ||
self.state(), | ||
BackFillState::Failed | BackFillState::NotRequired | ||
) { | ||
return Ok(()); | ||
} | ||
|
||
if let Some(batch_ids) = self.active_requests.remove(peer_id) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a great simplification. The repeated logic was a source of many seen/unseen bugs earlier. Kudos for having the big picture and spotting that this can be removed 🙌 My only concern with this is that in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point. Should also be noted that these disconnects are ungraceful. i.e When lighthouse disconnects from a peer, it will wait to try and fulfill all its requests. It wont just drop the connection. In fact, a stream timeout will occur before a disconnection in a graceful disconnect. The error peer disconnect should only happen when a peer drops the connection without fulfilling a request (which lighthouse doesn't do unless there is a network error). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right, current stable assumes that peer disconnection == failed download. But this
Should only apply if:
This should not happen frequently There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fair enough |
||
// fail the batches | ||
for id in batch_ids { | ||
if let Some(batch) = self.batches.get_mut(&id) { | ||
match batch.download_failed(false) { | ||
Ok(BatchOperationOutcome::Failed { blacklist: _ }) => { | ||
self.fail_sync(BackFillError::BatchDownloadFailed(id))?; | ||
} | ||
Ok(BatchOperationOutcome::Continue) => {} | ||
Err(e) => { | ||
self.fail_sync(BackFillError::BatchInvalidState(id, e.0))?; | ||
} | ||
} | ||
// If we have run out of peers in which to retry this batch, the backfill state | ||
// transitions to a paused state. | ||
// We still need to reset the state for all the affected batches, so we should not | ||
// short circuit early | ||
if self.retry_batch_download(network, id).is_err() { | ||
debug!( | ||
self.log, | ||
"Batch could not be retried"; | ||
"batch_id" => id, | ||
"error" => "no synced peers" | ||
); | ||
} | ||
} else { | ||
debug!(self.log, "Batch not found while removing peer"; | ||
"peer" => %peer_id, "batch" => id) | ||
} | ||
} | ||
} | ||
self.active_requests.remove(peer_id); | ||
|
||
// Remove the peer from the participation list | ||
self.participating_peers.remove(peer_id); | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -174,30 +174,8 @@ impl<T: BeaconChainTypes> SyncingChain<T> { | |
|
||
/// Removes a peer from the chain. | ||
/// If the peer has active batches, those are considered failed and re-requested. | ||
pub fn remove_peer( | ||
&mut self, | ||
peer_id: &PeerId, | ||
network: &mut SyncNetworkContext<T>, | ||
) -> ProcessingResult { | ||
if let Some(batch_ids) = self.peers.remove(peer_id) { | ||
// fail the batches | ||
for id in batch_ids { | ||
if let Some(batch) = self.batches.get_mut(&id) { | ||
if let BatchOperationOutcome::Failed { blacklist } = | ||
batch.download_failed(true)? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. similar comment as above regarding marking the batch as failed/not failed. In forward sync, this might mean potentially not retrying a valid chain because the peers on the good chain are disconnecting. |
||
{ | ||
return Err(RemoveChain::ChainFailed { | ||
blacklist, | ||
failing_batch: id, | ||
}); | ||
} | ||
self.retry_batch_download(network, id)?; | ||
} else { | ||
debug!(self.log, "Batch not found while removing peer"; | ||
"peer" => %peer_id, "batch" => id) | ||
} | ||
} | ||
} | ||
pub fn remove_peer(&mut self, peer_id: &PeerId) -> ProcessingResult { | ||
self.peers.remove(peer_id); | ||
|
||
if self.peers.is_empty() { | ||
Err(RemoveChain::EmptyPeerPool) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the test should make sure we only get to this branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this branch is the only way to break out of the loop, not hitting this branch will timeout the test. I considered adding something explicit but it feels redundant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right.