-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update syncing logic to fix duplicate block requests #3410
base: staging
Are you sure you want to change the base?
Update syncing logic to fix duplicate block requests #3410
Conversation
9f546d8
to
88d5574
Compare
node/sync/src/block_sync.rs
Outdated
} | ||
}, | ||
Err(err) => { | ||
warn!("{err}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: this is a change that might surprise users who relied on the format of the old log
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, good find! I've added more expressive warnings similar to the previous code here: d6e4f73
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with some nits
Co-authored-by: ljedrz <3750347+ljedrz@users.noreply.github.com>
} | ||
|
||
/// Removes the block request for the given peer IP, if it exists. | ||
#[allow(dead_code)] | ||
fn remove_block_request_to_peer(&self, peer_ip: &SocketAddr, height: u32) { | ||
let mut can_revoke = self.responses.read().get(&height).is_none(); | ||
let mut can_revoke = self.process_next_block(height).is_none(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is remove_block_request_to_peer
now calling process_next_block
instead of just removing it from the responses
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, I answered my own question. You are now nesting the response removal inside process_next_block
and adding some additional checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while I also believe it to be fine, an adjustment to the function name or extending the doc comment could be beneficial here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I think I'll feel a bit more comfortable with the change after we run our gambit of sync tests and malice tests, but otherwise logic looks good
Motivation
As issue #3404 described, validators send out duplicate block requests when syncing. This PR updates the syncing logic to make sure we only remove a block response after it was added to the ledger or if we encountered an error.
For background, currently, syncing works as follows:
BlockSync
requests the blocks, and stores it in the internalresponses
map.Sync
callsblock_sync.process_next_block(current_height)
to get the next block to process. This removes it fromBlockSync
’sresponses
and returns it toSync
.Sync
performs the checks, and if successful, adds the block to theledger
. This also updates BlockSync’scanon
.The issue is that
BlockSync
‘s view of thecanon
is only updated afterSync
is done processing blocks (and adding to the ledger). There’s a window whereBlockSync
is unaware of blocks that are pending validation inSync
. Thus, it re-requests them.This PR adjusts the block processing interface to the
Sync
module as follows:process_next_block
READS a block from a particular heightremove_block_response
REMOVES a block from a particular height, which MUST be called after advancing completed or failedTest Plan
Ran it locally and verified it removes the duplicate block request.
Also ran it with light load in a cloud devnet, and verified syncing works for validators and clients.
The following figure shows a syncing process before the fix, with a longer pause until a duplicate request is generated:
The following figure shows the syncing process with the fix (note there is no double request and no delay, as such, the syncing is much faster):
Closes #3404