Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More conservative selection of non-finalized peers #7086

Merged
merged 6 commits into from
Aug 24, 2020

Conversation

farazdagi
Copy link
Contributor

@farazdagi farazdagi commented Aug 23, 2020

What type of PR is this?

Bug fix

What does this PR do? Why is it needed?

  • Users have reported that sometimes, towards the end of 2nd phase of init-syncing (when we sync to non-finalized head), they are stuck processing what seems like the very same set of blocks (loop).
  • It is hard to reproduce, so this PR is my best attempt to resolve the issue: instead of requiring only min-sync-peers number of peers beyond node's head slot, we request double of that number. This is done in exactly two places (all other clients of BestNonFinalized() are using the absolute min. number of peers to proceed):
    • When we are deciding that we need to resync: we want to make sure that there are enough peers (even if some blink) to actually resync, so resync is forced if double of the minimum number of peers to sync report higher epoch (and delta between head and those peers should be >= 1 epoch)
    • When, in queue, code decides whether during the sync time peers moved even further, and update to the highestExpectedSlot needs to be done (this causes that loop towards the end of init-sync). We now require, double the minimum sync peers to agree on some future slot, before we update the highestExpectedSlot (and therefore force queue to continue operating).

Which issues(s) does this PR fix?

N/A

Other notes for review

  • Some looping will still remain (by design): if at some point (when highestExpectedSlot is about to be checked) we have 2*min.sync.peers agree on some future slot, and highestExpectedSlot is updated, but then when queue requests some range and doesn't find enough peers, it will loop noRequiredPeersErrMaxRetries (1000) times, with noRequiredPeersErrRefreshInterval (15 secs) intervals (~4hours or up to the point the highestExpectedSlot is reached - whichever comes the first). This is the way we make sure that in times of long non-finality (who remembers the last Medalla incident?) queue is robust to non-healthy network state and gives enough efforts before giving up.
  • In healthy network, the highestExpectedSlot, even if updated, is expected to be caught up quickly (enough nodes with those blocks!), and queue exits normally.

@farazdagi farazdagi self-assigned this Aug 23, 2020
@farazdagi farazdagi added the Sync Sync (regular, initial, checkpoint) related issues label Aug 23, 2020
@farazdagi farazdagi force-pushed the init-sync-non-finalized-loop branch from 7eca454 to c56a392 Compare August 23, 2020 10:32
@farazdagi farazdagi marked this pull request as ready for review August 23, 2020 10:35
@farazdagi farazdagi requested a review from a team as a code owner August 23, 2020 10:35
@codecov
Copy link

codecov bot commented Aug 23, 2020

Codecov Report

Merging #7086 into master will increase coverage by 2.23%.
The diff coverage is 72.77%.

@@            Coverage Diff             @@
##           master    #7086      +/-   ##
==========================================
+ Coverage   60.07%   62.30%   +2.23%     
==========================================
  Files         323      406      +83     
  Lines       27422    31596    +4174     
==========================================
+ Hits        16473    19686    +3213     
- Misses       8733     9122     +389     
- Partials     2216     2788     +572     

Copy link
Member

@nisdas nisdas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this will make it better. It potentially has the ability to make it worse. BestNonFinalized requires minimum number of peers in some target epoch, if that isnt the case it returns zero. And this PR now doubles that requirement, which now makes it 30 for all prysm nodes. Most prysm nodes only have a max peer size of 30, so this requires all your connected peers to be on the same target epoch to work.

// actual resyncing).
highestEpoch, _ := s.p2p.Peers().BestNonFinalized(flags.Get().MinimumSyncPeers*2, syncedEpoch)
// Check if the current node is more than 1 epoch behind.
if (highestEpoch - 1) > syncedEpoch {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will overflow if highestEpoch is 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. will fix the overflow now
  2. From discord:

Nishant, regarding your comment (not the overflow one, but above), how come BestNonFinalized requires 30 peers now? Best non-finalized before took just minimum peers to sync (3 by default), but when peers are blinking, it posed problem, so doubling (to 6) allows some leeway.
I am not sure if I understand concerns in your comment.
The problem we had is that it was "too easy" to cause resync or update to queue's highest finalized slot, now with double of min sync peers (3*2=6 by default), requirements a bit more hard, which should allow for a more robust system.
Everywhere except for those two places, we are still quite ok with min.number of peers (3 by default), and sync progresses. So, the only affected parts are: 1) "resync trigger" and 2) "force stay in queue", and it takes a bit more peers now to convince the system that it is the right move
Am I missing sth here?

@farazdagi farazdagi requested a review from nisdas August 24, 2020 10:06
@prylabs-bulldozer prylabs-bulldozer bot merged commit 5c9830f into master Aug 24, 2020
@farazdagi farazdagi deleted the init-sync-non-finalized-loop branch August 25, 2020 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Sync Sync (regular, initial, checkpoint) related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants