More conservative selection of non-finalized peers #7086

farazdagi · 2020-08-23T10:18:54Z

What type of PR is this?

Bug fix

What does this PR do? Why is it needed?

Users have reported that sometimes, towards the end of 2nd phase of init-syncing (when we sync to non-finalized head), they are stuck processing what seems like the very same set of blocks (loop).
It is hard to reproduce, so this PR is my best attempt to resolve the issue: instead of requiring only min-sync-peers number of peers beyond node's head slot, we request double of that number. This is done in exactly two places (all other clients of BestNonFinalized() are using the absolute min. number of peers to proceed):
- When we are deciding that we need to resync: we want to make sure that there are enough peers (even if some blink) to actually resync, so resync is forced if double of the minimum number of peers to sync report higher epoch (and delta between head and those peers should be >= 1 epoch)
- When, in queue, code decides whether during the sync time peers moved even further, and update to the highestExpectedSlot needs to be done (this causes that loop towards the end of init-sync). We now require, double the minimum sync peers to agree on some future slot, before we update the highestExpectedSlot (and therefore force queue to continue operating).

Which issues(s) does this PR fix?

N/A

Other notes for review

Some looping will still remain (by design): if at some point (when highestExpectedSlot is about to be checked) we have 2*min.sync.peers agree on some future slot, and highestExpectedSlot is updated, but then when queue requests some range and doesn't find enough peers, it will loop noRequiredPeersErrMaxRetries (1000) times, with noRequiredPeersErrRefreshInterval (15 secs) intervals (~4hours or up to the point the highestExpectedSlot is reached - whichever comes the first). This is the way we make sure that in times of long non-finality (who remembers the last Medalla incident?) queue is robust to non-healthy network state and gives enough efforts before giving up.
In healthy network, the highestExpectedSlot, even if updated, is expected to be caught up quickly (enough nodes with those blocks!), and queue exits normally.

codecov · 2020-08-23T11:32:41Z

Codecov Report

Merging #7086 into master will increase coverage by 2.23%.
The diff coverage is 72.77%.

@@            Coverage Diff             @@
##           master    #7086      +/-   ##
==========================================
+ Coverage   60.07%   62.30%   +2.23%     
==========================================
  Files         323      406      +83     
  Lines       27422    31596    +4174     
==========================================
+ Hits        16473    19686    +3213     
- Misses       8733     9122     +389     
- Partials     2216     2788     +572

nisdas

Not sure if this will make it better. It potentially has the ability to make it worse. BestNonFinalized requires minimum number of peers in some target epoch, if that isnt the case it returns zero. And this PR now doubles that requirement, which now makes it 30 for all prysm nodes. Most prysm nodes only have a max peer size of 30, so this requires all your connected peers to be on the same target epoch to work.

nisdas · 2020-08-24T09:50:27Z

beacon-chain/sync/rpc_status.go

+			// actual resyncing).
+			highestEpoch, _ := s.p2p.Peers().BestNonFinalized(flags.Get().MinimumSyncPeers*2, syncedEpoch)
+			// Check if the current node is more than 1 epoch behind.
+			if (highestEpoch - 1) > syncedEpoch {


This will overflow if highestEpoch is 0

will fix the overflow now

From discord:

Nishant, regarding your comment (not the overflow one, but above), how come BestNonFinalized requires 30 peers now? Best non-finalized before took just minimum peers to sync (3 by default), but when peers are blinking, it posed problem, so doubling (to 6) allows some leeway.
I am not sure if I understand concerns in your comment.
The problem we had is that it was "too easy" to cause resync or update to queue's highest finalized slot, now with double of min sync peers (3*2=6 by default), requirements a bit more hard, which should allow for a more robust system.
Everywhere except for those two places, we are still quite ok with min.number of peers (3 by default), and sync progresses. So, the only affected parts are: 1) "resync trigger" and 2) "force stay in queue", and it takes a bit more peers now to convince the system that it is the right move
Am I missing sth here?

…bs/prysm into init-sync-non-finalized-loop

more conservative selection of non-finalized peers

c56a392

farazdagi self-assigned this Aug 23, 2020

farazdagi added the Sync Sync (regular, initial, checkpoint) related issues label Aug 23, 2020

farazdagi force-pushed the init-sync-non-finalized-loop branch from 7eca454 to c56a392 Compare August 23, 2020 10:32

farazdagi marked this pull request as ready for review August 23, 2020 10:35

farazdagi requested a review from a team as a code owner August 23, 2020 10:35

farazdagi requested review from 0xKiwi, prestonvanloon and nisdas August 23, 2020 10:35

farazdagi added OK to merge labels Aug 23, 2020

Merge refs/heads/master into init-sync-non-finalized-loop

0b3483f

nisdas requested changes Aug 24, 2020

View reviewed changes

farazdagi added 2 commits August 24, 2020 13:02

Nishant's suggestion on possible overflow

27f1945

Merge branch 'init-sync-non-finalized-loop' of github.com:prysmaticla…

2af4d02

…bs/prysm into init-sync-non-finalized-loop

farazdagi requested a review from nisdas August 24, 2020 10:06

Merge refs/heads/master into init-sync-non-finalized-loop

23b9ef7

nisdas approved these changes Aug 24, 2020

View reviewed changes

Merge refs/heads/master into init-sync-non-finalized-loop

785cfc2

prylabs-bulldozer bot merged commit 5c9830f into master Aug 24, 2020

farazdagi deleted the init-sync-non-finalized-loop branch August 25, 2020 15:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More conservative selection of non-finalized peers #7086

More conservative selection of non-finalized peers #7086

farazdagi commented Aug 23, 2020 •

edited

Loading

codecov bot commented Aug 23, 2020 •

edited

Loading

nisdas left a comment

nisdas Aug 24, 2020

farazdagi Aug 24, 2020

More conservative selection of non-finalized peers #7086

More conservative selection of non-finalized peers #7086

Conversation

farazdagi commented Aug 23, 2020 • edited Loading

codecov bot commented Aug 23, 2020 • edited Loading

Codecov Report

nisdas left a comment

Choose a reason for hiding this comment

nisdas Aug 24, 2020

Choose a reason for hiding this comment

farazdagi Aug 24, 2020

Choose a reason for hiding this comment

farazdagi commented Aug 23, 2020 •

edited

Loading

codecov bot commented Aug 23, 2020 •

edited

Loading