-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More conservative selection of non-finalized peers #7086
Conversation
7eca454
to
c56a392
Compare
Codecov Report
@@ Coverage Diff @@
## master #7086 +/- ##
==========================================
+ Coverage 60.07% 62.30% +2.23%
==========================================
Files 323 406 +83
Lines 27422 31596 +4174
==========================================
+ Hits 16473 19686 +3213
- Misses 8733 9122 +389
- Partials 2216 2788 +572 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this will make it better. It potentially has the ability to make it worse. BestNonFinalized
requires minimum number of peers in some target epoch, if that isnt the case it returns zero. And this PR now doubles that requirement, which now makes it 30 for all prysm nodes. Most prysm nodes only have a max peer size of 30, so this requires all your connected peers to be on the same target epoch to work.
beacon-chain/sync/rpc_status.go
Outdated
// actual resyncing). | ||
highestEpoch, _ := s.p2p.Peers().BestNonFinalized(flags.Get().MinimumSyncPeers*2, syncedEpoch) | ||
// Check if the current node is more than 1 epoch behind. | ||
if (highestEpoch - 1) > syncedEpoch { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will overflow if highestEpoch
is 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- will fix the overflow now
- From discord:
Nishant, regarding your comment (not the overflow one, but above), how come BestNonFinalized requires 30 peers now? Best non-finalized before took just minimum peers to sync (3 by default), but when peers are blinking, it posed problem, so doubling (to 6) allows some leeway.
I am not sure if I understand concerns in your comment.
The problem we had is that it was "too easy" to cause resync or update to queue's highest finalized slot, now with double of min sync peers (3*2=6 by default), requirements a bit more hard, which should allow for a more robust system.
Everywhere except for those two places, we are still quite ok with min.number of peers (3 by default), and sync progresses. So, the only affected parts are: 1) "resync trigger" and 2) "force stay in queue", and it takes a bit more peers now to convince the system that it is the right move
Am I missing sth here?
…bs/prysm into init-sync-non-finalized-loop
What type of PR is this?
Bug fix
What does this PR do? Why is it needed?
BestNonFinalized()
are using the absolute min. number of peers to proceed):highestExpectedSlot
needs to be done (this causes that loop towards the end of init-sync). We now require, double the minimum sync peers to agree on some future slot, before we update thehighestExpectedSlot
(and therefore force queue to continue operating).Which issues(s) does this PR fix?
N/A
Other notes for review
highestExpectedSlot
is about to be checked) we have2*min.sync.peers
agree on some future slot, andhighestExpectedSlot
is updated, but then when queue requests some range and doesn't find enough peers, it will loopnoRequiredPeersErrMaxRetries
(1000) times, withnoRequiredPeersErrRefreshInterval
(15 secs) intervals (~4hours or up to the point thehighestExpectedSlot
is reached - whichever comes the first). This is the way we make sure that in times of long non-finality (who remembers the last Medalla incident?) queue is robust to non-healthy network state and gives enough efforts before giving up.highestExpectedSlot
, even if updated, is expected to be caught up quickly (enough nodes with those blocks!), and queue exits normally.