-
Notifications
You must be signed in to change notification settings - Fork 489
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
followLatest catchup algorithm does not work for small peer sets #6054
Comments
{"Context":"sync","file":"service.go","function":"github.com/algorand/go-algorand/catchup.(*Service).fetchAndWrite","level":"debug","line":343,"msg":"fetchAndWrite(41635669): Could not fetch: no block available for given round (attempt 11)","name":"","time":"2024-07-06T14:32:16.328379Z"}
{"Context":"sync","file":"service.go","function":"github.com/algorand/go-algorand/catchup.(*Service).fetchAndWrite","level":"debug","line":309,"msg":"fetchAndWrite: was unable to obtain a peer to retrieve the block from","name":"","time":"2024-07-06T14:32:16.328493Z"}
{"Context":"sync","file":"service.go","function":"github.com/algorand/go-algorand/catchup.(*Service).fetchAndWrite","level":"debug","line":343,"msg":"fetchAndWrite(41635668): Could not fetch: no block available for given round (attempt 11)","name":"","time":"2024-07-06T14:32:16.370168Z"}
{"Context":"sync","file":"service.go","function":"github.com/algorand/go-algorand/catchup.(*Service).fetchAndWrite","level":"debug","line":309,"msg":"fetchAndWrite: was unable to obtain a peer to retrieve the block from","name":"","time":"2024-07-06T14:32:16.370267Z"}
{"Context":"sync","file":"service.go","function":"github.com/algorand/go-algorand/catchup.(*Service).sync","level":"info","line":732,"msg":"Catchup Service: finished catching up, now at round 41635667 (previously 41635667). Total time catching up 1.936764984s.","name":"","time":"2024-07-06T14:32:16.370325Z"}
{"Context":"sync","file":"service.go","function":"github.com/algorand/go-algorand/catchup.(*Service).periodicSync","level":"info","line":660,"msg":"It's been too long since our ledger advanced; resyncing","name":"","time":"2024-07-06T14:32:33.371646Z"} example logs from a testnet 3.25 follower |
Also - IMHO there should be no scenario that causes the number of peers to go to zero. |
3.25 follower just hiccups on small peer sets. I am running smoothly with this dynamic backoff, where if s.followLatest {
bo := max(followLatestBackoff, time.Duration(2000/pc)*(time.Millisecond))
time.Sleep(bo)
} |
For some extra context: this looks very similar to what was observed in P2P catchup testing (there is no archival DNS for P2P) in a scenario with 4 nodes in DNS: 2 relays with a limited history and 2 relays with a full block history. I have not debug it but it looked like the peerSelector was punishing 404 nodes, advanced to the next class, did not find any suitable nodes there and eventually aborted catchup. |
For people that encounter this in the future, the configuration |
Status
When 3.25 follower node runs with small peer set (eg testnet or a colocated relay) the algorithm often aborts sync without immediate resync trigger causing 15 second long pause in tip following for conduit.
Expected
Sync should not run abort when following tip with small peers sets.
Solution
Maybe peers should not be downranked when they return 404 in followLatest mode?
3.21 does not seem to experience this issue but 3.25 does - but that is just an observation without a proper test.
The text was updated successfully, but these errors were encountered: