Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

followLatest catchup algorithm does not work for small peer sets #6054

Closed
urtho opened this issue Jul 6, 2024 · 6 comments
Closed

followLatest catchup algorithm does not work for small peer sets #6054

urtho opened this issue Jul 6, 2024 · 6 comments

Comments

@urtho
Copy link
Contributor

urtho commented Jul 6, 2024

Status

When 3.25 follower node runs with small peer set (eg testnet or a colocated relay) the algorithm often aborts sync without immediate resync trigger causing 15 second long pause in tip following for conduit.

Expected

Sync should not run abort when following tip with small peers sets.

Solution

Maybe peers should not be downranked when they return 404 in followLatest mode?
3.21 does not seem to experience this issue but 3.25 does - but that is just an observation without a proper test.

@urtho
Copy link
Contributor Author

urtho commented Jul 6, 2024

{"Context":"sync","file":"service.go","function":"github.com/algorand/go-algorand/catchup.(*Service).fetchAndWrite","level":"debug","line":343,"msg":"fetchAndWrite(41635669): Could not fetch: no block available for given round (attempt 11)","name":"","time":"2024-07-06T14:32:16.328379Z"}
{"Context":"sync","file":"service.go","function":"github.com/algorand/go-algorand/catchup.(*Service).fetchAndWrite","level":"debug","line":309,"msg":"fetchAndWrite: was unable to obtain a peer to retrieve the block from","name":"","time":"2024-07-06T14:32:16.328493Z"}
{"Context":"sync","file":"service.go","function":"github.com/algorand/go-algorand/catchup.(*Service).fetchAndWrite","level":"debug","line":343,"msg":"fetchAndWrite(41635668): Could not fetch: no block available for given round (attempt 11)","name":"","time":"2024-07-06T14:32:16.370168Z"}
{"Context":"sync","file":"service.go","function":"github.com/algorand/go-algorand/catchup.(*Service).fetchAndWrite","level":"debug","line":309,"msg":"fetchAndWrite: was unable to obtain a peer to retrieve the block from","name":"","time":"2024-07-06T14:32:16.370267Z"}
{"Context":"sync","file":"service.go","function":"github.com/algorand/go-algorand/catchup.(*Service).sync","level":"info","line":732,"msg":"Catchup Service: finished catching up, now at round 41635667 (previously 41635667). Total time catching up 1.936764984s.","name":"","time":"2024-07-06T14:32:16.370325Z"}
{"Context":"sync","file":"service.go","function":"github.com/algorand/go-algorand/catchup.(*Service).periodicSync","level":"info","line":660,"msg":"It's been too long since our ledger advanced; resyncing","name":"","time":"2024-07-06T14:32:33.371646Z"}

example logs from a testnet 3.25 follower

@urtho
Copy link
Contributor Author

urtho commented Jul 6, 2024

Also - IMHO there should be no scenario that causes the number of peers to go to zero.

@gmalouf
Copy link
Contributor

gmalouf commented Jul 17, 2024

@urtho before we take any action on this, some of this behavior was introduced in #5836

@urtho
Copy link
Contributor Author

urtho commented Jul 18, 2024

3.25 follower just hiccups on small peer sets.

I am running smoothly with this dynamic backoff, where pc is peer count across enabled peerSelectors.

    if s.followLatest {
	    bo := max(followLatestBackoff, time.Duration(2000/pc)*(time.Millisecond))
	    time.Sleep(bo)
    }

@algorandskiy
Copy link
Contributor

For some extra context: this looks very similar to what was observed in P2P catchup testing (there is no archival DNS for P2P) in a scenario with 4 nodes in DNS: 2 relays with a limited history and 2 relays with a full block history. I have not debug it but it looked like the peerSelector was punishing 404 nodes, advanced to the next class, did not find any suitable nodes there and eventually aborted catchup.

@gmalouf
Copy link
Contributor

gmalouf commented Oct 4, 2024

For people that encounter this in the future, the configuration CatchupParallelBlocks = 64 default for follower mode in the configuration profile works fine on networks with more than ~ 8 peers, but is too high on smaller networks. I believe a ratio of roughly 8 parallel requests per peer or ideally 4 requests/peer will work better. Dynamically calculating this max catchup parallel blocks has quite a bit of complexity, I am open to lowering the profile recommended value though 64 is nice for mainnet/testnet today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants