-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zebra gradually loses peer connections, until it has none left #1905
Comments
It looks like Zebra gets into a state where it stops crawling and dialing, even though there are many This might be happening because That makes this ticket a potential blocker for a bunch of network security fixes - because the fixes could make this problem much worse.
|
It looks like the
|
From another instance - it's definitely And this issue might be more frequent than we thought - I've had it happen to 2 testnet instances in 24 hours.
|
Here is a metrics dump from a peer in this state. I waited 5 minutes, and dumped the metrics again. There weren't any changes in any metric.
|
Things we could try to debug this issue:
Here's the substrate task polling monitor code: |
We should also look for places where |
Version
zebrad 1.0.0-alpha.3
commit 6473ed99Platform
Linux ... 5.4.96 #1-NixOS SMP Sun Feb 7 14:35:50 UTC 2021 x86_64 GNU/Linux
Description
Sometimes, on testnet, Zebra gradually loses peers, until it has none left.
Sometimes the peers are lost after Zebra disconnects them due to block download or verify errors. Other times, the peers just seem to disconnect themselves.
Zebra should try to reconnect to these peers, but it doesn't seem to get any peers back.
Maybe Zebra isn't trying to connect. Or maybe the connections keep failing. Zebra might be triggering
zcashd
's peer blocklist, due to bugs like #1848. Or there could be a peer address or connection state handling issue in Zebra's network stack.This issue doesn't seem to happen that often (I've only seen it once in weeks of local testing). Our current reliability standard of 2 weeks continuous runtime: Zebra easily achieves that standard on mainnet, and mostly achieves that standard on testnet.
Therefore, this issue is a low priority, until it affects CI or other node deployments.
Related Tickets
Fixing #1904 will get Zebra better peers from the DNS seeders on testnet
Fixing #1848 and other network security issues will make Zebra's network stack more reliable
This ticket might block #1791, if it happens often enough in our CI
Commands
zebrad start
, configured to use testnet with the DNS seeders and a few local peers.Logs
The relevant logs are:
The text was updated successfully, but these errors were encountered: