Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Forest slow sync on Mainnet network #5032

Closed
hamidmuslih opened this issue Dec 3, 2024 · 3 comments · Fixed by #5071
Closed

Forest slow sync on Mainnet network #5032

hamidmuslih opened this issue Dec 3, 2024 · 3 comments · Fixed by #5071
Assignees
Labels
Type: Bug Something isn't working

Comments

@hamidmuslih
Copy link

hamidmuslih commented Dec 3, 2024

Describe the bug

We have two forest nodes on the mainnet network, one running as a bootnode for community and the other as an RPC node. They are regularly lagging behind the network head, and there seem to be inconsistencies when we check their status with different commands.

To reproduce

Below is the forest mainnet rpc config:

[client]
#Controls whether the keystore is encrypted
encrypt_keystore = false
data_dir = "/home/forest/forest_data"
enable_rpc = true
enable_metrics_endpoint = true
metrics_address = "127.0.0.1:6116"
rpc_address = "0.0.0.0:2345"

[network]
listening_multiaddrs = ["/ip4/0.0.0.0/tcp/34000", "/ip6/::/tcp/34000"]
#Kademlia option enforce p2p connections.
kademlia = true
#Amount of peers the node should maintain a connection with
#Helps in syncing ( moderate value to not be exceeded)
target_peer_count = 100

1- when we check the node sync status it shows it is almost synced
command: forest-cli info show


Network: mainnet
Uptime: 6days 4h 42m 34s (Started at: 2024-11-27 18:00:41.731653103 +00:00)
Chain: [sync: Behind! (3h 26m 16s behind)] [basefee: 53569722 attoFIL] [epoch: 4498234]
Chain health: 96.60%

2- also another command, that shows Stage: complete
command: forest-cli sync status


sync status:
Base:	[bafy2bzaceamfqfikvtcngw4iwdtasfbnd6paptcjop2bkqzrjpcsx4wrzj3wc, bafy2bzaceacsdozmi47gr6vt4oc44txwu5npwpnefuw6rb5nfdedxeyxs5wti, bafy2bzacedakwpuhvmh62ujvsw6gzpem35z3ygx536xi6dtie6v65zkyu6nnu, bafy2bzacebhwpenxzyxmbtypb2ibbfk5x2i6bix6ddm2dkqinps6rfuvfivtm, bafy2bzacedhnyg2zndpgpdbmy3cqlhwumbc2o6kgsas2lqar53twzu2gcjmoq, bafy2bzacecfcbojfe3mualppa2mojqd3bxn2qp5kxrvra52hzsqd2uzlrjzrm, bafy2bzacebnatltccotwjkqnrtlkpozqr2pturleoq4kyjio6gmlnwizh365g]
Target:	[bafy2bzaceacddelmtdlhqityjzvkhnahkxcvtdpg2gumi737p4o36z4ijfpx2] (4498234)
Height diff:	0
Stage:	complete
Height:	4498234
Elapsed time:	19s

3- same using RPC call stage complete

curl -s -X POST "http://127.0.0.1:2345/rpc/v0" -H 'Content-Type: application/json' --data '{"jsonrpc":"2.0","id":1,"method":"Filecoin.SyncState","params":[]}' | jq '.result.ActiveSyncs[0].Stage'`

"complete"

4- ⚠ but in reality, it is 3:26 mins behind the network head and at the time of writing the network head is at 4498643
command: forest-cli info show

Network: mainnet
Uptime: 6days 4h 42m 40s (Started at: 2024-11-27 18:00:41.731653103 +00:00)
Chain: [sync: Behind! (3h 26m 22s behind)] [basefee: 53569722 attoFIL] [epoch: 4498234]
Chain health: 96.60%

Log output

Log Output

2024-12-03 23:46:07.415 | 2024-12-03T22:46:07.414980Z  WARN forest_filecoin::libp2p::hello::behaviour: Disconnecting peer for not receiving hello in 30s peer=12D3KooWK3sQX6FPGKgV87jnHcf4ENJEeq8VhgXhjr4nfmQNCEbZ |  
-- | -- | --
  |   | 2024-12-03 23:46:00.736 | 2024-12-03T22:46:00.736345Z  WARN forest_filecoin::libp2p_bitswap::request_manager: sending on a closed channel |  
  |   | 2024-12-03 23:46:00.736 | 2024-12-03T22:46:00.736346Z  WARN forest_filecoin::libp2p_bitswap::request_manager: sending on a closed channel |  
  |   | 2024-12-03 23:46:00.736 | 2024-12-03T22:46:00.736339Z  WARN forest_filecoin::libp2p_bitswap::request_manager: sending on a closed channel |  
  |   | 2024-12-03 23:46:00.736 | 2024-12-03T22:46:00.736342Z  WARN forest_filecoin::libp2p_bitswap::request_manager: sending on a closed channel |  
  |   | 2024-12-03 23:46:00.736 | 2024-12-03T22:46:00.736338Z  WARN forest_filecoin::libp2p_bitswap::request_manager: sending on a closed channel |  
  |   | 2024-12-03 23:46:00.736 | 2024-12-03T22:46:00.736336Z  WARN forest_filecoin::libp2p_bitswap::request_manager: sending on a closed channel |  
  |   | 2024-12-03 23:46:00.736 | 2024-12-03T22:46:00.736333Z  WARN forest_filecoin::libp2p_bitswap::request_manager: sending on a closed channel |  
  |   | 2024-12-03 23:46:00.736 | 2024-12-03T22:46:00.736330Z  WARN forest_filecoin::libp2p_bitswap::request_manager: sending on a closed channel |  
  |   | 2024-12-03 23:46:00.736 | 2024-12-03T22:46:00.736326Z  WARN forest_filecoin::libp2p_bitswap::request_manager: sending on a closed channel |  
  |   | 2024-12-03 23:46:00.736 | 2024-12-03T22:46:00.736312Z  WARN forest_filecoin::libp2p_bitswap::request_manager: sending on a closed channel |  
  |   | 2024-12-03 23:46:00.736 | 2024-12-03T22:46:00.736307Z  WARN forest_filecoin::libp2p_bitswap::request_manager: sending on a closed channel |  
  |   | 2024-12-03 23:46:00.732 | 2024-12-03T22:46:00.732524Z  WARN forest_filecoin::libp2p_bitswap::request_manager: sending on a closed channel |  
  |   | 2024-12-03 23:46:00.675 | 2024-12-03T22:46:00.675408Z  WARN forest_filecoin::libp2p_bitswap::request_manager: sending on a closed channel |  
  |   | 2024-12-03 23:46:00.675 | 2024-12-03T22:46:00.675383Z  WARN forest_filecoin::libp2p_bitswap::request_manager: sending on a closed channel |  
  |   | 2024-12-03 23:46:00.675 | 2024-12-03T22:46:00.675379Z  WARN forest_filecoin::libp2p_bitswap::request_manager: sending on a closed channel |  
  |   | 2024-12-03 23:46:00.675 | 2024-12-03T22:46:00.675377Z  WARN forest_filecoin::libp2p_bitswap::request_manager: sending on a closed channel |  
  |   | 2024-12-03 23:46:00.675 | 2024-12-03T22:46:00.675271Z  WARN forest_filecoin::libp2p_bitswap::request_manager: sending on a closed channel |  
  |   | 2024-12-03 23:46:00.675 | 2024-12-03T22:46:00.675269Z  WARN forest_filecoin::libp2p_bitswap::request_manager: sending on a closed channel |  
  |   | 2024-12-03 23:46:00.675 | 2024-12-03T22:46:00.675268Z  WARN forest_filecoin::libp2p_bitswap::request_manager: sending on a closed channel |  
  |   | 2024-12-03 23:46:00.675 | 2024-12-03T22:46:00.675261Z  WARN forest_filecoin::libp2p_bitswap::request_manager: sending on a closed channel |  
  |   | 2024-12-03 23:46:00.675 | 2024-12-03T22:46:00.675019Z  WARN forest_filecoin::libp2p_bitswap::request_manager: sending on a closed channel |  
  |   | 2024-12-03 23:46:00.675 | 2024-12-03T22:46:00.674978Z  WARN forest_filecoin::libp2p_bitswap::request_manager: sending on a closed channel |  
  |   | 2024-12-03 23:45:45.036 | 2024-12-03T22:45:45.035891Z  WARN forest_filecoin::libp2p::hello::behaviour: Disconnecting peer for not receiving hello in 30s peer=12D3KooWFThK9fi3zf2Vi2HW5PrpLwr7BmMMMhx6BzSB6jUp2C4Z |  
  |   | 2024-12-03 23:45:44.692 | 2024-12-03T22:45:44.692833Z  WARN forest_filecoin::libp2p::hello::behaviour: Disconnecting peer for not receiving hello in 30s peer=12D3KooWMEy8BVNAs3r4ifLPVXRHcjwRnzZkvvxt1i13dieZQpnx |  
  |   | 2024-12-03 23:45:44.408 | 2024-12-03T22:45:44.408533Z  WARN forest_filecoin::libp2p::hello::behaviour: Disconnecting peer for not receiving hello in 30s peer=12D3KooWHBvDzPF6t67rEq6G28k1GG9euGEYvJsiK85NBVgYmcp5 |  
  |   | 2024-12-03 23:45:42.635 | 2024-12-03T22:45:42.635684Z  INFO forest_filecoin::libp2p::service: Re-dialing to bootstrap peer at /dns/node.glif.io/tcp/1235/p2p/12D3KooWBF8cpp65hp2u9LK5mh19x67ftAam84z9LsfaquTDSBpt |  
  |   | 2024-12-03 23:45:42.635 | 2024-12-03T22:45:42.635651Z  INFO forest_filecoin::libp2p::service: Re-dialing to bootstrap peer at /dns/bootstrap-mainnet-0.chainsafe-fil.io/tcp/34000/p2p/12D3KooWKKkCZbcigsWTEu1cgNetNbZJqeNtysRtFpq7DTqw3eqH |  
  |   | 2024-12-03 23:45:42.586 | 2024-12-03T22:45:42.586314Z  WARN forest_filecoin::libp2p::hello::behaviour: Disconnecting peer for not receiving hello in 30s peer=12D3KooWLu9RSbCvPffzfj2TbDZQoHT2Qk8yuHN61WnXCVRc9TLN |  
  |   | 2024-12-03 23:45:32.670 | 2024-12-03T22:45:32.670288Z  WARN forest_filecoin::libp2p::hello::behaviour: Disconnecting peer for not receiving hello in 30s peer=12D3KooWJpE7tt1MPyfhzMFWxLqNPVquHXcDxYmBjbcftU7gwfmd |  
  |   | 2024-12-03 23:45:30.675 | 2024-12-03T22:45:30.675027Z  WARN forest_filecoin::libp2p_bitswap::request_manager: sending on a closed channel |  
  |   | 2024-12-03 23:45:30.675 | 2024-12-03T22:45:30.675026Z  WARN forest_filecoin::libp2p_bitswap::request_manager: sending on a closed channel |  
  |   | 2024-12-03 23:45:30.675 | 2024-12-03T22:45:30.675024Z  WARN forest_filecoin::libp2p_bitswap::request_manager: sending on a closed channel |  
  |   | 2024-12-03 23:45:30.675 | 2024-12-03T22:45:30.675019Z  WARN forest_filecoin::libp2p_bitswap::request_manager: sending on a closed channel


Expected behaviour

node should see the real network head, and sync normal.

Screenshots

Server performance doesn't seem scary
Image

Environment (please complete the following information):

  • OS: Ubuntu 22.04.4 LTS
  • Branch/commit: ghcr.io/chainsafe/forest:v0.22.0-fat
  • Hardware
    • 32GB RAM
    • CPU 6/12thread
    • Disk locally attached NVME

Other information and links

we have some internal discussion on this as well, slack link here

@hamidmuslih hamidmuslih added the Type: Bug Something isn't working label Dec 3, 2024
@LesnyRumcajs LesnyRumcajs moved this from New to In progress in Forest Backlog 🌲 Dec 5, 2024
@LesnyRumcajs
Copy link
Member

Investigated with the infra team by @lemmih, to be verified if it's Forest issue or not.

@lemmih
Copy link
Contributor

lemmih commented Dec 6, 2024

Replicated on the staging server. I will try to replicate it locally.

@lemmih
Copy link
Contributor

lemmih commented Dec 10, 2024

I managed to replicate it locally, but it can take hours to manifest.

When the problem occurs, Forest starts processing tipsets at a reduced rate (roughly one every 2 minutes instead of 2 per minute). Processing doesn't come to a halt, and no errors are generated.

We use two state machines: ChainMuxer and TipsetProcessor. The ChainMuxer talks to the p2p swarm to receive news about the latest tipsets. It then forwards them to the TipsetProcessor, which is responsible for fetching dependencies (eg. parent tipsets that haven't already been downloaded) and resolving forks.

I initially thought there was a problem with the TipsetProcessor. A bit of debugging shows this not to be the case. Instead, the stream of tipsets we get from the p2p swarm slows down. This shouldn't be possible, and I'm currently investigating how this can happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Bug Something isn't working
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants