Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Forest is getting stuck during sync #3990

Closed
LesnyRumcajs opened this issue Feb 26, 2024 · 1 comment · Fixed by #3996
Closed

Forest is getting stuck during sync #3990

LesnyRumcajs opened this issue Feb 26, 2024 · 1 comment · Fixed by #3996
Assignees
Labels
Network Libp2p and PubSub stuff Priority: 2 - High Very important and should be addressed ASAP Ready Issue is ready for work and anyone can freely assign it to themselves

Comments

@LesnyRumcajs
Copy link
Member

Issue summary

Over the last few days, Forest was getting consistently stuck in the snapshot service. It happens during this part of the script, namely while the script is waiting for the sync. During the timeout window (six hours), Forest either manages to recover alone or is killed.

forest --config config.toml --chain "$CHAIN_NAME" --no-gc --save-token=token.txt --target-peer-count 500 --detach
timeout "$SYNC_TIMEOUT" forest-cli sync wait

While the steps are always the same and the service is always created from scratch, it heavily depends on the network conditions.

Take, for example, this forest.out.txt from 2024-02-26. It repeatedly manages to bootstrap, then stalls for around 10 minutes on evaluating the network head and bootstraps again. My understanding is that, once bootstrapped, should quickly switch to FOLLOW mode.

2024-02-26T08:04:11.394028Z  INFO forest_filecoin::chain_sync::tipset_syncer: Validating tipset: ...
2024-02-26T08:04:22.173405Z  INFO forest_filecoin::chain_sync::tipset_syncer: Validating tipset: EPOCH = 3687607, N blocks = 3
2024-02-26T08:04:23.770512Z  INFO forest_filecoin::chain_sync::chain_muxer: Bootstrap successfully completed, now evaluating the network head to ensure the node is in sync
2024-02-26T08:04:23.770546Z  INFO forest_filecoin::chain_sync::chain_muxer: Evaluating network head...
2024-02-26T08:14:30.175532Z  INFO forest_filecoin::chain_sync::chain_muxer: Local node is behind the network, starting BOOTSTRAP from LOCAL_HEAD = 3687607 -> NETWORK_HEAD = 3687628
2024-02-26T08:14:32.451592Z  INFO forest_filecoin::chain_sync::tipset_syncer: Validating tipset: EPOCH = 3687607, N blocks = 3
...
2024-02-26T08:15:21.416066Z  INFO forest_filecoin::chain_sync::tipset_syncer: Validating tipset: EPOCH = 3687628, N blocks = 6
2024-02-26T08:15:23.442363Z  INFO forest_filecoin::chain_sync::chain_muxer: Bootstrap successfully completed, now evaluating the network head to ensure the node is in sync
2024-02-26T08:15:23.442412Z  INFO forest_filecoin::chain_sync::chain_muxer: Evaluating network head...
2024-02-26T08:24:42.045186Z  INFO forest_filecoin::chain_sync::chain_muxer: Local node is behind the network, starting BOOTSTRAP from LOCAL_HEAD = 3687628 -> NETWORK_HEAD = 3687649
2024-02-26T08:24:43.834966Z  INFO forest_filecoin::chain_sync::tipset_syncer: Validating tipset: EPOCH = 3687628, N blocks = 6
...
2024-02-26T08:25:40.036960Z  INFO forest_filecoin::chain_sync::tipset_syncer: Validating tipset: EPOCH = 3687649, N blocks = 5
2024-02-26T08:25:42.113208Z  INFO forest_filecoin::chain_sync::chain_muxer: Bootstrap successfully completed, now evaluating the network head to ensure the node is in sync
2024-02-26T08:25:42.113256Z  INFO forest_filecoin::chain_sync::chain_muxer: Evaluating network head...
2024-02-26T08:33:30.277333Z  INFO forest_filecoin::chain_sync::chain_muxer: Local node is behind the network, starting BOOTSTRAP from LOCAL_HEAD = 3687649 -> NETWORK_HEAD = 3687666
2024-02-26T08:33:32.657598Z  INFO forest_filecoin::chain_sync::tipset_syncer: Validating tipset: EPOCH = 3687649, N blocks = 5

It may be relevant or not, but the bootstrapping failed a few times during this execution, e.g,.

2024-02-26T06:41:30.263759Z ERROR forest_filecoin::chain_sync::chain_muxer: Bootstrapping failed, re-evaluating the network head to retry the bootstrap. Error = TipsetRangeSyncer(NetworkMessageQueryFailed("ChainExchange request failed for all top peers. 1 network failures, 0 lookup failures, request:\nChainExchangeRequest { start: NonEmpty { head: Cid(bafy2bzacebqs5q2rpjckgvsujjb3ujvloo5jdnneodaahvhv5ovhuwwrgrc6c), tail: [Cid(bafy2bzaceasnyvtuyydg6rzbihwnfulqoixxdtjvuhxjbqi4fdk3i4cvqisqa), Cid(bafy2bzacedo7nqqcxqzcpn7vv3uik332noezq4remr2etwz6yaxzygfyqaamg), Cid(bafy2bzacealtmb75qnq27htdodpxfd6tnjxztvl4mvpgt5lmehmxi5rwrvees), Cid(bafy2bzacedtunn4ziqpictr3s4qww3b4hn4ab2qvsz7tokfpl7iiwzxmj7dds), Cid(bafy2bzacedah76zlbnrbggcocyuinlsomyzstay62rpx6nrkiw5jojqau6xvc), Cid(bafy2bzacebjelq6j4n23qeca7uizmbbzgl4nhyutrgvie34u7lkd7uhpym3du), Cid(bafy2bzacecchpxwmazmpzlfgdebk4i2uau3c5qbwxthotiuqfp4axqlgr7x6k)] }, request_len: 8, options: 2 }"))

Overall, even if we end up with a few bad peers, the entire mainnet should have around 3000 peers the last time I checked. Do we end up with 500 bad ones?

Other information and links
This is happening on forest v0.16.6

@LesnyRumcajs LesnyRumcajs added Priority: 2 - High Very important and should be addressed ASAP Network Libp2p and PubSub stuff Ready Issue is ready for work and anyone can freely assign it to themselves labels Feb 26, 2024
@LesnyRumcajs
Copy link
Member Author

Same happened recently - on two distinct servers. This may indicate a general condition in the network that Forest is just not able to handle.

2024-02-27T03:57:11.680500Z  INFO forest_filecoin::chain_sync::tipset_syncer: Validating tipset: EPOCH = 3688934, N blocks = 3
2024-02-27T03:57:13.869985Z  INFO forest_filecoin::chain_sync::tipset_syncer: Validating tipset: EPOCH = 3688935, N blocks = 3
2024-02-27T03:57:15.872318Z  INFO forest_filecoin::chain_sync::tipset_syncer: Validating tipset: EPOCH = 3688936, N blocks = 8
2024-02-27T03:57:18.128510Z  INFO forest_filecoin::chain_sync::tipset_syncer: Validating tipset: EPOCH = 3688937, N blocks = 8
2024-02-27T03:57:20.267705Z  INFO forest_filecoin::chain_sync::tipset_syncer: Validating tipset: EPOCH = 3688938, N blocks = 12
2024-02-27T03:57:22.408323Z  INFO forest_filecoin::chain_sync::tipset_syncer: Validating tipset: EPOCH = 3688939, N blocks = 3
2024-02-27T03:57:24.212859Z  INFO forest_filecoin::chain_sync::tipset_syncer: Validating tipset: EPOCH = 3688940, N blocks = 9
2024-02-27T03:57:26.386801Z  INFO forest_filecoin::chain_sync::tipset_syncer: Validating tipset: EPOCH = 3688941, N blocks = 6
2024-02-27T03:57:29.012571Z  INFO forest_filecoin::chain_sync::tipset_syncer: Validating tipset: EPOCH = 3688942, N blocks = 7
2024-02-27T03:57:30.733516Z  INFO forest_filecoin::chain_sync::tipset_syncer: Validating tipset: EPOCH = 3688943, N blocks = 2
2024-02-27T03:57:32.642782Z  INFO forest_filecoin::chain_sync::tipset_syncer: Validating tipset: EPOCH = 3688944, N blocks = 7
2024-02-27T03:57:34.405816Z  INFO forest_filecoin::chain_sync::tipset_syncer: Validating tipset: EPOCH = 3688945, N blocks = 4
2024-02-27T03:57:36.875247Z  INFO forest_filecoin::chain_sync::tipset_syncer: Validating tipset: EPOCH = 3688946, N blocks = 7
2024-02-27T03:57:39.051070Z ERROR forest_filecoin::chain_sync::tipset_syncer: Sync messages check state failed for tipset range
2024-02-27T03:57:39.051196Z ERROR forest_filecoin::chain_sync::chain_muxer: Bootstrapping failed, re-evaluating the network head to retry the bootstrap. Error = TipsetRangeSyncer(NetworkMessageQueryFailed("ChainExchange request failed for all top peers. 1 network failures, 0 lookup failures, request:\nChainExchangeRequest { start: NonEmpty { head: Cid(bafy2bzaceawdj4rcmlop2ywthtjoku6uh6pwtvkdfmdultmf5ecyasi5zoilw), tail: [Cid(bafy2bzacednhzl3kbhattskndzi7fl4ekvvsv3zhyf43hpo3kllx3tddkzigq), Cid(bafy2bzacecvalk4waodo5my2dnwsagnnfet3vaouri3yjcpz64lwm77vy5kp4)] }, request_len: 8, options: 2 }"))
2024-02-27T03:57:39.051245Z  INFO forest_filecoin::chain_sync::chain_muxer: Evaluating network head...
2024-02-27T04:04:59.873071Z  INFO forest_filecoin::chain_sync::chain_muxer: Local node is behind the network, starting BOOTSTRAP from LOCAL_HEAD = 3688946 -> NETWORK_HEAD = 3690009
2024-02-27T04:05:06.261098Z ERROR forest_filecoin::chain_sync::tipset_syncer: Sync messages check state failed for tipset range
2024-02-27T04:05:06.261256Z ERROR forest_filecoin::chain_sync::chain_muxer: Bootstrapping failed, re-evaluating the network head to retry the bootstrap. Error = TipsetRangeSyncer(NetworkMessageQueryFailed("ChainExchange request failed for all top peers. 1 network failures, 0 lookup failures, request:\nChainExchangeRequest { start: NonEmpty { head: Cid(bafy2bzaceam64a6njudjfh7enuwxro6enxgwjjswqwn5ezlsc75vdbsiwhjyw), tail: [Cid(bafy2bzaceb4hlohv4fghgpi2k5rnkvkodumlsfjxbt7qgzrqiuo4rdjoakbxi), Cid(bafy2bzacecohntypminocrc4rqur2wiw7bjtpuvne3wg32qe7rnv73z5lapbm), Cid(bafy2bzacebtpwvfdj4gr2sth7lc25th7hqbauigrozpqxvkncqwwe7lcxa2z6), Cid(bafy2bzaceatjctmlhgrz7k4ngc4uxxpn7pqmlqlupesajohlekyujepj5dhi4)] }, request_len: 8, options: 2 }"))
2024-02-27T04:05:06.261287Z  INFO forest_filecoin::chain_sync::chain_muxer: Evaluating network head...
2024-02-27T04:16:59.758525Z  INFO forest_filecoin::chain_sync::chain_muxer: Local node is behind the network, starting BOOTSTRAP from LOCAL_HEAD = 3688946 -> NETWORK_HEAD = 3690033
2024-02-27T04:17:06.060036Z ERROR forest_filecoin::chain_sync::tipset_syncer: Sync messages check state failed for tipset range
2024-02-27T04:17:06.060148Z ERROR forest_filecoin::chain_sync::chain_muxer: Bootstrapping failed, re-evaluating the network head to retry the bootstrap. Error = TipsetRangeSyncer(NetworkMessageQueryFailed("ChainExchange request failed for all top peers. 1 network failures, 0 lookup failures, request:\nChainExchangeRequest { start: NonEmpty { head: Cid(bafy2bzaceam64a6njudjfh7enuwxro6enxgwjjswqwn5ezlsc75vdbsiwhjyw), tail: [Cid(bafy2bzaceb4hlohv4fghgpi2k5rnkvkodumlsfjxbt7qgzrqiuo4rdjoakbxi), Cid(bafy2bzacecohntypminocrc4rqur2wiw7bjtpuvne3wg32qe7rnv73z5lapbm), Cid(bafy2bzacebtpwvfdj4gr2sth7lc25th7hqbauigrozpqxvkncqwwe7lcxa2z6), Cid(bafy2bzaceatjctmlhgrz7k4ngc4uxxpn7pqmlqlupesajohlekyujepj5dhi4)] }, request_len: 8, options: 2 }"))
2024-02-27T04:17:06.060185Z  INFO forest_filecoin::chain_sync::chain_muxer: Evaluating network head...

forest.out.txt

Added the metrics to DO bucket forest-issues/ forest#3990 /mainnet_2024-02-27

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Network Libp2p and PubSub stuff Priority: 2 - High Very important and should be addressed ASAP Ready Issue is ready for work and anyone can freely assign it to themselves
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant