Forest is getting stuck during sync #3990

LesnyRumcajs · 2024-02-26T14:02:30Z

Issue summary

Over the last few days, Forest was getting consistently stuck in the snapshot service. It happens during this part of the script, namely while the script is waiting for the sync. During the timeout window (six hours), Forest either manages to recover alone or is killed.

forest --config config.toml --chain "$CHAIN_NAME" --no-gc --save-token=token.txt --target-peer-count 500 --detach
timeout "$SYNC_TIMEOUT" forest-cli sync wait

While the steps are always the same and the service is always created from scratch, it heavily depends on the network conditions.

Take, for example, this forest.out.txt from 2024-02-26. It repeatedly manages to bootstrap, then stalls for around 10 minutes on evaluating the network head and bootstraps again. My understanding is that, once bootstrapped, should quickly switch to FOLLOW mode.

2024-02-26T08:04:11.394028Z  INFO forest_filecoin::chain_sync::tipset_syncer: Validating tipset: ...
2024-02-26T08:04:22.173405Z  INFO forest_filecoin::chain_sync::tipset_syncer: Validating tipset: EPOCH = 3687607, N blocks = 3
2024-02-26T08:04:23.770512Z  INFO forest_filecoin::chain_sync::chain_muxer: Bootstrap successfully completed, now evaluating the network head to ensure the node is in sync
2024-02-26T08:04:23.770546Z  INFO forest_filecoin::chain_sync::chain_muxer: Evaluating network head...
2024-02-26T08:14:30.175532Z  INFO forest_filecoin::chain_sync::chain_muxer: Local node is behind the network, starting BOOTSTRAP from LOCAL_HEAD = 3687607 -> NETWORK_HEAD = 3687628
2024-02-26T08:14:32.451592Z  INFO forest_filecoin::chain_sync::tipset_syncer: Validating tipset: EPOCH = 3687607, N blocks = 3
...
2024-02-26T08:15:21.416066Z  INFO forest_filecoin::chain_sync::tipset_syncer: Validating tipset: EPOCH = 3687628, N blocks = 6
2024-02-26T08:15:23.442363Z  INFO forest_filecoin::chain_sync::chain_muxer: Bootstrap successfully completed, now evaluating the network head to ensure the node is in sync
2024-02-26T08:15:23.442412Z  INFO forest_filecoin::chain_sync::chain_muxer: Evaluating network head...
2024-02-26T08:24:42.045186Z  INFO forest_filecoin::chain_sync::chain_muxer: Local node is behind the network, starting BOOTSTRAP from LOCAL_HEAD = 3687628 -> NETWORK_HEAD = 3687649
2024-02-26T08:24:43.834966Z  INFO forest_filecoin::chain_sync::tipset_syncer: Validating tipset: EPOCH = 3687628, N blocks = 6
...
2024-02-26T08:25:40.036960Z  INFO forest_filecoin::chain_sync::tipset_syncer: Validating tipset: EPOCH = 3687649, N blocks = 5
2024-02-26T08:25:42.113208Z  INFO forest_filecoin::chain_sync::chain_muxer: Bootstrap successfully completed, now evaluating the network head to ensure the node is in sync
2024-02-26T08:25:42.113256Z  INFO forest_filecoin::chain_sync::chain_muxer: Evaluating network head...
2024-02-26T08:33:30.277333Z  INFO forest_filecoin::chain_sync::chain_muxer: Local node is behind the network, starting BOOTSTRAP from LOCAL_HEAD = 3687649 -> NETWORK_HEAD = 3687666
2024-02-26T08:33:32.657598Z  INFO forest_filecoin::chain_sync::tipset_syncer: Validating tipset: EPOCH = 3687649, N blocks = 5

It may be relevant or not, but the bootstrapping failed a few times during this execution, e.g,.

2024-02-26T06:41:30.263759Z ERROR forest_filecoin::chain_sync::chain_muxer: Bootstrapping failed, re-evaluating the network head to retry the bootstrap. Error = TipsetRangeSyncer(NetworkMessageQueryFailed("ChainExchange request failed for all top peers. 1 network failures, 0 lookup failures, request:\nChainExchangeRequest { start: NonEmpty { head: Cid(bafy2bzacebqs5q2rpjckgvsujjb3ujvloo5jdnneodaahvhv5ovhuwwrgrc6c), tail: [Cid(bafy2bzaceasnyvtuyydg6rzbihwnfulqoixxdtjvuhxjbqi4fdk3i4cvqisqa), Cid(bafy2bzacedo7nqqcxqzcpn7vv3uik332noezq4remr2etwz6yaxzygfyqaamg), Cid(bafy2bzacealtmb75qnq27htdodpxfd6tnjxztvl4mvpgt5lmehmxi5rwrvees), Cid(bafy2bzacedtunn4ziqpictr3s4qww3b4hn4ab2qvsz7tokfpl7iiwzxmj7dds), Cid(bafy2bzacedah76zlbnrbggcocyuinlsomyzstay62rpx6nrkiw5jojqau6xvc), Cid(bafy2bzacebjelq6j4n23qeca7uizmbbzgl4nhyutrgvie34u7lkd7uhpym3du), Cid(bafy2bzacecchpxwmazmpzlfgdebk4i2uau3c5qbwxthotiuqfp4axqlgr7x6k)] }, request_len: 8, options: 2 }"))

Overall, even if we end up with a few bad peers, the entire mainnet should have around 3000 peers the last time I checked. Do we end up with 500 bad ones?

Other information and links
This is happening on forest v0.16.6

The text was updated successfully, but these errors were encountered:

LesnyRumcajs · 2024-02-27T08:11:48Z

Same happened recently - on two distinct servers. This may indicate a general condition in the network that Forest is just not able to handle.

2024-02-27T03:57:11.680500Z  INFO forest_filecoin::chain_sync::tipset_syncer: Validating tipset: EPOCH = 3688934, N blocks = 3
2024-02-27T03:57:13.869985Z  INFO forest_filecoin::chain_sync::tipset_syncer: Validating tipset: EPOCH = 3688935, N blocks = 3
2024-02-27T03:57:15.872318Z  INFO forest_filecoin::chain_sync::tipset_syncer: Validating tipset: EPOCH = 3688936, N blocks = 8
2024-02-27T03:57:18.128510Z  INFO forest_filecoin::chain_sync::tipset_syncer: Validating tipset: EPOCH = 3688937, N blocks = 8
2024-02-27T03:57:20.267705Z  INFO forest_filecoin::chain_sync::tipset_syncer: Validating tipset: EPOCH = 3688938, N blocks = 12
2024-02-27T03:57:22.408323Z  INFO forest_filecoin::chain_sync::tipset_syncer: Validating tipset: EPOCH = 3688939, N blocks = 3
2024-02-27T03:57:24.212859Z  INFO forest_filecoin::chain_sync::tipset_syncer: Validating tipset: EPOCH = 3688940, N blocks = 9
2024-02-27T03:57:26.386801Z  INFO forest_filecoin::chain_sync::tipset_syncer: Validating tipset: EPOCH = 3688941, N blocks = 6
2024-02-27T03:57:29.012571Z  INFO forest_filecoin::chain_sync::tipset_syncer: Validating tipset: EPOCH = 3688942, N blocks = 7
2024-02-27T03:57:30.733516Z  INFO forest_filecoin::chain_sync::tipset_syncer: Validating tipset: EPOCH = 3688943, N blocks = 2
2024-02-27T03:57:32.642782Z  INFO forest_filecoin::chain_sync::tipset_syncer: Validating tipset: EPOCH = 3688944, N blocks = 7
2024-02-27T03:57:34.405816Z  INFO forest_filecoin::chain_sync::tipset_syncer: Validating tipset: EPOCH = 3688945, N blocks = 4
2024-02-27T03:57:36.875247Z  INFO forest_filecoin::chain_sync::tipset_syncer: Validating tipset: EPOCH = 3688946, N blocks = 7
2024-02-27T03:57:39.051070Z ERROR forest_filecoin::chain_sync::tipset_syncer: Sync messages check state failed for tipset range
2024-02-27T03:57:39.051196Z ERROR forest_filecoin::chain_sync::chain_muxer: Bootstrapping failed, re-evaluating the network head to retry the bootstrap. Error = TipsetRangeSyncer(NetworkMessageQueryFailed("ChainExchange request failed for all top peers. 1 network failures, 0 lookup failures, request:\nChainExchangeRequest { start: NonEmpty { head: Cid(bafy2bzaceawdj4rcmlop2ywthtjoku6uh6pwtvkdfmdultmf5ecyasi5zoilw), tail: [Cid(bafy2bzacednhzl3kbhattskndzi7fl4ekvvsv3zhyf43hpo3kllx3tddkzigq), Cid(bafy2bzacecvalk4waodo5my2dnwsagnnfet3vaouri3yjcpz64lwm77vy5kp4)] }, request_len: 8, options: 2 }"))
2024-02-27T03:57:39.051245Z  INFO forest_filecoin::chain_sync::chain_muxer: Evaluating network head...
2024-02-27T04:04:59.873071Z  INFO forest_filecoin::chain_sync::chain_muxer: Local node is behind the network, starting BOOTSTRAP from LOCAL_HEAD = 3688946 -> NETWORK_HEAD = 3690009
2024-02-27T04:05:06.261098Z ERROR forest_filecoin::chain_sync::tipset_syncer: Sync messages check state failed for tipset range
2024-02-27T04:05:06.261256Z ERROR forest_filecoin::chain_sync::chain_muxer: Bootstrapping failed, re-evaluating the network head to retry the bootstrap. Error = TipsetRangeSyncer(NetworkMessageQueryFailed("ChainExchange request failed for all top peers. 1 network failures, 0 lookup failures, request:\nChainExchangeRequest { start: NonEmpty { head: Cid(bafy2bzaceam64a6njudjfh7enuwxro6enxgwjjswqwn5ezlsc75vdbsiwhjyw), tail: [Cid(bafy2bzaceb4hlohv4fghgpi2k5rnkvkodumlsfjxbt7qgzrqiuo4rdjoakbxi), Cid(bafy2bzacecohntypminocrc4rqur2wiw7bjtpuvne3wg32qe7rnv73z5lapbm), Cid(bafy2bzacebtpwvfdj4gr2sth7lc25th7hqbauigrozpqxvkncqwwe7lcxa2z6), Cid(bafy2bzaceatjctmlhgrz7k4ngc4uxxpn7pqmlqlupesajohlekyujepj5dhi4)] }, request_len: 8, options: 2 }"))
2024-02-27T04:05:06.261287Z  INFO forest_filecoin::chain_sync::chain_muxer: Evaluating network head...
2024-02-27T04:16:59.758525Z  INFO forest_filecoin::chain_sync::chain_muxer: Local node is behind the network, starting BOOTSTRAP from LOCAL_HEAD = 3688946 -> NETWORK_HEAD = 3690033
2024-02-27T04:17:06.060036Z ERROR forest_filecoin::chain_sync::tipset_syncer: Sync messages check state failed for tipset range
2024-02-27T04:17:06.060148Z ERROR forest_filecoin::chain_sync::chain_muxer: Bootstrapping failed, re-evaluating the network head to retry the bootstrap. Error = TipsetRangeSyncer(NetworkMessageQueryFailed("ChainExchange request failed for all top peers. 1 network failures, 0 lookup failures, request:\nChainExchangeRequest { start: NonEmpty { head: Cid(bafy2bzaceam64a6njudjfh7enuwxro6enxgwjjswqwn5ezlsc75vdbsiwhjyw), tail: [Cid(bafy2bzaceb4hlohv4fghgpi2k5rnkvkodumlsfjxbt7qgzrqiuo4rdjoakbxi), Cid(bafy2bzacecohntypminocrc4rqur2wiw7bjtpuvne3wg32qe7rnv73z5lapbm), Cid(bafy2bzacebtpwvfdj4gr2sth7lc25th7hqbauigrozpqxvkncqwwe7lcxa2z6), Cid(bafy2bzaceatjctmlhgrz7k4ngc4uxxpn7pqmlqlupesajohlekyujepj5dhi4)] }, request_len: 8, options: 2 }"))
2024-02-27T04:17:06.060185Z  INFO forest_filecoin::chain_sync::chain_muxer: Evaluating network head...

forest.out.txt

Added the metrics to DO bucket forest-issues/ forest#3990 /mainnet_2024-02-27

LesnyRumcajs added Priority: 2 - High Very important and should be addressed ASAP Network Libp2p and PubSub stuff Ready Issue is ready for work and anyone can freely assign it to themselves labels Feb 26, 2024

hanabi1224 mentioned this issue Feb 27, 2024

feat(daemon): add --p2p-listen-address arg to forest #3995

Merged

4 tasks

LesnyRumcajs self-assigned this Feb 27, 2024

LesnyRumcajs mentioned this issue Feb 27, 2024

enable rsa feature for libp2p #3996

Merged

4 tasks

LesnyRumcajs closed this as completed in #3996 Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Forest is getting stuck during sync #3990

Forest is getting stuck during sync #3990

LesnyRumcajs commented Feb 26, 2024

LesnyRumcajs commented Feb 27, 2024

Forest is getting stuck during sync #3990

Forest is getting stuck during sync #3990

Comments

LesnyRumcajs commented Feb 26, 2024

LesnyRumcajs commented Feb 27, 2024