Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flaky zombienet test zombienet-substrate-0002-validators-warp-sync #5974

Closed
Tracked by #5852
pepoviola opened this issue Oct 8, 2024 · 7 comments · Fixed by #6154
Closed
Tracked by #5852

Fix flaky zombienet test zombienet-substrate-0002-validators-warp-sync #5974

pepoviola opened this issue Oct 8, 2024 · 7 comments · Fixed by #6154
Assignees

Comments

github-merge-queue bot pushed a commit that referenced this issue Oct 9, 2024
@michalkucharczyk
Copy link
Contributor

michalkucharczyk commented Oct 9, 2024

@pepoviola

I see this test internally is using substrate_beefy_best_block - why is it a beefy block and not just a best block?
Test is just using best block:

alice: reports block height is at least {{DB_BLOCK_HEIGHT}} within 60 seconds
bob: reports block height is at least {{DB_BLOCK_HEIGHT}} within 60 seconds
charlie: reports block height is at least {{DB_BLOCK_HEIGHT}} within 60 seconds

And it seems that this number is not being synced fast enough:

2024-10-07T16:20:55.002Z zombie::network-node using comparator isAtLeast for 10797, 56687
2024-10-07T16:20:55.002Z zombie::network-node [alice] Current value: 10797 for metric substrate_beefy_best_block, keep trying...
2024-10-07T16:20:56.002Z zombie::network-node [alice] Fetching metrics - q: 174  time:  Mon Oct 07 2024 16:20:56 GMT+0000 (Coordinated Universal Time)
2024-10-07T16:20:56.026Z zombie::network-node returning for: substrate_beefy_best_block from ns: _raw
2024-10-07T16:20:56.026Z zombie::network-node returning: 10997
2024-10-07T16:20:56.026Z zombie::network-node using comparator isAtLeast for 10997, 56687
2024-10-07T16:20:56.026Z zombie::network-node [alice] Current value: 10997 for metric substrate_beefy_best_block, keep trying...
2024-10-07T16:20:57.027Z zombie::network-node [alice] Fetching metrics - q: 175  time:  Mon Oct 07 2024 16:20:57 GMT+0000 (Coordinated Universal Time)
2024-10-07T16:20:57.048Z zombie::network-node returning for: substrate_beefy_best_block from ns: _raw
2024-10-07T16:20:57.048Z zombie::network-node returning: 11197
2024-10-07T16:20:57.048Z zombie::network-node using comparator isAtLeast for 11197, 56687
2024-10-07T16:20:57.048Z zombie::network-node [alice] Current value: 11197 for metric substrate_beefy_best_block, keep trying...
	 Error:  
		 [alice] Timeout(180), "getting desired metric value 56687 within 180 secs".

From a quick glance it looks like beefy sync issue?

@pepoviola
Copy link
Contributor Author

I see this test internally is using substrate_beefy_best_block - why is it a beefy block and not just a best block?
Test is just using best block:

Thanks for pointing this @michalkucharczyk, I think this is a bug. I will fix it and re-check this test.

Thanks for the help!

@pepoviola
Copy link
Contributor Author

Hi @michalkucharczyk, I double check and the lines you mention are from other test (0001). This issue is for 0002 and is set to check that metric

alice: reports substrate_beefy_best_block is at least {{DB_BLOCK_HEIGHT}} within 180 seconds
bob: reports substrate_beefy_best_block is at least {{DB_BLOCK_HEIGHT}} within 180 seconds
alice: reports substrate_beefy_best_block is greater than {{DB_BLOCK_HEIGHT}} within 180 seconds
bob: reports substrate_beefy_best_block is greater than {{DB_BLOCK_HEIGHT}} within 180 seconds

Thanks!

@michalkucharczyk
Copy link
Contributor

Oh, sorry.

So it looks like 180s is not enough for beefy to sync the best block.
@serban300 what are your thoughts on this? Do you think it can be some regression?

@serban300
Copy link
Contributor

I don't think it's a regression, but I would have to look more closely. I'll look on it these days.

mordamax pushed a commit to paritytech-stg/polkadot-sdk that referenced this issue Oct 9, 2024
mordamax pushed a commit to paritytech-stg/polkadot-sdk that referenced this issue Oct 9, 2024
@serban300 serban300 self-assigned this Oct 15, 2024
@serban300
Copy link
Contributor

In the 0002-validators-warp-sync scenario, all the validators are started from scratch, so they don't preserve the BEEFY state. They use warp sync, but for BEEFY they have to start voting from the beginning, for all the mandatory blocks.

In the worst case scenario, after voting, the validator doesn't immediately become aware of the BEEFY round finalization. It has to wait until the next block finality notification. This means 6s for each BEEFY mandatory round to end.

For example:
Alice:

2024-10-21 13:19:04.214 DEBUG tokio-runtime-worker beefy: 🥩 New gossip filter: start 1, end 1, validator set id 0    
2024-10-21 13:19:04.214  INFO tokio-runtime-worker beefy: 🥩 run BEEFY worker, best grandpa: #56685.    
2024-10-21 13:19:04.214 DEBUG tokio-runtime-worker beefy: 🥩 vote target - mandatory block: #1    
2024-10-21 13:19:04.214 DEBUG tokio-runtime-worker beefy: 🥩 Try voting on 1    
2024-10-21 13:19:04.214 DEBUG tokio-runtime-worker beefy: 🥩 Local authority id: Public(020a1091341fe5664bfa1782d5e04779689068c916b04cb365ec3153755684d9a1 (KW39r9CJ...))    
2024-10-21 13:19:04.215 DEBUG tokio-runtime-worker beefy: 🥩 Sent vote message: VoteMessage { commitment: Commitment { payload: Payload([([109, 104], [85, 60, 123, 231, 125, 122, 128, 232, 176, 78, 116, 97, 117, 166, 146, 230, 221, 196, 158, 17, 220, 247, 48, 102, 66, 46, 25, 250, 78, 1, 151, 97])]), block_number: 1, validator_set_id: 0 }, id: Public(020a1091341fe5664bfa1782d5e04779689068c916b04cb365ec3153755684d9a1 (KW39r9CJ...)), signature: Signature(7a1958b718ca60c1b46125588df75dbb43a1bf5813e06fb44dbf2761a989afbb22b9f56d9f6fd7ac21ef7c9db987531e3c09a3834ce73678a8bf9da673c976a401) }    
2024-10-21 13:19:04.215 DEBUG tokio-runtime-worker beefy::sync: 🥩 no good peers to request justif #1 from    
2024-10-21 13:19:07.939 DEBUG tokio-runtime-worker beefy: 🥩 Transforming grandpa notification. #56686(0x44a7fdf0826622bb89515fcf2f381507e24c8d922c6d10bee3cc73801ba58668)    
2024-10-21 13:19:07.939 DEBUG tokio-runtime-worker beefy: 🥩 Finality notification: header(number 56686, hash 0x44a7fdf0826622bb89515fcf2f381507e24c8d922c6d10bee3cc73801ba58668) tree_route []    
2024-10-21 13:19:07.939 DEBUG tokio-runtime-worker beefy: 🥩 New gossip filter: start 1, end 1, validator set id 0    
2024-10-21 13:19:07.939 DEBUG tokio-runtime-worker beefy: 🥩 vote target - mandatory block: #1    
2024-10-21 13:19:07.939 DEBUG tokio-runtime-worker beefy::sync: 🥩 requesting justif #1 from peer PeerId("12D3KooWRkZhiRhsqmrQ28rt73K7V3aCBpqKrLGSXmZ99PTcTZby")    
2024-10-21 13:19:07.940 DEBUG tokio-runtime-worker beefy::sync: 🥩 received valid on-demand justif #1 from PeerId("12D3KooWRkZhiRhsqmrQ28rt73K7V3aCBpqKrLGSXmZ99PTcTZby")    
2024-10-21 13:19:07.940 DEBUG tokio-runtime-worker beefy: 🥩 Process justification for round: 1.    
2024-10-21 13:19:07.940  INFO tokio-runtime-worker beefy: 🥩 Concluded mandatory round #1    
2024-10-21 13:19:08.125 DEBUG tokio-runtime-worker beefy: 🥩 New gossip filter: start 201, end 201, validator set id 1    
2024-10-21 13:19:08.125 DEBUG tokio-runtime-worker beefy: 🥩 vote target - mandatory block: #201    
2024-10-21 13:19:08.125 DEBUG tokio-runtime-worker beefy: 🥩 Try voting on 201    
2024-10-21 13:19:08.125 DEBUG tokio-runtime-worker beefy: 🥩 Local authority id: Public(020a1091341fe5664bfa1782d5e04779689068c916b04cb365ec3153755684d9a1 (KW39r9CJ...))    
2024-10-21 13:19:08.126 DEBUG tokio-runtime-worker beefy: 🥩 Sent vote message: VoteMessage { commitment: Commitment { payload: Payload([([109, 104], [167, 131, 107, 171, 89, 116, 180, 131, 233, 88, 171, 186, 85, 246, 176, 240, 220, 185, 183, 15, 121, 176, 249, 55, 83, 147, 57, 166, 116, 50, 46, 53])]), block_number: 201, validator_set_id: 1 }, id: Public(020a1091341fe5664bfa1782d5e04779689068c916b04cb365ec3153755684d9a1 (KW39r9CJ...)), signature: Signature(321c80d9f48932ec91260f783be413ab19f07baf5bdfd0309937432d35716c342e2107f22cef2e73a92c07d3315fff47d363c8d2683fce25b38e2f805bc3166901) }    
2024-10-21 13:19:08.126 DEBUG tokio-runtime-worker beefy::sync: 🥩 no good peers to request justif #201 from    
2024-10-21 13:19:08.713  INFO tokio-runtime-worker substrate: 💤 Idle (4 peers), best: #56688 (0x4cae…f1bc), finalized #56686 (0x44a7…8668), ⬇ 0.9kiB/s ⬆ 2.6kiB/s    
2024-10-21 13:19:13.262  WARN tokio-runtime-worker babe: 👶 Epoch(s) skipped: from 284 to 40758    
2024-10-21 13:19:13.262  INFO tokio-runtime-worker babe: 👶 New epoch 40758 launching at block 0xcfed…eb2b (block slot 576501982 >= start slot 576501940).    
2024-10-21 13:19:13.262  INFO tokio-runtime-worker babe: 👶 Next epoch starts at slot 576502140    
2024-10-21 13:19:13.264  INFO tokio-runtime-worker sc_informant: ♻️  Reorg on #56688,0x4cae…f1bc to #56688,0xcfed…eb2b, common ancestor #56687,0xc1de…73e3    
2024-10-21 13:19:13.264  INFO tokio-runtime-worker substrate: 🏆 Imported #56688 (0xc1de…73e3 → 0xcfed…eb2b)    
2024-10-21 13:19:13.265  INFO tokio-runtime-worker substrate: 🏆 Imported #56689 (0xcfed…eb2b → 0x7402…c458)    
2024-10-21 13:19:13.266  INFO tokio-runtime-worker substrate: 🏆 Imported #56690 (0x7402…c458 → 0x2969…1445)    
2024-10-21 13:19:13.713  INFO tokio-runtime-worker substrate: 💤 Idle (4 peers), best: #56690 (0x2969…1445), finalized #56686 (0x44a7…8668), ⬇ 2.5kiB/s ⬆ 2.7kiB/s    
2024-10-21 13:19:14.606 DEBUG tokio-runtime-worker beefy: 🥩 Transforming grandpa notification. #56688(0xcfeda77a654bc5df3d33d81739a7686d382b903fa1d2ecadd0a2396bee65eb2b)    
2024-10-21 13:19:14.606 DEBUG tokio-runtime-worker beefy: 🥩 Finality notification: header(number 56688, hash 0xcfeda77a654bc5df3d33d81739a7686d382b903fa1d2ecadd0a2396bee65eb2b) tree_route [0xc1de2059d65d3b9edadef3d4e6ce0f116d3f1d20d5846c898a96d697a6c173e3]    
2024-10-21 13:19:14.606 DEBUG tokio-runtime-worker beefy: 🥩 New active validator set: ValidatorSet { validators: [Public(020a1091341fe5664bfa1782d5e04779689068c916b04cb365ec3153755684d9a1 (KW39r9CJ...)), Public(0390084fdbf27d2b79d26a4f13f0ccd982cb755a661969143c37cbc49ef5b91f27 (KWByAN7W...))], id: 284 }    
2024-10-21 13:19:14.606 DEBUG tokio-runtime-worker beefy: 🥩 New session 284 while active session 1 is still lagging.    
2024-10-21 13:19:14.606  INFO tokio-runtime-worker beefy: 🥩 New Rounds for validator set id: 284 with session_start 56688    
2024-10-21 13:19:14.606 DEBUG tokio-runtime-worker beefy: 🥩 New gossip filter: start 201, end 201, validator set id 1    
2024-10-21 13:19:14.606 DEBUG tokio-runtime-worker beefy: 🥩 vote target - mandatory block: #201    
2024-10-21 13:19:14.606 DEBUG tokio-runtime-worker beefy::sync: 🥩 requesting justif #201 from peer PeerId("12D3KooWRkZhiRhsqmrQ28rt73K7V3aCBpqKrLGSXmZ99PTcTZby")    
2024-10-21 13:19:14.607 DEBUG tokio-runtime-worker beefy::sync: 🥩 received valid on-demand justif #201 from PeerId("12D3KooWRkZhiRhsqmrQ28rt73K7V3aCBpqKrLGSXmZ99PTcTZby")    
2024-10-21 13:19:14.607 DEBUG tokio-runtime-worker beefy: 🥩 Process justification for round: 201.    
2024-10-21 13:19:14.607  INFO tokio-runtime-worker beefy: 🥩 Concluded mandatory round #201    
2024-10-21 13:19:14.792 DEBUG tokio-runtime-worker beefy: 🥩 New gossip filter: start 401, end 401, validator set id 2  

Bob:

2024-10-21 13:19:03.629 DEBUG tokio-runtime-worker beefy: 🥩 New gossip filter: start 1, end 1, validator set id 0    
2024-10-21 13:19:03.629  INFO tokio-runtime-worker beefy: 🥩 run BEEFY worker, best grandpa: #56685.    
2024-10-21 13:19:03.629 DEBUG tokio-runtime-worker beefy: 🥩 vote target - mandatory block: #1    
2024-10-21 13:19:03.629 DEBUG tokio-runtime-worker beefy: 🥩 Try voting on 1    
2024-10-21 13:19:03.629 DEBUG tokio-runtime-worker beefy: 🥩 Local authority id: Public(0390084fdbf27d2b79d26a4f13f0ccd982cb755a661969143c37cbc49ef5b91f27 (KWByAN7W...))    
2024-10-21 13:19:03.630 DEBUG tokio-runtime-worker beefy: 🥩 Sent vote message: VoteMessage { commitment: Commitment { payload: Payload([([109, 104], [85, 60, 123, 231, 125, 122, 128, 232, 176, 78, 116, 97, 117, 166, 146, 230, 221, 196, 158, 17, 220, 247, 48, 102, 66, 46, 25, 250, 78, 1, 151, 97])]), block_number: 1, validator_set_id: 0 }, id: Public(0390084fdbf27d2b79d26a4f13f0ccd982cb755a661969143c37cbc49ef5b91f27 (KWByAN7W...)), signature: Signature(864aed8f160f3deb43ef6e5483c00b78787fc217c8bd51356b1fd2cbeab161381fe2f0171587afcc628227c556464f8385ab66c86b68fa0d0bc7519d68f4bfa701) }    
2024-10-21 13:19:03.630 DEBUG tokio-runtime-worker beefy::sync: 🥩 no good peers to request justif #1 from    
2024-10-21 13:19:04.215 DEBUG tokio-runtime-worker beefy: 🥩 Round #1 concluded, finality_proof: V1(SignedCommitment { commitment: Commitment { payload: Payload([([109, 104], [85, 60, 123, 231, 125, 122, 128, 232, 176, 78, 116, 97, 117, 166, 146, 230, 221, 196, 158, 17, 220, 247, 48, 102, 66, 46, 25, 250, 78, 1, 151, 97])]), block_number: 1, validator_set_id: 0 }, signatures: [Some(Signature(7a1958b718ca60c1b46125588df75dbb43a1bf5813e06fb44dbf2761a989afbb22b9f56d9f6fd7ac21ef7c9db987531e3c09a3834ce73678a8bf9da673c976a401)), Some(Signature(864aed8f160f3deb43ef6e5483c00b78787fc217c8bd51356b1fd2cbeab161381fe2f0171587afcc628227c556464f8385ab66c86b68fa0d0bc7519d68f4bfa701))] }).    
2024-10-21 13:19:04.215  INFO tokio-runtime-worker beefy: 🥩 Concluded mandatory round #1    
2024-10-21 13:19:04.230  INFO tokio-runtime-worker substrate: 💤 Idle (4 peers), best: #56687 (0xc1de…73e3), finalized #56685 (0xb9e1…4c09), ⬇ 4.4MiB/s ⬆ 23.7kiB/s    
2024-10-21 13:19:04.416 DEBUG tokio-runtime-worker beefy: 🥩 New gossip filter: start 201, end 201, validator set id 1    
2024-10-21 13:19:04.416 DEBUG tokio-runtime-worker beefy: 🥩 vote target - mandatory block: #201    
2024-10-21 13:19:04.416 DEBUG tokio-runtime-worker beefy: 🥩 Try voting on 201    
2024-10-21 13:19:04.416 DEBUG tokio-runtime-worker beefy: 🥩 Local authority id: Public(0390084fdbf27d2b79d26a4f13f0ccd982cb755a661969143c37cbc49ef5b91f27 (KWByAN7W...))    
2024-10-21 13:19:04.417 DEBUG tokio-runtime-worker beefy: 🥩 Sent vote message: VoteMessage { commitment: Commitment { payload: Payload([([109, 104], [167, 131, 107, 171, 89, 116, 180, 131, 233, 88, 171, 186, 85, 246, 176, 240, 220, 185, 183, 15, 121, 176, 249, 55, 83, 147, 57, 166, 116, 50, 46, 53])]), block_number: 201, validator_set_id: 1 }, id: Public(0390084fdbf27d2b79d26a4f13f0ccd982cb755a661969143c37cbc49ef5b91f27 (KWByAN7W...)), signature: Signature(a25de1fd2ba3f1fdfd64e2254048a9968cae49c77de639510bdebd1d92162bbc51d80310b005694dc79a8a13094bc985fd169e38705055503fcb5c6ee4057efe01) }    
2024-10-21 13:19:04.417 DEBUG tokio-runtime-worker beefy::sync: 🥩 no good peers to request justif #201 from    
2024-10-21 13:19:06.001  INFO tokio-runtime-worker sc_basic_authorship::basic_authorship: 🙌 Starting consensus session on top of parent 0xc1de2059d65d3b9edadef3d4e6ce0f116d3f1d20d5846c898a96d697a6c173e3 (#56687)    
2024-10-21 13:19:06.003  INFO tokio-runtime-worker sc_basic_authorship::basic_authorship: 🎁 Prepared block for proposing at 56688 (1 ms) [hash: 0x9707c1444bc7b608778bd6c0ce60b08c6d03633d887c00359817cd6abe9120d4; parent_hash: 0xc1de…73e3; extrinsics_count: 1    
2024-10-21 13:19:06.004  INFO tokio-runtime-worker babe: 🔖 Pre-sealed block for proposal at 56688. Hash now 0xcfeda77a654bc5df3d33d81739a7686d382b903fa1d2ecadd0a2396bee65eb2b, previously 0x9707c1444bc7b608778bd6c0ce60b08c6d03633d887c00359817cd6abe9120d4.    
2024-10-21 13:19:06.004  WARN tokio-runtime-worker babe: 👶 Epoch(s) skipped: from 284 to 40758    
2024-10-21 13:19:06.004  INFO tokio-runtime-worker babe: 👶 New epoch 40758 launching at block 0xcfed…eb2b (block slot 576501982 >= start slot 576501940).    
2024-10-21 13:19:06.004  INFO tokio-runtime-worker babe: 👶 Next epoch starts at slot 576502140    
2024-10-21 13:19:06.004  INFO tokio-runtime-worker substrate: 🏆 Imported #56688 (0xc1de…73e3 → 0xcfed…eb2b)    
2024-10-21 13:19:07.939 DEBUG tokio-runtime-worker beefy: 🥩 Transforming grandpa notification. #56686(0x44a7fdf0826622bb89515fcf2f381507e24c8d922c6d10bee3cc73801ba58668)    
2024-10-21 13:19:07.939 DEBUG tokio-runtime-worker beefy: 🥩 Finality notification: header(number 56686, hash 0x44a7fdf0826622bb89515fcf2f381507e24c8d922c6d10bee3cc73801ba58668) tree_route []    
2024-10-21 13:19:07.939 DEBUG tokio-runtime-worker beefy: 🥩 New gossip filter: start 201, end 201, validator set id 1    
2024-10-21 13:19:07.939 DEBUG tokio-runtime-worker beefy: 🥩 vote target - mandatory block: #201    
2024-10-21 13:19:07.939 DEBUG tokio-runtime-worker beefy::sync: 🥩 no good peers to request justif #201 from    
2024-10-21 13:19:07.939 DEBUG tokio-runtime-worker beefy::sync: 🥩 Handled BEEFY justification request from PeerId("12D3KooWQCkBm1BYtkHpocxCwMgR8yjitEeHGx8spzcDLGt2gkBm").    
2024-10-21 13:19:08.126 DEBUG tokio-runtime-worker beefy: 🥩 Round #201 concluded, finality_proof: V1(SignedCommitment { commitment: Commitment { payload: Payload([([109, 104], [167, 131, 107, 171, 89, 116, 180, 131, 233, 88, 171, 186, 85, 246, 176, 240, 220, 185, 183, 15, 121, 176, 249, 55, 83, 147, 57, 166, 116, 50, 46, 53])]), block_number: 201, validator_set_id: 1 }, signatures: [Some(Signature(321c80d9f48932ec91260f783be413ab19f07baf5bdfd0309937432d35716c342e2107f22cef2e73a92c07d3315fff47d363c8d2683fce25b38e2f805bc3166901)), Some(Signature(a25de1fd2ba3f1fdfd64e2254048a9968cae49c77de639510bdebd1d92162bbc51d80310b005694dc79a8a13094bc985fd169e38705055503fcb5c6ee4057efe01))] }).    
2024-10-21 13:19:08.126  INFO tokio-runtime-worker beefy: 🥩 Concluded mandatory round #201    
2024-10-21 13:19:08.315 DEBUG tokio-runtime-worker beefy: 🥩 New gossip filter: start 401, end 401, validator set id 2    

So sometimes, Alice and Bob don't reach substrate_beefy_best_block 56687 in 180s. I think we need to adjust that check, accounting for the worst case scenario.

@serban300
Copy link
Contributor

Opened #6154 for fixing the issue

@acatangiu @michalkucharczyk @pepoviola PTAL !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants