Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Finalization issue #2304

Closed
xlc opened this issue Apr 17, 2019 · 11 comments
Closed

Finalization issue #2304

xlc opened this issue Apr 17, 2019 · 11 comments
Labels
J2-unconfirmed Issue might be valid, but it’s not yet known.

Comments

@xlc
Copy link
Contributor

xlc commented Apr 17, 2019

Based on 7c64746
Screen Shot 2019-04-17 at 3 22 47 PM

All nodes stopped finalization at 14046 / 14071.
New node failed to get finalization at all.

Restarting doesn't help.

Anything we can do to diagnostic the issue and resume the finalization process?

@bkchr
Copy link
Member

bkchr commented Apr 17, 2019

Could you try to collect some logs with -lafg?

CC @andresilva @rphmeier

@bkchr bkchr added the J2-unconfirmed Issue might be valid, but it’s not yet known. label Apr 17, 2019
@gguoss
Copy link
Contributor

gguoss commented Apr 17, 2019

4 validators( 3 validators in 14046, 1 validators in 14071), may be 3 validators afg not > 2/3 weight.

may be restart 4 validators resume grandpa

@andresilva
Copy link
Contributor

@gguoss I'm assuming the authorities are stuck on different rounds, if you collect logs for afg target you should see some messages saying the GRANDPA round the nodes are in. Can you also check if the validator that finalized 14071 is connected to the other validators? I think that validator is on a later GRANDPA round than the other validators.

@xlc
Copy link
Contributor Author

xlc commented Apr 18, 2019

Logs from validators and some other nodes https://gist.github.com/xlc/82e9c35d95f9e400134de047d6dfea67

@andresilva
Copy link
Contributor

logs-from-cennznet-validators-validator-0-in-cennznet-validators-validator-0-0.txt:2019-04-18 01:17:25.549 main DEBUG afg  Voter VALIDATOR_0 noting beginning of round (Round(551), SetId(0)) to network.
logs-from-cennznet-validators-validator-1-in-cennznet-validators-validator-1-0:2019-04-18 01:17:32.213 main DEBUG afg  Voter VALIDATOR_1 noting beginning of round (Round(550), SetId(0)) to network.
logs-from-cennznet-validators-validator-2-in-cennznet-validators-validator-2-0.txt:2019-04-18 01:17:16.724 main DEBUG afg  Voter VALIDATOR_2 noting beginning of round (Round(550), SetId(0)) to network.
logs-from-cennznet-validators-validator-3-in-cennznet-validators-validator-3-0:2019-04-18 01:17:25.558 main DEBUG afg  Voter VALIDATOR_3 noting beginning of round (Round(550), SetId(0)) to network.

So it seems that one of the validators progressed to the next round (maybe because the other authorities didn't see its vote), while the other authorities are stuck in round 550 and probably don't have threshold stake to finalize. What you should do to get finality started again is disable all validators and copy the database from validator 0 into the other validators' nodes, this way when you restart the nodes they'll all be at round 551.

We are working on improvements to fix these situations where it can get stuck with a small amount of validators (9631622 was recently merged which should help as well).

@xlc
Copy link
Contributor Author

xlc commented Apr 18, 2019

Thanks. I will upgrade the substrate version and do the fix next week and report the results here.

@xlc
Copy link
Contributor Author

xlc commented Apr 29, 2019

Tried copy the db of validator 0 to other validators and reset all other nodes and it breaks the connection somehow. Maybe relates to #2335.

Screen Shot 2019-04-30 at 10 22 15 AM

I am going to pull latest substrate and reset the testnet and see if this happens again.

@xlc
Copy link
Contributor Author

xlc commented May 8, 2019

Not happening anymore.

@xlc xlc closed this as completed May 8, 2019
@xlc
Copy link
Contributor Author

xlc commented Jun 11, 2019

It happens again. Please let me know if there are anything you need to diagnostic this issue. The testnet is public now. Our telemetry server is not public but we are going to migrate to use polkadot one soon.

Our web UI: https://cennznet.js.org/cennznet-ui/
Our repo: https://github.com/cennznet/cennznet
Use --chain=rimu to join Rimu testnet. It is also the default network so not specify chain will join to it as well.

Let me know if you need anything, like logs from our validators, or a validator seat.

Screen Shot 2019-06-11 at 12 42 03 PM

@xlc xlc reopened this Jun 11, 2019
@xlc
Copy link
Contributor Author

xlc commented Jul 24, 2019

Most likely fixed in new version.

@badkk
Copy link
Contributor

badkk commented Aug 25, 2020

Tried copy the db of validator 0 to other validators and reset all other nodes and it breaks the connection somehow. Maybe relates to #2335.

Screen Shot 2019-04-30 at 10 22 15 AM

I am going to pull latest substrate and reset the testnet and see if this happens again.

Will this workable? I am using rc5 and it happends again.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
J2-unconfirmed Issue might be valid, but it’s not yet known.
Projects
None yet
Development

No branches or pull requests

5 participants