Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restarted BFT validators can fail to agree on new blocks under certain conditions #8307

Open
matthew1001 opened this issue Feb 14, 2025 · 2 comments · May be fixed by #8308
Open

Restarted BFT validators can fail to agree on new blocks under certain conditions #8307

matthew1001 opened this issue Feb 14, 2025 · 2 comments · May be fixed by #8308
Assignees
Labels
bug Something isn't working P3 Medium (ex: JSON-RPC request not working with a specific client library due to loose spec assumtion) QBFT QBFT Consensus re;ated

Comments

@matthew1001
Copy link
Contributor

matthew1001 commented Feb 14, 2025

I believe this issue only occurs if sufficient BFT validators are restarted with a new data directory and they need to resync with existing node(s) before continuing BFT block production. Specific examples would be:

  1. A 2 node BFT chain, where a node is restarted with a clean data dir
  2. A 4 node BFT chain, where 2 nodes are restarted with clean data dirs
  3. ...any other case where a chain does not have quorum validators with current data dirs.

Another scenario would include 1 validator node that has been producing blocks on its own and then votes in a new validator, where the new validator has a fresh data dir. This is another way of reaching 1 above.

Voting in a new validator that has already been following the chain should not be impacted by the issue.

Steps to Reproduce

  • Create a 2-validator QBFT chain (IBFT could also be used)
  • Let it mine a few 100 blocks
  • Stop node 2, delete its data dir, then restart it

Expected behavior:

It syncs with node 1, then continues to agree on new rounds and propose new blocks on its turn

Actual behavior:

It starts up but then fails to move its QBFT round timer on, so no new blocks are produced

Frequency:

It's a little unclear if there is some timing involved, but for 2-node chains it seems to fail reliably.

There are cases where it is not an issue, which perhaps explains why it's not been seen by many users:

  1. A single node (e.g. in a dev environment) doesn't suffer with this because it doesn't go through a sync process if its the only validator
  2. A chain with enough validators that f >= 1 isn't affected if only 1 validator has its data dir deleted and re-syncs, because other validators propose a new block without it. This "un-sticks" the restarted node and everything proceeds as expected
  3. If a node is restarted but its data dir isn't cleared it doesn't hit the issue

Versions (Add all that apply)

  • Software version: v25.1.0
@matthew1001 matthew1001 added the bug Something isn't working label Feb 14, 2025
@matthew1001 matthew1001 self-assigned this Feb 14, 2025
@matthew1001 matthew1001 added P3 Medium (ex: JSON-RPC request not working with a specific client library due to loose spec assumtion) QBFT QBFT Consensus re;ated labels Feb 14, 2025
@matthew1001
Copy link
Contributor Author

@jimthematrix tagging you in as you were interested in this issue and the potential fix.

@matthew1001
Copy link
Contributor Author

matthew1001 commented Feb 14, 2025

I have a fix in a draft PR (#8308). I need to finish it off and refactor a little before putting into review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P3 Medium (ex: JSON-RPC request not working with a specific client library due to loose spec assumtion) QBFT QBFT Consensus re;ated
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant