Skip to content
This repository has been archived by the owner on Dec 4, 2024. It is now read-only.

Solve syncer issue within BestPeer() #208

Merged
merged 6 commits into from
Nov 15, 2021
Merged

Conversation

dbrajovic
Copy link
Contributor

@dbrajovic dbrajovic commented Nov 9, 2021

Description

This fix provides the correct bestTd values from peers when a node is choosing the best candidate.

Changes include

  • Bugfix (non-breaking change that solves an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (change that is not backwards-compatible and/or changes current functionality)

Checklist

  • I have assigned this PR to myself
  • I have added at least 1 reviewer
  • I have tested this code
  • I have updated the README and other relevant documents (guides...)
  • I have added sufficient documentation both in code, as well as in the READMEs

Additional comments

Explanation:

Cause

Validator nodes always send the blockNumber as Difficulty (instead of TotalDifficulty) when building a new block. If this info is relayed to the non validator node before it resolves its best peer, the non validator will end up looping inside of RunAcceptState() (for quite some time), listening for new blocks, but not being able to write them.
As a consequence, reproducing the issue is a bit random, as the non validator will only sometimes be able to re-sync with the chain.

In-depth

When a node's Syncer first starts it subscribes to the blockchain's stream of events (in the background) and handles incoming peer connections (disconnections). While the event stream is something mostly continuous, the peer conns/deconns are rather sporadic in nature. Nonetheless, both flows access the data structures crucial for the Syncer to work as intended.
In our particular case scenario (step 4.), when a node first connects to the network it starts to receive these initial handshakes with other peers, each containing the difficulty of the chain (totalDifficulty) that peer is seeing. Example (with custom prints):

handshake

In the example above, the 4 validator nodes are up-and-running, while the last 2 non validators got stuck previously. The node being restarted is a non validator which also got stuck on diff: 3742 . Let's see what it does next:

best_peer

What happened?

Right after connecting to its peers, there was a window of opportunity for a race to happen. If before reaching the call to bestPeer() a new block was broadcasted from the validator set, the end result of that broadcast will update the difficulty the node has previously registered for that peer (during the initial handshake). The origin of the "lower" value comes from the fact that the validator, instead of sending total difficulty of the chain, actually sent the last block's difficulty (which in IBFT is, at the time of writing this, equal to the block's number).

At the time of discovery, the validator nodes were the only ones broadcasting blocks to the network, which is why their difficulty is growing in the next call to bestPeer() (as opposed to the stuck non validators).
Regardless, the node rejects these peers as candidates (believing they have yet to sync up with the chain) and repeats the process over again.

Fortunately, all it takes to solve this bug is store the chain's difficulty (total difficulty) in the Difficulty field of Status instead of the last block's difficulty.

Copy link
Contributor

@lazartravica lazartravica left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please add a test checking the difficulty broadcasted.

@lazartravica lazartravica mentioned this pull request Nov 9, 2021
8 tasks
protocol/testing.go Outdated Show resolved Hide resolved
Copy link
Contributor

@zivkovicmilos zivkovicmilos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for figuring all of this out, it was definitely tricky 🙏

Looks great to me, please make sure to solve any failing tests 💯

Copy link
Contributor

@lazartravica lazartravica left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the changes to the tests. LGTM

@dbrajovic dbrajovic merged commit 8dd9fa6 into develop Nov 15, 2021
@dbrajovic dbrajovic deleted the bug/best-peer-issue branch November 15, 2021 12:48
@dbrajovic dbrajovic mentioned this pull request Nov 18, 2021
8 tasks
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
4 participants