Skip to content
This repository has been archived by the owner on Dec 4, 2024. It is now read-only.

Block writing still stops in non-sealing nodes due to nil BestPeer #179

Closed
dankostiuk opened this issue Oct 7, 2021 · 1 comment · Fixed by #208
Closed

Block writing still stops in non-sealing nodes due to nil BestPeer #179

dankostiuk opened this issue Oct 7, 2021 · 1 comment · Fixed by #208
Assignees

Comments

@dankostiuk
Copy link
Contributor

Block writing stops in non-sealing nodes due to nil BestPeer

Description

It looks like we are still observing the same issue as originally described in #167

After investigating, this seems to be related to some inconsistencies in how we determine the best peer to connect to within BestPeer(). We are seeing that at times, curDiff is always greater than bestTd, resulting in this function always returning nil.

We've adding some extra logging in BestPeer to help investigate:

func (s *Syncer) BestPeer() *syncPeer {
	var bestPeer *syncPeer
	var bestTd *big.Int

	for _, p := range s.peers {
		status := p.status
		if bestPeer == nil || status.Difficulty.Cmp(bestTd) > 0 {
			bestPeer, bestTd = p, status.Difficulty
		}
	}
	if bestPeer == nil {
		return nil
	}
	curDiff := s.blockchain.CurrentTD()
	fmt.Println("BestPeer() curDiff = ", curDiff)
	fmt.Println("BestPeer() bestTd = ", bestTd)
	if bestTd.Cmp(curDiff) <= 0 {
		return nil
	}
	return bestPeer
}

Although the change 5170883 was put in (to broadcast the header's difficulty instead of the block number), it appears that we are actually initializing the header's Difficulty to be the same value as the header's Number, resulting in a broadcast of the same value:

https://github.com/0xPolygon/polygon-sdk/blob/bba205e6ed4aad9522450558ed3dbad67664e723/consensus/ibft/ibft.go#L349-L353

Upon attempting to correct L353 above, we are still seeing the same issue as can be seen in our logs. Below the bestTd is always 464670654277. Furthermore, the difficulty outputted in the enqueue block debug statement has the same value as the block number:

2021-10-07T14:11:43.046Z [DEBUG] polygon.consensus.syncer: enqueue block: peer=16Uiu2HAmBv5UkctkZK1ui2tMip6XqpykvMnrzDuLGNjN38YDztmW number=982519 hash=0xfbda9f200b380eeab2eeb9fb2e80bb25a925929b55c1f2a35d38dbc085f1c831
2021-10-07T14:11:43.046Z [DEBUG] polygon.consensus.syncer: update peer status: peer=16Uiu2HAmBv5UkctkZK1ui2tMip6XqpykvMnrzDuLGNjN38YDztmW latest block number=982519 latest block hash=0xfbda9f200b380eeab2eeb9fb2e80bb25a925929b55c1f2a35d38dbc085f1c831 difficulty=982519
2
BestPeer() curDiff =  482664423817
BestPeer() bestTd =  464670654277

I've put together the difficulties of the 5 peers currently connected:

peer1.status.Difficulty = 982668
peer2.status.Difficulty = 982668
peer3.status.Difficulty = 982668
peer4.status.Difficulty = 982668
peer5.status.Difficulty = 464670654277

In the above scenario, peer5 will always be our 'bestPeer', however its difficulty is always less than the current difficulty of 482664423817 so we will always return nil.

I would assume that if the Difficulty value for peer1-peer4 were sent correctly, we would connect to a better beer whose Difficulty would be greater than our current difficulty.

Your environment

  • OS and version Ubuntu 20
  • version of the Polygon SDK bba205e
  • branch that causes this issue develop

Steps to reproduce

  • start some non-sealing nodes connected to a cluster of 5 validators
  • observe that we can never connect to a peer who's difficulty is greater than the current block difficulty
  • restarting sometimes fixes the isssue

Expected behaviour

Gossip'd blocks should continue to be written to state and processed normally after bulk sync.

Actual behaviour

Although we still enqueue blocks, we are no longer writing them to state when we can't properly determine the best difficulty of a connected peer.

@dbrajovic dbrajovic mentioned this issue Oct 26, 2021
8 tasks
@zivkovicmilos zivkovicmilos linked a pull request Oct 27, 2021 that will close this issue
8 tasks
@dbrajovic dbrajovic linked a pull request Nov 9, 2021 that will close this issue
8 tasks
@dbrajovic dbrajovic removed a link to a pull request Nov 9, 2021
8 tasks
@dbrajovic
Copy link
Contributor

dbrajovic commented Nov 9, 2021

Hey @dankostiuk ,

the (recurring) issue is addressed in PR#208. This should solve any BestPeer() issues a Syncer might have with other peers.

dbrajovic added a commit that referenced this issue Nov 15, 2021
* broadcast chain difficulty instead of block difficulty

* change Syncer unit tests to compare chain difficulty instead of block difficulty

* resolves issues #167, #179
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants