Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node stop mining due to incorrect channel handling #323

Closed
DarianShawn opened this issue Mar 14, 2023 · 5 comments
Closed

Node stop mining due to incorrect channel handling #323

DarianShawn opened this issue Mar 14, 2023 · 5 comments
Labels
bug Something isn't working

Comments

@DarianShawn
Copy link
Collaborator

DarianShawn commented Mar 14, 2023

[ibft]

Description

All the validator nodes in devnet stop mining during 2023-03-13T15:50:00 and 2023-03-14T00:11:00. It had not been recovered until restarting.

image

The validator is yielding lots of channels when notified by blockchain subscription and didn't close it until canceling.
In fact, the context is a different one. So GoRoutine might leaks here, and never closed.

func (i *Ibft) runSequence(height uint64) <-chan struct{} {

isValidator = i.isValidSnapshot()

// validator must not be in syncing mode to start a new block
if isValidator && !i.syncer.IsSyncing() {
	sequenceCh = i.runSequence(pending)
}

select {
case <-syncerBlockCh:
	if isValidator {
		i.stopSequence()
		i.logger.Info("canceled sequence", "sequence", pending)
	}

	i.logger.Info("sequence canceled due to new block", "sequence", pending)
case <-sequenceCh:
case <-i.closeCh:
	if isValidator {
		i.stopSequence()
		i.logger.Info("ibft close", "sequence", pending)
	}

	return
}

Your environment

  • OS and version
    • CentOS 8
  • version of the Dogechain
  • branch that causes this issue
    • feat-state-snapshot

Steps to reproduce

  • Upgrade the specific version to replace 4 validators and 1 RPC node.
  • Send 300-500 multicall contract methods to RPC endpoint.
  • After sealing all transactions, within 1-2 hours, the validators will no longer produce blocks.

Logs

Check the reference.
There are repeating logs failed to update submodules in consensus: height=7251395 err="failed to get update lock", then no more synced block triggered, and no block sealed any more.

Proposed sol

not-sealing.log
ution

  • Re-combine the IBFT and syncing module, to make it smooth again.
@DarianShawn DarianShawn added the bug Something isn't working label Mar 14, 2023
@DarianShawn
Copy link
Collaborator Author

@0xcb9ff9 This is a Top-A issue and we need to fix it before any new releases.

@0xcb9ff9
Copy link

The state here is very complicated, first create a temporary patch?

@DarianShawn
Copy link
Collaborator Author

The solution is mainly to preserve the sealed state and prevent re-entering the consensus loop.
Also, we will ignore stale block height notifications since we are already participating in consensus with a higher block height.

It passed system tests and local manually tests with 1-validator and 4-validator network.
It will be tested on DevNet for some corner cases for 1-2 days.

@DarianShawn
Copy link
Collaborator Author

4440033, 766006a attempt to revert to an older available version.

@DarianShawn
Copy link
Collaborator Author

The issue will be closed as it had not be recovered in two weeks.
Please reopen if we find out more. @0xcb9ff9

0xcb9ff9 added a commit that referenced this issue Apr 11, 2023
# Description

PR #323 

Two questions

1. When node A is getting blocks from node B, if node B is forced to
shut down (or when the network connection is disconnected), `GetBlocks`
will never expire, causing node A to block synchronously. Although
difficult to reproduce, the problem does exist.

2. `PeerConnInfo.protocolClient` saves the grpc client, but the timeout
period can't be set in the libp2p stream. libp2p stream leaks if peer
disconnects
 
This PR removes grpc client saving, passing timeout context, enforcing
timeouts in libp2p streams
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants