Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Oonoonba sync - Invalid state root #2631

Closed
twoeths opened this issue Jun 4, 2021 · 8 comments · Fixed by #2661
Closed

Oonoonba sync - Invalid state root #2631

twoeths opened this issue Jun 4, 2021 · 8 comments · Fixed by #2661
Assignees

Comments

@twoeths
Copy link
Contributor

twoeths commented Jun 4, 2021

Describe the bug

Got this after we sync all the way to head

Jun-04 02:45:20.080 [CHAIN]           error: Block error slot=14122 code=BLOCK_ERROR_PARENT_UNKNOWN, parentRoot=0x7b0e44370bc961b4a38d337ac0296c0395950fd4ff335e1225c1251f5c657f55
Error: BLOCK_ERROR_PARENT_UNKNOWN
    at processBlock (/usr/app/packages/lodestar/src/chain/blocks/process.ts:38:11)
    at processBlockJob (/usr/app/packages/lodestar/src/chain/blocks/processor.ts:59:11)
    at Object.job (/usr/app/packages/lodestar/src/chain/blocks/processor.ts:38:48)
    at Timeout._onTimeout (/usr/app/packages/lodestar/src/util/queue/index.ts:106:32)
    at listOnTimeout (internal/timers.js:554:17)
    at processTimers (internal/timers.js:497:7)
Jun-04 02:45:21.778 [CHAIN]           error: Block error slot=14122 code=BLOCK_ERROR_BEACON_CHAIN_ERROR, error={"message":"Invalid state root"}
Error: Invalid state root
    at Object.stateTransition (/usr/app/packages/beacon-state-transition/src/allForks/stateTransition.ts:68:13)
    at runStateTransition (/usr/app/packages/lodestar/src/chain/blocks/stateTransition.ts:86:30)
    at processBlocksInEpoch (/usr/app/packages/lodestar/src/chain/blocks/process.ts:110:24)
    at processBlock (/usr/app/packages/lodestar/src/chain/blocks/process.ts:44:3)
    at processBlockJob (/usr/app/packages/lodestar/src/chain/blocks/processor.ts:59:5)
    at Object.job (/usr/app/packages/lodestar/src/chain/blocks/processor.ts:38:42)
    at JobQueue.runJob (/usr/app/packages/lodestar/src/util/queue/index.ts:106:22)
Error: BLOCK_ERROR_BEACON_CHAIN_ERROR
    at processBlocksInEpoch (/usr/app/packages/lodestar/src/chain/blocks/process.ts:151:11)
    at processBlock (/usr/app/packages/lodestar/src/chain/blocks/process.ts:44:3)
    at processBlockJob (/usr/app/packages/lodestar/src/chain/blocks/processor.ts:59:5)
    at Object.job (/usr/app/packages/lodestar/src/chain/blocks/processor.ts:38:42)
    at JobQueue.runJob (/usr/app/packages/lodestar/src/util/queue/index.ts:106:22)
Jun-04 02:45:26.004 []                 info: Synced - finalized: 439 0x3fe0…ab21 - head: 14121 0x7b0e…7f55 - clockSlot: 14122 - peers: 2

Expected behavior

No error

Steps to Reproduce

./bin/lodestar beacon --network oonoonba

@twoeths
Copy link
Contributor Author

twoeths commented Jun 4, 2021

seems that this happens with docker sync only

Update: ignore this

@twoeths
Copy link
Contributor Author

twoeths commented Jun 4, 2021

after a while it could process the same blocks and stay synced
beacon.log.bak.zip

@dapplion
Copy link
Contributor

dapplion commented Jun 4, 2021

Notable parts of the logs:

  • Our unknown parent root sync seems to stall if the first block is already known. So we are unable to catch up and need a full RangeSync to reach the head again
  • There are many "invalid state root" errors both getting the blocks from gossip and then getting the same from RangeSync. However eventually we can move forward.
Jun-04 03:09:56.000 [CHAIN]         �[36mverbose�[39m: Clock slot slot=14245
Jun-04 03:09:56.000 [CHAIN]           �[34mdebug�[39m: Block pools:  pendingBlocks=0, currentSlot=14245
Jun-04 03:09:56.078 [NETWORK]       �[36mverbose�[39m: Received gossip block slot=14245
Jun-04 03:09:56.082 [CHAIN]           �[31merror�[39m: Block error slot=14245 code=BLOCK_ERROR_PARENT_UNKNOWN, parentRoot=0xd1882df259a549fbfd5e4bb71f95eff9ae85339729f2e0a4d49523934b93c72e
Error: BLOCK_ERROR_PARENT_UNKNOWN
    at processBlock (/usr/app/packages/lodestar/src/chain/blocks/process.ts:38:11)
    at processBlockJob (/usr/app/packages/lodestar/src/chain/blocks/processor.ts:59:11)
    at Object.job (/usr/app/packages/lodestar/src/chain/blocks/processor.ts:38:48)
    at Timeout._onTimeout (/usr/app/packages/lodestar/src/util/queue/index.ts:106:32)
    at listOnTimeout (internal/timers.js:554:17)
    at processTimers (internal/timers.js:497:7)
Jun-04 03:09:56.084 [CHAIN]           �[34mdebug�[39m: Add block to pool blockKey=0x643d4ea46a9968a72e890a33eccd807c57e2b05f03bdab0f87680daa29812ae8
Jun-04 03:09:56.086 [SYNC]          �[36mverbose�[39m: Finding block for unknown ancestor root parentRootHex=0xd1882df259a549fbfd5e4bb71f95eff9ae85339729f2e0a4d49523934b93c72e
Jun-04 03:09:56.086 [NETWORK]         �[34mdebug�[39m: Req  dialing peer method=beacon_blocks_by_root, encoding=ssz_snappy, agentVersion=teku/teku/v21.5.0+59-ga3cc317/linux-x86_64/-redhatinc-openjdk64bitservervm-java-16, peer=16...EAGj7C, requestId=470
Jun-04 03:09:56.313 [NETWORK]         �[34mdebug�[39m: Req  sending request method=beacon_blocks_by_root, encoding=ssz_snappy, agentVersion=teku/teku/v21.5.0+59-ga3cc317/linux-x86_64/-redhatinc-openjdk64bitservervm-java-16, peer=16...EAGj7C, requestId=470, requestBody=["0xd1882df259a549fbfd5e4bb71f95eff9ae85339729f2e0a4d49523934b93c72e"]
Jun-04 03:09:56.316 [NETWORK]         �[34mdebug�[39m: Req  request sent method=beacon_blocks_by_root, encoding=ssz_snappy, agentVersion=teku/teku/v21.5.0+59-ga3cc317/linux-x86_64/-redhatinc-openjdk64bitservervm-java-16, peer=16...EAGj7C, requestId=470
Jun-04 03:09:56.592 [NETWORK]       �[36mverbose�[39m: Req  done method=beacon_blocks_by_root, encoding=ssz_snappy, agentVersion=teku/teku/v21.5.0+59-ga3cc317/linux-x86_64/-redhatinc-openjdk64bitservervm-java-16, peer=16...EAGj7C, requestId=470
Jun-04 03:09:56.592 [SYNC]          �[36mverbose�[39m: Found UnknownBlockRoot parentRootHex=0xd1882df259a549fbfd5e4bb71f95eff9ae85339729f2e0a4d49523934b93c72e
Jun-04 03:09:56.621 [CHAIN]         �[36mverbose�[39m: Block processed slot=14244, root=0xd1882df259a549fbfd5e4bb71f95eff9ae85339729f2e0a4d49523934b93c72e
Jun-04 03:09:56.622 [CHAIN]         �[36mverbose�[39m: New chain head headSlot=14244, headRoot=0xd1882df259a549fbfd5e4bb71f95eff9ae85339729f2e0a4d49523934b93c72e
Jun-04 03:09:56.630 [CHAIN]           �[34mdebug�[39m: Attestation processed slot=14243, index=0, targetRoot=0x401533d5acc60a6051f4102b64d50b67a555424c408ed73439e92a9f6b50eafe, aggregationBits=0xffffffffffffffffff0f
Jun-04 03:09:56.670 [CHAIN]           �[31merror�[39m: Block error slot=14245 code=BLOCK_ERROR_BEACON_CHAIN_ERROR, error={"message":"Invalid state root"}
Error: Invalid state root
    at Object.stateTransition (/usr/app/packages/beacon-state-transition/src/allForks/stateTransition.ts:68:13)
    at runStateTransition (/usr/app/packages/lodestar/src/chain/blocks/stateTransition.ts:86:30)
    at processBlocksInEpoch (/usr/app/packages/lodestar/src/chain/blocks/process.ts:110:24)
    at processBlock (/usr/app/packages/lodestar/src/chain/blocks/process.ts:44:3)
    at processBlockJob (/usr/app/packages/lodestar/src/chain/blocks/processor.ts:59:5)
    at Object.job (/usr/app/packages/lodestar/src/chain/blocks/processor.ts:38:42)
    at JobQueue.runJob (/usr/app/packages/lodestar/src/util/queue/index.ts:106:22)
Error: BLOCK_ERROR_BEACON_CHAIN_ERROR
    at processBlocksInEpoch (/usr/app/packages/lodestar/src/chain/blocks/process.ts:151:11)
    at processBlock (/usr/app/packages/lodestar/src/chain/blocks/process.ts:44:3)
    at processBlockJob (/usr/app/packages/lodestar/src/chain/blocks/processor.ts:59:5)
    at Object.job (/usr/app/packages/lodestar/src/chain/blocks/processor.ts:38:42)
    at JobQueue.runJob (/usr/app/packages/lodestar/src/util/queue/index.ts:106:22)

@twoeths
Copy link
Contributor Author

twoeths commented Jun 7, 2021

These 2 errors actually have the same root cause: we cannot process a block after fetching it from pendingBlock db, and the pending blocks stay in pendingBlock db forever.

In the above log, 14255 came missing parent root, it' saved to pendingBlocks db, node fetched and processed the parent - 14244 successfully, then it fetched back 14255 from pendingBlocks db and it failed to process with Invalid state root error and it stayed in pendingBlocks db. From now all gossip blocks after 14255 comes to pendingBlocks db with UNKNOWN_PARENT error and couldn't be processed, it tried to get missing ancestor root which is 14244 and got BLOCK_ERROR_BLOCK_IS_ALREADY_KNOWN error.

If a gossip block come without being persisted to pendingBlocks db, it's fine to process and the node could be syned due to that.

@twoeths
Copy link
Contributor Author

twoeths commented Jun 8, 2021

Update: this Invalid state root issue also happen with processChainSegment

Jun-08 02:53:25.747 [SYNC]          verbose: Batch process error id=Finalized, startEpoch=1023, status=Processing code=BLOCK_ERROR_BEACON_CHAIN_ERROR, error={"message":"Invalid state root"}
Error: Invalid state root
    at Object.stateTransition (/root/workspace/node/lodestar/packages/beacon-state-transition/src/allForks/stateTransition.ts:66:13)
    at runStateTransition (/root/workspace/node/lodestar/packages/lodestar/src/chain/blocks/stateTransition.ts:83:30)
    at processBlocksInEpoch (/root/workspace/node/lodestar/packages/lodestar/src/chain/blocks/process.ts:110:24)
    at processChainSegment (/root/workspace/node/lodestar/packages/lodestar/src/chain/blocks/process.ts:79:7)
    at processChainSegmentJob (/root/workspace/node/lodestar/packages/lodestar/src/chain/blocks/processor.ts:144:3)
    at Object.job (/root/workspace/node/lodestar/packages/lodestar/src/chain/blocks/processor.ts:42:42)
    at Timeout._onTimeout (/root/workspace/node/lodestar/packages/lodestar/src/util/queue/index.ts:106:22)

@twoeths twoeths self-assigned this Jun 8, 2021
@twoeths
Copy link
Contributor Author

twoeths commented Jun 8, 2021

when we refetch a pending block from db, it's actually a struct and we have an issue with hashTreeRoot regarding BitVectorType, see #2648

@twoeths
Copy link
Contributor Author

twoeths commented Jun 8, 2021

We always have Invalid state root issue with block 32768 of oonoonba network so I downloaded it
signed_block_32768.zip
pre_state_32767.zip
post_state_32768.zip

  • post_state_32768 has the invalid root
  • if I reprocess the signed block 32768 with pre state 32767, I don't see the issue
  • comparing the resulting state and post_state_32768, the only difference is balances and it happens with most of validator

@twoeths
Copy link
Contributor Author

twoeths commented Jun 9, 2021

the root cause is we compute the next sync committee indices at wrong place - when rotating epoch. At that time the epoch in the state is increased already and we have different active validator indices

epochCtx.nextSyncCommitteeIndexes = getNextSyncCommitteeIndices(state);

at that time we should simply get sync committee indices from the available next sync committee pubkeys calculated at previous epoch transition (at sync committee period boundary)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants