Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not crash on block gap in displaced_leaves_after_finalizing #4997

Merged
merged 7 commits into from
Jul 17, 2024

Conversation

skunert
Copy link
Contributor

@skunert skunert commented Jul 10, 2024

After the merge of #4922 we saw failing zombienet tests with the following error:

2024-07-09 10:30:09 Error applying finality to block (0xb9e1d3d9cb2047fe61667e28a0963e0634a7b29781895bc9ca40c898027b4c09, 56685): UnknownBlock: Header was not found in the database: 0x0000000000000000000000000000000000000000000000000000000000000000    
2024-07-09 10:30:09 GRANDPA voter error: could not complete a round on disk: UnknownBlock: Header was not found in the database: 0x0000000000000000000000000000000000000000000000000000000000000000    

Example

The crashing situation is warp-sync related. After warp syncing, it can happen that there are gaps in block ancestry where we don't have the header. At the same time, the genesis hash is in the set of leaves. In displaced_leaves_after_finalizing we then iterate from the finalized block backwards until we hit an unknown block, crashing the node.

This PR makes the detection of displaced branches resilient against unknown block in the finalized block chain.

cc @nazar-pc (github won't let me request a review from you)

@skunert skunert added the T0-node This PR/Issue is related to the topic “node”. label Jul 10, 2024
@skunert skunert requested review from bkchr, lexnv and a team July 10, 2024 17:44
Copy link
Contributor

@nazar-pc nazar-pc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tagging!

substrate/primitives/blockchain/src/backend.rs Outdated Show resolved Hide resolved
substrate/primitives/blockchain/src/backend.rs Outdated Show resolved Hide resolved
substrate/primitives/blockchain/src/backend.rs Outdated Show resolved Hide resolved
Comment on lines 3193 to 3198
{
let displaced =
blockchain.displaced_leaves_after_finalizing(a3_hash, a3_number).unwrap();
assert_eq!(displaced.displaced_leaves, vec![(a1_number, a1_hash)]);
assert_eq!(displaced.displaced_blocks, vec![]);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes a specific narrow case of warp sync: we download future blocks and then sync the gap. In Subspace's Snap sync we actually have a similar situation, but without gap sync afterwards. We have to clear the gap manually due to Substrate's hardcoded notion of warp sync, see #4607 and issue linked from there. Also currently we are only able to do a single jump from genesis to a future block and no more jumps after that, which is another annoying limitation in Substrate that we'd like to lift if possible.

Another case that seems to be missing here is this:

// g -> a1 -> a2 -> <unimported> -> a3 -> a4
//  \-> b1 -> b2

If I understand proposed change correctly, it will remove both a2 and b2 leaves, while blocks themselves will remain in the database. Later when gap sync catches up, it'll connect to either a2 or b2 (or maybe some c3 even, I am not familiar with warp sync logic), which means there will be blocks that will essentially "leak" and never be removed from the database due to them not being part of any of the active leafs and thus never discovered as displaced blocks by displaced_leaves_after_finalizing again.

The correct solution would be to simply delay the decision with two cases:

  1. for warp sync it is sufficient to simply delay the decision until gap sync fills unimported gap and then decide what to prune with information of the full finalized chain
  2. for other sync strategies like one used in Subspace we still need to delay the decision, but there will never be a decision, we can only wait for leaves that are 100% outside of pruning depth anyway, at which point the whole thing all the way down to genesis needs to be pruned unconditionally. I'm also wondering if it applies to warp sync if it is used on non-archival nodes (again, I'm not familiar with implementation, so let me know if warp sync is not applicable to non-archival nodes anyway).

For 2. above I think displaced_leaves_after_finalizing should return these leafs with "unknown" state such that they can be acted upon based on pruning logic. And pruning logic will need to be extended to support 2. of course.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defering the decision is fair. It is also what earlier implementations where doing. I changed the logic to not mark any leafs as displaced that have no current connection to the to-be-finalized block.

Regarding 2., I am not fully able to follow your explanation. I don't think it is in scope for this PR, which is only focusing on not making the node crash in case of finalization gap. This should restore feature parity to earlier implementations while maintaining the speed boosts from your improvements. We can then further iterate on it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not 100% in the scope, but definitely closely related. Simply imagine how would you prune blocks if gap is never synced after creation (because the chain doesn't need it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not saying we should not do it. I just want to have it in a follow up PR. I will not be able to work in the next days and would like to have this merged sooner rather than later due to the crashing. Once I am back I can take another look at it.

substrate/primitives/blockchain/src/lib.rs Outdated Show resolved Hide resolved
if leaf_number == Zero::zero() {
debug!(target: crate::LOG_TARGET, %leaf_hash, "Leaf is genesis hash, removing.");
if finalized_block_number > leaf_number {
result.displaced_leaves.push((leaf_number, leaf_hash));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it also be added to displaced_blocks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm so my idea here was that displaced_leaves is for removing entries from the leaves, which is what we want. It is however not really a displaced branch where the genesis block itself needs to be pruned.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then this needs to be delayed just like everything else until the gap is closed, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is again going back to my other topic that this shouldn't assume gap will necessarly be closed at any point in the future. This genesis block may end up being pruned instead, but we need to be able to know that it is a detached leaf in order to be able to do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, the correct generic way of handling this would be to delay the genesis block too. It comes however at the cost of performance of this method. If we don't remove the genesis block here, we will always have more than 2 leaves until the block gap is closed (if ever, as you say). So while the block gap is not closed, we will always iterate from the block-to-be-finalized to the first block after the gap. So by removing this genesis leaf here, we can avoid this iteration at the cost of leaking at most the genesis block, if the gap is never closed.

However, if we had the pruning handling of before-gap blocks outside the pruning window you described in the other comment, we would not have this problem. Currently I am leaning in favor of removing the genesis leaf handling "hack" here and implementing the pruning improvements rather soon.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it is not up to me to decide, but for completeness after each individual PR I'd probably still not mark genesis block as displaced. Either way I think we're on the same page here and if the change happens shortly afterwards then it is fine with me as well, just want to make sure it does happen.

blockchain.displaced_leaves_after_finalizing(a3_hash, a3_number).unwrap();
assert_eq!(blockchain.leaves().unwrap(), vec![a4_hash, genesis_hash]);
assert_eq!(displaced.displaced_leaves, vec![(genesis_number, genesis_hash)]);
assert_eq!(displaced.displaced_blocks, vec![]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think displaced blocks should be empty if there are displaced leaves. There should be at least the same block as displaced leaf, potentially more blocks depending on branch length. See displaced_leaves_after_finalizing_works test where every time we have displaced leaf we also have corresponding block(s) displaced as well.

Comment on lines 3193 to 3198
{
let displaced =
blockchain.displaced_leaves_after_finalizing(a3_hash, a3_number).unwrap();
assert_eq!(displaced.displaced_leaves, vec![(a1_number, a1_hash)]);
assert_eq!(displaced.displaced_blocks, vec![]);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not 100% in the scope, but definitely closely related. Simply imagine how would you prune blocks if gap is never synced after creation (because the chain doesn't need it).

Copy link
Contributor

@nazar-pc nazar-pc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes make sense to me and there is a clear path on where to go from here 👍

Comment on lines +3246 to +3250
// Even though there is a disconnect, diplace should still detect
// branches above the block gap.
// /-> c4
// g -> a1 -> <unimported> -> a3 -> a4(f)
// \-> b1 -> b2 ----------> b3 -> b4 -> b5
Copy link
Contributor

@nazar-pc nazar-pc Jul 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good example that made me think we could also probably detect that b5/b4/b3 are also clearly not going to make it and remove them changing leaf from b5 to b2, though that might be more tricky to implement than really worth it.

These test are really a nice way to understand the behavior.

substrate/client/db/src/lib.rs Outdated Show resolved Hide resolved
substrate/primitives/blockchain/src/backend.rs Outdated Show resolved Hide resolved
.push_front(MinimalBlockMetadata::from(&self.header_metadata(finalized_block_hash)?));
let current_finalized = match self.header_metadata(finalized_block_hash) {
Ok(metadata) => metadata,
Err(Error::UnknownBlock(_)) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should never happen, as we import the warp sync target as finalized block. However, it should also not hurt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Afaict this can happen exactly when we import the finalized block after warp sync. It is imported and finalized at the same time. Not yet in the backend but we are directly finalizing it so this method is still called.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yeah okay :P

substrate/primitives/blockchain/src/backend.rs Outdated Show resolved Hide resolved
@bkchr bkchr enabled auto-merge July 17, 2024 14:42
Comment on lines +3260 to +3261
}
}
Copy link
Contributor

@lexnv lexnv Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dq: If we finalize b5 here, instead of a4(f), do we report a4 as displaced leaf?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lexnv
Copy link
Contributor

lexnv commented Jul 17, 2024

LGTM! 👍

@bkchr
Copy link
Member

bkchr commented Jul 17, 2024

bot fmt

@command-bot
Copy link

command-bot bot commented Jul 17, 2024

@bkchr https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/6724423 was started for your command "$PIPELINE_SCRIPTS_DIR/commands/fmt/fmt.sh". Check out https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/pipelines?page=1&scope=all&username=group_605_bot to know what else is being executed currently.

Comment bot cancel 5-e7bc6328-eeda-47d3-bf85-b3569c178b3d to cancel this command or bot cancel to cancel all commands in this pull request.

Copy link

We are migrating the command bot to be a GitHub Action

Please, see the documentation on how to use it

@command-bot
Copy link

command-bot bot commented Jul 17, 2024

@bkchr Command "$PIPELINE_SCRIPTS_DIR/commands/fmt/fmt.sh" has finished. Result: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/6724423 has finished. If any artifacts were generated, you can download them from https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/6724423/artifacts/download.

@bkchr bkchr added this pull request to the merge queue Jul 17, 2024
Merged via the queue into master with commit 1b6292b Jul 17, 2024
158 of 161 checks passed
@bkchr bkchr deleted the skunert/fix-warp-displaced-branches branch July 17, 2024 16:10
jpserrat pushed a commit to jpserrat/polkadot-sdk that referenced this pull request Jul 18, 2024
…itytech#4997)

After the merge of paritytech#4922 we saw failing zombienet tests with the
following error:
```
2024-07-09 10:30:09 Error applying finality to block (0xb9e1d3d9cb2047fe61667e28a0963e0634a7b29781895bc9ca40c898027b4c09, 56685): UnknownBlock: Header was not found in the database: 0x0000000000000000000000000000000000000000000000000000000000000000    
2024-07-09 10:30:09 GRANDPA voter error: could not complete a round on disk: UnknownBlock: Header was not found in the database: 0x0000000000000000000000000000000000000000000000000000000000000000    
```

[Example](https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/6662262)

The crashing situation is warp-sync related. After warp syncing, it can
happen that there are gaps in block ancestry where we don't have the
header. At the same time, the genesis hash is in the set of leaves. In
`displaced_leaves_after_finalizing` we then iterate from the finalized
block backwards until we hit an unknown block, crashing the node.

This PR makes the detection of displaced branches resilient against
unknown block in the finalized block chain.

cc @nazar-pc (github won't let me request a review from you)

---------

Co-authored-by: Bastian Köcher <git@kchr.de>
Co-authored-by: command-bot <>
TarekkMA pushed a commit to moonbeam-foundation/polkadot-sdk that referenced this pull request Aug 2, 2024
…itytech#4997)

After the merge of paritytech#4922 we saw failing zombienet tests with the
following error:
```
2024-07-09 10:30:09 Error applying finality to block (0xb9e1d3d9cb2047fe61667e28a0963e0634a7b29781895bc9ca40c898027b4c09, 56685): UnknownBlock: Header was not found in the database: 0x0000000000000000000000000000000000000000000000000000000000000000    
2024-07-09 10:30:09 GRANDPA voter error: could not complete a round on disk: UnknownBlock: Header was not found in the database: 0x0000000000000000000000000000000000000000000000000000000000000000    
```

[Example](https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/6662262)

The crashing situation is warp-sync related. After warp syncing, it can
happen that there are gaps in block ancestry where we don't have the
header. At the same time, the genesis hash is in the set of leaves. In
`displaced_leaves_after_finalizing` we then iterate from the finalized
block backwards until we hit an unknown block, crashing the node.

This PR makes the detection of displaced branches resilient against
unknown block in the finalized block chain.

cc @nazar-pc (github won't let me request a review from you)

---------

Co-authored-by: Bastian Köcher <git@kchr.de>
Co-authored-by: command-bot <>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T0-node This PR/Issue is related to the topic “node”.
Projects
Status: done
Development

Successfully merging this pull request may close these issues.

4 participants