Optimize finalization performance #4922

nazar-pc · 2024-07-02T05:31:37Z

This PR largely fixes #4903 by addressing it from a few different directions.

The high-level observation is that complexity of finalization was unfortunately roughly O(n^3). Not only displaced_leaves_after_finalizing was extremely inefficient on its own, especially when large ranges of blocks were involved, it was called once upfront and then on every single block that was finalized over and over again.

The first commit refactores code adjacent to displaced_leaves_after_finalizing to optimize memory allocations. For example things like BTreeMap<_, Vec<_>> were very bad in terms of number of allocations and after analyzing code paths was completely unnecessary and replaced with Vec<(_, _)>. In other places allocations of known size were not done upfront and some APIs required unnecessary cloning of vectors.

I checked invariants and didn't find anything that was violated after refactoring.

Second commit completely replaces displaced_leaves_after_finalizing implementation with a much more efficient one. In my case with ~82k blocks and ~13k leaves it takes ~5.4s to finish client.apply_finality() now.

The idea is to avoid querying the same blocks over and over again as well as introducing temporary local cache for blocks related to leaves above block that is being finalized as well as local cache of the finalized branch of the chain. I left some comments in the code and wrote tests that I belive should check all code invariants for correctness. lowest_common_ancestor_multiblock was removed as unnecessary and not great in terms of performance API, domain-specific code should be written instead like done in displaced_leaves_after_finalizing.

After these changes I noticed finalization is still horribly slow, turned out that even though displaced_leaves_after_finalizing was way faster that before (probably order of magnitude), it was called for every single of those 82k blocks 🤦

The quick hack I came up with in the third commit to handle this edge case was to not call it when finalizing multiple blocks at once until the very last moment. It works and allows to finish the whole finalization in just 14 seconds (5.4+5.4 of which are two calls to displaced_leaves_after_finalizing). I'm really not happy with the fact that displaced_leaves_after_finalizing is called twice, but much heavier refactoring would be necessary to get rid of second call.

Next steps:

assuming the changes are acceptable I'll write prdoc
Allow concurrent header metadata reads #4920 or something similar in spirit should be implemented to unleash efficient parallelsm with rayon in displaced_leaves_after_finalizing, which will allow to further (and significant!) scale its performance rather that being CPU-bound on a single core, also reading database sequentially should ideally be avoided
someone should look into removal of the second displaced_leaves_after_finalizing call
further cleanups are possible if undo_finalization can be removed

Polkadot Address: 1vSxzbyz2cJREAuVWjhXUT1ds8vBzoxn2w4asNpusQKwjJd

…efficient

paritytech-cicd-pr · 2024-07-02T06:39:50Z

The CI pipeline was cancelled due to failure one of the required jobs.
Job name: test-linux-stable 3/3
Logs: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/6599982

skunert

Double checked the logic, looks really good. Thank you!

For sure we should do further cleanups and optimizations, this PR should remove most of the pain however. That we were calling this for potentially millions of times was the most expensive part.

substrate/primitives/blockchain/src/backend.rs

substrate/client/db/src/lib.rs

skunert · 2024-07-03T08:30:02Z

substrate/primitives/blockchain/src/backend.rs

+			// If points back to the finalized header then nothing left to do, this leaf will be
+			// checked again later


Suggested change

// If points back to the finalized header then nothing left to do, this leaf will be

// checked again later

// If points back to the finalized header then nothing left to do

Nit: Don't think it will be checked again later, its just finished.

I meant that it might still end up being a leaf that is not finalized in the end and will be dropped. Maybe that detail was more confusing than helpful though.

skunert · 2024-07-03T08:55:34Z

substrate/primitives/blockchain/src/backend.rs

+
+		// Local cache is a performance optimization in case of finalized block deep below the
+		// tip of the chain with a lot of leaves above finalized block
+		let mut local_cache = HashMap::<Block::Hash, CachedHeaderMetadata<Block>>::new();


One thing I thought about:

This cache will become quite memory intensive when finalizing large gaps. For parachains which sync from scratch but use and external relay chain RPC, it is likely that they finalize from genesis to tip. If there happen to be multiple branches at finalization time, the leaves.len() == 1 optimization above will not be taken.

So for each year that goes by, these chains need ~800MB more memory to keep this data in the cache here. This problem can be avoided by using warp sync, but still. We will have chains with 3 and 2 seconds block time which will grow even faster.

So what we could do:

Only cache what we need. Since we only need the hash, number and parent hash we can save around 40% here.

Use schnellru instead of the HashMap and set a byte limit (only peek to not use LRU capability), can be generous like 2GB or something. This would still be a lot faster than relying on the relatively small HeaderMetadataCache.

Abandon caching here totally. Initial finalization will take a while, but will be only called once instead of for every block. Operation at chain tip will still benefit from HeaderMetadataCache.

Accept that this requires an ever growing amount of memory.

I think 2 is good and easy to integrate, with 1 as optional optimization.

I didn't want to introduce a custom data structure here or use an awkward tuple, but we can store a smaller data set here (only need hash, number and parent hash, while CachedHeaderMetadata has a few more data structures).

I do not think removing it is good performance-wise because this cache enabled much faster processing of blocks way above what we're finalizing. For example at Subspace without #1570 we are only "finalizing" blocks at many tens of thousands of blocks deep. Without this cache we'll be traversing branches from leaf all the way until the block being finalized while hitting on-disk database over and over again, it'll likely take a very long time.

LRU cache will not help us here since it will either have to be as large as this cache (which will be less efficient than HashMap) or data will not fit into it, so we'll be overriding it in a loop like it was done before this PR with metadata cache.

So I am really quite convinced that we want this cache here.

Yeah so the points from my answer were just some thoughts. I am also not convinced to remove the cache completely.

So point 1 seems like something we align on, just store what we need to reduce the memory footprint.

Regarding the LRU, as mentioned I was thinking of just using schnellru as a drop-in replacement for the hashmap to easily define a fixed size limit. And then only use peek to not bump the entries. But not insisting on this one.

Implemented in latest commit with MinimalBlockMetadata data structure

lexnv

LGTM!

…tion-performance

skunert · 2024-07-05T18:11:44Z

@bkchr Can you pls add a tip here?

bkchr · 2024-07-05T18:15:36Z

/tip medium

substrate-tip-bot · 2024-07-05T18:16:21Z

@bkchr A referendum for a medium (80 DOT) tip was successfully submitted for @nazar-pc (1vSxzbyz2cJREAuVWjhXUT1ds8vBzoxn2w4asNpusQKwjJd on polkadot).

Referendum number: 955.

This PR largely fixes paritytech#4903 by addressing it from a few different directions. The high-level observation is that complexity of finalization was unfortunately roughly `O(n^3)`. Not only `displaced_leaves_after_finalizing` was extremely inefficient on its own, especially when large ranges of blocks were involved, it was called once upfront and then on every single block that was finalized over and over again. The first commit refactores code adjacent to `displaced_leaves_after_finalizing` to optimize memory allocations. For example things like `BTreeMap<_, Vec<_>>` were very bad in terms of number of allocations and after analyzing code paths was completely unnecessary and replaced with `Vec<(_, _)>`. In other places allocations of known size were not done upfront and some APIs required unnecessary cloning of vectors. I checked invariants and didn't find anything that was violated after refactoring. Second commit completely replaces `displaced_leaves_after_finalizing` implementation with a much more efficient one. In my case with ~82k blocks and ~13k leaves it takes ~5.4s to finish `client.apply_finality()` now. The idea is to avoid querying the same blocks over and over again as well as introducing temporary local cache for blocks related to leaves above block that is being finalized as well as local cache of the finalized branch of the chain. I left some comments in the code and wrote tests that I belive should check all code invariants for correctness. `lowest_common_ancestor_multiblock` was removed as unnecessary and not great in terms of performance API, domain-specific code should be written instead like done in `displaced_leaves_after_finalizing`. After these changes I noticed finalization is still horribly slow, turned out that even though `displaced_leaves_after_finalizing` was way faster that before (probably order of magnitude), it was called for every single of those 82k blocks 🤦 The quick hack I came up with in the third commit to handle this edge case was to not call it when finalizing multiple blocks at once until the very last moment. It works and allows to finish the whole finalization in just 14 seconds (5.4+5.4 of which are two calls to `displaced_leaves_after_finalizing`). I'm really not happy with the fact that `displaced_leaves_after_finalizing` is called twice, but much heavier refactoring would be necessary to get rid of second call. --- Next steps: * assuming the changes are acceptable I'll write prdoc * paritytech#4920 or something similar in spirit should be implemented to unleash efficient parallelsm with rayon in `displaced_leaves_after_finalizing`, which will allow to further (and significant!) scale its performance rather that being CPU-bound on a single core, also reading database sequentially should ideally be avoided * someone should look into removal of the second `displaced_leaves_after_finalizing` call * further cleanups are possible if `undo_finalization` can be removed --- Polkadot Address: 1vSxzbyz2cJREAuVWjhXUT1ds8vBzoxn2w4asNpusQKwjJd --------- Co-authored-by: Sebastian Kunert <skunert49@gmail.com>

@nazar-pc

After the merge of #4922 we saw failing zombienet tests with the following error: ``` 2024-07-09 10:30:09 Error applying finality to block (0xb9e1d3d9cb2047fe61667e28a0963e0634a7b29781895bc9ca40c898027b4c09, 56685): UnknownBlock: Header was not found in the database: 0x0000000000000000000000000000000000000000000000000000000000000000 2024-07-09 10:30:09 GRANDPA voter error: could not complete a round on disk: UnknownBlock: Header was not found in the database: 0x0000000000000000000000000000000000000000000000000000000000000000 ``` [Example](https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/6662262) The crashing situation is warp-sync related. After warp syncing, it can happen that there are gaps in block ancestry where we don't have the header. At the same time, the genesis hash is in the set of leaves. In `displaced_leaves_after_finalizing` we then iterate from the finalized block backwards until we hit an unknown block, crashing the node. This PR makes the detection of displaced branches resilient against unknown block in the finalized block chain. cc @nazar-pc (github won't let me request a review from you) --------- Co-authored-by: Bastian Köcher <git@kchr.de> Co-authored-by: command-bot <>

@nazar-pc

…itytech#4997) After the merge of paritytech#4922 we saw failing zombienet tests with the following error: ``` 2024-07-09 10:30:09 Error applying finality to block (0xb9e1d3d9cb2047fe61667e28a0963e0634a7b29781895bc9ca40c898027b4c09, 56685): UnknownBlock: Header was not found in the database: 0x0000000000000000000000000000000000000000000000000000000000000000 2024-07-09 10:30:09 GRANDPA voter error: could not complete a round on disk: UnknownBlock: Header was not found in the database: 0x0000000000000000000000000000000000000000000000000000000000000000 ``` [Example](https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/6662262) The crashing situation is warp-sync related. After warp syncing, it can happen that there are gaps in block ancestry where we don't have the header. At the same time, the genesis hash is in the set of leaves. In `displaced_leaves_after_finalizing` we then iterate from the finalized block backwards until we hit an unknown block, crashing the node. This PR makes the detection of displaced branches resilient against unknown block in the finalized block chain. cc @nazar-pc (github won't let me request a review from you) --------- Co-authored-by: Bastian Köcher <git@kchr.de> Co-authored-by: command-bot <>

This PR largely fixes paritytech#4903 by addressing it from a few different directions. The high-level observation is that complexity of finalization was unfortunately roughly `O(n^3)`. Not only `displaced_leaves_after_finalizing` was extremely inefficient on its own, especially when large ranges of blocks were involved, it was called once upfront and then on every single block that was finalized over and over again. The first commit refactores code adjacent to `displaced_leaves_after_finalizing` to optimize memory allocations. For example things like `BTreeMap<_, Vec<_>>` were very bad in terms of number of allocations and after analyzing code paths was completely unnecessary and replaced with `Vec<(_, _)>`. In other places allocations of known size were not done upfront and some APIs required unnecessary cloning of vectors. I checked invariants and didn't find anything that was violated after refactoring. Second commit completely replaces `displaced_leaves_after_finalizing` implementation with a much more efficient one. In my case with ~82k blocks and ~13k leaves it takes ~5.4s to finish `client.apply_finality()` now. The idea is to avoid querying the same blocks over and over again as well as introducing temporary local cache for blocks related to leaves above block that is being finalized as well as local cache of the finalized branch of the chain. I left some comments in the code and wrote tests that I belive should check all code invariants for correctness. `lowest_common_ancestor_multiblock` was removed as unnecessary and not great in terms of performance API, domain-specific code should be written instead like done in `displaced_leaves_after_finalizing`. After these changes I noticed finalization is still horribly slow, turned out that even though `displaced_leaves_after_finalizing` was way faster that before (probably order of magnitude), it was called for every single of those 82k blocks 🤦 The quick hack I came up with in the third commit to handle this edge case was to not call it when finalizing multiple blocks at once until the very last moment. It works and allows to finish the whole finalization in just 14 seconds (5.4+5.4 of which are two calls to `displaced_leaves_after_finalizing`). I'm really not happy with the fact that `displaced_leaves_after_finalizing` is called twice, but much heavier refactoring would be necessary to get rid of second call. --- Next steps: * assuming the changes are acceptable I'll write prdoc * paritytech#4920 or something similar in spirit should be implemented to unleash efficient parallelsm with rayon in `displaced_leaves_after_finalizing`, which will allow to further (and significant!) scale its performance rather that being CPU-bound on a single core, also reading database sequentially should ideally be avoided * someone should look into removal of the second `displaced_leaves_after_finalizing` call * further cleanups are possible if `undo_finalization` can be removed --- Polkadot Address: 1vSxzbyz2cJREAuVWjhXUT1ds8vBzoxn2w4asNpusQKwjJd --------- Co-authored-by: Sebastian Kunert <skunert49@gmail.com>

@nazar-pc

…itytech#4997) After the merge of paritytech#4922 we saw failing zombienet tests with the following error: ``` 2024-07-09 10:30:09 Error applying finality to block (0xb9e1d3d9cb2047fe61667e28a0963e0634a7b29781895bc9ca40c898027b4c09, 56685): UnknownBlock: Header was not found in the database: 0x0000000000000000000000000000000000000000000000000000000000000000 2024-07-09 10:30:09 GRANDPA voter error: could not complete a round on disk: UnknownBlock: Header was not found in the database: 0x0000000000000000000000000000000000000000000000000000000000000000 ``` [Example](https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/6662262) The crashing situation is warp-sync related. After warp syncing, it can happen that there are gaps in block ancestry where we don't have the header. At the same time, the genesis hash is in the set of leaves. In `displaced_leaves_after_finalizing` we then iterate from the finalized block backwards until we hit an unknown block, crashing the node. This PR makes the detection of displaced branches resilient against unknown block in the finalized block chain. cc @nazar-pc (github won't let me request a review from you) --------- Co-authored-by: Bastian Köcher <git@kchr.de> Co-authored-by: command-bot <>

This PR largely fixes paritytech#4903 by addressing it from a few different directions. The high-level observation is that complexity of finalization was unfortunately roughly `O(n^3)`. Not only `displaced_leaves_after_finalizing` was extremely inefficient on its own, especially when large ranges of blocks were involved, it was called once upfront and then on every single block that was finalized over and over again. The first commit refactores code adjacent to `displaced_leaves_after_finalizing` to optimize memory allocations. For example things like `BTreeMap<_, Vec<_>>` were very bad in terms of number of allocations and after analyzing code paths was completely unnecessary and replaced with `Vec<(_, _)>`. In other places allocations of known size were not done upfront and some APIs required unnecessary cloning of vectors. I checked invariants and didn't find anything that was violated after refactoring. Second commit completely replaces `displaced_leaves_after_finalizing` implementation with a much more efficient one. In my case with ~82k blocks and ~13k leaves it takes ~5.4s to finish `client.apply_finality()` now. The idea is to avoid querying the same blocks over and over again as well as introducing temporary local cache for blocks related to leaves above block that is being finalized as well as local cache of the finalized branch of the chain. I left some comments in the code and wrote tests that I belive should check all code invariants for correctness. `lowest_common_ancestor_multiblock` was removed as unnecessary and not great in terms of performance API, domain-specific code should be written instead like done in `displaced_leaves_after_finalizing`. After these changes I noticed finalization is still horribly slow, turned out that even though `displaced_leaves_after_finalizing` was way faster that before (probably order of magnitude), it was called for every single of those 82k blocks 🤦 The quick hack I came up with in the third commit to handle this edge case was to not call it when finalizing multiple blocks at once until the very last moment. It works and allows to finish the whole finalization in just 14 seconds (5.4+5.4 of which are two calls to `displaced_leaves_after_finalizing`). I'm really not happy with the fact that `displaced_leaves_after_finalizing` is called twice, but much heavier refactoring would be necessary to get rid of second call. --- Next steps: * assuming the changes are acceptable I'll write prdoc * paritytech#4920 or something similar in spirit should be implemented to unleash efficient parallelsm with rayon in `displaced_leaves_after_finalizing`, which will allow to further (and significant!) scale its performance rather that being CPU-bound on a single core, also reading database sequentially should ideally be avoided * someone should look into removal of the second `displaced_leaves_after_finalizing` call * further cleanups are possible if `undo_finalization` can be removed --- Polkadot Address: 1vSxzbyz2cJREAuVWjhXUT1ds8vBzoxn2w4asNpusQKwjJd --------- Co-authored-by: Sebastian Kunert <skunert49@gmail.com>

@nazar-pc

…itytech#4997) After the merge of paritytech#4922 we saw failing zombienet tests with the following error: ``` 2024-07-09 10:30:09 Error applying finality to block (0xb9e1d3d9cb2047fe61667e28a0963e0634a7b29781895bc9ca40c898027b4c09, 56685): UnknownBlock: Header was not found in the database: 0x0000000000000000000000000000000000000000000000000000000000000000 2024-07-09 10:30:09 GRANDPA voter error: could not complete a round on disk: UnknownBlock: Header was not found in the database: 0x0000000000000000000000000000000000000000000000000000000000000000 ``` [Example](https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/6662262) The crashing situation is warp-sync related. After warp syncing, it can happen that there are gaps in block ancestry where we don't have the header. At the same time, the genesis hash is in the set of leaves. In `displaced_leaves_after_finalizing` we then iterate from the finalized block backwards until we hit an unknown block, crashing the node. This PR makes the detection of displaced branches resilient against unknown block in the finalized block chain. cc @nazar-pc (github won't let me request a review from you) --------- Co-authored-by: Bastian Köcher <git@kchr.de> Co-authored-by: command-bot <>

nazar-pc added 2 commits July 2, 2024 08:34

Optimize allocations during finalization

6b0e7a6

Replace displaced_leaves_after_finalizing implementation with more …

227fb4e

…efficient

nazar-pc force-pushed the optimize-finalization-performance branch from 4a2e709 to 87310fc Compare July 2, 2024 05:34

This was referenced Jul 2, 2024

Finalization hangs in 1.13 #4903

Closed

Optimize block finalization autonomys/polkadot-sdk#20

Merged

nazar-pc force-pushed the optimize-finalization-performance branch from 87310fc to 19d0edf Compare July 2, 2024 06:14

Optimize pruning during batch finalization of blocks

c452058

nazar-pc force-pushed the optimize-finalization-performance branch from 19d0edf to c452058 Compare July 2, 2024 15:06

skunert reviewed Jul 2, 2024

View reviewed changes

substrate/primitives/blockchain/src/backend.rs Outdated Show resolved Hide resolved

substrate/primitives/blockchain/src/backend.rs Outdated Show resolved Hide resolved

substrate/client/db/src/lib.rs Show resolved Hide resolved

Tiny simplifications

c2184dd

nazar-pc requested a review from skunert July 2, 2024 20:54

skunert reviewed Jul 3, 2024

View reviewed changes

skunert requested a review from bkchr July 3, 2024 08:56

skunert added the T0-node This PR/Issue is related to the topic “node”. label Jul 3, 2024

Minimize metadata stored in memory

289ceaf

skunert approved these changes Jul 3, 2024

View reviewed changes

skunert requested a review from a team July 4, 2024 07:12

lexnv approved these changes Jul 4, 2024

View reviewed changes

nazar-pc and others added 4 commits July 4, 2024 20:15

Add prdoc

223d8e0

Merge remote-tracking branch 'upstream/master' into optimize-finaliza…

10c5343

…tion-performance

Update prdoc

6af7f9d

Merge branch 'master' into optimize-finalization-performance

7bef23d

skunert enabled auto-merge July 5, 2024 18:10

skunert added this pull request to the merge queue Jul 5, 2024

Merged via the queue into paritytech:master with commit 221eddc Jul 5, 2024
154 of 160 checks passed

nazar-pc deleted the optimize-finalization-performance branch July 5, 2024 23:42

skunert mentioned this pull request Jul 10, 2024

Do not crash on block gap in displaced_leaves_after_finalizing #4997

Merged

This was referenced Aug 21, 2024

Update polkadot-sdk from v1.11.0 to stable2407 moondance-labs/tanssi#659

Closed

Update polkadot-sdk from v1.11.0 to stable2407 moonbeam-foundation/moonbeam#2912

Closed

TDemeco mentioned this pull request Oct 7, 2024

feat: ⏫ upgrade to Polkadot SDK stable2407 Moonsong-Labs/storage-hub#222

Merged

ToufeeqP mentioned this pull request Dec 10, 2024

upgrade to v1.16.2 availproject/polkadot-sdk#21

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize finalization performance #4922

Optimize finalization performance #4922

nazar-pc commented Jul 2, 2024 •

edited

Loading

paritytech-cicd-pr commented Jul 2, 2024

skunert left a comment

skunert Jul 3, 2024

nazar-pc Jul 3, 2024

skunert Jul 3, 2024

nazar-pc Jul 3, 2024

skunert Jul 3, 2024

nazar-pc Jul 3, 2024

lexnv left a comment

skunert commented Jul 5, 2024

bkchr commented Jul 5, 2024

substrate-tip-bot bot commented Jul 5, 2024

		// If points back to the finalized header then nothing left to do, this leaf will be
		// checked again later

Optimize finalization performance #4922

Optimize finalization performance #4922

Conversation

nazar-pc commented Jul 2, 2024 • edited Loading

paritytech-cicd-pr commented Jul 2, 2024

skunert left a comment

Choose a reason for hiding this comment

skunert Jul 3, 2024

Choose a reason for hiding this comment

nazar-pc Jul 3, 2024

Choose a reason for hiding this comment

skunert Jul 3, 2024

Choose a reason for hiding this comment

nazar-pc Jul 3, 2024

Choose a reason for hiding this comment

skunert Jul 3, 2024

Choose a reason for hiding this comment

nazar-pc Jul 3, 2024

Choose a reason for hiding this comment

lexnv left a comment

Choose a reason for hiding this comment

skunert commented Jul 5, 2024

bkchr commented Jul 5, 2024

substrate-tip-bot bot commented Jul 5, 2024

nazar-pc commented Jul 2, 2024 •

edited

Loading