Optimize MTrie checkpoint: 47x speedup (11.7 hours -> 15 mins), -431 GB alloc/op, -7.6 billion allocs/op, -6.9 GB file size #1944

fxamacker · 2022-02-03T16:47:16Z

Description

Optimize checkpoint creating (includes loading):

47x speedup (11.7 hours to 15 mins), avoid 431 GB alloc/op, avoid 7.6 billion allocs/op
- 171x speedup (11.4 hours to 4 mins) in MTrie traversal+flattening+writing phase
Reduce long-held data in RAM by 116+ GB (lowers hardware requirements)
Reduce checkpoint file size by 6.9+ GB (another 4.4+ GB reduction planned in separate PR) without using compression

Most of the optimizations were proposed in comments to issue #1750. I'm moving all remaining optimizations like concurrency and/or compression, etc. to separate PRs so this is ready for review as-is.

Increased interim + leaf node counts are causing checkpoint creation to take hours. This PR sacrifices some readability and simplicity as tradeoffs and gains speed, memory efficiency, and storage efficiency.

I limited scope of PR to optimizations that don't require performance tradeoffs or overhead (like adding processes, IPC).

Big thanks to @ramtinms for opening #1750 to point out that trie flattening can have big optimizations. 👍

Closes #1750
Closes #1884
Updates #1744, #1746, https://github.com/dapperlabs/flow-go/issues/6114

Impact on Execution Nodes

⚠️ Unoptimized checkpoint creation reaches 248+GB RAM within the first 30 minutes and can run for about 15-17+ hours on EN3. This duration during heavy load was long enough on EN3 to accumulate enough WAL files to trigger another checkpoint immediately after the current one finishes. So the 248 GB RAM is held again with 590 GB alloc/op and 9.8 billion allocs/op.

EN Startup Time

This PR speeds up EN startup time in several ways:

Checkpoint loading and WAL replaying will be optimized for speed (see benchmarks).
Checkpoint creation will be fast enough to run multiple times per day, which will reduce WAL segments that need to be replayed during startup.
Checkpoint creation finishing in minutes rather than 15+ hours reduces risk of being interrupted by shutdown, etc. which can cause extra WAL segments to be replayed on next EN startup.

See issue #1884 for more info about extra WAL segments causing EN startup delays.

EN Memory Use and System Requirements

This PR can reduce long-held data in RAM (15+ hours on EN3) by up to 116 GB. Additionally, eliminating 431 GB alloc/op and 7.6 billion allocs/op will reduce load on the Go garbage collector.

Benchmark Comparisons

Unoptimized Checkpoint Creation

After the first 30 minutes and for next 11+ hours (15-17+ hours on EN3):

Optimized Checkpoint Creation

Finishes in 15+ minutes and peaks in the last minute at:

Preliminary Results (WIP) Without Adding Concurrency Yet

MTrie Checkpoint Load+Create v3 (old) vs v4 (WIP)

Input: checkpoint.00003443 + 41 WAL files
Platform: Go 1.16, benchnet (the big one)
name                        old time/op       new time/op       delta
NewCheckpoint-48            42052s ± 0%        886s ± 0%       -97.89%

name                        old alloc/op      new alloc/op      delta
NewCheckpoint-48             590GB ± 0%       159GB ± 0%       -73.04%

name                        old allocs/op     new allocs/op     delta
NewCheckpoint-48             9.80G ± 0%       2.19G ± 0%       -77.67%

DISCLAIMERS: not done yet, didn't add concurrency yet, n=1 due to duration, 
file system cache can affect results.

UPDATE: on March 1, optimized checkpoint creation speed (v4 -> v4 with 41 WALs) varied by 63 seconds between the first 2 runs (all 3 used same input files to create same output):

926 secs (first run right after OS booted, maybe didn't wait long enough)
863 secs (second run without rebooting OS, maybe file system cache helped)
879 secs (third run after other activities without rebooting OS)

Load Checkpoint File + replay 41 WALs v3 (old) vs v4 (WIP)

Input: checkpoint.00003443 + 41 WAL files
Platform: Go 1.16, benchnet (the big one)
name                        old time/op       new time/op       delta
LoadCheckpointAndWALs-48     989s ± 0%         676s ± 0%       -31.64%

name                        old alloc/op      new alloc/op      delta
LoadCheckpointAndWALs-48    297GB ± 0%        136GB ± 0%       -54.35%

name                        old allocs/op     new allocs/op     delta
LoadCheckpointAndWALs-48    5.98G ± 0%        2.17G ± 0%       -63.67%

DISCLAIMERS: not done yet, didn't add concurrency yet, n=1, 
file system cache affects speed so delta can be -28% to -32%.

Changes include:

Create checkpoint file v4 and replace v3, while retaining ability to load older versions. (v4 is not yet finalized). First, Reduce checkpoint file size by 5.8+GB. Next, reduce checkpoint file size by 1.1+GB by removing encoded hash size and path size. Further reduction of 4.4+GB is planned for 10.2 GB combined reduction compared to v3. These file size reductions don't use compression.
Use stream encoding and writing for checkpoint file creation. This reduces RAM use by avoiding the creation of a ~400 million element slice containing all nodes and creation of 400 million objects. Savings will be about 43.2+ GB plus more from other changes in this PR.
Add NewUniqueNodeIterator() to skip shared nodes. NewUniqueNodeIterator() can be used to optimize node iteration for forest. It skips shared sub-tries that were visited and only iterates unique nodes.
Optimize reading checkpoint file by reusing buffer. Reduce allocs by using a 4096 byte scratch buffer to reduce another 400+ million allocs during checkpoint reading. Since checkpoint creation requires reading checkpoint, this optimization benefits both.
Optimize creating checkpoint by reusing buffer. Reduce allocs by using a 4096 byte scratch buffer to reduce another 400+ million allocs during checkpoint writing.
Skip StorableNode/StorableTrie when creating checkpoint
- Merge FlattenForest() with StoreCheckpoint() to iterate and serialize nodes without creating intermediate StorableNode/StorableTrie objects.
- Stream encode nodes to avoid creating 400+ million element slice holding 400 million StorableNode objects.
- Change checkpoint file format (v4) to store node count and trie count at the footer (instead of header) required for stream encoding.
- Support previous checkpoint formats (v1, v3).
Skip StorableNode/Trie when reading checkpoint
- Merge RebuildTries() with LoadCheckpoint() to deserialize data to nodes without creating intermediate StorableNode/StorableTrie objects.
- Avoid creating 400+ million element slice holding all StorableNodes read from checkpoint file.
- DiskWal.Replay*() APIs are changed. checkpointFn receives []*trie.MTrie instead of FlattenedForest.
Add flattening encoding tests, add checkpoint v3 decoding tests, add more validation, add comments, refactor code for readability, and etc.

TODO

update benchmark comparisons using latest results

Additional TODOs that will probably be wrapped up in a separate PR

maybe add zeroCopy flag and more tests for these functions: DecodeKey(), DecodeKeyPart(), and DecodePayload(). Not high priority because these functions appear to be unused.
further reduce data written to checkpoint part 2 (e.g. encoded payload value size can be uint32 instead of uint64 but changing this affects code outside checkpoint creation)
micro optimizations (depends on speedup vs readability tradeoff)
add concurrency
maybe add file compression or payload compression (only if concurrency is added)
maybe replace CRC32 with BLAKE3 or BLAKE2 since checkpoint file is >60GB
maybe encode integers using variable length to reduce space (possibly not needed if/when we use file compression)
~~maybe split checkpoint file into 3 files (metadata, nodes, and payload file).~~ I synced with Ramtin and his preference is to keep the checkpoint as one file for this PR.

NewUniqueNodeIterator() can be used to optimize node iteration for forest. It skips shared sub-tries that were visited and only iterates unique nodes.

Use NewUniqueNodeIterator() in FlattenForest() to skip traversing visited shared sub-trie while flattening forest.

- Merge FlattenForest() with StoreCheckpoint() to iterate and serialize nodes without creating intermediate StorableNode/StorableTrie objects. - Stream encode nodes to avoid creating 400+ million element slice holding all nodes. - Change checkpoint file format (v4) to store node count and trie count at the footer (instead of header) required for stream encoding. - Support previous checkpoint formats (v1, v3).

- Merge RebuildTries() with LoadCheckpoint() to deserialize data to nodes without creating intermediate StorableNode/StorableTrie objects. - Avoid creating 400+ million element slice holding all StorableNodes read from checkpoint file - DiskWal.Replay*() APIs are changed. checkpointFn receives []*trie.MTrie instead of FlattenedForest. - Remove files contaning StorableNode/StorableTrie/FlattenedForest etc. * mtrie/flattener/forest.go * mtrie/flattener/forest_test.go * mtrie/flattener/storables.go * mtrie/flattener/trie.go * mtrie/flattener/trie_test.go

Further reduction of 4.4+GB is planned for a total reduction of 10.2+GB. (see TODOs at bottom). Leaf node is encoded as: - node type (1 byte) (new in v4) - height (2 bytes) - max depth (2 bytes) - reg count (8 bytes) - hash (2 bytes + 32 bytes) - path (2 bytes + 32 bytes) - payload (4 bytes + n bytes) Encoded payload size also reduced by removing prefix (version 2 bytes + type 1 byte). Interim node is encoded as: - node type (1 byte) (new in v4) - height (2 bytes) - max depth (2 bytes) - reg count (8 bytes) - lchild index (8 bytes) - rchild index (8 bytes) - hash (2 bytes + 32 bytes) Trie is encoded as: - root node index (8 bytes) - hash (2 bytes + 32 bytes) Removed v3 leaf node fields: - version (2 bytes) - left child index (8 bytes) - right child index (8 bytes) - payload version and type (3 bytes) Removed v3 interim node fields: - version (2 bytes) - path (2 bytes) - payload (4 bytes) Removed v3 trie field: - version (2 bytes) Leaf node data is reduced by 20 bytes (2+8+8+3-1). Interim node data is reduced by 7 bytes (2+2+4-1). Trie is reduced by 2 bytes. TODO: remove max depth and reg count fields from both leaf node and interim node types. TODO: reduce hash length from 2 bytes to 1 byte for both leaf node and interim node types.

bluesign · 2022-02-04T15:16:00Z

Hey @fxamacker, first of all great work.

Does this changes allow using checkpoint file without loading it totally to memory ? If not is it possible to add an index to the end ? It would be great to have ability to query checkpoint file with a CLI tool for lookup. But with recent changes using atree on storage, I have to import all checkpoint file to Badger ( or similar DB ) or load fully in to the memory.

codecov-commenter · 2022-02-04T15:36:46Z

Codecov Report

Merging #1944 (2c587e6) into master (4d5c22c) will increase coverage by 0.11%.
The diff coverage is 71.57%.

@@            Coverage Diff             @@
##           master    #1944      +/-   ##
==========================================
+ Coverage   57.18%   57.29%   +0.11%     
==========================================
  Files         634      633       -1     
  Lines       36679    36833     +154     
==========================================
+ Hits        20976    21105     +129     
- Misses      13068    13077       +9     
- Partials     2635     2651      +16

Flag	Coverage Δ
unittests	`57.29% <71.57%> (+0.11%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
ledger/complete/mtrie/forest.go	`61.14% <0.00%> (ø)`
ledger/complete/mtrie/flattener/encoding_v3.go	`43.47% <43.47%> (ø)`
ledger/complete/ledger.go	`59.39% <60.00%> (-0.41%)`	⬇️
ledger/complete/wal/checkpointer.go	`61.09% <63.57%> (-4.40%)`	⬇️
ledger/complete/mtrie/flattener/encoding.go	`80.83% <85.25%> (+30.83%)`	⬆️
ledger/common/encoding/encoding.go	`59.72% <96.36%> (+2.66%)`	⬆️
ledger/complete/mtrie/flattener/iterator.go	`100.00% <100.00%> (ø)`
ledger/complete/wal/wal.go	`55.26% <100.00%> (+0.42%)`	⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4d5c22c...2c587e6. Read the comment docs.

Reduce allocs by using a 4096 byte scratch buffer to reduce another 400+ million allocs during checkpoint reading.

Reduce allocs by using a 4096 byte scratch buffer to reduce another 400+ million allocs during checkpoint writing.

fxamacker · 2022-02-07T16:14:10Z

Hello @bluesign,

Does this changes allow using checkpoint file without loading it totally to memory ?

The entire checkpoint file will be stream decoded in order to create in-memory tries. Currently, the in-memory tries will have all the data from checkpoint file. Issue #1746 will be handled by a separate PR after this one is merged.

It would be great to have ability to query checkpoint file with a CLI tool for lookup. But with recent changes using atree on storage, I have to import all checkpoint file to Badger ( or similar DB ) or load fully in to the memory.

I understand.

If not is it possible to add an index to the end ?

Maybe (depends on details/tradeoffs/etc), but adding index is outside the scope of this PR.

There's a chance this PR might store the payload in a separate file. It is a tiny intermediate step towards #1746. Not sure, but maybe this step could be useful to projects like https://github.com/bluesign/checkpointState

Please consider opening an issue to add index to checkpoint file, so discussions can continue there if needed. This PR is focused on optimizing checkpoint (creation and loading) for speed, memory, and file size.

Decode payload from input buffer and only copy shared data, to avoid extra allocs of payload's Key.KeyParts ([]KeyPart). Loading checkpoint v4 and replaying WALs used 3,076,382,277 fewer allocs/op compared to v3. Prior to this change: When decoding payload during checkpoint loading, payload object was created with shared data from read buffer and was deep copied. Since payload contains key parts as []KeyPart, new slice was created during deep copying.

EncodeAndAppendPayloadWithoutPrefix() appends encoded payload to input buffer. If payload is nil, unmodified buffer is returned. This edge case is uncommon and didn't get triggered when creating checkpoint.3485 from checkpoint.3443 with 41 WAL segments (both v3->v4 and v4->v4). Add tests.

NewForest() doesn't have a logger. Error is returned by onTreeEvicted() callback function passed to NewForest() by NewLedger(). NewLedger() has a Logger and defines onTreeEvicted(), so we can probably log the error from inside onTreeEvicted() instead. However, that change is outside the scope for PR #1944.

Add benchmarks for checkpoint creation and checkpoint loading. Checkpoint creation time can vary. For example: 926 seconds (first run after OS boot) 863 seconds (2nd run of benchmark without OS reboot in between) Also fix off-by-one error in the checkpoint filename created by benchmarks: was checkpoint.00003485 (number should've been 3484) now checkpoint.00003484 (same file size and hash, only name changed) For BenchmmarkNewCheckpoint and BenchmarkLoadCheckpointAndWALs: If the current folder doesn't contain the checkpoint and WAL files, then the -dir option should be used. These two benchmarks should be used to benchmark real data. For example, use checkpoint.00003443 and 41 WAL files (00003444 - 00003484) to create checkpoint.00003484. BenchmarkNewCheckpointRandom* generates random WAL segments and doesn't require -dir option or any files. They can be used for regression tests.

AlexHentschel

I just took high-level look at the PR. I mainly focused my review on any changes to the core logic of the trie. Didn't go deep into the checkpointing logic itself, as I am not familiar with this part of the code base.

AlexHentschel · 2022-03-07T18:49:50Z

ledger/complete/mtrie/flattener/iterator.go

+	// NodeIterator only uses visitedNodes for read operation.
+	// No special handling is needed if visitedNodes is nil.
+	// WARNING: visitedNodes is not safe for concurrent use.
+	visitedNodes map[*node.Node]uint64


AlexHentschel · 2022-03-07T18:50:46Z

ledger/complete/mtrie/flattener/iterator.go

+	// and updated MTrie after register writes).
+	// NodeIterator only uses visitedNodes for read operation.
+	// No special handling is needed if visitedNodes is nil.
+	// WARNING: visitedNodes is not safe for concurrent use.


could you add this line also to the existing NewNodeIterator constructor please

flow-go/ledger/complete/mtrie/flattener/iterator.go

Line 77 in 1ebde08

func NewNodeIterator(mTrie *trie.MTrie) *NodeIterator {

I'm not sure we should add the same concurrency warning here because visitedNodes will always be nil which appears to make it safe for concurrent use (unless I'm mistaken).

I added this comment to NewNodeIterator:

// NodeIterator created by NewNodeIterator is safe for concurrent use // because visitedNodes is always nil in this case.

Please let me know if I need to update the comment. 🙏

ledger/complete/mtrie/forest.go

- Remove return value (error) from onTreeEvicted() - Log error in onTreeEvicted() in NewLedger() - Replace all empty onTreeEvicted callbacks with nil

NodeIterator created by NewNodeIterator is safe for concurrent use because visitedNodes is always nil in this case.

fxamacker added 7 commits February 1, 2022 14:28

Add NewUniqueNodeIterator() to skip shared nodes

04e5c08

NewUniqueNodeIterator() can be used to optimize node iteration for forest. It skips shared sub-tries that were visited and only iterates unique nodes.

Optimize FlattenForest() with NewUniqueNodeIterator

dfafbd0

Use NewUniqueNodeIterator() in FlattenForest() to skip traversing visited shared sub-trie while flattening forest.

Fix lint errors

800cca1

Merge branch 'master' into fxamacker/optimize-checkpoint

0b73f52

fxamacker self-assigned this Feb 3, 2022

fxamacker requested review from AlexHentschel, m4ksio and ramtinms as code owners February 3, 2022 16:47

fxamacker marked this pull request as draft February 3, 2022 16:47

fxamacker added the Performance label Feb 3, 2022

Encode and decode empty payload in checkpoint

dd636e5

fxamacker added 3 commits February 4, 2022 10:11

Optimize reading checkpoint file by reusing buffer

6c5ad37

Reduce allocs by using a 4096 byte scratch buffer to reduce another 400+ million allocs during checkpoint reading.

Increase bufio read size to 8192 for checkpoint

3aacebb

Optimize creating checkpoint by reusing buffer

c9a8f14

Reduce allocs by using a 4096 byte scratch buffer to reduce another 400+ million allocs during checkpoint writing.

fxamacker changed the title ~~WIP: optimize MTrie checkpoint (creating and loading) for speed, memory, and file size~~ WIP: optimize MTrie checkpoint for speed, memory, and file size Feb 6, 2022

fxamacker added 9 commits February 7, 2022 12:53

Increase bufio write size to 8192 for checkpoint

681d9fe

Preallocate slice when decoding trie update

c7c62cd

Increase checkpoint bufio buffer to 32KiB

81ddd43

Merge branch 'master' into fxamacker/optimize-checkpoint

529b923

Rename payload encoding functions

889c2f2

Refactor payload encoding function APIs

51c3249

Refactor payload encoding functions

2d0936c

fxamacker added the Breaking Change label Feb 25, 2022

fxamacker mentioned this pull request Feb 25, 2022

[Execution State] Add extra logging in execution node #2074

Closed

ramtinms requested a review from tarakby March 1, 2022 20:47

fxamacker added 3 commits March 1, 2022 21:01

Fix linter error

c919100

Remove extra logs

1ebde08

AlexHentschel approved these changes Mar 7, 2022

View reviewed changes

fxamacker added 4 commits March 7, 2022 13:56

Remove returned error from onTreeEvicted callback

c99bc8c

- Remove return value (error) from onTreeEvicted() - Log error in onTreeEvicted() in NewLedger() - Replace all empty onTreeEvicted callbacks with nil

Add extra comment for NewNodeIterator

d3a32c6

NodeIterator created by NewNodeIterator is safe for concurrent use because visitedNodes is always nil in this case.

Merge branch 'master' into fxamacker/optimize-checkpoint

6eb8ca9

Merge branch 'master' into fxamacker/optimize-checkpoint

2c587e6

fxamacker merged commit 9a1c704 into master Mar 7, 2022

fxamacker deleted the fxamacker/optimize-checkpoint branch March 7, 2022 21:57

fxamacker changed the title ~~Optimize MTrie checkpoint for speed, memory, and file size~~ Optimize MTrie checkpoint: 47x speedup (11.7 hours -> 15 mins), -431 GB alloc/op, -7.6 billion allocs/op, -6.9 GB file size Mar 7, 2022

fxamacker mentioned this pull request Mar 9, 2022

Optimize MTrie Checkpoint (regCount & regSize): -9GB alloc/op, -110 milllion allocs/op, -4GB file size #2126

Merged

This was referenced Mar 18, 2022

Reduce checkpoint file size by using fewer bytes to encode length of encoded payload value #2165

Merged

Update checkpoint file format version from v4 to v5 #2174

Merged

fxamacker mentioned this pull request Apr 9, 2022

[Execution State] Avoid creating separate MTrie state during checkpoint creation for about -200GB peak RAM use and -32 minutes duration #2286

Closed

2 tasks

fxamacker added the Execution Cadence Execution Team label Jul 14, 2022

fxamacker mentioned this pull request Aug 1, 2022

[EN Performance] Reuse ledger state for about -200GB peak RAM, -160GB disk i/o, and about -32 minutes duration #2792

Merged

14 tasks

fxamacker mentioned this pull request Sep 29, 2022

[EN] Replace CRC-32 after adding concurrency for checkpoint file #3302

Closed

This was referenced Feb 22, 2024

Add version, checksum, and partial state indicator to intermediate migration payload file #5437

Closed

Add version, partial state indicator, and checksum to payload file (intermediate migration file) #5438

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize MTrie checkpoint: 47x speedup (11.7 hours -> 15 mins), -431 GB alloc/op, -7.6 billion allocs/op, -6.9 GB file size #1944

Optimize MTrie checkpoint: 47x speedup (11.7 hours -> 15 mins), -431 GB alloc/op, -7.6 billion allocs/op, -6.9 GB file size #1944

fxamacker commented Feb 3, 2022 •

edited

Loading

bluesign commented Feb 4, 2022

codecov-commenter commented Feb 4, 2022 •

edited

Loading

fxamacker commented Feb 7, 2022

AlexHentschel left a comment

AlexHentschel Mar 7, 2022

AlexHentschel Mar 7, 2022

fxamacker Mar 7, 2022

Optimize MTrie checkpoint: 47x speedup (11.7 hours -> 15 mins), -431 GB alloc/op, -7.6 billion allocs/op, -6.9 GB file size #1944

Optimize MTrie checkpoint: 47x speedup (11.7 hours -> 15 mins), -431 GB alloc/op, -7.6 billion allocs/op, -6.9 GB file size #1944

Conversation

fxamacker commented Feb 3, 2022 • edited Loading

Description

Impact on Execution Nodes

EN Startup Time

EN Memory Use and System Requirements

Benchmark Comparisons

Unoptimized Checkpoint Creation

Optimized Checkpoint Creation

Preliminary Results (WIP) Without Adding Concurrency Yet

MTrie Checkpoint Load+Create v3 (old) vs v4 (WIP)

Load Checkpoint File + replay 41 WALs v3 (old) vs v4 (WIP)

Changes include:

TODO

Additional TODOs that will probably be wrapped up in a separate PR

bluesign commented Feb 4, 2022

codecov-commenter commented Feb 4, 2022 • edited Loading

Codecov Report

fxamacker commented Feb 7, 2022

AlexHentschel left a comment

Choose a reason for hiding this comment

AlexHentschel Mar 7, 2022

Choose a reason for hiding this comment

AlexHentschel Mar 7, 2022

Choose a reason for hiding this comment

fxamacker Mar 7, 2022

Choose a reason for hiding this comment

fxamacker commented Feb 3, 2022 •

edited

Loading

codecov-commenter commented Feb 4, 2022 •

edited

Loading