Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize MTrie checkpoint: 47x speedup (11.7 hours -> 15 mins), -431 GB alloc/op, -7.6 billion allocs/op, -6.9 GB file size #1944

Merged
merged 41 commits into from
Mar 7, 2022

Conversation

fxamacker
Copy link
Member

@fxamacker fxamacker commented Feb 3, 2022

Description

Optimize checkpoint creating (includes loading):

  • 47x speedup (11.7 hours to 15 mins), avoid 431 GB alloc/op, avoid 7.6 billion allocs/op
    • 171x speedup (11.4 hours to 4 mins) in MTrie traversal+flattening+writing phase
  • Reduce long-held data in RAM by 116+ GB (lowers hardware requirements)
  • Reduce checkpoint file size by 6.9+ GB (another 4.4+ GB reduction planned in separate PR) without using compression

Most of the optimizations were proposed in comments to issue #1750. I'm moving all remaining optimizations like concurrency and/or compression, etc. to separate PRs so this is ready for review as-is.

Increased interim + leaf node counts are causing checkpoint creation to take hours. This PR sacrifices some readability and simplicity as tradeoffs and gains speed, memory efficiency, and storage efficiency.

I limited scope of PR to optimizations that don't require performance tradeoffs or overhead (like adding processes, IPC).

Big thanks to @ramtinms for opening #1750 to point out that trie flattening can have big optimizations. 👍

Closes #1750
Closes #1884
Updates #1744, #1746, https://github.com/dapperlabs/flow-go/issues/6114

Impact on Execution Nodes

⚠️ Unoptimized checkpoint creation reaches 248+GB RAM within the first 30 minutes and can run for about 15-17+ hours on EN3. This duration during heavy load was long enough on EN3 to accumulate enough WAL files to trigger another checkpoint immediately after the current one finishes. So the 248 GB RAM is held again with 590 GB alloc/op and 9.8 billion allocs/op.

EN Startup Time

This PR speeds up EN startup time in several ways:

  • Checkpoint loading and WAL replaying will be optimized for speed (see benchmarks).
  • Checkpoint creation will be fast enough to run multiple times per day, which will reduce WAL segments that need to be replayed during startup.
  • Checkpoint creation finishing in minutes rather than 15+ hours reduces risk of being interrupted by shutdown, etc. which can cause extra WAL segments to be replayed on next EN startup.

See issue #1884 for more info about extra WAL segments causing EN startup delays.

EN Memory Use and System Requirements

This PR can reduce long-held data in RAM (15+ hours on EN3) by up to 116 GB. Additionally, eliminating 431 GB alloc/op and 7.6 billion allocs/op will reduce load on the Go garbage collector.

Benchmark Comparisons

Unoptimized Checkpoint Creation

After the first 30 minutes and for next 11+ hours (15-17+ hours on EN3):
image

Optimized Checkpoint Creation

Finishes in 15+ minutes and peaks in the last minute at:
image

Preliminary Results (WIP) Without Adding Concurrency Yet

MTrie Checkpoint Load+Create v3 (old) vs v4 (WIP)

Input: checkpoint.00003443 + 41 WAL files
Platform: Go 1.16, benchnet (the big one)
name                        old time/op       new time/op       delta
NewCheckpoint-48            42052s ± 0%        886s ± 0%       -97.89%

name                        old alloc/op      new alloc/op      delta
NewCheckpoint-48             590GB ± 0%       159GB ± 0%       -73.04%

name                        old allocs/op     new allocs/op     delta
NewCheckpoint-48             9.80G ± 0%       2.19G ± 0%       -77.67%

DISCLAIMERS: not done yet, didn't add concurrency yet, n=1 due to duration, 
file system cache can affect results.

UPDATE: on March 1, optimized checkpoint creation speed (v4 -> v4 with 41 WALs) varied by 63 seconds between the first 2 runs (all 3 used same input files to create same output):

  • 926 secs (first run right after OS booted, maybe didn't wait long enough)
  • 863 secs (second run without rebooting OS, maybe file system cache helped)
  • 879 secs (third run after other activities without rebooting OS)

Load Checkpoint File + replay 41 WALs v3 (old) vs v4 (WIP)

Input: checkpoint.00003443 + 41 WAL files
Platform: Go 1.16, benchnet (the big one)
name                        old time/op       new time/op       delta
LoadCheckpointAndWALs-48     989s ± 0%         676s ± 0%       -31.64%

name                        old alloc/op      new alloc/op      delta
LoadCheckpointAndWALs-48    297GB ± 0%        136GB ± 0%       -54.35%

name                        old allocs/op     new allocs/op     delta
LoadCheckpointAndWALs-48    5.98G ± 0%        2.17G ± 0%       -63.67%

DISCLAIMERS: not done yet, didn't add concurrency yet, n=1, 
file system cache affects speed so delta can be -28% to -32%.

Changes include:

  • Create checkpoint file v4 and replace v3, while retaining ability to load older versions. (v4 is not yet finalized). First, Reduce checkpoint file size by 5.8+GB. Next, reduce checkpoint file size by 1.1+GB by removing encoded hash size and path size. Further reduction of 4.4+GB is planned for 10.2 GB combined reduction compared to v3. These file size reductions don't use compression.
  • Use stream encoding and writing for checkpoint file creation. This reduces RAM use by avoiding the creation of a ~400 million element slice containing all nodes and creation of 400 million objects. Savings will be about 43.2+ GB plus more from other changes in this PR.
  • Add NewUniqueNodeIterator() to skip shared nodes. NewUniqueNodeIterator() can be used to optimize node iteration for forest. It skips shared sub-tries that were visited and only iterates unique nodes.
  • Optimize reading checkpoint file by reusing buffer. Reduce allocs by using a 4096 byte scratch buffer to reduce another 400+ million allocs during checkpoint reading. Since checkpoint creation requires reading checkpoint, this optimization benefits both.
  • Optimize creating checkpoint by reusing buffer. Reduce allocs by using a 4096 byte scratch buffer to reduce another 400+ million allocs during checkpoint writing.
  • Skip StorableNode/StorableTrie when creating checkpoint
    • Merge FlattenForest() with StoreCheckpoint() to iterate and serialize nodes without creating intermediate StorableNode/StorableTrie objects.
    • Stream encode nodes to avoid creating 400+ million element slice holding 400 million StorableNode objects.
    • Change checkpoint file format (v4) to store node count and trie count at the footer (instead of header) required for stream encoding.
    • Support previous checkpoint formats (v1, v3).
  • Skip StorableNode/Trie when reading checkpoint
    • Merge RebuildTries() with LoadCheckpoint() to deserialize data to nodes without creating intermediate StorableNode/StorableTrie objects.
    • Avoid creating 400+ million element slice holding all StorableNodes read from checkpoint file.
    • DiskWal.Replay*() APIs are changed. checkpointFn receives []*trie.MTrie instead of FlattenedForest.
  • Add flattening encoding tests, add checkpoint v3 decoding tests, add more validation, add comments, refactor code for readability, and etc.

TODO

  • update benchmark comparisons using latest results

Additional TODOs that will probably be wrapped up in a separate PR

  • maybe add zeroCopy flag and more tests for these functions: DecodeKey(), DecodeKeyPart(), and DecodePayload(). Not high priority because these functions appear to be unused.
  • further reduce data written to checkpoint part 2 (e.g. encoded payload value size can be uint32 instead of uint64 but changing this affects code outside checkpoint creation)
  • micro optimizations (depends on speedup vs readability tradeoff)
  • add concurrency
  • maybe add file compression or payload compression (only if concurrency is added)
  • maybe replace CRC32 with BLAKE3 or BLAKE2 since checkpoint file is >60GB
  • maybe encode integers using variable length to reduce space (possibly not needed if/when we use file compression)
  • maybe split checkpoint file into 3 files (metadata, nodes, and payload file). I synced with Ramtin and his preference is to keep the checkpoint as one file for this PR.

NewUniqueNodeIterator() can be used to optimize node iteration for
forest.  It skips shared sub-tries that were visited and only iterates
unique nodes.
Use NewUniqueNodeIterator() in FlattenForest() to skip traversing
visited shared sub-trie while flattening forest.
- Merge FlattenForest() with StoreCheckpoint() to iterate and serialize
nodes without creating intermediate StorableNode/StorableTrie objects.

- Stream encode nodes to avoid creating 400+ million element slice
  holding all nodes.

- Change checkpoint file format (v4) to store node count and trie count
at the footer (instead of header) required for stream encoding.

- Support previous checkpoint formats (v1, v3).
- Merge RebuildTries() with LoadCheckpoint() to deserialize data
to nodes without creating intermediate StorableNode/StorableTrie
objects.

- Avoid creating 400+ million element slice holding all StorableNodes
read from checkpoint file

- DiskWal.Replay*() APIs are changed.  checkpointFn receives
[]*trie.MTrie instead of FlattenedForest.

- Remove files contaning StorableNode/StorableTrie/FlattenedForest etc.
  * mtrie/flattener/forest.go
  * mtrie/flattener/forest_test.go
  * mtrie/flattener/storables.go
  * mtrie/flattener/trie.go
  * mtrie/flattener/trie_test.go
Further reduction of 4.4+GB is planned for a total
reduction of 10.2+GB.  (see TODOs at bottom).

Leaf node is encoded as:
- node type (1 byte) (new in v4)
- height (2 bytes)
- max depth (2 bytes)
- reg count (8 bytes)
- hash (2 bytes + 32 bytes)
- path (2 bytes + 32 bytes)
- payload (4 bytes + n bytes)

Encoded payload size also reduced by removing prefix (version 2 bytes +
type 1 byte).

Interim node is encoded as:
- node type (1 byte) (new in v4)
- height (2 bytes)
- max depth (2 bytes)
- reg count (8 bytes)
- lchild index (8 bytes)
- rchild index (8 bytes)
- hash (2 bytes + 32 bytes)

Trie is encoded as:
- root node index (8 bytes)
- hash (2 bytes + 32 bytes)

Removed v3 leaf node fields:
- version (2 bytes)
- left child index (8 bytes)
- right child index (8 bytes)
- payload version and type (3 bytes)

Removed v3 interim node fields:
- version (2 bytes)
- path (2 bytes)
- payload (4 bytes)

Removed v3 trie field:
- version (2 bytes)

Leaf node data is reduced by 20 bytes (2+8+8+3-1).
Interim node data is reduced by 7 bytes (2+2+4-1).
Trie is reduced by 2 bytes.

TODO: remove max depth and reg count fields from both leaf node and
interim node types.
TODO: reduce hash length from 2 bytes to 1 byte for both leaf node and
interim node types.
@fxamacker fxamacker self-assigned this Feb 3, 2022
@fxamacker fxamacker marked this pull request as draft February 3, 2022 16:47
@bluesign
Copy link
Contributor

bluesign commented Feb 4, 2022

Hey @fxamacker, first of all great work.

Does this changes allow using checkpoint file without loading it totally to memory ? If not is it possible to add an index to the end ? It would be great to have ability to query checkpoint file with a CLI tool for lookup. But with recent changes using atree on storage, I have to import all checkpoint file to Badger ( or similar DB ) or load fully in to the memory.

@codecov-commenter
Copy link

codecov-commenter commented Feb 4, 2022

Codecov Report

Merging #1944 (2c587e6) into master (4d5c22c) will increase coverage by 0.11%.
The diff coverage is 71.57%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1944      +/-   ##
==========================================
+ Coverage   57.18%   57.29%   +0.11%     
==========================================
  Files         634      633       -1     
  Lines       36679    36833     +154     
==========================================
+ Hits        20976    21105     +129     
- Misses      13068    13077       +9     
- Partials     2635     2651      +16     
Flag Coverage Δ
unittests 57.29% <71.57%> (+0.11%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
ledger/complete/mtrie/forest.go 61.14% <0.00%> (ø)
ledger/complete/mtrie/flattener/encoding_v3.go 43.47% <43.47%> (ø)
ledger/complete/ledger.go 59.39% <60.00%> (-0.41%) ⬇️
ledger/complete/wal/checkpointer.go 61.09% <63.57%> (-4.40%) ⬇️
ledger/complete/mtrie/flattener/encoding.go 80.83% <85.25%> (+30.83%) ⬆️
ledger/common/encoding/encoding.go 59.72% <96.36%> (+2.66%) ⬆️
ledger/complete/mtrie/flattener/iterator.go 100.00% <100.00%> (ø)
ledger/complete/wal/wal.go 55.26% <100.00%> (+0.42%) ⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4d5c22c...2c587e6. Read the comment docs.

Reduce allocs by using a 4096 byte scratch buffer
to reduce another 400+ million allocs during
checkpoint reading.
Reduce allocs by using a 4096 byte scratch buffer
to reduce another 400+ million allocs during
checkpoint writing.
@fxamacker fxamacker changed the title WIP: optimize MTrie checkpoint (creating and loading) for speed, memory, and file size WIP: optimize MTrie checkpoint for speed, memory, and file size Feb 6, 2022
@fxamacker
Copy link
Member Author

Hello @bluesign,

Does this changes allow using checkpoint file without loading it totally to memory ?

The entire checkpoint file will be stream decoded in order to create in-memory tries. Currently, the in-memory tries will have all the data from checkpoint file. Issue #1746 will be handled by a separate PR after this one is merged.

It would be great to have ability to query checkpoint file with a CLI tool for lookup. But with recent changes using atree on storage, I have to import all checkpoint file to Badger ( or similar DB ) or load fully in to the memory.

I understand.

If not is it possible to add an index to the end ?

Maybe (depends on details/tradeoffs/etc), but adding index is outside the scope of this PR.

There's a chance this PR might store the payload in a separate file. It is a tiny intermediate step towards #1746. Not sure, but maybe this step could be useful to projects like https://github.com/bluesign/checkpointState

Please consider opening an issue to add index to checkpoint file, so discussions can continue there if needed. This PR is focused on optimizing checkpoint (creation and loading) for speed, memory, and file size.

Decode payload from input buffer and only copy shared data,
to avoid extra allocs of payload's Key.KeyParts ([]KeyPart).

Loading checkpoint v4 and replaying WALs used 3,076,382,277 fewer
allocs/op compared to v3.

Prior to this change:
When decoding payload during checkpoint loading, payload object was
created with shared data from read buffer and was deep copied.
Since payload contains key parts as []KeyPart, new slice was
created during deep copying.
EncodeAndAppendPayloadWithoutPrefix() appends encoded payload to
input buffer.  If payload is nil, unmodified buffer is returned.

This edge case is uncommon and didn't get triggered when creating
checkpoint.3485 from checkpoint.3443 with 41 WAL segments
(both v3->v4 and v4->v4).

Add tests.
NewForest() doesn't have a logger.  Error is returned by onTreeEvicted()
callback function passed to NewForest() by NewLedger().

NewLedger() has a Logger and defines onTreeEvicted(), so we can
probably log the error from inside onTreeEvicted() instead. However,
that change is outside the scope for PR #1944.
Add benchmarks for checkpoint creation and
checkpoint loading.

Checkpoint creation time can vary. For example:
926 seconds (first run after OS boot)
863 seconds (2nd run of benchmark without OS reboot in between)

Also fix off-by-one error in the checkpoint filename
created by benchmarks:
was checkpoint.00003485 (number should've been 3484)
now checkpoint.00003484 (same file size and hash, only name changed)

For BenchmmarkNewCheckpoint and BenchmarkLoadCheckpointAndWALs:
If the current folder doesn't contain the checkpoint and WAL
files, then the -dir option should be used.  These two benchmarks
should be used to benchmark real data. For example, use
checkpoint.00003443 and 41 WAL files (00003444 - 00003484)
to create checkpoint.00003484.

BenchmarkNewCheckpointRandom* generates random WAL segments and doesn't
require -dir option or any files. They can be used for regression tests.
Copy link
Member

@AlexHentschel AlexHentschel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just took high-level look at the PR. I mainly focused my review on any changes to the core logic of the trie. Didn't go deep into the checkpointing logic itself, as I am not familiar with this part of the code base.

// NodeIterator only uses visitedNodes for read operation.
// No special handling is needed if visitedNodes is nil.
// WARNING: visitedNodes is not safe for concurrent use.
visitedNodes map[*node.Node]uint64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

// and updated MTrie after register writes).
// NodeIterator only uses visitedNodes for read operation.
// No special handling is needed if visitedNodes is nil.
// WARNING: visitedNodes is not safe for concurrent use.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add this line also to the existing NewNodeIterator constructor please

func NewNodeIterator(mTrie *trie.MTrie) *NodeIterator {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we should add the same concurrency warning here because visitedNodes will always be nil which appears to make it safe for concurrent use (unless I'm mistaken).

I added this comment to NewNodeIterator:

// NodeIterator created by NewNodeIterator is safe for concurrent use
// because visitedNodes is always nil in this case.

Please let me know if I need to update the comment. 🙏

ledger/complete/mtrie/forest.go Outdated Show resolved Hide resolved
- Remove return value (error) from onTreeEvicted()
- Log error in onTreeEvicted() in NewLedger()
- Replace all empty onTreeEvicted callbacks with nil
NodeIterator created by NewNodeIterator is safe for concurrent use
because visitedNodes is always nil in this case.
@fxamacker fxamacker merged commit 9a1c704 into master Mar 7, 2022
@fxamacker fxamacker deleted the fxamacker/optimize-checkpoint branch March 7, 2022 21:57
@fxamacker fxamacker changed the title Optimize MTrie checkpoint for speed, memory, and file size Optimize MTrie checkpoint: 47x speedup (11.7 hours -> 15 mins), -431 GB alloc/op, -7.6 billion allocs/op, -6.9 GB file size Mar 7, 2022
@fxamacker fxamacker added the Execution Cadence Execution Team label Jul 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants