This repository has been archived by the owner on Mar 5, 2024. It is now read-only.
Streams with parallel processing, lazy filtering and random sampling #1389
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
TL;DR
This is a prerequisite to #1334 which also adds optimal random sampling of RocksDB and other reusable stream operations.
Summary
data_stream:t/1
, initially used in streaming blocks from ledger snapshot;foreach
fold
sample
: implementing an optimal reservoir sampling algorithm - can be used in all cases where we need to pick a random record from RocksDB;pmap_to_bag
: parallel maps a stream into a list with arbitrary re-order - this was used to parallelizebuild_hash_chain
for a ~6x speedup in Parallelizebuild_hash_chain
#1334map
filter
(can be used in restrict iterator_move to returning valid pubkeys erlang-libp2p#437, see:blockchain_rocks_SUITE:t_sample_filtered/1
)append
(allowing chaining N streams into one)blockchain_rocks
), offering all of the above operations.Pitch
Our data access patterns have the same broad shape - stream processing; and the same specific patterns, including, but not limited to:
While we've accumulated adhoc solutions, they:
This PR canonicalizes a stream abstraction adaptable to any of those needs, implements the common stream processing patterns and adapts them to RocksDB, testing at every step.
Reservations
data_
prefix;core
, so it can be used by libs deeper in the dependency tree, likeerlang-libp2p
.Something like
erlang-hel
(standing for Helium Library) can be a good home.