Skip to content
This repository has been archived by the owner on Mar 5, 2024. It is now read-only.

Streams with parallel processing, lazy filtering and random sampling #1389

Open
wants to merge 25 commits into
base: master
Choose a base branch
from

Conversation

xandkar
Copy link
Contributor

@xandkar xandkar commented Jun 8, 2022

TL;DR

This is a prerequisite to #1334 which also adds optimal random sampling of RocksDB and other reusable stream operations.

Summary

  1. generalizes the stream type as data_stream:t/1, initially used in streaming blocks from ledger snapshot;
  2. implements general sequence operations on streams:
  1. exposes RocksDB access as a stream ( blockchain_rocks), offering all of the above operations.

Pitch

Our data access patterns have the same broad shape - stream processing; and the same specific patterns, including, but not limited to:

  • side-effecting iteration
  • folding
  • filtering
  • random sampling
  • parallel processing

While we've accumulated adhoc solutions, they:

This PR canonicalizes a stream abstraction adaptable to any of those needs, implements the common stream processing patterns and adapts them to RocksDB, testing at every step.

Reservations

  • I'm not crazy about the data_ prefix;
  • it should ideally live outside of core, so it can be used by libs deeper in the dependency tree, like erlang-libp2p.

Something like erlang-hel (standing for Helium Library) can be a good home.

@xandkar xandkar changed the title Generalize the stream abstraction used in the snap and parallel hash chain Streams with parallel processing, lazy filtering and random sampling Jun 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant