Documentation / Installation / Repository / PyPI
Beavers is a python library for stream processing, optimized for analytics.
It is used at Tradewell Technologies, to calculate analytics and serve model predictions, for both realtime and batch jobs.
- Works in real time (eg: reading from Kafka) and replay mode (eg: reading from Parquet files).
- Optimized for analytics, using micro-batches (instead of processing records one by one).
- Similar to incremental, it updates nodes in a dag incrementally.
- Taking inspiration from kafka streams, there are two types of nodes in the dag:
- Stream: ephemeral micro-batches of events (cleared after every cycle).
- State: durable state derived from streams.
- Clear separation between the business logic and the IO. So the same dag can be used in real time mode, replay mode or can be easily tested.
- Functional interface: no inheritance or decorator required.
- Support for complicated joins, not just "linear" data flow.
- No concurrency support. To speed up calculation use libraries like pandas, pyarrow or polars.
- No async code. To speed up IO use kafka driver native thread or parquet IO thread pool.
- No support for persistent state. Instead of saving state, replay historic data from kafka to prime stateful nodes.