-
Notifications
You must be signed in to change notification settings - Fork 3k
Closed as not planned
Labels
Description
Flink is famous for its streaming computation.
- Iceberg has this capability as a message bus for stream computing. Even in near real time, it can meet many requirements.
- Compared with Kafka, iceberg can save all the historical data, while Kafka can only save the data of recent days. The ad-hoc can query all the historical data. Moreover, iceberg has efficient query performance and storage efficiency.
After #1346 , it is easy to build Flink streaming reading based on it.
Unlike Spark, Flink streaming continuous monitor new Files of table, and directly send the splits to downstream tasks. The source don't need take care of micro-batch size, because the downstream tasks stores incoming splits into state, and consume one by one.
Monitor ----(Splits)-----> ReaderOperator
Monitor (Single task):
- Monitoring snapshots of the Iceberg table.
- Creating the splits corresponding to the incremental files using
FlinkSplitGenerator. (Actually usingTableScan.appendsBetween). - Assigning them to downstream tasks for further processing.
ReaderOperator (multiple tasks):
- Put received splits into state (A splits queue).
- Read splits using
FlinkInputFormatin a checkpoint cycle. - If a checkpoint barrier coming, let the main thread complete the snapshot for checkpoint.
- After that, the task continues to consume the remaining splits in the state.
Reactions are currently unavailable