Flink: Support Flink streaming reading

Flink is famous for its streaming computation. 
- Iceberg has this capability as a message bus for stream computing. Even in near real time, it can meet many requirements.
- Compared with Kafka, iceberg can save all the historical data, while Kafka can only save the data of recent days. The ad-hoc can query all the historical data. Moreover, iceberg has efficient query performance and storage efficiency.

After https://github.com/apache/iceberg/pull/1346 , it is easy to build Flink streaming reading based on it.
Unlike Spark, Flink streaming continuous monitor new Files of table, and directly send the splits to downstream tasks. The source don't need take care of micro-batch size, because the downstream tasks stores incoming splits into state, and consume one by one.

Monitor ----(Splits)-----> ReaderOperator

Monitor (Single task):
- Monitoring snapshots of the Iceberg table.
- Creating the splits corresponding to the incremental files using `FlinkSplitGenerator`. (Actually using `TableScan.appendsBetween`).
- Assigning them to downstream tasks for further processing.

ReaderOperator (multiple tasks):
- Put received splits into state (A splits queue).
- Read splits using `FlinkInputFormat` in a checkpoint cycle.
- If a checkpoint barrier coming, let the main thread complete the snapshot for checkpoint.
- After that, the task continues to consume the remaining splits in the state.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flink: Support Flink streaming reading #1383

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Flink: Support Flink streaming reading #1383

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions