Skip to content

Flink: Support Flink streaming reading #1383

@JingsongLi

Description

@JingsongLi

Flink is famous for its streaming computation.

  • Iceberg has this capability as a message bus for stream computing. Even in near real time, it can meet many requirements.
  • Compared with Kafka, iceberg can save all the historical data, while Kafka can only save the data of recent days. The ad-hoc can query all the historical data. Moreover, iceberg has efficient query performance and storage efficiency.

After #1346 , it is easy to build Flink streaming reading based on it.
Unlike Spark, Flink streaming continuous monitor new Files of table, and directly send the splits to downstream tasks. The source don't need take care of micro-batch size, because the downstream tasks stores incoming splits into state, and consume one by one.

Monitor ----(Splits)-----> ReaderOperator

Monitor (Single task):

  • Monitoring snapshots of the Iceberg table.
  • Creating the splits corresponding to the incremental files using FlinkSplitGenerator. (Actually using TableScan.appendsBetween).
  • Assigning them to downstream tasks for further processing.

ReaderOperator (multiple tasks):

  • Put received splits into state (A splits queue).
  • Read splits using FlinkInputFormat in a checkpoint cycle.
  • If a checkpoint barrier coming, let the main thread complete the snapshot for checkpoint.
  • After that, the task continues to consume the remaining splits in the state.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions