Flink: store watermark as iceberg table's property #2109

dixingxing0 · 2021-01-18T16:48:27Z

rdblue · 2021-01-18T22:16:35Z

@dixingxing0, in our implementation, we store watermarks in snapshot summary metadata. I think that's a more appropriate place for it because it is metadata about the snapshot that is produced. We also use a watermark per writer because we write in 3 different AWS regions. So I think it would make sense to be able to name each watermark, possibly with a default if you choose not to name it.

rdblue · 2021-01-18T22:16:48Z

FYI @stevenzwu

stevenzwu · 2021-01-18T22:59:07Z

@dixingxing0 can you describe the motivation of checkpointing the watermarks in Flink state?

Ryan described our use of watermarks in snapshot metadata. They are used to indicate the data completeness on the ingestion path so that downstream batch consumer jobs can be triggered when data is complete for a window (like hourly).

kbendick · 2021-01-19T02:04:24Z

flink/src/main/java/org/apache/iceberg/flink/sink/IcebergFilesCommitter.java

    if (context.isRestored()) {
+      watermarkPerCheckpoint.putAll(watermarkState.get().iterator().next());


Will this be backwards compatible for on going streaming jobs that don't have any watermarkState when they restore? For example, for on going streaming jobs that are upgraded to a version of iceberg that includes this patch?

Looking at the Flink AppendingState interface, it says that calling .get() should return null if the state is empty. Also, you can see that the value of restoredFlinkJobId below from calling jobIdState.get().iterator().next() below is checked for null or empty.

Actually, on further inspection of the ListState interface, it says that passing null to putAll is a no-op. So I don' think there should be backwards compatibility issues, but should we possibly be (1) logging something if no watermark state is restored and/or (2) possibly doing our own null check vs relying on the documented behavior of ListState#putAll to be consistent over time when inserting null?

I don't have a strong opinion about either point 1 or point 2, but I thought it might be worth bringing up for discussion.

Thanks @kbendick, i neglected backwards compatible thing, current code will raise an java.util.NoSuchElementException when watermarkState is empty, i will fix it, also i will logging whether watermark state is restored since the restore is not an high frequency event.
BTW, watermarkPerCheckpoint is an instance of HashMap, i think the variable name misled you here 😄.

dixingxing0 · 2021-01-19T08:20:41Z

@dixingxing0, in our implementation, we store watermarks in snapshot summary metadata. I think that's a more appropriate place for it because it is metadata about the snapshot that is produced. We also use a watermark per writer because we write in 3 different AWS regions. So I think it would make sense to be able to name each watermark, possibly with a default if you choose not to name it.

Thanks @rdblue, agree with you.
Since we will store watermark in snapshot summary metadata, we should also consider rewrite action, currently rewrite action will lost the extra properties in summary metadata, e.g. flink.max-committed-checkpoint-id,flink.job-id.
I think we should introduce an table property snapshot-summary-inheritance.enabled, if set it to true, newSnapshot will use the extra properties in oldSnapshot as default.

About to name watermark, i think we can introduce an new confiuration flink.watermark-name:

// user specified configuration
flink.store-watermark=false      // as default
flink.watermark-name=default  // as default

// written by flink file committer
flink.watermark-for-default=the-watermark  // use flink.watermark-name as suffix

@rdblue what do you think?

dixingxing0 · 2021-01-19T08:33:07Z

@dixingxing0 can you describe the motivation of checkpointing the watermarks in Flink state?

Ryan described our use of watermarks in snapshot metadata. They are used to indicate the data completeness on the ingestion path so that downstream batch consumer jobs can be triggered when data is complete for a window (like hourly).

Thanks @stevenzwu, about the watermark state, i am just according to the current restore behavior:

      NavigableMap<Long, byte[]> uncommittedDataFiles = Maps
          .newTreeMap(checkpointsState.get().iterator().next())
          .tailMap(maxCommittedCheckpointId, false);
      if (!uncommittedDataFiles.isEmpty()) {
        // Committed all uncommitted data files from the old flink job to iceberg table.
        long maxUncommittedCheckpointId = uncommittedDataFiles.lastKey();
        commitUpToCheckpoint(uncommittedDataFiles, restoredFlinkJobId, maxUncommittedCheckpointId);
      }

Since flink will commit last uncommitted checkpoint, i think we should also store the right watermark for that checkpoint.

Our use case is exactly same as you and @rdblue described, except we don't have multi writers 😄 .

stevenzwu · 2021-01-20T02:57:43Z

flink/src/main/java/org/apache/iceberg/flink/sink/IcebergFilesCommitter.java

@@ -106,6 +114,9 @@
  // All pending checkpoints states for this function.
  private static final ListStateDescriptor<SortedMap<Long, byte[]>> STATE_DESCRIPTOR = buildStateDescriptor();
  private transient ListState<SortedMap<Long, byte[]>> checkpointsState;
+  private static final ListStateDescriptor<Map<Long, Long>> WATERMARK_DESCRIPTOR = new ListStateDescriptor<>(
+      "iceberg-flink-watermark", new MapTypeInfo<>(BasicTypeInfo.LONG_TYPE_INFO, BasicTypeInfo.LONG_TYPE_INFO));
+  private transient ListState<Map<Long, Long>> watermarkState;


should we define a MetaData to hold all checkpointed metadata fields so that we don't have to define a new state for each case?

Ideally, I would prefer the metadata and the manifest file bundled in a single class (per checkpoint). That would require complexity of handling state schema evolution, which I am not sure if it is worth the effort.

stevenzwu · 2021-01-20T03:00:37Z

flink/src/main/java/org/apache/iceberg/flink/sink/IcebergFilesCommitter.java

@@ -296,6 +318,13 @@ private void commitOperation(SnapshotUpdate<?> operation, int numDataFiles, int

    long start = System.currentTimeMillis();
    operation.commit(); // abort is automatically called if this fails.
+
+    Long watermarkForCheckpoint = watermarkPerCheckpoint.get(checkpointId);


We need to use table transaction here so that operation.commit() and `table.updatePropertie...commit() are atomic. This may require bigger refactoring of the code though.

If we store watermark in snapshot summary metadata as @rdblue said, we can omit table.updatePropertie transaction.

We actually set it as table properties too.

I think table properties is easier for the workflow scheduler (in the batch system) to query. Otherwise, they have to iterate the snapshots and find out the latest watermarks for all 3 regions. cc @rdblue

stevenzwu · 2021-01-20T03:04:38Z

flink/src/main/java/org/apache/iceberg/flink/sink/IcebergFilesCommitter.java

+  @Override
+  public void processWatermark(Watermark mark) throws Exception {
+    super.processWatermark(mark);
+    if (mark.getTimestamp() != Watermark.MAX_WATERMARK.getTimestamp()) {


why do we need to ignore the MAX_WATERMARK? it signals the end of input.

As you said before, we use watermark to indicate the data completeness on the ingestion path, i think we do not need to store MAX_WATERMARK when flink job run in streaming-mode.
If flink job run in batch-mode, even we store one MAX_WATERMARK, we still can't know which partition is completed, i think in batch-mode, we can just simply rely on the scheduling system. I'm not sure how to use the MAX_WATERMARK, so i just ignore it 😁.

stevenzwu · 2021-01-20T03:07:45Z

flink/src/main/java/org/apache/iceberg/flink/sink/IcebergFilesCommitter.java

@@ -106,6 +114,9 @@
  // All pending checkpoints states for this function.
  private static final ListStateDescriptor<SortedMap<Long, byte[]>> STATE_DESCRIPTOR = buildStateDescriptor();
  private transient ListState<SortedMap<Long, byte[]>> checkpointsState;
+  private static final ListStateDescriptor<Map<Long, Long>> WATERMARK_DESCRIPTOR = new ListStateDescriptor<>(


We probably should use SortedMap here.

stevenzwu · 2021-01-20T03:13:46Z

@dixingxing0 thx a lot for the additional context. that is very helpful. I left a few comments.

Regarding the scenario of multiple writer jobs and single table, I am afraid that the additional config won't help because we are talking one table here.

Somehow, we need to allow a provider to provide the suffix for watermark property key. For us, the suffix is the AWS region. I am not sure what is the cleanest way to achieve that. We can define a provider class config and use reflection to instantiate it. I am hesitant with reflection as it is impossible to pass dependency to reflection instantiated class.

dixingxing0 · 2021-01-20T03:52:35Z

@dixingxing0 thx a lot for the additional context. that is very helpful. I left a few comments.

Regarding the scenario of multiple writer jobs and single table, I am afraid that the additional config won't help because we are talking one table here.

Somehow, we need to allow a provider to provide the suffix for watermark property key. For us, the suffix is the AWS region. I am not sure what is the cleanest way to achieve that. We can define a provider class config and use reflection to instantiate it. I am hesitant with reflection as it is impossible to pass dependency to reflection instantiated class.

@stevenzwu thanks for the review and comments!

As you described, we cannot config watermark name suffix as table property for multiple writers 😁 . How about we introduce new fields in FlinkSink.Builder to config watermark name suffix, this should work for multiple writers:

// introduce new fields in org.apache.iceberg.flink.sink.FlinkSink.Builder
private boolean storeWatermarkEnabled;      // default false
private String watermarkNameSuffix;  // default "default"

// iceberg `table property` or `snapshot summary` written by flink file committer
flink.watermark-for-default=the-watermark  // use watermarkNameSuffix config as suffix

stevenzwu · 2021-01-20T04:48:46Z

@dixingxing0 yeah. extending the FlinkSink.Builder sounds like a right approach to me.

Small suggestion on the naming:

storeWatermarkEnabled -> storeWatermark
flink.watermark-for-default -> flink.watermark.default

dixingxing0 · 2021-01-20T09:49:56Z

@dixingxing0 yeah. extending the FlinkSink.Builder sounds like a right approach to me.

Small suggestion on the naming:

storeWatermarkEnabled -> storeWatermark

flink.watermark-for-default -> flink.watermark.default

Thanks, i will address it.

stevenzwu · 2021-01-29T03:05:41Z

We actually implemented this in a slightly different way of calculating the watermark. Instead of using Flink watermark, we add some additional metadata (min, max, sum, count) per DataFile for the timestamp column. In the committer, we use the min of min to decide the watermark value. We never regress the watermark value. Those metadata can also help us calculate metrics for ingestion latency (commit time - event/Kafka time): like min, max, avg.

Just to share, by no means that I am suggesting changing the approach in this PR. It is perfectly good.

zhangjun0x01 · 2021-01-29T03:33:30Z

Through this PR, I have an idea whether we can add a Callback method to expose to the user. After each submit succeed, the Callback method is called asynchronously, and the user can do some work in this callback method, such as send some information to kafka, and then consume the kafka data to perform some operations ,such as triggering downstream batch jobs, what do you think ? @rdblue

rdblue · 2021-01-29T20:14:38Z

A callback sounds complicated and seems to tie too much of the back-end together. I wouldn't want Something plugged into the Iceberg component talking to Kafka directly. That sounds like we're trying to work around a framework limitation.

liubo1022126 · 2021-08-13T09:28:00Z

We actually implemented this in a slightly different way of calculating the watermark. Instead of using Flink watermark, we add some additional metadata (min, max, sum, count) per DataFile for the timestamp column. In the committer, we use the min of min to decide the watermark value. We never regress the watermark value. Those metadata can also help us calculate metrics for ingestion latency (commit time - event/Kafka time): like min, max, avg.

Just to share, by no means that I am suggesting changing the approach in this PR. It is perfectly good.

thx @stevenzwu @rdblue, that sounds great! We also need to embed the iceberg table, which is regarded as real-time table, into our workflow. Is there any doc or patch for your implementation?

github-actions · 2024-07-27T00:13:03Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

github-actions · 2024-08-03T00:13:14Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

Support store wartermark into table properties

2815635

dixingxing0 changed the title ~~Store wartermark as iceberg table's property~~ Flink: store wartermark as iceberg table's property Jan 18, 2021

github-actions bot added the flink label Jan 18, 2021

dixingxing0 changed the title ~~Flink: store wartermark as iceberg table's property~~ Flink: store watermark as iceberg table's property Jan 18, 2021

Delete useless imports

cfeba98

kbendick reviewed Jan 19, 2021

View reviewed changes

stevenzwu reviewed Jan 20, 2021

View reviewed changes

openinx self-requested a review January 20, 2021 07:49

liubo1022126 mentioned this pull request Sep 9, 2021

Flink: support write watermark in snapshot #3093

Closed

stevenzwu mentioned this pull request Nov 28, 2022

Flink: Write watermark to the snapshot summary #6253

Closed

github-actions bot added the stale label Jul 27, 2024

github-actions bot closed this Aug 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flink: store watermark as iceberg table's property #2109

Flink: store watermark as iceberg table's property #2109

dixingxing0 commented Jan 18, 2021

rdblue commented Jan 18, 2021

rdblue commented Jan 18, 2021

stevenzwu commented Jan 18, 2021

kbendick Jan 19, 2021

kbendick Jan 19, 2021 •

edited

Loading

dixingxing0 Jan 19, 2021

dixingxing0 commented Jan 19, 2021 •

edited

Loading

dixingxing0 commented Jan 19, 2021 •

edited

Loading

stevenzwu Jan 20, 2021

stevenzwu Jan 20, 2021 •

edited

Loading

dixingxing0 Jan 20, 2021 •

edited

Loading

stevenzwu Jan 20, 2021

stevenzwu Jan 20, 2021

dixingxing0 Jan 20, 2021 •

edited

Loading

stevenzwu Jan 20, 2021

stevenzwu commented Jan 20, 2021 •

edited

Loading

dixingxing0 commented Jan 20, 2021

stevenzwu commented Jan 20, 2021

dixingxing0 commented Jan 20, 2021

stevenzwu commented Jan 29, 2021 •

edited

Loading

zhangjun0x01 commented Jan 29, 2021 •

edited

Loading

rdblue commented Jan 29, 2021

liubo1022126 commented Aug 13, 2021

github-actions bot commented Jul 27, 2024

github-actions bot commented Aug 3, 2024

		if (context.isRestored()) {
		watermarkPerCheckpoint.putAll(watermarkState.get().iterator().next());

Flink: store watermark as iceberg table's property #2109

Flink: store watermark as iceberg table's property #2109

Conversation

dixingxing0 commented Jan 18, 2021

rdblue commented Jan 18, 2021

rdblue commented Jan 18, 2021

stevenzwu commented Jan 18, 2021

kbendick Jan 19, 2021

Choose a reason for hiding this comment

kbendick Jan 19, 2021 • edited Loading

Choose a reason for hiding this comment

dixingxing0 Jan 19, 2021

Choose a reason for hiding this comment

dixingxing0 commented Jan 19, 2021 • edited Loading

dixingxing0 commented Jan 19, 2021 • edited Loading

stevenzwu Jan 20, 2021

Choose a reason for hiding this comment

stevenzwu Jan 20, 2021 • edited Loading

Choose a reason for hiding this comment

dixingxing0 Jan 20, 2021 • edited Loading

Choose a reason for hiding this comment

stevenzwu Jan 20, 2021

Choose a reason for hiding this comment

stevenzwu Jan 20, 2021

Choose a reason for hiding this comment

dixingxing0 Jan 20, 2021 • edited Loading

Choose a reason for hiding this comment

stevenzwu Jan 20, 2021

Choose a reason for hiding this comment

stevenzwu commented Jan 20, 2021 • edited Loading

dixingxing0 commented Jan 20, 2021

stevenzwu commented Jan 20, 2021

dixingxing0 commented Jan 20, 2021

stevenzwu commented Jan 29, 2021 • edited Loading

zhangjun0x01 commented Jan 29, 2021 • edited Loading

rdblue commented Jan 29, 2021

liubo1022126 commented Aug 13, 2021

github-actions bot commented Jul 27, 2024

github-actions bot commented Aug 3, 2024

kbendick Jan 19, 2021 •

edited

Loading

dixingxing0 commented Jan 19, 2021 •

edited

Loading

dixingxing0 commented Jan 19, 2021 •

edited

Loading

stevenzwu Jan 20, 2021 •

edited

Loading

dixingxing0 Jan 20, 2021 •

edited

Loading

dixingxing0 Jan 20, 2021 •

edited

Loading

stevenzwu commented Jan 20, 2021 •

edited

Loading

stevenzwu commented Jan 29, 2021 •

edited

Loading

zhangjun0x01 commented Jan 29, 2021 •

edited

Loading