[FLINK-25920] Ignore duplicate EOI in SinkWriter #25292

AHeise · 2024-09-05T14:18:28Z

What is the purpose of the change

In case of a failure after final checkpoint, EOI is called twice. SinkWriter should ignore the second call to avoid emitting duplicate committables. This commit uses a union state to remember that EOI happened and suppress additional handling.

Brief change log

Improve sink test assertions
Straighten EOI handling in CommittableCollector
Ignore duplicate EOI in SinkWriter <- main fix
Improve logging in committable handling of the sink

Verifying this change

Please make sure both new and modified tests in this PR follow the conventions for tests defined in our code quality guide.

Added integration tests that covers the fix

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (yes / no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
The serializers: (yes / no / don't know)
The runtime per-record code paths (performance sensitive): (yes / no / don't know)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know)
The S3 file system connector: (yes / no / don't know)

Documentation

Does this pull request introduce a new feature? (yes / no)
If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

flinkbot · 2024-09-05T14:44:39Z

CI report:

6ebfe41 Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

fapaul

Thanks for working on this issue.

The refactoring of the checkpoint id is very nice 👍

Regarding the actual fix I am not fully convinced yet. Even if persisting the information on EOI works it feels slightly off.

WDYT about not sending the commitable summary on EOI if the summary is empty? It seems the much easier approach.

fapaul · 2024-09-07T12:44:10Z

...time/src/main/java/org/apache/flink/streaming/runtime/operators/sink/SinkWriterOperator.java

+
+        // use union state to ensure that rescaling works correctly
+        this.endOfInputState =
+                context.getOperatorStateStore().getUnionListState(END_OF_INPUT_STATE_DESC);


I need help understanding using the state.

I would expect that after receiving EOI we can also not persist anything anymore.

We can and need to. Think about the committables inside the CommitterOperator. We absolutely need to track them. EOI just means processElement isn't called again and you can't call collector afterwards.

...time/src/main/java/org/apache/flink/streaming/runtime/operators/sink/SinkWriterOperator.java

rkhachatryan

Thanks for the fix @AHeise

I've left some comments, PTAL.

Besides of that, I think CompactorOperator.java needs to be adjusted:

flink/flink-connectors/flink-connector-files/src/main/java/org/apache/flink/connector/file/sink/compactor/operator/CompactorOperator.java

Line 149 in 277706d

emitCompacted(null);

- this code path still emits null
flink/flink-connectors/flink-connector-files/src/main/java/org/apache/flink/connector/file/sink/compactor/operator/CompactorOperator.java

Line 252 in 277706d

checkpointId,

possible NPE (or remove @Nullable)
flink/flink-connectors/flink-connector-files/src/main/java/org/apache/flink/connector/file/sink/compactor/operator/CompactorOperator.java

Line 261 in 277706d

checkpointId,

possible NPE (or remove @Nullable)

Also, could you clarify what is the exact path when a duplicate summary for the same checkpoint was emitted? IIUC, the 2nd code path is the "normal" endInput by SinkWriterOperator for then bounded input. But what is the 1st path?

...time/src/main/java/org/apache/flink/streaming/runtime/operators/sink/SinkWriterOperator.java

AHeise · 2024-09-09T08:13:29Z

Besides of that, I think CompactorOperator.java needs to be adjusted:
Thanks for the pointers. I'll address it later today.

Also, could you clarify what is the exact path when a duplicate summary for the same checkpoint was emitted? IIUC, the 2nd code path is the "normal" endInput by SinkWriterOperator for then bounded input. But what is the 1st path?

Let's first recap why we need to emit in EOI at all: For streaming jobs, emitting committables on barrier is the correct behavior. For batch, emitting on EOI.

However, there is an additional case around streaming with bounded input and final checkpoints. The operator first receives an EOI and then the final barrier. After EOI, the operator is not allowed to emit another record. So just for this case, streaming also needs to emit on EOI and suppress on barrier.

Now the question is what happens on failure after final checkpoint. Logically speaking, the EOI before the checkpoint should influence the operator state in the checkpoint in such a way that after recover we are still not allowed to emit records. For sinks that means that all committables have already been transferred to the committer operator for the snapshot. For technical reasons, we still receive a second EOI after recovery. Logically, the first EOI should have sufficed. I'm assuming Flink does the second EOI mostly for channel management but maybe the proper fix is actually not calling EOI twice. @pnowojski could you PTAL?

Roman, you raise good questions concerning rescaling and more. The current implementation assumes that EOI will be received by all subtasks at the same logical time (e.g. on final checkpoint). Can we have instances where some subtasks shutdown earlier?

AHeise · 2024-09-12T09:04:16Z

...ntime/src/main/java/org/apache/flink/streaming/runtime/operators/sink/CommitterOperator.java

-        OptionalLong checkpointId = element.getValue().getCheckpointId();
-        if (checkpointId.isPresent() && checkpointId.getAsLong() <= lastCompletedCheckpointId) {
+        long checkpointId = element.getValue().getCheckpointIdOrEOI();
+        if (checkpointId <= lastCompletedCheckpointId) {


This is kinda of a change in the semantics on first glance. However, we should not receive any elements after EOI, so this code path is actually never triggered and now it's simpler.

I do not fully understand the comment.

Why is it a change in semantics? The condition looks the same the only case that is removed is a committable without checkpoint. Which scenario was this before the change?

I thought that for the final checkpoint, we would first receive EOI and then do a final checkpoint. This would mean the committer receives data after EOI.

Yes, we mean the same thing. For EOI committables, this check will always yield false. Even without the special case on EOI.

AHeise · 2024-09-12T09:17:57Z

...ntime/src/main/java/org/apache/flink/streaming/runtime/operators/sink/CommitterOperator.java

-                boolean fullyReceived =
-                        !endInput && manager.getCheckpointId() == lastCompletedCheckpointId;
-                commitAndEmit(manager, fullyReceived);
+                    committableCollector.getCheckpointCommittablesUpTo(completedCheckpointId)) {


@fapaul please double-check why we did it in this complicated manner originally.

On first glance this change doesn't look correct.

By removing fullyReceived, can we now commit committables that are from the "current" checkpoint but on receival from a delayed notifyCheckpointComplete from a previous checkpoint.

Just in case you haven't seen it: fullyReceived is now implicitly always true.

As discussed offline, fullyReceived should always be true because we always want to have complete batches. For earlier checkpoints, partially committed batches are still considered fullyReceived as long as all committables arrived at some point (fullyReceived === (#pending + #completed + #failed == #expected))

...he/flink/streaming/runtime/operators/sink/committables/CheckpointCommittableManagerImpl.java

AHeise · 2024-09-12T09:27:55Z

I addressed your comments and took a different approach to state management (it looks similar but is conceptually rather different).

I restructured the PR and added 2 more commits to it. Unfortunately, the fixups looked then rather messy so I decided to force push everything again. So I'm sorry, but you more or less have to review again ;).

Btw I figured that the EOI changes were a breaking change of the experimental CommittableMessage and made them non-breaking instead. So the commit is essentially the same but looks a bit different.

fapaul

Overall looks good but I left a few comments which we should answer.

fapaul · 2024-09-13T08:51:40Z

...src/main/java/org/apache/flink/connector/file/sink/compactor/operator/CompactorOperator.java

        assert checkpointRequests.isEmpty();

        getAllTasksFuture().join();
-        emitCompacted(null);


Is it safe to change this?

Afaik yes. We pass the param directly to both CommittableMessages. The serializer replaces null with EOI. So it effectively results in the same bytes.

fapaul · 2024-09-13T08:59:59Z

...c/main/java/org/apache/flink/streaming/api/connector/sink2/CommittableMessageSerializer.java

@@ -91,13 +90,13 @@ public CommittableMessage<CommT> deserialize(int version, byte[] serialized)
                return new CommittableWithLineage<>(
                        SimpleVersionedSerialization.readVersionAndDeSerialize(
                                committableSerializer, in),
-                        readCheckpointId(in),
+                        in.readLong(),


Do we need consider migration cases from SinkV1 where afaik the checkpointId is always null.

Thanks for challenging that. It's one of the parts where most uncertainty still resides. But let's look again on what is happening and has happened:

On write: we replaced null with EOI, now we get EOI and write it; so it should be the same byte sequence.

On read: we have always used readLong which is not null-aware. We have replaced EOI with null.

So for serialization nothing has changed. However, around serialization we replace null with EOI in all instances of the Message.

That should be safe to change without migration, no?

For compatibility, I left getCheckpointId the same, so it will return an OptionalLong.empty() on EOI where it previously returned it on null. I have not found any usage of the method outside of Flink anyhow.

fapaul · 2024-09-13T09:06:20Z

...ntime/src/main/java/org/apache/flink/streaming/runtime/operators/sink/CommitterOperator.java

-        OptionalLong checkpointId = element.getValue().getCheckpointId();
-        if (checkpointId.isPresent() && checkpointId.getAsLong() <= lastCompletedCheckpointId) {
+        long checkpointId = element.getValue().getCheckpointIdOrEOI();
+        if (checkpointId <= lastCompletedCheckpointId) {


I do not fully understand the comment.

Why is it a change in semantics? The condition looks the same the only case that is removed is a committable without checkpoint. Which scenario was this before the change?

I thought that for the final checkpoint, we would first receive EOI and then do a final checkpoint. This would mean the committer receives data after EOI.

fapaul · 2024-09-13T09:12:26Z

...he/flink/streaming/runtime/operators/sink/committables/CheckpointCommittableManagerImpl.java

@@ -147,15 +146,16 @@ Collection<CommittableWithLineage<CommT>> drainFinished() {
    }

    CheckpointCommittableManagerImpl<CommT> merge(CheckpointCommittableManagerImpl<CommT> other) {
-        checkArgument(Objects.equals(other.checkpointId, checkpointId));
+        checkArgument(other.checkpointId == checkpointId);


Nit: This change should probably go into the commit, changing the type of the checkpoint fields

Good catch. I'll try to move it.

fapaul · 2024-09-13T09:25:01Z

...ntime/src/main/java/org/apache/flink/streaming/runtime/operators/sink/CommitterOperator.java

-                boolean fullyReceived =
-                        !endInput && manager.getCheckpointId() == lastCompletedCheckpointId;
-                commitAndEmit(manager, fullyReceived);
+                    committableCollector.getCheckpointCommittablesUpTo(completedCheckpointId)) {


On first glance this change doesn't look correct.

By removing fullyReceived, can we now commit committables that are from the "current" checkpoint but on receival from a delayed notifyCheckpointComplete from a previous checkpoint.

...time/src/main/java/org/apache/flink/streaming/runtime/operators/sink/SinkWriterOperator.java

fapaul · 2024-09-13T09:35:31Z

...time/src/main/java/org/apache/flink/streaming/runtime/operators/sink/SinkWriterOperator.java

+        this.endOfInput = !previousState.isEmpty() && !previousState.contains(false);
+        sinkWriter =
+                this.endOfInput
+                        ? new ClosedWriter<>()


As discussed offline, this probably leaves an unclean state from the SinkWriter when it previously crashed.

Fixed in a fixup commit. PTAL.

fapaul · 2024-09-13T09:35:57Z

...time/src/main/java/org/apache/flink/streaming/runtime/operators/sink/SinkWriterOperator.java

@@ -178,7 +214,7 @@ public void processElement(StreamRecord<InputT> element) throws Exception {
    @Override
    public void prepareSnapshotPreBarrier(long checkpointId) throws Exception {
        super.prepareSnapshotPreBarrier(checkpointId);
-        if (!endOfInput) {
+        if (!this.endOfInput) {


Nit: Why this here we do not use inside the other methods?

Yes, I'll remove.

rkhachatryan

LGTM, the fix should work and I don't see any issues with it.

...time/src/main/java/org/apache/flink/streaming/runtime/operators/sink/SinkWriterOperator.java

AHeise · 2024-09-13T13:49:18Z

I added the fixup commits inline. PTAL.

fapaul

Thanks for walking me through the changes offline and discussing the last open points 👍

In some parts of the sink, EOI is treated as checkpointId=null and in some checkpointId=MAX. The code of CheckpointCommittableManagerImpl implies that a null is valid however the serializer actually breaks then. In practice, checkpointId=MAX is used all the time by accident. This commit replaces the nullable checkpointIds with a primitive long EOI=MAX, so that we always use the special value instead of null. The serializer already used that value, so it actually simplifies many places and doesn't break any existing state.

Remove the side-effect and create a new (rather cheap) instance of the managers.

Use the proper ObjectAssert as the base for CommittableSummaryAssert and CommittableWithLinageAssert.

The committer is supposed to commit all committables at once for a given subtask (so that it can potentially optimize committables on the fly). With UCs, we could potentially see notifyCheckpointCompleted before receiving all committables. The CommittableSummary was built and is used to detect that. So far, we enforced completeness only for the most current committables belonging the respective checkpoint being completed. However, we should also enforce it to all subsumed committables. In fact, we probably implicitly do it but we have the extra code path which allows subsumed committables to be incomplete. This commit simplifies the code a bit by always enforcing completeness.

The stateful SinkWriterOperatorTestBase test cases used EOI to manipulate the state which was never clean. In particular, it also stored the input elements in state until EOI arrived and emitted them all at once. For state restoration tests, we emitted records after EOI arrived. This commit changed the writer state completely to just capture the record count, which is much more realistic than storing actual payload. The tests now directly assert on the state instead of output. This commit also introduces an adaptor for serializing basic types in the writer state and replaces the hard-to-maintain SinkAndSuppliers with an InspectableSink in the sink writer tests that require an abstraction on top of the different Sink flavors.

In case of a failure after final checkpoint, EOI is called twice. SinkWriter should ignore the second call to avoid emitting more dummy committables = transactional objects containing no data since no data can arrive when recovering from final checkpoint. The commit uses a boolean list state to remember if EOI has been emitted. The cases are discussed in code. Since rescaling may still result in these dummy committables, the committer needs merge them into the CommittableCollector as these committables still need to be committed as systems like Kafka don't provide transactional isolation.

AbstractStreamingWriter send partition info twice on EOI. This commit ensures that we are not resending partition information even after restarting from a final checkpoint.

fapaul

Reviewed commit for Fix AbstractStreamingWriter sending after EOI

fapaul · 2024-09-17T11:22:19Z

...iles/src/main/java/org/apache/flink/connector/file/table/stream/AbstractStreamingWriter.java

+        //     the writer and potentially emit duplicate summaries if we indeed recovered from a
+        //     final checkpoint.
+        endOfInputState = context.getOperatorStateStore().getListState(END_OF_INPUT_STATE_DESC);
+        List<Boolean> previousState = Lists.newArrayList(endOfInputState.get());


Nit: Can we use List.of here and avoid using the shaded guava dependency?

Sure if you drop Java8 support first ;).

fapaul · 2024-09-17T12:00:11Z

...iles/src/main/java/org/apache/flink/connector/file/table/stream/AbstractStreamingWriter.java

+            buckets.onProcessingTime(Long.MAX_VALUE);
+            helper.snapshotState(Long.MAX_VALUE);
+            output.emitWatermark(new Watermark(Long.MAX_VALUE));
+            commitUpToCheckpoint(Long.MAX_VALUE);


Can we use the EOI variable instead of Long.MAX_VALUE.

I thought about that but decided against it. This is not related to CommittableSummary directly, so it would feel weird to import it just for the EOI. And someone wise said that this class is deprecated anyways, so I didn't want to change too much.

AHeise · 2024-09-17T17:06:04Z

@flinkbot run azure

AHeise force-pushed the flink-25920-fix-eoi-in-sink branch from 61b53e2 to 0587c70 Compare September 5, 2024 14:30

flinkbot added component=API/DataStream component=Connectors/Common labels Sep 5, 2024

fapaul reviewed Sep 7, 2024

View reviewed changes

rkhachatryan reviewed Sep 8, 2024

View reviewed changes

...time/src/main/java/org/apache/flink/streaming/runtime/operators/sink/SinkWriterOperator.java Outdated Show resolved Hide resolved

...time/src/main/java/org/apache/flink/streaming/runtime/operators/sink/SinkWriterOperator.java Outdated Show resolved Hide resolved

AHeise force-pushed the flink-25920-fix-eoi-in-sink branch from 0587c70 to 3965a8f Compare September 12, 2024 08:55

AHeise commented Sep 12, 2024

View reviewed changes

AHeise force-pushed the flink-25920-fix-eoi-in-sink branch from 3965a8f to a5934b8 Compare September 12, 2024 09:15

AHeise commented Sep 12, 2024

View reviewed changes

...he/flink/streaming/runtime/operators/sink/committables/CheckpointCommittableManagerImpl.java Show resolved Hide resolved

AHeise force-pushed the flink-25920-fix-eoi-in-sink branch from a5934b8 to c3827c1 Compare September 12, 2024 09:22

fapaul reviewed Sep 13, 2024

View reviewed changes

rkhachatryan approved these changes Sep 13, 2024

View reviewed changes

...time/src/main/java/org/apache/flink/streaming/runtime/operators/sink/SinkWriterOperator.java Show resolved Hide resolved

AHeise force-pushed the flink-25920-fix-eoi-in-sink branch from c3827c1 to 6470f83 Compare September 13, 2024 13:48

AHeise force-pushed the flink-25920-fix-eoi-in-sink branch 2 times, most recently from a8d3ff9 to aaac5b7 Compare September 13, 2024 15:35

fapaul approved these changes Sep 13, 2024

View reviewed changes

AHeise force-pushed the flink-25920-fix-eoi-in-sink branch 4 times, most recently from 47edbfb to 2633be2 Compare September 17, 2024 09:07

AHeise added 5 commits September 17, 2024 13:02

[FLINK-25920] Turn CommittableManager#merge functional

01b31cc

Remove the side-effect and create a new (rather cheap) instance of the managers.

[FLINK-25920] Improve sink test assertions

b648058

Use the proper ObjectAssert as the base for CommittableSummaryAssert and CommittableWithLinageAssert.

AHeise added 3 commits September 17, 2024 13:04

[FLINK-25920] Improve logging in committable handling of the sink

840e9a8

[FLINK-25920] Fix AbstractStreamingWriter sending after EOI

6ebfe41

AbstractStreamingWriter send partition info twice on EOI. This commit ensures that we are not resending partition information even after restarting from a final checkpoint.

AHeise force-pushed the flink-25920-fix-eoi-in-sink branch from 2633be2 to 6ebfe41 Compare September 17, 2024 11:10

fapaul approved these changes Sep 17, 2024

View reviewed changes

AHeise merged commit 6d60f41 into apache:master Sep 17, 2024

This was referenced Nov 7, 2024

[FLINK-25920] Ignore duplicate EOI in SinkWriter [1.20] #25619

Merged

[FLINK-25920] Ignore duplicate EOI in SinkWriter [1.19] #25627

Merged

[FLINK-25920] Ignore duplicate EOI in SinkWriter #25292

[FLINK-25920] Ignore duplicate EOI in SinkWriter #25292

Conversation

AHeise commented Sep 5, 2024

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

flinkbot commented Sep 5, 2024 • edited Loading

CI report:

fapaul left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rkhachatryan left a comment

Choose a reason for hiding this comment

AHeise commented Sep 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AHeise commented Sep 12, 2024

fapaul left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rkhachatryan left a comment

Choose a reason for hiding this comment

AHeise commented Sep 13, 2024

fapaul left a comment

Choose a reason for hiding this comment

fapaul left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AHeise commented Sep 17, 2024

flinkbot commented Sep 5, 2024 •

edited

Loading