[FLINK-36455] Sinks retry synchronously #25547

AHeise · 2024-10-18T13:28:39Z

What is the purpose of the change

Sinks so far retried asynchronously to increase commit throughput in case of temporary issues. However, the contract of notifyCheckpointCompleted states that checkpoints must be side-effect free meaning all transactions have to be committed on return of the PRC call.

Brief change log

This commit retries a fixed number of times and then fails in notifyCheckpointCompleted.
Simplifies parts of committable handling now that all committables of a subtask either succeed or fail

Verifying this change

Already covered by many tests
Adjusted and changed tests in
api/connector/sink2
runtime/operators/sink/committables

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (yes / no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
The serializers: (yes / no / don't know)
The runtime per-record code paths (performance sensitive): (yes / no / don't know)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know)
The S3 file system connector: (yes / no / don't know)

Documentation

Does this pull request introduce a new feature? (yes / no)
If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

flinkbot · 2024-10-18T13:35:23Z

CI report:

64a5257 Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

fapaul

Thanks for making this change. I left a few comments, but I have two main concerns.

I wonder whether we should leave the failed comittable tracker
The PR misses changes to the committer interface e.g. CommitRequest#signalFailedWithKnownReason

fapaul · 2024-10-28T08:38:11Z

...runtime/src/main/java/org/apache/flink/streaming/api/connector/sink2/CommittableSummary.java

    /** The number of committables that have not been successfully committed. */
    private final int numberOfPendingCommittables;
+
+    @Deprecated
    /** The number of committables that are not retried and have been failed. */
    private final int numberOfFailedCommittables;


Why do we deprecate failed committables? I still see some value to provide the possibility to discard unrecoverable committables. The change also looks unrelated to the retry mechanism.

Good point. I probably should motivate that change in the commit message first. The short answer is: it doesn't work. Let's go to the long one.

So far, we increase this number on known issue and don't emit the respective committable. That means that the global committer would need to wait for all committables except the failed once and then commit. However, it always used to wait for all known committables running in an infinite loop.

The change of this PR only creates the summary of successful commits. Unknown errors cause a restart loop and known errors cause the committables to be dropped from the statistics. So the global committer waits for all committables of the summary and works now.

The alternative would be to still update the stastics as you propose and ignore the failed in the global committer. However, I wonder what the value of that is. If you'd like to keep it just to have less disruptions, I can revert the change and fix it.

However, the custom topologies that I have seen so far, also run into the same issue as the global committer (and now I think that the compactor has the same issues). I think that emitting the stats on the ignored committables ultimately just increases complexity without giving downstream operators any good handle whatsoever.

WDYT?

fapaul · 2024-10-28T08:40:07Z

...me/src/main/java/org/apache/flink/streaming/api/connector/sink2/GlobalCommitterOperator.java

@@ -95,6 +94,7 @@
 public class GlobalCommitterOperator<CommT, GlobalCommT> extends AbstractStreamOperator<Void>
        implements OneInputStreamOperator<CommittableMessage<CommT>, Void> {

+    private static final int MAX_RETRIES = 10;


IMO we should make this a Flink config specific to the sink and allow users to opt-out with the failed committable mechanism.

Currently the number is duplicated in committer and global committer.

fapaul · 2024-10-28T08:42:32Z

...ntime/src/main/java/org/apache/flink/streaming/runtime/operators/sink/CommitterOperator.java

+        int subtaskId = getRuntimeContext().getTaskInfo().getIndexOfThisSubtask();
+        int numberOfSubtasks = getRuntimeContext().getTaskInfo().getNumberOfParallelSubtasks();


Nit: I know this wasn't introduced by this PR but why do we need to fetch the subtask id and number of tasks on every emit call?

What's the concern here? Peformance? JVM should inline the call to pretty much result into a field access.

Mostly should looked strange.

AHeise · 2024-10-29T13:47:18Z

Reverted the deprecation of numFailed and added a config option for the retries. PTAL @fapaul

fapaul

LGTM now, can you also add a small note to the release notes that we introduce a new config

fapaul · 2024-10-30T11:35:29Z

flink-core/src/main/java/org/apache/flink/configuration/SinkOptions.java

+                    .intType()
+                    .defaultValue(10)
+                    .withDescription(
+                            "The number of retries on a committable (e.g., transaction) before Flink application fails and potentially restarts.");


Nit: Should we mention here that it applies to CommitRequest#signalFailedWithUnknownReason and not generic exceptions that are thrown during calling commit on the committer.

A good point. I fear that giving too much details is confusing (which end user knows CommitRequest?).

I'd rephrase to
The number of retries a Flink application attempts for committable operations (such as transactions) on retriable errors, as specified by the sink connector, before Flink fails and potentially restarts.

Sounds much better 👍

fapaul · 2024-10-30T11:37:20Z

...apache/flink/streaming/runtime/operators/sink/committables/CheckpointCommittableManager.java

+    Collection<CommT> getSuccessfulCommittables();
+
+    int getNumFailed();


Please add doc strings to the method

Sinks so far retried asynchronously to increase commit throughput in case of temporary issues. However, the contract of notifyCheckpointCompleted states that checkpoints must be side-effect free meaning all transactions have to be committed on return of the PRC call. This commit retries a fixed number of times and then fails in notifyCheckpointCompleted. Note that sync retries significantly simplify the committable handling. This commit starts a few simplifications; the next commit clears up more.

We can only set the gauge once.

Without async parts of committable summary, number of pending committables will always be 0. Failed committables will also be 0 as they will throw an error if unexpected or not they are silently ignored. The previous behavior with them being >0 actually led to infinite loops in the global committer.

flinkbot added the component=API/Core label Oct 18, 2024

AHeise mentioned this pull request Oct 23, 2024

[FLINK-24530][datastream] GlobalCommitter might not commit all records on drain #17536

Closed

AHeise assigned fapaul Oct 23, 2024

AHeise force-pushed the FLINK-36455-sync-retries branch 4 times, most recently from 1ec3860 to 4aa6d6d Compare October 24, 2024 11:51

fapaul reviewed Oct 28, 2024

View reviewed changes

AHeise force-pushed the FLINK-36455-sync-retries branch from 4aa6d6d to f8fd060 Compare October 29, 2024 13:44

fapaul approved these changes Oct 30, 2024

View reviewed changes

AHeise force-pushed the FLINK-36455-sync-retries branch from f8fd060 to 22899ad Compare November 4, 2024 09:29

AHeise added 3 commits November 4, 2024 14:37

[FLINK-36455] Fix PendingCommittable metric in sink

b1c4557

We can only set the gauge once.

AHeise force-pushed the FLINK-36455-sync-retries branch from 22899ad to 64a5257 Compare November 4, 2024 13:37

AHeise merged commit 13fe0e6 into apache:master Nov 6, 2024

AHeise deleted the FLINK-36455-sync-retries branch November 6, 2024 08:55

AHeise mentioned this pull request Nov 7, 2024

[FLINK-25920] Ignore duplicate EOI in SinkWriter [1.20] #25619

Merged

AHeise mentioned this pull request Nov 15, 2024

[FLINK-36455] Sinks retry synchronously [1.20] #25661

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-36455] Sinks retry synchronously #25547

[FLINK-36455] Sinks retry synchronously #25547

AHeise commented Oct 18, 2024

flinkbot commented Oct 18, 2024 •

edited

Loading

fapaul left a comment

fapaul Oct 28, 2024

AHeise Oct 28, 2024

fapaul Oct 28, 2024

fapaul Oct 28, 2024

AHeise Oct 29, 2024

fapaul Oct 30, 2024

AHeise commented Oct 29, 2024

fapaul left a comment

fapaul Oct 30, 2024

AHeise Nov 4, 2024

fapaul Nov 4, 2024

fapaul Oct 30, 2024

		int subtaskId = getRuntimeContext().getTaskInfo().getIndexOfThisSubtask();
		int numberOfSubtasks = getRuntimeContext().getTaskInfo().getNumberOfParallelSubtasks();

		Collection<CommT> getSuccessfulCommittables();

		int getNumFailed();

[FLINK-36455] Sinks retry synchronously #25547

[FLINK-36455] Sinks retry synchronously #25547

Conversation

AHeise commented Oct 18, 2024

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

flinkbot commented Oct 18, 2024 • edited Loading

CI report:

fapaul left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AHeise commented Oct 29, 2024

fapaul left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flinkbot commented Oct 18, 2024 •

edited

Loading