Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-36455] Sinks retry synchronously #25547

Merged
merged 3 commits into from
Nov 6, 2024

Conversation

AHeise
Copy link
Contributor

@AHeise AHeise commented Oct 18, 2024

What is the purpose of the change

Sinks so far retried asynchronously to increase commit throughput in case of temporary issues. However, the contract of notifyCheckpointCompleted states that checkpoints must be side-effect free meaning all transactions have to be committed on return of the PRC call.

Brief change log

  • This commit retries a fixed number of times and then fails in notifyCheckpointCompleted.
  • Simplifies parts of committable handling now that all committables of a subtask either succeed or fail

Verifying this change

  • Already covered by many tests
  • Adjusted and changed tests in
    api/connector/sink2
    runtime/operators/sink/committables

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
  • The serializers: (yes / no / don't know)
  • The runtime per-record code paths (performance sensitive): (yes / no / don't know)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know)
  • The S3 file system connector: (yes / no / don't know)

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

@flinkbot
Copy link
Collaborator

flinkbot commented Oct 18, 2024

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

Copy link
Contributor

@fapaul fapaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this change. I left a few comments, but I have two main concerns.

  • I wonder whether we should leave the failed comittable tracker
  • The PR misses changes to the committer interface e.g. CommitRequest#signalFailedWithKnownReason

/** The number of committables that have not been successfully committed. */
private final int numberOfPendingCommittables;

@Deprecated
/** The number of committables that are not retried and have been failed. */
private final int numberOfFailedCommittables;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we deprecate failed committables? I still see some value to provide the possibility to discard unrecoverable committables. The change also looks unrelated to the retry mechanism.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I probably should motivate that change in the commit message first. The short answer is: it doesn't work. Let's go to the long one.

So far, we increase this number on known issue and don't emit the respective committable. That means that the global committer would need to wait for all committables except the failed once and then commit. However, it always used to wait for all known committables running in an infinite loop.

The change of this PR only creates the summary of successful commits. Unknown errors cause a restart loop and known errors cause the committables to be dropped from the statistics. So the global committer waits for all committables of the summary and works now.

The alternative would be to still update the stastics as you propose and ignore the failed in the global committer. However, I wonder what the value of that is. If you'd like to keep it just to have less disruptions, I can revert the change and fix it.

However, the custom topologies that I have seen so far, also run into the same issue as the global committer (and now I think that the compactor has the same issues). I think that emitting the stats on the ignored committables ultimately just increases complexity without giving downstream operators any good handle whatsoever.

WDYT?

@@ -95,6 +94,7 @@
public class GlobalCommitterOperator<CommT, GlobalCommT> extends AbstractStreamOperator<Void>
implements OneInputStreamOperator<CommittableMessage<CommT>, Void> {

private static final int MAX_RETRIES = 10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should make this a Flink config specific to the sink and allow users to opt-out with the failed committable mechanism.

Currently the number is duplicated in committer and global committer.

Comment on lines +186 to +189
int subtaskId = getRuntimeContext().getTaskInfo().getIndexOfThisSubtask();
int numberOfSubtasks = getRuntimeContext().getTaskInfo().getNumberOfParallelSubtasks();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I know this wasn't introduced by this PR but why do we need to fetch the subtask id and number of tasks on every emit call?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the concern here? Peformance? JVM should inline the call to pretty much result into a field access.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly should looked strange.

@AHeise AHeise force-pushed the FLINK-36455-sync-retries branch from 4aa6d6d to f8fd060 Compare October 29, 2024 13:44
@AHeise
Copy link
Contributor Author

AHeise commented Oct 29, 2024

Reverted the deprecation of numFailed and added a config option for the retries. PTAL @fapaul

Copy link
Contributor

@fapaul fapaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now, can you also add a small note to the release notes that we introduce a new config

.intType()
.defaultValue(10)
.withDescription(
"The number of retries on a committable (e.g., transaction) before Flink application fails and potentially restarts.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Should we mention here that it applies to CommitRequest#signalFailedWithUnknownReason and not generic exceptions that are thrown during calling commit on the committer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good point. I fear that giving too much details is confusing (which end user knows CommitRequest?).

I'd rephrase to
The number of retries a Flink application attempts for committable operations (such as transactions) on retriable errors, as specified by the sink connector, before Flink fails and potentially restarts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds much better 👍

Comment on lines 69 to 88
Collection<CommT> getSuccessfulCommittables();

int getNumFailed();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add doc strings to the method

@AHeise AHeise force-pushed the FLINK-36455-sync-retries branch from f8fd060 to 22899ad Compare November 4, 2024 09:29
Sinks so far retried asynchronously to increase commit throughput in case of temporary issues. However, the contract of notifyCheckpointCompleted states that checkpoints must be side-effect free meaning all transactions have to be committed on return of the PRC call.

This commit retries a fixed number of times and then fails in notifyCheckpointCompleted.

Note that sync retries significantly simplify the committable handling. This commit starts a few simplifications; the next commit clears up more.
Without async parts of committable summary, number of pending committables will always be 0.

Failed committables will also be 0 as they will throw an error if unexpected or not they are silently ignored. The previous behavior with them being >0 actually led to infinite loops in the global committer.
@AHeise AHeise force-pushed the FLINK-36455-sync-retries branch from 22899ad to 64a5257 Compare November 4, 2024 13:37
@AHeise AHeise merged commit 13fe0e6 into apache:master Nov 6, 2024
@AHeise AHeise deleted the FLINK-36455-sync-retries branch November 6, 2024 08:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants