Skip to content

[FLINK-38112] Align default of yarn.application-attempt-failures-validity-interval with YARN #26809

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

liyude-tw
Copy link

What is the purpose of the change

This pull request aligns Flink’s default for the YARN configuration option yarn.application-attempt-failures-validity-interval with YARN itself.
The previous default (10 000 ms) caused unexpected endless AM restarts once the interval between two failures exceeded ten seconds.

Why -1 instead of another fixed window?
Since every environment performs differently, some restart AM in 30 seconds, some in 3 seconds. There is no fixed time that fits everyone.
Setting the default to -1 (global counting) removes the hidden assumption and lets users choose a window that matches their own infrastructure when needed.

JIRA: FLINK-38112


Brief change log

  • YarnConfigOptions
    • defaultValue changed from 10000L to -1L
    • description text updated accordingly, including a correct REST-API link
  • Docs
    • Regenerated configuration HTML/Markdown via generate-configdocs so the tables reflect the new default

Verifying this change

This change is a trivial configuration default update.
No new tests are required; existing unit and IT cases already cover option parsing, and the full Maven build (mvn -T1C clean verify) now passes on JDK 17.


Does this pull request potentially affect one of the following parts

Area Impact
Dependencies no
Public API (@Public/@PublicEvolving) no
Serializers no
Runtime per-record code paths no
Deployment / Recovery components (JM, Checkpointing, K8s/YARN, ZooKeeper) yes – YARN only (default value change)
S3 file-system connector no

Documentation

Does this pull request introduce a new feature? no

The existing docs were regenerated; no manual doc text was added.

@flinkbot
Copy link
Collaborator

flinkbot commented Jul 18, 2025

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@@ -110,16 +110,16 @@ public class YarnConfigOptions {
public static final ConfigOption<Long> APPLICATION_ATTEMPT_FAILURE_VALIDITY_INTERVAL =
key("yarn.application-attempt-failures-validity-interval")
.longType()
.defaultValue(10000L)
.defaultValue(-1L)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am concerned about changing the default of an API, as this will result in change of behaviour for the user and could be seen as a regression. What was 10 seconds in now done globally.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below is the reasoning that led me to propose -1 and how I believe the change is safer and less surprising than the current default.

  1. Few users intentionally depend on the current 10 s window
    The 10 s sliding window was introduced in PR [FLINK-12472][yarn] Support setting attemptFailuresValidityInterval o… #8400 by re-using the then-default Akka timeout. It wasn’t added to satisfy a concrete production need, so I think almost no one relies on it on purpose. We discover it after being surprised by extra restarts.

  2. Hadoop YARN’s own default is -1 (global counting)
    Because Flink runs as a YARN ApplicationMaster, aligning with the upstream default reduces the cognitive overhead for operators who administer both systems.

  3. The documentation and common intuition both imply “global counting”
    The description of yarn.application-attempts naturally suggests a total attempt limit. A hidden time window can therefore be surprising.

Risk-mitigation proposal

  1. Upgrade guide
    Add the following note in the upgrade section for this release:

Starting with this release, yarn.application-attempt-failures-validity-interval defaults to -1 (global counting).
Clusters that benefit from the previous 10 s sliding window can retain the old behaviour by adding
yarn.application-attempt-failures-validity-interval: 10000

  1. Release notes
    Repeat the same notice and example so that operators can quickly restore the former setting if needed.

@github-actions github-actions bot added community-reviewed PR has been reviewed by the community. and removed community-reviewed PR has been reviewed by the community. labels Jul 18, 2025
@github-actions github-actions bot added community-reviewed PR has been reviewed by the community. and removed community-reviewed PR has been reviewed by the community. labels Jul 21, 2025
@github-actions github-actions bot removed the community-reviewed PR has been reviewed by the community. label Jul 28, 2025
@github-actions github-actions bot added community-reviewed PR has been reviewed by the community. and removed community-reviewed PR has been reviewed by the community. labels Jul 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-reviewed PR has been reviewed by the community.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants