-
Notifications
You must be signed in to change notification settings - Fork 13.7k
[FLINK-38112] Align default of yarn.application-attempt-failures-validity-interval with YARN #26809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
docs/layouts/shortcodes/generated/yarn_config_configuration.html
Outdated
Show resolved
Hide resolved
@@ -110,16 +110,16 @@ public class YarnConfigOptions { | |||
public static final ConfigOption<Long> APPLICATION_ATTEMPT_FAILURE_VALIDITY_INTERVAL = | |||
key("yarn.application-attempt-failures-validity-interval") | |||
.longType() | |||
.defaultValue(10000L) | |||
.defaultValue(-1L) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am concerned about changing the default of an API, as this will result in change of behaviour for the user and could be seen as a regression. What was 10 seconds in now done globally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Below is the reasoning that led me to propose -1 and how I believe the change is safer and less surprising than the current default.
-
Few users intentionally depend on the current 10 s window
The 10 s sliding window was introduced in PR [FLINK-12472][yarn] Support setting attemptFailuresValidityInterval o… #8400 by re-using the then-default Akka timeout. It wasn’t added to satisfy a concrete production need, so I think almost no one relies on it on purpose. We discover it after being surprised by extra restarts. -
Hadoop YARN’s own default is -1 (global counting)
Because Flink runs as a YARN ApplicationMaster, aligning with the upstream default reduces the cognitive overhead for operators who administer both systems. -
The documentation and common intuition both imply “global counting”
The description of yarn.application-attempts naturally suggests a total attempt limit. A hidden time window can therefore be surprising.
Risk-mitigation proposal
- Upgrade guide
Add the following note in the upgrade section for this release:
Starting with this release, yarn.application-attempt-failures-validity-interval defaults to -1 (global counting).
Clusters that benefit from the previous 10 s sliding window can retain the old behaviour by adding
yarn.application-attempt-failures-validity-interval: 10000
- Release notes
Repeat the same notice and example so that operators can quickly restore the former setting if needed.
What is the purpose of the change
This pull request aligns Flink’s default for the YARN configuration option
yarn.application-attempt-failures-validity-interval
with YARN itself.The previous default (10 000 ms) caused unexpected endless AM restarts once the interval between two failures exceeded ten seconds.
Why
-1
instead of another fixed window?Since every environment performs differently, some restart AM in 30 seconds, some in 3 seconds. There is no fixed time that fits everyone.
Setting the default to
-1
(global counting) removes the hidden assumption and lets users choose a window that matches their own infrastructure when needed.JIRA: FLINK-38112
Brief change log
YarnConfigOptions
defaultValue
changed from10000L
to-1L
generate-configdocs
so the tables reflect the new defaultVerifying this change
This change is a trivial configuration default update.
No new tests are required; existing unit and IT cases already cover option parsing, and the full Maven build (
mvn -T1C clean verify
) now passes on JDK 17.Does this pull request potentially affect one of the following parts
@Public
/@PublicEvolving
)Documentation
Does this pull request introduce a new feature? no
The existing docs were regenerated; no manual doc text was added.