Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow remote retry max delay to be user configurable #16058

Closed

Conversation

joeljeske
Copy link
Contributor

This introduces a new option --remote_retry_max_delay can be used to change the existing maximum exponential backoff interval used when retrying remote requests. Before this change, there was a hardcoded value controlling this maximum exponential backoff interval, set to 5s.

Rational
remote_retries is useful in masking over temporary disruptions to a remote cluster. If a cluster experiences temporary downtime, it is useful to allow bazel clients to wait for a period of time for the cluster to recover before bailing and giving up. If users cannot configure the maximum exponential backoff delay, one must set a large number for remote_retries, each retry eventually waiting for up to 5s. This allows the bazel client to wait for a reasonable amount of time for the cluster to recover.

The problem here is that under certain cluster failure modes, requests may not be handled and failed quickly, rather they may wait until remote_timeout before failing. A large remote_timeout combined with a large remote_retries could lead to waiting for a very long time before finally bailing on a given action.

If a user can bump the remote_retry_max_delay, they can control the retry waiting semantics to their own needs.

@joeljeske joeljeske requested a review from a team as a code owner August 6, 2022 19:05
@joeljeske joeljeske force-pushed the remote-retry-max-delay branch 2 times, most recently from 71a4cbe to b14af12 Compare August 6, 2022 21:42
@sgowroji sgowroji added team-Remote-Exec Issues and PRs for the Execution (Remote) team awaiting-review PR is awaiting review from an assigned reviewer labels Aug 7, 2022
@joeljeske
Copy link
Contributor Author

@tjgq could I trouble you for a quick review of this configuration option? 🙇 🙏

Copy link
Contributor

@tjgq tjgq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about the delay. This PR seems reasonable to me, I have just a small comment.

@@ -154,6 +154,7 @@ public static class ExponentialBackoff implements Backoff {
*/
ExponentialBackoff(Duration initial, Duration max, double multiplier, double jitter,
int maxAttempts) {
Preconditions.checkArgument(max.compareTo(initial) > 0, "max must be > initial");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a rule, Bazel should not crash due to malformed input. To avoid additional validation logic, how about we use max(initial, remoteRetryMaxDelay) as the argument to ExponentialBackoff below?

@joeljeske
Copy link
Contributor Author

Thanks for checking this out, @tjgq! I made your suggested change, preventing Bazel from crashing on an invalid value here.

@joeljeske
Copy link
Contributor Author

Friendly ping, @tjgq

@tjgq
Copy link
Contributor

tjgq commented Apr 11, 2023

Sorry for the delay, importing it now.

@brentleyjones
Copy link
Contributor

@bazel-io flag

@bazel-io bazel-io added the potential release blocker Flagged by community members using "@bazel-io flag". Should be added to a release blocker milestone label Apr 12, 2023
@keertk
Copy link
Member

keertk commented Apr 12, 2023

@bazel-io fork 6.2.0

@bazel-io bazel-io removed the potential release blocker Flagged by community members using "@bazel-io flag". Should be added to a release blocker milestone label Apr 12, 2023
ShreeM01 pushed a commit to ShreeM01/bazel that referenced this pull request Apr 12, 2023
This introduces a new option `--remote_retry_max_delay` can be used to change the existing maximum exponential backoff interval used when retrying remote requests. Before this change, there was a hardcoded value controlling this maximum exponential backoff interval, set to `5s`.

Rational
`remote_retries` is useful in masking over temporary disruptions to a remote cluster. If a cluster experiences temporary downtime, it is useful to allow bazel clients to wait for a period of time for the cluster to recover before bailing and giving up. If users cannot configure the maximum exponential backoff delay, one must set a large number for `remote_retries`, each retry eventually waiting for up to 5s. This allows the bazel client to wait for a reasonable amount of time for the cluster to recover.

The problem here is that under certain cluster failure modes, requests may not be handled and failed quickly, rather they may wait until `remote_timeout` before failing. A large `remote_timeout` combined with a large `remote_retries` could lead to waiting for a very long time before finally bailing on a given action.

If a user can bump the `remote_retry_max_delay`, they can control the retry waiting semantics to their own needs.

Closes bazelbuild#16058.

PiperOrigin-RevId: 523680725
Change-Id: I21daba78b91d3157362ca85bb7b1cbbef8a94bb3
keertk pushed a commit that referenced this pull request Apr 21, 2023
* Allow remote retry max delay to be user configurable

This introduces a new option `--remote_retry_max_delay` can be used to change the existing maximum exponential backoff interval used when retrying remote requests. Before this change, there was a hardcoded value controlling this maximum exponential backoff interval, set to `5s`.

Rational
`remote_retries` is useful in masking over temporary disruptions to a remote cluster. If a cluster experiences temporary downtime, it is useful to allow bazel clients to wait for a period of time for the cluster to recover before bailing and giving up. If users cannot configure the maximum exponential backoff delay, one must set a large number for `remote_retries`, each retry eventually waiting for up to 5s. This allows the bazel client to wait for a reasonable amount of time for the cluster to recover.

The problem here is that under certain cluster failure modes, requests may not be handled and failed quickly, rather they may wait until `remote_timeout` before failing. A large `remote_timeout` combined with a large `remote_retries` could lead to waiting for a very long time before finally bailing on a given action.

If a user can bump the `remote_retry_max_delay`, they can control the retry waiting semantics to their own needs.

Closes #16058.

PiperOrigin-RevId: 523680725
Change-Id: I21daba78b91d3157362ca85bb7b1cbbef8a94bb3

* Replace RemoteDurationConverter with RemoteTimeoutConverter

---------

Co-authored-by: Joel Jeske <joel.jeske@robinhood.com>
fweikert pushed a commit to fweikert/bazel that referenced this pull request May 25, 2023
This introduces a new option `--remote_retry_max_delay` can be used to change the existing maximum exponential backoff interval used when retrying remote requests. Before this change, there was a hardcoded value controlling this maximum exponential backoff interval, set to `5s`.

Rational
`remote_retries` is useful in masking over temporary disruptions to a remote cluster. If a cluster experiences temporary downtime, it is useful to allow bazel clients to wait for a period of time for the cluster to recover before bailing and giving up. If users cannot configure the maximum exponential backoff delay, one must set a large number for `remote_retries`, each retry eventually waiting for up to 5s. This allows the bazel client to wait for a reasonable amount of time for the cluster to recover.

The problem here is that under certain cluster failure modes, requests may not be handled and failed quickly, rather they may wait until `remote_timeout` before failing. A large `remote_timeout` combined with a large `remote_retries` could lead to waiting for a very long time before finally bailing on a given action.

If a user can bump the `remote_retry_max_delay`, they can control the retry waiting semantics to their own needs.

Closes bazelbuild#16058.

PiperOrigin-RevId: 523680725
Change-Id: I21daba78b91d3157362ca85bb7b1cbbef8a94bb3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-review PR is awaiting review from an assigned reviewer team-Remote-Exec Issues and PRs for the Execution (Remote) team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants