Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use failure_rate instead of failure count for circuit breaker #18539

Closed
wants to merge 6 commits into from

Conversation

amishra-u
Copy link
Contributor

Continuation of #18359
I ran multiple experiment and tried to find optimal failure threshold and failure window interval with different remote_timeout, for healthy remote cache, semi-healthy (overloaded) remote cache and unhealthy remote cache.
As I described here even with healthy remote cache there was 5-10% circuit trip and we were not getting the best result.

Issue related to the failure count:

  1. When the remote cache is healthy, builds are fast, and Bazel makes a high number of calls to the buildfarm. As a result, even with a moderate failure rate, the failure count may exceed the threshold.
  2. Additionally, write calls, which have a higher probability of failure compared to other calls, are batched immediately after the completion of an action's build. This further increases the chances of breaching the failure threshold within the defined window interval.
  3. On the other hand, when the remote cache is unhealthy or semi-healthy, builds are significantly slowed down, and Bazel makes fewer calls to the remote cache.

Finding a configuration that works well for both healthy and unhealthy remote caches was not feasible. Therefore, changed the approach to use the failure rate, and easily found a configuration that worked effectively in both scenarios.

@amishra-u amishra-u marked this pull request as ready for review May 30, 2023 22:04
@amishra-u amishra-u requested a review from a team as a code owner May 30, 2023 22:04
@github-actions github-actions bot added awaiting-review PR is awaiting review from an assigned reviewer team-Remote-Exec Issues and PRs for the Execution (Remote) team labels May 30, 2023
@linzhp
Copy link
Contributor

linzhp commented May 30, 2023

@coeuvre Can you review?

@amishra-u
Copy link
Contributor Author

@coeuvre incorporated feedback please review.

@coeuvre coeuvre added awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally and removed awaiting-review PR is awaiting review from an assigned reviewer labels May 31, 2023
@copybara-service copybara-service bot closed this in 10fb5f6 Jun 7, 2023
@iancha1992 iancha1992 removed the awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally label Jun 7, 2023
amishra-u added a commit to amishra-u/bazel that referenced this pull request Jun 7, 2023
Continuation of bazelbuild#18359
I ran multiple experiment and tried to find optimal failure threshold and failure window interval with different remote_timeout, for healthy remote cache, semi-healthy (overloaded) remote cache and unhealthy remote cache.
As I described [here](bazelbuild#18359 (comment)) even with healthy remote cache there was 5-10% circuit trip and we were not getting the best result.

Issue related to the failure count:
1. When the remote cache is healthy, builds are fast, and Bazel makes a high number of calls to the buildfarm. As a result, even with a moderate failure rate, the failure count may exceed the threshold.
2. Additionally, write calls, which have a higher probability of failure compared to other calls, are batched immediately after the completion of an action's build. This further increases the chances of breaching the failure threshold within the defined window interval.
3. On the other hand, when the remote cache is unhealthy or semi-healthy, builds are significantly slowed down, and Bazel makes fewer calls to the remote cache.

Finding a configuration that works well for both healthy and unhealthy remote caches was not feasible. Therefore, changed the  approach to use the failure rate, and easily found a configuration  that worked effectively in both scenarios.

Closes bazelbuild#18539.

PiperOrigin-RevId: 538588379
Change-Id: I64a49eeeb32846d41d54ca3b637ded3085588528
copybara-service bot pushed a commit that referenced this pull request Jun 13, 2023
When the digest size exceeds the max configured digest size by remote-cache, an "out_of_range" error is returned. These errors should not be considered as API failures for the circuit breaker logic, as they do not indicate any issues with the remote-cache service.
Similarly there are other non-retriable errors that should not be treated as server failure such as ALREADY_EXISTS.

This change considers non-retriable errors as user/client error and logs them as success. While retriable errors such `DEADLINE_EXCEEDED`, `UNKNOWN` etc are logged as failure.

Related PRs
#18359
#18539

Closes #18613.

PiperOrigin-RevId: 539948823
Change-Id: I5b51f6a3aecab7c17d73f78b8234d9a6da49fe6c
traversaro pushed a commit to traversaro/bazel that referenced this pull request Jun 24, 2023
Continuation of bazelbuild#18359
I ran multiple experiment and tried to find optimal failure threshold and failure window interval with different remote_timeout, for healthy remote cache, semi-healthy (overloaded) remote cache and unhealthy remote cache.
As I described [here](bazelbuild#18359 (comment)) even with healthy remote cache there was 5-10% circuit trip and we were not getting the best result.

Issue related to the failure count:
1. When the remote cache is healthy, builds are fast, and Bazel makes a high number of calls to the buildfarm. As a result, even with a moderate failure rate, the failure count may exceed the threshold.
2. Additionally, write calls, which have a higher probability of failure compared to other calls, are batched immediately after the completion of an action's build. This further increases the chances of breaching the failure threshold within the defined window interval.
3. On the other hand, when the remote cache is unhealthy or semi-healthy, builds are significantly slowed down, and Bazel makes fewer calls to the remote cache.

Finding a configuration that works well for both healthy and unhealthy remote caches was not feasible. Therefore, changed the  approach to use the failure rate, and easily found a configuration  that worked effectively in both scenarios.

Closes bazelbuild#18539.

PiperOrigin-RevId: 538588379
Change-Id: I64a49eeeb32846d41d54ca3b637ded3085588528
traversaro pushed a commit to traversaro/bazel that referenced this pull request Jun 24, 2023
When the digest size exceeds the max configured digest size by remote-cache, an "out_of_range" error is returned. These errors should not be considered as API failures for the circuit breaker logic, as they do not indicate any issues with the remote-cache service.
Similarly there are other non-retriable errors that should not be treated as server failure such as ALREADY_EXISTS.

This change considers non-retriable errors as user/client error and logs them as success. While retriable errors such `DEADLINE_EXCEEDED`, `UNKNOWN` etc are logged as failure.

Related PRs
bazelbuild#18359
bazelbuild#18539

Closes bazelbuild#18613.

PiperOrigin-RevId: 539948823
Change-Id: I5b51f6a3aecab7c17d73f78b8234d9a6da49fe6c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-Remote-Exec Issues and PRs for the Execution (Remote) team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants