Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor Update: Add out_of_range to ignored failure list for circuit_breaker #18583

Closed
wants to merge 25 commits into from

Conversation

amishra-u
Copy link
Contributor

@amishra-u amishra-u commented Jun 5, 2023

When the digest size exceeds the max configured digest size by remote-cache, an "out_of_range" error is returned. These errors should not be considered as API failures for the circuit breaker logic, as they do not indicate any issues with the remote-cache service.

Related PRs
#18359
#18539

@amishra-u amishra-u changed the title Add out_of_range to ignored failure list Add out_of_range to ignored failure list for cirucit_breaker Jun 6, 2023
@amishra-u amishra-u changed the title Add out_of_range to ignored failure list for cirucit_breaker Add out_of_range to ignored failure list for circuit_breaker Jun 6, 2023
@amishra-u amishra-u marked this pull request as ready for review June 6, 2023 00:21
@amishra-u amishra-u requested a review from a team as a code owner June 6, 2023 00:21
@github-actions github-actions bot added awaiting-review PR is awaiting review from an assigned reviewer team-Remote-Exec Issues and PRs for the Execution (Remote) team labels Jun 6, 2023
@amishra-u
Copy link
Contributor Author

amishra-u commented Jun 6, 2023

@coeuvre Can you please review? Minor update in circuit_breaker logic.

@amishra-u amishra-u changed the title Add out_of_range to ignored failure list for circuit_breaker Minor Update: Add out_of_range to ignored failure list for circuit_breaker Jun 6, 2023

public static final ImmutableSet<Class<? extends Exception>> DEFAULT_IGNORED_ERRORS =
ImmutableSet.of(CacheNotFoundException.class);
public static final Predicate<? super Exception> DEFAULT_IGNORED_ERRORS =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fundamentally the same as before. Since we already have the logic to check whether Bazel should retry on certain errors in Retrier, we want to reuse that instead of repeating the logic here.

Currently, we always call circuitBreaker.recordFailure(e) if Bazel encountered an error. If we move the call after the shouldRetry check, we know errors are not ignored when they are arrived in recordFailure.

This makes CircuitBreaker no longer need to check whether it should ignore certain errors and makes it possible to use the CircuitBreaker for HTTP payload.

Sorry if I didn't make it clear before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback.
You mean call recordFailure if error is not one of these CANCELLED, UNKNOWN, DEADLINE_EXCEEDED, ABORTED, INTERNAL, UNAVAILABLE, RESOURCE_EXHAUSTED.
If I take diff with available grpc errors then leftover grpc errors are INVALID_ARGUMENT, NOT_FOUND, ALREADY_EXISTS, PERMISSION_DENIED, UNAUTHENTICATED, FAILED_PRECONDITION OUT_OF_RANGE, and UNIMPLEMENTED

At highlevel it seems that left over errors can be treated as ignored error. But I believe being explicit could be better. And here I only want NOT_FOUND and OUT_OF_RANGE to marked as ignored errror.

Also ignored errors are marked as success call for circuit breaker logic. code. The diff is not yet merged so I had to create pull request against master:HEAD.

I had two options then

  1. Add boolean argument isErrorIgnored as argument for recordFailure.
  2. Or call recordSuccess on ignored errors.

What do you say?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the logic to ignore failure to Retrier.

tkoeppe and others added 14 commits June 7, 2023 15:46
…dvanced/performance/iteration-speed

PiperOrigin-RevId: 538270596
Change-Id: Id8c88674133417357efb4d2c34c661a017427744
…y need it.

`OutputGroupInfo` always uses `Location.BUILTIN`. This change saves 8 bytes per `OutputGroupInfo` instance.

There are likely more classes that don't need a `location` field (see existing `TODO` in `NativeInfo`), but I'm just focusing on `OutputGroupInfo` for now.

PiperOrigin-RevId: 538327560
Change-Id: I6b10e72ae70a621cfe5bebf31d12ae4b88564d62
PiperOrigin-RevId: 538400959
Change-Id: I11d46083e6a161b96f8bb8588191a3ff99418883
PiperOrigin-RevId: 538401036
Change-Id: I244bfb99bd21c21367ac88ca197a916c1d034088
Keeping only import paths directories in correct order. Those consist only of "binfiles"/"genfiles", each "_virtual_imports" directory (if the proto_library using it is needed) and ".".

Listing every .proto file on the command line had performance implications.

PiperOrigin-RevId: 538456587
Change-Id: Iefe48db0d4efd1e7f8c0266595fd12aef5f71a7c
For an exported rule `foo_library`, `repr(foo_library)` now evaluates to `<rule foo_library>`, not `<rule>`, matching the behavior of native rules more closely.

Fixes bazelbuild#17483

Closes bazelbuild#18392.

PiperOrigin-RevId: 538458291
Change-Id: I331955655756a3c235be0a93f1394716ddf9849d
Also update all references in builtins to the symbol exported in this file.

PiperOrigin-RevId: 538467997
Change-Id: I3eb93d8a8fe90691595207998322597836ec05c5
See: bazelbuild/examples#201

Closes bazelbuild#18581.

PiperOrigin-RevId: 538469631
Change-Id: Ib0cfc6de5d9f0d2cc4fc045b47c2d8ce5a6b358e
…ule actions.

The motivation behind this change is to eventually promote the warning to an error in the output case; this avoids surprising behavior such as described in bazelbuild#18579. I'm less confident that the input case (source directories) can be similarly promoted to an error, but I'm still extending the warning to all actions for symmetry.

PiperOrigin-RevId: 538470433
Change-Id: Ic6c8a4cb353dda5f6397d286fe2e6faee28f896c
Update all references in builtins to the symbol exported in this file.

PiperOrigin-RevId: 538479229
Change-Id: If4dc4eb4cb6943632c1b7e5908fe9ed61679bc09
… try)

Rolling forward bazelbuild@975866a (which was rollbacked at bazelbuild@d51c75f) with fixes.

- Introduce a `rules_java_builtin` repo in WORKSPACE prefix to avoid conflict with user defined rules_java.
    - `@bazel_tools//tools/jdk/*.bzl` loads from `rules_java_builtin` through repo-mappings.
    - `@local_jdk` was overridden in `jdk.WORKSPACE` to add repo_mapping for `rules_java`.
    - `jdk.WORKSPACE` explicitly loads from `rules_java_builtin` for JDK definitions and java toolchain definitions.
- Allow using `__SKIP_WORKSPACE_PREFIX__` and `__SKIP_WORKSPACE_SUFFIX__` in WORKSPACE comment.
- Fixed many tests by adjusting the WORKSPACE file content.
- Re-export more symbols from `@bazel_tools` to be backward-compatible.

Fixes bazelbuild#18551

Related:
- bazelbuild#18373
- bazelbuild/rules_java#110
- bazelbuild#18423

Closes bazelbuild#18558.

PiperOrigin-RevId: 538483417
Change-Id: I5223eec2c4b10131fc8c5b342237385ff2f56413
… case.

This flag can set a by-mnemonic list of actions to allow locally, `--strategy_regexp` doesn't match against mnemonic but against the message.

PiperOrigin-RevId: 538484992
Change-Id: Ib80302d38212515bb2e5dbd09ffa884fdaa12776
…eCommand.

PiperOrigin-RevId: 538498468
Change-Id: I88cb9114bb370f42ecdeaac5b592b3a4e9ab4f08
…ency is copied correctly.

Without this, toolchain dependencies do not get the same execution platform as the original target, and toolchain's exec dependencies are not necessarily executable in the original target's actions.

PiperOrigin-RevId: 538541526
Change-Id: Icdcaa75c84d3a0f3f952aa491026e72ef88be099
Continuation of bazelbuild#18359
I ran multiple experiment and tried to find optimal failure threshold and failure window interval with different remote_timeout, for healthy remote cache, semi-healthy (overloaded) remote cache and unhealthy remote cache.
As I described [here](bazelbuild#18359 (comment)) even with healthy remote cache there was 5-10% circuit trip and we were not getting the best result.

Issue related to the failure count:
1. When the remote cache is healthy, builds are fast, and Bazel makes a high number of calls to the buildfarm. As a result, even with a moderate failure rate, the failure count may exceed the threshold.
2. Additionally, write calls, which have a higher probability of failure compared to other calls, are batched immediately after the completion of an action's build. This further increases the chances of breaching the failure threshold within the defined window interval.
3. On the other hand, when the remote cache is unhealthy or semi-healthy, builds are significantly slowed down, and Bazel makes fewer calls to the remote cache.

Finding a configuration that works well for both healthy and unhealthy remote caches was not feasible. Therefore, changed the  approach to use the failure rate, and easily found a configuration  that worked effectively in both scenarios.

Closes bazelbuild#18539.

PiperOrigin-RevId: 538588379
Change-Id: I64a49eeeb32846d41d54ca3b637ded3085588528
@amishra-u amishra-u requested a review from coeuvre June 8, 2023 00:00
@amishra-u
Copy link
Contributor Author

Create a new PR #18613, as the related #18539 got merged to mainline. And merge from master was not clean and pulled multiple unrelated commits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-review PR is awaiting review from an assigned reviewer team-Remote-Exec Issues and PRs for the Execution (Remote) team
Projects
None yet
Development

Successfully merging this pull request may close these issues.