Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Implement failure circuit breaker #18120

Closed
wants to merge 0 commits into from

Conversation

amishra-u
Copy link
Contributor

@amishra-u amishra-u commented Apr 17, 2023

Issue

We have noticed that any problems with the remote cache have a detrimental effect on build times. On investigation we found that the interface for the circuit breaker was left unimplemented.

Solution

To address this issue, implemented a failure circuit breaker, which includes three new Bazel flags: 1) experimental_circuitbreaker_strategy, 2) experimental_remote_failure_threshold, and 3) experimental_emote_failure_window.

In this implementation, I have implemented failure strategy for circuit breaker and used failure count to trip the circuit.

Reasoning behind using failure count instead of failure rate : To measure failure rate I also need the success count. While both the failure and success count need to be an AtomicInteger as both will be modified concurrently by multiple threads. Even though getAndIncrement is very light weight operation, at very high request it might contribute to latency.

Reasoning behind using failure circuit breaker : A new instance of Retrier.CircuitBreaker is created for each build. Therefore, if the circuit breaker trips during a build, the remote cache will be disabled for that build. However, it will be enabled again
for the next build as a new instance of Retrier.CircuitBreaker will be created. If needed in the future we may add cool down strategy also. e.g. failure_and_cool_down_startegy.

closes #18136

@google-cla
Copy link

google-cla bot commented Apr 17, 2023

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@amishra-u amishra-u marked this pull request as ready for review April 18, 2023 20:11
@amishra-u amishra-u requested a review from a team as a code owner April 18, 2023 20:11
@github-actions github-actions bot added awaiting-review PR is awaiting review from an assigned reviewer team-Remote-Exec Issues and PRs for the Execution (Remote) team labels Apr 18, 2023
@@ -100,7 +100,7 @@ enum State {
State state();

/** Called after an execution failed. */
void recordFailure();
void recordFailure(Throwable t);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change so you need to find another way to record ignored errors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to wit: Throwable may be overly aggressive here - this is only called with Exception that I can see.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@werkt Will update it to Exception.

Copy link
Contributor Author

@amishra-u amishra-u Apr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shirchen for backward compatibility I can define an overloaded method.

@deprecated
default void recordFailure() {
   // As there was not default implementation earlier, all implementing classes must have implemented this. 
   throw IllegalState("Not implemented method")
}

default void recordFailure(Exception e) {
   // for backward compatibility
   recordFailure() 
}

Please let me know your thought.

@werkt
Copy link
Contributor

werkt commented Apr 19, 2023

The original issue with circuit breaker behavior was one represented during ActionCache misses: NOT_FOUND status is returned from this call for misses, which incorrectly signaled the circuit to open due to translation of the NOT_FOUND status to an exception. Can you identify the verification (i.e. test) that this expected failure status does not trip the circuit breaker open (as it is going to be extremely common), and that other failure statuses may trip it open?

@amishra-u
Copy link
Contributor Author

The original issue with circuit breaker behavior was one represented during ActionCache misses: NOT_FOUND status is returned from this call for misses, which incorrectly signaled the circuit to open due to translation of the NOT_FOUND status to an exception. Can you identify the verification (i.e. test) that this expected failure status does not trip the circuit breaker open (as it is going to be extremely common), and that other failure statuses may trip it open?

I tested locally, and 'not found' didn't trip the circuit. Also, I will start shadow experiment this week to verify any performance issue and will post my findings.

assertThat(credentials.getRequestMetadata(URI.create("https://bar.example.org"))).isEmpty();
}

private void assertCircuitBreakerInstance(Retrier.CircuitBreaker circuitBreaker, RemoteOptions remoteOptions) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent

Can you take a look again at this change to see if you can rework it to not move so many methods around? Getting a handle on the depth of the change is difficult with all the sequencing changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to organize the test with no functional change and then added my test on top of it. There was lot of duplicate code.

Let me know what you think about this.

  1. I create another pr where I just reorganize the test.
  2. Then I put this pr on top of that.

Copy link
Contributor

@werkt werkt Apr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the indent level and position of testNetrc_netrcWithoutRemoteCache could have to do with organization and duplicate code: the method appears to have been moved without any changes (sans indent).

If you think things will be made clearer with another PR, go for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize that this package uses 2 space indentation, while I was using 4 space indentation. Fix the indentation issue and made sure that the original position of the method are maintained.

@amishra-u
Copy link
Contributor Author

amishra-u commented May 10, 2023

The original issue with circuit breaker behavior was one represented during ActionCache misses: NOT_FOUND status is returned from this call for misses, which incorrectly signaled the circuit to open due to translation of the NOT_FOUND status to an exception. Can you identify the verification (i.e. test) that this expected failure status does not trip the circuit breaker open (as it is going to be extremely common), and that other failure statuses may trip it open?

I verified there was no circuit trip for not_found error. The not_found error remained same for every experiment 10-15% of total request (quite high number) but with healthy remote-cache there was no or very less trips.
more details here

copybara-service bot pushed a commit that referenced this pull request May 30, 2023
Copy of #18120: I accidentally closed #18120 during rebase and doesn't have permission to reopen.

### Issue
We have noticed that any problems with the remote cache have a detrimental effect on build times. On investigation we found that the interface for the circuit breaker was left unimplemented.

### Solution
To address this issue, implemented a failure circuit breaker, which includes three new Bazel flags: 1) experimental_circuitbreaker_strategy, 2) experimental_remote_failure_threshold, and 3) experimental_emote_failure_window.

In this implementation, I have implemented failure strategy for circuit breaker and used failure count to trip the circuit.

Reasoning behind using failure count instead of failure rate : To measure failure rate I also need the success count. While both the failure and success count need to be an AtomicInteger as both will be modified concurrently by multiple threads. Even though getAndIncrement is very light weight operation, at very high request it might contribute to latency.

Reasoning behind using failure circuit breaker : A new instance of Retrier.CircuitBreaker is created for each build. Therefore, if the circuit breaker trips during a build, the remote cache will be disabled for that build. However, it will be enabled again
for the next build as a new instance of Retrier.CircuitBreaker will be created. If needed in the future we may add cool down strategy also. e.g. failure_and_cool_down_startegy.

closes #18136

Closes #18359.

PiperOrigin-RevId: 536349954
Change-Id: I5e1c57d4ad0ce07ddc4808bf1f327bc5df6ce704
amishra-u added a commit to amishra-u/bazel that referenced this pull request May 30, 2023
Copy of bazelbuild#18120: I accidentally closed bazelbuild#18120 during rebase and doesn't have permission to reopen.

We have noticed that any problems with the remote cache have a detrimental effect on build times. On investigation we found that the interface for the circuit breaker was left unimplemented.

To address this issue, implemented a failure circuit breaker, which includes three new Bazel flags: 1) experimental_circuitbreaker_strategy, 2) experimental_remote_failure_threshold, and 3) experimental_emote_failure_window.

In this implementation, I have implemented failure strategy for circuit breaker and used failure count to trip the circuit.

Reasoning behind using failure count instead of failure rate : To measure failure rate I also need the success count. While both the failure and success count need to be an AtomicInteger as both will be modified concurrently by multiple threads. Even though getAndIncrement is very light weight operation, at very high request it might contribute to latency.

Reasoning behind using failure circuit breaker : A new instance of Retrier.CircuitBreaker is created for each build. Therefore, if the circuit breaker trips during a build, the remote cache will be disabled for that build. However, it will be enabled again
for the next build as a new instance of Retrier.CircuitBreaker will be created. If needed in the future we may add cool down strategy also. e.g. failure_and_cool_down_startegy.

closes bazelbuild#18136

Closes bazelbuild#18359.

PiperOrigin-RevId: 536349954
Change-Id: I5e1c57d4ad0ce07ddc4808bf1f327bc5df6ce704
iancha1992 pushed a commit that referenced this pull request Jun 1, 2023
* feat: Implement failure circuit breaker

Copy of #18120: I accidentally closed #18120 during rebase and doesn't have permission to reopen.

We have noticed that any problems with the remote cache have a detrimental effect on build times. On investigation we found that the interface for the circuit breaker was left unimplemented.

To address this issue, implemented a failure circuit breaker, which includes three new Bazel flags: 1) experimental_circuitbreaker_strategy, 2) experimental_remote_failure_threshold, and 3) experimental_emote_failure_window.

In this implementation, I have implemented failure strategy for circuit breaker and used failure count to trip the circuit.

Reasoning behind using failure count instead of failure rate : To measure failure rate I also need the success count. While both the failure and success count need to be an AtomicInteger as both will be modified concurrently by multiple threads. Even though getAndIncrement is very light weight operation, at very high request it might contribute to latency.

Reasoning behind using failure circuit breaker : A new instance of Retrier.CircuitBreaker is created for each build. Therefore, if the circuit breaker trips during a build, the remote cache will be disabled for that build. However, it will be enabled again
for the next build as a new instance of Retrier.CircuitBreaker will be created. If needed in the future we may add cool down strategy also. e.g. failure_and_cool_down_startegy.

closes #18136

Closes #18359.

PiperOrigin-RevId: 536349954
Change-Id: I5e1c57d4ad0ce07ddc4808bf1f327bc5df6ce704

* remove target included in cherry-pick by mistake
iancha1992 added a commit that referenced this pull request Jun 13, 2023
…#18559)

* feat: Implement failure circuit breaker

Copy of #18120: I accidentally closed #18120 during rebase and doesn't have permission to reopen.

We have noticed that any problems with the remote cache have a detrimental effect on build times. On investigation we found that the interface for the circuit breaker was left unimplemented.

To address this issue, implemented a failure circuit breaker, which includes three new Bazel flags: 1) experimental_circuitbreaker_strategy, 2) experimental_remote_failure_threshold, and 3) experimental_emote_failure_window.

In this implementation, I have implemented failure strategy for circuit breaker and used failure count to trip the circuit.

Reasoning behind using failure count instead of failure rate : To measure failure rate I also need the success count. While both the failure and success count need to be an AtomicInteger as both will be modified concurrently by multiple threads. Even though getAndIncrement is very light weight operation, at very high request it might contribute to latency.

Reasoning behind using failure circuit breaker : A new instance of Retrier.CircuitBreaker is created for each build. Therefore, if the circuit breaker trips during a build, the remote cache will be disabled for that build. However, it will be enabled again
for the next build as a new instance of Retrier.CircuitBreaker will be created. If needed in the future we may add cool down strategy also. e.g. failure_and_cool_down_startegy.

closes #18136

Closes #18359.

PiperOrigin-RevId: 536349954
Change-Id: I5e1c57d4ad0ce07ddc4808bf1f327bc5df6ce704

* remove target included in cherry-pick by mistake

* Use failure_rate instead of failure count for circuit breaker

---------

Co-authored-by: Ian (Hee) Cha <heec@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-review PR is awaiting review from an assigned reviewer team-Remote-Exec Issues and PRs for the Execution (Remote) team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Implementation for Circuit breaker
3 participants