feat: Implement failure circuit breaker #18120

amishra-u · 2023-04-17T18:08:10Z

Issue

We have noticed that any problems with the remote cache have a detrimental effect on build times. On investigation we found that the interface for the circuit breaker was left unimplemented.

Solution

To address this issue, implemented a failure circuit breaker, which includes three new Bazel flags: 1) experimental_circuitbreaker_strategy, 2) experimental_remote_failure_threshold, and 3) experimental_emote_failure_window.

In this implementation, I have implemented failure strategy for circuit breaker and used failure count to trip the circuit.

Reasoning behind using failure count instead of failure rate : To measure failure rate I also need the success count. While both the failure and success count need to be an AtomicInteger as both will be modified concurrently by multiple threads. Even though getAndIncrement is very light weight operation, at very high request it might contribute to latency.

Reasoning behind using failure circuit breaker : A new instance of Retrier.CircuitBreaker is created for each build. Therefore, if the circuit breaker trips during a build, the remote cache will be disabled for that build. However, it will be enabled again
for the next build as a new instance of Retrier.CircuitBreaker will be created. If needed in the future we may add cool down strategy also. e.g. failure_and_cool_down_startegy.

closes #18136

google-cla · 2023-04-17T18:08:15Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

shirchen · 2023-04-18T20:46:23Z

src/main/java/com/google/devtools/build/lib/remote/Retrier.java

@@ -100,7 +100,7 @@ enum State {
 State state();

 /** Called after an execution failed. */
- void recordFailure();
+ void recordFailure(Throwable t);


This is a breaking change so you need to find another way to record ignored errors.

to wit: Throwable may be overly aggressive here - this is only called with Exception that I can see.

@werkt Will update it to Exception.

@shirchen for backward compatibility I can define an overloaded method.

@deprecated default void recordFailure() { // As there was not default implementation earlier, all implementing classes must have implemented this. throw IllegalState("Not implemented method") } default void recordFailure(Exception e) { // for backward compatibility recordFailure() }

Please let me know your thought.

werkt · 2023-04-19T17:24:11Z

The original issue with circuit breaker behavior was one represented during ActionCache misses: NOT_FOUND status is returned from this call for misses, which incorrectly signaled the circuit to open due to translation of the NOT_FOUND status to an exception. Can you identify the verification (i.e. test) that this expected failure status does not trip the circuit breaker open (as it is going to be extremely common), and that other failure statuses may trip it open?

amishra-u · 2023-04-19T17:32:21Z

The original issue with circuit breaker behavior was one represented during ActionCache misses: NOT_FOUND status is returned from this call for misses, which incorrectly signaled the circuit to open due to translation of the NOT_FOUND status to an exception. Can you identify the verification (i.e. test) that this expected failure status does not trip the circuit breaker open (as it is going to be extremely common), and that other failure statuses may trip it open?

I tested locally, and 'not found' didn't trip the circuit. Also, I will start shadow experiment this week to verify any performance issue and will post my findings.

werkt · 2023-04-19T21:04:16Z

src/test/java/com/google/devtools/build/lib/remote/RemoteModuleTest.java

+ assertThat(credentials.getRequestMetadata(URI.create("https://bar.example.org"))).isEmpty();
+ }
+
+ private void assertCircuitBreakerInstance(Retrier.CircuitBreaker circuitBreaker, RemoteOptions remoteOptions) {


indent

Can you take a look again at this change to see if you can rework it to not move so many methods around? Getting a handle on the depth of the change is difficult with all the sequencing changes.

I tried to organize the test with no functional change and then added my test on top of it. There was lot of duplicate code.

Let me know what you think about this.

I create another pr where I just reorganize the test.

Then I put this pr on top of that.

I'm not sure what the indent level and position of testNetrc_netrcWithoutRemoteCache could have to do with organization and duplicate code: the method appears to have been moved without any changes (sans indent).

If you think things will be made clearer with another PR, go for it.

I didn't realize that this package uses 2 space indentation, while I was using 4 space indentation. Fix the indentation issue and made sure that the original position of the method are maintained.

amishra-u · 2023-05-10T19:57:42Z

The original issue with circuit breaker behavior was one represented during ActionCache misses: NOT_FOUND status is returned from this call for misses, which incorrectly signaled the circuit to open due to translation of the NOT_FOUND status to an exception. Can you identify the verification (i.e. test) that this expected failure status does not trip the circuit breaker open (as it is going to be extremely common), and that other failure statuses may trip it open?

I verified there was no circuit trip for not_found error. The not_found error remained same for every experiment 10-15% of total request (quite high number) but with healthy remote-cache there was no or very less trips.
more details here

Copy of #18120: I accidentally closed #18120 during rebase and doesn't have permission to reopen. ### Issue We have noticed that any problems with the remote cache have a detrimental effect on build times. On investigation we found that the interface for the circuit breaker was left unimplemented. ### Solution To address this issue, implemented a failure circuit breaker, which includes three new Bazel flags: 1) experimental_circuitbreaker_strategy, 2) experimental_remote_failure_threshold, and 3) experimental_emote_failure_window. In this implementation, I have implemented failure strategy for circuit breaker and used failure count to trip the circuit. Reasoning behind using failure count instead of failure rate : To measure failure rate I also need the success count. While both the failure and success count need to be an AtomicInteger as both will be modified concurrently by multiple threads. Even though getAndIncrement is very light weight operation, at very high request it might contribute to latency. Reasoning behind using failure circuit breaker : A new instance of Retrier.CircuitBreaker is created for each build. Therefore, if the circuit breaker trips during a build, the remote cache will be disabled for that build. However, it will be enabled again for the next build as a new instance of Retrier.CircuitBreaker will be created. If needed in the future we may add cool down strategy also. e.g. failure_and_cool_down_startegy. closes #18136 Closes #18359. PiperOrigin-RevId: 536349954 Change-Id: I5e1c57d4ad0ce07ddc4808bf1f327bc5df6ce704

Copy of bazelbuild#18120: I accidentally closed bazelbuild#18120 during rebase and doesn't have permission to reopen. We have noticed that any problems with the remote cache have a detrimental effect on build times. On investigation we found that the interface for the circuit breaker was left unimplemented. To address this issue, implemented a failure circuit breaker, which includes three new Bazel flags: 1) experimental_circuitbreaker_strategy, 2) experimental_remote_failure_threshold, and 3) experimental_emote_failure_window. In this implementation, I have implemented failure strategy for circuit breaker and used failure count to trip the circuit. Reasoning behind using failure count instead of failure rate : To measure failure rate I also need the success count. While both the failure and success count need to be an AtomicInteger as both will be modified concurrently by multiple threads. Even though getAndIncrement is very light weight operation, at very high request it might contribute to latency. Reasoning behind using failure circuit breaker : A new instance of Retrier.CircuitBreaker is created for each build. Therefore, if the circuit breaker trips during a build, the remote cache will be disabled for that build. However, it will be enabled again for the next build as a new instance of Retrier.CircuitBreaker will be created. If needed in the future we may add cool down strategy also. e.g. failure_and_cool_down_startegy. closes bazelbuild#18136 Closes bazelbuild#18359. PiperOrigin-RevId: 536349954 Change-Id: I5e1c57d4ad0ce07ddc4808bf1f327bc5df6ce704

* feat: Implement failure circuit breaker Copy of #18120: I accidentally closed #18120 during rebase and doesn't have permission to reopen. We have noticed that any problems with the remote cache have a detrimental effect on build times. On investigation we found that the interface for the circuit breaker was left unimplemented. To address this issue, implemented a failure circuit breaker, which includes three new Bazel flags: 1) experimental_circuitbreaker_strategy, 2) experimental_remote_failure_threshold, and 3) experimental_emote_failure_window. In this implementation, I have implemented failure strategy for circuit breaker and used failure count to trip the circuit. Reasoning behind using failure count instead of failure rate : To measure failure rate I also need the success count. While both the failure and success count need to be an AtomicInteger as both will be modified concurrently by multiple threads. Even though getAndIncrement is very light weight operation, at very high request it might contribute to latency. Reasoning behind using failure circuit breaker : A new instance of Retrier.CircuitBreaker is created for each build. Therefore, if the circuit breaker trips during a build, the remote cache will be disabled for that build. However, it will be enabled again for the next build as a new instance of Retrier.CircuitBreaker will be created. If needed in the future we may add cool down strategy also. e.g. failure_and_cool_down_startegy. closes #18136 Closes #18359. PiperOrigin-RevId: 536349954 Change-Id: I5e1c57d4ad0ce07ddc4808bf1f327bc5df6ce704 * remove target included in cherry-pick by mistake

…#18559) * feat: Implement failure circuit breaker Copy of #18120: I accidentally closed #18120 during rebase and doesn't have permission to reopen. We have noticed that any problems with the remote cache have a detrimental effect on build times. On investigation we found that the interface for the circuit breaker was left unimplemented. To address this issue, implemented a failure circuit breaker, which includes three new Bazel flags: 1) experimental_circuitbreaker_strategy, 2) experimental_remote_failure_threshold, and 3) experimental_emote_failure_window. In this implementation, I have implemented failure strategy for circuit breaker and used failure count to trip the circuit. Reasoning behind using failure count instead of failure rate : To measure failure rate I also need the success count. While both the failure and success count need to be an AtomicInteger as both will be modified concurrently by multiple threads. Even though getAndIncrement is very light weight operation, at very high request it might contribute to latency. Reasoning behind using failure circuit breaker : A new instance of Retrier.CircuitBreaker is created for each build. Therefore, if the circuit breaker trips during a build, the remote cache will be disabled for that build. However, it will be enabled again for the next build as a new instance of Retrier.CircuitBreaker will be created. If needed in the future we may add cool down strategy also. e.g. failure_and_cool_down_startegy. closes #18136 Closes #18359. PiperOrigin-RevId: 536349954 Change-Id: I5e1c57d4ad0ce07ddc4808bf1f327bc5df6ce704 * remove target included in cherry-pick by mistake * Use failure_rate instead of failure count for circuit breaker --------- Co-authored-by: Ian (Hee) Cha <heec@google.com>

amishra-u marked this pull request as ready for review April 18, 2023 20:11

amishra-u requested a review from a team as a code owner April 18, 2023 20:11

github-actions bot added awaiting-review PR is awaiting review from an assigned reviewer team-Remote-Exec Issues and PRs for the Execution (Remote) team labels Apr 18, 2023

shirchen suggested changes Apr 18, 2023

View reviewed changes

werkt reviewed Apr 19, 2023

View reviewed changes

amishra-u closed this May 9, 2023

amishra-u force-pushed the master branch from b42d75d to fd33575 Compare May 9, 2023 23:22

amishra-u mentioned this pull request May 10, 2023

feat: Implement failure circuit breaker #18359

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Implement failure circuit breaker #18120

feat: Implement failure circuit breaker #18120

amishra-u commented Apr 17, 2023 •

edited

Loading

google-cla bot commented Apr 17, 2023

shirchen Apr 18, 2023

werkt Apr 19, 2023

amishra-u Apr 20, 2023

amishra-u Apr 20, 2023 •

edited

Loading

werkt commented Apr 19, 2023

amishra-u commented Apr 19, 2023

werkt Apr 19, 2023

amishra-u Apr 19, 2023

werkt Apr 20, 2023 •

edited

Loading

amishra-u Apr 20, 2023

amishra-u commented May 10, 2023 •

edited

Loading

feat: Implement failure circuit breaker #18120

feat: Implement failure circuit breaker #18120

Conversation

amishra-u commented Apr 17, 2023 • edited Loading

Issue

Solution

google-cla bot commented Apr 17, 2023

shirchen Apr 18, 2023

Choose a reason for hiding this comment

werkt Apr 19, 2023

Choose a reason for hiding this comment

amishra-u Apr 20, 2023

Choose a reason for hiding this comment

amishra-u Apr 20, 2023 • edited Loading

Choose a reason for hiding this comment

werkt commented Apr 19, 2023

amishra-u commented Apr 19, 2023

werkt Apr 19, 2023

Choose a reason for hiding this comment

amishra-u Apr 19, 2023

Choose a reason for hiding this comment

werkt Apr 20, 2023 • edited Loading

Choose a reason for hiding this comment

amishra-u Apr 20, 2023

Choose a reason for hiding this comment

amishra-u commented May 10, 2023 • edited Loading

amishra-u commented Apr 17, 2023 •

edited

Loading

amishra-u Apr 20, 2023 •

edited

Loading

werkt Apr 20, 2023 •

edited

Loading

amishra-u commented May 10, 2023 •

edited

Loading