fix: Decrease RPC failover threshold & cooldown duration

mcmire · mcmire · commit 776b25f9e769 · 2025-10-28T12:35:19.000-06:00
Currently, there must be 28 consecutive failed attempts to hit an RPC
endpoint before it is perceived to be unavailable. At this point, if the
endpoint is configured with a failover, the failover will be activated
and all requests will be automatically diverted to it; otherwise
requests are paused for 30 minutes.

There are two problems we are seeing now:

- If Infura is degraded enough to where failing over to QuickNode is
  warranted, the failover may be activated too late.
- If the user is attempting to access a custom network and they are
  experiencing issues (either due to a local connection issue or an
  issue with the endpoint itself), they will be prevented from using
  that network for 30 minutes. This is way too long.

To fix these problems, this commit:

- Lowers the "max consecutive failures" (the number of successive
  attempts to obtain a successful response from an endpoint before
  requests are paused or the failover is triggered) from 28 to 8.
- Lowers the "circuit break duration" (the period during which requests
  to an unavailable endpoint will be paused) from 30 minutes to 30
  *seconds*.

In summary, if a network starts to become degraded or the user is
experiencing connection issues, the network is more likely to be flagged
as unavailable, but if the situation improves the user may be able to
use it more quickly.

How quickly does the circuit break now? It depends on whether the user
is using Chrome or Firefox and whether the errors encountered are
retriable or non-retriable:

- Retriable errors (e.g. connection errors, 502/503/504, etc.) will, as
  the name implies, be automatically retried. If these errors are
  continually produced, the circuit will break very quickly (if the
  extension is restarted, then it will break immediately).
- Non-retriable errors (e.g. 4xx errors) do not get automatically
  retried, so it takes longer for the circuit to break (if the extension
  is restarted, on average it will take about 1 minute).
- Note that Chrome implements "anti-DDoS throttling logic" which means
  that some non-retriable errors will turn into retriable errors. In
  this situation the circuit breaks faster than it would on Firefox.
diff --git a/app/scripts/controller-init/network-controller-init.ts b/app/scripts/controller-init/network-controller-init.ts
@@ -157,33 +157,47 @@ export const NetworkControllerInit: ControllerInitFunction<
   };
 
   const getRpcServiceOptions = (rpcEndpointUrl: string) => {
-    const maxRetries = 4;
+    // This is the default, but we define it here to be explicit.
+    // Note that the total number of attempts is 1 more than this.
+    const maxRetries = 3;
     const commonOptions = {
       fetch: globalThis.fetch.bind(globalThis),
       btoa: globalThis.btoa.bind(globalThis),
+      maxRetries,
+    };
+    const commonPolicyOptions = {
+      // Ensure that the "cooldown" period after breaking the circuit is short.
+      circuitBreakDuration: 30 * SECOND,
     };
 
     if (getIsQuicknodeEndpointUrl(rpcEndpointUrl)) {
       return {
         ...commonOptions,
         policyOptions: {
-          maxRetries,
-          // When we fail over to Quicknode, we expect it to be down at
-          // first while it is being automatically activated. If an endpoint
-          // is down, the failover logic enters a "cooldown period" of 30
-          // minutes. We'd really rather not enter that for Quicknode, so
-          // keep retrying longer.
-          maxConsecutiveFailures: (maxRetries + 1) * 14,
+          ...commonPolicyOptions,
+          // The number of rounds of retries that will break the circuit,
+          // triggering a "cooldown".
+          //
+          // When we fail over to QuickNode, we expect it to be down at first
+          // while it is being automatically activated, and we don't want to
+          // activate the "cooldown" accidentally.
+          maxConsecutiveFailures: (maxRetries + 1) * 10,
         },
       };
     }
 
     return {
       ...commonOptions,
       policyOptions: {
-        maxRetries,
-        // Ensure that the circuit does not break too quickly.
-        maxConsecutiveFailures: (maxRetries + 1) * 7,
+        ...commonPolicyOptions,
+        // Ensure that if the endpoint continually responds with errors, we
+        // break the circuit relatively fast (but not prematurely).
+        //
+        // Note that the circuit will break much faster if the errors are
+        // retriable (e.g. 503) than if not (e.g. 500), so we attempt to strike
+        // a balance here. (In testing, it takes about 1 minute to break the
+        // circuit with a continual non-retriable error.)
+        maxConsecutiveFailures: (maxRetries + 1) * 2,
       },
     };
   };