Skip to content

Commit 776b25f

Browse files
committed
fix: Decrease RPC failover threshold & cooldown duration
Currently, there must be 28 consecutive failed attempts to hit an RPC endpoint before it is perceived to be unavailable. At this point, if the endpoint is configured with a failover, the failover will be activated and all requests will be automatically diverted to it; otherwise requests are paused for 30 minutes. There are two problems we are seeing now: - If Infura is degraded enough to where failing over to QuickNode is warranted, the failover may be activated too late. - If the user is attempting to access a custom network and they are experiencing issues (either due to a local connection issue or an issue with the endpoint itself), they will be prevented from using that network for 30 minutes. This is way too long. To fix these problems, this commit: - Lowers the "max consecutive failures" (the number of successive attempts to obtain a successful response from an endpoint before requests are paused or the failover is triggered) from 28 to 8. - Lowers the "circuit break duration" (the period during which requests to an unavailable endpoint will be paused) from 30 minutes to 30 *seconds*. In summary, if a network starts to become degraded or the user is experiencing connection issues, the network is more likely to be flagged as unavailable, but if the situation improves the user may be able to use it more quickly. How quickly does the circuit break now? It depends on whether the user is using Chrome or Firefox and whether the errors encountered are retriable or non-retriable: - Retriable errors (e.g. connection errors, 502/503/504, etc.) will, as the name implies, be automatically retried. If these errors are continually produced, the circuit will break very quickly (if the extension is restarted, then it will break immediately). - Non-retriable errors (e.g. 4xx errors) do not get automatically retried, so it takes longer for the circuit to break (if the extension is restarted, on average it will take about 1 minute). - Note that Chrome implements "anti-DDoS throttling logic" which means that some non-retriable errors will turn into retriable errors. In this situation the circuit breaks faster than it would on Firefox.
1 parent 9094185 commit 776b25f

File tree

1 file changed

+25
-11
lines changed

1 file changed

+25
-11
lines changed

app/scripts/controller-init/network-controller-init.ts

Lines changed: 25 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -157,33 +157,47 @@ export const NetworkControllerInit: ControllerInitFunction<
157157
};
158158

159159
const getRpcServiceOptions = (rpcEndpointUrl: string) => {
160-
const maxRetries = 4;
160+
// This is the default, but we define it here to be explicit.
161+
// Note that the total number of attempts is 1 more than this.
162+
const maxRetries = 3;
161163
const commonOptions = {
162164
fetch: globalThis.fetch.bind(globalThis),
163165
btoa: globalThis.btoa.bind(globalThis),
166+
maxRetries,
167+
};
168+
const commonPolicyOptions = {
169+
// Ensure that the "cooldown" period after breaking the circuit is short.
170+
circuitBreakDuration: 30 * SECOND,
164171
};
165172

166173
if (getIsQuicknodeEndpointUrl(rpcEndpointUrl)) {
167174
return {
168175
...commonOptions,
169176
policyOptions: {
170-
maxRetries,
171-
// When we fail over to Quicknode, we expect it to be down at
172-
// first while it is being automatically activated. If an endpoint
173-
// is down, the failover logic enters a "cooldown period" of 30
174-
// minutes. We'd really rather not enter that for Quicknode, so
175-
// keep retrying longer.
176-
maxConsecutiveFailures: (maxRetries + 1) * 14,
177+
...commonPolicyOptions,
178+
// The number of rounds of retries that will break the circuit,
179+
// triggering a "cooldown".
180+
//
181+
// When we fail over to QuickNode, we expect it to be down at first
182+
// while it is being automatically activated, and we don't want to
183+
// activate the "cooldown" accidentally.
184+
maxConsecutiveFailures: (maxRetries + 1) * 10,
177185
},
178186
};
179187
}
180188

181189
return {
182190
...commonOptions,
183191
policyOptions: {
184-
maxRetries,
185-
// Ensure that the circuit does not break too quickly.
186-
maxConsecutiveFailures: (maxRetries + 1) * 7,
192+
...commonPolicyOptions,
193+
// Ensure that if the endpoint continually responds with errors, we
194+
// break the circuit relatively fast (but not prematurely).
195+
//
196+
// Note that the circuit will break much faster if the errors are
197+
// retriable (e.g. 503) than if not (e.g. 500), so we attempt to strike
198+
// a balance here. (In testing, it takes about 1 minute to break the
199+
// circuit with a continual non-retriable error.)
200+
maxConsecutiveFailures: (maxRetries + 1) * 2,
187201
},
188202
};
189203
};

0 commit comments

Comments
 (0)