Skip to content

Commit b4ce2c1

Browse files
authored
chore: Decrease threshold before RPC failover is activated (#37002)
<!-- Please submit this PR as a draft initially. Do not mark it as "Ready for review" until the template has been completely filled out, and PR status checks have passed at least once. --> ## **Description** <!-- Write a short description of the changes included in this pull request, also include relevant motivation and context. Have in mind the following questions: 1. What is the reason for the change? 2. What is the improvement/solution? --> The network request code within the client makes use of the circuit breaker pattern. Currently, there must be 28 consecutive failed attempts to hit an RPC endpoint before the circuit breaks and it is perceived to be unavailable. At this point, if the endpoint is configured with a failover, the failover will be activated and all requests will be automatically diverted to it; otherwise requests are paused for 30 minutes. There are two problems we are seeing now: - If Infura is degraded enough to where failing over to QuickNode is warranted, the failover may be activated too late. - If the user is attempting to access a custom network and they are experiencing issues (either due to a local connection issue or an issue with the endpoint itself), they will be prevented from using that network for 30 minutes. This is way too long. To fix these problems, this commit: - Lowers the "max consecutive failures" (the number of successive attempts to obtain a successful response from an endpoint before requests are paused or the failover is triggered) from 28 to 8. - Lowers the "circuit break duration" (the period during which requests to an unavailable endpoint will be paused) from 30 minutes to 30 *seconds*. In summary, if a network starts to become degraded or the user is experiencing connection issues, the network is more likely to be flagged as unavailable, but if the situation improves the user may be able to use the network more quickly. How quickly does the circuit break now? It depends on whether the user is using Chrome or Firefox and whether the errors encountered are retriable or non-retriable: - Retriable errors (e.g. connection errors, 502/503/504, etc.) will, as the name implies, be automatically retried. If these errors are continually produced, the circuit will break very quickly (if the extension is restarted, then it will break immediately). - Non-retriable errors (e.g. 4xx errors) do not get automatically retried, so it takes longer for the circuit to break (if the extension is restarted, on average it will take about 1 minute). - Note that Chrome implements "anti-DDoS throttling logic" which means that some non-retriable errors will turn into retriable errors. In this situation the circuit breaks faster than it would on Firefox. [![Open in GitHub Codespaces](https://github.com/codespaces/badge.svg)](https://codespaces.new/MetaMask/metamask-extension/pull/37002?quickstart=1) ## **Changelog** <!-- If this PR is not End-User-Facing and should not show up in the CHANGELOG, you can choose to either: 1. Write `CHANGELOG entry: null` 2. Label with `no-changelog` If this PR is End-User-Facing, please write a short User-Facing description in the past tense like: `CHANGELOG entry: Added a new tab for users to see their NFTs` `CHANGELOG entry: Fixed a bug that was causing some NFTs to flicker` (This helps the Release Engineer do their job more quickly and accurately) --> CHANGELOG entry: Decrease time before activating QuickNode when Infura is degraded or unavailable; decrease time before allowing users to interact with a custom network following connection issues ## **Related issues** - https://consensyssoftware.atlassian.net/browse/WPC-103 ## **Manual testing steps** ### Prerequisites 1. Install [FoxyProxy](https://chromewebstore.google.com/detail/foxyproxy/gcknhkkoolaabfmlnjonogaaifnjlfnp). Pin it. 2. Open the options for FoxyProxy (click on its icon and go to Options). 3. Click on the Proxies tab, then click on Add. A new box should appear. 4. Under "Hostname", enter `localhost`; under "Port", enter `8080`; enter some name in "Title". 5. Next to "Proxy by Patterns", click the plus icon. You should see a new row appear. 6. In the `://example.com` field, enter `https://*.infura.io/*`; in the "title" field, enter some name, like "Localhost". (This will ensure that only requests to Infura go through the proxy.) 8. Install [`mitmproxy`](https://docs.mitmproxy.org/stable/overview/installation/). 10. Create a Python script somewhere on your computer with the following contents: https://gist.github.com/mcmire/1d43ce690d3a974217126cd584f79b7d. This script will cause all requests to Infura RPC endpoints to respond with 500, simulating an outage. 11. Run `mitmproxy -s <path to your script>` in an open terminal session. This will run the proxy server. 12. Go back to FoxyProxy and enable the proxy you created earlier by clicking on the icon, then choosing the name of the new proxy (e.g. Localhost). 13. Check out this branch. Run `yarn`, then `yarn start`. 14. Ensure that you've added the local version of MetaMask to your browser and no other versions are present. 15. Open MetaMask, and go through onboarding. 16. Open the DevTools for the extension (Chrome: go to Extensions, find MetaMask, click on "service worker"; in Firefox, go to `about:debugging`, find MetaMask, click on "Inspect"). ### Testing non-retriable errors 1. Ensure that in the `mitmproxy` script, all Infura endpoints return 500. 2. Restart the extension. 3. Switch to DevTools. You should see request errors like "RPC endpoint not found or unavailable" in the Console view. 5. Open MetaMask and switch to full-screen view so that it stays open. 6. Switch back to DevTools and wait. After about 20 seconds, you should see errors that say "RPC endpoint returned too many errors". ### Testing retriable errors 1. Stop `mitmproxy`. Modify the script so that all Infura endpoints return 503 instead of 500. Restart it. 2. Restart the extension. 3. Switch to DevTools. You should see a flurry of messages, and at the end you should see an error that says "RPC endpoint returned too many errors". ## **Screenshots/Recordings** <!-- If applicable, add screenshots and/or recordings to visualize the before and after of your change. --> ### **Before** <!-- [screenshots/recordings] --> ### **After** #### Chrome In this video, the endpoint continually responds with 500 (a non-retriable error): https://github.com/user-attachments/assets/fc8956d2-ec43-4ba6-a4c0-8e68b72b90c4 In this video, the endpoint continually responds with 503 (a retriable error): https://github.com/user-attachments/assets/d449e404-d755-4d81-be94-0bee44174354 #### Firefox In this video, the endpoint continually responds with 500 (a non-retriable error): https://github.com/user-attachments/assets/6500f3a2-b0e8-459f-86f6-20f15044ecda In this video, the endpoint continually responds with 503 (a retriable error): https://github.com/user-attachments/assets/84361615-e6a9-48a0-aab4-13ca016eec2e ## **Pre-merge author checklist** - [x] I've followed [MetaMask Contributor Docs](https://github.com/MetaMask/contributor-docs) and [MetaMask Extension Coding Standards](https://github.com/MetaMask/metamask-extension/blob/main/.github/guidelines/CODING_GUIDELINES.md). - [x] I've completed the PR template to the best of my ability - [x] I’ve included tests if applicable - [x] I’ve documented my code using [JSDoc](https://jsdoc.app/) format if applicable - [x] I’ve applied the right labels on the PR (see [labeling guidelines](https://github.com/MetaMask/metamask-extension/blob/main/.github/guidelines/LABELING_GUIDELINES.md)). Not required for external contributors. ## **Pre-merge reviewer checklist** - [ ] I've manually tested the PR (e.g. pull and build branch, run the app, test code being changed). - [ ] I confirm that this PR addresses all acceptance criteria described in the ticket it closes and includes the necessary testing evidence such as recordings and or screenshots. <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Tightens RPC circuit-breaker settings by using DEFAULT_MAX_RETRIES, setting a 30s cooldown, and reducing maxConsecutiveFailures for QuickNode and other endpoints. > > - **Network Controller Init (`app/scripts/controller-init/network-controller-init.ts`)** > - Use `DEFAULT_MAX_RETRIES` for RPC retries. > - Introduce shared `policyOptions` with `circuitBreakDuration: 30 * SECOND`. > - Adjust `maxConsecutiveFailures`: > - QuickNode endpoints: `(maxRetries + 1) * 10`. > - Other endpoints: `(maxRetries + 1) * 3`. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit bb71a1c. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->
1 parent 790cc12 commit b4ce2c1

File tree

1 file changed

+29
-12
lines changed

1 file changed

+29
-12
lines changed

app/scripts/controller-init/network-controller-init.ts

Lines changed: 29 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,11 @@ import {
44
NetworkController,
55
RpcEndpointType,
66
} from '@metamask/network-controller';
7-
import { BlockExplorerUrl, ChainId } from '@metamask/controller-utils';
7+
import {
8+
DEFAULT_MAX_RETRIES,
9+
BlockExplorerUrl,
10+
ChainId,
11+
} from '@metamask/controller-utils';
812
import { hasProperty } from '@metamask/utils';
913
import { SECOND } from '../../../shared/constants/time';
1014
import { getIsQuicknodeEndpointUrl } from '../../../shared/lib/network-utils';
@@ -173,33 +177,46 @@ export const NetworkControllerInit: ControllerInitFunction<
173177
};
174178

175179
const getRpcServiceOptions = (rpcEndpointUrl: string) => {
176-
const maxRetries = 4;
180+
// Note that the total number of attempts is 1 more than this
181+
// (which is why we add 1 below).
182+
const maxRetries = DEFAULT_MAX_RETRIES;
177183
const commonOptions = {
178184
fetch: globalThis.fetch.bind(globalThis),
179185
btoa: globalThis.btoa.bind(globalThis),
180186
};
187+
const commonPolicyOptions = {
188+
// Ensure that the "cooldown" period after breaking the circuit is short.
189+
circuitBreakDuration: 30 * SECOND,
190+
maxRetries,
191+
};
181192

182193
if (getIsQuicknodeEndpointUrl(rpcEndpointUrl)) {
183194
return {
184195
...commonOptions,
185196
policyOptions: {
186-
maxRetries,
187-
// When we fail over to Quicknode, we expect it to be down at
188-
// first while it is being automatically activated. If an endpoint
189-
// is down, the failover logic enters a "cooldown period" of 30
190-
// minutes. We'd really rather not enter that for Quicknode, so
191-
// keep retrying longer.
192-
maxConsecutiveFailures: (maxRetries + 1) * 14,
197+
...commonPolicyOptions,
198+
// The number of rounds of retries that will break the circuit,
199+
// triggering a "cooldown".
200+
//
201+
// When we fail over to QuickNode, we expect it to be down at first
202+
// while it is being automatically activated, and we don't want to
203+
// activate the "cooldown" accidentally.
204+
maxConsecutiveFailures: (maxRetries + 1) * 10,
193205
},
194206
};
195207
}
196208

197209
return {
198210
...commonOptions,
199211
policyOptions: {
200-
maxRetries,
201-
// Ensure that the circuit does not break too quickly.
202-
maxConsecutiveFailures: (maxRetries + 1) * 7,
212+
...commonPolicyOptions,
213+
// Ensure that if the endpoint continually responds with errors, we
214+
// break the circuit relatively fast (but not prematurely).
215+
//
216+
// Note that the circuit will break much faster if the errors are
217+
// retriable (e.g. 503) than if not (e.g. 500), so we attempt to strike
218+
// a balance here.
219+
maxConsecutiveFailures: (maxRetries + 1) * 3,
203220
},
204221
};
205222
};

0 commit comments

Comments
 (0)