Commit b4ce2c1
authored
chore: Decrease threshold before RPC failover is activated (#37002)
<!--
Please submit this PR as a draft initially.
Do not mark it as "Ready for review" until the template has been
completely filled out, and PR status checks have passed at least once.
-->
## **Description**
<!--
Write a short description of the changes included in this pull request,
also include relevant motivation and context. Have in mind the following
questions:
1. What is the reason for the change?
2. What is the improvement/solution?
-->
The network request code within the client makes use of the circuit
breaker pattern. Currently, there must be 28 consecutive failed attempts
to hit an RPC endpoint before the circuit breaks and it is perceived to
be unavailable. At this point, if the endpoint is configured with a
failover, the failover will be activated and all requests will be
automatically diverted to it; otherwise requests are paused for 30
minutes.
There are two problems we are seeing now:
- If Infura is degraded enough to where failing over to QuickNode is
warranted, the failover may be activated too late.
- If the user is attempting to access a custom network and they are
experiencing issues (either due to a local connection issue or an issue
with the endpoint itself), they will be prevented from using that
network for 30 minutes. This is way too long.
To fix these problems, this commit:
- Lowers the "max consecutive failures" (the number of successive
attempts to obtain a successful response from an endpoint before
requests are paused or the failover is triggered) from 28 to 8.
- Lowers the "circuit break duration" (the period during which requests
to an unavailable endpoint will be paused) from 30 minutes to 30
*seconds*.
In summary, if a network starts to become degraded or the user is
experiencing connection issues, the network is more likely to be flagged
as unavailable, but if the situation improves the user may be able to
use the network more quickly.
How quickly does the circuit break now? It depends on whether the user
is using Chrome or Firefox and whether the errors encountered are
retriable or non-retriable:
- Retriable errors (e.g. connection errors, 502/503/504, etc.) will, as
the name implies, be automatically retried. If these errors are
continually produced, the circuit will break very quickly (if the
extension is restarted, then it will break immediately).
- Non-retriable errors (e.g. 4xx errors) do not get automatically
retried, so it takes longer for the circuit to break (if the extension
is restarted, on average it will take about 1 minute).
- Note that Chrome implements "anti-DDoS throttling logic" which means
that some non-retriable errors will turn into retriable errors. In this
situation the circuit breaks faster than it would on Firefox.
[](https://codespaces.new/MetaMask/metamask-extension/pull/37002?quickstart=1)
## **Changelog**
<!--
If this PR is not End-User-Facing and should not show up in the
CHANGELOG, you can choose to either:
1. Write `CHANGELOG entry: null`
2. Label with `no-changelog`
If this PR is End-User-Facing, please write a short User-Facing
description in the past tense like:
`CHANGELOG entry: Added a new tab for users to see their NFTs`
`CHANGELOG entry: Fixed a bug that was causing some NFTs to flicker`
(This helps the Release Engineer do their job more quickly and
accurately)
-->
CHANGELOG entry: Decrease time before activating QuickNode when Infura
is degraded or unavailable; decrease time before allowing users to
interact with a custom network following connection issues
## **Related issues**
- https://consensyssoftware.atlassian.net/browse/WPC-103
## **Manual testing steps**
### Prerequisites
1. Install
[FoxyProxy](https://chromewebstore.google.com/detail/foxyproxy/gcknhkkoolaabfmlnjonogaaifnjlfnp).
Pin it.
2. Open the options for FoxyProxy (click on its icon and go to Options).
3. Click on the Proxies tab, then click on Add. A new box should appear.
4. Under "Hostname", enter `localhost`; under "Port", enter `8080`;
enter some name in "Title".
5. Next to "Proxy by Patterns", click the plus icon. You should see a
new row appear.
6. In the `://example.com` field, enter `https://*.infura.io/*`; in the
"title" field, enter some name, like "Localhost". (This will ensure that
only requests to Infura go through the proxy.)
8. Install
[`mitmproxy`](https://docs.mitmproxy.org/stable/overview/installation/).
10. Create a Python script somewhere on your computer with the following
contents:
https://gist.github.com/mcmire/1d43ce690d3a974217126cd584f79b7d. This
script will cause all requests to Infura RPC endpoints to respond with
500, simulating an outage.
11. Run `mitmproxy -s <path to your script>` in an open terminal
session. This will run the proxy server.
12. Go back to FoxyProxy and enable the proxy you created earlier by
clicking on the icon, then choosing the name of the new proxy (e.g.
Localhost).
13. Check out this branch. Run `yarn`, then `yarn start`.
14. Ensure that you've added the local version of MetaMask to your
browser and no other versions are present.
15. Open MetaMask, and go through onboarding.
16. Open the DevTools for the extension (Chrome: go to Extensions, find
MetaMask, click on "service worker"; in Firefox, go to
`about:debugging`, find MetaMask, click on "Inspect").
### Testing non-retriable errors
1. Ensure that in the `mitmproxy` script, all Infura endpoints return
500.
2. Restart the extension.
3. Switch to DevTools. You should see request errors like "RPC endpoint
not found or unavailable" in the Console view.
5. Open MetaMask and switch to full-screen view so that it stays open.
6. Switch back to DevTools and wait. After about 20 seconds, you should
see errors that say "RPC endpoint returned too many errors".
### Testing retriable errors
1. Stop `mitmproxy`. Modify the script so that all Infura endpoints
return 503 instead of 500. Restart it.
2. Restart the extension.
3. Switch to DevTools. You should see a flurry of messages, and at the
end you should see an error that says "RPC endpoint returned too many
errors".
## **Screenshots/Recordings**
<!-- If applicable, add screenshots and/or recordings to visualize the
before and after of your change. -->
### **Before**
<!-- [screenshots/recordings] -->
### **After**
#### Chrome
In this video, the endpoint continually responds with 500 (a
non-retriable error):
https://github.com/user-attachments/assets/fc8956d2-ec43-4ba6-a4c0-8e68b72b90c4
In this video, the endpoint continually responds with 503 (a retriable
error):
https://github.com/user-attachments/assets/d449e404-d755-4d81-be94-0bee44174354
#### Firefox
In this video, the endpoint continually responds with 500 (a
non-retriable error):
https://github.com/user-attachments/assets/6500f3a2-b0e8-459f-86f6-20f15044ecda
In this video, the endpoint continually responds with 503 (a retriable
error):
https://github.com/user-attachments/assets/84361615-e6a9-48a0-aab4-13ca016eec2e
## **Pre-merge author checklist**
- [x] I've followed [MetaMask Contributor
Docs](https://github.com/MetaMask/contributor-docs) and [MetaMask
Extension Coding
Standards](https://github.com/MetaMask/metamask-extension/blob/main/.github/guidelines/CODING_GUIDELINES.md).
- [x] I've completed the PR template to the best of my ability
- [x] I’ve included tests if applicable
- [x] I’ve documented my code using [JSDoc](https://jsdoc.app/) format
if applicable
- [x] I’ve applied the right labels on the PR (see [labeling
guidelines](https://github.com/MetaMask/metamask-extension/blob/main/.github/guidelines/LABELING_GUIDELINES.md)).
Not required for external contributors.
## **Pre-merge reviewer checklist**
- [ ] I've manually tested the PR (e.g. pull and build branch, run the
app, test code being changed).
- [ ] I confirm that this PR addresses all acceptance criteria described
in the ticket it closes and includes the necessary testing evidence such
as recordings and or screenshots.
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> Tightens RPC circuit-breaker settings by using DEFAULT_MAX_RETRIES,
setting a 30s cooldown, and reducing maxConsecutiveFailures for
QuickNode and other endpoints.
>
> - **Network Controller Init
(`app/scripts/controller-init/network-controller-init.ts`)**
> - Use `DEFAULT_MAX_RETRIES` for RPC retries.
> - Introduce shared `policyOptions` with `circuitBreakDuration: 30 *
SECOND`.
> - Adjust `maxConsecutiveFailures`:
> - QuickNode endpoints: `(maxRetries + 1) * 10`.
> - Other endpoints: `(maxRetries + 1) * 3`.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
bb71a1c. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->1 parent 790cc12 commit b4ce2c1
1 file changed
+29
-12
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
4 | 4 | | |
5 | 5 | | |
6 | 6 | | |
7 | | - | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
8 | 12 | | |
9 | 13 | | |
10 | 14 | | |
| |||
173 | 177 | | |
174 | 178 | | |
175 | 179 | | |
176 | | - | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
177 | 183 | | |
178 | 184 | | |
179 | 185 | | |
180 | 186 | | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
181 | 192 | | |
182 | 193 | | |
183 | 194 | | |
184 | 195 | | |
185 | 196 | | |
186 | | - | |
187 | | - | |
188 | | - | |
189 | | - | |
190 | | - | |
191 | | - | |
192 | | - | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
193 | 205 | | |
194 | 206 | | |
195 | 207 | | |
196 | 208 | | |
197 | 209 | | |
198 | 210 | | |
199 | 211 | | |
200 | | - | |
201 | | - | |
202 | | - | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
203 | 220 | | |
204 | 221 | | |
205 | 222 | | |
| |||
0 commit comments