-
Notifications
You must be signed in to change notification settings - Fork 5.4k
fix: Decrease threshold before RPC failover is activated #37002
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
CLA Signature Action: All authors have signed the CLA. You may need to manually re-run the blocking PR check if it doesn't pass in a few minutes. |
📊 Page Load Benchmark ResultsCurrent Commit: 📄 Localhost MetaMask Test DappSamples: 100 Summary
📈 Detailed Results
Results generated automatically by MetaMask CI |
Builds ready [72f9f8a]
UI Startup Metrics (1242 ± 71 ms)
Bundle size diffs [🚨 Warning! Bundle size has increased!]
|
Currently, there must be 28 consecutive failed attempts to hit an RPC endpoint before it is perceived to be unavailable. At this point, if the endpoint is configured with a failover, the failover will be activated and all requests will be automatically diverted to it; otherwise requests are paused for 30 minutes. There are two problems we are seeing now: - If Infura is degraded enough to where failing over to QuickNode is warranted, the failover may be activated too late. - If the user is attempting to access a custom network and they are experiencing issues (either due to a local connection issue or an issue with the endpoint itself), they will be prevented from using that network for 30 minutes. This is way too long. To fix these problems, this commit: - Lowers the "max consecutive failures" (the number of successive attempts to obtain a successful response from an endpoint before requests are paused or the failover is triggered) from 28 to 8. - Lowers the "circuit break duration" (the period during which requests to an unavailable endpoint will be paused) from 30 minutes to 30 *seconds*. In summary, if a network starts to become degraded or the user is experiencing connection issues, the network is more likely to be flagged as unavailable, but if the situation improves the user may be able to use it more quickly. How quickly does the circuit break now? It depends on whether the user is using Chrome or Firefox and whether the errors encountered are retriable or non-retriable: - Retriable errors (e.g. connection errors, 502/503/504, etc.) will, as the name implies, be automatically retried. If these errors are continually produced, the circuit will break very quickly (if the extension is restarted, then it will break immediately). - Non-retriable errors (e.g. 4xx errors) do not get automatically retried, so it takes longer for the circuit to break (if the extension is restarted, on average it will take about 1 minute). - Note that Chrome implements "anti-DDoS throttling logic" which means that some non-retriable errors will turn into retriable errors. In this situation the circuit breaks faster than it would on Firefox.
72f9f8a to
776b25f
Compare
📊 Page Load Benchmark ResultsCurrent Commit: 📄 Localhost MetaMask Test DappSamples: 100 Summary
📈 Detailed Results
Results generated automatically by MetaMask CI |
Builds ready [776b25f]
UI Startup Metrics (1280 ± 77 ms)
Bundle size diffs [🚨 Warning! Bundle size has increased!]
|
cryptodev-2s
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
LGTM ! |
bb71a1c
📊 Page Load Benchmark ResultsCurrent Commit: 📄 Localhost MetaMask Test DappSamples: 100 Summary
📈 Detailed Results
Results generated automatically by MetaMask CI |
Builds ready [bb71a1c]
UI Startup Metrics (1280 ± 84 ms)
Bundle size diffs [🚨 Warning! Bundle size has increased!]
|
Gudahtt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
MajorLift
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Description
The network request code within the client makes use of the circuit breaker pattern. Currently, there must be 28 consecutive failed attempts to hit an RPC endpoint before the circuit breaks and it is perceived to be unavailable. At this point, if the endpoint is configured with a failover, the failover will be activated and all requests will be automatically diverted to it; otherwise requests are paused for 30 minutes.
There are two problems we are seeing now:
To fix these problems, this commit:
In summary, if a network starts to become degraded or the user is experiencing connection issues, the network is more likely to be flagged as unavailable, but if the situation improves the user may be able to use the network more quickly.
How quickly does the circuit break now? It depends on whether the user is using Chrome or Firefox and whether the errors encountered are retriable or non-retriable:
Changelog
CHANGELOG entry: Decrease time before activating QuickNode when Infura is degraded or unavailable; decrease time before allowing users to interact with a custom network following connection issues
Related issues
Manual testing steps
Prerequisites
localhost; under "Port", enter8080; enter some name in "Title".://example.comfield, enterhttps://*.infura.io/*; in the "title" field, enter some name, like "Localhost". (This will ensure that only requests to Infura go through the proxy.)mitmproxy.mitmproxy -s <path to your script>in an open terminal session. This will run the proxy server.yarn, thenyarn start.about:debugging, find MetaMask, click on "Inspect").Testing non-retriable errors
mitmproxyscript, all Infura endpoints return 500.Testing retriable errors
mitmproxy. Modify the script so that all Infura endpoints return 503 instead of 500. Restart it.Screenshots/Recordings
Before
After
Chrome
In this video, the endpoint continually responds with 500 (a non-retriable error):
extension.-.reduced.circuit.break.duration.-.chrome.-.500.mov
In this video, the endpoint continually responds with 503 (a retriable error):
extension.-.reduced.circuit.break.duration.-.chrome.-.503.mov
Firefox
In this video, the endpoint continually responds with 500 (a non-retriable error):
extension.-.reduced.circuit.break.duration.-.firefox.-.500.mov
In this video, the endpoint continually responds with 503 (a retriable error):
extension.-.reduced.circuit.break.duration.-.firefox.-.503.mov
Pre-merge author checklist
Pre-merge reviewer checklist
Note
Tightens RPC circuit-breaker settings by using DEFAULT_MAX_RETRIES, setting a 30s cooldown, and reducing maxConsecutiveFailures for QuickNode and other endpoints.
app/scripts/controller-init/network-controller-init.ts)DEFAULT_MAX_RETRIESfor RPC retries.policyOptionswithcircuitBreakDuration: 30 * SECOND.maxConsecutiveFailures:(maxRetries + 1) * 10.(maxRetries + 1) * 3.Written by Cursor Bugbot for commit bb71a1c. This will update automatically on new commits. Configure here.