Skip to content

Conversation

@mcmire
Copy link
Contributor

@mcmire mcmire commented Oct 20, 2025

Description

The network request code within the client makes use of the circuit breaker pattern. Currently, there must be 28 consecutive failed attempts to hit an RPC endpoint before the circuit breaks and it is perceived to be unavailable. At this point, if the endpoint is configured with a failover, the failover will be activated and all requests will be automatically diverted to it; otherwise requests are paused for 30 minutes.

There are two problems we are seeing now:

  • If Infura is degraded enough to where failing over to QuickNode is warranted, the failover may be activated too late.
  • If the user is attempting to access a custom network and they are experiencing issues (either due to a local connection issue or an issue with the endpoint itself), they will be prevented from using that network for 30 minutes. This is way too long.

To fix these problems, this commit:

  • Lowers the "max consecutive failures" (the number of successive attempts to obtain a successful response from an endpoint before requests are paused or the failover is triggered) from 28 to 8.
  • Lowers the "circuit break duration" (the period during which requests to an unavailable endpoint will be paused) from 30 minutes to 30 seconds.

In summary, if a network starts to become degraded or the user is experiencing connection issues, the network is more likely to be flagged as unavailable, but if the situation improves the user may be able to use the network more quickly.

How quickly does the circuit break now? It depends on whether the user is using Chrome or Firefox and whether the errors encountered are retriable or non-retriable:

  • Retriable errors (e.g. connection errors, 502/503/504, etc.) will, as the name implies, be automatically retried. If these errors are continually produced, the circuit will break very quickly (if the extension is restarted, then it will break immediately).
  • Non-retriable errors (e.g. 4xx errors) do not get automatically retried, so it takes longer for the circuit to break (if the extension is restarted, on average it will take about 1 minute).
  • Note that Chrome implements "anti-DDoS throttling logic" which means that some non-retriable errors will turn into retriable errors. In this situation the circuit breaks faster than it would on Firefox.

Open in GitHub Codespaces

Changelog

CHANGELOG entry: Decrease time before activating QuickNode when Infura is degraded or unavailable; decrease time before allowing users to interact with a custom network following connection issues

Related issues

Manual testing steps

Prerequisites

  1. Install FoxyProxy. Pin it.
  2. Open the options for FoxyProxy (click on its icon and go to Options).
  3. Click on the Proxies tab, then click on Add. A new box should appear.
  4. Under "Hostname", enter localhost; under "Port", enter 8080; enter some name in "Title".
  5. Next to "Proxy by Patterns", click the plus icon. You should see a new row appear.
  6. In the ://example.com field, enter https://*.infura.io/*; in the "title" field, enter some name, like "Localhost". (This will ensure that only requests to Infura go through the proxy.)
  7. Install mitmproxy.
  8. Create a Python script somewhere on your computer with the following contents: https://gist.github.com/mcmire/1d43ce690d3a974217126cd584f79b7d. This script will cause all requests to Infura RPC endpoints to respond with 500, simulating an outage.
  9. Run mitmproxy -s <path to your script> in an open terminal session. This will run the proxy server.
  10. Go back to FoxyProxy and enable the proxy you created earlier by clicking on the icon, then choosing the name of the new proxy (e.g. Localhost).
  11. Check out this branch. Run yarn, then yarn start.
  12. Ensure that you've added the local version of MetaMask to your browser and no other versions are present.
  13. Open MetaMask, and go through onboarding.
  14. Open the DevTools for the extension (Chrome: go to Extensions, find MetaMask, click on "service worker"; in Firefox, go to about:debugging, find MetaMask, click on "Inspect").

Testing non-retriable errors

  1. Ensure that in the mitmproxy script, all Infura endpoints return 500.
  2. Restart the extension.
  3. Switch to DevTools. You should see request errors like "RPC endpoint not found or unavailable" in the Console view.
  4. Open MetaMask and switch to full-screen view so that it stays open.
  5. Switch back to DevTools and wait. After about 20 seconds, you should see errors that say "RPC endpoint returned too many errors".

Testing retriable errors

  1. Stop mitmproxy. Modify the script so that all Infura endpoints return 503 instead of 500. Restart it.
  2. Restart the extension.
  3. Switch to DevTools. You should see a flurry of messages, and at the end you should see an error that says "RPC endpoint returned too many errors".

Screenshots/Recordings

Before

After

Chrome

In this video, the endpoint continually responds with 500 (a non-retriable error):

extension.-.reduced.circuit.break.duration.-.chrome.-.500.mov

In this video, the endpoint continually responds with 503 (a retriable error):

extension.-.reduced.circuit.break.duration.-.chrome.-.503.mov

Firefox

In this video, the endpoint continually responds with 500 (a non-retriable error):

extension.-.reduced.circuit.break.duration.-.firefox.-.500.mov

In this video, the endpoint continually responds with 503 (a retriable error):

extension.-.reduced.circuit.break.duration.-.firefox.-.503.mov

Pre-merge author checklist

Pre-merge reviewer checklist

  • I've manually tested the PR (e.g. pull and build branch, run the app, test code being changed).
  • I confirm that this PR addresses all acceptance criteria described in the ticket it closes and includes the necessary testing evidence such as recordings and or screenshots.

Note

Tightens RPC circuit-breaker settings by using DEFAULT_MAX_RETRIES, setting a 30s cooldown, and reducing maxConsecutiveFailures for QuickNode and other endpoints.

  • Network Controller Init (app/scripts/controller-init/network-controller-init.ts)
    • Use DEFAULT_MAX_RETRIES for RPC retries.
    • Introduce shared policyOptions with circuitBreakDuration: 30 * SECOND.
    • Adjust maxConsecutiveFailures:
      • QuickNode endpoints: (maxRetries + 1) * 10.
      • Other endpoints: (maxRetries + 1) * 3.

Written by Cursor Bugbot for commit bb71a1c. This will update automatically on new commits. Configure here.

@github-actions
Copy link
Contributor

CLA Signature Action: All authors have signed the CLA. You may need to manually re-run the blocking PR check if it doesn't pass in a few minutes.

@metamaskbot metamaskbot added the team-core-platform Core Platform team label Oct 20, 2025
@metamaskbot
Copy link
Collaborator

📊 Page Load Benchmark Results

Current Commit: 72f9f8a | Date: 10/20/2025

📄 Localhost MetaMask Test Dapp

Samples: 100

Summary

  • pageLoadTime-> current mean value: 1.04s (±72ms) 🟡 | historical mean value: 1.04s ⬇️ (historical data)
  • domContentLoaded-> current mean value: 733ms (±70ms) 🟢 | historical mean value: 737ms ⬇️ (historical data)
  • firstContentfulPaint-> current mean value: 78ms (±14ms) 🟢 | historical mean value: 80ms ⬇️ (historical data)
📈 Detailed Results
Metric Mean Std Dev Min Max P95 P99
pageLoadTime 1.04s 72ms 992ms 1.34s 1.27s 1.34s
domContentLoaded 733ms 70ms 690ms 1.03s 938ms 1.03s
firstPaint 78ms 14ms 60ms 200ms 92ms 200ms
firstContentfulPaint 78ms 14ms 60ms 200ms 92ms 200ms
largestContentfulPaint 0ms 0ms 0ms 0ms 0ms 0ms

Results generated automatically by MetaMask CI

@metamaskbot
Copy link
Collaborator

Builds ready [72f9f8a]
UI Startup Metrics (1242 ± 71 ms)
PlatformBuildTypePageMetricMean (ms)Min (ms)Max (ms)Std Dev (ms)P 75 (ms)P 95 (ms)
ChromeBrowserifyHomeuiStartup1242114215857112821361
load106097513656110971154
domContentLoaded105496913596110911148
domInteractive18136281741
firstPaint68687118942010781141
backgroundConnect2532412848256270
firstReactRender26185172843
getState2084882534
initialActions717110628
loadScripts807722109859844910
setupStore1062631017
WebpackHomeuiStartup840726111166857957
load63058394666636747
domContentLoaded62357893865630735
domInteractive16126391443
firstPaint17456752161193596
backgroundConnect21114672634
firstReactRender27176783134
getState1162031417
initialActions319247
loadScripts62057692763628724
setupStore961631113
FirefoxBrowserifyHomeuiStartup14231234179911114781642
load1204106114377712541342
domContentLoaded1204106114367712531341
domInteractive1113331051115236
firstPaintNaNNaNNaNNaNNaNNaN
backgroundConnect34246283948
firstReactRender26215262641
getState74485614
initialActions41487312
loadScripts1182103914197712361320
setupStore14589121246
WebpackHomeuiStartup16841405212512317321938
load1436123917109014791658
domContentLoaded1436123817099014781657
domInteractive1104536561110330
firstPaintNaNNaNNaNNaNNaNNaN
backgroundConnect4923165245692
firstReactRender30238283342
getState134186281050
initialActions519112414
loadScripts1408121816778714481589
setupStore176196221559
Bundle size diffs [🚨 Warning! Bundle size has increased!]
  • background: 58 Bytes (0%)
  • ui: 0 Bytes (0%)
  • common: 10 Bytes (0%)

Currently, there must be 28 consecutive failed attempts to hit an RPC
endpoint before it is perceived to be unavailable. At this point, if the
endpoint is configured with a failover, the failover will be activated
and all requests will be automatically diverted to it; otherwise
requests are paused for 30 minutes.

There are two problems we are seeing now:

- If Infura is degraded enough to where failing over to QuickNode is
  warranted, the failover may be activated too late.
- If the user is attempting to access a custom network and they are
  experiencing issues (either due to a local connection issue or an
  issue with the endpoint itself), they will be prevented from using
  that network for 30 minutes. This is way too long.

To fix these problems, this commit:

- Lowers the "max consecutive failures" (the number of successive
  attempts to obtain a successful response from an endpoint before
  requests are paused or the failover is triggered) from 28 to 8.
- Lowers the "circuit break duration" (the period during which requests
  to an unavailable endpoint will be paused) from 30 minutes to 30
  *seconds*.

In summary, if a network starts to become degraded or the user is
experiencing connection issues, the network is more likely to be flagged
as unavailable, but if the situation improves the user may be able to
use it more quickly.

How quickly does the circuit break now? It depends on whether the user
is using Chrome or Firefox and whether the errors encountered are
retriable or non-retriable:

- Retriable errors (e.g. connection errors, 502/503/504, etc.) will, as
  the name implies, be automatically retried. If these errors are
  continually produced, the circuit will break very quickly (if the
  extension is restarted, then it will break immediately).
- Non-retriable errors (e.g. 4xx errors) do not get automatically
  retried, so it takes longer for the circuit to break (if the extension
  is restarted, on average it will take about 1 minute).
- Note that Chrome implements "anti-DDoS throttling logic" which means
  that some non-retriable errors will turn into retriable errors. In
  this situation the circuit breaks faster than it would on Firefox.
@mcmire mcmire force-pushed the reduce-circuit-break-threshold branch from 72f9f8a to 776b25f Compare October 28, 2025 19:08
@metamaskbot
Copy link
Collaborator

📊 Page Load Benchmark Results

Current Commit: 776b25f | Date: 10/28/2025

📄 Localhost MetaMask Test Dapp

Samples: 100

Summary

  • pageLoadTime-> current mean value: 1.02s (±39ms) 🟡 | historical mean value: 1.03s ⬇️ (historical data)
  • domContentLoaded-> current mean value: 709ms (±36ms) 🟢 | historical mean value: 720ms ⬇️ (historical data)
  • firstContentfulPaint-> current mean value: 74ms (±11ms) 🟢 | historical mean value: 80ms ⬇️ (historical data)
📈 Detailed Results
Metric Mean Std Dev Min Max P95 P99
pageLoadTime 1.02s 39ms 991ms 1.31s 1.04s 1.31s
domContentLoaded 709ms 36ms 684ms 983ms 723ms 983ms
firstPaint 74ms 11ms 60ms 172ms 80ms 172ms
firstContentfulPaint 74ms 11ms 60ms 172ms 80ms 172ms
largestContentfulPaint 0ms 0ms 0ms 0ms 0ms 0ms

Results generated automatically by MetaMask CI

@metamaskbot
Copy link
Collaborator

Builds ready [776b25f]
UI Startup Metrics (1280 ± 77 ms)
PlatformBuildTypePageMetricMean (ms)Min (ms)Max (ms)Std Dev (ms)P 75 (ms)P 95 (ms)
ChromeBrowserifyStandard HomeuiStartup1280113314687713301412
load109496812927611411225
domContentLoaded108896312867511341220
domInteractive201469101840
firstPaint60289123544411051196
backgroundConnect2342202869238250
firstReactRender2919112113148
getState1984072232
initialActions61578622
loadScripts860726105276910994
setupStore1062431117
BrowserifyPower User HomeuiStartup22641720392265628483922
load1163931185732016051857
domContentLoaded1151925181130915921811
domInteractive321781225781
firstPaint800184181955610831819
backgroundConnect290220603112336603
firstReactRender27226092660
getState20315022916210229
initialActions3323057534305
loadScripts896711137025412671370
setupStore1172641126
WebpackStandard HomeuiStartup8477211121828571091
load63658292274644883
domContentLoaded62857690873637868
domInteractive15116491440
firstPaint20659937210188732
backgroundConnect22135272634
firstReactRender2817165163243
getState1262341520
initialActions3013246
loadScripts62557489671634856
setupStore1052331214
WebpackPower User HomeuiStartup18261546260735020092607
load82367615582409711558
domContentLoaded77465413561899211356
domInteractive221349113649
firstPaint4917513594149241359
backgroundConnect13235540153103540
firstReactRender27234652846
getState1539218722163187
initialActions111106258106
loadScripts76965213431859101343
setupStore14682181382
FirefoxBrowserifyStandard HomeuiStartup14871291193310515391658
load1276111816179013251423
domContentLoaded1276111816169013241423
domInteractive1213431949128218
firstPaint------
backgroundConnect4023194214669
firstReactRender25214752641
getState84648717
initialActions4167836
loadScripts1251109915758613021397
setupStore1162831215
BrowserifyPower User HomeuiStartup27812329388347132183883
load14431267184514415601845
domContentLoaded14431267184414315591844
domInteractive19693434107290434
firstPaint------
backgroundConnect21142817210409817
firstReactRender413192144492
getState1339120328157203
initialActions14166212366
loadScripts13971232180013714661800
setupStore271175193175
WebpackStandard HomeuiStartup16121442230212916461834
load13881221198511114411570
domContentLoaded13871221198411114411570
domInteractive1083451769108238
firstPaint------
backgroundConnect4524157205088
firstReactRender292186112767
getState84477917
initialActions41466312
loadScripts13581204189510714131537
setupStore1274451321
WebpackPower User HomeuiStartup28382459352030729943520
load16211378196216017421962
domContentLoaded16211377196116017421961
domInteractive1326321546189215
firstPaint------
backgroundConnect18560391109281391
firstReactRender463084155484
getState1597923342185233
initialActions824911949
loadScripts15491337182313716671823
setupStore23681193481
Bundle size diffs [🚨 Warning! Bundle size has increased!]
  • background: -21.71 KiB (-0.47%)
  • ui: 18.32 KiB (0.26%)
  • common: 23.67 KiB (0.27%)

@mcmire mcmire marked this pull request as ready for review October 28, 2025 21:34
cryptodev-2s
cryptodev-2s previously approved these changes Oct 29, 2025
Copy link
Contributor

@cryptodev-2s cryptodev-2s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

mikesposito
mikesposito previously approved these changes Oct 29, 2025
DDDDDanica
DDDDDanica previously approved these changes Oct 29, 2025
@DDDDDanica
Copy link
Contributor

LGTM !

@mcmire mcmire dismissed stale reviews from DDDDDanica, mikesposito, and cryptodev-2s via bb71a1c October 29, 2025 17:26
@metamaskbot
Copy link
Collaborator

📊 Page Load Benchmark Results

Current Commit: bb71a1c | Date: 10/29/2025

📄 Localhost MetaMask Test Dapp

Samples: 100

Summary

  • pageLoadTime-> current mean value: 1.03s (±39ms) 🟡 | historical mean value: 1.06s ⬇️ (historical data)
  • domContentLoaded-> current mean value: 715ms (±36ms) 🟢 | historical mean value: 743ms ⬇️ (historical data)
  • firstContentfulPaint-> current mean value: 75ms (±11ms) 🟢 | historical mean value: 83ms ⬇️ (historical data)
📈 Detailed Results
Metric Mean Std Dev Min Max P95 P99
pageLoadTime 1.03s 39ms 1.00s 1.31s 1.05s 1.31s
domContentLoaded 715ms 36ms 694ms 982ms 732ms 982ms
firstPaint 75ms 11ms 56ms 164ms 84ms 164ms
firstContentfulPaint 75ms 11ms 56ms 164ms 84ms 164ms
largestContentfulPaint 0ms 0ms 0ms 0ms 0ms 0ms

Results generated automatically by MetaMask CI

@metamaskbot
Copy link
Collaborator

Builds ready [bb71a1c]
UI Startup Metrics (1280 ± 84 ms)
PlatformBuildTypePageMetricMean (ms)Min (ms)Max (ms)Std Dev (ms)P 75 (ms)P 95 (ms)
ChromeBrowserifyStandard HomeuiStartup1280111514378413441423
load109994812658211611239
domContentLoaded109394212588111531232
domInteractive20145491847
firstPaint55586126244510771217
backgroundConnect2322182648235251
firstReactRender28195883345
getState2086492339
initialActions61688614
loadScripts8687291031819251013
setupStore1062141119
BrowserifyPower User HomeuiStartup21411812328150921233281
load1156940175529215631755
domContentLoaded1149932174629015541746
domInteractive402288184688
firstPaint754169175356010151753
backgroundConnect26622244462299444
firstReactRender25213022730
getState1757223735187237
initialActions927116771
loadScripts908707145426312961454
setupStore1282761227
WebpackStandard HomeuiStartup803692101260826917
load59754982863601736
domContentLoaded58954381961593729
domInteractive15113971433
firstPaint20351822212168710
backgroundConnect24105883043
firstReactRender261690103134
getState1152241416
initialActions3113247
loadScripts58654181160591726
setupStore942331113
WebpackPower User HomeuiStartup17071302242231719912422
load70463287070784870
domContentLoaded68662683463735834
domInteractive25135093250
firstPaint39184838290779838
backgroundConnect10517359105249359
firstReactRender26233532835
getState16412032042166320
initialActions905215752
loadScripts68162482259724822
setupStore17647132647
FirefoxBrowserifyStandard HomeuiStartup1408123016509014901579
load1204107013777112511336
domContentLoaded1203107013777112511335
domInteractive1093628540114215
firstPaint------
backgroundConnect3921105134665
firstReactRender25205262442
getState74777713
initialActions3180837
loadScripts1179105213576812211307
setupStore1164251222
BrowserifyPower User HomeuiStartup28572358433851029304338
load14851285183516015791835
domContentLoaded14841285183515915781835
domInteractive20111936969244369
firstPaint------
backgroundConnect24857969233284969
firstReactRender453082135382
getState1358219736170197
initialActions12254142054
loadScripts14081228169312615121693
setupStore5582585767258
WebpackStandard HomeuiStartup16511486221713916752005
load1405125317449214291587
domContentLoaded1404125317449214281586
domInteractive1133649868113285
firstPaint------
backgroundConnect4826158225683
firstReactRender312492123175
getState74243915
initialActions51658416
loadScripts1374123717228613941558
setupStore177209241358
WebpackPower User HomeuiStartup28582391393140131813931
load16351394204817017402048
domContentLoaded16341394204817017402048
domInteractive1587725962217259
firstPaint------
backgroundConnect24535599188452599
firstReactRender402969114769
getState1196520932134209
initialActions10144131544
loadScripts15681360180514216831805
setupStore33111683837168
Bundle size diffs [🚨 Warning! Bundle size has increased!]
  • background: -16.3 KiB (-0.35%)
  • ui: 24.19 KiB (0.34%)
  • common: 27.14 KiB (0.31%)

Copy link
Member

@Gudahtt Gudahtt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@MajorLift MajorLift left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@mcmire mcmire added this pull request to the merge queue Oct 31, 2025
Merged via the queue into main with commit b4ce2c1 Oct 31, 2025
328 of 331 checks passed
@mcmire mcmire deleted the reduce-circuit-break-threshold branch October 31, 2025 16:17
@github-actions github-actions bot locked and limited conversation to collaborators Oct 31, 2025
@metamaskbot metamaskbot added the release-13.9.0 Issue or pull request that will be included in release 13.9.0 label Oct 31, 2025
@mcmire mcmire changed the title chore: Decrease threshold before RPC failover is activated fix: Decrease threshold before RPC failover is activated Nov 4, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

release-13.9.0 Issue or pull request that will be included in release 13.9.0 size-S team-core-platform Core Platform team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants