Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: WebTransport session establishment failed. Too many pending WebTransport sessions (64) #1896

Open
SgtPooki opened this issue Jul 25, 2023 · 12 comments
Assignees
Labels
kind/bug A bug in existing code (including security flaws) status/blocked Unable to be worked further until needs are met

Comments

@SgtPooki
Copy link
Member

Sometimes I see thousands instances of this warning in Chrome:

WebTransport session establishment failed. Too many pending WebTransport sessions (64)

image

This module may need some sort of dial queue to ensure it doesn't open too many connections and trigger this error.

ported over from libp2p/js-libp2p-webtransport#64

@SgtPooki SgtPooki added the need/triage Needs initial labeling and prioritization label Jul 25, 2023
@SgtPooki SgtPooki changed the title WebTransport session establishment failed. Too many pending WebTransport sessions (64) bug: WebTransport session establishment failed. Too many pending WebTransport sessions (64) Jul 25, 2023
@SgtPooki SgtPooki added the kind/bug A bug in existing code (including security flaws) label Jul 25, 2023
@maschad maschad moved this to 🥞Weekly Candidates/Discuss🎙 in js-libp2p Jul 26, 2023
@maschad maschad self-assigned this Jul 26, 2023
@SgtPooki
Copy link
Member Author

SgtPooki commented Aug 8, 2023

I believe this is a blocker for ipfs/helia#182 because:

  1. webtransport is one of the few consistent ways we can connect to other nodes from the browser.
  2. A Helia nodes' connectivity become unstable once we have too many pending dials.

@maschad are you actively working on this?

@maschad
Copy link
Member

maschad commented Aug 9, 2023

I'm not actively working on it at the moment @SgtPooki although I think @achingbrain 's PR #1947 may be related.

@achingbrain
Copy link
Member

achingbrain commented Aug 9, 2023

I think #1947 will help with the unstable bit but I can't help but wonder if there's some cleanup we need to do that we're missing to prevent the "Too many pending sessions" thing in the first place.

@maschad maschad removed their assignment Aug 9, 2023
@SgtPooki SgtPooki self-assigned this Aug 11, 2023
achingbrain added a commit that referenced this issue Aug 15, 2023
Refactors session closing to happen in one function and call that
function when the session has closed or failed to init.

Doesn't quite solve the "Too many pending WebTransport Sessions"
problem but does slow it down a little bit.

Refs: #1896
achingbrain added a commit that referenced this issue Aug 15, 2023
…1969)

Refactors session closing to happen in one function and call that
function when the session has closed or failed to init.

Doesn't quite solve the "Too many pending WebTransport Sessions"
problem but does slow it down a little bit.

Refs: #1896
@achingbrain
Copy link
Member

achingbrain commented Aug 24, 2023

This may be a bug in Chrome.

When we forcibly close WebTransport connections whose .ready promise hasn't resolved within the connection timeout (the peer has gone away, is on a slow connection, is overloaded, is firewalled, etc), Chrome may not be cleaning up properly in the background, so we hit this limit.

More details here: https://bugs.chromium.org/p/chromium/issues/detail?id=1473980

@achingbrain
Copy link
Member

I've tried to add a global count to the WebTransport transport to ensure we don't go over 64 "pending" connections, taking "pending" as meaning "has yet to resolve/reject the .ready/.closed promises" but it doesn't solve the problem.

Counting the various WebTransport sessions that have been opened and what happened to them, it seems sessions that reject* their .ready/.closed promises are still counted as "pending".

Therefore regardless of any limit we set on how many connections we open simultaneously, once the number of errored connections plus the number of yet-to-resolve/reject connections reaches 65 no further WebTransport connections can be opened.

This is bad news and needs a browser fix because once 65 connections have errored it's essentially game over until the page is reloaded.

I've updated the chromium bug report with this information.


* = The rejection reasons are normal network things - an unreachable host, a handshake timeout, etc.

@achingbrain achingbrain added status/blocked Unable to be worked further until needs are met and removed need/triage Needs initial labeling and prioritization labels Aug 30, 2023
@achingbrain
Copy link
Member

achingbrain commented Aug 30, 2023

A comment on the Chromium bug links to this design doc - it seems Chromium unilaterally applies an anti-DOS measure by keeping "failed" connections in the "pending" state for 5 minutes after the failure.

This also seems to include sessions that have had their .close method called before .ready has resolved - which is how we cancel connections when (for example) dialling a peer on all available addresses then when one dial succeeds, aborting all the other dials.

This seriously limits the amount of connections that can be opened over time.

The Chromium bug is still valid, I think - because the 5 minute delay does not seem to be applied, failed connections are "pending" forever maybe not forever, but for a lot longer than 5 minutes.

I've put a simple demo page together here that doesn't have any libp2p code in it - https://webtransport-pending-sessions.on.fleek.co/

We can use this to see if the issue has been resolved over time.

Interestingly Firefox does not apply the 5 minute wait though it does crash quite reliably.

I've tried adding a dial queue to the WebTransport transport that applies the 5 minute wait for new dials once 64 have errored, but we request dial slots quicker than the old ones time out so everything sort of grinds to a halt.

We may be able to do something about this by increasing the auto dial retry threshold to something over 5 minutes, this should give Chromium enough time to reach it's internal timeout, after the bug that means it never reaches its internal timeout is fixed 🫠

@SgtPooki
Copy link
Member Author

Thanks for staying on top of this one and keeping us updated @achingbrain

@SgtPooki SgtPooki assigned achingbrain and unassigned SgtPooki Aug 30, 2023
@maschad maschad moved this from 🥞Weekly Candidates/Discuss🎙 to 🧱Blocked in js-libp2p Sep 5, 2023
This was referenced Jan 18, 2024
@achingbrain
Copy link
Member

@dhuseby dhuseby moved this from 🧱Blocked to 🏃‍♀️In Progress in js-libp2p Apr 30, 2024
@dhuseby
Copy link
Contributor

dhuseby commented Apr 30, 2024

@lidel and Javier from Igalia are working with the Chrome team to get a fix into Chrome. Firefox nightly does have WebTransport and seems to work.

@dhuseby
Copy link
Contributor

dhuseby commented Apr 30, 2024

link to test page: https://libp2p-webtransport-sessions.on.fleek.co/

@dhuseby dhuseby moved this from 🏃‍♀️In Progress to 🧱Blocked in js-libp2p May 7, 2024
@dhuseby
Copy link
Contributor

dhuseby commented May 7, 2024

Waiting on Igalia to submit a patch to Chrome that fixes this.

@achingbrain
Copy link
Member

Notes from Igalia work stream (under various Handling pending WebTransport sessions headers): https://hackmd.io/SaJIHZmyRUKfl_fQwoYfog

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws) status/blocked Unable to be worked further until needs are met
Projects
Status: 🧱Blocked
Development

Successfully merging a pull request may close this issue.

4 participants