r11s: enable socket.io long-polling as a fallback #8820

znewton · 2022-01-20T19:06:46Z

This change will allow long-polling as a fallback/downgrade option in the r11s-driver and r11s server. We should be able to release this driver change before the server change, because if WebSockets are not available in a client's browser/network, then their scenario is already broken.

vladsud · 2022-01-20T20:30:35Z

Some observations:

Please take a look at Experiment with long polling in ODSP driver #7964 and the article it references.
based on what I read about socket.io protocols, adding long polling will result in it being used first, for some indeterminate period of time. I.e., its less of a fallback (the way it's implemented), latency of communication might be affected negatively by this change for all connections that do support websocket upgrade.
So it would be great to actually do some measurements and see if this approach is good, or it's better to first try websocket protocol, and if it fails, retry with long-polling.
It would be also nice to understand what problem is being solved (i.e. particular user scenario). It's possible that we do not like long-polling perf and would want to use it to formulate better error message to users, i.e. isolate case where we can't reach relay service with any protocol, or can only with long polling (but not websocket), to point them to the root of the problem, but not support FF over long polling.

znewton · 2022-01-21T00:44:53Z

@vladsud

based on what I read about socket.io protocols, adding long polling will result in it being used first, for some indeterminate period of time. I.e., its less of a fallback (the way it's implemented), latency of communication might be affected negatively by this change for all connections that do support websocket upgrade.

This is correct when talking about the defaults used by socket.io (["polling", "websocket"]), however, order matters; by setting the value to ["websocket", "polling"], a WebSocket connection is attempted first, and polling is only used as a fallback. See socket.io's transports docs for more info.

const socket = io("https://example.com", {
  transports: ["websocket", "polling"] // use WebSocket first, if available
});

socket.on("connect_error", () => {
  // revert to classic upgrade
  socket.io.opts.transports = ["polling", "websocket"];
});

That being said, looking at the code example from socket.io, it appears that in order for the failover to actually work we might need to add an on("connect_error") handler to switch over to polling by default. I'm still working out how to actually validate this failover by reproducing the customer scenario that prompted this change.

So it would be great to actually do some measurements and see if this approach is good, or it's better to first try websocket protocol, and if it fails, retry with long-polling.

See above, the "better" option here is what this PR does.

It would be also nice to understand what problem is being solved (i.e. particular user scenario). It's possible that we do not like long-polling perf and would want to use it to formulate better error message to users, i.e. isolate case where we can't reach relay service with any protocol, or can only with long polling (but not websocket), to point them to the root of the problem, but not support FF over long polling.

We have some very specific customers who have this requirement. Please reach out offline, and I can connect you with the right people to give you more information about the specific need.

vladsud · 2022-01-21T02:11:51Z

It would be great to share your learnings in #7964, including impact on service (I believe there is non-zero impact here RE COGS).
We want to learn as same change is likely needed on ODSP side.

msfluid-bot · 2022-01-21T21:01:11Z

⯅ @fluid-example/bundle-size-tests: +565 Bytes

Metric Name	Baseline Size	Compare Size	Size Diff
aqueduct.js	379.93 KB	379.93 KB	■ No change
containerRuntime.js	180.61 KB	180.61 KB	■ No change
loader.js	160.84 KB	160.84 KB	■ No change
map.js	50.64 KB	50.64 KB	■ No change
matrix.js	145.45 KB	145.45 KB	■ No change
odspDriver.js	193.51 KB	194.07 KB	⯅ +565 Bytes
odspPrefetchSnapshot.js	45.32 KB	45.32 KB	■ No change
sharedString.js	166.37 KB	166.37 KB	■ No change
Total Size	1.32 MB	1.32 MB	⯅ +565 Bytes

Baseline commit: b71abdd

Generated by 🚫 dangerJS against 8fe1cf5

znewton · 2022-01-21T21:05:52Z

I was able to mimic a WebSocket failure -> polling downgrade using a TamperMonkey script that closes a websocket connection before the handshake is able to complete. Using this test, I was able to confirm that some additional changes were needed to make the failover actually happen. Previously, driver-base/DocumentDeltaConnection would fail and terminate on any connect_error. However, we need to allow socket.io to re-attempt failover with polling once before terminating. Please take a look at the latest commit to see how I implemented that.

Scenarios I tested using Clicker and the result:

Script Blocks WebSockets	Server Allows WebSockets	Server Allows Polling	Result
No	Yes	Yes	Successful WebSocket handshake and connection, no polling attempt made, collaboration works as expected
Yes	Yes	Yes	WebSocket error: closed before connection established, polling begins, collaboration works as expected
Yes	Yes	No	WebSocket error: closed before connection established, polling attempt fails with 400 error, documentDeltaConnection fails with `DeltaConnectionFailureToConnect {"delay":1000,"duration":1743,"online":1,"category":"error","error":"socket.io (connectError): xhr poll error: {\"type\":\"TransportError\",\"description\":400}"`

Ideally, we would be able to know when the connect_error is due to a transport layer failure versus a socket.io server middleware error. I will continue to look into adding more specificity.

TL;DR this works, but I'm curious if there are other ideas on how to allow 1 reconnect attempt.

tanviraumi

@GaryWilber Could you take a look at this?

packages/drivers/routerlicious-driver/src/documentDeltaConnection.ts

vladsud · 2022-01-24T22:37:31Z

One way I heard people block websockets upgrade is by proxy inspecting traffic and dropping http headers (I can probably try to dig on where I heard about it).

znewton · 2022-01-24T23:24:32Z

@vladsud that's a great idea. I think I can probably find a way to configure nginx in front of local R11s to prevent websocket upgrade

vladsud · 2022-01-25T04:00:02Z

packages/drivers/driver-base/src/documentDeltaConnection.ts

@@ -365,7 +367,15 @@ export class DocumentDeltaConnection
                        description.target = undefined;
                    }
                } catch(_e) {}
-                fail(true, this.createErrorObject("connectError", error));
+                if (!this.socket.io.reconnection() ||
+                    this.reconnectAttempts >= (this.socket.io.reconnectionAttempts() ?? 0)) {


It would be great from start to incorporate proper telemetry into this flow:

events for reconnection attempts (and their duration / protocol)

Once connection is established - to log (in some form) of long polling or websocket is used.
Ideally the form telemetry takes is left to the caller, so maybe we just need hooks for each driver to be able to leverage them? Maybe not, and this layer just issuing events is fine.
@jatgarg , @markfields - FYI

Seems like since the logic is shared the logging can live here too, but maybe I'm not seeing the benefits of each driver having more flexibility.

I'll be curious to see how often this happens, and for tenants configured this way how spammy these logs are. If there's some other error that triggers this reconnect but never succeeds, then we'll just sit and log the same error every 8 seconds (see connectCore). #8411 would help...

Created #8968 to track this as a follow up for this PR

vladsud · 2022-01-25T04:02:00Z

packages/drivers/routerlicious-driver/src/documentDeltaConnection.ts

@@ -33,9 +33,25 @@ export class R11sDocumentDeltaConnection extends DocumentDeltaConnection
                    tenantId,
                },
                reconnection: false,
-                transports: ["websocket"],
+                // Enable long-polling as a downgrade option when WebSockets are not available
+                transports: ["websocket", "polling"],


Curious, does it matter if polling is added here, given that you change transports below (on error)? Or both places need to have it in the list?
Maybe other way to ask - if (for example) a host wants to control this behavior, where we would put IF condition to disable/enable long polling? Should we do it from the start (and bake into generic layer) instead of having it here, and each driver just control policy? Would be easier down the road for ODSP, and would allow not to replicate bugs across layers due to duplicated code.

Putting it here allows for polling to be used in scenarios where websockets are not available in the given runtime (e.g. IE 9). I think currently IE 9 is the only browser that doesn't support websockets, and engine.io-client has built-in isomorphic support so it also works in Node.js via ws package. Realistically, adding this line here adds very niche support, but I figured it's not really hurting anything to add it.

As for where to put an IF condition, I was under the impression that we didn't want to bake socket.io specifics into the generic layer in case a driver/service wanted to use a different WebSockets implementation, but that might be an outdated understanding. In the case that it's here in the per-driver layer, I believe the IF condition would have to go both around this config line as well as around the following .on("connect_error") handler, since that is really where the majority of polling usecases are covered.

Do we care about IE9? I think we should not, unless there is some strong case here (I'm not aware of).

I'm not sure what you mean by different WebSocket implementation. If driver choses to use this class, then it agrees to use socket.io and appropriate transport protocols required / supported by this layer. It feels standard issue of code reuse - I do not see much value in duplicated same logic across drivers. If we expect same logic to be in other drivers that want long polling, then I'd rather move it to shared layer, and expose policy to enable / disable it.

I'd go even further - you want it for single driver to have ability to easily control it, and thus allow end experiences to evaluate / enable it slowly, observing impact (COGS, perf, etc.) and potentially cut it off through dynamic feature gates if things go wrong.

I can also see (maybe it's in some distant future) requests in the form of - could you please cache that given tenant does not support websockets, and thus move to long polling faster? I'd prefer to give tools to other layers (hosts) to be able to implement something like that, and simply expose enough policy controls for them to be able to do so, vs. bake it into drivers.

Do we care about IE9? I think we should not, unless there is some strong case here (I'm not aware of).

Nope, just now found this FluidFramework browser support doc. I think I can safely remove this additional polling config, but will verify that everything still works.

Other stuff

Totally agree. Thank you for explaining that driver-base's DocumentDeltaConnection requires socket.io (looking at the types now this should have been obvious). I will work on an update to this PR that moves this handling into the shared space.

api-report/driver-base.api.md

vladsud · 2022-02-01T05:34:38Z

packages/drivers/driver-base/src/documentDeltaConnection.ts

@@ -365,7 +368,24 @@ export class DocumentDeltaConnection
                        description.target = undefined;
                    }
                } catch(_e) {}
-                fail(true, this.createErrorObject("connectError", error));
+                if (this.enableLongPollingDowngrades) {


I'd prefer a bit "defense in depth" approach in here.
For example, can we get into if (this.enableLongPollingDowngrades) multiple times? If feels like the answer should be no. If that's the case, would it be worth resetting this.enableLongPollingDowngrades such that if we were to get here (for whatever reason) second time, we will bail out with error right away?
Similar, what happens if this.enableLongPollingDowngrades && error?.type !== "TransportError" ?
Or !this.enableLongPollingDowngrades && this.socket.io.reconnection() ?
Are these supported modes?

Can we have couple special conditions (escaping), and all other code paths (catch-all) resulting in fail() like it used to be?

I hear what you're saying. Looking at this if chunk again, I agree that it's not clear what the behavior is in these scenarios and even thought the behavior is as expected for each of your described scenarios, the code path is not clean (goes into if() blocks unnecessarily), so I will try to clean it up. The current logic also assumes that the driver layer initially sets reconnection: false, which I don't think is ideal.

Okay, I've updated the logic to be much more specific.
It will only enter the long-polling downgrade block if (1) it's a websocket transport error, (2) long-polling downgrade is enabled, and (3) the polling upgrade mechanism (transports: ["polling", ...]) is not already in place. Additionally, the socket.io internal reconnection behavior will not be altered if internal reconnection behavior was already enabled by the driver layer. Lastly, the reconnection enabled + reconnect attempts < max reconnect attempts check is now an escape rather than a condition to fail.
Let me know what you think, thanks!

packages/drivers/routerlicious-driver/src/documentDeltaConnection.ts

github-actions bot added area: driver Driver related issues area: server Server related issues (routerlicious) labels Jan 20, 2022

github-actions bot requested review from vladsud, jatgarg, tanviraumi, anthony-murphy, GaryWilber, hedasilv and pradeepvairamani January 20, 2022 19:08

vladsud approved these changes Jan 21, 2022

View reviewed changes

github-actions bot requested a review from vladsud January 21, 2022 20:43

tanviraumi reviewed Jan 22, 2022

View reviewed changes

packages/drivers/routerlicious-driver/src/documentDeltaConnection.ts Outdated Show resolved Hide resolved

arikt-ms requested a review from tanviraumi January 24, 2022 22:17

znewton mentioned this pull request Jan 25, 2022

r11s: add NGINX in front of docker-compose services #8865

Merged

vladsud reviewed Jan 25, 2022

View reviewed changes

tanviraumi mentioned this pull request Jan 25, 2022

Routerlicious driver support for long polling fallback #8856

Closed

1 task

znewton force-pushed the enable-long-polling branch from 9772afe to ba448b2 Compare January 31, 2022 22:11

znewton requested a review from a team as a code owner January 31, 2022 22:11

github-actions bot requested review from vladsud and removed request for a team January 31, 2022 22:12

github-actions bot added the public api change Changes to a public API label Jan 31, 2022

znewton commented Jan 31, 2022

View reviewed changes

api-report/driver-base.api.md Show resolved Hide resolved

vladsud reviewed Feb 1, 2022

View reviewed changes

packages/drivers/routerlicious-driver/src/documentDeltaConnection.ts Outdated Show resolved Hide resolved

znewton added 4 commits February 1, 2022 12:04

enable socket.io long-polling as a fallback

c880a1e

actually failover to polling

7c72e4e

Smarter failover

af7e255

shared failover

96ea006

znewton force-pushed the enable-long-polling branch from ba448b2 to 96ea006 Compare February 1, 2022 19:04

znewton requested review from a team as code owners February 1, 2022 19:04

znewton added 2 commits February 1, 2022 15:02

update failover logic to be more specific

2484396

fix websocket transport error detection

8fe1cf5

znewton mentioned this pull request Feb 1, 2022

Add telemetry to DocumentDeltaConnection reconnect attempts #8968

Closed

vladsud mentioned this pull request Feb 1, 2022

Experiment with long polling in ODSP driver #7964

Closed

znewton merged commit 3cdb3d8 into microsoft:main Feb 8, 2022

znewton deleted the enable-long-polling branch February 8, 2022 00:11

znewton mentioned this pull request Jun 19, 2023

R11s-driver: Add driver policy for long-polling downgrade #16051

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

r11s: enable socket.io long-polling as a fallback #8820

r11s: enable socket.io long-polling as a fallback #8820

znewton commented Jan 20, 2022

vladsud commented Jan 20, 2022

znewton commented Jan 21, 2022 •

edited

Loading

vladsud commented Jan 21, 2022

msfluid-bot commented Jan 21, 2022 •

edited

Loading

znewton commented Jan 21, 2022

tanviraumi left a comment

vladsud commented Jan 24, 2022

znewton commented Jan 24, 2022

vladsud Jan 25, 2022

markfields Jan 25, 2022

znewton Feb 1, 2022

vladsud Jan 25, 2022

znewton Jan 25, 2022

vladsud Jan 25, 2022

znewton Jan 25, 2022

vladsud Feb 1, 2022

znewton Feb 1, 2022

znewton Feb 1, 2022

r11s: enable socket.io long-polling as a fallback #8820

r11s: enable socket.io long-polling as a fallback #8820

Conversation

znewton commented Jan 20, 2022

vladsud commented Jan 20, 2022

znewton commented Jan 21, 2022 • edited Loading

vladsud commented Jan 21, 2022

msfluid-bot commented Jan 21, 2022 • edited Loading

znewton commented Jan 21, 2022

tanviraumi left a comment

Choose a reason for hiding this comment

vladsud commented Jan 24, 2022

znewton commented Jan 24, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

znewton commented Jan 21, 2022 •

edited

Loading

msfluid-bot commented Jan 21, 2022 •

edited

Loading