-
Notifications
You must be signed in to change notification settings - Fork 535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
r11s: enable socket.io long-polling as a fallback #8820
Conversation
Some observations:
|
This is correct when talking about the defaults used by socket.io ( const socket = io("https://example.com", {
transports: ["websocket", "polling"] // use WebSocket first, if available
});
socket.on("connect_error", () => {
// revert to classic upgrade
socket.io.opts.transports = ["polling", "websocket"];
}); That being said, looking at the code example from socket.io, it appears that in order for the failover to actually work we might need to add an
See above, the "better" option here is what this PR does.
We have some very specific customers who have this requirement. Please reach out offline, and I can connect you with the right people to give you more information about the specific need. |
It would be great to share your learnings in #7964, including impact on service (I believe there is non-zero impact here RE COGS). |
⯅ @fluid-example/bundle-size-tests: +565 Bytes
Baseline commit: b71abdd |
I was able to mimic a WebSocket failure -> polling downgrade using a TamperMonkey script that closes a websocket connection before the handshake is able to complete. Using this test, I was able to confirm that some additional changes were needed to make the failover actually happen. Previously, driver-base/DocumentDeltaConnection would fail and terminate on any connect_error. However, we need to allow socket.io to re-attempt failover with polling once before terminating. Please take a look at the latest commit to see how I implemented that. Scenarios I tested using Clicker and the result:
Ideally, we would be able to know when the connect_error is due to a transport layer failure versus a socket.io server middleware error. I will continue to look into adding more specificity. TL;DR this works, but I'm curious if there are other ideas on how to allow 1 reconnect attempt. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GaryWilber Could you take a look at this?
packages/drivers/routerlicious-driver/src/documentDeltaConnection.ts
Outdated
Show resolved
Hide resolved
One way I heard people block websockets upgrade is by proxy inspecting traffic and dropping http headers (I can probably try to dig on where I heard about it). |
@vladsud that's a great idea. I think I can probably find a way to configure nginx in front of local R11s to prevent websocket upgrade |
@@ -365,7 +367,15 @@ export class DocumentDeltaConnection | |||
description.target = undefined; | |||
} | |||
} catch(_e) {} | |||
fail(true, this.createErrorObject("connectError", error)); | |||
if (!this.socket.io.reconnection() || | |||
this.reconnectAttempts >= (this.socket.io.reconnectionAttempts() ?? 0)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great from start to incorporate proper telemetry into this flow:
- events for reconnection attempts (and their duration / protocol)
- Once connection is established - to log (in some form) of long polling or websocket is used.
Ideally the form telemetry takes is left to the caller, so maybe we just need hooks for each driver to be able to leverage them? Maybe not, and this layer just issuing events is fine.
@jatgarg , @markfields - FYI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like since the logic is shared the logging can live here too, but maybe I'm not seeing the benefits of each driver having more flexibility.
I'll be curious to see how often this happens, and for tenants configured this way how spammy these logs are. If there's some other error that triggers this reconnect but never succeeds, then we'll just sit and log the same error every 8 seconds (see connectCore
). #8411 would help...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #8968 to track this as a follow up for this PR
@@ -33,9 +33,25 @@ export class R11sDocumentDeltaConnection extends DocumentDeltaConnection | |||
tenantId, | |||
}, | |||
reconnection: false, | |||
transports: ["websocket"], | |||
// Enable long-polling as a downgrade option when WebSockets are not available | |||
transports: ["websocket", "polling"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious, does it matter if polling is added here, given that you change transports below (on error)? Or both places need to have it in the list?
Maybe other way to ask - if (for example) a host wants to control this behavior, where we would put IF condition to disable/enable long polling? Should we do it from the start (and bake into generic layer) instead of having it here, and each driver just control policy? Would be easier down the road for ODSP, and would allow not to replicate bugs across layers due to duplicated code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting it here allows for polling to be used in scenarios where websockets are not available in the given runtime (e.g. IE 9). I think currently IE 9 is the only browser that doesn't support websockets, and engine.io-client has built-in isomorphic support so it also works in Node.js via ws
package. Realistically, adding this line here adds very niche support, but I figured it's not really hurting anything to add it.
As for where to put an IF condition, I was under the impression that we didn't want to bake socket.io specifics into the generic layer in case a driver/service wanted to use a different WebSockets implementation, but that might be an outdated understanding. In the case that it's here in the per-driver layer, I believe the IF condition would have to go both around this config line as well as around the following .on("connect_error")
handler, since that is really where the majority of polling usecases are covered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we care about IE9? I think we should not, unless there is some strong case here (I'm not aware of).
I'm not sure what you mean by different WebSocket implementation. If driver choses to use this class, then it agrees to use socket.io and appropriate transport protocols required / supported by this layer. It feels standard issue of code reuse - I do not see much value in duplicated same logic across drivers. If we expect same logic to be in other drivers that want long polling, then I'd rather move it to shared layer, and expose policy to enable / disable it.
I'd go even further - you want it for single driver to have ability to easily control it, and thus allow end experiences to evaluate / enable it slowly, observing impact (COGS, perf, etc.) and potentially cut it off through dynamic feature gates if things go wrong.
I can also see (maybe it's in some distant future) requests in the form of - could you please cache that given tenant does not support websockets, and thus move to long polling faster? I'd prefer to give tools to other layers (hosts) to be able to implement something like that, and simply expose enough policy controls for them to be able to do so, vs. bake it into drivers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we care about IE9? I think we should not, unless there is some strong case here (I'm not aware of).
Nope, just now found this FluidFramework browser support doc. I think I can safely remove this additional polling config, but will verify that everything still works.
Other stuff
Totally agree. Thank you for explaining that driver-base's DocumentDeltaConnection requires socket.io (looking at the types now this should have been obvious). I will work on an update to this PR that moves this handling into the shared space.
9772afe
to
ba448b2
Compare
@@ -365,7 +368,24 @@ export class DocumentDeltaConnection | |||
description.target = undefined; | |||
} | |||
} catch(_e) {} | |||
fail(true, this.createErrorObject("connectError", error)); | |||
if (this.enableLongPollingDowngrades) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer a bit "defense in depth" approach in here.
For example, can we get into if (this.enableLongPollingDowngrades) multiple times? If feels like the answer should be no. If that's the case, would it be worth resetting this.enableLongPollingDowngrades such that if we were to get here (for whatever reason) second time, we will bail out with error right away?
Similar, what happens if this.enableLongPollingDowngrades && error?.type !== "TransportError" ?
Or !this.enableLongPollingDowngrades && this.socket.io.reconnection() ?
Are these supported modes?
Can we have couple special conditions (escaping), and all other code paths (catch-all) resulting in fail() like it used to be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hear what you're saying. Looking at this if chunk again, I agree that it's not clear what the behavior is in these scenarios and even thought the behavior is as expected for each of your described scenarios, the code path is not clean (goes into if() blocks unnecessarily), so I will try to clean it up. The current logic also assumes that the driver layer initially sets reconnection: false
, which I don't think is ideal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I've updated the logic to be much more specific.
It will only enter the long-polling downgrade block if (1) it's a websocket transport error, (2) long-polling downgrade is enabled, and (3) the polling upgrade mechanism (transports: ["polling", ...]
) is not already in place. Additionally, the socket.io internal reconnection behavior will not be altered if internal reconnection behavior was already enabled by the driver layer. Lastly, the reconnection enabled + reconnect attempts < max reconnect attempts check is now an escape rather than a condition to fail.
Let me know what you think, thanks!
packages/drivers/routerlicious-driver/src/documentDeltaConnection.ts
Outdated
Show resolved
Hide resolved
ba448b2
to
96ea006
Compare
This change will allow long-polling as a fallback/downgrade option in the r11s-driver and r11s server. We should be able to release this driver change before the server change, because if WebSockets are not available in a client's browser/network, then their scenario is already broken.