Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support HTTP long polling as an alternative to WebSockets #859

Merged
merged 12 commits into from
Mar 28, 2024

Conversation

jonathanperret
Copy link
Collaborator

@jonathanperret jonathanperret commented Feb 19, 2024

Motivation

This PR is a proposed fix for #840 .

As mentioned in that issue, the motivation for supporting an alternative to WebSockets is that while all browsers supported by Grist offer native WebSocket support, some networking environments do not allow WebSocket traffic.

Implementation

Engine.IO was selected as the underlying implementation of HTTP long polling, after careful consideration as described here.

While Engine.IO offers both a WebSocket-based transport and HTTP long polling, and the ability to transparently upgrade from long polling to WebSockets, I chose to only use Engine.IO for its polling implementation, to avoid imposing the overhead of the Engine.IO protocol (and server-side memory usage) on the vast majority of clients, which can use plain WebSockets.

The Grist client will first attempt a regular WebSocket connection, using the same protocol and endpoints as before, but fall back to long polling using Engine.IO if the WebSocket connection fails.

To get all of the test cases to pass (particularly the Comm tests) I had to add some form of emulation of the behavior of the ws library to the Engine.IO-based implementation. In particular, while the Engine.IO socket's send method accepts a callback, it is only called upon success and not if the transfer fails. Because a failed transfer immediately causes the socket to be closed, I keep track of messages that have not had their callback called yet and invoke these upon close if any are still pending.

@paulfitz
Copy link
Member

Should the raw WebSocket backend remain available? In theory, Engine.IO will upgrade to the more efficient transport (i.e. WebSockets over long polling) and therefore it would seem unnecessary to maintain a separate implementation. However, this represents a significant change in a critical component of Grist and a conservative approach might be called for. Perhaps a typical sequence of making the new backend available, then making it the default, and finally deprecating the legacy backend would make sense?

There's a case for doing that, since CI style testing doesn't really capture the diversity of browsers and company firewalls and network latency out there. But I'd guess this would involve significant more work? An alternative would be, once we're happy with the PR, to tag it for release to a fly.io preview system we have before it lands. Then we could have volunteers from all over the world (recruited from our discord) hammer on it a bit. That could still miss a problem that only shows up after e.g. multiple days work with intermittent laptop suspends or whatever, but it would increase confidence that whatever problem remains it'll be something we can handle in follow-up.

@jonathanperret
Copy link
Collaborator Author

There's a case for doing that, since CI style testing doesn't really capture the diversity of browsers and company firewalls and network latency out there. But I'd guess this would involve significant more work? An alternative would be, once we're happy with the PR, to tag it for release to a fly.io preview system we have before it lands. Then we could have volunteers from all over the world (recruited from our discord) hammer on it a bit. That could still miss a problem that only shows up after e.g. multiple days work with intermittent laptop suspends or whatever, but it would increase confidence that whatever problem remains it'll be something we can handle in follow-up.

To make sure we're on the same page: if when saying "this would involve significant more work" you're referring to the complexity of having both implementations coexist in the code, this is something I have already handled. I briefly tried a wholesale substitution at the start but realized I was not going to be able to ensure the absence of regressions if I was not able to keep running the previous (raw WebSockets) implementation side-by-side.

Basically, with the code in this PR, setting (at run time — it has no effect on the build) GRIST_USE_ENGINE_IO to a falsy value (or not setting it at all) causes the server and client to use the legacy WebSocket-based implementation — albeit with a slightly more convoluted code path due to the GristSocketServer/GristClientSocket abstractions I introduced. I am highly confident that this mode of execution provides 100% compatibility with the existing code.

However you may have been referring to extra work on the release management and support side of things — in which case I agree that there would be an added burden in supporting these two configurations.

In any case, having this new implementation put through its paces by volunteers sound like a great idea. I have just managed to get all the tests to pass, so I'll mark this PR as ready for review and hopefully it can be deployed on your testing environment.

@jonathanperret jonathanperret marked this pull request as ready for review February 22, 2024 19:07
@paulfitz paulfitz added the preview Launch preview deployment of this PR label Feb 22, 2024
@jonathanperret
Copy link
Collaborator Author

I have updated the PR description to reflect the current implementation status.

I have also pushed a commit that adds GRIST_USE_ENGINE_IO=1 to the environment in fly-template.toml, hopefully this is how an environment variable is added to a deployment?

@jonathanperret
Copy link
Collaborator Author

The Fly deployment failure seems caused by the fact that PRs based on branches from forks do not have access to repository secrets, for obvious security reasons. You may have to clone this branch to one local to this repository?

@paulfitz paulfitz removed the preview Launch preview deployment of this PR label Feb 22, 2024
@paulfitz
Copy link
Member

The Fly deployment failure seems caused by the fact that PRs based on branches from forks do not have access to repository secrets, for obvious security reasons. You may have to clone this branch to one local to this repository?

Yes, you're right, done. And thanks for adding the environment variable, I was just kicking the preview machinery to make sure it still works. For anyone following along, a preview is over in https://grist-http-long-polling.fly.dev/

For the work: thanks for reminding me you had this feature under a flag already. If code review and preview all seem fine I think I'll be voting for making the new behavior the default, though it will be good that any installation that sees trouble has an easy way back to the old behavior.

Will get someone on reviewing this as soon as possible. Thanks a lot for your work! Will be exciting to have this extra level of robustness.

@paulfitz
Copy link
Member

(just did a quick test in firefox with network.websocket.max-connections set to 0 and working perfectly so far with no websockets, very nice)

@jonathanperret
Copy link
Collaborator Author

Testing with a more complex deployment including multiple doc workers and a load balancer revealed that the addition of the /engine.io/ prefix to socket requests meant that some load balancer configurations may need to be updated.

I'll be trying to get rid of that prefix so that Grist admins don't need to rework their setup.

@paulfitz paulfitz requested a review from dsagal February 28, 2024 13:50
@jonathanperret
Copy link
Collaborator Author

Hi! I added a few commits that :

  • remove the /engine.io/ prefix previously added to WebSocket and polling request paths; this should make the new protocol compatible with practically any existing deployment as the URLs are identical. This fixed the deployment on our local staging instances which use path-based routing;
  • fix reconnection behavior when GRIST_USE_ENGINE_IO=1 (the client wouldn't re-attempt a connection after an initial failure);
  • implement a proper terminate() method for Engine.IO-based sockets, for quicker connection interruption when necessary;
  • add the necessary CORS headers for polling to work even in a deployment where workers are on a different domain from the home server.
    @paulfitz if you update the local branch we can see if this still works on the Fly deployment.
    @dsagal I'm looking forward to your feedback when you find the time to look at this.

Copy link
Member

@dsagal dsagal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great work, thank you @jonathanperret!

app/client/components/GristClientSocket.ts Outdated Show resolved Hide resolved
app/common/gristUrls.ts Outdated Show resolved Hide resolved
app/server/lib/GristServerSocket.ts Outdated Show resolved Hide resolved
log.debug("got socket close reason=%s description=%s messages=%s",
reason, description?.message ?? description, [...this._messageCallbacks.keys()]);

const err = description ?? new Error(reason);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is description an Error instance, when present?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can see when it is constructed and forwarded, the description is indeed an Error instance. It is not documented as such however, unfortunately.

I'm not sure what to do here to make this cleaner. Maybe just ignore the description?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine, maybe just add a comment that in practice description is an instance of Error, or just change its type to Error if Typescript doesn't mind.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went for a comment and hopefully robust way of always ending up with an Error object. I also changed GristServerSocket.onerror to take an (err: Error) => void callback.

Suggested change
const err = description ?? new Error(reason);
// In practice, when available, description has more error details,
// possibly in the form of an Error object.
const maybeErr = description ?? reason;
const err = maybeErr instanceof Error ? maybeErr : new Error(maybeErr);

app/server/lib/GristServerSocket.ts Outdated Show resolved Hide resolved
app/server/lib/GristServerSocket.ts Outdated Show resolved Hide resolved
app/server/lib/GristServerSocket.ts Outdated Show resolved Hide resolved
constructor(server: http.Server) {
super();
this._server = new EIO.Server({
allowUpgrades: false,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this false because we try websocket before long-polling, and this option is for upgrading from long-polling to websockets?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just suggest adding a comment for anyone else wondering the same question.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

app/server/lib/GristSocketServer.ts Show resolved Hide resolved
// this CORS behavior will only apply to polling requests, which end up
// subjected downstream to the same Origin checks as those necessarily
// applied to WebSocket requests, it is safe to let any client attempt
// a connection here.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eek, it looks like your security assumption isn't actually the reality today. It doesn't look like we are checking the Origin header for websocket requests. I had to test whether there is a security vulnerability because of that. It looks like there is not, currently, at least on modern browsers, because we set SameSite=Lax attribute on session cookies, so they are not actually sent with CORS requests. With websockets, I don't think there is a way to include them. So a CORS ws connection may be made, and used to access public documents, but it would not include credentials to access private ones.

Your assumption (that an Origin check would necessarily get applied to all WebSocket requests) makes sense. I think it's important even if SameSite=Lax provides protection, if only for consistency with other request handling where we have decided it's important (and also because my quick check does not rule out all possible shenanigans). Incidentally, I found this commit as our latest change to origin checking.

Is that something you would consider adding as part of this diff?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh, good thing you looked into this then. Indeed, after a bit of testing on my side I can confirm it looks like SameSite is what is really keeping cross-origin WebSockets from being a vulnerability (WebSocket requests do include what cookies they can by default, without having to add e.g. credentials: "include" as is required with fetch).

It looks like the underlying assumption that "Opening up CORS for regular requests that are being handled by the same code that treats WebSocket requests (however insecurely) does not make things worse" still holds though.

But I agree that checking Origin would make for a more secure implementation all around. I would not mind looking into including the extra check, however I would require a few pointers to find the proper methods to call for verifying the origin.

The commit you referenced modifies the FlexServer.trustOriginHandler method but I cannot call this when validating a WebSocket request, since it is not in the (req, res, next) context of an Express middleware. The underlying requestUtils.trustOrigin is still bound to a Request/Response interface and can add response headers which again do not make sense in a WebSocket response (by the way, the fact that on this line req.hostname is checked rather than origin is puzzling).

It looks like the core of what trustOrigin does is in this line:

if (!allowHost(req, new URL(origin)) && !isEnvironmentAllowedHost(new URL(origin)))

I could add something like this early in Comm._onWebSocketConnection (where I found out WebSocket requests whose Host header does not match an allowed value are rejected, as a result of calling hosts.addOrgInfo).

But looking at allowHost, I see a cast of req to RequestWithOrg (when the raw WebSocket request clearly has no organization info, at least when it first arrives) and this gets into a part of the code I have no familiarity with yet. I wouldn't want to break a legitimate scenario, so I'm afraid some hand-holding will be necessary here.

If you can show me how and where you would add this Origin check to WebSocket requests in the current codebase, I think I would easily port this to my rearranged version of the code in this PR. Come to think of it, maybe it would be more prudent to add this check in a separate (prerequisite to this one) PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...require a few pointers to find the proper methods to call for verifying the origin.

What you suggested is exactly what I had in mind. addOrgInfo is what turns a req into RequestWithOrg, so if you add the allowHost check afterwards, it should almost make sense. "Almost" because you would need to adjust some methods to accept either express.Request or the underlying http.IncomingMessage. I think that's always possible by getting headers using req.headers[lowercase_name].

If allowHost returns false, we could then either fail when credentials are present (since we don't expect them to be), like trustOriginHandler does, or perhaps just strip the credentials and keep going.

the fact that on this line req.hostname is checked rather than origin is puzzling

Doh, it's puzzling to me too. It looks like a mistake. It seems it was added for tests, and in practice normally has no effect (because GRIST_HOST is either unset or set to something like 0.0.0.0), but if it did have an effect, it also looks like the wrong one to me. I'd try removing it and seeing if any tests fail. If you can try that, that would be great; if not, let me know, I'll make a note so it doesn't fall through the cracks. Thank you for noticing!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the fact that on this line req.hostname is checked rather than origin is puzzling

Doh, it's puzzling to me too. It looks like a mistake. It seems it was added for tests, and in practice normally has no effect (because GRIST_HOST is either unset or set to something like 0.0.0.0), but if it did have an effect, it also looks like the wrong one to me. I'd try removing it and seeing if any tests fail. If you can try that, that would be great; if not, let me know, I'll make a note so it doesn't fall through the cracks. Thank you for noticing!

I can confirm that removing this line has no effect on tests. I ran the whole test suite on GitHub Actions, even though it seems only the DocApi test suite would catch a regression in this code based on the fact that it seems to be the only suite where the Origin header is varied.

So I went ahead and removed that line. Now looking into getting websocket request origins checked as you described. This will involve adding some Comm tests I believe.

log.debug("got socket close reason=%s description=%s messages=%s",
reason, description?.message ?? description, [...this._messageCallbacks.keys()]);

const err = description ?? new Error(reason);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine, maybe just add a comment that in practice description is an instance of Error, or just change its type to Error if Typescript doesn't mind.

constructor(server: http.Server) {
super();
this._server = new EIO.Server({
allowUpgrades: false,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just suggest adding a comment for anyone else wondering the same question.

public set onmessage(cb: null | ((data: string) => void)) {
this._ws.onmessage = cb ?
(event: WS.MessageEvent | MessageEvent<any>) => {
cb(event.data instanceof String ? event.data : event.data.toString());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this work on the client-side? It looks like it might string in practice, but if it's not, then according to documentation, it would be ArrayBuffer or Blob, and neither would do anything useful with toString(), I don't think.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the client side, I think it is supposed to always be a string since we are only sending text frames, see https://developer.mozilla.org/en-US/docs/Web/API/WebSocket/message_event#event_properties

If the message type is "text", then this field is a string.

On the server side (using ws), the call to .toString() here is meant to only make explicit what JSON.parse in GristWSConnection was doing implicitly.

But in fact, looking at what ws does to emulate the HTML5 WebSocket API, it seems the toString() is not even necessary since it will have been node by ws to convert its NodeJS Buffer object (which has a useful toString, as contrasted from ArrayBuffer or Blob) to a string to match standard WebSocket behavior.

So I've gone ahead and removed the toString() and added an explanatory comment.

Suggested change
cb(event.data instanceof String ? event.data : event.data.toString());
// event.data is guaranteed to be a string here because we only send text frames.
// https://developer.mozilla.org/en-US/docs/Web/API/WebSocket/message_event#event_properties
cb(event.data);

cb(err);
}
for (const cb of this._messageCallbacks.values()) {
cb?.(err);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?. part shouldn't be needed here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I removed it.

log.debug("calling cb for msg %d", msgNum);
cb();
if (this._messageCallbacks.delete(msgNum)) {
cb?.();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can cb be undefined? That's fine, but it would be good if its type reflected that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I changed send so that the callback is optional.

  public send(data: string, cb?: (err?: Error) => void) {
    const msgNum = this._messageCounter++;
    if (cb) {
      this._messageCallbacks.set(msgNum, cb);
    }
    this._socket.send(data, {}, () => {
      if (cb && this._messageCallbacks.delete(msgNum)) {
        cb();
      }
    });
  }

app/server/lib/GristSocketServer.ts Show resolved Hide resolved
// this CORS behavior will only apply to polling requests, which end up
// subjected downstream to the same Origin checks as those necessarily
// applied to WebSocket requests, it is safe to let any client attempt
// a connection here.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...require a few pointers to find the proper methods to call for verifying the origin.

What you suggested is exactly what I had in mind. addOrgInfo is what turns a req into RequestWithOrg, so if you add the allowHost check afterwards, it should almost make sense. "Almost" because you would need to adjust some methods to accept either express.Request or the underlying http.IncomingMessage. I think that's always possible by getting headers using req.headers[lowercase_name].

If allowHost returns false, we could then either fail when credentials are present (since we don't expect them to be), like trustOriginHandler does, or perhaps just strip the credentials and keep going.

the fact that on this line req.hostname is checked rather than origin is puzzling

Doh, it's puzzling to me too. It looks like a mistake. It seems it was added for tests, and in practice normally has no effect (because GRIST_HOST is either unset or set to something like 0.0.0.0), but if it did have an effect, it also looks like the wrong one to me. I'd try removing it and seeing if any tests fail. If you can try that, that would be great; if not, let me know, I'll make a note so it doesn't fall through the cracks. Thank you for noticing!

Copy link
Member

@dsagal dsagal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! All the changes look good so far, including the removal of the GRIST_HOST check (which I am not seeing in the PR yet but I saw your note that you are removing it).

Thank you also for working on the origin checks!

The reference to "window" would unexpectedly break Node-based tests
where a client stays connected long enough to send a heartbeat.

GristWSSettings.getPageUrl seems explicitly made for this.
This could avoid false results from the tests.
It seems this was added for test purposes but the current tests all pass without this, and
it looks a bit safer to remove it.
These classes, used as an alternative to native WebSockets,
provide an automatic fallback to HTTP long polling (implemented
using Engine.IO) when a WebSocket connection fails.
@jonathanperret
Copy link
Collaborator Author

Hi @dsagal , I finally found the time to work on this again, putting in what will hopefully be the last big change.

For origin verification, I added a verifyClient callback to GristSocketServer which lets the caller filter both WebSocket and polling connections. When instantiating GristSocketServer, Comm passes a verifyClient callback that calls trustOrigin — which I modified as discussed to use IncomingMessage instead of Request. Hopefully this is not too far from what you had in mind.

More significantly, I have changed my approach to integrating Engine.IO. It is no longer introduced as a unifying abstraction over WebSockets and long polling, as I initially hoped to do. Instead, the existing WebSocket-based protocol remains the default communication method, with an automatic fallback to the long polling implementation provided by Engine.IO. Among other things, this solves a problem revealed by the ManyFetches test suite, which is that the memory usage of the server-side Engine.IO sockets was much higher than the basic ws sockets — I could not determine exactly why but this did not inspire much confidence.

The most commonly taken path will therefore remain the native WebSocket connection without the overhead of the Engine.IO protocol, which makes regressions much less likely.

This in turn led me to remove the GRIST_USE_ENGINE_IO environment variable, which had been an annoyance for which I couldn't formulate a clear exit plan. There is now only one implementation of GristSocketServer and GristClientSocket, which includes the aforementioned fallback behavior.

I have squashed most of the commits into one that introduces the GristSocketServer and GristClientSocket, keeping apart the handful of commits that were not directly related to the main work.

Apologies for the review churn, and as always I'm looking forward to your feedback.

}

export class GristClientSocket {
private _wsSocket: WS.WebSocket | WebSocket | undefined;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to have a comment that exactly one of _wsSocket, _eioSocket is set.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

private _onEIOOpen() {
this._openHandler?.();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this need to set _openDone? And doesn't _onEIOMessage need to check that? (If not, a comment to explain would be good, since that's a sign of a difference with WS implementation)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned below, I reworked the logic and there's only one state check left, in _onWSError since that's the only point where we need to decide whether to emit an error or downgrade the socket.
_onWSMessage doesn't need to check the state (now called _wsConnected) because it can only occur after a successful call to _onWSOpen. None of the _onEIO handlers need to check the state either, since the very fact they are called implies the socket was downgraded to Engine.IO.

this._openHandler?.();
}

private _onWSError(ev: Event) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure it's impossible to receive the error event for a socket that's already open? If possible, should we forward it to _errorHandler rather than reconnect using polling at this point?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I ended up simplifying the logic, replacing _openDone with a _wsConnected bool that only tells us whether the next WebSocket error event means we should retry with Engine.IO or emit the error.

return true;
}

// Returns whether req satisfies the given allowedHost. Unless req is to a custom domain, it is
// enough if only the base domains match. Differing ports are allowed, which helps in dev/testing.
export function allowHost(req: Request, allowedHost: string|URL) {
export function allowHost(req: IncomingMessage, allowedHost: string|URL) {
const mreq = req as RequestWithOrg;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest avoiding creating mreq variable, since it's a risky type conversion (e.g. mreq.get() is presumably not always available but would be considered correct type), and instead doing this cast inline in the one place where it's used, i.e. (req as RequestWithOrg).isCustomHost.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea about scoping the cast more tightly.
Actually, looking into it I realized the check as I had written it in Comm would not work for custom hosts, because allowHost was called before the request object had been enriched with organization info. I reworked the check to have the call to addOrgInfo done before trustOrigin.

if (process.env.APP_HOME_URL) {
return new URL(process.env.APP_HOME_URL).protocol.replace(':', '');
}
return req.get("X-Forwarded-Proto") || req.protocol;
return req.headers["x-forwarded-proto"] || ((req.socket as TLSSocket).encrypted ? 'https' : 'http');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at the implementation of protocol for express.Request. There is a difference around trust-proxy settings which protocol checks. But it's moot because we seem to have been blindly trusting x-forwarded-proto header anyway. That doesn't seem quite right (maybe it's safe in this case, but it's certanly not obvious to me). It's fine not to address here, but if you don't mind, I'd appreciate a TODO to look into when x-forwarded-proto should be trusted.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO Comment added.

if (cb) {
this._messageCallbacks.set(msgNum, cb);
}
this._socket.send(data, {}, () => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the callback ever have an argument here (e.g. if err can be received, should it be passed on)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Engine.IO only calls the send() callback on success, without an argument. So if we get here the err callback argument is correctly undefined. I'm adding a comment to clarify.


public set onmessage(handler: (data: string) => void) {
this._ws.on('message', (msg: Buffer) => handler(msg.toString()));
this._eventHandlers.push({ event: 'message', handler });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to remember the wrapper function to match on removal (handler is not exactly what got added with _ws.on('message'))

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, sorry about missing that again, fixed.

@@ -189,7 +184,7 @@ export class Client {

public interruptConnection() {
if (this._websocket) {
this._removeWebsocketListeners();
this._websocket?.removeAllListeners();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question mark unneeded here (given the conditional)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, removed.

@paulfitz
Copy link
Member

I just updated the preview at https://grist-http-long-polling.fly.dev/ and edited a doc using firefox with network.websocket.max-connections=0 on about:config. Worked fine! Checked fly.io backend logs and confirmed that I see transport=polling in logs. When I enable websockets again and reload, I don't see transport=polling.

Copy link
Member

@dsagal dsagal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thank you @jonathanperret !!

@dsagal dsagal merged commit 96b652f into gristlabs:main Mar 28, 2024
25 of 26 checks passed
@jonathanperret jonathanperret deleted the http-long-polling branch March 28, 2024 17:31
@paulfitz
Copy link
Member

Thanks @jonathanperret and @dsagal ! This is really great to have.

One thing that came to my mind recently as possible follow-up. @jonathanperret is it possible to determine whether long-polling fallback is happening consistently, and log a warning if so? Not setting up websocket forwarding correctly is a fairly common configuration problem (one example of many: #261). Until now people have been forced to notice and fix this problem since they can't use documents at all until they get it right. But now, they might never even notice. Which isn't terrible, but seems a bit sub-optimal?

@jonathanperret
Copy link
Collaborator Author

Thank you guys for helping bring this to fruition. Can't wait to deploy it and unlock the power of Grist for our restricted-network users.

@paulfitz I see what you mean. I'm not optimistic however that a Grist administrator would notice such a warning, particularly since it wouldn't occur right on server startup.

Maybe calling out WebSockets as worthy of attention in the self-managing docs would help more? It appears there are zero hits for "WebSocket" currently on https://support.getgrist.com.

@paulfitz
Copy link
Member

@paulfitz I see what you mean. I'm not optimistic however that a Grist administrator would notice such a warning, particularly since it wouldn't occur right on server startup.

I agree, that is true. There is an admin panel coming soon where installation problems can be highlighted. If the warning existed, then the developer working on that panel would just need to find it in the code and hook up a way to report it better.

Maybe calling out WebSockets as worthy of attention in the self-managing docs would help more? It appears there are zero hits for "WebSocket" currently on https://support.getgrist.com.

Good point! We have a systems engineer joining the team next week whose mandate will include smoothing out the self-hosting experience. I've added that to the list of easy things to start with :-)

if (!origin) { return true; } // Not a CORS request.
if (process.env.GRIST_HOST && req.hostname === process.env.GRIST_HOST) { return true; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking for an explanation for why this line was removed, as I believe it's responsible for certain requests failing under Docker and possibly under other circumstances.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From earlier in the review:

Jonathan:

the fact that on this line req.hostname is checked rather than origin is puzzling

Me:

Doh, it's puzzling to me too. It looks like a mistake. It seems it was added for tests, and in practice normally has no effect (because GRIST_HOST is either unset or set to something like 0.0.0.0), but if it did have an effect, it also looks like the wrong one to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When GRIST_HOST is set to 0.0.0.0, which seems to be the case when running all of Grist within the same Docker instance from the gristlabs/grist image, certain internal requests from the doc workers to the home servers seems to fail. This seems to be the underlying cause for why certain operations like duplicating documents fail when running the tests under Docker.

But surely running the Docker tests isn't the only reason to have a setup in which we might need such internal requests?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a somewhat related discussion in #915

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jordigh would you have a repro scenario for requests failing since this change, under Docker or otherwise?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't fail anymore since 400937f, but if you want to see the failure, try running MOCHA_WEBDRIVER_HEADLESS=1 yarn run test:docker -g DuplicateDocument at its parent, (17ea97d).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants