Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Close connection if server rejects writes while maintaining write connection #1722

Closed
wants to merge 2 commits into from
Closed

Close connection if server rejects writes while maintaining write connection #1722

wants to merge 2 commits into from

Conversation

vladsud
Copy link
Contributor

@vladsud vladsud commented Apr 6, 2020

+Fix Fluid debugger

@@ -812,6 +812,9 @@ export class DeltaManager extends EventEmitter implements IDeltaManager<ISequenc
if (this.readonlyPermissions) {
this.close(createWriteError("WriteOnReadOnlyDocument"));
}
if (this.connectionMode === "write") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that we never retry on nack now? If not, then what is the scenario?
Seems like if we do not have write permissions, then readonlyPermissions is true, and we get "WriteOnReadOnlyDocument". If we do have write permissions, then connectionMode is "write", and we get "ServerRejectsWrites".
Is the last case of checking for autoReconnect still relevant then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'm not sure when nack can happen beyond those two cases - read-only permissions and view-only connection.
I was thinking to wrap both IFs with IF (target === -1), but did not have it for first case, so not sure when target !== -1 can happen. @GaryWilber, @tanviraumi , can you please comment?

I believe checking for this.autoReconnect is still relevant, because if it's neither read-only permissions nor view-only mode (as optimization on top of r/w permissions), then we can reconnect, and we are getting nack because someone is generating an op (presumably because of user actions). In such case, if reconnect is off, that very likely points to a bug in host code that deals with user presence. Bots will not hit it, because they start with "write" mode.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the server can NACK a client if they send an op that has a reference seq# below the minimum seq#. We should reconnect in that case.


In reply to: 404885819 [](ancestors = 404885819)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the right way to differentiate between those cases? Is testing for target telling anything?
I'd really hate to go with solution that starts counting number of disconnects per period of time as the only signal that something went wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we trying to prevent reconnection attempts when a "read-only permission" user sends an op? So what happens when a user switches from read to write? We have never formalized Nacks very well but in r11s, alfred sends you a nack when you can't write. That means either you don't have read permission or you joined as readonly. In both cases, the value is -1.(https://github.com/microsoft/FluidFramework/blob/master/server/routerlicious/packages/lambdas/src/utils/messageGenerator.ts#L15)

Deli nacks you on different occasions. And in all those occasions client should always reconnect. The value in all those cases is the current min sequence number (https://github.com/microsoft/FluidFramework/blob/master/server/routerlicious/packages/lambdas/src/deli/lambda.ts#L641). The cases are:

  1. There is a gap so kafka might have missed something. Client should reconnect to resend the op.
  2. refseq < minseq.
  3. Stale client.
  4. Tried to summarize without a summary scope. This is more future proofing and r11s/Push always provides a summary claim with the write claim. May be we should not reconnect in this case.

I am pretty sure Deli behaves similarly in Push and r11s. Not sure about the alfred part of Push but @GaryWilber will know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to add tenantId-documentId here just to be safe...

Copy link
Contributor

@GaryWilber GaryWilber Apr 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea it's sent to a specific room, but the above image is what's received on the websocket. Unfortunately the client doesn't know what room that was sent to.
I believe rooms are purely a server-side concept to aid in sending messages to specific clients.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gary and I talked. If we need an urgent fix, you can check with -2 and Gary can starting send them from push. We will work on a better way to handle all cases in the meantime

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should got with right fix. I can wait of make other temp fix if needed (track rate of disconnects). Can you please share bug / issue links to track?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vladsud
Copy link
Contributor Author

vladsud commented Apr 13, 2020

Forking Fluid debugger fix into #1773
Closing this PR as issue is opened to provide client proper signal to properly differentiate nack errors

@vladsud vladsud closed this Apr 13, 2020
@vladsud vladsud deleted the ConnectionBailSooner branch April 24, 2020 06:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants