Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add expired client identifier field and checks to improve old clientId check perf #3704

Merged
merged 7 commits into from
Sep 30, 2020

Conversation

heliocliu
Copy link
Contributor

@heliocliu heliocliu commented Sep 22, 2020

#3675
Change the old/expired clientId check to use the quorum instead of by keeping a list of old clientIds. This works by adding a shouldHaveLeft field to the ISequencedClient in the quorum, where a reconnecting client will set its old clientId shouldHaveLeft to true (if it still exists in the client). Then when we process a message, instead of checking against a list of old clientIds, we check if the clientId exists in the quorum (messages should only be coming from clients in the quorum) and if the clientId has been marked should have left. An inactive clientId in this situation would suggest a message from an old connection of the same client (which shouldn't happen).

@heliocliu
Copy link
Contributor Author

heliocliu commented Sep 22, 2020

Will check in server/client parts separately; both included for now to show the entire change
no longer needed

this check doesn't work for older messages. could probably add a way to check if it's recent enough, but that seems kind of iffy and isn't necessary for the case we're trying to cover with telemetry here
if (!local) {
const client: IActiveSequencedClient | undefined =
this._protocolHandler?.quorum.getMember(message.clientId);
if (client?.isActive === false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this won't catach the issue. both clients were still active from the servers persepctive which is what quorum represents

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a client id is not active from the server's perspective, i'm pretty sure the server will nack any ops for that client, so this should also never happen

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this what we're looking for though? (Cases where these ops should have been nacked, but aren't)
The isActive flag is set locally on our side, so the only input the server has on it here is if the client gets removed from the quorum. This is the same place we added telemetry before, except this time the quorum is being used to remove clientIds that should no longer matter instead of keeping them indefinitely until they are bumped out of the MRU.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. That's not what happens. The old client id has not had a leave message sent by the server, as it was the client who decided it needed to reconnect. We see the first set of ops from the old client id, then the resubmitted ops from the new client id, and only after that do we see a leave message for the old client id.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i updated #3627 to make this explicit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are the steps as I understand and how this PR fits into it, which I think matches what's in #3627. If you could help me with which part I'm missing, I'd appreciate it.

  1. Old client join message
  2. First set of ops sent from old client id
  3. New client join message
    3a. During new client connect, we hit the change in L1398 above. When we change our client id, we see the old client is still in the quorum (because there has not been an old client leave message yet), so we mark it as isActive = false, but don't remove it from the quorum.
  4. Second set of ops sent from old client id
    4a. New client sees second set of ops and hits the change here. It doesn't recognize them as coming from the same client (!local), but because old client has not had its leave message yet (and is still in the quorum), new client will find it in the quorum marked !isActive (which was done earlier on new client connect). It then reports telemetry.
  5. Duplicate second set of ops are sent from new client (because the ops sent from old client were not recognized)
  6. Old client leave message

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that looks correct. :) There might be many reasons for the issue, but this check is equivalent to what an keeping at array of 100 clientIDs did - we should evaluate equivalence here only.
If there are more checks or more cases to track, we should do it in separate PR.

if (!local) {
const client: IActiveSequencedClient | undefined =
this._protocolHandler?.quorum.getMember(message.clientId);
if (client?.isActive === false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think this check is correct.
I'd make it
if (client === undefined || client.isActive === false)
i.e. the client that sends ops should be in quorum!

That said, this would work only if we make this change for !isSystemMessage(message), i.e. exclude "join" & "leave" messages as they are odd - we need to give this.protocolHandler.processMessage a chance to process join for this check to apply to it, but leave is other way around.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I'd consider dropping if (!local) check for this - the invariant is still valid - the client that is sending ops should be in quorum, and can't be non-active. Local or not is secondary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the client === undefined check work with unrelated clients playing back older ops to catch up to the latest? I had something similar in the first iteration and it was getting hit during the UTs and failing them, so I wasn't super sure if it wouldn't work outside of the live scenario, or if it was an issue with the local test -licious variant (which seemed unlikely)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your current check only caches cases that are still in quorum, i.e. server still thinks that client is alive (otherwise server would generate leave op for it and it would be deleted from quorum, loosing active flag tracing).

Increasing check to include client === undefined would not help to catch client error (or rather - client-only errors), because client missing from quorum should not happen - server should not allow to accept and retransmit ops from clients that are not in quorum. But it might be still valuable (in terms of increasing coverage), as maybe there are server bugs we are against, or maybe re-ordering of ops happens somewhere else (in driver?) that may lead to this condition - hitting it would be useful indicator of where to keep digging.

You likely hit problems with join & leave messages, as join message would be coming from client that is still not in quorum - we add it to quorum down when calling into protocolHandler. Reordering check with protocolHandler would not help as it would fix it for join op, but make it fail for leave op. That's the reason I think the best check is something like that::

if (client === undefined && op.type !== MessageType.ClientJoin || client?.isActive === false) {

If it's not the case, then it would be interesting to learn why you hit problems. I'm not aware of any other reason why it's wrong check to have.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

problem was just that the check wasn't handling noops (because message.clientId was null 🙃). all good after handling that case

@vladsud
Copy link
Contributor

vladsud commented Sep 29, 2020

Looks great!

@heliocliu heliocliu changed the title Add isActive field and checks to improve old clientId check perf Add expired client identifier field and checks to improve old clientId check perf Sep 30, 2020
@heliocliu heliocliu merged commit 116d800 into microsoft:main Sep 30, 2020
@heliocliu heliocliu deleted the ImproveClientIdLogging branch September 30, 2020 18:42
heliocliu added a commit to heliocliu/FluidFramework-1 that referenced this pull request Oct 1, 2020
…d check perf (microsoft#3704)

microsoft#3675
Change the old/expired clientId check to use the quorum instead of by keeping a list of old clientIds. This works by adding a shouldHaveLeft field to the ISequencedClient in the quorum, where a reconnecting client will set its old clientId shouldHaveLeft to true (if it still exists in the client). Then when we process a message, instead of checking against a list of old clientIds, we check if the clientId exists in the quorum (messages should only be coming from clients in the quorum) and if the clientId has been marked should have left. An inactive clientId in this situation would suggest a message from an old connection of the same client (which shouldn't happen).
heliocliu added a commit that referenced this pull request Oct 1, 2020
Undo #3656 and replace it with the change from #3704 which does the same thing but in a smarter way
heliocliu added a commit that referenced this pull request Oct 1, 2020
…d check perf (#3704) (#3801)

#3675
Change the old/expired clientId check to use the quorum instead of by keeping a list of old clientIds. This works by adding a shouldHaveLeft field to the ISequencedClient in the quorum, where a reconnecting client will set its old clientId shouldHaveLeft to true (if it still exists in the client). Then when we process a message, instead of checking against a list of old clientIds, we check if the clientId exists in the quorum (messages should only be coming from clients in the quorum) and if the clientId has been marked should have left. An inactive clientId in this situation would suggest a message from an old connection of the same client (which shouldn't happen).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants