-
Notifications
You must be signed in to change notification settings - Fork 536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add expired client identifier field and checks to improve old clientId check perf #3704
Conversation
|
server/routerlicious/packages/protocol-definitions/src/clients.ts
Outdated
Show resolved
Hide resolved
this check doesn't work for older messages. could probably add a way to check if it's recent enough, but that seems kind of iffy and isn't necessary for the case we're trying to cover with telemetry here
if (!local) { | ||
const client: IActiveSequencedClient | undefined = | ||
this._protocolHandler?.quorum.getMember(message.clientId); | ||
if (client?.isActive === false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this won't catach the issue. both clients were still active from the servers persepctive which is what quorum represents
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if a client id is not active from the server's perspective, i'm pretty sure the server will nack any ops for that client, so this should also never happen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this what we're looking for though? (Cases where these ops should have been nacked, but aren't)
The isActive
flag is set locally on our side, so the only input the server has on it here is if the client gets removed from the quorum. This is the same place we added telemetry before, except this time the quorum is being used to remove clientIds that should no longer matter instead of keeping them indefinitely until they are bumped out of the MRU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. That's not what happens. The old client id has not had a leave message sent by the server, as it was the client who decided it needed to reconnect. We see the first set of ops from the old client id, then the resubmitted ops from the new client id, and only after that do we see a leave message for the old client id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i updated #3627 to make this explicit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are the steps as I understand and how this PR fits into it, which I think matches what's in #3627. If you could help me with which part I'm missing, I'd appreciate it.
- Old client join message
- First set of ops sent from old client id
- New client join message
3a. During new client connect, we hit the change in L1398 above. When we change our client id, we see the old client is still in the quorum (because there has not been an old client leave message yet), so we mark it asisActive = false
, but don't remove it from the quorum. - Second set of ops sent from old client id
4a. New client sees second set of ops and hits the change here. It doesn't recognize them as coming from the same client (!local
), but because old client has not had its leave message yet (and is still in the quorum), new client will find it in the quorum marked!isActive
(which was done earlier on new client connect). It then reports telemetry. - Duplicate second set of ops are sent from new client (because the ops sent from old client were not recognized)
- Old client leave message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that looks correct. :) There might be many reasons for the issue, but this check is equivalent to what an keeping at array of 100 clientIDs did - we should evaluate equivalence here only.
If there are more checks or more cases to track, we should do it in separate PR.
if (!local) { | ||
const client: IActiveSequencedClient | undefined = | ||
this._protocolHandler?.quorum.getMember(message.clientId); | ||
if (client?.isActive === false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think this check is correct.
I'd make it
if (client === undefined || client.isActive === false)
i.e. the client that sends ops should be in quorum!
That said, this would work only if we make this change for !isSystemMessage(message), i.e. exclude "join" & "leave" messages as they are odd - we need to give this.protocolHandler.processMessage a chance to process join for this check to apply to it, but leave is other way around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, I'd consider dropping if (!local) check for this - the invariant is still valid - the client that is sending ops should be in quorum, and can't be non-active. Local or not is secondary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the client === undefined
check work with unrelated clients playing back older ops to catch up to the latest? I had something similar in the first iteration and it was getting hit during the UTs and failing them, so I wasn't super sure if it wouldn't work outside of the live scenario, or if it was an issue with the local test -licious variant (which seemed unlikely)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your current check only caches cases that are still in quorum, i.e. server still thinks that client is alive (otherwise server would generate leave op for it and it would be deleted from quorum, loosing active flag tracing).
Increasing check to include client === undefined would not help to catch client error (or rather - client-only errors), because client missing from quorum should not happen - server should not allow to accept and retransmit ops from clients that are not in quorum. But it might be still valuable (in terms of increasing coverage), as maybe there are server bugs we are against, or maybe re-ordering of ops happens somewhere else (in driver?) that may lead to this condition - hitting it would be useful indicator of where to keep digging.
You likely hit problems with join & leave messages, as join message would be coming from client that is still not in quorum - we add it to quorum down when calling into protocolHandler. Reordering check with protocolHandler would not help as it would fix it for join op, but make it fail for leave op. That's the reason I think the best check is something like that::
if (client === undefined && op.type !== MessageType.ClientJoin || client?.isActive === false) {
If it's not the case, then it would be interesting to learn why you hit problems. I'm not aware of any other reason why it's wrong check to have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
problem was just that the check wasn't handling noops (because message.clientId was null 🙃). all good after handling that case
Looks great! |
…d check perf (microsoft#3704) microsoft#3675 Change the old/expired clientId check to use the quorum instead of by keeping a list of old clientIds. This works by adding a shouldHaveLeft field to the ISequencedClient in the quorum, where a reconnecting client will set its old clientId shouldHaveLeft to true (if it still exists in the client). Then when we process a message, instead of checking against a list of old clientIds, we check if the clientId exists in the quorum (messages should only be coming from clients in the quorum) and if the clientId has been marked should have left. An inactive clientId in this situation would suggest a message from an old connection of the same client (which shouldn't happen).
…d check perf (#3704) (#3801) #3675 Change the old/expired clientId check to use the quorum instead of by keeping a list of old clientIds. This works by adding a shouldHaveLeft field to the ISequencedClient in the quorum, where a reconnecting client will set its old clientId shouldHaveLeft to true (if it still exists in the client). Then when we process a message, instead of checking against a list of old clientIds, we check if the clientId exists in the quorum (messages should only be coming from clients in the quorum) and if the clientId has been marked should have left. An inactive clientId in this situation would suggest a message from an old connection of the same client (which shouldn't happen).
#3675
Change the old/expired clientId check to use the quorum instead of by keeping a list of old clientIds. This works by adding a shouldHaveLeft field to the ISequencedClient in the quorum, where a reconnecting client will set its old clientId shouldHaveLeft to true (if it still exists in the client). Then when we process a message, instead of checking against a list of old clientIds, we check if the clientId exists in the quorum (messages should only be coming from clients in the quorum) and if the clientId has been marked should have left. An inactive clientId in this situation would suggest a message from an old connection of the same client (which shouldn't happen).