Data corruption: Client runtime can't properly track op acks in presence of distributed ordering service architecture #4399

vladsud · 2020-11-18T18:07:36Z

This sounds like fundamental issue with R11s, and by virtue - PUSH V2 design.

Scenario:

Client is connected to ordering server A
Client sends an op
Client disconnects
Clients reconnects to ordering server B
Client receives own join op # 101
Client resends op (# 2)
server A finally pushes ops to Kafka and op # 2 is sequenced as # 102
Op from # 6 is sequences as # 103

But:
# 102 & # 103 are the same ops, the op was send and sequenced twice, which causes data corruption.

Possible client-only solution is rather complex and is probability based:

add notion of container instance ID and add it to op metadata
do not reset clientSequenceID on reconnects (server needs to allow that)
each client tracks all container instances over some period of time to detect duplicate ops and ignore them
summarizer preserves enough state for clients to adhere to this algorithm if they boot/join in the middle of potential issues.

anthony-murphy · 2020-11-18T20:23:12Z

I wonder if increasing the cost of join by doing a multiple phase commit could help here. all nodes involved in the session need to acknowledge the join, before it is sent, to ensure any outstanding ops come before the join message. i don't know the internals well enough to be able to evaluate the feasibility of a solution like this.

@GaryWilber @tanviraumi

tanviraumi · 2020-11-18T22:55:33Z

@anthony-murphy Multiple phase commits for doing distributed transactions is a blocking protocol. So we should probably not consider a similar solution. Even a less strict solution will require a way to globally track all connected clients with consistent read/writes. That will becomes a single point of failure and we probably don't want to go there.

@vladsud I think this PR (https://github.com/microsoft/FluidFramework/pull/3704/files) is detecting the exact same bug. I may be wrong but that looks like the purpose. If we are already detecting it, I am curious to know the frequency of occurrence. Do we have a number yet?

Looking at the PR, I am also not sure on closing the container. The session is still usable, clients are not stuck, service is also not corrupted. The content got duplicated which definitely violated users intention. Given the scenarios we support today, clients can possibly undo the behavior?

To be clear, I am not downplaying the importance of the bug. This is definitely a bad failure but I just want to get a sense of the frequency of occurrence. This will help us figure out the cost of the solution.

arinwt · 2020-11-18T23:20:04Z

It depends on what op is duplicated.. some can be more devastating I think (attach ops?).

Edit: did not mean to close this issue with this comment!

anthony-murphy · 2020-11-19T00:02:55Z

This is a data corruption, we might get lucky with some, but there are tons of scenarios that will not recover. i don't think telling the client to recover is reasonable. They basically need to build a duplicate system to the fluid framework to catch duplicates. We need to handle this on the server or the client

anthony-murphy · 2020-11-19T00:06:55Z

@tanviraumi i'm not sure a multi phase commit is that detrimental is this case. We wouldn't need to block ops, just slow down join. the joining client would still see ops, and be able to edit locally. the join message might show up later in the ops stream than it needs to, but that's a whole lot better than showing up to early as is possible right now

vladsud · 2020-11-19T00:55:36Z

@tanviraumi, yes, it's good point - we already have mechanism, but it only works for a client who send the op. That same mechanism will apply the op just fine for all other clients.
And as others pointed out, applying duplicate ops is wrong - things will break either silently or very in the user face. Even if things keep working properly from runtime perspective, users do not want to see same text being inserted twice!

So we need some solution, either pure client solution, pure server solution, or some combination.

tanviraumi · 2020-11-19T00:57:15Z

This is a very rough idea, and may have some issues that I am not considering. So let me know if I am missing something obvious:

How about a disconnected client also tracks its own "leave" op? The main problem is, we have two active clients (c1 older and c2 newer) in the quorum for the same container. As soon as the older client (c1) leaves, deli will stop sequencing any message for c1. So the idea is: new client c2 cannot send ops before c1 actually leaves the quourm. It can still join the websocket, receive ops, make local edits. But in addition to waiting for its own join message, it will also wait for c1 to leave. In regular cases: you will check the quorum and realize that c1 already left. Or you expect a leave message soon (Need data for how soon). In this corruption case, you will discard your local ops one by one until you see a leave op (we already do that but we dont wait for leave message I think). And then you can submit the rest.

In the corruption case, you will wait "x" seconds of time. We need data to see what "x" is. But "x" only impacts your op going live. You can still edit locally and see other people's changes.

One caveat: Deli may never see a "leave" message from alfred. We already handle that today by using a timer. Today its roughly 5 minutes (both for r11s and push). But this is tunable number and we can discuss on a lower number.

vladsud · 2020-11-19T04:07:46Z

Thanks Tanvir! I'd describe it as "delay with possibility of fast resolution".
I.e. in the core if it is a delay of X seconds that gives us fir nines of probability, but we see our own leave op, then client may stop waiting and proceed with sending ops. Precisely for the point you pointed out - it may take a while for server to realize client left.

we likely we can go with it for Push V2 deployment, and collect more data if we need better solution.

I'll add one more solution: we can add container instance ID (stable ID for duration of lifetime of container instance) to IClient and start sending it as part of connect_document message. It would be tracked in quorum as result. Whenever new client shows up, all clients would start ignoring all the ops coming from all other clientsIDs that are associated with same container instance ID that just joined.
That's the most simple tracking mechanism, but equivalent (in outcome) to the one I described at the start of this issue.
The only potential issues I see is that any client can now "block" any other client on the wire, by spoofing someone else container instance ID.

And I think we can make it secure. I'll try :)
On first connect, client sends empty payload as part of new field of connect_document and receives back signed payload that contains container instance ID.
On reconnect, client resents same signed payload it received earlier from server, and server thus can extract container instance ID from it, as well as validate that payload was not tampered. Given that nobody has same payload, only this client can use it.

tanviraumi · 2020-11-19T10:56:46Z

We should add telemetry to measure what 'X' is. My suspicion, A high percentile of 9 will see zero delay because your quorum is already up to date (leave already came before you joined). For the rest, there is a X > 0. But now that I think about it, you probably don't need to wait for "leave" if <number of unacked ops === 0> (I may be wrong here so let me know).

Assuming the previous conditions are correct, we are now waiting for the "leave" for a small number of cases (need number to verify how small). Today we are closing the container when this happens (#3704 ). So this solution is a big improvement over our current experience. And then we will measure to improve it further.

I think your signing idea works but my recent experiences with FRS taught me that "signing" is not easy. Basically you need to store a key and make sure the key is never leaked. Then you realize that you need a per tenant key and you need to know how to roll that over. FRS has already done those work and we can leverage that if required. But I am not sure about Push. As much as I know, Push does not deal with any keys and relies to ODSP to verify the "Push Token". So they may need to build another ODSP contract to verify the signing. @GaryWilber will know more about this.

anthony-murphy · 2020-11-19T16:30:28Z

@vladsud we could use a combined key of instance id + user id. This would prevent other clients from spoofing

anthony-murphy · 2020-11-19T16:39:18Z

something i've also considered in adding a client sent leave message that the client can send to the server to affirmatively leave, as we've seen cases where the client thinks it's left, but the server doesn't. This can happen with the websocket connection pooling done by things like the odsp driver. this should help ensuring leaving is fast in the most reconnect cases. disconnect cases are less interesting as presumably the client isn't coming back.

GaryWilber · 2020-11-19T17:54:28Z

The ODSP driver actually already sends that message today, primarily for the connection pooling scenario you described. When disconnecting, the client sends a "disconnect_document" message to tell the server to disconnect the provided client id.

However I'm not sure how helpful that is for this case because this scenario involves a frontend machine having a delay/issue inserting messages for orderering. So the leave op created by this event would likely be stuck as well. And the [3. Client disconnects] bullet point will already cause this logic to run (disconnect all the client ids for the websocket connection).

anthony-murphy · 2020-11-19T18:10:33Z

Yeah. That makes sense, but in the non-error case it would help keep the leave fast. As long as the error case is rare it might be ok if that case is slow regardless.

vladsud · 2020-11-19T20:59:41Z

We should double check, but yes, SPO already should trigger proper events for server to generate leave right away when client initiates disconnect - no new work should be required here, assuming it actually works :)

Delta manager already knows about potential set of non-acked ops::

        // if we have any non-acked ops from last connection, reconnect as "write".
        // without that we would connect in view-only mode, which will result in immediate
        // firing of "connected" event from Container and switch of current clientId (as tracked
        // by all DDSes). This will make it impossible to figure out if ops actually made it through,
        // so DDSes will immediately resubmit all pending ops, and some of them will be duplicates, corrupting document
        if (this.clientSequenceNumberObserved !== this.clientSequenceNumber) {
            requestedMode = "write";
        }

(Note: this code is a bit aggressive, as it does not take into account noop coalescing).

So yes, we can leverage that signal and wait for our own leave and just report times to get a sense if this is good enough.
I think if we go that route, I'd love to see code being in one place. Right now, the logic when to issue "connected" event is in Container, but logic to track non-acked ops and connect as "write" is in delta manager.

That all said, I'd keep some reasonable time-out and not wait forever, and actually track (in telemetry) timeouts as an key indicator. I'd start with 2 seconds, as anything close to 5 seconds would results in more disconnected banner flashing in app.

tanviraumi · 2020-11-19T22:20:23Z

What is the timeout for? Is it for how long it takes to show up a leave message? Clients and still connect and see other peoples ops. They will also join the quorum. So I am not sure why this has to be smaller that "disconnected banner flashing" timeout.

vladsud · 2020-11-19T22:57:26Z

Yes, timeout for waiting for own leave op.
This is regression (in terms of time to reconnect), and it will lead to longer / more often disconnect banners. I do not want to be in position to explain to people "that's limitation of the system"

tanviraumi · 2020-11-19T23:07:32Z

So today, do we have a 5 seconds timer for a client to receive its own join message? And what happens if we exceed 5 seconds? Do we show a disconnect banner?

vladsud · 2020-11-19T23:09:55Z

We show disconnect banner if client did not get to "connected" state in 5 seconds after last disconnect. Every time we regressed here, there was rather strong feedback that product is not meeting the bar, including flashing disconnections every once in a while.

tanviraumi · 2020-11-19T23:27:24Z

Makes sense. But I am not sure why this one only gets 2 seconds. Connected (receiving your own join op) it the fast path. When this bug hits, "leave" op will show up after "join" op. So if the "join" case gets 5 seconds, shouldn't this one also get 5? Assuming both timer gets fired at the same time (after last disconnect).

My point is, these two has the same outcome. Both of them are blocking outgoing ops. So from a perf perspective, they should be treated on a similar standard. Right?

vladsud · 2020-11-20T02:57:27Z

We have chatted with Tanvir. 5 second is not specific to anything, it's end-to-end process from disconnect to "connected" event.
At the end of the day, it's all about collecting data and improving situation, we will adjust numbers based on our learning.

vladsud · 2020-12-15T15:33:54Z

We need to start addressing it, so I've assigned it to myself & Jatin (let's see if any of us finds time to address first portion of it this month).

agarwal-navin · 2021-02-09T01:03:24Z

This IcM incident may be related to the issue being discussed here - https://portal.microsofticm.com/imp/v3/incidents/details/226938258/home

jcmoore · 2021-02-13T06:21:14Z

Only just coming up to speed on the latest regarding incident 226938258 you mention @agarwal-navin, so maybe this is well understood by this point, but it sounds like there are idempotence characteristics that result in bad behavior in the face of unreliable network connections.

This is something I was trying to wrap my head around in conversations with @vladsud and @tanviraumi a couple weeks back. If there are design discussions going on here, I'd be interested to hear what people are thinking ~~(I'm including some of my unsolicited questions/thoughts below "the fold" in the P.S.)~~

~~P.S. Assuming (...lots of assumptions):~~

~~the backend does the deduping of ops~~
~~each op is sent along with some guid which is used to dedupe~~
~~guids are a composite of:~~
- ~~some namespace (the client id?) managed by backend nodes and granted to frontend peers~~
- ~~some gapless monotonic counter managed by frontend peers~~
~~the namespace portion of a guid can change when a frontend peer disconnects from one backend node and reconnects to the backend (the node may be the same or different)~~

If that's the case (BIG if), a frontend peer could get backend nodes to sequence semantically equivalent ops (i.e. "create dds/data-object using a specific id") with differing guids. If that's the problem, I might recommend:

having the frontend peer send all guids (i.e. namespaces -- if the counter remains stable across reconnects) a backend node might have sequenced a given op with so it can reject duplicates even in the presence of arbitrarily faulty/slow network connections
make it so the client can use the same namespace even after it disconnects (presuming there is some way to validate that a namespace has not been hijacked -- perhaps by making them infeasible to guess)

~~Still learning a lot about the system, so not sure if this is actually the problem -- or if any of the above is feasible...~~ Looking forward to understanding more!

curtisman · 2021-02-13T19:43:21Z

@jcmoore Here is a summary of my understanding of the problem and the discussion so far (@tanviraumi and @vladsud can correct me here):

When the client send op, it doesn't know if the server received it, until it hear back the the sequenced op.
If the client disconnected after it put the op on the wire before hearing back, it needs to resend those ops after it reconnect if the server didn't receive the ops
To determine whether the server receive ops that was previously put on the wire:
- First after reconnect, it try to catch up by downloading ops up to what has been broadcasted by the reconnected websocket.
- However, there might still be ops that are still in transit or waiting to be sequenced, the reconnected client has to wait to make sure it get back those ops to before resend.
- Currently we use the ClientJoin message for the reconnect client to indicate the wait is over, with the assumption that if the server received the op from the old connection, the ClientJoin message of the new connection has to come after that.

The crux of this issue: there are a couple of situation this assumption is violated and the order is inverted:

Originally, this issue was opened because there is a case with a distributed ordering service (PushV2), that if the reconnection happened on a different server for some reason, it can't guarantees the ops from the old connection get on the op queue before the new connection on a new server.
I believe that with the newer incidents that @agarwal-navin raised is that even without distributed order service, the order can be inverted, because, while TCP guarantee order within a single connection, the network topography can change with different connections. So the new connection might get established first before the data from old connection arrive at the server at all.

The latest proposal on how to solve this issue that I know of is to wait for the ClientLeave message for the old connection instead. That message guarantees that any more message from the old connection will be rejected by the server, so the new connection can safely resend ops that it hasn't hear back.

The concern with this solution is how long the wait could be until the server notice the old client disconnected, and how often it happens, as it could delay reconnection and impact user experience.

Your assumption are great and matched many other existing distributed system , but Fluid is different in the aspect that backend isn't responsible for deduping the op. The ordering service is super simple (on purpose), and all it does is to listen to ops, give it a sequence number, track minimum sequence number, and broadcast them. It doesn't try to understand the semantic meaning of the ops. All the reasoning of distributed state is client's responsibility.

jcmoore · 2021-02-13T20:39:10Z

Thanks for the explanation @curtisman , I'm following the previous discussion much more clearly now!

ChumpChief · 2021-03-24T15:57:44Z

Removing OCE tag, as this appears to require design discussion rather than OCE mitigation.

…ed edits [This](microsoft#4399) issue on the Fluid framework side can (and has) caused data corruption for affected fluid documents. This PR adds a mitigation for that problem by tolerating duplicate edits on our end, thus making the documents loadable again. Related work items: #54204

vladsud added the bug Something isn't working label Nov 18, 2020

vladsud added this to the December 2020 milestone Nov 18, 2020

ghost added the triage label Nov 18, 2020

vladsud changed the title ~~Client runtime can't properly track op acks in presence of distributed ordering service architecture~~ Data corruption: Client runtime can't properly track op acks in presence of distributed ordering service architecture Nov 18, 2020

arinwt closed this as completed Nov 18, 2020

arinwt reopened this Nov 18, 2020

tylerbutler added design-required This issue requires design thought area: runtime Runtime related issues and removed triage area: runtime Runtime related issues labels Nov 19, 2020

vladsud mentioned this issue Nov 20, 2020

Explore moving Quorum/Audience and dealing of "connected" event into DeltaManager #4429

Closed

vladsud assigned vladsud and jatgarg Dec 15, 2020

vladsud added the focus Items that engineers are focusing on now, but may not have any (coding) outcome in current milestone label Dec 15, 2020

vladsud removed their assignment Dec 17, 2020

jatgarg modified the milestones: December 2020, January 2021 Jan 4, 2021

vladsud modified the milestones: January 2021, February 2021 Jan 20, 2021

curtisman added area: runtime Runtime related issues and removed design-required This issue requires design thought focus Items that engineers are focusing on now, but may not have any (coding) outcome in current milestone labels Jan 27, 2021

vladsud mentioned this issue Feb 16, 2021

Client can hold collab window for very long time on reconnecting as "write" #5150

Closed

markfields added the msft: oce Pertaining to Msft-internal on call tickets label Feb 22, 2021

markfields modified the milestones: February 2021, March 2021 Feb 26, 2021

ChumpChief removed the msft: oce Pertaining to Msft-internal on call tickets label Mar 24, 2021

jatgarg mentioned this issue Mar 25, 2021

Implementation for resolving data corruption issue for distributed data ordering service #5484

Merged

jatgarg closed this as completed in #5484 Mar 25, 2021

jatgarg mentioned this issue Apr 20, 2021

Data Corruption due to distributed ordering service. (Follow up) #5866

Closed

wes-carlson mentioned this issue Feb 25, 2022

Container should wait for leave message from previous clientId when loaded with stashed ops #9240

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data corruption: Client runtime can't properly track op acks in presence of distributed ordering service architecture #4399

Data corruption: Client runtime can't properly track op acks in presence of distributed ordering service architecture #4399

vladsud commented Nov 18, 2020 •

edited

Loading

anthony-murphy commented Nov 18, 2020 •

edited

Loading

tanviraumi commented Nov 18, 2020

arinwt commented Nov 18, 2020 •

edited

Loading

anthony-murphy commented Nov 19, 2020 •

edited

Loading

anthony-murphy commented Nov 19, 2020

vladsud commented Nov 19, 2020

tanviraumi commented Nov 19, 2020 •

edited

Loading

vladsud commented Nov 19, 2020

tanviraumi commented Nov 19, 2020

anthony-murphy commented Nov 19, 2020

anthony-murphy commented Nov 19, 2020

GaryWilber commented Nov 19, 2020

anthony-murphy commented Nov 19, 2020

vladsud commented Nov 19, 2020

tanviraumi commented Nov 19, 2020

vladsud commented Nov 19, 2020

tanviraumi commented Nov 19, 2020

vladsud commented Nov 19, 2020

tanviraumi commented Nov 19, 2020

vladsud commented Nov 20, 2020

vladsud commented Dec 15, 2020

agarwal-navin commented Feb 9, 2021

jcmoore commented Feb 13, 2021 •

edited

Loading

curtisman commented Feb 13, 2021 •

edited

Loading

jcmoore commented Feb 13, 2021

ChumpChief commented Mar 24, 2021

Data corruption: Client runtime can't properly track op acks in presence of distributed ordering service architecture #4399

Data corruption: Client runtime can't properly track op acks in presence of distributed ordering service architecture #4399

Comments

vladsud commented Nov 18, 2020 • edited Loading

anthony-murphy commented Nov 18, 2020 • edited Loading

tanviraumi commented Nov 18, 2020

arinwt commented Nov 18, 2020 • edited Loading

anthony-murphy commented Nov 19, 2020 • edited Loading

anthony-murphy commented Nov 19, 2020

vladsud commented Nov 19, 2020

tanviraumi commented Nov 19, 2020 • edited Loading

vladsud commented Nov 19, 2020

tanviraumi commented Nov 19, 2020

anthony-murphy commented Nov 19, 2020

anthony-murphy commented Nov 19, 2020

GaryWilber commented Nov 19, 2020

anthony-murphy commented Nov 19, 2020

vladsud commented Nov 19, 2020

tanviraumi commented Nov 19, 2020

vladsud commented Nov 19, 2020

tanviraumi commented Nov 19, 2020

vladsud commented Nov 19, 2020

tanviraumi commented Nov 19, 2020

vladsud commented Nov 20, 2020

vladsud commented Dec 15, 2020

agarwal-navin commented Feb 9, 2021

jcmoore commented Feb 13, 2021 • edited Loading

curtisman commented Feb 13, 2021 • edited Loading

jcmoore commented Feb 13, 2021

ChumpChief commented Mar 24, 2021

vladsud commented Nov 18, 2020 •

edited

Loading

anthony-murphy commented Nov 18, 2020 •

edited

Loading

arinwt commented Nov 18, 2020 •

edited

Loading

anthony-murphy commented Nov 19, 2020 •

edited

Loading

tanviraumi commented Nov 19, 2020 •

edited

Loading

jcmoore commented Feb 13, 2021 •

edited

Loading

curtisman commented Feb 13, 2021 •

edited

Loading