Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data corruption: Client runtime can't properly track op acks in presence of distributed ordering service architecture #4399

Closed
vladsud opened this issue Nov 18, 2020 · 26 comments · Fixed by #5484
Assignees
Labels
area: runtime Runtime related issues bug Something isn't working
Milestone

Comments

@vladsud
Copy link
Contributor

vladsud commented Nov 18, 2020

This sounds like fundamental issue with R11s, and by virtue - PUSH V2 design.

Scenario:

  1. Client is connected to ordering server A
  2. Client sends an op
  3. Client disconnects
  4. Clients reconnects to ordering server B
  5. Client receives own join op # 101
  6. Client resends op (# 2)
  7. server A finally pushes ops to Kafka and op # 2 is sequenced as # 102
  8. Op from # 6 is sequences as # 103

But:
# 102 & # 103 are the same ops, the op was send and sequenced twice, which causes data corruption.

Possible client-only solution is rather complex and is probability based:

  • add notion of container instance ID and add it to op metadata
  • do not reset clientSequenceID on reconnects (server needs to allow that)
  • each client tracks all container instances over some period of time to detect duplicate ops and ignore them
  • summarizer preserves enough state for clients to adhere to this algorithm if they boot/join in the middle of potential issues.
@vladsud vladsud added the bug Something isn't working label Nov 18, 2020
@vladsud vladsud added this to the December 2020 milestone Nov 18, 2020
@ghost ghost added the triage label Nov 18, 2020
@vladsud vladsud changed the title Client runtime can't properly track op acks in presence of distributed ordering service architecture Data corruption: Client runtime can't properly track op acks in presence of distributed ordering service architecture Nov 18, 2020
@anthony-murphy
Copy link
Contributor

anthony-murphy commented Nov 18, 2020

I wonder if increasing the cost of join by doing a multiple phase commit could help here. all nodes involved in the session need to acknowledge the join, before it is sent, to ensure any outstanding ops come before the join message. i don't know the internals well enough to be able to evaluate the feasibility of a solution like this.

@GaryWilber @tanviraumi

@tanviraumi
Copy link
Contributor

@anthony-murphy Multiple phase commits for doing distributed transactions is a blocking protocol. So we should probably not consider a similar solution. Even a less strict solution will require a way to globally track all connected clients with consistent read/writes. That will becomes a single point of failure and we probably don't want to go there.

@vladsud I think this PR (https://github.com/microsoft/FluidFramework/pull/3704/files) is detecting the exact same bug. I may be wrong but that looks like the purpose. If we are already detecting it, I am curious to know the frequency of occurrence. Do we have a number yet?

Looking at the PR, I am also not sure on closing the container. The session is still usable, clients are not stuck, service is also not corrupted. The content got duplicated which definitely violated users intention. Given the scenarios we support today, clients can possibly undo the behavior?

To be clear, I am not downplaying the importance of the bug. This is definitely a bad failure but I just want to get a sense of the frequency of occurrence. This will help us figure out the cost of the solution.

@arinwt
Copy link
Contributor

arinwt commented Nov 18, 2020

It depends on what op is duplicated.. some can be more devastating I think (attach ops?).

Edit: did not mean to close this issue with this comment!

@arinwt arinwt closed this as completed Nov 18, 2020
@arinwt arinwt reopened this Nov 18, 2020
@anthony-murphy
Copy link
Contributor

anthony-murphy commented Nov 19, 2020

This is a data corruption, we might get lucky with some, but there are tons of scenarios that will not recover. i don't think telling the client to recover is reasonable. They basically need to build a duplicate system to the fluid framework to catch duplicates. We need to handle this on the server or the client

@anthony-murphy
Copy link
Contributor

@tanviraumi i'm not sure a multi phase commit is that detrimental is this case. We wouldn't need to block ops, just slow down join. the joining client would still see ops, and be able to edit locally. the join message might show up later in the ops stream than it needs to, but that's a whole lot better than showing up to early as is possible right now

@vladsud
Copy link
Contributor Author

vladsud commented Nov 19, 2020

@tanviraumi, yes, it's good point - we already have mechanism, but it only works for a client who send the op. That same mechanism will apply the op just fine for all other clients.
And as others pointed out, applying duplicate ops is wrong - things will break either silently or very in the user face. Even if things keep working properly from runtime perspective, users do not want to see same text being inserted twice!

So we need some solution, either pure client solution, pure server solution, or some combination.

@tanviraumi
Copy link
Contributor

tanviraumi commented Nov 19, 2020

This is a very rough idea, and may have some issues that I am not considering. So let me know if I am missing something obvious:

How about a disconnected client also tracks its own "leave" op? The main problem is, we have two active clients (c1 older and c2 newer) in the quorum for the same container. As soon as the older client (c1) leaves, deli will stop sequencing any message for c1. So the idea is: new client c2 cannot send ops before c1 actually leaves the quourm. It can still join the websocket, receive ops, make local edits. But in addition to waiting for its own join message, it will also wait for c1 to leave. In regular cases: you will check the quorum and realize that c1 already left. Or you expect a leave message soon (Need data for how soon). In this corruption case, you will discard your local ops one by one until you see a leave op (we already do that but we dont wait for leave message I think). And then you can submit the rest.

In the corruption case, you will wait "x" seconds of time. We need data to see what "x" is. But "x" only impacts your op going live. You can still edit locally and see other people's changes.

One caveat: Deli may never see a "leave" message from alfred. We already handle that today by using a timer. Today its roughly 5 minutes (both for r11s and push). But this is tunable number and we can discuss on a lower number.

@vladsud
Copy link
Contributor Author

vladsud commented Nov 19, 2020

Thanks Tanvir! I'd describe it as "delay with possibility of fast resolution".
I.e. in the core if it is a delay of X seconds that gives us fir nines of probability, but we see our own leave op, then client may stop waiting and proceed with sending ops. Precisely for the point you pointed out - it may take a while for server to realize client left.

we likely we can go with it for Push V2 deployment, and collect more data if we need better solution.

I'll add one more solution: we can add container instance ID (stable ID for duration of lifetime of container instance) to IClient and start sending it as part of connect_document message. It would be tracked in quorum as result. Whenever new client shows up, all clients would start ignoring all the ops coming from all other clientsIDs that are associated with same container instance ID that just joined.
That's the most simple tracking mechanism, but equivalent (in outcome) to the one I described at the start of this issue.
The only potential issues I see is that any client can now "block" any other client on the wire, by spoofing someone else container instance ID.

And I think we can make it secure. I'll try :)
On first connect, client sends empty payload as part of new field of connect_document and receives back signed payload that contains container instance ID.
On reconnect, client resents same signed payload it received earlier from server, and server thus can extract container instance ID from it, as well as validate that payload was not tampered. Given that nobody has same payload, only this client can use it.

@tanviraumi
Copy link
Contributor

We should add telemetry to measure what 'X' is. My suspicion, A high percentile of 9 will see zero delay because your quorum is already up to date (leave already came before you joined). For the rest, there is a X > 0. But now that I think about it, you probably don't need to wait for "leave" if <number of unacked ops === 0> (I may be wrong here so let me know).

Assuming the previous conditions are correct, we are now waiting for the "leave" for a small number of cases (need number to verify how small). Today we are closing the container when this happens (#3704 ). So this solution is a big improvement over our current experience. And then we will measure to improve it further.

I think your signing idea works but my recent experiences with FRS taught me that "signing" is not easy. Basically you need to store a key and make sure the key is never leaked. Then you realize that you need a per tenant key and you need to know how to roll that over. FRS has already done those work and we can leverage that if required. But I am not sure about Push. As much as I know, Push does not deal with any keys and relies to ODSP to verify the "Push Token". So they may need to build another ODSP contract to verify the signing. @GaryWilber will know more about this.

@anthony-murphy
Copy link
Contributor

@vladsud we could use a combined key of instance id + user id. This would prevent other clients from spoofing

@anthony-murphy
Copy link
Contributor

something i've also considered in adding a client sent leave message that the client can send to the server to affirmatively leave, as we've seen cases where the client thinks it's left, but the server doesn't. This can happen with the websocket connection pooling done by things like the odsp driver. this should help ensuring leaving is fast in the most reconnect cases. disconnect cases are less interesting as presumably the client isn't coming back.

@GaryWilber
Copy link
Contributor

The ODSP driver actually already sends that message today, primarily for the connection pooling scenario you described. When disconnecting, the client sends a "disconnect_document" message to tell the server to disconnect the provided client id.

However I'm not sure how helpful that is for this case because this scenario involves a frontend machine having a delay/issue inserting messages for orderering. So the leave op created by this event would likely be stuck as well. And the [3. Client disconnects] bullet point will already cause this logic to run (disconnect all the client ids for the websocket connection).

@tylerbutler tylerbutler added design-required This issue requires design thought area: runtime Runtime related issues and removed triage area: runtime Runtime related issues labels Nov 19, 2020
@anthony-murphy
Copy link
Contributor

Yeah. That makes sense, but in the non-error case it would help keep the leave fast. As long as the error case is rare it might be ok if that case is slow regardless.

@vladsud
Copy link
Contributor Author

vladsud commented Nov 19, 2020

We should double check, but yes, SPO already should trigger proper events for server to generate leave right away when client initiates disconnect - no new work should be required here, assuming it actually works :)

Delta manager already knows about potential set of non-acked ops::

        // if we have any non-acked ops from last connection, reconnect as "write".
        // without that we would connect in view-only mode, which will result in immediate
        // firing of "connected" event from Container and switch of current clientId (as tracked
        // by all DDSes). This will make it impossible to figure out if ops actually made it through,
        // so DDSes will immediately resubmit all pending ops, and some of them will be duplicates, corrupting document
        if (this.clientSequenceNumberObserved !== this.clientSequenceNumber) {
            requestedMode = "write";
        }

(Note: this code is a bit aggressive, as it does not take into account noop coalescing).

So yes, we can leverage that signal and wait for our own leave and just report times to get a sense if this is good enough.
I think if we go that route, I'd love to see code being in one place. Right now, the logic when to issue "connected" event is in Container, but logic to track non-acked ops and connect as "write" is in delta manager.

That all said, I'd keep some reasonable time-out and not wait forever, and actually track (in telemetry) timeouts as an key indicator. I'd start with 2 seconds, as anything close to 5 seconds would results in more disconnected banner flashing in app.

@tanviraumi
Copy link
Contributor

What is the timeout for? Is it for how long it takes to show up a leave message? Clients and still connect and see other peoples ops. They will also join the quorum. So I am not sure why this has to be smaller that "disconnected banner flashing" timeout.

@vladsud
Copy link
Contributor Author

vladsud commented Nov 19, 2020

Yes, timeout for waiting for own leave op.
This is regression (in terms of time to reconnect), and it will lead to longer / more often disconnect banners. I do not want to be in position to explain to people "that's limitation of the system"

@tanviraumi
Copy link
Contributor

So today, do we have a 5 seconds timer for a client to receive its own join message? And what happens if we exceed 5 seconds? Do we show a disconnect banner?

@vladsud
Copy link
Contributor Author

vladsud commented Nov 19, 2020

We show disconnect banner if client did not get to "connected" state in 5 seconds after last disconnect. Every time we regressed here, there was rather strong feedback that product is not meeting the bar, including flashing disconnections every once in a while.

@tanviraumi
Copy link
Contributor

Makes sense. But I am not sure why this one only gets 2 seconds. Connected (receiving your own join op) it the fast path. When this bug hits, "leave" op will show up after "join" op. So if the "join" case gets 5 seconds, shouldn't this one also get 5? Assuming both timer gets fired at the same time (after last disconnect).

My point is, these two has the same outcome. Both of them are blocking outgoing ops. So from a perf perspective, they should be treated on a similar standard. Right?

@vladsud
Copy link
Contributor Author

vladsud commented Nov 20, 2020

We have chatted with Tanvir. 5 second is not specific to anything, it's end-to-end process from disconnect to "connected" event.
At the end of the day, it's all about collecting data and improving situation, we will adjust numbers based on our learning.

@vladsud
Copy link
Contributor Author

vladsud commented Dec 15, 2020

We need to start addressing it, so I've assigned it to myself & Jatin (let's see if any of us finds time to address first portion of it this month).

@vladsud vladsud added the focus Items that engineers are focusing on now, but may not have any (coding) outcome in current milestone label Dec 15, 2020
@vladsud vladsud removed their assignment Dec 17, 2020
@jatgarg jatgarg modified the milestones: December 2020, January 2021 Jan 4, 2021
@vladsud vladsud modified the milestones: January 2021, February 2021 Jan 20, 2021
@curtisman curtisman added area: runtime Runtime related issues and removed design-required This issue requires design thought focus Items that engineers are focusing on now, but may not have any (coding) outcome in current milestone labels Jan 27, 2021
@agarwal-navin
Copy link
Contributor

This IcM incident may be related to the issue being discussed here - https://portal.microsofticm.com/imp/v3/incidents/details/226938258/home

@jcmoore
Copy link
Contributor

jcmoore commented Feb 13, 2021

Only just coming up to speed on the latest regarding incident 226938258 you mention @agarwal-navin, so maybe this is well understood by this point, but it sounds like there are idempotence characteristics that result in bad behavior in the face of unreliable network connections.

This is something I was trying to wrap my head around in conversations with @vladsud and @tanviraumi a couple weeks back. If there are design discussions going on here, I'd be interested to hear what people are thinking (I'm including some of my unsolicited questions/thoughts below "the fold" in the P.S.)


P.S. Assuming (...lots of assumptions):

  • the backend does the deduping of ops
  • each op is sent along with some guid which is used to dedupe
  • guids are a composite of:
    • some namespace (the client id?) managed by backend nodes and granted to frontend peers
    • some gapless monotonic counter managed by frontend peers
  • the namespace portion of a guid can change when a frontend peer disconnects from one backend node and reconnects to the backend (the node may be the same or different)

If that's the case (BIG if), a frontend peer could get backend nodes to sequence semantically equivalent ops (i.e. "create dds/data-object using a specific id") with differing guids. If that's the problem, I might recommend:

  • having the frontend peer send all guids (i.e. namespaces -- if the counter remains stable across reconnects) a backend node might have sequenced a given op with so it can reject duplicates even in the presence of arbitrarily faulty/slow network connections
  • make it so the client can use the same namespace even after it disconnects (presuming there is some way to validate that a namespace has not been hijacked -- perhaps by making them infeasible to guess)

Still learning a lot about the system, so not sure if this is actually the problem -- or if any of the above is feasible... Looking forward to understanding more!

@curtisman
Copy link
Member

curtisman commented Feb 13, 2021

@jcmoore Here is a summary of my understanding of the problem and the discussion so far (@tanviraumi and @vladsud can correct me here):

  • When the client send op, it doesn't know if the server received it, until it hear back the the sequenced op.
  • If the client disconnected after it put the op on the wire before hearing back, it needs to resend those ops after it reconnect if the server didn't receive the ops
  • To determine whether the server receive ops that was previously put on the wire:
    • First after reconnect, it try to catch up by downloading ops up to what has been broadcasted by the reconnected websocket.
    • However, there might still be ops that are still in transit or waiting to be sequenced, the reconnected client has to wait to make sure it get back those ops to before resend.
    • Currently we use the ClientJoin message for the reconnect client to indicate the wait is over, with the assumption that if the server received the op from the old connection, the ClientJoin message of the new connection has to come after that.

The crux of this issue: there are a couple of situation this assumption is violated and the order is inverted:

  • Originally, this issue was opened because there is a case with a distributed ordering service (PushV2), that if the reconnection happened on a different server for some reason, it can't guarantees the ops from the old connection get on the op queue before the new connection on a new server.
  • I believe that with the newer incidents that @agarwal-navin raised is that even without distributed order service, the order can be inverted, because, while TCP guarantee order within a single connection, the network topography can change with different connections. So the new connection might get established first before the data from old connection arrive at the server at all.

The latest proposal on how to solve this issue that I know of is to wait for the ClientLeave message for the old connection instead. That message guarantees that any more message from the old connection will be rejected by the server, so the new connection can safely resend ops that it hasn't hear back.

The concern with this solution is how long the wait could be until the server notice the old client disconnected, and how often it happens, as it could delay reconnection and impact user experience.


Your assumption are great and matched many other existing distributed system , but Fluid is different in the aspect that backend isn't responsible for deduping the op. The ordering service is super simple (on purpose), and all it does is to listen to ops, give it a sequence number, track minimum sequence number, and broadcast them. It doesn't try to understand the semantic meaning of the ops. All the reasoning of distributed state is client's responsibility.

@jcmoore
Copy link
Contributor

jcmoore commented Feb 13, 2021

Thanks for the explanation @curtisman , I'm following the previous discussion much more clearly now!

@markfields markfields added the msft: oce Pertaining to Msft-internal on call tickets label Feb 22, 2021
@markfields markfields modified the milestones: February 2021, March 2021 Feb 26, 2021
@ChumpChief ChumpChief removed the msft: oce Pertaining to Msft-internal on call tickets label Mar 24, 2021
@ChumpChief
Copy link
Contributor

Removing OCE tag, as this appears to require design discussion rather than OCE mitigation.

Abe27342 added a commit to Abe27342/FluidFramework that referenced this issue Apr 14, 2021
…ed edits

[This](microsoft#4399) issue on the Fluid framework side can (and has) caused data corruption for affected fluid documents.

This PR adds a mitigation for that problem by tolerating duplicate edits on our end, thus making the documents loadable again.

Related work items: #54204
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: runtime Runtime related issues bug Something isn't working
Projects
None yet