-
Notifications
You must be signed in to change notification settings - Fork 536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data corruption: Client runtime can't properly track op acks in presence of distributed ordering service architecture #4399
Comments
I wonder if increasing the cost of join by doing a multiple phase commit could help here. all nodes involved in the session need to acknowledge the join, before it is sent, to ensure any outstanding ops come before the join message. i don't know the internals well enough to be able to evaluate the feasibility of a solution like this. |
@anthony-murphy Multiple phase commits for doing distributed transactions is a blocking protocol. So we should probably not consider a similar solution. Even a less strict solution will require a way to globally track all connected clients with consistent read/writes. That will becomes a single point of failure and we probably don't want to go there. @vladsud I think this PR (https://github.com/microsoft/FluidFramework/pull/3704/files) is detecting the exact same bug. I may be wrong but that looks like the purpose. If we are already detecting it, I am curious to know the frequency of occurrence. Do we have a number yet? Looking at the PR, I am also not sure on closing the container. The session is still usable, clients are not stuck, service is also not corrupted. The content got duplicated which definitely violated users intention. Given the scenarios we support today, clients can possibly undo the behavior? To be clear, I am not downplaying the importance of the bug. This is definitely a bad failure but I just want to get a sense of the frequency of occurrence. This will help us figure out the cost of the solution. |
It depends on what op is duplicated.. some can be more devastating I think (attach ops?). Edit: did not mean to close this issue with this comment! |
This is a data corruption, we might get lucky with some, but there are tons of scenarios that will not recover. i don't think telling the client to recover is reasonable. They basically need to build a duplicate system to the fluid framework to catch duplicates. We need to handle this on the server or the client |
@tanviraumi i'm not sure a multi phase commit is that detrimental is this case. We wouldn't need to block ops, just slow down join. the joining client would still see ops, and be able to edit locally. the join message might show up later in the ops stream than it needs to, but that's a whole lot better than showing up to early as is possible right now |
@tanviraumi, yes, it's good point - we already have mechanism, but it only works for a client who send the op. That same mechanism will apply the op just fine for all other clients. So we need some solution, either pure client solution, pure server solution, or some combination. |
This is a very rough idea, and may have some issues that I am not considering. So let me know if I am missing something obvious: How about a disconnected client also tracks its own "leave" op? The main problem is, we have two active clients (c1 older and c2 newer) in the quorum for the same container. As soon as the older client (c1) leaves, deli will stop sequencing any message for c1. So the idea is: new client c2 cannot send ops before c1 actually leaves the quourm. It can still join the websocket, receive ops, make local edits. But in addition to waiting for its own join message, it will also wait for c1 to leave. In regular cases: you will check the quorum and realize that c1 already left. Or you expect a leave message soon (Need data for how soon). In this corruption case, you will discard your local ops one by one until you see a leave op (we already do that but we dont wait for leave message I think). And then you can submit the rest. In the corruption case, you will wait "x" seconds of time. We need data to see what "x" is. But "x" only impacts your op going live. You can still edit locally and see other people's changes. One caveat: Deli may never see a "leave" message from alfred. We already handle that today by using a timer. Today its roughly 5 minutes (both for r11s and push). But this is tunable number and we can discuss on a lower number. |
Thanks Tanvir! I'd describe it as "delay with possibility of fast resolution". we likely we can go with it for Push V2 deployment, and collect more data if we need better solution. I'll add one more solution: we can add container instance ID (stable ID for duration of lifetime of container instance) to IClient and start sending it as part of connect_document message. It would be tracked in quorum as result. Whenever new client shows up, all clients would start ignoring all the ops coming from all other clientsIDs that are associated with same container instance ID that just joined. And I think we can make it secure. I'll try :) |
We should add telemetry to measure what 'X' is. My suspicion, A high percentile of 9 will see zero delay because your quorum is already up to date (leave already came before you joined). For the rest, there is a X > 0. But now that I think about it, you probably don't need to wait for "leave" if <number of unacked ops === 0> (I may be wrong here so let me know). Assuming the previous conditions are correct, we are now waiting for the "leave" for a small number of cases (need number to verify how small). Today we are closing the container when this happens (#3704 ). So this solution is a big improvement over our current experience. And then we will measure to improve it further. I think your signing idea works but my recent experiences with FRS taught me that "signing" is not easy. Basically you need to store a key and make sure the key is never leaked. Then you realize that you need a per tenant key and you need to know how to roll that over. FRS has already done those work and we can leverage that if required. But I am not sure about Push. As much as I know, Push does not deal with any keys and relies to ODSP to verify the "Push Token". So they may need to build another ODSP contract to verify the signing. @GaryWilber will know more about this. |
@vladsud we could use a combined key of instance id + user id. This would prevent other clients from spoofing |
something i've also considered in adding a client sent leave message that the client can send to the server to affirmatively leave, as we've seen cases where the client thinks it's left, but the server doesn't. This can happen with the websocket connection pooling done by things like the odsp driver. this should help ensuring leaving is fast in the most reconnect cases. disconnect cases are less interesting as presumably the client isn't coming back. |
The ODSP driver actually already sends that message today, primarily for the connection pooling scenario you described. When disconnecting, the client sends a "disconnect_document" message to tell the server to disconnect the provided client id. However I'm not sure how helpful that is for this case because this scenario involves a frontend machine having a delay/issue inserting messages for orderering. So the leave op created by this event would likely be stuck as well. And the [3. Client disconnects] bullet point will already cause this logic to run (disconnect all the client ids for the websocket connection). |
Yeah. That makes sense, but in the non-error case it would help keep the leave fast. As long as the error case is rare it might be ok if that case is slow regardless. |
We should double check, but yes, SPO already should trigger proper events for server to generate leave right away when client initiates disconnect - no new work should be required here, assuming it actually works :) Delta manager already knows about potential set of non-acked ops::
(Note: this code is a bit aggressive, as it does not take into account noop coalescing). So yes, we can leverage that signal and wait for our own leave and just report times to get a sense if this is good enough. That all said, I'd keep some reasonable time-out and not wait forever, and actually track (in telemetry) timeouts as an key indicator. I'd start with 2 seconds, as anything close to 5 seconds would results in more disconnected banner flashing in app. |
What is the timeout for? Is it for how long it takes to show up a leave message? Clients and still connect and see other peoples ops. They will also join the quorum. So I am not sure why this has to be smaller that "disconnected banner flashing" timeout. |
Yes, timeout for waiting for own leave op. |
So today, do we have a 5 seconds timer for a client to receive its own join message? And what happens if we exceed 5 seconds? Do we show a disconnect banner? |
We show disconnect banner if client did not get to "connected" state in 5 seconds after last disconnect. Every time we regressed here, there was rather strong feedback that product is not meeting the bar, including flashing disconnections every once in a while. |
Makes sense. But I am not sure why this one only gets 2 seconds. Connected (receiving your own join op) it the fast path. When this bug hits, "leave" op will show up after "join" op. So if the "join" case gets 5 seconds, shouldn't this one also get 5? Assuming both timer gets fired at the same time (after last disconnect). My point is, these two has the same outcome. Both of them are blocking outgoing ops. So from a perf perspective, they should be treated on a similar standard. Right? |
We have chatted with Tanvir. 5 second is not specific to anything, it's end-to-end process from disconnect to "connected" event. |
We need to start addressing it, so I've assigned it to myself & Jatin (let's see if any of us finds time to address first portion of it this month). |
This IcM incident may be related to the issue being discussed here - https://portal.microsofticm.com/imp/v3/incidents/details/226938258/home |
Only just coming up to speed on the latest regarding incident 226938258 you mention @agarwal-navin, so maybe this is well understood by this point, but it sounds like there are idempotence characteristics that result in bad behavior in the face of unreliable network connections. This is something I was trying to wrap my head around in conversations with @vladsud and @tanviraumi a couple weeks back. If there are design discussions going on here, I'd be interested to hear what people are thinking
|
@jcmoore Here is a summary of my understanding of the problem and the discussion so far (@tanviraumi and @vladsud can correct me here):
The crux of this issue: there are a couple of situation this assumption is violated and the order is inverted:
The latest proposal on how to solve this issue that I know of is to wait for the The concern with this solution is how long the wait could be until the server notice the old client disconnected, and how often it happens, as it could delay reconnection and impact user experience. Your assumption are great and matched many other existing distributed system , but Fluid is different in the aspect that backend isn't responsible for deduping the op. The ordering service is super simple (on purpose), and all it does is to listen to ops, give it a sequence number, track minimum sequence number, and broadcast them. It doesn't try to understand the semantic meaning of the ops. All the reasoning of distributed state is client's responsibility. |
Thanks for the explanation @curtisman , I'm following the previous discussion much more clearly now! |
Removing OCE tag, as this appears to require design discussion rather than OCE mitigation. |
…ed edits [This](microsoft#4399) issue on the Fluid framework side can (and has) caused data corruption for affected fluid documents. This PR adds a mitigation for that problem by tolerating duplicate edits on our end, thus making the documents loadable again. Related work items: #54204
This sounds like fundamental issue with R11s, and by virtue - PUSH V2 design.
Scenario:
But:
# 102 & # 103 are the same ops, the op was send and sequenced twice, which causes data corruption.
Possible client-only solution is rather complex and is probability based:
The text was updated successfully, but these errors were encountered: