Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Corruption due to distributed ordering service. (Follow up) #5866

Closed
jatgarg opened this issue Apr 20, 2021 · 23 comments
Closed

Data Corruption due to distributed ordering service. (Follow up) #5866

jatgarg opened this issue Apr 20, 2021 · 23 comments
Assignees
Labels
area: runtime Runtime related issues bug Something isn't working
Milestone

Comments

@jatgarg
Copy link
Contributor

jatgarg commented Apr 20, 2021

Describe the bug

The original bug is described here: Data corruption: Client runtime can't properly track op acks in presence of distributed ordering service architecture

For fixing the issue, we implemented the logic to wait for 90sec(default) to wait for leave op before moving to connected state with new clientId even if we have not received the leave op for last client which will resend the pending ops from last client.
Pr: #5484

Now with the telemetry to analyze, we are seeing that mostly we are timing out of 90 sec and connecting without last client leaving causing the data corruption error to still occur. This is expected as the server sends the leave op after waiting for 5 mins.
So we are trying to increase the timeout to more than 5 mins to see if we can get rid of the error. It is configurable through loader options so we don't need to do a patch release.
Also 5 mins seems a lot for server to wait before sending the leave op. So we can think of reducing that too at server.

@jatgarg jatgarg added the bug Something isn't working label Apr 20, 2021
@jatgarg jatgarg added this to the May 2021 milestone Apr 20, 2021
@jatgarg jatgarg self-assigned this Apr 20, 2021
@ghost ghost added the triage label Apr 20, 2021
@jatgarg
Copy link
Contributor Author

jatgarg commented Apr 20, 2021

@anthony-murphy
Copy link
Contributor

Additionally, I don't think a timeout ever makes sense here. It should just be part of the protocol that the client needs to wait for it's own leave before its truly connected, anything else can lead to corruption issues. if the leave takes to long we should work on making that faster and not have the client unsafely proceed.

@vladsud
Copy link
Contributor

vladsud commented Apr 20, 2021

We are timing out of 90 sec and connecting without last client leaving causing the data corruption error to still occur

Is this a guess, or there is actual data supporting this claim?
Can you please post stats / queries pointing out how many times we hit timeout, and how many times we actually see duplicate ops in those cases?

@anthony-murphy
Copy link
Contributor

I believe the original fix went in 0.37, but we're still seeing messageClientIdShouldHaveLeft any of which could be corruptions:

Data_docId Column1 set_Data_hostName set_Data_loaderVersion
iTrsm2r5fc%2BOSN3TkfanHFGYns%2Fp%2BZnvHMfF6lQTL3o%3D   [ "Office.com" ] [ "0.34.1" ]
nLHcLBrJ3Unz9z1iaBC%2F9JslN46r36z12gLLhEi2HEY%3D   [ "Office.com" ] [ "0.34.1" ]
FVDhiOnGL%2BoJneT5IbJWlTBzL8eU8yBOjvN4pcPfrbk%3D 2021-04-16 21:33:17.3360000 [ "Office.com" ] [ "0.37.2" ]
tLfQmK3UcvMqF2Ki9M8doLXrekk3jIG9yYHxcfnFaSQ%3D 2021-04-16 22:34:33.3510000 [ "Office.com" ] [ "0.37.2" ]
tLfQmK3UcvMqF2Ki9M8doLXrekk3jIG9yYHxcfnFaSQ%3D 2021-04-16 22:45:28.2430000 [ "Office.com" ] [ "0.37.2" ]
bE6fle9OPDm0PVSEzRrbKyjNsWlkWYjVfS2gXnvPimI%3D 2021-04-19 04:31:31.5340000 [ "Fluid Preview App" ] [ "0.37.4" ]
ug9m8y%2FdYvtus2RIIGaAPvYvlokhPmgUbfxSElZ3nXw%3D 2021-04-19 17:02:20.0270000 [ "Office.com" ] [ "0.37.2" ]
nVAzpfpoFYuXXlbW7grt38HBzilcRatcrbeO2vGHFuE%3D 2021-04-19 20:00:56.1180000 [ "Fluid Preview App" ] [ "0.37.4" ]
oMyZ8c8I0vYhYNRfqtL94DXMN91tvGJtmZVREXV%2FenU%3D 2021-04-20 10:37:19.8790000 [ "OneNoteMeetingsDefaultTab" ] [ "0.37.4" ]

@anthony-murphy
Copy link
Contributor

I really don't understand how this was even supposed to work, as we know the server has a 5min timeout to send leaves. from the data we can see no client ever successfully waited for their leave (below). We need to be strict in this case, and work on optimizing leave message time. Anything else leaves gaps that can lead to corruption

Office_Fluid_FluidRuntime_Performance
| where Data_eventName has "WaitBeforeClientLeave"
| summarize count(), countif(Data_leaveReceived == 1)

count_ countif_
852 0

@jatgarg
Copy link
Contributor Author

jatgarg commented Apr 20, 2021

The result of this query is empty
union Office_Fluid_FluidRuntime_*
| where Event_Time > ago(30d)
| where Data_eventName has "WaitBeforeClientLeave_end"
| where Data_leaveReceived == true
| summarize by Data_docId
Means we didn't received leave op when we were waiting and already received join op.

@jatgarg
Copy link
Contributor Author

jatgarg commented Apr 20, 2021

ohh Tony already had results for that.

@jatgarg
Copy link
Contributor Author

jatgarg commented Apr 20, 2021

Also The results of the query:
let docs =
union Office_Fluid_FluidRuntime_*
| where Event_Time > ago(30d)
| where Data_eventName has "WaitBeforeClientLeave_end"
| summarize by Data_docId
| where isnotempty(Data_docId)
;
union Office_Fluid_FluidRuntime_Error
| where Data_message == "messageClientIdShouldHaveLeft"
| where Data_docId in (docs)
| project Data_docId, Data_loaderVersion

Data_docId Data_loaderVersion
FVDhiOnGL%2BoJneT5IbJWlTBzL8eU8yBOjvN4pcPfrbk%3D 0.37.2

It is weird that I found only 1 doc for which we recorded data corruption and where we were waiting on leave op after receiving the join op. While in @anthony-murphy query, there are many docs having the "messageClientIdShouldHaveLeft" error.

@vladsud
Copy link
Contributor

vladsud commented Apr 20, 2021

That's precisely why I'm asking for more data - up until recent post by Jatin there was no data on overlap of those two circles - timeouts and data corruptions. Meaning that we can get rid of timeout, wait for a month to get new data and be in exactly same position as today.

I glanced at some of the WaitBeforeClientLeave_end sessions and they all seem to follow "Nack: Nonexistent client" disconnect. I do not know what it means, but it looks like server assumes that this is invalid clientId for some reason and so it will likely never produce leave op for it. Why - no clue, this needs to be looked deeper, but we should also look deeper into why messageClientIdShouldHaveLeft happens and if there is any relationship with timeouts. There might be one (and it might be 100%), but I simply do not see that data right now, so I find it a bit premature to jump with code changes if we do not understand what's going on. And going to no timeout may result in deadlock (if my theory about disconnects is correct). So more digging is required here.

@anthony-murphy
Copy link
Contributor

Office_Fluid_FluidRuntime_Error
| where Data_message == "messageClientIdShouldHaveLeft"
| summarize make_set(Data_hostName), make_set(Data_loaderVersion) by Data_docId, unixtime_milliseconds_todatetime(Data_messageTimestamp)

@jatgarg
Copy link
Contributor Author

jatgarg commented Apr 22, 2021

One problem is referred in above issue. One more is:
So I have analyzed a some sessions and seems like we are receiving removeMember event before the disconnected event and that is why we always timeout as timer is not shutdown by leave as that has already occurred before even starting the timer. I think this could happen in case server initiating the disconnection where we first receive the removeMember event?

@anthony-murphy
Copy link
Contributor

could this be a situation like a laptop close. where the local state doesn't change, but the server kicks the client out, but on laptop open we somehow fetch ops before we know we are disconnected

@jatgarg
Copy link
Contributor Author

jatgarg commented Apr 22, 2021

Maybe and I do see a time gap of couple of mins between removeMember event and last event before it whenever we receive this server nack: non existent client. So it does seem like there might be no user activity during that time. So seems like that it is wrong assumption that we will receive removeMember after disconnect. So if we receive removeMember before disconnect then we need to take that into account and not wait for it.

@tanviraumi
Copy link
Contributor

Thats a great find! Its possible to receive a removeMember event before you see disconnect. Given that websockets can be flaky and load balancers are still not perfect with keeping websocket connection alive, it's possible that client thinks it's connected but server thinks they are disconnected. This also explains the nack message because deli already considers you gone so it will nack any further message from that clientId.

We need to handle this case as well. If client sees a removeMemeber before seeing disconnect, it needs to force a disconnect. This will generate duplicate leave messages but deli already handles that.

@vladsud
Copy link
Contributor

vladsud commented Apr 23, 2021

I think forcing disconnect on client's own leave op makes sense.
That will keep the rest of the state more consistent as well.

@jatgarg, how do you know that we received leave op and minutes later receive Nonexistent client nack? I believe we do not have enough telemetry here to make such conclusion. Is this from local testing?

@tanviraumi, is this condition due to ODSP reusing sockets? Cause otherwise I'd assume server should have closed socket a and thus not be in position (especially minutes later) to receive anything from client on that same socket to generate such nack.

@tanviraumi
Copy link
Contributor

r11s does not reuse socket. @GaryWilber will know more about reusing sockets. I can see how it can happen seconds later but not sure how it can happen after minutes. Push uses a separate code path for handling websockets so @GaryWilber can probably explain better.

@jatgarg
Copy link
Contributor Author

jatgarg commented Apr 23, 2021

@vladsud "how do you know that we received leave op and minutes later receive Nonexistent client nack? I believe we do not have enough telemetry here to make such conclusion. Is this from local testing?"
Ans: No the issue was that we received the nack first and then the disconnected event. We can deduce this because we listened to removeMember event and log something based off that. Also the time gap was between last telemetry event and nack for non existent client and not between nack and disconnection.

@tanviraumi We already listen for nacks and initiate disconnection based off that.

Above PR should resolve this isssue.

@vladsud
Copy link
Contributor

vladsud commented May 21, 2021

Note - I once was able to hit it in stress tests. Which suggests the issue is likely not fully addressed.
I highly doubt that we would hit a time-out (i.e. that would be a reason for error), for couple reasons:

  • Most of my recent runs are just 1-min runs (I do not remember for sure in which run I hit it though).
  • The connection from my home to end point is pretty good :) It's possible that packet somewhere took a detour, but chances are low.

Anyway - wanted to share that it's likely possible to stress test this area via stress tests. We can actually adjust knobs to increase number of forced disconnects to see if we can improve chances of hitting it. I should be in position to help here next week to add such knobs / fix some other stability issues in this test to be able to give you more reliable harness

@anthony-murphy
Copy link
Contributor

We hit this is prod, @jatgarg has a theory.

@jatgarg
Copy link
Contributor Author

jatgarg commented May 21, 2021

Imo we hit a rare case where noop caused us to throw data corruption error. So when the issue occurred in prod, we did not wait for leave op because our check said that we have received acks for all of our ops. But we don't check acks for trailing noops in delta manager.
public shouldJoinWrite(): boolean { // We don't have to wait for ack for topmost NoOps. So subtract those. return this.clientSequenceNumberObserved < (this.clientSequenceNumber - this.trailingNoopCount); }
So lets say clientseqNum observed was 9 and client seq number was 10 and last op was noop. Then it should be fine to not wait. So looks like we can add a check when we throw the error :
else if (client?.shouldHaveLeft === true && message.type !== MessageType.Noop) {                 errorMsg = "messageClientIdShouldHaveLeft"; }
This also aligns with what @vladsud mentioned above.

@jatgarg
Copy link
Contributor Author

jatgarg commented May 21, 2021

I think we can cherry-pick this change to 0.39 also?

@anthony-murphy
Copy link
Contributor

I think it is fine to cherry-pick, but i wouldn't release. if we have to release for another reason this will get bundled, but the issue is pretty minor and low probability.

@jatgarg
Copy link
Contributor Author

jatgarg commented Jun 2, 2021

Closing this as we are not seeing any new instances after latest and noop changes after monitoring for a month.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: runtime Runtime related issues bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants