Data Corruption due to distributed ordering service. (Follow up) #5866

jatgarg · 2021-04-20T18:57:34Z

Describe the bug

The original bug is described here: Data corruption: Client runtime can't properly track op acks in presence of distributed ordering service architecture

For fixing the issue, we implemented the logic to wait for 90sec(default) to wait for leave op before moving to connected state with new clientId even if we have not received the leave op for last client which will resend the pending ops from last client.
Pr: #5484

Now with the telemetry to analyze, we are seeing that mostly we are timing out of 90 sec and connecting without last client leaving causing the data corruption error to still occur. This is expected as the server sends the leave op after waiting for 5 mins.
So we are trying to increase the timeout to more than 5 mins to see if we can get rid of the error. It is configurable through loader options so we don't need to do a patch release.
Also 5 mins seems a lot for server to wait before sending the leave op. So we can think of reducing that too at server.

jatgarg · 2021-04-20T18:58:02Z

fyi: @vladsud @anthony-murphy @tanviraumi @GaryWilber @curtisman

anthony-murphy · 2021-04-20T19:18:36Z

Additionally, I don't think a timeout ever makes sense here. It should just be part of the protocol that the client needs to wait for it's own leave before its truly connected, anything else can lead to corruption issues. if the leave takes to long we should work on making that faster and not have the client unsafely proceed.

vladsud · 2021-04-20T19:54:57Z

We are timing out of 90 sec and connecting without last client leaving causing the data corruption error to still occur

Is this a guess, or there is actual data supporting this claim?
Can you please post stats / queries pointing out how many times we hit timeout, and how many times we actually see duplicate ops in those cases?

anthony-murphy · 2021-04-20T20:30:06Z

I believe the original fix went in 0.37, but we're still seeing messageClientIdShouldHaveLeft any of which could be corruptions:

Data_docId	Column1	set_Data_hostName	set_Data_loaderVersion
iTrsm2r5fc%2BOSN3TkfanHFGYns%2Fp%2BZnvHMfF6lQTL3o%3D		[ "Office.com" ]	[ "0.34.1" ]
nLHcLBrJ3Unz9z1iaBC%2F9JslN46r36z12gLLhEi2HEY%3D		[ "Office.com" ]	[ "0.34.1" ]
FVDhiOnGL%2BoJneT5IbJWlTBzL8eU8yBOjvN4pcPfrbk%3D	2021-04-16 21:33:17.3360000	[ "Office.com" ]	[ "0.37.2" ]
tLfQmK3UcvMqF2Ki9M8doLXrekk3jIG9yYHxcfnFaSQ%3D	2021-04-16 22:34:33.3510000	[ "Office.com" ]	[ "0.37.2" ]
tLfQmK3UcvMqF2Ki9M8doLXrekk3jIG9yYHxcfnFaSQ%3D	2021-04-16 22:45:28.2430000	[ "Office.com" ]	[ "0.37.2" ]
bE6fle9OPDm0PVSEzRrbKyjNsWlkWYjVfS2gXnvPimI%3D	2021-04-19 04:31:31.5340000	[ "Fluid Preview App" ]	[ "0.37.4" ]
ug9m8y%2FdYvtus2RIIGaAPvYvlokhPmgUbfxSElZ3nXw%3D	2021-04-19 17:02:20.0270000	[ "Office.com" ]	[ "0.37.2" ]
nVAzpfpoFYuXXlbW7grt38HBzilcRatcrbeO2vGHFuE%3D	2021-04-19 20:00:56.1180000	[ "Fluid Preview App" ]	[ "0.37.4" ]
oMyZ8c8I0vYhYNRfqtL94DXMN91tvGJtmZVREXV%2FenU%3D	2021-04-20 10:37:19.8790000	[ "OneNoteMeetingsDefaultTab" ]	[ "0.37.4" ]

anthony-murphy · 2021-04-20T20:36:17Z

I really don't understand how this was even supposed to work, as we know the server has a 5min timeout to send leaves. from the data we can see no client ever successfully waited for their leave (below). We need to be strict in this case, and work on optimizing leave message time. Anything else leaves gaps that can lead to corruption

Office_Fluid_FluidRuntime_Performance
| where Data_eventName has "WaitBeforeClientLeave"
| summarize count(), countif(Data_leaveReceived == 1)

count_	countif_
852	0

jatgarg · 2021-04-20T20:40:10Z

The result of this query is empty
union Office_Fluid_FluidRuntime_*
| where Event_Time > ago(30d)
| where Data_eventName has "WaitBeforeClientLeave_end"
| where Data_leaveReceived == true
| summarize by Data_docId
Means we didn't received leave op when we were waiting and already received join op.

jatgarg · 2021-04-20T20:40:46Z

ohh Tony already had results for that.

jatgarg · 2021-04-20T20:50:06Z

Also The results of the query:
let docs =
union Office_Fluid_FluidRuntime_*
| where Event_Time > ago(30d)
| where Data_eventName has "WaitBeforeClientLeave_end"
| summarize by Data_docId
| where isnotempty(Data_docId)
;
union Office_Fluid_FluidRuntime_Error
| where Data_message == "messageClientIdShouldHaveLeft"
| where Data_docId in (docs)
| project Data_docId, Data_loaderVersion

Data_docId	Data_loaderVersion
FVDhiOnGL%2BoJneT5IbJWlTBzL8eU8yBOjvN4pcPfrbk%3D	0.37.2

It is weird that I found only 1 doc for which we recorded data corruption and where we were waiting on leave op after receiving the join op. While in @anthony-murphy query, there are many docs having the "messageClientIdShouldHaveLeft" error.

vladsud · 2021-04-20T20:56:58Z

That's precisely why I'm asking for more data - up until recent post by Jatin there was no data on overlap of those two circles - timeouts and data corruptions. Meaning that we can get rid of timeout, wait for a month to get new data and be in exactly same position as today.

I glanced at some of the WaitBeforeClientLeave_end sessions and they all seem to follow "Nack: Nonexistent client" disconnect. I do not know what it means, but it looks like server assumes that this is invalid clientId for some reason and so it will likely never produce leave op for it. Why - no clue, this needs to be looked deeper, but we should also look deeper into why messageClientIdShouldHaveLeft happens and if there is any relationship with timeouts. There might be one (and it might be 100%), but I simply do not see that data right now, so I find it a bit premature to jump with code changes if we do not understand what's going on. And going to no timeout may result in deadlock (if my theory about disconnects is correct). So more digging is required here.

anthony-murphy · 2021-04-20T20:57:00Z

Office_Fluid_FluidRuntime_Error
| where Data_message == "messageClientIdShouldHaveLeft"
| summarize make_set(Data_hostName), make_set(Data_loaderVersion) by Data_docId, unixtime_milliseconds_todatetime(Data_messageTimestamp)

jatgarg · 2021-04-22T21:02:15Z

One problem is referred in above issue. One more is:
So I have analyzed a some sessions and seems like we are receiving removeMember event before the disconnected event and that is why we always timeout as timer is not shutdown by leave as that has already occurred before even starting the timer. I think this could happen in case server initiating the disconnection where we first receive the removeMember event?

anthony-murphy · 2021-04-22T21:49:30Z

could this be a situation like a laptop close. where the local state doesn't change, but the server kicks the client out, but on laptop open we somehow fetch ops before we know we are disconnected

jatgarg · 2021-04-22T22:17:54Z

Maybe and I do see a time gap of couple of mins between removeMember event and last event before it whenever we receive this server nack: non existent client. So it does seem like there might be no user activity during that time. So seems like that it is wrong assumption that we will receive removeMember after disconnect. So if we receive removeMember before disconnect then we need to take that into account and not wait for it.

tanviraumi · 2021-04-23T00:29:33Z

Thats a great find! Its possible to receive a removeMember event before you see disconnect. Given that websockets can be flaky and load balancers are still not perfect with keeping websocket connection alive, it's possible that client thinks it's connected but server thinks they are disconnected. This also explains the nack message because deli already considers you gone so it will nack any further message from that clientId.

We need to handle this case as well. If client sees a removeMemeber before seeing disconnect, it needs to force a disconnect. This will generate duplicate leave messages but deli already handles that.

vladsud · 2021-04-23T03:11:20Z

I think forcing disconnect on client's own leave op makes sense.
That will keep the rest of the state more consistent as well.

@jatgarg, how do you know that we received leave op and minutes later receive Nonexistent client nack? I believe we do not have enough telemetry here to make such conclusion. Is this from local testing?

@tanviraumi, is this condition due to ODSP reusing sockets? Cause otherwise I'd assume server should have closed socket a and thus not be in position (especially minutes later) to receive anything from client on that same socket to generate such nack.

tanviraumi · 2021-04-23T05:20:40Z

r11s does not reuse socket. @GaryWilber will know more about reusing sockets. I can see how it can happen seconds later but not sure how it can happen after minutes. Push uses a separate code path for handling websockets so @GaryWilber can probably explain better.

jatgarg · 2021-04-23T21:17:32Z

@vladsud "how do you know that we received leave op and minutes later receive Nonexistent client nack? I believe we do not have enough telemetry here to make such conclusion. Is this from local testing?"
Ans: No the issue was that we received the nack first and then the disconnected event. We can deduce this because we listened to removeMember event and log something based off that. Also the time gap was between last telemetry event and nack for non existent client and not between nack and disconnection.

@tanviraumi We already listen for nacks and initiate disconnection based off that.

Above PR should resolve this isssue.

vladsud · 2021-05-21T17:22:44Z

Note - I once was able to hit it in stress tests. Which suggests the issue is likely not fully addressed.
I highly doubt that we would hit a time-out (i.e. that would be a reason for error), for couple reasons:

Most of my recent runs are just 1-min runs (I do not remember for sure in which run I hit it though).
The connection from my home to end point is pretty good :) It's possible that packet somewhere took a detour, but chances are low.

Anyway - wanted to share that it's likely possible to stress test this area via stress tests. We can actually adjust knobs to increase number of forced disconnects to see if we can improve chances of hitting it. I should be in position to help here next week to add such knobs / fix some other stability issues in this test to be able to give you more reliable harness

anthony-murphy · 2021-05-21T18:31:08Z

We hit this is prod, @jatgarg has a theory.

jatgarg · 2021-05-21T18:35:24Z

Imo we hit a rare case where noop caused us to throw data corruption error. So when the issue occurred in prod, we did not wait for leave op because our check said that we have received acks for all of our ops. But we don't check acks for trailing noops in delta manager.
public shouldJoinWrite(): boolean { // We don't have to wait for ack for topmost NoOps. So subtract those. return this.clientSequenceNumberObserved < (this.clientSequenceNumber - this.trailingNoopCount); }
So lets say clientseqNum observed was 9 and client seq number was 10 and last op was noop. Then it should be fine to not wait. So looks like we can add a check when we throw the error :
else if (client?.shouldHaveLeft === true && message.type !== MessageType.Noop) { errorMsg = "messageClientIdShouldHaveLeft"; }
This also aligns with what @vladsud mentioned above.

jatgarg · 2021-05-21T20:05:01Z

I think we can cherry-pick this change to 0.39 also?

anthony-murphy · 2021-05-21T20:15:42Z

I think it is fine to cherry-pick, but i wouldn't release. if we have to release for another reason this will get bundled, but the issue is pretty minor and low probability.

jatgarg · 2021-06-02T16:56:46Z

Closing this as we are not seeing any new instances after latest and noop changes after monitoring for a month.

jatgarg added the bug Something isn't working label Apr 20, 2021

jatgarg added this to the May 2021 milestone Apr 20, 2021

jatgarg self-assigned this Apr 20, 2021

ghost added the triage label Apr 20, 2021

curtisman added area: runtime Runtime related issues and removed triage labels Apr 22, 2021

jatgarg mentioned this issue Apr 22, 2021

Always wait for removeMemver event in case the client was in quorum and had unacked ops #5892

Merged

jatgarg mentioned this issue May 21, 2021

No data corruption on Noops #6219

Merged

jatgarg mentioned this issue May 21, 2021

[main > 0.39]: No data corruption on noops (#6219) #6222

Merged

anthony-murphy modified the milestones: May 2021, June 2021 Jun 1, 2021

jatgarg closed this as completed Jun 2, 2021

wes-carlson mentioned this issue Jan 21, 2022

messageClientIdShouldHaveLeft #8836

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Corruption due to distributed ordering service. (Follow up) #5866

Data Corruption due to distributed ordering service. (Follow up) #5866

jatgarg commented Apr 20, 2021 •

edited

Loading

jatgarg commented Apr 20, 2021

anthony-murphy commented Apr 20, 2021

vladsud commented Apr 20, 2021

anthony-murphy commented Apr 20, 2021

anthony-murphy commented Apr 20, 2021

jatgarg commented Apr 20, 2021

jatgarg commented Apr 20, 2021

jatgarg commented Apr 20, 2021

vladsud commented Apr 20, 2021

anthony-murphy commented Apr 20, 2021

jatgarg commented Apr 22, 2021

anthony-murphy commented Apr 22, 2021

jatgarg commented Apr 22, 2021 •

edited

Loading

tanviraumi commented Apr 23, 2021

vladsud commented Apr 23, 2021

tanviraumi commented Apr 23, 2021

jatgarg commented Apr 23, 2021

vladsud commented May 21, 2021

anthony-murphy commented May 21, 2021

jatgarg commented May 21, 2021 •

edited

Loading

jatgarg commented May 21, 2021

anthony-murphy commented May 21, 2021

jatgarg commented Jun 2, 2021

Data Corruption due to distributed ordering service. (Follow up) #5866

Data Corruption due to distributed ordering service. (Follow up) #5866

Comments

jatgarg commented Apr 20, 2021 • edited Loading

Describe the bug

jatgarg commented Apr 20, 2021

anthony-murphy commented Apr 20, 2021

vladsud commented Apr 20, 2021

anthony-murphy commented Apr 20, 2021

anthony-murphy commented Apr 20, 2021

jatgarg commented Apr 20, 2021

jatgarg commented Apr 20, 2021

jatgarg commented Apr 20, 2021

vladsud commented Apr 20, 2021

anthony-murphy commented Apr 20, 2021

jatgarg commented Apr 22, 2021

anthony-murphy commented Apr 22, 2021

jatgarg commented Apr 22, 2021 • edited Loading

tanviraumi commented Apr 23, 2021

vladsud commented Apr 23, 2021

tanviraumi commented Apr 23, 2021

jatgarg commented Apr 23, 2021

vladsud commented May 21, 2021

anthony-murphy commented May 21, 2021

jatgarg commented May 21, 2021 • edited Loading

jatgarg commented May 21, 2021

anthony-murphy commented May 21, 2021

jatgarg commented Jun 2, 2021

jatgarg commented Apr 20, 2021 •

edited

Loading

jatgarg commented Apr 22, 2021 •

edited

Loading

jatgarg commented May 21, 2021 •

edited

Loading