-
Notifications
You must be signed in to change notification settings - Fork 536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data Corruption due to distributed ordering service. (Follow up) #5866
Comments
Additionally, I don't think a timeout ever makes sense here. It should just be part of the protocol that the client needs to wait for it's own leave before its truly connected, anything else can lead to corruption issues. if the leave takes to long we should work on making that faster and not have the client unsafely proceed. |
Is this a guess, or there is actual data supporting this claim? |
I believe the original fix went in 0.37, but we're still seeing messageClientIdShouldHaveLeft any of which could be corruptions:
|
I really don't understand how this was even supposed to work, as we know the server has a 5min timeout to send leaves. from the data we can see no client ever successfully waited for their leave (below). We need to be strict in this case, and work on optimizing leave message time. Anything else leaves gaps that can lead to corruption Office_Fluid_FluidRuntime_Performance
|
The result of this query is empty |
ohh Tony already had results for that. |
Also The results of the query:
It is weird that I found only 1 doc for which we recorded data corruption and where we were waiting on leave op after receiving the join op. While in @anthony-murphy query, there are many docs having the "messageClientIdShouldHaveLeft" error. |
That's precisely why I'm asking for more data - up until recent post by Jatin there was no data on overlap of those two circles - timeouts and data corruptions. Meaning that we can get rid of timeout, wait for a month to get new data and be in exactly same position as today. I glanced at some of the WaitBeforeClientLeave_end sessions and they all seem to follow "Nack: Nonexistent client" disconnect. I do not know what it means, but it looks like server assumes that this is invalid clientId for some reason and so it will likely never produce leave op for it. Why - no clue, this needs to be looked deeper, but we should also look deeper into why messageClientIdShouldHaveLeft happens and if there is any relationship with timeouts. There might be one (and it might be 100%), but I simply do not see that data right now, so I find it a bit premature to jump with code changes if we do not understand what's going on. And going to no timeout may result in deadlock (if my theory about disconnects is correct). So more digging is required here. |
Office_Fluid_FluidRuntime_Error |
One problem is referred in above issue. One more is: |
could this be a situation like a laptop close. where the local state doesn't change, but the server kicks the client out, but on laptop open we somehow fetch ops before we know we are disconnected |
Maybe and I do see a time gap of couple of mins between removeMember event and last event before it whenever we receive this server nack: non existent client. So it does seem like there might be no user activity during that time. So seems like that it is wrong assumption that we will receive removeMember after disconnect. So if we receive removeMember before disconnect then we need to take that into account and not wait for it. |
Thats a great find! Its possible to receive a removeMember event before you see disconnect. Given that websockets can be flaky and load balancers are still not perfect with keeping websocket connection alive, it's possible that client thinks it's connected but server thinks they are disconnected. This also explains the nack message because deli already considers you gone so it will nack any further message from that clientId. We need to handle this case as well. If client sees a removeMemeber before seeing disconnect, it needs to force a disconnect. This will generate duplicate leave messages but deli already handles that. |
I think forcing disconnect on client's own leave op makes sense. @jatgarg, how do you know that we received leave op and minutes later receive Nonexistent client nack? I believe we do not have enough telemetry here to make such conclusion. Is this from local testing? @tanviraumi, is this condition due to ODSP reusing sockets? Cause otherwise I'd assume server should have closed socket a and thus not be in position (especially minutes later) to receive anything from client on that same socket to generate such nack. |
r11s does not reuse socket. @GaryWilber will know more about reusing sockets. I can see how it can happen seconds later but not sure how it can happen after minutes. Push uses a separate code path for handling websockets so @GaryWilber can probably explain better. |
@vladsud "how do you know that we received leave op and minutes later receive Nonexistent client nack? I believe we do not have enough telemetry here to make such conclusion. Is this from local testing?" @tanviraumi We already listen for nacks and initiate disconnection based off that. Above PR should resolve this isssue. |
Note - I once was able to hit it in stress tests. Which suggests the issue is likely not fully addressed.
Anyway - wanted to share that it's likely possible to stress test this area via stress tests. We can actually adjust knobs to increase number of forced disconnects to see if we can improve chances of hitting it. I should be in position to help here next week to add such knobs / fix some other stability issues in this test to be able to give you more reliable harness |
We hit this is prod, @jatgarg has a theory. |
Imo we hit a rare case where noop caused us to throw data corruption error. So when the issue occurred in prod, we did not wait for leave op because our check said that we have received acks for all of our ops. But we don't check acks for trailing noops in delta manager. |
I think we can cherry-pick this change to 0.39 also? |
I think it is fine to cherry-pick, but i wouldn't release. if we have to release for another reason this will get bundled, but the issue is pretty minor and low probability. |
Closing this as we are not seeing any new instances after latest and noop changes after monitoring for a month. |
Describe the bug
The original bug is described here: Data corruption: Client runtime can't properly track op acks in presence of distributed ordering service architecture
For fixing the issue, we implemented the logic to wait for 90sec(default) to wait for leave op before moving to connected state with new clientId even if we have not received the leave op for last client which will resend the pending ops from last client.
Pr: #5484
Now with the telemetry to analyze, we are seeing that mostly we are timing out of 90 sec and connecting without last client leaving causing the data corruption error to still occur. This is expected as the server sends the leave op after waiting for 5 mins.
So we are trying to increase the timeout to more than 5 mins to see if we can get rid of the error. It is configurable through loader options so we don't need to do a patch release.
Also 5 mins seems a lot for server to wait before sending the leave op. So we can think of reducing that too at server.
The text was updated successfully, but these errors were encountered: