-
Notifications
You must be signed in to change notification settings - Fork 536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate duped ops due to socket reuse in odsp driver #3627
Comments
specifically, on reconnect we see:
2 should never happen after 1, the join message acts as a barrier in the op stream so the client only watches for ops from itself until it sees its new join message. in this case i doesn't recognize the ops in step 2 as it's own, so resend them when the reconnection completes in 3. |
Per discussion with Matt, removing hotpatch tag and pushing out to October. |
@danielroney and @ChumpChief this is causing active data corruption issues. i think it should be hot patch. @heliocliu can you provide the loader versions, and expected log entry for each version to detect this issue as it has changed a few times. |
So for 0.24/0.25, there should be a telemetry event named For 0.26/0.27, there will be a container error with message either |
@heliocliu Any updates here? Is there concrete work we should do this month to move along this investigation or mitigate the issue? Or are we waiting for releases to get picked up by OWH and deployed by partners to see telemetry? |
@markfields Don't really have any here... The old telemetry suggested this issue wasn't as prevalent as feared and the new telemetry (which hasn't been picked up yet afaik) introduces some throws, so we should have more to work with come 0.26 integration |
0.26 integration is done but not deployed yet, so we'll keep an eye on this telemetry after that goes out. |
0.26 bump deployed yesterday, no hits on the new telemetry yet. Will continue to monitor - possible we just haven't happened to hit it or that other issues are masking it. |
No hits on the new telemetry (>0.26), and no hits on the old telemetry (<=0.25) since 10/27, so seems plausible that #3787 was successful in mitigating. Closing as there are no recent hits. |
See: #3605 which adds some telemetry,
Teams threadjk no teams threadThere's some issue resulting in multiple connections re-sending pending ops to the server, leading to data loss. We are observing in one instance that during network instability, multiple active reconnections are re-sending pending ops and not correctly identifying those ops as local, leading to duplication.
Ops list
The text was updated successfully, but these errors were encountered: