-
Notifications
You must be signed in to change notification settings - Fork 536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation for resolving data corruption issue for distributed data ordering service #5484
Conversation
⯅ @fluidframework/base-host: +2.38 KB
■ @fluid-example/bundle-size-tests: No change
Baseline commit: 0b4889b |
general feedback. can we encapsulat the logic we are adding to container into a class. i don't like the way the logic and state is so spread out. the class should also then be unit testable, so fewer end to end tests will be needed, consider then when builing the api |
&& this.client.mode === "write" | ||
) { | ||
this.prevClientLeftP = new Deferred(); | ||
// Max time is 5 min for which we are going to wait for its own "leave" message. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally do not see much value in waiting 5 min - it always same as server timeout (server has 5 min timeout, the difference is only in TCP/IP timeout that adds on top of it for worse possible case).
I'm of an opinion that we want to go with much smaller timeout as we are not making things worse then they are today (data corruption). Going smaller will ensure that we can find reasonable place where we feel Ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to 90 sec. Should we reduce it more?
Consider moving all connectivity-based logic into separate class as a pre-cursor to this change.
|
Added UTs too. |
packages/loader/container-loader/src/test/connectionStateHandler.spec.ts
Outdated
Show resolved
Hide resolved
Consider sprinkling telemetry events, such that when things go wrong, we have at least something to debug it. |
Fixes: #4399
Sol:
Whenever we receive a disconnected event, if we have outstanding ops for disconnected client and if that client was write client and we are not already waiting, then we start a timer for waiting for leave/timeout of that client. If we receive the addMember/join of new client before receiving removeMember/leave of older client, then we don't move container to connected state instead we wait for leave/timeout. So later when leave come, we move the container to Connected state. If leave of older client is received first, then we just move container to connected state on join immediately. In case of timeout, we stop waiting for leave and just continue with normal operation hoping that nothing bad happens.
Current wait time for leave op is 90 sec.