Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation for resolving data corruption issue for distributed data ordering service #5484

Merged
merged 35 commits into from
Mar 25, 2021

Conversation

jatgarg
Copy link
Contributor

@jatgarg jatgarg commented Mar 11, 2021

Fixes: #4399
Sol:
Whenever we receive a disconnected event, if we have outstanding ops for disconnected client and if that client was write client and we are not already waiting, then we start a timer for waiting for leave/timeout of that client. If we receive the addMember/join of new client before receiving removeMember/leave of older client, then we don't move container to connected state instead we wait for leave/timeout. So later when leave come, we move the container to Connected state. If leave of older client is received first, then we just move container to connected state on join immediately. In case of timeout, we stop waiting for leave and just continue with normal operation hoping that nothing bad happens.
Current wait time for leave op is 90 sec.

@jatgarg jatgarg added the area: loader Loader related issues label Mar 11, 2021
@jatgarg jatgarg added this to the March 2021 milestone Mar 11, 2021
@jatgarg jatgarg self-assigned this Mar 11, 2021
@msfluid-bot
Copy link
Collaborator

msfluid-bot commented Mar 11, 2021

@fluidframework/base-host: +2.38 KB
Metric NameBaseline SizeCompare SizeSize Diff
main.js 179.99 KB 182.37 KB +2.38 KB
Total Size 179.99 KB 182.37 KB +2.38 KB
@fluid-example/bundle-size-tests: No change
Metric NameBaseline SizeCompare SizeSize Diff
container.js 201.54 KB 201.54 KB No change
map.js 49.11 KB 49.11 KB No change
matrix.js 143.42 KB 143.42 KB No change
odspDriver.js 201.22 KB 201.22 KB No change
sharedString.js 158.14 KB 158.14 KB No change
Total Size 753.45 KB 753.45 KB No change

Baseline commit: 0b4889b

Generated by 🚫 dangerJS against 3c76d63

@anthony-murphy
Copy link
Contributor

anthony-murphy commented Mar 12, 2021

general feedback. can we encapsulat the logic we are adding to container into a class. i don't like the way the logic and state is so spread out. the class should also then be unit testable, so fewer end to end tests will be needed, consider then when builing the api

&& this.client.mode === "write"
) {
this.prevClientLeftP = new Deferred();
// Max time is 5 min for which we are going to wait for its own "leave" message.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally do not see much value in waiting 5 min - it always same as server timeout (server has 5 min timeout, the difference is only in TCP/IP timeout that adds on top of it for worse possible case).
I'm of an opinion that we want to go with much smaller timeout as we are not making things worse then they are today (data corruption). Going smaller will ensure that we can find reasonable place where we feel Ok

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to 90 sec. Should we reduce it more?

@vladsud
Copy link
Contributor

vladsud commented Mar 15, 2021

Consider moving all connectivity-based logic into separate class as a pre-cursor to this change.
I believe this will allow you to

  • write UTs independently
  • I think (though not sure) it will be much easier to spell out states / transitions, and ideally not use promises - this will help in better understanding logic (asynchrony changes state and assumptions and it's in general very hard to see omissions, synchronous code forces to clearly articulate state when transition needs to happen). And I think it will lead to better code and better testability and hitting all conditions.

@jatgarg
Copy link
Contributor Author

jatgarg commented Mar 19, 2021

Added UTs too.

@github-actions github-actions bot requested a review from vladsud March 19, 2021 18:05
@github-actions github-actions bot requested a review from vladsud March 19, 2021 19:46
@github-actions github-actions bot requested a review from vladsud March 19, 2021 23:56
@vladsud
Copy link
Contributor

vladsud commented Mar 24, 2021

Consider sprinkling telemetry events, such that when things go wrong, we have at least something to debug it.

@github-actions github-actions bot requested a review from vladsud March 24, 2021 22:57
@jatgarg jatgarg merged commit 369384c into microsoft:main Mar 25, 2021
@jatgarg jatgarg deleted the datacorr branch March 25, 2021 18:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: loader Loader related issues
Projects
None yet
4 participants