Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving Audience reliability using heartbeat via signals #7845

Closed
wants to merge 22 commits into from

Conversation

Dhoot
Copy link
Contributor

@Dhoot Dhoot commented Oct 14, 2021

Implementing heartbeat/ping-pong mechanism and get_clients signal for improving Audience reliability.

@github-actions github-actions bot requested a review from vladsud October 14, 2021 11:16
@github-actions github-actions bot added area: definitions area: driver Driver related issues area: loader Loader related issues area: server Server related issues (routerlicious) labels Oct 14, 2021
@github-actions github-actions bot requested a review from curtisman October 14, 2021 11:16
@Dhoot Dhoot requested review from tanviraumi and removed request for curtisman and vladsud October 14, 2021 11:16
@github-actions github-actions bot requested review from vladsud and curtisman October 14, 2021 11:26
Copy link
Contributor

@anthony-murphy anthony-murphy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a work item that goes over the design here? this is a fairly large change to the server, protocol, and client that should be reviewed and agreed upon

Copy link
Contributor

@tanviraumi tanviraumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree! This is a big protocol change. My initial thought: we should try to solve the Audience problem up the stack (runtime/application, preferably application) and see how reliable it is, before changing the protocol and new concepts.

server/routerlicious/packages/lambdas/src/alfred/index.ts Outdated Show resolved Hide resolved
@Dhoot
Copy link
Contributor Author

Dhoot commented Oct 18, 2021

@anthony-murphy it indirectly related to #6544 because to choose read-only leader we need to choose it from Audience list and for that we need to make our Audience list more reliable.

@anthony-murphy
Copy link
Contributor

@anthony-murphy it indirectly related to #6544 because to choose read-only leader we need to choose it from Audience list and for that we need to make our Audience list more reliable.

I understand that, but for a change like this we need to design review the proposal for the fix as well. please open an issue where the design for this can be discussed.

@Dhoot Dhoot linked an issue Oct 19, 2021 that may be closed by this pull request
@Dhoot Dhoot requested a review from a team as a code owner October 26, 2021 17:30
@github-actions github-actions bot added the area: dev experience Improving the experience of devs building on top of fluid label Oct 26, 2021
@github-actions github-actions bot removed the area: loader Loader related issues label Oct 26, 2021
// client missed addMember event.
if (this.audience.getMember(message.clientId) === undefined) {
this.emit(MessageType.ClientJoin, this.audience.getMember(message.clientId));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to deal with the case where we miss a client ping for 5 times, (thus emitted client leave), but we heard for that client again. (which I assume you want to emit join again?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am emitting MessageType.ClientJoin and MessageType.ClientLeave both

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me how the API could be used and what the expectation is. Can you write documentation and test to illustrate what can be built on top of this?

constructor(
audience: IAudience,
runtime: IFluidDataStoreRuntime,
frequency: number = 30000) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

30000

Do we intent to have every readonly client to send ping every 30 seconds? That will quickly get to a lot of signal as more clients join the session.

I am wondering whether we are solving right problem here. I believe your intent is to make Audience more reliable so that it can be relied on it to pick a read only client to do some work. But there are alternatives to sove that problem. Can we solve the problem more directly?

I wonder would it suffice if the client that is supposed to do the work send the singal when it is doing the work, and the other validate it that way?

The key of validating it is understanding the error rate of signals, which can be measured

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, I am also not sure about frequency that's why I thought of to keep it in params to keep it configurable, may be after some load test we will be to know what should be default frequency.

To make sure low load I added enableHeartBeat and disableHeartBeat so the moment we have one member in quorum we can disable it and when there is no member in quorum we can enable it again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, we should avoid bloating the network traffic that scales up by the number of client. Does all the client really need to know that all the read client is there for your scenario? As I indicated, I believe you only need to make sure that one that you chose to do the work is alive for your scenario. That's O(1) vs. O(n)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this PR, I am try to propose a generic scenario to improve reliability of audience.

For leader only signal I think we should handle it, FFX level as leader selection may vary.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to avoid the multiple leader problem too as far as we can. For that signalling to all clients will be needed. @curtisman please let us know if you feel something different?

@github-actions github-actions bot added area: contributor experience area: dds Issues related to distributed data structures area: tests Tests to add, test infrastructure improvements, etc dependencies Pull requests that update a dependency file labels Nov 17, 2021
@ghost ghost added the status: stale label May 17, 2022
@ghost
Copy link

ghost commented May 17, 2022

This issue has been automatically marked as stale because it has had no activity for 180 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework!

@ghost ghost closed this May 25, 2022
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: contributor experience area: dds Issues related to distributed data structures area: dev experience Improving the experience of devs building on top of fluid area: tests Tests to add, test infrastructure improvements, etc dependencies Pull requests that update a dependency file status: stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve audience list reliability
9 participants