-
Notifications
You must be signed in to change notification settings - Fork 536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving Audience reliability using heartbeat via signals #7845
Conversation
… improving Audience reliability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a work item that goes over the design here? this is a fairly large change to the server, protocol, and client that should be reviewed and agreed upon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree! This is a big protocol change. My initial thought: we should try to solve the Audience problem up the stack (runtime/application, preferably application) and see how reliable it is, before changing the protocol and new concepts.
@anthony-murphy it indirectly related to #6544 because to choose read-only leader we need to choose it from Audience list and for that we need to make our Audience list more reliable. |
I understand that, but for a change like this we need to design review the proposal for the fix as well. please open an issue where the design for this can be discussed. |
experimental/framework/audience-heartbeat/src/audienceWithHeartBeat.ts
Outdated
Show resolved
Hide resolved
experimental/framework/audience-heartbeat/src/audienceWithHeartBeat.ts
Outdated
Show resolved
Hide resolved
experimental/framework/audience-heartbeat/src/audienceWithHeartBeat.ts
Outdated
Show resolved
Hide resolved
experimental/framework/audience-heartbeat/src/audienceWithHeartBeat.ts
Outdated
Show resolved
Hide resolved
// client missed addMember event. | ||
if (this.audience.getMember(message.clientId) === undefined) { | ||
this.emit(MessageType.ClientJoin, this.audience.getMember(message.clientId)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to deal with the case where we miss a client ping for 5 times, (thus emitted client leave), but we heard for that client again. (which I assume you want to emit join again?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am emitting MessageType.ClientJoin and MessageType.ClientLeave both
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me how the API could be used and what the expectation is. Can you write documentation and test to illustrate what can be built on top of this?
constructor( | ||
audience: IAudience, | ||
runtime: IFluidDataStoreRuntime, | ||
frequency: number = 30000) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we intent to have every readonly client to send ping every 30 seconds? That will quickly get to a lot of signal as more clients join the session.
I am wondering whether we are solving right problem here. I believe your intent is to make Audience more reliable so that it can be relied on it to pick a read only client to do some work. But there are alternatives to sove that problem. Can we solve the problem more directly?
I wonder would it suffice if the client that is supposed to do the work send the singal when it is doing the work, and the other validate it that way?
The key of validating it is understanding the error rate of signals, which can be measured
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, I am also not sure about frequency that's why I thought of to keep it in params to keep it configurable, may be after some load test we will be to know what should be default frequency.
To make sure low load I added enableHeartBeat and disableHeartBeat so the moment we have one member in quorum we can disable it and when there is no member in quorum we can enable it again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, we should avoid bloating the network traffic that scales up by the number of client. Does all the client really need to know that all the read client is there for your scenario? As I indicated, I believe you only need to make sure that one that you chose to do the work is alive for your scenario. That's O(1) vs. O(n)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this PR, I am try to propose a generic scenario to improve reliability of audience.
For leader only signal I think we should handle it, FFX level as leader selection may vary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to avoid the multiple leader problem too as far as we can. For that signalling to all clients will be needed. @curtisman please let us know if you feel something different?
This issue has been automatically marked as stale because it has had no activity for 180 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework! |
Implementing heartbeat/ping-pong mechanism and get_clients signal for improving Audience reliability.