Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change elected summarizer when not summarizing for too long #6335

Merged
merged 18 commits into from
Jul 12, 2021

Conversation

arinwt
Copy link
Contributor

@arinwt arinwt commented Jun 2, 2021

Fixes #5222
Fixes #1791

The goal of the change is to have all clients unanimously and deterministically change their elected summarizer client whenever too many ops pass without a summary ack. In order to ensure eventual consistency here, the decision is purely based on inbound op processing: we use the sequence numbers of summary acks and other ops to figure it out.

The current behavior is that the oldest interactive client is responsible for summarizing. If 3000 inbound ops are seen since the last summary ack, that client will be marked as invalid, and the next oldest client will be elected. If no summary ack is seen since that client is elected, we use the first inbound op since they were elected as the baseline for the sequence number comparison.

Implementation Details

OrderedClientCollection - doubly linked list which tracks all the clients in the Quorum, sorted by their join sequence number.
OrderedClientElection - this is an adapter over the OrderedClientCollection which allows the caller to specify an isEligibleFn function to filter out non-interactive clients (or other criteria) and track a client as an "elected" client. It provides APIs to increment and reset the elected client. The rule is that the oldest eligible client is the elected client, and when increment is called the next oldest eligible client is elected instead (which means it should only be called deterministically for all clients to achieve eventual consistency). At any point, the elected client can be reset back to the oldest eligible client again, which should also be deterministically called.

The above 2 classes achieve something similar to the old leader election or task picking, but without requiring any additional ops or signals to be used. This is because the information is based on other ops already being passed around: join op sequence numbers, summaryAck sequence numbers, and other op sequence numbers.

ElectedSummarizerManager - this is a helper class used by SummaryManager to handle electing which client is responsible for generating summaries. It leverages the above OrderedClientElection to achieve this, and decides to reelect after some number of ops have passed since the last successful summary. If all clients are elected, it directly resets and goes back to electing the oldest client again.

Added tests for all 3 above classes.

@github-actions github-actions bot requested review from vladsud and curtisman June 2, 2021 18:34
@github-actions github-actions bot added the area: runtime Runtime related issues label Jun 2, 2021
@arinwt arinwt requested a review from anthony-murphy June 2, 2021 18:35
@msfluid-bot
Copy link
Collaborator

msfluid-bot commented Jun 2, 2021

@fluidframework/base-host: No change
Metric NameBaseline SizeCompare SizeSize Diff
main.js 178.64 KB 178.64 KB No change
Total Size 178.64 KB 178.64 KB No change
@fluid-example/bundle-size-tests: +1.63 KB
Metric NameBaseline SizeCompare SizeSize Diff
container.js 150.96 KB 152.59 KB +1.63 KB
map.js 48.69 KB 48.69 KB No change
matrix.js 147.33 KB 147.33 KB No change
odspDriver.js 197.23 KB 197.23 KB No change
odspPrefetchSnapshot.js 28.17 KB 28.17 KB No change
sharedString.js 165.21 KB 165.21 KB No change
Total Size 737.59 KB 739.22 KB +1.63 KB

Baseline commit: 9758a97

Generated by 🚫 dangerJS against c21dab5

@@ -200,20 +257,14 @@ export class SummaryManager extends EventEmitter implements IDisposable {
return { shouldSummarize: false, stopReason: "parentShouldNotSummarize" };
} else if (this.disposed) {
return { shouldSummarize: false, stopReason: "disposed" };
} else if (this.orderedClients.getSummarizerCount() > 0) {
Copy link
Contributor

@anthony-murphy anthony-murphy Jun 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

} else if (this.orderedClients.getSummarizerCount() > 0) {


based on the comment i get why we need to remove shouldStart, but is the case handled elsewhere? #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it that now refreshSummarizer will detect this state based on this.electedSummarizerManager.electedClientId, so while getShouldSummarizeState will return true, it won't be electect so it it won't run? if so, i wonder if we should rename getShouldSummarizeState to canSummarizeState, or supportsSumarizingState, as whether it should or not is really controlled by election

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or you could move the newSummarizerClientId logic and event emitting from refreshSummarizer to getShouldSummarizeState

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just no longer wait for the summarizer client(s) to leave the quorum.

@anthony-murphy
Copy link
Contributor

    // computed summarizer client id

this comment is no longer accurate given the above


Refers to: packages/runtime/container-runtime/src/summaryManager.ts:275 in a46ce7a. [](commit_id = a46ce7a, deletion_comment = False)

@@ -320,7 +371,7 @@ export class SummaryManager extends EventEmitter implements IDisposable {

private tryRestart(): void {
const shouldSummarizeState = this.getShouldSummarizeState();
if (shouldSummarizeState.shouldSummarize && shouldSummarizeState.shouldStart) {
if (shouldSummarizeState.shouldSummarize) {
Copy link
Contributor

@anthony-murphy anthony-murphy Jun 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(

this check is no longer complete, as it doesn't valid the client is actuall the client that should summarize. does that matter here? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldSummarize is only true if it is the client that should summarize (i.e. it is elected). So this still holds the same as before. Previously shouldStart could be false if waiting for the previous lingering summarizer client to leave the quorum.

@anthony-murphy
Copy link
Contributor

anthony-murphy commented Jun 10, 2021

Still reviewing, but i think we should wait and check this in after the 0.41 release, so it has the full amount of time to bake in our stress runs. also sorry it took sooooo long to review here, been having trouble finding a solid chuck of time to really dig in here


In reply to: 858809955


In reply to: 858809955

) { }
public getAllEligibleClients(): ITrackedClient[] {
return this.orderedClientCollection.getAllClients().filter(this.isEligibleFn);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if it would be easier to just keep a map in this class as well as it will also remove the need for this._eligibleCount. i tend to dislike any kind of ref counting, as it's easy to break

anthony-murphy
anthony-murphy previously approved these changes Jun 10, 2021
Copy link
Contributor

@anthony-murphy anthony-murphy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I have some question comments, but no major concerns. Please don't merge until after 0.41 cuts to ensure this has time to bake in stress.

@@ -604,11 +606,13 @@ export class ContainerRuntime extends TypedEventEmitter<IContainerRuntimeEvents>
};
const chunks = await tryFetchBlob<[string, string[]][]>(chunksBlobName) ?? [];
const metadata = await tryFetchBlob<IContainerRuntimeMetadata>(metadataBlobName);
const electedSummarizer = await tryFetchBlob<ISerializedElection>(electedSummarizerBlobName);
Copy link
Contributor

@anthony-murphy anthony-murphy Jun 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const electedSummarizer = await tryFetchBlob(electedSummarizerBlobName);


do we want a new blob here? would it make sense in metadata? is it fine due to aggregation? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good question. I've done it both ways for now (in a separate commit).. open to suggestions. @vladsud

@anthony-murphy anthony-murphy self-requested a review June 23, 2021 17:31
@anthony-murphy anthony-murphy dismissed their stale review June 23, 2021 17:39

lots of changes

@arinwt arinwt force-pushed the ordered-clients branch from f93f8fe to 7b36312 Compare June 30, 2021 13:23
this._hasAnySummarizersInQuorum = this.clientElection.getSummarizerCount() > 0;
}

public serialize(): ISerializedElection {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the whole serialize/deserialize flow seems over complicated with deserialization decoupled in orderedClientElection and loadOrderedClients. can we just add a static create funciton to this class that takes the serialize form, and whatever else is needed, rather than spreading across muliple classes and funcitons

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I can simplify it, which will probably make it easier to understand, but the intention in creating & loading both the OrderedClientCollection and all OrderedClientElections in a separate function is that it ensures they are instantiated together. It is important that they are created synchronously together, and this gives that direction via code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to just call all the constructors plainly in ContainerRuntime to show the difference, but I think it makes more sense to have the separate load function. The comparison is easy to see by looking at only this commit: 2a1dfb2

Copy link
Contributor

@anthony-murphy anthony-murphy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd still like the deserialization code cleaned up, but otherwise looks good.

@vladsud
Copy link
Contributor

vladsud commented Jul 8, 2021

Some high level thoughts:

  1. 3K ops is too small of a number. Please take a look at some of the recent numbers of TAP40 simulation results - we continuously hit 3K, and even 10K in this run. There are a lot of questions on workload being representative, proper feedback to client, and throttling being not balanced (i.e. ideally it happens earlier and not allow that many ops to be submitted, if we can't summarize it). BUT even with all these ifs and pending investigations, we should assume that hitting 3K limit in production will be routine, and switching summarizer at that boundary will create more pressure on SPO (more calls to load container, more posted summaries), and it will also increase chances of two summarizers working in parallel. So I feel like this kind of switch should be really action of last resort.
  2. Is this mostly to address client misbehaving, or service misbehaving? Have we considered it driven from service side, in a form of forceful disconnect of lingering summarizer?
    • Quorum errors (non-existing client in quorum) would create other problems (leader election), so I'm not sure we want to address each feature that relies on quorum individually.
  3. I worry about summarization process becoming more and more complex.

@github-actions github-actions bot added the area: tests Tests to add, test infrastructure improvements, etc label Jul 9, 2021
@vladsud vladsud mentioned this pull request Jul 9, 2021
32 tasks
@arinwt arinwt merged commit 2302632 into microsoft:main Jul 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: runtime Runtime related issues area: tests Tests to add, test infrastructure improvements, etc
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Spawn summarizer when elected client is not summarizing Protect from lingering inactive summarizers
4 participants