Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRAFT: Async summarization of DDSs #8422

Closed
wants to merge 18 commits into from

Conversation

scarlettjlee
Copy link
Contributor

@scarlettjlee scarlettjlee commented Nov 27, 2021

See issue #7615

Split summarization of DDSs into two parts:

  1. synchronous method to capture state needed to summarize the DDS at this point in time
  2. asynchronous method to turn the captured state into a summary

Questions for reviewers:

  1. Should the breaking change in common be staged over several versions? I'm not sure of the best way to do this when a method becomes async.
  2. Should GC also have two methods instead of a single one [to prevent accidentally making something async that should be synchronous]?
  3. Should I leave snapshotCore for a version or two before deprecating? or remove it now?
  4. Is there anything I should make sure I test?

Not all the existing DDSs and test code are fixed up yet and this change will need more testing.

@scarlettjlee scarlettjlee added this to the December 2021 milestone Nov 27, 2021
@scarlettjlee scarlettjlee self-assigned this Nov 27, 2021
@github-actions github-actions bot added area: dds Issues related to distributed data structures area: dds: sharedstring area: definitions area: examples Changes that focus on our examples area: loader Loader related issues area: runtime Runtime related issues public api change Changes to a public API labels Nov 27, 2021
@github-actions github-actions bot requested a review from ChumpChief November 27, 2021 08:26
// See: https://github.com/microsoft/FluidFramework/issues/4547
const serializer = new SummarySerializer(this.runtime.channelsRoutingContext);
this.snapshotCore(serializer);
const capture = this.captureSummaryStateCore(serializer, false);
await this.summarizeStateCore(serializer, capture);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it okay to prevent use of serializer during this async call?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean. But looking at this function, it feels like it does not care about reentrancy at all (same for getGCData), so maybe we remove _isGCing checks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant the serializer getter, which we don't want used during GC. My concern is that making this async might allow some other code [not related to GC process] to run in the meantime which legitimately calls the serializer getter.

@scarlettjlee scarlettjlee linked an issue Dec 9, 2021 that may be closed by this pull request
// See: https://github.com/microsoft/FluidFramework/issues/4547
const serializer = new SummarySerializer(this.runtime.channelsRoutingContext);
const state = this.captureStateCore();
await state.summarize(serializer, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the time SharedObject captured state, it's too late to give it a serializer. I think the right model is the following:

  • captureStateCore (or captureState, not sure there is need for both here) would internally create a serializer and capture snapshot with it, keeping serializer
  • The object returned has summarize() method that does not take serializer, and simply returns a summary.
  • There is also another method, called getGCData(), that would use internally stored serializer to spit it out.

That way serializer is left to be implementation detail, and API is relatively high level:

  • get a revision (aka state)
  • ask revision to generate summary.
  • ask revision to generate GC info

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @agarwal-navin comments, running GC can change what should be in the summary. Given that the potential change is only to say whether something is referenced or not, it seems possible to just fix that up in the IChannelRevision. That feels like a strange thing to do though.

@github-actions github-actions bot requested a review from vladsud December 16, 2021 02:29
/**
* {@inheritDoc (ISharedObject:interface).captureRevision}
*/
public captureRevision(): IChannelRevision {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a couple of problems with having this API instead of separate APIs for summarize and getGCData:

  1. How will a DDS have custom implementation for one of these APIs? For example, SharedMatrix and SharedString work on a different set of data for generating GC state vs summary state. (Btw, this change will break them).
  2. The state returned by summarize is affected by GC. When GC runs on the data returned by getGCData, it may update the "reference" state of certain nodes in the system. This information needs to be encoded in the summary. Additionally, some nodes which did not need summarization before (because its data did not change) may need to be summarized because its "reference" state changed.
    This is not really a problem at this layer yet but as soon as we have sub-DDS GC, this will break. Also, IFluidDataStoreChannel's shape will also become similar to this and it will run into this problem now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main challenge with point 2 above is that we don't fully know whether we need to re-summarize a node or not until GC has run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I'm not sure if my changes for sequence and matrix are the right thing. For DDSs that want to do something custom, they should derive from SharedObjectCore and fully implement captureRevision themselves. They'd still need to capture all the state necessary for both summaries and GC.
  2. Is it possible to update the captured state and generate another summary off that? or does it need to happen on the sharedObject?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was asking myself that question (that Navin raised) - can summary and gc data come from same state or not, sounds not (sorry I forgot it). I think there are two ways to address it:

  • Leave API as is but capture revision twice (at different times).
  • Leave GC API the same, while revision to support only summary API.

It's possible that the answer will be different on the layer. I.e. the general structure should likely allow flexibility, and this we might want to leave GC API as is (including on SharedObjectCore layer). But if all but Whiteboard DDSs can always generage GC from summary, then I'd argue we want to build code in a way where this duplication of logic happens in single place (SharedObject) by having implementation of IChannelRevision here that can extract GC data out of it (i.e. have concrete implementation of IChannelRevision that captures and exposes GC data). That way, SharedObject.getGCData() can just do this.captureRevision().getGcData().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can implement it via the first way you mentioned. The reason is that the summary state of a node may have the reference state (whether its unreferenced or not) based on an older summary state. Basically, GC state can be from seq#100 whereas summary state can be from seq#200. We cannot mark a node at seq#200 referenced / unreferenced based on its data from seq#100 - that would be inconsistent.

Both the summary and GC state in a summary should be off the same point in time (seq#).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to interpret your comment :) DDS just exposes ability to capture these states, at any moment in time. External process governs when these methods are called - for same state or not. DDS itself does not capture any additional state, so there is no differences here. The only thing that differs is how internally it captures GC - if all SharedObject derived classes extract GC data in exactly same way (by essentially summarizing themselves), then there is no need to duplicate this code - SharedObject base class can help here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I misunderstood this change. I was under the impression that we are also changing the way summarization happens in the regular case in summarizer client.

@@ -367,16 +367,17 @@ export class DataStores implements IDisposable {
return summaryBuilder.getSummaryTree();
}

public createSummary(): ISummaryTreeWithStats {
public async createSummary(): Promise<ISummaryTreeWithStats> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above: It's possible to ensure (by code inspection) that this function returns state of container at the time when createSummary() call started. But we will not be able to sustain that over time, even with UTs. I believe if we go that route, we have to commit (and implement quickly as a follow up) same flow you are implementing for DDS here, i.e., for whole container, with synchronous "capture revision" and async "generate snapshot out of it.

@@ -829,13 +829,13 @@ export class LocalFluidDataStoreContextBase extends FluidDataStoreContext {
});
}

public generateAttachMessage(): IAttachMessage {
public async generateAttachMessage(): Promise<IAttachMessage> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API existed for two reasons:

  1. synchronous nature of workflows that depended on it
  2. some subtle differences in content returned from async summarize methods.
    Given that # 1 is gone in this PR, it would be great to have a follow up issue to examine # 2 and see if we can substantially reduce code duplication. I.e. make this logic rely as much as possible on summarization flow, only adding (if needed) subtle differences that are required (if they are) for attachment workflows

@@ -778,7 +778,7 @@ export class Container extends EventEmitterWithErrorHandling<IContainerEvents> i
if (!hasAttachmentBlobs) {
// Get the document state post attach - possibly can just call attach but we need to change the
// semantics around what the attach means as far as async code goes.
const appSummary: ISummaryTree = this.context.createSummary();
const appSummary: ISummaryTree = await this.context.createSummary();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is an example of the problem that I worry about when making createSummary() flow async. All capturing of state (aka revision) has to happen synchronously as otherwise we may not capture consistent state (i.e., parts of state would come from different points in time).
Here, we await container runtime summary creation, but capture protocol state and switch container to attaching state after that. That means that protocol state may change while we are blocked on await. Similar, any changes to container while we await would not generate ops because container is not switched to attaching state early enough, thus those changes would not make it to either snapshot nor ops that follow.

The only right way here is to ensure that all state is captured before any async activity is awaited. Ideally through 2-step process - get revision synchronously, generate snapshot from revision async. As a temp solution, we can do code inspection + UTs to get something working quickly (and fixing any cases like this one by ensuring that promise is captured early but await is pushed to line 793). But this can't be long-term solution

@@ -829,7 +829,7 @@ export class Container extends EventEmitterWithErrorHandling<IContainerEvents> i
}

// take summary and upload
const appSummary: ISummaryTree = this.context.createSummary(redirectTable);
const appSummary: ISummaryTree = await this.context.createSummary(redirectTable);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same problem here - await needs to be after this.emit("attached"). I'm not even sure code can handle it correctly today (i.e. async inflight summary process while we emit attached event) - maybe, but I'd need to do deeper code inspection to be sure, as we never have had such state.

if (blobRedirectTable) {
this.blobManager.setRedirectTable(blobRedirectTable);
}

const summarizeResult = this.dataStores.createSummary();
const summarizeResult = await this.dataStores.createSummary();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar problem here (requiring changing existing flows / APIs) - addContainerBlobsToSummary() below should happen before await here, as we need to do all "initiation" of data collection synchronously in one go, to make sure it represents state at the same point in time.

assert(this.bindState === BindState.NotBound, 0x13b /* "datastore context is already in bound state" */);
this.bindState = BindState.Binding;
assert(this.channel !== undefined, 0x13c /* "undefined channel on datastore context" */);
bindChannel(this.channel);
await bindChannel(this.channel);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to other places, it's extremely hard to say if conversion to async here would not open pandora box of bugs.
Given that this code is used in synchronous workflows (i.e., storing a handle as a value in a map - that will cause attachment of data store / DDS that are backed by such handle, which previously was fully synchronous process), it's hard to believe that async parts of this process can happen later, when user has a chance to continue to modify objects that we are trying to snapshot. Similar as in other places, we need to ensure we capture enough state (revisions) synchronously to represent state of universe before allowing async processes run off that revision.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the full implications of this. Does it mean that it will be very hard to get things right? Or we should enforce that some things remain synchronous?

const summarizeResult = channel.summarize(fullTree, trackState);
): Promise<ISummaryTreeWithStats> {
const state = channel.captureSummaryState(fullTree);
const summarizeResult = await channel.summarizeState(state);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit at a loss here.
You have added IChannel.captureRevision().summarize() flow.
Why is there some other captureSummaryState() flow? I do not see IChannel having captureSummaryState()
Is this simply non-updated code from prior revisions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, non-updated. The last couple commits were just to get feedback on the change to SharedObject before changing it everywhere.

@vladsud
Copy link
Contributor

vladsud commented Dec 17, 2021

Looking deeper at the set of changes, I think you will not be able to escape from making 2-step process (capture revision, generate summary) across all layers.
If you really want to avoid it in single PR, I think you will have to leave dual workflows (synchronous, and async) across all layers, and simply introduce async flow (the way you are doing) to DDS (while leaving sync flow completely intact).
It just too easy to make mistakes and capture state from different points in time, and it will take substantial effort to write all the required UTs to get confidence that such massive change across all layers actually does not regress anything.

@vladsud
Copy link
Contributor

vladsud commented Dec 17, 2021

A thought worth sharing RE createSummary() flow: I think there is an incremental path how to convert everything to single flow.
What if we leave this PR for now, and start another one (that would be pre-game for this one). It would change createSummary() to return an IChannelRevision instead of current summary. Under the covers, it can capture snapshot IChannelRevision would have just one function (possibly even synchronous to make changes incremental) to expose such snapshot. That way we can rather easily (only one complication - back compat) changes all existing layers to work with IChannelRevision with no other semantic or behavior changes. Then we can make further steps like

  • async summary on IChannelRevision - will require careful code inspection to make sure all awaits are after ALL state has been captured (some of the comments in this PR)
  • introduction of SharedObjectCore and allowing truly async behavior where it matters.

RE back compat: it's hard as we require back compat across two boundaries - Loader (Container + ContainerContext) to IRuntime (ContainerRuntime) and container runtime (IContainerRuntimeBase) to data store (IFluidDataStoreChannel). We will need parallel system (i.e. support old functions with old semantics while having new flow). Code can likely be reused but shape of API will need to support both for a while. Thus we would need createSummary2 in some format (i.e. can't just reuse existing name, unless there is something else telling each layer how to think about shape of this API)

@agarwal-navin
Copy link
Contributor

I have a question about the first step, i.e., synchronously getting summary state of a node, in the proposed 2-step summarization process:
How will this work at layers such as data store context and channel context with delay loading? For example, if a data store has not yet been realized during summary and it has changed, we need to realize it first and so we cannot capture its summary state synchronously.

@agarwal-navin
Copy link
Contributor

This will work for the summary generated for detached container via createSummary but not for the regular summarization flow.

@vladsud
Copy link
Contributor

vladsud commented Dec 17, 2021

@agarwal-navin, that's correct, today's captureSummary does not work for those cases. New flow will have exactly same limitations.
I think you are asking how we extend it to be the only summary flow in container, and make it work across all those different states that have different assumptions. I think the only answer - very slowly, and only when work in layers underneath layers is done to support it, and only if/when it happens. I.e. nothing prevents us from writing a bunch of code that will capture revision of current state for data store at any state. This is likely tremendous amount of work as we would need to duplicate current state / tree completely to isolate it from "live" state. Or bake immutability across layers. But when / if this is done, we can return revision in some form and work on it.

Note that we do not have to do it. We can continue to have two flows - async summary (as today) and sync capture revision, with revision flow working only where captureSummary() works today. It will still give us a boost in the form of leaf nodes (DDS) being able to switch to 2-step approach when generating summaries (i.e. DDSs will have only one flow to generate summary - 2-step process through revisions). This will solve Whiteboard problem, while maintaining correctness of the system. It's correctness that I'm after, plus at least one tiny place (DDSs) where we have only one way of doing summaries (not two as we have at data store layer), with possible long future capability of having one way of doing summaries across all layers.

@agarwal-navin
Copy link
Contributor

Okay, I think I misunderstood this change. This is only changing the synchronous captureSummary for detached container by making it async. It is not changing the regular summarization flow that happens in the summarizer client. And we are looking for ways to make the DDS implement a single way of generating the summary for both the above flows.

@vladsud
Copy link
Contributor

vladsud commented Dec 17, 2021

So I feel like PRs need to flow in this order:

  1. Create SharedObject & SharedObjectCore. SharedObject will implement getGCData() through summary flow.
    • (a) this assumes that most DDSs can leverage that. Those that can't need to use SharedObjectCore and have full flexibility as today (sounds like SharedMatrix & Sequence fall into this bucket).
    • (b) This step is optional - we can skip it if we believe it does not provide that much value. The goal here (after step 3) is to simplify most common case - most DDSs would care only about one thing providing summary method and that's it.
  2. Introduce async summary flow to DDS (IChannel), but leave sync flow as is
    • (a) FluidDataSoreRuntime .summarize() will call into IChannel.summarizeAsync()
    • (b) Do not touch any other layer. captureSummary() / getAttachSummary do not change and rely on synchronous IChannel.summarize()).
    • (c) Whiteboard (DDSs based on SharedObjectCore) has to deal with two flows (async & sync) for now but gets ability to do stuff asynchronously (when not attached). They can implement both sync & async methods however they want
    • (d) async flow for SharedObject is implemented through sync flow (i.e., all non-Whiteboard DDSs implement just one sync summarize() method). All derived from SharedObject classes care only about implementing one sync summarize() method (same for steps 3-4).
  3. Introduce capturing revisions for all layers. Revision exposes only async summary method. The purpose of this step is to start conversion on single path across layers, but in the meantime - be able to do so at DDS level
    • (a) For now, only works in detached state across all layers (container runtime, data store, DDS), plus works for DDS in any other state that has no local changes (i.e. called by summarizer).
    • (b) Allows to remove 2c limitation (IChannel will have only captureRevision() and getGCData(), async summary method moves to revision), but keep two summary flows at higher levels.
    • (c) SharedObject can expose getGCData() on cocreate revision object it creates by capturing serializer while capturing summary/revision. Nothing about flow of control changes, we simply maintain simpler requirements for SharedObject derived classes - they only need to care about implementing single method - synchronous summarize() method. All other flows (revision capture / async summary, GC capture) are implemented by SharedObject by calling synchronous summarize() method implemented by each concrete class.
    • (d) No changes RE getGCData() for any other layer / interface.
  4. Rework layers to be able to capture revisions at any moment in time (attached or not). Allows to unify summarization process across all layers.

# 2 is rather simple and immediately unblocks Whiteboard (even though in an ugly / hard to deal / explain way)
# 3 is not that hard, but back-compat requirements make it more problematic
# 4 is likely a ton of work. We would need to estimate it and make decision if it's worth it.

Note that # 3 gives us tool that can be used in other flows (and can be improved over time). I.e., ability to capture revision at DDS level is very powerful for all kinds of things (would be useful today for Word's transactions request). It also simplifies layer that 3rd party developers are most likely to implement, so while not having # 4 is bad for other layers, it simplifies where it matters the most. That said # 4 is what makes revisions really powerful :)

One possible pre-work: We need to be consistent on naming, and maybe that's PR # 0.
captureSummary() / getAttachSummary / async summarize() / sync summarize() - too many names with similar but different semantics. I think we can survive with fewer names, and names that clearly tell if it's sync or async method (at least when it comes to "summarize" naming)

@agarwal-navin
Copy link
Contributor

That sounds like a good plan. Breaking this down into stages will definitely make it simpler to identify and catch issues.

@scarlettjlee scarlettjlee mentioned this pull request Dec 20, 2021
@scarlettjlee
Copy link
Contributor Author

This may be reopened later. For now, it's replaced by #8592.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: dds: sharedstring area: dds Issues related to distributed data structures area: definitions area: examples Changes that focus on our examples area: loader Loader related issues area: runtime Runtime related issues public api change Changes to a public API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DDS Async Summarization
4 participants