DRAFT: Async summarization of DDSs #8422

scarlettjlee · 2021-11-27T08:25:54Z

See issue #7615

Split summarization of DDSs into two parts:

synchronous method to capture state needed to summarize the DDS at this point in time
asynchronous method to turn the captured state into a summary

Questions for reviewers:

Should the breaking change in common be staged over several versions? I'm not sure of the best way to do this when a method becomes async.
Should GC also have two methods instead of a single one [to prevent accidentally making something async that should be synchronous]?
Should I leave snapshotCore for a version or two before deprecating? or remove it now?
Is there anything I should make sure I test?

Not all the existing DDSs and test code are fixed up yet and this change will need more testing.

packages/dds/shared-object-base/src/sharedObject.ts

scarlettjlee · 2021-11-27T18:25:50Z

packages/dds/shared-object-base/src/sharedObject.ts

        // See: https://github.com/microsoft/FluidFramework/issues/4547
        const serializer = new SummarySerializer(this.runtime.channelsRoutingContext);
-        this.snapshotCore(serializer);
+        const capture = this.captureSummaryStateCore(serializer, false);
+        await this.summarizeStateCore(serializer, capture);


Is it okay to prevent use of serializer during this async call?

I'm not sure what you mean. But looking at this function, it feels like it does not care about reentrancy at all (same for getGCData), so maybe we remove _isGCing checks?

I meant the serializer getter, which we don't want used during GC. My concern is that making this async might allow some other code [not related to GC process] to run in the meantime which legitimately calls the serializer getter.

packages/dds/shared-object-base/src/sharedObject.ts

packages/dds/shared-object-base/src/sharedObjectBase.ts

packages/dds/shared-object-base/src/sharedObject.ts

packages/dds/shared-object-base/src/sharedObjectBase.ts

vladsud · 2021-12-13T23:52:35Z

packages/dds/shared-object-base/src/sharedObjectBase.ts

+        // See: https://github.com/microsoft/FluidFramework/issues/4547
+        const serializer = new SummarySerializer(this.runtime.channelsRoutingContext);
+        const state = this.captureStateCore();
+        await state.summarize(serializer, false);


By the time SharedObject captured state, it's too late to give it a serializer. I think the right model is the following:

captureStateCore (or captureState, not sure there is need for both here) would internally create a serializer and capture snapshot with it, keeping serializer

The object returned has summarize() method that does not take serializer, and simply returns a summary.

There is also another method, called getGCData(), that would use internally stored serializer to spit it out.

That way serializer is left to be implementation detail, and API is relatively high level:

get a revision (aka state)

ask revision to generate summary.

ask revision to generate GC info

As @agarwal-navin comments, running GC can change what should be in the summary. Given that the potential change is only to say whether something is referenced or not, it seems possible to just fix that up in the IChannelRevision. That feels like a strange thing to do though.

packages/dds/shared-object-base/src/types.ts

Renames

agarwal-navin · 2021-12-16T17:14:18Z

packages/dds/shared-object-base/src/sharedObject.ts

+    /**
+     * {@inheritDoc (ISharedObject:interface).captureRevision}
+     */
+    public captureRevision(): IChannelRevision {


I see a couple of problems with having this API instead of separate APIs for summarize and getGCData:

How will a DDS have custom implementation for one of these APIs? For example, SharedMatrix and SharedString work on a different set of data for generating GC state vs summary state. (Btw, this change will break them).

The state returned by summarize is affected by GC. When GC runs on the data returned by getGCData, it may update the "reference" state of certain nodes in the system. This information needs to be encoded in the summary. Additionally, some nodes which did not need summarization before (because its data did not change) may need to be summarized because its "reference" state changed.
This is not really a problem at this layer yet but as soon as we have sub-DDS GC, this will break. Also, IFluidDataStoreChannel's shape will also become similar to this and it will run into this problem now.

The main challenge with point 2 above is that we don't fully know whether we need to re-summarize a node or not until GC has run.

I'm not sure if my changes for sequence and matrix are the right thing. For DDSs that want to do something custom, they should derive from SharedObjectCore and fully implement captureRevision themselves. They'd still need to capture all the state necessary for both summaries and GC.

Is it possible to update the captured state and generate another summary off that? or does it need to happen on the sharedObject?

I was asking myself that question (that Navin raised) - can summary and gc data come from same state or not, sounds not (sorry I forgot it). I think there are two ways to address it:

Leave API as is but capture revision twice (at different times).

Leave GC API the same, while revision to support only summary API.

It's possible that the answer will be different on the layer. I.e. the general structure should likely allow flexibility, and this we might want to leave GC API as is (including on SharedObjectCore layer). But if all but Whiteboard DDSs can always generage GC from summary, then I'd argue we want to build code in a way where this duplication of logic happens in single place (SharedObject) by having implementation of IChannelRevision here that can extract GC data out of it (i.e. have concrete implementation of IChannelRevision that captures and exposes GC data). That way, SharedObject.getGCData() can just do this.captureRevision().getGcData().

I don't think we can implement it via the first way you mentioned. The reason is that the summary state of a node may have the reference state (whether its unreferenced or not) based on an older summary state. Basically, GC state can be from seq#100 whereas summary state can be from seq#200. We cannot mark a node at seq#200 referenced / unreferenced based on its data from seq#100 - that would be inconsistent.

Both the summary and GC state in a summary should be off the same point in time (seq#).

I'm not sure how to interpret your comment :) DDS just exposes ability to capture these states, at any moment in time. External process governs when these methods are called - for same state or not. DDS itself does not capture any additional state, so there is no differences here. The only thing that differs is how internally it captures GC - if all SharedObject derived classes extract GC data in exactly same way (by essentially summarizing themselves), then there is no need to duplicate this code - SharedObject base class can help here.

I misunderstood this change. I was under the impression that we are also changing the way summarization happens in the regular case in summarizer client.

vladsud · 2021-12-17T05:41:11Z

packages/runtime/container-runtime/src/dataStores.ts

@@ -367,16 +367,17 @@ export class DataStores implements IDisposable {
        return summaryBuilder.getSummaryTree();
    }

-    public createSummary(): ISummaryTreeWithStats {
+    public async createSummary(): Promise<ISummaryTreeWithStats> {


Same comment as above: It's possible to ensure (by code inspection) that this function returns state of container at the time when createSummary() call started. But we will not be able to sustain that over time, even with UTs. I believe if we go that route, we have to commit (and implement quickly as a follow up) same flow you are implementing for DDS here, i.e., for whole container, with synchronous "capture revision" and async "generate snapshot out of it.

vladsud · 2021-12-17T05:43:36Z

packages/runtime/container-runtime/src/dataStoreContext.ts

@@ -829,13 +829,13 @@ export class LocalFluidDataStoreContextBase extends FluidDataStoreContext {
        });
    }

-    public generateAttachMessage(): IAttachMessage {
+    public async generateAttachMessage(): Promise<IAttachMessage> {


This API existed for two reasons:

synchronous nature of workflows that depended on it

some subtle differences in content returned from async summarize methods.
Given that # 1 is gone in this PR, it would be great to have a follow up issue to examine # 2 and see if we can substantially reduce code duplication. I.e. make this logic rely as much as possible on summarization flow, only adding (if needed) subtle differences that are required (if they are) for attachment workflows

vladsud · 2021-12-17T05:52:44Z

packages/loader/container-loader/src/container.ts

@@ -778,7 +778,7 @@ export class Container extends EventEmitterWithErrorHandling<IContainerEvents> i
            if (!hasAttachmentBlobs) {
                // Get the document state post attach - possibly can just call attach but we need to change the
                // semantics around what the attach means as far as async code goes.
-                const appSummary: ISummaryTree = this.context.createSummary();
+                const appSummary: ISummaryTree = await this.context.createSummary();


Here is an example of the problem that I worry about when making createSummary() flow async. All capturing of state (aka revision) has to happen synchronously as otherwise we may not capture consistent state (i.e., parts of state would come from different points in time).
Here, we await container runtime summary creation, but capture protocol state and switch container to attaching state after that. That means that protocol state may change while we are blocked on await. Similar, any changes to container while we await would not generate ops because container is not switched to attaching state early enough, thus those changes would not make it to either snapshot nor ops that follow.

The only right way here is to ensure that all state is captured before any async activity is awaited. Ideally through 2-step process - get revision synchronously, generate snapshot from revision async. As a temp solution, we can do code inspection + UTs to get something working quickly (and fixing any cases like this one by ensuring that promise is captured early but await is pushed to line 793). But this can't be long-term solution

vladsud · 2021-12-17T05:55:09Z

packages/loader/container-loader/src/container.ts

@@ -829,7 +829,7 @@ export class Container extends EventEmitterWithErrorHandling<IContainerEvents> i
                }

                // take summary and upload
-                const appSummary: ISummaryTree = this.context.createSummary(redirectTable);
+                const appSummary: ISummaryTree = await this.context.createSummary(redirectTable);


Same problem here - await needs to be after this.emit("attached"). I'm not even sure code can handle it correctly today (i.e. async inflight summary process while we emit attached event) - maybe, but I'd need to do deeper code inspection to be sure, as we never have had such state.

vladsud · 2021-12-17T05:56:51Z

packages/runtime/container-runtime/src/containerRuntime.ts

        if (blobRedirectTable) {
            this.blobManager.setRedirectTable(blobRedirectTable);
        }

-        const summarizeResult = this.dataStores.createSummary();
+        const summarizeResult = await this.dataStores.createSummary();


Similar problem here (requiring changing existing flows / APIs) - addContainerBlobsToSummary() below should happen before await here, as we need to do all "initiation" of data collection synchronously in one go, to make sure it represents state at the same point in time.

vladsud · 2021-12-17T06:01:38Z

packages/runtime/container-runtime/src/dataStoreContext.ts

            assert(this.bindState === BindState.NotBound, 0x13b /* "datastore context is already in bound state" */);
            this.bindState = BindState.Binding;
            assert(this.channel !== undefined, 0x13c /* "undefined channel on datastore context" */);
-            bindChannel(this.channel);
+            await bindChannel(this.channel);


Similar to other places, it's extremely hard to say if conversion to async here would not open pandora box of bugs.
Given that this code is used in synchronous workflows (i.e., storing a handle as a value in a map - that will cause attachment of data store / DDS that are backed by such handle, which previously was fully synchronous process), it's hard to believe that async parts of this process can happen later, when user has a chance to continue to modify objects that we are trying to snapshot. Similar as in other places, we need to ensure we capture enough state (revisions) synchronously to represent state of universe before allowing async processes run off that revision.

I don't understand the full implications of this. Does it mean that it will be very hard to get things right? Or we should enforce that some things remain synchronous?

vladsud · 2021-12-17T06:15:41Z

packages/runtime/datastore/src/channelContext.ts

-    const summarizeResult = channel.summarize(fullTree, trackState);
+): Promise<ISummaryTreeWithStats> {
+    const state = channel.captureSummaryState(fullTree);
+    const summarizeResult = await channel.summarizeState(state);


I'm a bit at a loss here.
You have added IChannel.captureRevision().summarize() flow.
Why is there some other captureSummaryState() flow? I do not see IChannel having captureSummaryState()
Is this simply non-updated code from prior revisions?

Yes, non-updated. The last couple commits were just to get feedback on the change to SharedObject before changing it everywhere.

vladsud · 2021-12-17T06:19:56Z

Looking deeper at the set of changes, I think you will not be able to escape from making 2-step process (capture revision, generate summary) across all layers.
If you really want to avoid it in single PR, I think you will have to leave dual workflows (synchronous, and async) across all layers, and simply introduce async flow (the way you are doing) to DDS (while leaving sync flow completely intact).
It just too easy to make mistakes and capture state from different points in time, and it will take substantial effort to write all the required UTs to get confidence that such massive change across all layers actually does not regress anything.

vladsud · 2021-12-17T17:37:06Z

A thought worth sharing RE createSummary() flow: I think there is an incremental path how to convert everything to single flow.
What if we leave this PR for now, and start another one (that would be pre-game for this one). It would change createSummary() to return an IChannelRevision instead of current summary. Under the covers, it can capture snapshot IChannelRevision would have just one function (possibly even synchronous to make changes incremental) to expose such snapshot. That way we can rather easily (only one complication - back compat) changes all existing layers to work with IChannelRevision with no other semantic or behavior changes. Then we can make further steps like

async summary on IChannelRevision - will require careful code inspection to make sure all awaits are after ALL state has been captured (some of the comments in this PR)
introduction of SharedObjectCore and allowing truly async behavior where it matters.

RE back compat: it's hard as we require back compat across two boundaries - Loader (Container + ContainerContext) to IRuntime (ContainerRuntime) and container runtime (IContainerRuntimeBase) to data store (IFluidDataStoreChannel). We will need parallel system (i.e. support old functions with old semantics while having new flow). Code can likely be reused but shape of API will need to support both for a while. Thus we would need createSummary2 in some format (i.e. can't just reuse existing name, unless there is something else telling each layer how to think about shape of this API)

agarwal-navin · 2021-12-17T17:45:47Z

I have a question about the first step, i.e., synchronously getting summary state of a node, in the proposed 2-step summarization process:
How will this work at layers such as data store context and channel context with delay loading? For example, if a data store has not yet been realized during summary and it has changed, we need to realize it first and so we cannot capture its summary state synchronously.

agarwal-navin · 2021-12-17T17:46:55Z

This will work for the summary generated for detached container via createSummary but not for the regular summarization flow.

vladsud · 2021-12-17T17:56:01Z

@agarwal-navin, that's correct, today's captureSummary does not work for those cases. New flow will have exactly same limitations.
I think you are asking how we extend it to be the only summary flow in container, and make it work across all those different states that have different assumptions. I think the only answer - very slowly, and only when work in layers underneath layers is done to support it, and only if/when it happens. I.e. nothing prevents us from writing a bunch of code that will capture revision of current state for data store at any state. This is likely tremendous amount of work as we would need to duplicate current state / tree completely to isolate it from "live" state. Or bake immutability across layers. But when / if this is done, we can return revision in some form and work on it.

Note that we do not have to do it. We can continue to have two flows - async summary (as today) and sync capture revision, with revision flow working only where captureSummary() works today. It will still give us a boost in the form of leaf nodes (DDS) being able to switch to 2-step approach when generating summaries (i.e. DDSs will have only one flow to generate summary - 2-step process through revisions). This will solve Whiteboard problem, while maintaining correctness of the system. It's correctness that I'm after, plus at least one tiny place (DDSs) where we have only one way of doing summaries (not two as we have at data store layer), with possible long future capability of having one way of doing summaries across all layers.

agarwal-navin · 2021-12-17T18:05:33Z

Okay, I think I misunderstood this change. This is only changing the synchronous captureSummary for detached container by making it async. It is not changing the regular summarization flow that happens in the summarizer client. And we are looking for ways to make the DDS implement a single way of generating the summary for both the above flows.

vladsud · 2021-12-17T18:08:02Z

So I feel like PRs need to flow in this order:

Create SharedObject & SharedObjectCore. SharedObject will implement getGCData() through summary flow.
- (a) this assumes that most DDSs can leverage that. Those that can't need to use SharedObjectCore and have full flexibility as today (sounds like SharedMatrix & Sequence fall into this bucket).
- (b) This step is optional - we can skip it if we believe it does not provide that much value. The goal here (after step 3) is to simplify most common case - most DDSs would care only about one thing providing summary method and that's it.
Introduce async summary flow to DDS (IChannel), but leave sync flow as is
- (a) FluidDataSoreRuntime .summarize() will call into IChannel.summarizeAsync()
- (b) Do not touch any other layer. captureSummary() / getAttachSummary do not change and rely on synchronous IChannel.summarize()).
- (c) Whiteboard (DDSs based on SharedObjectCore) has to deal with two flows (async & sync) for now but gets ability to do stuff asynchronously (when not attached). They can implement both sync & async methods however they want
- (d) async flow for SharedObject is implemented through sync flow (i.e., all non-Whiteboard DDSs implement just one sync summarize() method). All derived from SharedObject classes care only about implementing one sync summarize() method (same for steps 3-4).
Introduce capturing revisions for all layers. Revision exposes only async summary method. The purpose of this step is to start conversion on single path across layers, but in the meantime - be able to do so at DDS level
- (a) For now, only works in detached state across all layers (container runtime, data store, DDS), plus works for DDS in any other state that has no local changes (i.e. called by summarizer).
- (b) Allows to remove 2c limitation (IChannel will have only captureRevision() and getGCData(), async summary method moves to revision), but keep two summary flows at higher levels.
- (c) SharedObject can expose getGCData() on cocreate revision object it creates by capturing serializer while capturing summary/revision. Nothing about flow of control changes, we simply maintain simpler requirements for SharedObject derived classes - they only need to care about implementing single method - synchronous summarize() method. All other flows (revision capture / async summary, GC capture) are implemented by SharedObject by calling synchronous summarize() method implemented by each concrete class.
- (d) No changes RE getGCData() for any other layer / interface.
Rework layers to be able to capture revisions at any moment in time (attached or not). Allows to unify summarization process across all layers.

# 2 is rather simple and immediately unblocks Whiteboard (even though in an ugly / hard to deal / explain way)
# 3 is not that hard, but back-compat requirements make it more problematic
# 4 is likely a ton of work. We would need to estimate it and make decision if it's worth it.

Note that # 3 gives us tool that can be used in other flows (and can be improved over time). I.e., ability to capture revision at DDS level is very powerful for all kinds of things (would be useful today for Word's transactions request). It also simplifies layer that 3rd party developers are most likely to implement, so while not having # 4 is bad for other layers, it simplifies where it matters the most. That said # 4 is what makes revisions really powerful :)

One possible pre-work: We need to be consistent on naming, and maybe that's PR # 0.
captureSummary() / getAttachSummary / async summarize() / sync summarize() - too many names with similar but different semantics. I think we can survive with fewer names, and names that clearly tell if it's sync or async method (at least when it comes to "summarize" naming)

agarwal-navin · 2021-12-17T18:18:02Z

That sounds like a good plan. Breaking this down into stages will definitely make it simpler to identify and catch issues.

scarlettjlee · 2022-01-03T20:05:41Z

This may be reopened later. For now, it's replaced by #8592.

scarlettjlee added 9 commits November 5, 2021 18:03

Add new methods

f639fa5

Change summarize calls up the stack

e04e404

Fix async behaviour further up stack

66db1c4

Make channel GC async

fa4d02d

Update docs for async GC methods

17a9c68

Fix up more async calls up the stack

911630b

Merge branch 'main' into asyncSummarization1

0be3bab

Fix merge

530b50c

Remove summarize method from SharedObject

13a77b5

scarlettjlee added this to the December 2021 milestone Nov 27, 2021

scarlettjlee requested review from DLehenbauer and agarwal-navin November 27, 2021 08:25

scarlettjlee self-assigned this Nov 27, 2021

github-actions bot requested review from vladsud, curtisman and anthony-murphy November 27, 2021 08:26

github-actions bot requested a review from ChumpChief November 27, 2021 08:26

scarlettjlee added 2 commits November 27, 2021 00:28

Remove summarize method from channel

0415039

Clean up unused lines

adda890

scarlettjlee commented Nov 27, 2021

View reviewed changes

packages/dds/shared-object-base/src/sharedObject.ts Outdated Show resolved Hide resolved

Change SharedObject._isSummarizing to isGCing

9e96cff

scarlettjlee commented Nov 27, 2021

View reviewed changes

packages/dds/shared-object-base/src/sharedObject.ts Show resolved Hide resolved

scarlettjlee commented Nov 27, 2021

View reviewed changes

scarlettjlee linked an issue Dec 9, 2021 that may be closed by this pull request

DDS Async Summarization #7934

Closed

scarlettjlee added 2 commits December 13, 2021 01:44

Avoid 'any' as return type during summarization

31462ac

Avoid introducing more snap* naming into summarization

185572a

vladsud reviewed Dec 13, 2021

View reviewed changes

packages/dds/shared-object-base/src/sharedObject.ts Show resolved Hide resolved

vladsud reviewed Dec 13, 2021

View reviewed changes

packages/dds/shared-object-base/src/sharedObjectBase.ts Outdated Show resolved Hide resolved

vladsud reviewed Dec 13, 2021

View reviewed changes

packages/dds/shared-object-base/src/sharedObject.ts Outdated Show resolved Hide resolved

vladsud reviewed Dec 13, 2021

View reviewed changes

packages/dds/shared-object-base/src/sharedObjectBase.ts Outdated Show resolved Hide resolved

vladsud reviewed Dec 13, 2021

View reviewed changes

vladsud reviewed Dec 14, 2021

View reviewed changes

packages/dds/shared-object-base/src/types.ts Outdated Show resolved Hide resolved

Move getGCData to IChannelRevision from IChannel

9c0c71c

Renames

github-actions bot requested a review from vladsud December 16, 2021 02:29

agarwal-navin reviewed Dec 16, 2021

View reviewed changes

vladsud reviewed Dec 17, 2021

View reviewed changes

scarlettjlee mentioned this pull request Dec 20, 2021

summarizeAsync #8592

Merged

scarlettjlee closed this Jan 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRAFT: Async summarization of DDSs #8422

DRAFT: Async summarization of DDSs #8422

scarlettjlee commented Nov 27, 2021 •

edited

Loading

scarlettjlee Nov 27, 2021

vladsud Nov 29, 2021

scarlettjlee Dec 8, 2021

vladsud Dec 13, 2021

scarlettjlee Dec 17, 2021

agarwal-navin Dec 16, 2021

agarwal-navin Dec 16, 2021

scarlettjlee Dec 16, 2021

vladsud Dec 17, 2021

agarwal-navin Dec 17, 2021

vladsud Dec 17, 2021

agarwal-navin Dec 17, 2021

vladsud Dec 17, 2021

vladsud Dec 17, 2021

vladsud Dec 17, 2021

vladsud Dec 17, 2021

vladsud Dec 17, 2021

vladsud Dec 17, 2021

scarlettjlee Dec 17, 2021

vladsud Dec 17, 2021

scarlettjlee Dec 17, 2021

vladsud commented Dec 17, 2021

vladsud commented Dec 17, 2021

agarwal-navin commented Dec 17, 2021

agarwal-navin commented Dec 17, 2021

vladsud commented Dec 17, 2021

agarwal-navin commented Dec 17, 2021

vladsud commented Dec 17, 2021 •

edited

Loading

agarwal-navin commented Dec 17, 2021

scarlettjlee commented Jan 3, 2022

DRAFT: Async summarization of DDSs #8422

DRAFT: Async summarization of DDSs #8422

Conversation

scarlettjlee commented Nov 27, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vladsud commented Dec 17, 2021

vladsud commented Dec 17, 2021

agarwal-navin commented Dec 17, 2021

agarwal-navin commented Dec 17, 2021

vladsud commented Dec 17, 2021

agarwal-navin commented Dec 17, 2021

vladsud commented Dec 17, 2021 • edited Loading

agarwal-navin commented Dec 17, 2021

scarlettjlee commented Jan 3, 2022

scarlettjlee commented Nov 27, 2021 •

edited

Loading

vladsud commented Dec 17, 2021 •

edited

Loading