-
Notifications
You must be signed in to change notification settings - Fork 536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRAFT: Async summarization of DDSs #8422
Conversation
// See: https://github.com/microsoft/FluidFramework/issues/4547 | ||
const serializer = new SummarySerializer(this.runtime.channelsRoutingContext); | ||
this.snapshotCore(serializer); | ||
const capture = this.captureSummaryStateCore(serializer, false); | ||
await this.summarizeStateCore(serializer, capture); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it okay to prevent use of serializer
during this async call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean. But looking at this function, it feels like it does not care about reentrancy at all (same for getGCData), so maybe we remove _isGCing checks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant the serializer getter, which we don't want used during GC. My concern is that making this async might allow some other code [not related to GC process] to run in the meantime which legitimately calls the serializer getter.
// See: https://github.com/microsoft/FluidFramework/issues/4547 | ||
const serializer = new SummarySerializer(this.runtime.channelsRoutingContext); | ||
const state = this.captureStateCore(); | ||
await state.summarize(serializer, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the time SharedObject captured state, it's too late to give it a serializer. I think the right model is the following:
- captureStateCore (or captureState, not sure there is need for both here) would internally create a serializer and capture snapshot with it, keeping serializer
- The object returned has summarize() method that does not take serializer, and simply returns a summary.
- There is also another method, called getGCData(), that would use internally stored serializer to spit it out.
That way serializer is left to be implementation detail, and API is relatively high level:
- get a revision (aka state)
- ask revision to generate summary.
- ask revision to generate GC info
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @agarwal-navin comments, running GC can change what should be in the summary. Given that the potential change is only to say whether something is referenced or not, it seems possible to just fix that up in the IChannelRevision. That feels like a strange thing to do though.
/** | ||
* {@inheritDoc (ISharedObject:interface).captureRevision} | ||
*/ | ||
public captureRevision(): IChannelRevision { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see a couple of problems with having this API instead of separate APIs for summarize and getGCData:
- How will a DDS have custom implementation for one of these APIs? For example, SharedMatrix and SharedString work on a different set of data for generating GC state vs summary state. (Btw, this change will break them).
- The state returned by summarize is affected by GC. When GC runs on the data returned by getGCData, it may update the "reference" state of certain nodes in the system. This information needs to be encoded in the summary. Additionally, some nodes which did not need summarization before (because its data did not change) may need to be summarized because its "reference" state changed.
This is not really a problem at this layer yet but as soon as we have sub-DDS GC, this will break. Also, IFluidDataStoreChannel's shape will also become similar to this and it will run into this problem now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main challenge with point 2 above is that we don't fully know whether we need to re-summarize a node or not until GC has run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I'm not sure if my changes for sequence and matrix are the right thing. For DDSs that want to do something custom, they should derive from SharedObjectCore and fully implement captureRevision themselves. They'd still need to capture all the state necessary for both summaries and GC.
- Is it possible to update the captured state and generate another summary off that? or does it need to happen on the sharedObject?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was asking myself that question (that Navin raised) - can summary and gc data come from same state or not, sounds not (sorry I forgot it). I think there are two ways to address it:
- Leave API as is but capture revision twice (at different times).
- Leave GC API the same, while revision to support only summary API.
It's possible that the answer will be different on the layer. I.e. the general structure should likely allow flexibility, and this we might want to leave GC API as is (including on SharedObjectCore layer). But if all but Whiteboard DDSs can always generage GC from summary, then I'd argue we want to build code in a way where this duplication of logic happens in single place (SharedObject) by having implementation of IChannelRevision here that can extract GC data out of it (i.e. have concrete implementation of IChannelRevision that captures and exposes GC data). That way, SharedObject.getGCData() can just do this.captureRevision().getGcData().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can implement it via the first way you mentioned. The reason is that the summary state of a node may have the reference state (whether its unreferenced or not) based on an older summary state. Basically, GC state can be from seq#100 whereas summary state can be from seq#200. We cannot mark a node at seq#200 referenced / unreferenced based on its data from seq#100 - that would be inconsistent.
Both the summary and GC state in a summary should be off the same point in time (seq#).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how to interpret your comment :) DDS just exposes ability to capture these states, at any moment in time. External process governs when these methods are called - for same state or not. DDS itself does not capture any additional state, so there is no differences here. The only thing that differs is how internally it captures GC - if all SharedObject derived classes extract GC data in exactly same way (by essentially summarizing themselves), then there is no need to duplicate this code - SharedObject base class can help here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I misunderstood this change. I was under the impression that we are also changing the way summarization happens in the regular case in summarizer client.
@@ -367,16 +367,17 @@ export class DataStores implements IDisposable { | |||
return summaryBuilder.getSummaryTree(); | |||
} | |||
|
|||
public createSummary(): ISummaryTreeWithStats { | |||
public async createSummary(): Promise<ISummaryTreeWithStats> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above: It's possible to ensure (by code inspection) that this function returns state of container at the time when createSummary() call started. But we will not be able to sustain that over time, even with UTs. I believe if we go that route, we have to commit (and implement quickly as a follow up) same flow you are implementing for DDS here, i.e., for whole container, with synchronous "capture revision" and async "generate snapshot out of it.
@@ -829,13 +829,13 @@ export class LocalFluidDataStoreContextBase extends FluidDataStoreContext { | |||
}); | |||
} | |||
|
|||
public generateAttachMessage(): IAttachMessage { | |||
public async generateAttachMessage(): Promise<IAttachMessage> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This API existed for two reasons:
- synchronous nature of workflows that depended on it
- some subtle differences in content returned from async summarize methods.
Given that # 1 is gone in this PR, it would be great to have a follow up issue to examine # 2 and see if we can substantially reduce code duplication. I.e. make this logic rely as much as possible on summarization flow, only adding (if needed) subtle differences that are required (if they are) for attachment workflows
@@ -778,7 +778,7 @@ export class Container extends EventEmitterWithErrorHandling<IContainerEvents> i | |||
if (!hasAttachmentBlobs) { | |||
// Get the document state post attach - possibly can just call attach but we need to change the | |||
// semantics around what the attach means as far as async code goes. | |||
const appSummary: ISummaryTree = this.context.createSummary(); | |||
const appSummary: ISummaryTree = await this.context.createSummary(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is an example of the problem that I worry about when making createSummary() flow async. All capturing of state (aka revision) has to happen synchronously as otherwise we may not capture consistent state (i.e., parts of state would come from different points in time).
Here, we await container runtime summary creation, but capture protocol state and switch container to attaching state after that. That means that protocol state may change while we are blocked on await. Similar, any changes to container while we await would not generate ops because container is not switched to attaching state early enough, thus those changes would not make it to either snapshot nor ops that follow.
The only right way here is to ensure that all state is captured before any async activity is awaited. Ideally through 2-step process - get revision synchronously, generate snapshot from revision async. As a temp solution, we can do code inspection + UTs to get something working quickly (and fixing any cases like this one by ensuring that promise is captured early but await is pushed to line 793). But this can't be long-term solution
@@ -829,7 +829,7 @@ export class Container extends EventEmitterWithErrorHandling<IContainerEvents> i | |||
} | |||
|
|||
// take summary and upload | |||
const appSummary: ISummaryTree = this.context.createSummary(redirectTable); | |||
const appSummary: ISummaryTree = await this.context.createSummary(redirectTable); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same problem here - await needs to be after this.emit("attached"). I'm not even sure code can handle it correctly today (i.e. async inflight summary process while we emit attached event) - maybe, but I'd need to do deeper code inspection to be sure, as we never have had such state.
if (blobRedirectTable) { | ||
this.blobManager.setRedirectTable(blobRedirectTable); | ||
} | ||
|
||
const summarizeResult = this.dataStores.createSummary(); | ||
const summarizeResult = await this.dataStores.createSummary(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar problem here (requiring changing existing flows / APIs) - addContainerBlobsToSummary() below should happen before await here, as we need to do all "initiation" of data collection synchronously in one go, to make sure it represents state at the same point in time.
assert(this.bindState === BindState.NotBound, 0x13b /* "datastore context is already in bound state" */); | ||
this.bindState = BindState.Binding; | ||
assert(this.channel !== undefined, 0x13c /* "undefined channel on datastore context" */); | ||
bindChannel(this.channel); | ||
await bindChannel(this.channel); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to other places, it's extremely hard to say if conversion to async here would not open pandora box of bugs.
Given that this code is used in synchronous workflows (i.e., storing a handle as a value in a map - that will cause attachment of data store / DDS that are backed by such handle, which previously was fully synchronous process), it's hard to believe that async parts of this process can happen later, when user has a chance to continue to modify objects that we are trying to snapshot. Similar as in other places, we need to ensure we capture enough state (revisions) synchronously to represent state of universe before allowing async processes run off that revision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the full implications of this. Does it mean that it will be very hard to get things right? Or we should enforce that some things remain synchronous?
const summarizeResult = channel.summarize(fullTree, trackState); | ||
): Promise<ISummaryTreeWithStats> { | ||
const state = channel.captureSummaryState(fullTree); | ||
const summarizeResult = await channel.summarizeState(state); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit at a loss here.
You have added IChannel.captureRevision().summarize() flow.
Why is there some other captureSummaryState() flow? I do not see IChannel having captureSummaryState()
Is this simply non-updated code from prior revisions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, non-updated. The last couple commits were just to get feedback on the change to SharedObject before changing it everywhere.
Looking deeper at the set of changes, I think you will not be able to escape from making 2-step process (capture revision, generate summary) across all layers. |
A thought worth sharing RE createSummary() flow: I think there is an incremental path how to convert everything to single flow.
RE back compat: it's hard as we require back compat across two boundaries - Loader (Container + ContainerContext) to IRuntime (ContainerRuntime) and container runtime (IContainerRuntimeBase) to data store (IFluidDataStoreChannel). We will need parallel system (i.e. support old functions with old semantics while having new flow). Code can likely be reused but shape of API will need to support both for a while. Thus we would need createSummary2 in some format (i.e. can't just reuse existing name, unless there is something else telling each layer how to think about shape of this API) |
I have a question about the first step, i.e., synchronously getting summary state of a node, in the proposed 2-step summarization process: |
This will work for the summary generated for detached container via |
@agarwal-navin, that's correct, today's captureSummary does not work for those cases. New flow will have exactly same limitations. Note that we do not have to do it. We can continue to have two flows - async summary (as today) and sync capture revision, with revision flow working only where captureSummary() works today. It will still give us a boost in the form of leaf nodes (DDS) being able to switch to 2-step approach when generating summaries (i.e. DDSs will have only one flow to generate summary - 2-step process through revisions). This will solve Whiteboard problem, while maintaining correctness of the system. It's correctness that I'm after, plus at least one tiny place (DDSs) where we have only one way of doing summaries (not two as we have at data store layer), with possible long future capability of having one way of doing summaries across all layers. |
Okay, I think I misunderstood this change. This is only changing the synchronous captureSummary for detached container by making it async. It is not changing the regular summarization flow that happens in the summarizer client. And we are looking for ways to make the DDS implement a single way of generating the summary for both the above flows. |
So I feel like PRs need to flow in this order:
# 2 is rather simple and immediately unblocks Whiteboard (even though in an ugly / hard to deal / explain way) Note that # 3 gives us tool that can be used in other flows (and can be improved over time). I.e., ability to capture revision at DDS level is very powerful for all kinds of things (would be useful today for Word's transactions request). It also simplifies layer that 3rd party developers are most likely to implement, so while not having # 4 is bad for other layers, it simplifies where it matters the most. That said # 4 is what makes revisions really powerful :) One possible pre-work: We need to be consistent on naming, and maybe that's PR # 0. |
That sounds like a good plan. Breaking this down into stages will definitely make it simpler to identify and catch issues. |
This may be reopened later. For now, it's replaced by #8592. |
See issue #7615
Split summarization of DDSs into two parts:
Questions for reviewers:
Not all the existing DDSs and test code are fixed up yet and this change will need more testing.