Consider adding notion of document revisions & more coarse document update model / events #9542

vladsud · 2022-03-21T16:24:56Z

This issue is opened to collect feedback.

Proposal:

Add notion of document revision, which represents a diff of some sort between two document states
- Most likely initially it would be in the same form we use today when delivering change events
  - (a) final document state
  - (b) what changed
  - (c) previous values of parts of document that changed.
- Long term, I'd love to see immutable revisions representing whole document and efficient mechanisms to clone (copy-on-write) and diff between revisions.
All changed to document be represented as revision updates at container runtime level
- Each object in object graph (data stores, DDSs) can chose to raise their own events scoping events to only what changed under this node.
- Note that such updates will differ in two ways from how DDSs are raising events: they will
  - (a) contain multiple changes (like multiple keys changed in a map)
  - (b) there would be no longer ordering between these events. This includes changes within single DDS, as well as across DDSs.

Where it helps:
This would allow runtime / application to implement throttling of updates when the rate of changes exceeds capability of the system. Some scenarios:

Catch up - client needs to process thousands of ops.
Boot - client quite often starts with stale (cached) snapshot and received more recent snapshot after initial render - it would be great to move to recent state in one step, vs. rely on processing a lot of ops (and need to fetch them).
Cases where number of clients in the system and number of changes by those clients exceeds capability of a single client to process that many ops in a given timeframe
- Rough metric here - slow client can do at most 2K ops/second. That's not that much given that ODSP supports up to 500 clients per document session. Client throughput may be reduced substantially if CPU is busy.

Cons:
One big con of that model in inability to reason about order of changes. This needs to be strongly considered.
At the same time, as far as I can tell, that kind of update model is the primary update model in other systems (firebase being an example).

vladsud · 2022-03-22T14:29:21Z

Here is one perf angle to this:

Office_Fluid_FluidRuntime_Performance
| where Data_eventName == "fluid:telemetry:OpPerf:GetDeltas_OpProcessing"
| where Data_count >= 100
| summarize count(), percentile(toint(Data_duration / Data_count * 1000), 95) by Data_clientType

clientType	count	P95
Interactive	753,800	475
noninteractive/summarizer	350,469	123

Most likely explanation of the difference - we spend unbelievable amount of time in applications callbacks (DDS event handlers) - summarizer does not have them.
This is supported by various threads on perf, like this one from Scriptor

So ability to reduce number of updates to the app is the key in maintaining performance.
Currently, based on P50 numbers, at best (i.e. processing big batch of ops), we can inbound 50 ops / second. This is simply not acceptable. Summarizer can do 200 ops / second at P50 & 1000ops+ batches.

vladsud · 2022-03-22T14:47:48Z

Here is one more angle:

Office_Fluid_FluidRuntime_Performance
| where Data_eventName == "fluid:telemetry:OpPerf:GetDeltas_OpProcessing"
| where Data_count >= 100
| where Data_clientType == "interactive"
| where Data_hostScenarioName notcontains "Whiteboard"
| summarize count(), percentile(toint(Data_duration / Data_count * 1000), 95

Whiteboard - 231ms
All others - 478ms

ChatMessage (biggest counts) - 476ms

That clearly demonstrates same info (but from different angle) - app callbacks dominate processing time and are limiting factor in how quickly we can process ops, and thus scalability of whole system.

vladsud · 2022-03-30T18:34:15Z

We may solve problems of slow catch up via slightly different mechanism:
Not providing revisions (and ability to diff them), but rather by refreshing all container state and letting application know that it needs to refresh its model / presentation layer from scratch, same way as app approaches it on container load.
Runtime may use this mechanism when it notices we are way behind and can move ahead by (let's say) applying new snapshot to the data model, as opposed to processing ops.

CraigMacomber · 2022-04-04T21:17:41Z

Whiteboard's async op processing benefits from batching: if the app is being too slow, it will end up with a back log of remote edits to apply, but whiteboard will process the entire backlog before sending an update to the app, which causes it to tend to catch up instead of falling behind (since the per op cost deceases as it falls behind).

That said, there are fixed op processing costs that don't decrease this way (ex: actually, changing the internal tree model), and whiteboard (and also fluid?) don't have backpressure, so if it falls behind it can just get worse and worse leaving an unbounded queue of both inbound and outbound ops to process.

We need a way to have end to end backpressure somehow (so the user can be informed/delayed when processing or upload/download is falling behind). It seems like maybe #9618 could be applied to help separate DDS vs App caused costs which we might use to inform backpressure handling differently: if the UI for a component can't keep up, unloading its UI and replacing it with a message to use user may have value (like browsers do with unresponsive tabs, but maybe recover once op rate slows), but if the DDS itself can't keep up just not the UI, I think we have to drop out of the collaborators (makes sure we won't be the summerizer, so someone who can keep up will be), and maybe try and reconnect after the next summary (or ops really slow down).

Applications (or DDSs) that incur cost of ops async (like whiteboard) may need additional APIs to express that backpressure to fluid. We may also want a prioritization system as part of this (some app operations are more important to avoid delaying that others)

Is there a documentation on how we do (or plan to do) backpressure? I curious what the UX is supposed to be when op rate exceeds what some collaborators can handle.

microsoft-github-policy-service · 2023-02-01T17:34:38Z

This PR has been automatically marked as stale because it has had no activity for 60 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework!

vladsud self-assigned this Mar 21, 2022

markfields added the design-required This issue requires design thought label Mar 21, 2022

vladsud added this to the April 2022 milestone Mar 22, 2022

This was referenced Mar 25, 2022

Op Processing Latency Improvements #9613

Closed

Measure impact of running DDS callbacks when processing ops #9618

Closed

vladsud modified the milestones: April 2022, Future Mar 30, 2022

microsoft-github-policy-service bot added the status: stale label Feb 1, 2023

microsoft-github-policy-service bot removed this from the Future milestone Apr 23, 2023

microsoft-github-policy-service bot closed this as completed Apr 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider adding notion of document revisions & more coarse document update model / events #9542

Consider adding notion of document revisions & more coarse document update model / events #9542

vladsud commented Mar 21, 2022

vladsud commented Mar 22, 2022

vladsud commented Mar 22, 2022

vladsud commented Mar 30, 2022

CraigMacomber commented Apr 4, 2022

microsoft-github-policy-service bot commented Feb 1, 2023

Consider adding notion of document revisions & more coarse document update model / events #9542

Consider adding notion of document revisions & more coarse document update model / events #9542

Comments

vladsud commented Mar 21, 2022

vladsud commented Mar 22, 2022

vladsud commented Mar 22, 2022

vladsud commented Mar 30, 2022

CraigMacomber commented Apr 4, 2022

microsoft-github-policy-service bot commented Feb 1, 2023