Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider adding notion of document revisions & more coarse document update model / events #9542

Closed
vladsud opened this issue Mar 21, 2022 · 5 comments
Assignees
Labels
design-required This issue requires design thought status: stale

Comments

@vladsud
Copy link
Contributor

vladsud commented Mar 21, 2022

This issue is opened to collect feedback.

Proposal:

  1. Add notion of document revision, which represents a diff of some sort between two document states
    • Most likely initially it would be in the same form we use today when delivering change events
      • (a) final document state
      • (b) what changed
      • (c) previous values of parts of document that changed.
    • Long term, I'd love to see immutable revisions representing whole document and efficient mechanisms to clone (copy-on-write) and diff between revisions.
  2. All changed to document be represented as revision updates at container runtime level
    • Each object in object graph (data stores, DDSs) can chose to raise their own events scoping events to only what changed under this node.
    • Note that such updates will differ in two ways from how DDSs are raising events: they will
      • (a) contain multiple changes (like multiple keys changed in a map)
      • (b) there would be no longer ordering between these events. This includes changes within single DDS, as well as across DDSs.

Where it helps:
This would allow runtime / application to implement throttling of updates when the rate of changes exceeds capability of the system. Some scenarios:

  1. Catch up - client needs to process thousands of ops.
  2. Boot - client quite often starts with stale (cached) snapshot and received more recent snapshot after initial render - it would be great to move to recent state in one step, vs. rely on processing a lot of ops (and need to fetch them).
  3. Cases where number of clients in the system and number of changes by those clients exceeds capability of a single client to process that many ops in a given timeframe
    • Rough metric here - slow client can do at most 2K ops/second. That's not that much given that ODSP supports up to 500 clients per document session. Client throughput may be reduced substantially if CPU is busy.

Cons:
One big con of that model in inability to reason about order of changes. This needs to be strongly considered.
At the same time, as far as I can tell, that kind of update model is the primary update model in other systems (firebase being an example).

@vladsud vladsud self-assigned this Mar 21, 2022
@markfields markfields added the design-required This issue requires design thought label Mar 21, 2022
@vladsud
Copy link
Contributor Author

vladsud commented Mar 22, 2022

Here is one perf angle to this:

Office_Fluid_FluidRuntime_Performance
| where Data_eventName == "fluid:telemetry:OpPerf:GetDeltas_OpProcessing"
| where Data_count >= 100
| summarize count(), percentile(toint(Data_duration / Data_count * 1000), 95) by Data_clientType

clientType count P95
Interactive 753,800 475
noninteractive/summarizer 350,469 123

Most likely explanation of the difference - we spend unbelievable amount of time in applications callbacks (DDS event handlers) - summarizer does not have them.
This is supported by various threads on perf, like this one from Scriptor

So ability to reduce number of updates to the app is the key in maintaining performance.
Currently, based on P50 numbers, at best (i.e. processing big batch of ops), we can inbound 50 ops / second. This is simply not acceptable. Summarizer can do 200 ops / second at P50 & 1000ops+ batches.

@vladsud vladsud added this to the April 2022 milestone Mar 22, 2022
@vladsud
Copy link
Contributor Author

vladsud commented Mar 22, 2022

Here is one more angle:

Office_Fluid_FluidRuntime_Performance
| where Data_eventName == "fluid:telemetry:OpPerf:GetDeltas_OpProcessing"
| where Data_count >= 100
| where Data_clientType == "interactive"
| where Data_hostScenarioName notcontains "Whiteboard"
| summarize count(), percentile(toint(Data_duration / Data_count * 1000), 95

Whiteboard - 231ms
All others - 478ms

  • ChatMessage (biggest counts) - 476ms

That clearly demonstrates same info (but from different angle) - app callbacks dominate processing time and are limiting factor in how quickly we can process ops, and thus scalability of whole system.

@vladsud
Copy link
Contributor Author

vladsud commented Mar 30, 2022

We may solve problems of slow catch up via slightly different mechanism:
Not providing revisions (and ability to diff them), but rather by refreshing all container state and letting application know that it needs to refresh its model / presentation layer from scratch, same way as app approaches it on container load.
Runtime may use this mechanism when it notices we are way behind and can move ahead by (let's say) applying new snapshot to the data model, as opposed to processing ops.

@CraigMacomber
Copy link
Contributor

Whiteboard's async op processing benefits from batching: if the app is being too slow, it will end up with a back log of remote edits to apply, but whiteboard will process the entire backlog before sending an update to the app, which causes it to tend to catch up instead of falling behind (since the per op cost deceases as it falls behind).

That said, there are fixed op processing costs that don't decrease this way (ex: actually, changing the internal tree model), and whiteboard (and also fluid?) don't have backpressure, so if it falls behind it can just get worse and worse leaving an unbounded queue of both inbound and outbound ops to process.

We need a way to have end to end backpressure somehow (so the user can be informed/delayed when processing or upload/download is falling behind). It seems like maybe #9618 could be applied to help separate DDS vs App caused costs which we might use to inform backpressure handling differently: if the UI for a component can't keep up, unloading its UI and replacing it with a message to use user may have value (like browsers do with unresponsive tabs, but maybe recover once op rate slows), but if the DDS itself can't keep up just not the UI, I think we have to drop out of the collaborators (makes sure we won't be the summerizer, so someone who can keep up will be), and maybe try and reconnect after the next summary (or ops really slow down).

Applications (or DDSs) that incur cost of ops async (like whiteboard) may need additional APIs to express that backpressure to fluid. We may also want a prioritization system as part of this (some app operations are more important to avoid delaying that others)

Is there a documentation on how we do (or plan to do) backpressure? I curious what the UX is supposed to be when op rate exceeds what some collaborators can handle.

@microsoft-github-policy-service
Copy link
Contributor

This PR has been automatically marked as stale because it has had no activity for 60 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design-required This issue requires design thought status: stale
Projects
None yet
Development

No branches or pull requests

3 participants