Measure impact of running DDS callbacks when processing ops #9618

vladsud · 2022-03-25T15:56:32Z

(Update) Numbers below are not what they claim to be. This is due to me making mistake and doing percentile(toint(metric), 95), while it should have been toint(percentile(metric), 95)). The general observations are still the same, and the work that is needed to better understand impact of apps is the same, but we should be careful with using numbers below (and better to get latest numbers using right quries)

We need to track clearly op processing time without the cost of running DDS callbacks.
Or other way around - be able to track cost of application callbacks.

Here is some data why it matters - looking at op processing times:

Office_Fluid_FluidRuntime_Performance
| where Data_eventName == "fluid:telemetry:OpPerf:GetDeltas_OpProcessing"
| extend WB = Data_hostScenarioName contains "Whiteboard"
| extend Summarizer = Data_clientType != "interactive"
| summarize count(), percentile(toint(Data_duration / Data_count * 1000), 95) by WB, Summarizer

WB	Summarizer	count	P95
false	false	721491	498
true	false	11198	200
false	true	295747	128
true	true	63241	80

While usage differs between Whiteboard & Chapter1 scenarios greatly, I think the data clearly points out that performance of Chapter 1 scenario is substantially (and statistically significantly) worse than Whiteboard. Summarizer data for both scenarios (where app callbacks should not be installed) supports these claims as well.

This makes it really hard to

Clearly articulate cost to apps, as all we can do is rely on this kind of indirect comparison
We can't easily track framework time and see issues on our side.

Please note that these numbers are really bad. But if you add filter "Data_count > 1000", you sill see them dropping, but still being in pretty bad territory:

WB	Summarizer	count	P95
false	false	3077	168
true	false	655	39
false	true	9600	34
true	true	48131	26

at P50, we get to 6-10ms for best case (processing 1000+ ops, and not Chapter 1), and 28ms for best case for Chapter 1.
That's only 35 ops/sec for Chapter 1 at best case scenario! Ouch!

vladsud · 2022-03-25T16:02:46Z

@anthony-murphy, any thoughts on how to do it centrally (and apply to all future DDSs)?
We already have one potential bottleneck (SharedObject) to apply to 90% of cases, but there is no enforcement here unless it's part of DDS API.

#9542 is related issue that is worth considering. In discussion with Curtis, I think we will need to do something here, and one other direction popped up - maybe we will expose a "reboot your data model" kind of flow where runtime detects that we are so behind that we would prefer to move data model to latest state in one go, and then let application to reconnect to data model (same way it does no container load).
"move data model to latest state " may take two forms - apply all the ops without firing any callbacks, or just apply latest state (load latest snapshhot and move to its state, assumes no local changes or presence of 3-way merge in the future).

"Apply ops without firing callbacks" is interesting as it's the simplest model if we can make our system really fast (which is another issue to be opened & investigated). And that asks for more control from container runtime over callbacks.

anthony-murphy · 2022-03-25T16:52:59Z

The issue is that op processing, dds op application, and dds delta callbacks are all synchronous, and currently most dds combine op application, and delta callback.

The fastest and easiest is just to accept that most dds inherit from shared object and leverage that.

If we are ok with that constraint, we need to remove EventEmitter from SharedObject, and instead make the event emitter a private member of sharedObject, and have SharedObject implement the interface IEventProvider, which will proxy to the internal event emitter. We then expose our own protected emit function to be used by dds to emit their events. We can then track the duration of those events which will allow us to remove the event duration from the processing duration.

To go farther would require a much more drastic API change. Basically, DDS wouldn't be in full control of their api. For this i would probably have the processMessage callback implemented by Channels return the event data (name and args) for that message. Application would also only get some proxy object rather than real DDS. The proxy would proxy through the methods of the DDS, and then extends it with eventing based on the return events. I'm not sure such a model would be worth it, or ergonomic for developers

vladsud · 2022-03-25T18:22:14Z

Yes, I was thinking about the second part.
But I think we should start simple (leverage existing SharedObjectCore infra, maybe add one more intermediate object in inheritance chain that would have only this responsibility (to make it easier to use in other places where the rest of SharedObject chain is not used today, though inheritance has its own challenges).
This will give us enough initial data to reason on next steps

CraigMacomber · 2022-04-04T19:23:55Z

One thing to consider if comparing to whiteboard data is that whiteboard has asynchronous op processing (so it can programmatically control which blobs get prefetched before telling the app about the new state from fluid). Fluid framework does not know about this since fluid does not actually support asyn op application: whiteboard has to do this in a way fluid does not understand. Combining this with experimental shared tree only computing the result of applying ops lazily, and I would not be surprised if the cost of applying ops in whiteboard according to the fluid-framework drastically underestimates the actual cost of processing the op in the DDS in addition to not actually including any of the application side costs either.

vladsud · 2022-04-04T20:50:08Z

Makes sense. This basically moves more responsibility to ShareTree and Whiteboard to measure the impact of such processes to responsiveness of the application.
One side-effect to share - #9542 talks about potential solutions to a problem of too many ops / slow op applications. Whatever solution we come up to will depend on our ability to notice when system (as a whole) is not catching up and thus needs way out. With async processing runtime will not have full picture into these costs.

At the same time, you have more capabilities to detect such cases and take actions on your side, which is similar in nature.

vladsud · 2022-05-16T16:05:11Z

@justus-camp , are you planning on working on this issue this month? It is assigned to April. I'll move it to May.

justus-camp · 2022-05-23T18:01:14Z

Don't think I'll be getting to this one. This was initially going to be one of my ramp-up tasks but I don't think it's planned for me currently. Going to un-assign myself

ghost · 2022-12-12T03:02:13Z

This issue has been automatically marked as stale because it has had no activity for 180 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework!

vladsud added this to the April 2022 milestone Mar 25, 2022

vladsud assigned NicholasCouri Mar 25, 2022

vladsud mentioned this issue Mar 25, 2022

Op Processing Latency Improvements #9613

Closed

6 tasks

vladsud mentioned this issue Mar 25, 2022

Understand and improve op processing times. #9619

Closed

vladsud modified the milestones: April 2022, May 2022 Mar 30, 2022

vladsud added the perf label Mar 31, 2022

justus-camp self-assigned this Mar 31, 2022

justus-camp modified the milestones: May 2022, April 2022 Mar 31, 2022

vladsud unassigned NicholasCouri Apr 1, 2022

CraigMacomber mentioned this issue Apr 4, 2022

Consider adding notion of document revisions & more coarse document update model / events #9542

Closed

vladsud modified the milestones: April 2022, May 2022 May 16, 2022

justus-camp removed their assignment May 23, 2022

curtisman removed this from the May 2022 milestone May 24, 2022

curtisman added the ado Issue is tracked in the internal Azure DevOps backlog label Jun 15, 2022

alexvy86 mentioned this issue Jun 16, 2022

Instrument dds callback time #10711

Merged

8 tasks

ghost added the status: stale label Dec 12, 2022

ghost closed this as completed Dec 20, 2022

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Measure impact of running DDS callbacks when processing ops #9618

Measure impact of running DDS callbacks when processing ops #9618

vladsud commented Mar 25, 2022 •

edited

Loading

vladsud commented Mar 25, 2022

anthony-murphy commented Mar 25, 2022 •

edited

Loading

vladsud commented Mar 25, 2022

CraigMacomber commented Apr 4, 2022

vladsud commented Apr 4, 2022

vladsud commented May 16, 2022

justus-camp commented May 23, 2022

ghost commented Dec 12, 2022

Measure impact of running DDS callbacks when processing ops #9618

Measure impact of running DDS callbacks when processing ops #9618

Comments

vladsud commented Mar 25, 2022 • edited Loading

vladsud commented Mar 25, 2022

anthony-murphy commented Mar 25, 2022 • edited Loading

vladsud commented Mar 25, 2022

CraigMacomber commented Apr 4, 2022

vladsud commented Apr 4, 2022

vladsud commented May 16, 2022

justus-camp commented May 23, 2022

ghost commented Dec 12, 2022

vladsud commented Mar 25, 2022 •

edited

Loading

anthony-murphy commented Mar 25, 2022 •

edited

Loading