Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Measure impact of running DDS callbacks when processing ops #9618

Closed
Tracked by #9613
vladsud opened this issue Mar 25, 2022 · 8 comments
Closed
Tracked by #9613

Measure impact of running DDS callbacks when processing ops #9618

vladsud opened this issue Mar 25, 2022 · 8 comments
Labels
ado Issue is tracked in the internal Azure DevOps backlog perf status: stale

Comments

@vladsud
Copy link
Contributor

vladsud commented Mar 25, 2022

(Update) Numbers below are not what they claim to be. This is due to me making mistake and doing percentile(toint(metric), 95), while it should have been toint(percentile(metric), 95)). The general observations are still the same, and the work that is needed to better understand impact of apps is the same, but we should be careful with using numbers below (and better to get latest numbers using right quries)


We need to track clearly op processing time without the cost of running DDS callbacks.
Or other way around - be able to track cost of application callbacks.

Here is some data why it matters - looking at op processing times:

Office_Fluid_FluidRuntime_Performance
| where Data_eventName == "fluid:telemetry:OpPerf:GetDeltas_OpProcessing"
| extend WB = Data_hostScenarioName contains "Whiteboard"
| extend Summarizer = Data_clientType != "interactive"
| summarize count(), percentile(toint(Data_duration / Data_count * 1000), 95) by WB, Summarizer

WB Summarizer count P95
false false 721491 498
true false 11198 200
false true 295747 128
true true 63241 80

While usage differs between Whiteboard & Chapter1 scenarios greatly, I think the data clearly points out that performance of Chapter 1 scenario is substantially (and statistically significantly) worse than Whiteboard. Summarizer data for both scenarios (where app callbacks should not be installed) supports these claims as well.

This makes it really hard to

  • Clearly articulate cost to apps, as all we can do is rely on this kind of indirect comparison
  • We can't easily track framework time and see issues on our side.

Please note that these numbers are really bad. But if you add filter "Data_count > 1000", you sill see them dropping, but still being in pretty bad territory:

WB Summarizer count P95
false false 3077 168
true false 655 39
false true 9600 34
true true 48131 26

at P50, we get to 6-10ms for best case (processing 1000+ ops, and not Chapter 1), and 28ms for best case for Chapter 1.
That's only 35 ops/sec for Chapter 1 at best case scenario! Ouch!

@vladsud vladsud added this to the April 2022 milestone Mar 25, 2022
@vladsud
Copy link
Contributor Author

vladsud commented Mar 25, 2022

@anthony-murphy, any thoughts on how to do it centrally (and apply to all future DDSs)?
We already have one potential bottleneck (SharedObject) to apply to 90% of cases, but there is no enforcement here unless it's part of DDS API.

#9542 is related issue that is worth considering. In discussion with Curtis, I think we will need to do something here, and one other direction popped up - maybe we will expose a "reboot your data model" kind of flow where runtime detects that we are so behind that we would prefer to move data model to latest state in one go, and then let application to reconnect to data model (same way it does no container load).
"move data model to latest state " may take two forms - apply all the ops without firing any callbacks, or just apply latest state (load latest snapshhot and move to its state, assumes no local changes or presence of 3-way merge in the future).

"Apply ops without firing callbacks" is interesting as it's the simplest model if we can make our system really fast (which is another issue to be opened & investigated). And that asks for more control from container runtime over callbacks.

@anthony-murphy
Copy link
Contributor

anthony-murphy commented Mar 25, 2022

The issue is that op processing, dds op application, and dds delta callbacks are all synchronous, and currently most dds combine op application, and delta callback.

The fastest and easiest is just to accept that most dds inherit from shared object and leverage that.

If we are ok with that constraint, we need to remove EventEmitter from SharedObject, and instead make the event emitter a private member of sharedObject, and have SharedObject implement the interface IEventProvider, which will proxy to the internal event emitter. We then expose our own protected emit function to be used by dds to emit their events. We can then track the duration of those events which will allow us to remove the event duration from the processing duration.

To go farther would require a much more drastic API change. Basically, DDS wouldn't be in full control of their api. For this i would probably have the processMessage callback implemented by Channels return the event data (name and args) for that message. Application would also only get some proxy object rather than real DDS. The proxy would proxy through the methods of the DDS, and then extends it with eventing based on the return events. I'm not sure such a model would be worth it, or ergonomic for developers

@vladsud
Copy link
Contributor Author

vladsud commented Mar 25, 2022

Yes, I was thinking about the second part.
But I think we should start simple (leverage existing SharedObjectCore infra, maybe add one more intermediate object in inheritance chain that would have only this responsibility (to make it easier to use in other places where the rest of SharedObject chain is not used today, though inheritance has its own challenges).
This will give us enough initial data to reason on next steps

@CraigMacomber
Copy link
Contributor

One thing to consider if comparing to whiteboard data is that whiteboard has asynchronous op processing (so it can programmatically control which blobs get prefetched before telling the app about the new state from fluid). Fluid framework does not know about this since fluid does not actually support asyn op application: whiteboard has to do this in a way fluid does not understand. Combining this with experimental shared tree only computing the result of applying ops lazily, and I would not be surprised if the cost of applying ops in whiteboard according to the fluid-framework drastically underestimates the actual cost of processing the op in the DDS in addition to not actually including any of the application side costs either.

@vladsud
Copy link
Contributor Author

vladsud commented Apr 4, 2022

Makes sense. This basically moves more responsibility to ShareTree and Whiteboard to measure the impact of such processes to responsiveness of the application.
One side-effect to share - #9542 talks about potential solutions to a problem of too many ops / slow op applications. Whatever solution we come up to will depend on our ability to notice when system (as a whole) is not catching up and thus needs way out. With async processing runtime will not have full picture into these costs.

At the same time, you have more capabilities to detect such cases and take actions on your side, which is similar in nature.

@vladsud
Copy link
Contributor Author

vladsud commented May 16, 2022

@justus-camp , are you planning on working on this issue this month? It is assigned to April. I'll move it to May.

@vladsud vladsud modified the milestones: April 2022, May 2022 May 16, 2022
@justus-camp
Copy link
Contributor

Don't think I'll be getting to this one. This was initially going to be one of my ramp-up tasks but I don't think it's planned for me currently. Going to un-assign myself

@justus-camp justus-camp removed their assignment May 23, 2022
@curtisman curtisman removed this from the May 2022 milestone May 24, 2022
@curtisman curtisman added the ado Issue is tracked in the internal Azure DevOps backlog label Jun 15, 2022
@ghost ghost added the status: stale label Dec 12, 2022
@ghost
Copy link

ghost commented Dec 12, 2022

This issue has been automatically marked as stale because it has had no activity for 180 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework!

@ghost ghost closed this as completed Dec 20, 2022
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ado Issue is tracked in the internal Azure DevOps backlog perf status: stale
Projects
None yet
Development

No branches or pull requests

6 participants