-
Notifications
You must be signed in to change notification settings - Fork 536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Measure impact of running DDS callbacks when processing ops #9618
Comments
@anthony-murphy, any thoughts on how to do it centrally (and apply to all future DDSs)? #9542 is related issue that is worth considering. In discussion with Curtis, I think we will need to do something here, and one other direction popped up - maybe we will expose a "reboot your data model" kind of flow where runtime detects that we are so behind that we would prefer to move data model to latest state in one go, and then let application to reconnect to data model (same way it does no container load). "Apply ops without firing callbacks" is interesting as it's the simplest model if we can make our system really fast (which is another issue to be opened & investigated). And that asks for more control from container runtime over callbacks. |
The issue is that op processing, dds op application, and dds delta callbacks are all synchronous, and currently most dds combine op application, and delta callback. The fastest and easiest is just to accept that most dds inherit from shared object and leverage that. If we are ok with that constraint, we need to remove EventEmitter from SharedObject, and instead make the event emitter a private member of sharedObject, and have SharedObject implement the interface IEventProvider, which will proxy to the internal event emitter. We then expose our own protected emit function to be used by dds to emit their events. We can then track the duration of those events which will allow us to remove the event duration from the processing duration. To go farther would require a much more drastic API change. Basically, DDS wouldn't be in full control of their api. For this i would probably have the processMessage callback implemented by Channels return the event data (name and args) for that message. Application would also only get some proxy object rather than real DDS. The proxy would proxy through the methods of the DDS, and then extends it with eventing based on the return events. I'm not sure such a model would be worth it, or ergonomic for developers |
Yes, I was thinking about the second part. |
One thing to consider if comparing to whiteboard data is that whiteboard has asynchronous op processing (so it can programmatically control which blobs get prefetched before telling the app about the new state from fluid). Fluid framework does not know about this since fluid does not actually support asyn op application: whiteboard has to do this in a way fluid does not understand. Combining this with experimental shared tree only computing the result of applying ops lazily, and I would not be surprised if the cost of applying ops in whiteboard according to the fluid-framework drastically underestimates the actual cost of processing the op in the DDS in addition to not actually including any of the application side costs either. |
Makes sense. This basically moves more responsibility to ShareTree and Whiteboard to measure the impact of such processes to responsiveness of the application. At the same time, you have more capabilities to detect such cases and take actions on your side, which is similar in nature. |
@justus-camp , are you planning on working on this issue this month? It is assigned to April. I'll move it to May. |
Don't think I'll be getting to this one. This was initially going to be one of my ramp-up tasks but I don't think it's planned for me currently. Going to un-assign myself |
This issue has been automatically marked as stale because it has had no activity for 180 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework! |
(Update) Numbers below are not what they claim to be. This is due to me making mistake and doing percentile(toint(metric), 95), while it should have been toint(percentile(metric), 95)). The general observations are still the same, and the work that is needed to better understand impact of apps is the same, but we should be careful with using numbers below (and better to get latest numbers using right quries)
We need to track clearly op processing time without the cost of running DDS callbacks.
Or other way around - be able to track cost of application callbacks.
Here is some data why it matters - looking at op processing times:
While usage differs between Whiteboard & Chapter1 scenarios greatly, I think the data clearly points out that performance of Chapter 1 scenario is substantially (and statistically significantly) worse than Whiteboard. Summarizer data for both scenarios (where app callbacks should not be installed) supports these claims as well.
This makes it really hard to
Please note that these numbers are really bad. But if you add filter "Data_count > 1000", you sill see them dropping, but still being in pretty bad territory:
at P50, we get to 6-10ms for best case (processing 1000+ ops, and not Chapter 1), and 28ms for best case for Chapter 1.
That's only 35 ops/sec for Chapter 1 at best case scenario! Ouch!
The text was updated successfully, but these errors were encountered: