-
Notifications
You must be signed in to change notification settings - Fork 535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BatchManager should be removed, all the bugs in Runtime related to not using batches in right places fixed #3788
Comments
Looking at this one now, trying to determine how safe it is to just go ahead and remove the
So taking the intersection of these limitations, the dependent code would need to be depending on pseudo-batches smaller than 100 ops, in scenarios where other messages are not being generated in overly-large quantities, must not need the first op in the pseudo-batch to be received with the rest of the pseudo-batch, and must not yield the stack while generating the pseudo-batch. I don't know if we'll be able to further reduce this to an empty set, but it at least seems more rare than we initially were thinking. *Edit: I think I mistyped here and meant eight messages of 1, 99, 1, 99, 1, 99, 1, 9 but I don't have the experiment still up. Leaving as-is unless/until I verify. |
@ChumpChief, thanks for looking into it. That all said, I think it's less about BatchManager itself, it's more about what guarantees we need to deliver or start delivering to developers (i.e. it's more about #4048). And the rest is more about removal of pseudo-guarantees (either complete removal, or making it more predictable where it essentially becomes de-factor guaranteed behavior that people eventually rely on, or adding tools / processes / randomizations for people to easily find and realize bugs in their code due to reliance on undocumented behavior). I've put more thoughts into #4048 why I think it should be fully controlled by ContainerRuntime with different defaults and by virtue - no BatchManager (as it simply gets in the way of delivering ContainerRuntime promises) |
Should be tackled after #4048 is done. |
BatchManager is currently behind a feature gate. I've been running the Bohemia tests with the feature gate enabled and some issues have been fixed. The code will be removed after #7365 is closed. |
The config management feature will be merged soon (#8497). It then needs to be wired in Bohemia first, then we can properly enable this and roll it out. |
BatchManager hides bugs - it adds implicit batching, even if container runtime uses Automatic flushing mode. But because it's behavior is random (from POV of container runtime - it is based on timer), it will make sure that nobody pays attention to using batches correctly (where needed, because things sort of work), but occasionally would get no batching and random failures.
I believe the right behaviors is that we test all of of our code in automatic mode (which should be renamed - it's "no batching" mode), but for all existing workloads (i.e. Office container) we use manual flash mode (which likely should also be renamed to be JS-single-turn mode). And no BatchManager.
This will expose a ton of bugs - places where we need to use explicit batches, but we do not use them.
I think number of bugs in Fluid repo is likely low, but stress testing is required to find them.
(Hopefully detached created of container, components & DDSs already addresses most of potential issues).
That said, applications using Fluid likely have way more issues, and thus deserve to start in safe territory - i.e. JS-turn batching, and add manual flushing where needed. While it feels like it's a big drop in our ability to push ops quickly and have responsive collaborative typing, the reality is that app needs to yield JS turn frequently for app to be responsive, so usually it's not a problem unless bugs exists that needs to be fixed anyway.
Also worth pointing out that we use same safe strategy in rich Office applications on all platforms where UI & Model threads are separated, and this strategy worked really well (with only one exception - building progress UI kind of workflows).
The text was updated successfully, but these errors were encountered: