Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BatchManager should be removed, all the bugs in Runtime related to not using batches in right places fixed #3788

Closed
Tracked by #7912
vladsud opened this issue Oct 1, 2020 · 7 comments
Assignees
Labels
area: loader Loader related issues refactor Code refactoring and cleanup
Milestone

Comments

@vladsud
Copy link
Contributor

vladsud commented Oct 1, 2020

BatchManager hides bugs - it adds implicit batching, even if container runtime uses Automatic flushing mode. But because it's behavior is random (from POV of container runtime - it is based on timer), it will make sure that nobody pays attention to using batches correctly (where needed, because things sort of work), but occasionally would get no batching and random failures.

I believe the right behaviors is that we test all of of our code in automatic mode (which should be renamed - it's "no batching" mode), but for all existing workloads (i.e. Office container) we use manual flash mode (which likely should also be renamed to be JS-single-turn mode). And no BatchManager.

This will expose a ton of bugs - places where we need to use explicit batches, but we do not use them.
I think number of bugs in Fluid repo is likely low, but stress testing is required to find them.
(Hopefully detached created of container, components & DDSs already addresses most of potential issues).

That said, applications using Fluid likely have way more issues, and thus deserve to start in safe territory - i.e. JS-turn batching, and add manual flushing where needed. While it feels like it's a big drop in our ability to push ops quickly and have responsive collaborative typing, the reality is that app needs to yield JS turn frequently for app to be responsive, so usually it's not a problem unless bugs exists that needs to be fixed anyway.
Also worth pointing out that we use same safe strategy in rich Office applications on all platforms where UI & Model threads are separated, and this strategy worked really well (with only one exception - building progress UI kind of workflows).

@vladsud vladsud added the bug Something isn't working label Oct 1, 2020
@ghost ghost added the triage label Oct 1, 2020
@tylerbutler tylerbutler added this to the Next milestone Oct 1, 2020
@ghost ghost added triage and removed triage labels Oct 1, 2020
@tylerbutler tylerbutler removed the triage label Oct 1, 2020
@ghost ghost added the triage label Oct 1, 2020
@tylerbutler tylerbutler added area: loader Loader related issues and removed bug Something isn't working triage labels Oct 1, 2020
@ChumpChief
Copy link
Contributor

ChumpChief commented Oct 20, 2020

Looking at this one now, trying to determine how safe it is to just go ahead and remove the BatchManager. As Vlad mentions above, there's risk that customers are depending on implicit batching from BatchManager, but my initial investigation makes me question how possible/likely it would be to do so for the following reasons:

  1. Anything depending on this behavior must not yield the stack as it generates the ops. The timer-based submission makes any stack-yielding code highly unpredictable as to where the submission would occur within the pseudo-batch.
  2. Anything depending on this behavior must not generate more than 100 ops in the pseudo-batch. More than that will trigger an immediate flush, breaking the batch.
  3. Anything depending on this behavior must not happen to cross the 100 op threshold with its pseudo-batch (e.g. if some other source of messages has already queued up 95 ops so far on the current timer, and then the code in question generates 10 more).
  4. Both r11s and odsp have an odd behavior I'm observing that even when submitting a batch synchronously (say 50 ops), I get an ack for the very first op in the batch as an individual websocket message followed by a second websocket message which acks all of the remaining 49 ops at once. So on the receiving side at least, dependent code must not depend on the first op of the pseudo-batch actually being part of the pseudo-batch. Note this kind of seems like a bug, but nevertheless is also probably proof that batching is not a problem for the first op.
  5. r11s and odsp differ in their very-large-batch acks, in that if the total ops are over 100 (max batch size), odsp seems to at least try to consolidate some of the large batches together. E.g. when submitting 310 ops simultaneously, we submit four outgoing websocket messages with 100, 100, 100, 10 ops respectively. On odsp, the acks I get back come in three websocket messages, with 1, 199, and 110 ops respectively. On r11s I instead see the acks come in six incoming websocket messages with 1, 99, 1, 99, 1, and 9 ops respectively*. Again this seems like a bug to me, but also seems like further proof that pseudo-batches approaching 100 ops are probably much too inconsistent to take dependencies.

So taking the intersection of these limitations, the dependent code would need to be depending on pseudo-batches smaller than 100 ops, in scenarios where other messages are not being generated in overly-large quantities, must not need the first op in the pseudo-batch to be received with the rest of the pseudo-batch, and must not yield the stack while generating the pseudo-batch. I don't know if we'll be able to further reduce this to an empty set, but it at least seems more rare than we initially were thinking.

*Edit: I think I mistyped here and meant eight messages of 1, 99, 1, 99, 1, 99, 1, 9 but I don't have the experiment still up. Leaving as-is unless/until I verify.

@curtisman curtisman added the refactor Code refactoring and cleanup label Oct 26, 2020
@curtisman curtisman modified the milestones: Next, December 2020 Oct 30, 2020
@curtisman curtisman modified the milestones: December 2020, January 2021 Dec 1, 2020
@curtisman curtisman modified the milestones: January 2021, Next Jan 8, 2021
@vladsud vladsud modified the milestones: Next, July 2021 Jun 29, 2021
@vladsud
Copy link
Contributor Author

vladsud commented Jun 29, 2021

Assigning to myself couple tasks to make progress on decision designs:
#4048 - Correctness: 'FlushMode.Manual' should be the default
#3788 - BatchManager should be removed, all the bugs in Runtime related to not using batches in right places fixed

@vladsud vladsud self-assigned this Jun 30, 2021
@vladsud
Copy link
Contributor Author

vladsud commented Jul 13, 2021

@ChumpChief, thanks for looking into it.
I think it's a matter of probabilities. I.e. if we ask a question - can we construct realistic case that would implicitly depend on such batching with all those constrains, I think the answer would be yes.
It's also not great experience when single line change (sets one extra property on a map) results in failure in unrelated part of the code (some other component that developer does not control directly, but called into on the same JS turn).

That all said, I think it's less about BatchManager itself, it's more about what guarantees we need to deliver or start delivering to developers (i.e. it's more about #4048). And the rest is more about removal of pseudo-guarantees (either complete removal, or making it more predictable where it essentially becomes de-factor guaranteed behavior that people eventually rely on, or adding tools / processes / randomizations for people to easily find and realize bugs in their code due to reliance on undocumented behavior).

I've put more thoughts into #4048 why I think it should be fully controlled by ContainerRuntime with different defaults and by virtue - no BatchManager (as it simply gets in the way of delivering ContainerRuntime promises)

@vladsud vladsud modified the milestones: July 2021, August 2021 Jul 28, 2021
@vladsud vladsud assigned andre4i and unassigned vladsud Jul 28, 2021
@vladsud
Copy link
Contributor Author

vladsud commented Jul 28, 2021

Should be tackled after #4048 is done.
Note that this change is in the driver, while issue above is in runtime.
So we need to separate them by at least one version, to make sure newer runtime had a chance to go through deployment process first.

@andre4i
Copy link
Contributor

andre4i commented Oct 18, 2021

BatchManager is currently behind a feature gate. I've been running the Bohemia tests with the feature gate enabled and some issues have been fixed.

The code will be removed after #7365 is closed.

@andre4i andre4i added the epic An issue that is a parent to several other issues label Oct 18, 2021
@andre4i andre4i changed the title BatchManager should be removed, all the bugs in Runtime related to not using batches in right places fixed Epic: BatchManager should be removed, all the bugs in Runtime related to not using batches in right places fixed Oct 21, 2021
@andre4i andre4i modified the milestones: October 2021, November 2021 Oct 22, 2021
@danielroney danielroney removed the epic An issue that is a parent to several other issues label Oct 26, 2021
@danielroney danielroney changed the title Epic: BatchManager should be removed, all the bugs in Runtime related to not using batches in right places fixed BatchManager should be removed, all the bugs in Runtime related to not using batches in right places fixed Oct 26, 2021
@vladsud vladsud modified the milestones: November 2021, December 2021 Dec 6, 2021
@andre4i
Copy link
Contributor

andre4i commented Dec 27, 2021

The config management feature will be merged soon (#8497). It then needs to be wired in Bohemia first, then we can properly enable this and roll it out.

@andre4i
Copy link
Contributor

andre4i commented Mar 1, 2022

Removal hit this #9279. However, #9223 should work around that issue and move forward with the change.

@andre4i andre4i modified the milestones: February 2022, March 2022 Mar 1, 2022
@andre4i andre4i closed this as completed Mar 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: loader Loader related issues refactor Code refactoring and cleanup
Projects
None yet
Development

No branches or pull requests

7 participants