Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove feature gates for flushmode and batchmanager #8833

Merged
merged 36 commits into from
Mar 3, 2022

Conversation

andre4i
Copy link
Contributor

@andre4i andre4i commented Jan 21, 2022

closes #7365

This will be reviewed/merged after the respective experiments are concluded.

@github-actions github-actions bot added area: driver Driver related issues area: runtime Runtime related issues public api change Changes to a public API labels Jan 21, 2022
@github-actions github-actions bot added area: examples Changes that focus on our examples area: tests Tests to add, test infrastructure improvements, etc labels Jan 22, 2022
@andre4i
Copy link
Contributor Author

andre4i commented Feb 18, 2022

Small update, @vladsud

The test infra changes have been split off and merged in a separate PR, to keep things nice and isolated.

There are two problematic tests with this change:

packages/test/test-end-to-end-tests/src/test/stashedOps.spec.ts@"doesn't resend successful chunked op"

  • timeout. I investigated this before, but I can't find anything wrong with this. My suspicion is that without the batch manager and with flushmode.manual, this test may be irrelevant. Will disable it with an open issue and will follow up with Wes.

packages/test/test-end-to-end-tests/src/test/attachRegisterLocalApiTests.spec.ts@"Registering DDS in attached dataStore should attach it"

  • which fails due to the new feature in testing to catch all errors:
         "after each" hook for "Registering DDS in attached dataStore should attach it":
     Error: Unexpected Errors in Logs. Use itExpects to specify expected errors:
[
  {
    "eventName": "fluid:telemetry:DeltaManager:OpenBatchOnDisconnect",
    "length": 1,
    "category": "error",
    "clientType": "interactive",
    "containerId": "20256d54-0cb6-4ba1-be82-2e541cf02fd0",
    "docId": "0d6e77f2-0cb4-44c1-ab4d-2147b2de61eb",
    "containerAttachState": "Attached",
    "containerLifecycleState": "loaded",
    "containerConnectionState": "Connected",
    "dmInitialSeqNumber": 0,
    "dmLastKnownSeqNumber": 0,
    "connectionStateDuration": 1220.784600019455,
    "loaderId": "ca374d26-f185-4b3e-838d-e47be4961193",
    "loaderVersion": "0.58.0"
  }
]
  • what happens here is that the disconnect handler in the DeltaManager when the client reconnects due to read->write pushes the error event as there are ops in the queue.
  • this seems harmless, as ops should be retried by the runtime anyway
  • the failure was actually injected by this change and the new feature in the test infra uncovered it
  • will try to change the test to avoid ending up in this situation
  • I'm wondering if this event should be downgraded from error.. Maybe the current change would make it too chatty

@andre4i
Copy link
Contributor Author

andre4i commented Feb 18, 2022

Actually, packages/test/test-end-to-end-tests/src/test/stashedOps.spec.ts@"doesn't resend successful chunked op" may be a legitimate concern. The behavior of the test is cryptic, as it fails because the ensuresynchronized call times out. But the reason for that is that the container does not fire the saved event -> because replayPendingStates cannot send ops -> because the container is not connected -> because it closed with 0x16f.

I believe this is because https://github.com/microsoft/FluidFramework/blob/main/packages/runtime/container-runtime/src/containerRuntime.ts#L2355:

            // Note: Chunking will increase content beyond maxOpSize because we JSON'ing JSON payload -
            // there will be a lot of escape characters that can make it up to 2x bigger!
            // This is Ok, because DeltaManager.shouldSplit() will have 2 * maxMessageSize limit
            if (!serializedContent || serializedContent.length <= maxOpSize) {
                clientSequenceNumber = this.submitRuntimeMessage(
                    type,
                    content,
                    /* batch: */ this._flushMode === FlushMode.TurnBased,
                    opMetadataInternal);
            } else {
                clientSequenceNumber = this.submitChunkedMessage(type, serializedContent, maxOpSize);
            }

as if the message is split into chunks, even if it was part of a batch, it won't be sent as a batch message from the perspective of its metadata.

@github-actions github-actions bot added area: dds Issues related to distributed data structures and removed area: tests Tests to add, test infrastructure improvements, etc labels Mar 2, 2022
@andre4i andre4i marked this pull request as ready for review March 2, 2022 18:42
@andre4i andre4i requested review from a team as code owners March 2, 2022 18:42
@andre4i
Copy link
Contributor Author

andre4i commented Mar 2, 2022

The feature gates are enabled and have been for the past few weeks. Production rollout is at 100%, so this code is safe to remove. I will do so before the release.

@@ -575,9 +576,13 @@ export class DeltaManager<TConnectionManager extends IConnectionManager>
// DeltaManager overall policy - drop all ops on disconnection and rely on
// container runtime to deal with resubmitting any ops that did not make it through.
// So drop them, but also raise error event to look into details.
this.logger.sendErrorEvent({
this.logger.sendTelemetryEvent({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think there is a value in this event at all. I.e. it's either an error (i.e. we believe it should not happen), or it's normal flow that is handled, covered by UT, but really does not matter in terms of logging.
Also reason is not package data - it comes from socket (in majority cases) and we freely log it already.

@andre4i andre4i merged commit 079349f into microsoft:main Mar 3, 2022
andre4i added a commit that referenced this pull request Mar 17, 2022
This event has already been demoted from error in #8833 as it started firing when FlushMode.TurnBased was set as the default and causing test failures due to undeclared log errors.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: dds Issues related to distributed data structures area: driver Driver related issues area: loader Loader related issues area: runtime Runtime related issues public api change Changes to a public API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove batchmanager and default FlushMode related feature gates
3 participants