fluid:telemetry:DeltaManager:NoJoinOp is hit in production and ODSP scalability tests #7312

vladsud · 2021-08-31T18:56:51Z

This event is logged when client does not receive its own join op for 45 seconds after connecting as "write".
In indicates something going really bad, and user not being able to receive latest document changes, as well as save his/her changes back to file. Recovery process attempts to reconnect, but if same container keeps hitting this error, that means recovery does not help in any way and client stays in screwed state.

union Office_Fluid_FluidRuntime_*
| where Data_eventName contains "fluid:telemetry:DeltaManager:ConnectionRecovery"
| summarize count() by Data_containerId

Top results:
42d1b5e1-640b-4bec-b83f-a2776bb81338 32
acb0fbd8-1a63-490e-a42f-8e656fe3cdcc 25
9a1bc224-0abc-4155-a6e5-c5a25f9ad623 20
cf369d46-3d6b-4d79-8977-0d7ef38d41b3 18

vladsud · 2021-08-31T19:01:37Z

@jatgarg - I'd most likely need help here as I think I'll not have enough time in September to look into it.
Note that in the past most of these kinds of errors were due to client not being able to get prior ops from storage/ordering service, so it was running in a loop trying to fetch them, delaying processing ops from socket.
Most of these problems show up as these kinds of errors:

union Office_Fluid_FluidRuntime_*
| where Data_error contains "too many retries"
| summarize count() by Data_loaderVersion

There was a bunch of work done both on client & PUSH to fix various bugs we found and it made a huge improvement in reducing these errors, but as of now there are 20 errors for 0.46, and some number (low counts) in ODSP stress tests.
Prod numbers suggests that might be the core issue why we are hitting this issue, but in ODSP stress tests we hit fluid:telemetry:DeltaManager:ConnectionRecovery substantially more frequently that "too many retries" (like 50:1 ratio or so).
So I think not all of the issues can be explained by client not getting ops from server(s)

vladsud · 2021-08-31T19:05:34Z

It's actually top error in loaderVersion >= 0.45 if I remove 10K unsumarized ops nacks & 403 errors.
So we should prioritize looking into why client can't make progress here.

vladsud · 2021-09-03T20:40:30Z

Here is an interesting statistic: Currently there are 39K / 30K of sessions between 0.45/0.46 versions of loader.
But distribution of fluid:telemetry:DeltaManager:ConnectionRecovery events is 63 / 778 !
So 0.46 hits it much more.
I think it's due to 0.46 version of driver advertising that it supports get_ops flow:

        (connectMessage as any).supportedFeatures = { [feature_get_ops]: true };

And PUSH not providing IConnected.initialMessages in response, likely causing situation where client never receives its own join op due to some race condition somewhere (PUSH?) with broadcast subscription

Note that DeltaManager has this code that tries to remedy similar situation for "read" connections (no ops flowing, so client does not know how far it is behind), but there is an exclusion for "write" clients as there is an assumption that join op should always come from socket:

        if (initialMessages.length === 0) {
            if (checkpointSequenceNumber !== undefined) {
                // We know how far we are behind (roughly). If it's non-zero gap, fetch ops right away.
                if (checkpointSequenceNumber > this.lastQueuedSequenceNumber) {
                    this.fetchMissingDeltas("AfterConnection", this.lastQueuedSequenceNumber);
                }
            // we do not know the gap, and we will not learn about it if socket is quite - have to ask.
            } else if (connection.mode === "read") {
                this.fetchMissingDeltas("AfterReadConnection", this.lastQueuedSequenceNumber);
            }

It's a bit hard to prove this theory (without actually making changes and deploying / running through ODSP stress tests), but anecdotally, if I change this code to always fetch ops, it seems that I'm not hitting fluid:telemetry:DeltaManager:ConnectionRecovery any more in local (single-machine) testing, where previously I was hitting them with some frequency (so there is some statistical significance in this result, but hard to say if I simply got lucky).

vladsud · 2021-09-03T20:45:41Z

Not sure if there is any significance in that, but here are some results from latest ODSP scalability runs:
Much fewer cases in testUid == "209ac8ea-fd30-400c-b046-3958639bbc10" (tonight's run), all but one container hit it only once.
Situation is worse in testUid == "0e7916dd-d314-44ed-bbb8-4dfd25daef49" (run of two days ago) where there are substantially more hits and many with 4 attempts (max) to get out, but majority is still 1 attempt recovery.

vladsud · 2021-09-03T21:15:21Z

More experiments suggests that no, it's not the cause :)

vladsud · 2021-09-06T05:27:12Z

Here is interesting data point:

Last 10 days of ODSP scalability runs:
- 5049 out of 6237 all ConnectionRecovery events had fetchReason !== undefined
- 4781 out of 6237 had fetchReason == "AfterConnection_fetch"
In production:
- 843 out of 884 had no fetchReason

So ODSP cases are blocked by pending fetch from prior connection (as "AfterConnection" reason is used only for "read" connection). Cases like that explain why join op was not processed (we can't process it because there was a gap in ops, and fetching this gap likely was blocked by another long-running fetch, as we only do one concurrent fetch at a time). But it also looks strange given that we use 30 second timeout for fetches, so very likely that points to multiple fetches that timed out (need to double check that).

But prod distribution looks different.

One thing to note about local runs - I run into these problems very easily due to the fact that connection is established early on container boot, and occasionally boot takes very long time (with TreesLatest timing out and us retrying). So we hit the error quite often when DeltaManager.handler === undefined, and thus we are not yet processing any ops what so ever. Dropping mode in this line in Container.load() fixes this, as we start with "read" connection, and thus do not hit timer (timer applies only to write connections):

const connectionArgs: IConnectionArgs = { reason: "DocumentOpen", mode: "write", fetchOpsFromStorage: false };

In prod, Data_fetchReason == "" are hit predominantly by "OneNoteMeetingNotes" & "Fluid Preview App" apps, even though the top apps in terms of usage are "Whiteboard" & "Microsoft Teams Web". I've checked onenote sessions and they do nto connect to socket before boot is over, so they are not different from let's say Teams.

The good news is that both apps that hit it the most in prod are not shipping, so impact on prod at the moment is pretty low.

vladsud · 2021-09-07T16:29:17Z

Note: ALL issues are gone from stress tests with change to connectionArgs.mode mentioned above.
At least in one run we had. I'm sure we will have more issues, but that seems to be the majority of issues for stress tests.
Prod is still not fully understood

vladsud · 2021-09-07T16:35:25Z

This is the main offender in prod - 921 hits:

Issue #7312: fluid:telemetry:DeltaManager:ConnectionRecovery is hit in production and ODSP scalability tests I've been going around this area for quite a while, both trying to better understand it, but also try various recovery options. Unfortunately none of the recovery options tried so far does not really work. So removing it and switching to only logging more data to keep understanding problem better. With that, it's worth also provide a bit more info on what I've learned so far: Big part of these issues, at least in stress tests, is the fact that we boot container with "write" connection, and connection is established before container is loaded. As result, because load may take a while, we are not processing ops for very long time. I'll attempt to solve this in separate PR, but adding DeltaManager.handler & Container.loaded properties to payload is super valuable, as well as learning that we do sit on ton of unprocessed ops.

vladsud · 2021-09-08T05:03:33Z

PR #7392 added right telemetry, we need it to be deployed to prod to understand things better.
Given no shipping apps are affected in big numbers, moving to October

vladsud · 2021-09-08T17:29:52Z

Latest ODSP Scalability run has more telemetry. This one is interesting:

customEvents
| where customDimensions.testUid == "f3976e4b-b286-457d-be00-4541967e1d9f"
| where name == "fluid:telemetry:DeltaManager:NoJoinOp"
| where customDimensions.fetchReason == ""
| where customDimensions.haveHandler == true
| where toint(customDimensions.lastQueuedSequenceNumber) == toint(customDimensions.lastProcessedSequenceNumber)
| summarize count(), countif(toint(customDimensions.lastProcessedSequenceNumber) == toint(customDimensions.lastObserved))

It gives us 27 hits, with 16 cases where all sequence numbers are the same.
In all cases we have no pending ops!!!
In 27-16 cases we have difference between lastProcessedSequenceNumber & lastObserved, and given no pending ops - that points to a fact that we simply lost pending ops somewhere!!!

We also have fluid:telemetry:DeltaManager:ReceivedJoinOp events showing us when client gets out of this state, and tail is pretty bad - medium is at 1 min!

jatgarg · 2021-09-08T23:20:58Z

Latest ODSP Scalability run has more telemetry. This one is interesting:

customEvents
| where customDimensions.testUid == "f3976e4b-b286-457d-be00-4541967e1d9f"
| where name == "fluid:telemetry:DeltaManager:NoJoinOp"
| where customDimensions.fetchReason == ""
| where customDimensions.haveHandler == true
| where toint(customDimensions.lastQueuedSequenceNumber) == toint(customDimensions.lastProcessedSequenceNumber)
| summarize count(), countif(toint(customDimensions.lastProcessedSequenceNumber) == toint(customDimensions.lastObserved))

It gives us 27 hits, with 16 cases where all sequence numbers are the same.
In all cases we have no pending ops!!!
In 27-16 cases we have difference between lastProcessedSequenceNumber & lastObserved, and given no pending ops - that points to a fact that we simply lost pending ops somewhere!!!

We also have fluid:telemetry:DeltaManager:ReceivedJoinOp events showing us when client gets out of this state, and tail is pretty bad - medium is at 1 min!

Interesting that we lost the pending ops somewhere. Does all these cases started coming after we removed initialMessages in connection details?

vladsud · 2021-09-08T23:24:54Z

So I've added a ton of telemetry locally and what I see (sometimes when I hit this case very infrequently) is that it's not that we lost pending ops. lastObserved in the query about can be higher than lastProcessedSequenceNumber simply because we learned about max sequence number through checkpointSequenceNumber on IConnected. So client knows that there are more ops, but we never saw them (or at least we believe it's the case).
It still does not answer many questions, but my efforts so far is to confirm that we never saw these ops, of prove otherwise.

Issue microsoft#7312: fluid:telemetry:DeltaManager:ConnectionRecovery is hit in production and ODSP scalability tests I've been going around this area for quite a while, both trying to better understand it, but also try various recovery options. Unfortunately none of the recovery options tried so far does not really work. So removing it and switching to only logging more data to keep understanding problem better. With that, it's worth also provide a bit more info on what I've learned so far: Big part of these issues, at least in stress tests, is the fact that we boot container with "write" connection, and connection is established before container is loaded. As result, because load may take a while, we are not processing ops for very long time. I'll attempt to solve this in separate PR, but adding DeltaManager.handler & Container.loaded properties to payload is super valuable, as well as learning that we do sit on ton of unprocessed ops.

…h up and does not realize that (#7430) Please see issue #7312 for more details. This change does two things: 1. It adds telemetry to detect cases where we know that we are behind, but we are not actively trying to catch up. We do not fully understand these problems, but this telemetry (along with other PRs that add telemetry data points in various places) will help us better understand the problem. 2. It also engages recovery for such cases. We know for sure how we can get into such cases with "read" connections. These are normal (expected) workflows. "write" connections are much more mysterious, as service (PUSH for ODSP) guarantees that join op should not be missed and should be delivered to client quickly after connection. But telemetry shows that we may be in position where join op is not processed for minutes, and it's not clear at the moment if that's a bug on client or server. Semi-related PRs that together helps us move in right direction here: PR #7428 PR #7427 PR #7429

vladsud · 2021-09-28T13:50:01Z

Need to reassess ones PUSH addresses existing issues with op delivery latencies tracked by https://onedrive.visualstudio.com/SPIN/_queries/edit/1204718/?triage=true

vladsud · 2021-10-22T18:54:35Z

Latest query to track:

union Office_Fluid_FluidRuntime_*
| where Event_Time > ago(10d)
| where Data_eventName contains "fluid:telemetry:DeltaManager:NoJoinOp"
| summarize count() by Data_loaderVersion

The set of bugs to track:

https://onedrive.visualstudio.com/SPIN/_workitems/edit/1204718/?triage=true
https://onedrive.visualstudio.com/SPIN/_queries/edit/1186588/?triage=true

#7145
#7406

vladsud · 2021-10-25T18:29:39Z

Dates (October) for SPDF / load test cluster / PROD rollout:
Bug 1186588: 12 / 20 / still rolling to prod
Bug 1204718: 5 / 5 / 20

vladsud · 2021-10-31T15:43:54Z

99% of these issues in ODSP stress tests are due to #7772.
I.e. boot is slow, and because connection happens before boot is over, then we actually wait for boot to finish, which occasionally (due to retries and 30+ second fetch timeouts) results in this timer firing. Switching to default "read" mode will eliminate this reporting (as it only reported for "write" connections).

The remaining number of cases is small, and has different origins. Some are due to very long fetches, some looks like PUSH slowness at inserting join op (as ops otherwise are flowing). Some due to #7145, as unneeded storage calls block future storage calls that are needed to cover ops gap.

vladsud · 2021-10-31T16:33:07Z

This might be the most interesting way to look at the prod data:

Office_Fluid_FluidRuntime_Error
| where Data_loaderVersion contains "0.49"
| where Event_Time > ago(5d)
| where Data_eventName contains "NoJoinOp" or Data_eventName contains "ReceivedJoinOp"
| summarize countif(Data_eventName contains "NoJoinOp"), countif(Data_eventName contains "NoJoinOp"), make_set(Data_fetchReason) by Data_containerId

For example, containerId == "e62239fa-f161-41f1-af6a-610398d49169" hit and recovered 4 times and was not fetching ops during this time.
Similar, containerId == "16666f71-1c22-4141-971b-b4123d3c57bb" for 0.48 hit it 18 times!
Cases where we are not fetching ops are the most interesting to drill into

vladsud · 2021-11-03T19:35:41Z

containerId = "71aa9dcb-0b72-49c3-970c-323e6a5829eb" (odsp stress tests) is the same as the case described in #7406 (comment of 26 days ago). Per Gary, join op has seq = 271.
It's pretty clear we processed it (at least put into DeltaManager.inbound queue).

But NoJoinOp has this payload:
lastObserved = lastQueuedSequenceNumber = 445
lastProcessedSequenceNumber = 151

So op processing is paused! Someone is screwing op processing badly!

That said, this query gives me only 8 out of 134 cases where it's the case:

customEvents
| where timestamp > ago(10d)
| where name contains "NoJoinOp"
| summarize count(), countif(toint(customDimensions.lastProcessedSequenceNumber) < toint(customDimensions.lastQueuedSequenceNumber))

Same thing in prod gives me 110 out of 212 - much larger part:

union Office_Fluid_FluidRuntime_*
| where Event_Time > ago(10d)
| where Data_eventName contains "NoJoinOp"
| summarize count(), countif(Data_lastProcessedSequenceNumber < Data_lastQueuedSequenceNumber)

vladsud · 2021-11-03T19:37:23Z

What is even more interesting that it's almost 90% of cause for interactive clients (105 out of 126 in prod) vs. summarizer (5 out of 86)!

vladsud · 2021-11-03T19:52:00Z

Ugh, average number of ops we are behind for interactive client is 3193! I think we have a bug somewhere with pausing that probably never resumes. That said, we know that 73 out of 127 interactive clients that run into this case actually eventually recovered, so op processing was resumed eventually.

See microsoft#7312 for more details

There are couple active investigations around op roundtrip latency and join op receiving latency: #7312 #7406 While not all conditions are well understood, it's pretty clearly that many cases run into the issue of pending storage op fetch requests blocking op processing, where they are not needed. I.e. the reason (translating to range of ops) we are asking for is no longer valid / required, but active fetch request might block future ops requests, as system (as it stands today) allows only one request at a time. This change changes the following behavior: 1. Cancelation for request with known end limit (to argument) are cancelled only when all ops in the range came in not through that request. That will ensure that we do not cancel (and report cancelation) for storage requests that are about to complete either way. 2. Main change: Cancel unbound storage op requests when op is coming in from ordering service that is above last known op (which is usually equal to checkpointSequenceNumber ordering service provided on connection, indicating where the end of the file is). Potential future change to remove further bottlenecks: We might revisit the design of one request at a time. I think allowing one bound and one unbound request makes sense, or maybe 2 requests of any type, not sure. We will monitor data based on these results to make further determination.

There are couple active investigations around op roundtrip latency and join op receiving latency: microsoft#7312 microsoft#7406 While not all conditions are well understood, it's pretty clearly that many cases run into the issue of pending storage op fetch requests blocking op processing, where they are not needed. I.e. the reason (translating to range of ops) we are asking for is no longer valid / required, but active fetch request might block future ops requests, as system (as it stands today) allows only one request at a time. This change changes the following behavior: 1. Cancelation for request with known end limit (to argument) are cancelled only when all ops in the range came in not through that request. That will ensure that we do not cancel (and report cancelation) for storage requests that are about to complete either way. 2. Main change: Cancel unbound storage op requests when op is coming in from ordering service that is above last known op (which is usually equal to checkpointSequenceNumber ordering service provided on connection, indicating where the end of the file is). Potential future change to remove further bottlenecks: We might revisit the design of one request at a time. I think allowing one bound and one unbound request makes sense, or maybe 2 requests of any type, not sure. We will monitor data based on these results to make further determination.

…#8206) After PR #8137, it's now easier to see what's wrong with ScheduleManager. You can see impact of the wrong doing described in detail in issue #7312 The core of the problem is that ScheduleManager will pause processing on incomplete batch coming in. Even if we have plenty of ops before that incomplete batch that could be processed just fine. What likely happens (that makes it really bad in production) is that we could get multiple chunks of ops from storage, but each chunk might end up splitting some batch. Because we always have incomplete batch, we keep inbound queue in paused state until this problem resolves by fetching chunk of ops that does not break some batch. This change also ensures that we pause inbound queue only when we have partial batch and no other ops. It will not pause queue if inbound queue is empty, thus reducing number of paused - resumed transitions if we are getting non-batched ops.

vladsud · 2021-11-11T22:08:51Z

Closing this issue. Here is the summary of where we are:

Most of prod issues should be addressed by PR Fix ScheduleManager: Allow ops to be processed up to incomplete batch #8206 in 0.52
- Most of remaining prod issues are long fetches, i.e. fetch (usually fetchReason == "DocumentOpen_fetch") blocking ability to process incoming ops. I.e. there is not much we can do here. I see 81 cases for last 10 days (on deployed 0.49 bits)
- There is even smaller set of cases where it seems we are waiting for leave from previous session (based on fluid:telemetry:Container:connectedStateRejected event that follows). I see 32 cases for last 10 days (on deployed 0.49 bits). We might want to look closer into it, but in context of that event / process - opened Too many fluid:telemetry:Container:connectedStateRejected events with source = timeout #8235 to look closer into it.
Most of the ODSP scalability issues are due to and tracked by issue Change default connection mode from "write" to "read" #7772
- There are some remaining cases in ODSP stress tests that are harder to follow (as NoJoinOp happens when container is already loaded), but inspecting individual sessions shows that they are same as above, i.e. connection was established before container was loaded.

…#8206) After PR #8137, it's now easier to see what's wrong with ScheduleManager. You can see impact of the wrong doing described in detail in issue #7312 The core of the problem is that ScheduleManager will pause processing on incomplete batch coming in. Even if we have plenty of ops before that incomplete batch that could be processed just fine. What likely happens (that makes it really bad in production) is that we could get multiple chunks of ops from storage, but each chunk might end up splitting some batch. Because we always have incomplete batch, we keep inbound queue in paused state until this problem resolves by fetching chunk of ops that does not break some batch. This change also ensures that we pause inbound queue only when we have partial batch and no other ops. It will not pause queue if inbound queue is empty, thus reducing number of paused - resumed transitions if we are getting non-batched ops.

vladsud added the bug Something isn't working label Aug 31, 2021

vladsud added this to the September 2021 milestone Aug 31, 2021

vladsud assigned vladsud and jatgarg Aug 31, 2021

vladsud mentioned this issue Sep 2, 2021

Unable to load container with loader 0.39.8 and driver 0.46.1 / 0.47.0 combination with ODSP #7349

Closed

vladsud mentioned this issue Sep 7, 2021

Change no join op recovery to telemetry only #7392

Merged

vladsud mentioned this issue Sep 7, 2021

Switch default connection mode to "read". #7393

Closed

vladsud modified the milestones: September 2021, October 2021 Sep 8, 2021

vladsud unassigned jatgarg Sep 8, 2021

This was referenced Sep 8, 2021

More telemetry, asserts and some recovery for cases where client is not up to date #7422

Closed

Better telemetry and recovery #7430

Merged

vladsud modified the milestones: October 2021, November 2021 Sep 28, 2021

This was referenced Oct 26, 2021

fluid:telemetry:OpPerf:OpRoundtripTime errors are hit in prod #7406

Closed

Epic: Connectivity improvements #7995

Closed

vladsud changed the title ~~fluid:telemetry:DeltaManager:ConnectionRecovery is hit in production and ODSP scalability tests~~ fluid:telemetry:DeltaManager:NoJoinOp is hit in production and ODSP scalability tests Oct 31, 2021

vladsud mentioned this issue Nov 3, 2021

Better fetching cancelation #8115

Merged

vladsud added a commit to vladsud/FluidFramework that referenced this issue Nov 3, 2021

Tracking who screws up pausing of inbound queues

bbefc1b

See microsoft#7312 for more details

vladsud mentioned this issue Nov 3, 2021

Introduce IDeltaQueueWriter and a bit better tracking for connection issues #8120

Merged

vladsud mentioned this issue Nov 4, 2021

Op telemetry point improvement #7983

Closed

vladsud mentioned this issue Nov 10, 2021

Fix ScheduleManager: Allow ops to be processed up to incomplete batch #8206

Merged

vladsud mentioned this issue Nov 11, 2021

Too many fluid:telemetry:Container:connectedStateRejected events with source = timeout #8235

Closed

vladsud closed this as completed Nov 11, 2021

vladsud mentioned this issue Nov 11, 2021

[Epic] ODSP Scalability #6686

Closed

32 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fluid:telemetry:DeltaManager:NoJoinOp is hit in production and ODSP scalability tests #7312

fluid:telemetry:DeltaManager:NoJoinOp is hit in production and ODSP scalability tests #7312

vladsud commented Aug 31, 2021

vladsud commented Aug 31, 2021

vladsud commented Aug 31, 2021

vladsud commented Sep 3, 2021

vladsud commented Sep 3, 2021

vladsud commented Sep 3, 2021

vladsud commented Sep 6, 2021

vladsud commented Sep 7, 2021

vladsud commented Sep 7, 2021

vladsud commented Sep 8, 2021

vladsud commented Sep 8, 2021

jatgarg commented Sep 8, 2021

vladsud commented Sep 8, 2021

vladsud commented Sep 28, 2021

vladsud commented Oct 22, 2021

vladsud commented Oct 25, 2021

vladsud commented Oct 31, 2021

vladsud commented Oct 31, 2021

vladsud commented Nov 3, 2021

vladsud commented Nov 3, 2021

vladsud commented Nov 3, 2021 •

edited

Loading

vladsud commented Nov 11, 2021

fluid:telemetry:DeltaManager:NoJoinOp is hit in production and ODSP scalability tests #7312

fluid:telemetry:DeltaManager:NoJoinOp is hit in production and ODSP scalability tests #7312

Comments

vladsud commented Aug 31, 2021

vladsud commented Aug 31, 2021

vladsud commented Aug 31, 2021

vladsud commented Sep 3, 2021

vladsud commented Sep 3, 2021

vladsud commented Sep 3, 2021

vladsud commented Sep 6, 2021

vladsud commented Sep 7, 2021

vladsud commented Sep 7, 2021

vladsud commented Sep 8, 2021

vladsud commented Sep 8, 2021

jatgarg commented Sep 8, 2021

vladsud commented Sep 8, 2021

vladsud commented Sep 28, 2021

vladsud commented Oct 22, 2021

vladsud commented Oct 25, 2021

vladsud commented Oct 31, 2021

vladsud commented Oct 31, 2021

vladsud commented Nov 3, 2021

vladsud commented Nov 3, 2021

vladsud commented Nov 3, 2021 • edited Loading

vladsud commented Nov 11, 2021

vladsud commented Nov 3, 2021 •

edited

Loading