Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fluid:telemetry:DeltaManager:NoJoinOp is hit in production and ODSP scalability tests #7312

Closed
Tracked by #6686
vladsud opened this issue Aug 31, 2021 · 21 comments
Closed
Tracked by #6686
Assignees
Labels
bug Something isn't working
Milestone

Comments

@vladsud
Copy link
Contributor

vladsud commented Aug 31, 2021

This event is logged when client does not receive its own join op for 45 seconds after connecting as "write".
In indicates something going really bad, and user not being able to receive latest document changes, as well as save his/her changes back to file. Recovery process attempts to reconnect, but if same container keeps hitting this error, that means recovery does not help in any way and client stays in screwed state.

union Office_Fluid_FluidRuntime_*
| where Data_eventName contains "fluid:telemetry:DeltaManager:ConnectionRecovery"
| summarize count() by Data_containerId

Top results:
42d1b5e1-640b-4bec-b83f-a2776bb81338 32
acb0fbd8-1a63-490e-a42f-8e656fe3cdcc 25
9a1bc224-0abc-4155-a6e5-c5a25f9ad623 20
cf369d46-3d6b-4d79-8977-0d7ef38d41b3 18

@vladsud vladsud added the bug Something isn't working label Aug 31, 2021
@vladsud vladsud added this to the September 2021 milestone Aug 31, 2021
@vladsud
Copy link
Contributor Author

vladsud commented Aug 31, 2021

@jatgarg - I'd most likely need help here as I think I'll not have enough time in September to look into it.
Note that in the past most of these kinds of errors were due to client not being able to get prior ops from storage/ordering service, so it was running in a loop trying to fetch them, delaying processing ops from socket.
Most of these problems show up as these kinds of errors:

union Office_Fluid_FluidRuntime_*
| where Data_error contains "too many retries"
| summarize count() by Data_loaderVersion

There was a bunch of work done both on client & PUSH to fix various bugs we found and it made a huge improvement in reducing these errors, but as of now there are 20 errors for 0.46, and some number (low counts) in ODSP stress tests.
Prod numbers suggests that might be the core issue why we are hitting this issue, but in ODSP stress tests we hit fluid:telemetry:DeltaManager:ConnectionRecovery substantially more frequently that "too many retries" (like 50:1 ratio or so).
So I think not all of the issues can be explained by client not getting ops from server(s)

@vladsud
Copy link
Contributor Author

vladsud commented Aug 31, 2021

It's actually top error in loaderVersion >= 0.45 if I remove 10K unsumarized ops nacks & 403 errors.
So we should prioritize looking into why client can't make progress here.

@vladsud
Copy link
Contributor Author

vladsud commented Sep 3, 2021

Here is an interesting statistic: Currently there are 39K / 30K of sessions between 0.45/0.46 versions of loader.
But distribution of fluid:telemetry:DeltaManager:ConnectionRecovery events is 63 / 778 !
So 0.46 hits it much more.
I think it's due to 0.46 version of driver advertising that it supports get_ops flow:

        (connectMessage as any).supportedFeatures = { [feature_get_ops]: true };

And PUSH not providing IConnected.initialMessages in response, likely causing situation where client never receives its own join op due to some race condition somewhere (PUSH?) with broadcast subscription

Note that DeltaManager has this code that tries to remedy similar situation for "read" connections (no ops flowing, so client does not know how far it is behind), but there is an exclusion for "write" clients as there is an assumption that join op should always come from socket:

        if (initialMessages.length === 0) {
            if (checkpointSequenceNumber !== undefined) {
                // We know how far we are behind (roughly). If it's non-zero gap, fetch ops right away.
                if (checkpointSequenceNumber > this.lastQueuedSequenceNumber) {
                    this.fetchMissingDeltas("AfterConnection", this.lastQueuedSequenceNumber);
                }
            // we do not know the gap, and we will not learn about it if socket is quite - have to ask.
            } else if (connection.mode === "read") {
                this.fetchMissingDeltas("AfterReadConnection", this.lastQueuedSequenceNumber);
            } 

It's a bit hard to prove this theory (without actually making changes and deploying / running through ODSP stress tests), but anecdotally, if I change this code to always fetch ops, it seems that I'm not hitting fluid:telemetry:DeltaManager:ConnectionRecovery any more in local (single-machine) testing, where previously I was hitting them with some frequency (so there is some statistical significance in this result, but hard to say if I simply got lucky).

@vladsud
Copy link
Contributor Author

vladsud commented Sep 3, 2021

Not sure if there is any significance in that, but here are some results from latest ODSP scalability runs:
Much fewer cases in testUid == "209ac8ea-fd30-400c-b046-3958639bbc10" (tonight's run), all but one container hit it only once.
Situation is worse in testUid == "0e7916dd-d314-44ed-bbb8-4dfd25daef49" (run of two days ago) where there are substantially more hits and many with 4 attempts (max) to get out, but majority is still 1 attempt recovery.

@vladsud
Copy link
Contributor Author

vladsud commented Sep 3, 2021

More experiments suggests that no, it's not the cause :)

@vladsud
Copy link
Contributor Author

vladsud commented Sep 6, 2021

Here is interesting data point:

  1. Last 10 days of ODSP scalability runs:
    • 5049 out of 6237 all ConnectionRecovery events had fetchReason !== undefined
    • 4781 out of 6237 had fetchReason == "AfterConnection_fetch"
  2. In production:
    • 843 out of 884 had no fetchReason

So ODSP cases are blocked by pending fetch from prior connection (as "AfterConnection" reason is used only for "read" connection). Cases like that explain why join op was not processed (we can't process it because there was a gap in ops, and fetching this gap likely was blocked by another long-running fetch, as we only do one concurrent fetch at a time). But it also looks strange given that we use 30 second timeout for fetches, so very likely that points to multiple fetches that timed out (need to double check that).

But prod distribution looks different.

One thing to note about local runs - I run into these problems very easily due to the fact that connection is established early on container boot, and occasionally boot takes very long time (with TreesLatest timing out and us retrying). So we hit the error quite often when DeltaManager.handler === undefined, and thus we are not yet processing any ops what so ever. Dropping mode in this line in Container.load() fixes this, as we start with "read" connection, and thus do not hit timer (timer applies only to write connections):

const connectionArgs: IConnectionArgs = { reason: "DocumentOpen", mode: "write", fetchOpsFromStorage: false };

In prod, Data_fetchReason == "" are hit predominantly by "OneNoteMeetingNotes" & "Fluid Preview App" apps, even though the top apps in terms of usage are "Whiteboard" & "Microsoft Teams Web". I've checked onenote sessions and they do nto connect to socket before boot is over, so they are not different from let's say Teams.

The good news is that both apps that hit it the most in prod are not shipping, so impact on prod at the moment is pretty low.

@vladsud
Copy link
Contributor Author

vladsud commented Sep 7, 2021

Note: ALL issues are gone from stress tests with change to connectionArgs.mode mentioned above.
At least in one run we had. I'm sure we will have more issues, but that seems to be the majority of issues for stress tests.
Prod is still not fully understood

@vladsud
Copy link
Contributor Author

vladsud commented Sep 7, 2021

This is the main offender in prod - 921 hits:

union Office_Fluid_FluidRuntime_*
| where Data_eventName == "fluid:telemetry:DeltaManager:ConnectionRecovery"
| where Data_loaderVersion == "0.46.0"
| where Data_fetchReason == ""
| where Data_hostName == "OneNoteMeetingNotes"
| summarize count()

vladsud added a commit that referenced this issue Sep 7, 2021
Issue #7312: fluid:telemetry:DeltaManager:ConnectionRecovery is hit in production and ODSP scalability tests

I've been going around this area for quite a while, both trying to better understand it, but also try various recovery options.
Unfortunately none of the recovery options tried so far does not really work. So removing it and switching to only logging more data to keep understanding problem better.

With that, it's worth also provide a bit more info on what I've learned so far: Big part of these issues, at least in stress tests, is the fact that we boot container with "write" connection, and connection is established before container is loaded. As result, because load may take a while, we are not processing ops for very long time.
I'll attempt to solve this in separate PR, but adding DeltaManager.handler & Container.loaded properties to payload is super valuable, as well as learning that we do sit on ton of unprocessed ops.
@vladsud
Copy link
Contributor Author

vladsud commented Sep 8, 2021

PR #7392 added right telemetry, we need it to be deployed to prod to understand things better.
Given no shipping apps are affected in big numbers, moving to October

@vladsud
Copy link
Contributor Author

vladsud commented Sep 8, 2021

Latest ODSP Scalability run has more telemetry. This one is interesting:

customEvents
| where customDimensions.testUid == "f3976e4b-b286-457d-be00-4541967e1d9f"
| where name == "fluid:telemetry:DeltaManager:NoJoinOp"
| where customDimensions.fetchReason == ""
| where customDimensions.haveHandler == true
| where toint(customDimensions.lastQueuedSequenceNumber) == toint(customDimensions.lastProcessedSequenceNumber)
| summarize count(), countif(toint(customDimensions.lastProcessedSequenceNumber) == toint(customDimensions.lastObserved))

It gives us 27 hits, with 16 cases where all sequence numbers are the same.
In all cases we have no pending ops!!!
In 27-16 cases we have difference between lastProcessedSequenceNumber & lastObserved, and given no pending ops - that points to a fact that we simply lost pending ops somewhere!!!

We also have fluid:telemetry:DeltaManager:ReceivedJoinOp events showing us when client gets out of this state, and tail is pretty bad - medium is at 1 min!

@jatgarg
Copy link
Contributor

jatgarg commented Sep 8, 2021

Latest ODSP Scalability run has more telemetry. This one is interesting:

customEvents
| where customDimensions.testUid == "f3976e4b-b286-457d-be00-4541967e1d9f"
| where name == "fluid:telemetry:DeltaManager:NoJoinOp"
| where customDimensions.fetchReason == ""
| where customDimensions.haveHandler == true
| where toint(customDimensions.lastQueuedSequenceNumber) == toint(customDimensions.lastProcessedSequenceNumber)
| summarize count(), countif(toint(customDimensions.lastProcessedSequenceNumber) == toint(customDimensions.lastObserved))

It gives us 27 hits, with 16 cases where all sequence numbers are the same.
In all cases we have no pending ops!!!
In 27-16 cases we have difference between lastProcessedSequenceNumber & lastObserved, and given no pending ops - that points to a fact that we simply lost pending ops somewhere!!!

We also have fluid:telemetry:DeltaManager:ReceivedJoinOp events showing us when client gets out of this state, and tail is pretty bad - medium is at 1 min!

Interesting that we lost the pending ops somewhere. Does all these cases started coming after we removed initialMessages in connection details?

@vladsud
Copy link
Contributor Author

vladsud commented Sep 8, 2021

So I've added a ton of telemetry locally and what I see (sometimes when I hit this case very infrequently) is that it's not that we lost pending ops. lastObserved in the query about can be higher than lastProcessedSequenceNumber simply because we learned about max sequence number through checkpointSequenceNumber on IConnected. So client knows that there are more ops, but we never saw them (or at least we believe it's the case).
It still does not answer many questions, but my efforts so far is to confirm that we never saw these ops, of prove otherwise.

yunho-microsoft pushed a commit to yunho-microsoft/FluidFramework that referenced this issue Sep 9, 2021
Issue microsoft#7312: fluid:telemetry:DeltaManager:ConnectionRecovery is hit in production and ODSP scalability tests

I've been going around this area for quite a while, both trying to better understand it, but also try various recovery options.
Unfortunately none of the recovery options tried so far does not really work. So removing it and switching to only logging more data to keep understanding problem better.

With that, it's worth also provide a bit more info on what I've learned so far: Big part of these issues, at least in stress tests, is the fact that we boot container with "write" connection, and connection is established before container is loaded. As result, because load may take a while, we are not processing ops for very long time.
I'll attempt to solve this in separate PR, but adding DeltaManager.handler & Container.loaded properties to payload is super valuable, as well as learning that we do sit on ton of unprocessed ops.
vladsud added a commit that referenced this issue Sep 9, 2021
…h up and does not realize that (#7430)

Please see issue #7312 for more details.
This change does two things:
1. It adds telemetry to detect cases where we know that we are behind, but we are not actively trying to catch up. We do not fully understand these problems, but this telemetry (along with other PRs that add telemetry data points in various places) will help us better understand the problem.
2. It also engages recovery for such cases.

We know for sure how we can get into such cases with "read" connections. These are normal (expected) workflows.
"write" connections are much more mysterious, as service (PUSH for ODSP) guarantees that join op should not be missed and should be delivered to client quickly after connection. But telemetry shows that we may be in position where join op is not processed for minutes, and it's not clear at the moment if that's a bug on client or server.

Semi-related PRs that together helps us move in right direction here:
PR #7428
PR #7427
PR #7429
@vladsud
Copy link
Contributor Author

vladsud commented Sep 28, 2021

Need to reassess ones PUSH addresses existing issues with op delivery latencies tracked by https://onedrive.visualstudio.com/SPIN/_queries/edit/1204718/?triage=true

@vladsud vladsud modified the milestones: October 2021, November 2021 Sep 28, 2021
@vladsud
Copy link
Contributor Author

vladsud commented Oct 22, 2021

Latest query to track:

union Office_Fluid_FluidRuntime_*
| where Event_Time > ago(10d)
| where Data_eventName contains "fluid:telemetry:DeltaManager:NoJoinOp"
| summarize count() by Data_loaderVersion

The set of bugs to track:

https://onedrive.visualstudio.com/SPIN/_workitems/edit/1204718/?triage=true
https://onedrive.visualstudio.com/SPIN/_queries/edit/1186588/?triage=true

#7145
#7406

@vladsud
Copy link
Contributor Author

vladsud commented Oct 25, 2021

Dates (October) for SPDF / load test cluster / PROD rollout:
Bug 1186588: 12 / 20 / still rolling to prod
Bug 1204718: 5 / 5 / 20

@vladsud vladsud changed the title fluid:telemetry:DeltaManager:ConnectionRecovery is hit in production and ODSP scalability tests fluid:telemetry:DeltaManager:NoJoinOp is hit in production and ODSP scalability tests Oct 31, 2021
@vladsud
Copy link
Contributor Author

vladsud commented Oct 31, 2021

99% of these issues in ODSP stress tests are due to #7772.
I.e. boot is slow, and because connection happens before boot is over, then we actually wait for boot to finish, which occasionally (due to retries and 30+ second fetch timeouts) results in this timer firing. Switching to default "read" mode will eliminate this reporting (as it only reported for "write" connections).

The remaining number of cases is small, and has different origins. Some are due to very long fetches, some looks like PUSH slowness at inserting join op (as ops otherwise are flowing). Some due to #7145, as unneeded storage calls block future storage calls that are needed to cover ops gap.

@vladsud
Copy link
Contributor Author

vladsud commented Oct 31, 2021

This might be the most interesting way to look at the prod data:

Office_Fluid_FluidRuntime_Error
| where Data_loaderVersion contains "0.49"
| where Event_Time > ago(5d)
| where Data_eventName contains "NoJoinOp" or Data_eventName contains "ReceivedJoinOp"
| summarize countif(Data_eventName contains "NoJoinOp"), countif(Data_eventName contains "NoJoinOp"), make_set(Data_fetchReason) by Data_containerId

For example, containerId == "e62239fa-f161-41f1-af6a-610398d49169" hit and recovered 4 times and was not fetching ops during this time.
Similar, containerId == "16666f71-1c22-4141-971b-b4123d3c57bb" for 0.48 hit it 18 times!
Cases where we are not fetching ops are the most interesting to drill into

@vladsud
Copy link
Contributor Author

vladsud commented Nov 3, 2021

containerId = "71aa9dcb-0b72-49c3-970c-323e6a5829eb" (odsp stress tests) is the same as the case described in #7406 (comment of 26 days ago). Per Gary, join op has seq = 271.
It's pretty clear we processed it (at least put into DeltaManager.inbound queue).

But NoJoinOp has this payload:
lastObserved = lastQueuedSequenceNumber = 445
lastProcessedSequenceNumber = 151

So op processing is paused! Someone is screwing op processing badly!

That said, this query gives me only 8 out of 134 cases where it's the case:

customEvents
| where timestamp > ago(10d)
| where name contains "NoJoinOp"
| summarize count(), countif(toint(customDimensions.lastProcessedSequenceNumber) < toint(customDimensions.lastQueuedSequenceNumber))

Same thing in prod gives me 110 out of 212 - much larger part:

union Office_Fluid_FluidRuntime_*
| where Event_Time > ago(10d)
| where Data_eventName contains "NoJoinOp"
| summarize count(), countif(Data_lastProcessedSequenceNumber < Data_lastQueuedSequenceNumber)

@vladsud
Copy link
Contributor Author

vladsud commented Nov 3, 2021

What is even more interesting that it's almost 90% of cause for interactive clients (105 out of 126 in prod) vs. summarizer (5 out of 86)!

@vladsud
Copy link
Contributor Author

vladsud commented Nov 3, 2021

Ugh, average number of ops we are behind for interactive client is 3193! I think we have a bug somewhere with pausing that probably never resumes. That said, we know that 73 out of 127 interactive clients that run into this case actually eventually recovered, so op processing was resumed eventually.

vladsud added a commit to vladsud/FluidFramework that referenced this issue Nov 3, 2021
vladsud added a commit that referenced this issue Nov 3, 2021
There are couple active investigations around op roundtrip latency and join op receiving latency:
#7312
#7406

While not all conditions are well understood, it's pretty clearly that many cases run into the issue of pending storage op fetch requests blocking op processing, where they are not needed. I.e. the reason (translating to range of ops) we are asking for is no longer valid / required, but active fetch request might block future ops requests, as system (as it stands today) allows only one request at a time.

This change changes the following behavior:

1. Cancelation for request with known end limit (to argument) are cancelled only when all ops in the range came in not through that request. That will ensure that we do not cancel (and report cancelation) for storage requests that are about to complete either way.

2. Main change: Cancel unbound storage op requests when op is coming in from ordering service that is above last known op (which is usually equal to checkpointSequenceNumber ordering service provided on connection, indicating where the end of the file is).
Potential future change to remove further bottlenecks: We might revisit the design of one request at a time.
I think allowing one bound and one unbound request makes sense, or maybe 2 requests of any type, not sure.
We will monitor data based on these results to make further determination.
tylerbutler pushed a commit to tylerbutler/FluidFramework that referenced this issue Nov 5, 2021
There are couple active investigations around op roundtrip latency and join op receiving latency:
microsoft#7312
microsoft#7406

While not all conditions are well understood, it's pretty clearly that many cases run into the issue of pending storage op fetch requests blocking op processing, where they are not needed. I.e. the reason (translating to range of ops) we are asking for is no longer valid / required, but active fetch request might block future ops requests, as system (as it stands today) allows only one request at a time.

This change changes the following behavior:

1. Cancelation for request with known end limit (to argument) are cancelled only when all ops in the range came in not through that request. That will ensure that we do not cancel (and report cancelation) for storage requests that are about to complete either way.

2. Main change: Cancel unbound storage op requests when op is coming in from ordering service that is above last known op (which is usually equal to checkpointSequenceNumber ordering service provided on connection, indicating where the end of the file is).
Potential future change to remove further bottlenecks: We might revisit the design of one request at a time.
I think allowing one bound and one unbound request makes sense, or maybe 2 requests of any type, not sure.
We will monitor data based on these results to make further determination.
vladsud added a commit that referenced this issue Nov 11, 2021
…#8206)

After PR #8137, it's now easier to see what's wrong with ScheduleManager.
You can see impact of the wrong doing described in detail in issue #7312

The core of the problem is that ScheduleManager will pause processing on incomplete batch coming in. Even if we have plenty of ops before that incomplete batch that could be processed just fine.
What likely happens (that makes it really bad in production) is that we could get multiple chunks of ops from storage, but each chunk might end up splitting some batch. Because we always have incomplete batch, we keep inbound queue in paused state until this problem resolves by fetching chunk of ops that does not break some batch.

This change also ensures that we pause inbound queue only when we have partial batch and no other ops.
It will not pause queue if inbound queue is empty, thus reducing number of paused - resumed transitions if we are getting non-batched ops.
@vladsud
Copy link
Contributor Author

vladsud commented Nov 11, 2021

Closing this issue. Here is the summary of where we are:

  1. Most of prod issues should be addressed by PR Fix ScheduleManager: Allow ops to be processed up to incomplete batch #8206 in 0.52
    • Most of remaining prod issues are long fetches, i.e. fetch (usually fetchReason == "DocumentOpen_fetch") blocking ability to process incoming ops. I.e. there is not much we can do here. I see 81 cases for last 10 days (on deployed 0.49 bits)
    • There is even smaller set of cases where it seems we are waiting for leave from previous session (based on fluid:telemetry:Container:connectedStateRejected event that follows). I see 32 cases for last 10 days (on deployed 0.49 bits). We might want to look closer into it, but in context of that event / process - opened Too many fluid:telemetry:Container:connectedStateRejected events with source = timeout #8235 to look closer into it.
  2. Most of the ODSP scalability issues are due to and tracked by issue Change default connection mode from "write" to "read" #7772
    • There are some remaining cases in ODSP stress tests that are harder to follow (as NoJoinOp happens when container is already loaded), but inspecting individual sessions shows that they are same as above, i.e. connection was established before container was loaded.

@vladsud vladsud closed this as completed Nov 11, 2021
@vladsud vladsud mentioned this issue Nov 11, 2021
32 tasks
pleath pushed a commit that referenced this issue Nov 13, 2021
…#8206)

After PR #8137, it's now easier to see what's wrong with ScheduleManager.
You can see impact of the wrong doing described in detail in issue #7312

The core of the problem is that ScheduleManager will pause processing on incomplete batch coming in. Even if we have plenty of ops before that incomplete batch that could be processed just fine.
What likely happens (that makes it really bad in production) is that we could get multiple chunks of ops from storage, but each chunk might end up splitting some batch. Because we always have incomplete batch, we keep inbound queue in paused state until this problem resolves by fetching chunk of ops that does not break some batch.

This change also ensures that we pause inbound queue only when we have partial batch and no other ops.
It will not pause queue if inbound queue is empty, thus reducing number of paused - resumed transitions if we are getting non-batched ops.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants