-
Notifications
You must be signed in to change notification settings - Fork 536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fluid:telemetry:DeltaManager:NoJoinOp is hit in production and ODSP scalability tests #7312
Comments
@jatgarg - I'd most likely need help here as I think I'll not have enough time in September to look into it. union Office_Fluid_FluidRuntime_* There was a bunch of work done both on client & PUSH to fix various bugs we found and it made a huge improvement in reducing these errors, but as of now there are 20 errors for 0.46, and some number (low counts) in ODSP stress tests. |
It's actually top error in loaderVersion >= 0.45 if I remove 10K unsumarized ops nacks & 403 errors. |
Here is an interesting statistic: Currently there are 39K / 30K of sessions between 0.45/0.46 versions of loader. (connectMessage as any).supportedFeatures = { [feature_get_ops]: true }; And PUSH not providing IConnected.initialMessages in response, likely causing situation where client never receives its own join op due to some race condition somewhere (PUSH?) with broadcast subscription Note that DeltaManager has this code that tries to remedy similar situation for "read" connections (no ops flowing, so client does not know how far it is behind), but there is an exclusion for "write" clients as there is an assumption that join op should always come from socket: if (initialMessages.length === 0) {
if (checkpointSequenceNumber !== undefined) {
// We know how far we are behind (roughly). If it's non-zero gap, fetch ops right away.
if (checkpointSequenceNumber > this.lastQueuedSequenceNumber) {
this.fetchMissingDeltas("AfterConnection", this.lastQueuedSequenceNumber);
}
// we do not know the gap, and we will not learn about it if socket is quite - have to ask.
} else if (connection.mode === "read") {
this.fetchMissingDeltas("AfterReadConnection", this.lastQueuedSequenceNumber);
} It's a bit hard to prove this theory (without actually making changes and deploying / running through ODSP stress tests), but anecdotally, if I change this code to always fetch ops, it seems that I'm not hitting fluid:telemetry:DeltaManager:ConnectionRecovery any more in local (single-machine) testing, where previously I was hitting them with some frequency (so there is some statistical significance in this result, but hard to say if I simply got lucky). |
Not sure if there is any significance in that, but here are some results from latest ODSP scalability runs: |
More experiments suggests that no, it's not the cause :) |
Here is interesting data point:
So ODSP cases are blocked by pending fetch from prior connection (as "AfterConnection" reason is used only for "read" connection). Cases like that explain why join op was not processed (we can't process it because there was a gap in ops, and fetching this gap likely was blocked by another long-running fetch, as we only do one concurrent fetch at a time). But it also looks strange given that we use 30 second timeout for fetches, so very likely that points to multiple fetches that timed out (need to double check that). But prod distribution looks different. One thing to note about local runs - I run into these problems very easily due to the fact that connection is established early on container boot, and occasionally boot takes very long time (with TreesLatest timing out and us retrying). So we hit the error quite often when DeltaManager.handler === undefined, and thus we are not yet processing any ops what so ever. Dropping mode in this line in Container.load() fixes this, as we start with "read" connection, and thus do not hit timer (timer applies only to write connections): const connectionArgs: IConnectionArgs = { reason: "DocumentOpen", mode: "write", fetchOpsFromStorage: false }; In prod, Data_fetchReason == "" are hit predominantly by "OneNoteMeetingNotes" & "Fluid Preview App" apps, even though the top apps in terms of usage are "Whiteboard" & "Microsoft Teams Web". I've checked onenote sessions and they do nto connect to socket before boot is over, so they are not different from let's say Teams. The good news is that both apps that hit it the most in prod are not shipping, so impact on prod at the moment is pretty low. |
Note: ALL issues are gone from stress tests with change to connectionArgs.mode mentioned above. |
This is the main offender in prod - 921 hits: union Office_Fluid_FluidRuntime_* |
Issue #7312: fluid:telemetry:DeltaManager:ConnectionRecovery is hit in production and ODSP scalability tests I've been going around this area for quite a while, both trying to better understand it, but also try various recovery options. Unfortunately none of the recovery options tried so far does not really work. So removing it and switching to only logging more data to keep understanding problem better. With that, it's worth also provide a bit more info on what I've learned so far: Big part of these issues, at least in stress tests, is the fact that we boot container with "write" connection, and connection is established before container is loaded. As result, because load may take a while, we are not processing ops for very long time. I'll attempt to solve this in separate PR, but adding DeltaManager.handler & Container.loaded properties to payload is super valuable, as well as learning that we do sit on ton of unprocessed ops.
PR #7392 added right telemetry, we need it to be deployed to prod to understand things better. |
Latest ODSP Scalability run has more telemetry. This one is interesting: customEvents It gives us 27 hits, with 16 cases where all sequence numbers are the same. We also have fluid:telemetry:DeltaManager:ReceivedJoinOp events showing us when client gets out of this state, and tail is pretty bad - medium is at 1 min! |
Interesting that we lost the pending ops somewhere. Does all these cases started coming after we removed initialMessages in connection details? |
So I've added a ton of telemetry locally and what I see (sometimes when I hit this case very infrequently) is that it's not that we lost pending ops. lastObserved in the query about can be higher than lastProcessedSequenceNumber simply because we learned about max sequence number through checkpointSequenceNumber on IConnected. So client knows that there are more ops, but we never saw them (or at least we believe it's the case). |
Issue microsoft#7312: fluid:telemetry:DeltaManager:ConnectionRecovery is hit in production and ODSP scalability tests I've been going around this area for quite a while, both trying to better understand it, but also try various recovery options. Unfortunately none of the recovery options tried so far does not really work. So removing it and switching to only logging more data to keep understanding problem better. With that, it's worth also provide a bit more info on what I've learned so far: Big part of these issues, at least in stress tests, is the fact that we boot container with "write" connection, and connection is established before container is loaded. As result, because load may take a while, we are not processing ops for very long time. I'll attempt to solve this in separate PR, but adding DeltaManager.handler & Container.loaded properties to payload is super valuable, as well as learning that we do sit on ton of unprocessed ops.
…h up and does not realize that (#7430) Please see issue #7312 for more details. This change does two things: 1. It adds telemetry to detect cases where we know that we are behind, but we are not actively trying to catch up. We do not fully understand these problems, but this telemetry (along with other PRs that add telemetry data points in various places) will help us better understand the problem. 2. It also engages recovery for such cases. We know for sure how we can get into such cases with "read" connections. These are normal (expected) workflows. "write" connections are much more mysterious, as service (PUSH for ODSP) guarantees that join op should not be missed and should be delivered to client quickly after connection. But telemetry shows that we may be in position where join op is not processed for minutes, and it's not clear at the moment if that's a bug on client or server. Semi-related PRs that together helps us move in right direction here: PR #7428 PR #7427 PR #7429
Need to reassess ones PUSH addresses existing issues with op delivery latencies tracked by https://onedrive.visualstudio.com/SPIN/_queries/edit/1204718/?triage=true |
Latest query to track: union Office_Fluid_FluidRuntime_* The set of bugs to track: https://onedrive.visualstudio.com/SPIN/_workitems/edit/1204718/?triage=true |
Dates (October) for SPDF / load test cluster / PROD rollout: |
99% of these issues in ODSP stress tests are due to #7772. The remaining number of cases is small, and has different origins. Some are due to very long fetches, some looks like PUSH slowness at inserting join op (as ops otherwise are flowing). Some due to #7145, as unneeded storage calls block future storage calls that are needed to cover ops gap. |
This might be the most interesting way to look at the prod data: Office_Fluid_FluidRuntime_Error For example, containerId == "e62239fa-f161-41f1-af6a-610398d49169" hit and recovered 4 times and was not fetching ops during this time. |
containerId = "71aa9dcb-0b72-49c3-970c-323e6a5829eb" (odsp stress tests) is the same as the case described in #7406 (comment of 26 days ago). Per Gary, join op has seq = 271. But NoJoinOp has this payload: So op processing is paused! Someone is screwing op processing badly! That said, this query gives me only 8 out of 134 cases where it's the case:
Same thing in prod gives me 110 out of 212 - much larger part:
|
What is even more interesting that it's almost 90% of cause for interactive clients (105 out of 126 in prod) vs. summarizer (5 out of 86)! |
Ugh, average number of ops we are behind for interactive client is 3193! I think we have a bug somewhere with pausing that probably never resumes. That said, we know that 73 out of 127 interactive clients that run into this case actually eventually recovered, so op processing was resumed eventually. |
See microsoft#7312 for more details
There are couple active investigations around op roundtrip latency and join op receiving latency: #7312 #7406 While not all conditions are well understood, it's pretty clearly that many cases run into the issue of pending storage op fetch requests blocking op processing, where they are not needed. I.e. the reason (translating to range of ops) we are asking for is no longer valid / required, but active fetch request might block future ops requests, as system (as it stands today) allows only one request at a time. This change changes the following behavior: 1. Cancelation for request with known end limit (to argument) are cancelled only when all ops in the range came in not through that request. That will ensure that we do not cancel (and report cancelation) for storage requests that are about to complete either way. 2. Main change: Cancel unbound storage op requests when op is coming in from ordering service that is above last known op (which is usually equal to checkpointSequenceNumber ordering service provided on connection, indicating where the end of the file is). Potential future change to remove further bottlenecks: We might revisit the design of one request at a time. I think allowing one bound and one unbound request makes sense, or maybe 2 requests of any type, not sure. We will monitor data based on these results to make further determination.
There are couple active investigations around op roundtrip latency and join op receiving latency: microsoft#7312 microsoft#7406 While not all conditions are well understood, it's pretty clearly that many cases run into the issue of pending storage op fetch requests blocking op processing, where they are not needed. I.e. the reason (translating to range of ops) we are asking for is no longer valid / required, but active fetch request might block future ops requests, as system (as it stands today) allows only one request at a time. This change changes the following behavior: 1. Cancelation for request with known end limit (to argument) are cancelled only when all ops in the range came in not through that request. That will ensure that we do not cancel (and report cancelation) for storage requests that are about to complete either way. 2. Main change: Cancel unbound storage op requests when op is coming in from ordering service that is above last known op (which is usually equal to checkpointSequenceNumber ordering service provided on connection, indicating where the end of the file is). Potential future change to remove further bottlenecks: We might revisit the design of one request at a time. I think allowing one bound and one unbound request makes sense, or maybe 2 requests of any type, not sure. We will monitor data based on these results to make further determination.
…#8206) After PR #8137, it's now easier to see what's wrong with ScheduleManager. You can see impact of the wrong doing described in detail in issue #7312 The core of the problem is that ScheduleManager will pause processing on incomplete batch coming in. Even if we have plenty of ops before that incomplete batch that could be processed just fine. What likely happens (that makes it really bad in production) is that we could get multiple chunks of ops from storage, but each chunk might end up splitting some batch. Because we always have incomplete batch, we keep inbound queue in paused state until this problem resolves by fetching chunk of ops that does not break some batch. This change also ensures that we pause inbound queue only when we have partial batch and no other ops. It will not pause queue if inbound queue is empty, thus reducing number of paused - resumed transitions if we are getting non-batched ops.
Closing this issue. Here is the summary of where we are:
|
…#8206) After PR #8137, it's now easier to see what's wrong with ScheduleManager. You can see impact of the wrong doing described in detail in issue #7312 The core of the problem is that ScheduleManager will pause processing on incomplete batch coming in. Even if we have plenty of ops before that incomplete batch that could be processed just fine. What likely happens (that makes it really bad in production) is that we could get multiple chunks of ops from storage, but each chunk might end up splitting some batch. Because we always have incomplete batch, we keep inbound queue in paused state until this problem resolves by fetching chunk of ops that does not break some batch. This change also ensures that we pause inbound queue only when we have partial batch and no other ops. It will not pause queue if inbound queue is empty, thus reducing number of paused - resumed transitions if we are getting non-batched ops.
This event is logged when client does not receive its own join op for 45 seconds after connecting as "write".
In indicates something going really bad, and user not being able to receive latest document changes, as well as save his/her changes back to file. Recovery process attempts to reconnect, but if same container keeps hitting this error, that means recovery does not help in any way and client stays in screwed state.
union Office_Fluid_FluidRuntime_*
| where Data_eventName contains "fluid:telemetry:DeltaManager:ConnectionRecovery"
| summarize count() by Data_containerId
Top results:
42d1b5e1-640b-4bec-b83f-a2776bb81338 32
acb0fbd8-1a63-490e-a42f-8e656fe3cdcc 25
9a1bc224-0abc-4155-a6e5-c5a25f9ad623 20
cf369d46-3d6b-4d79-8977-0d7ef38d41b3 18
The text was updated successfully, but these errors were encountered: