-
Notifications
You must be signed in to change notification settings - Fork 537
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Better telemetry for fetching ops (#6947)
Problem statement: Newly added NoJoinOp telemetry event points out to a condition where ops are not processed for very long time. Examining telemetry shows that all such cases have one thing in common - there is outstanding ops request to service that takes a long time. And in pretty much all the cases actual network request (as indicated by OpsFetch event) takes relatively short time, but overall process (GetDeltas_end) takes long time, occasionally minutes. I believe in all these cases ops never get to storage (in reasonable time), but in majority cases client actually receives missing ops through websocket (though in all cases, read on). DeltaManager does cancel request in such case (see ExtraStorageCall event), but request is not immediately cancelled, blocking future requests (see fetchMissingDeltasCore - it allows only one outstanding call). As result, whole process does not more forward for the long time. I do not have in-depth understanding where we get stuck in the process, but one such case is obvious waitForConnectedState() - it's possible that browser lies to us or does not quickly reacts to online/offline, which may cause process to get stuck for up to 30 seconds. The other one more likely reason - 429s returned from SPO for fetching ops. We do not have logging for individual retryable attempts, so this goes unnoticed today. Fix: 1. Make op fetching process return on cancellation immediately by listening for cancelation event. 2. Add telemetry for some sub-processes, like fetching ops from cache, if it takes longer than 1 second. 3. Remove ExtraStorageCall event as it fires on all successful fetches, and instead make core op fetching logic raise GetDeltas_cancel event instead if cancel was processed before all ops were fetched. 4. Add telemetry (logNetworkFailure in getSingleOpBatch) for individual failed fetched, such that we get insights for things like 429 that may block fetching process (but currently not visible in telemetry). Outcome: This does address many, but not all NoJoinOp issues (remaining needs to be looked deeper). But this in turn brings back "too many retries" errors, indicating that one of the reasons we run into initial problem is due to client not being able to find relevant ops (and on top of it - not failing sooner, but hanging). These errors needs to also be looked deeper to understand if bugs are on client or server side.
- Loading branch information
Showing
4 changed files
with
51 additions
and
24 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters