Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve summary failure logging #6649

Closed
Tracked by #6686
vladsud opened this issue Jul 7, 2021 · 3 comments · Fixed by #7326
Closed
Tracked by #6686

Improve summary failure logging #6649

vladsud opened this issue Jul 7, 2021 · 3 comments · Fixed by #7326
Assignees
Labels
bug Something isn't working

Comments

@vladsud
Copy link
Contributor

vladsud commented Jul 7, 2021

Based on #6555.
Feel free to break it into smaller issues, we could also distribute it across more people.

Here are key observations (crossed out are things I believe I addressed):

  1. We should be consistent in how we log failure. There is SummaryAckWaitTimeout, SummaryOp, but there is no Nack event any more. We should either log all of them, or remove all of the and propagate relevant info into the rest of the flow (i.e. Summarize_[end|cancel] events).
    • I'm not sure there is a value in SummaryOp, and in tracking submitOpDuration on GenerateSummary event - the former tracks overall latency of connection, and later - latency of calling couple APIs?
  2. It's very hard to say how many nacks are there. One can aggregate error property from fluid:telemetry:Summarizer:Running:Summarize_cancel and guess which ones are coming from ordering service ("Failed to summarize the document") and which are not ("disconnecte", "Error 404").
  3. fluid:telemetry:Summarizer:Running:Summarize_cancel should have category = "error", not generic.
  4. There are fluid:telemetry:Summarizer:Running:Summarize_cancel events with empty error field. It's hard to say what these errors represent - would be great to figure out why that happens and how to address it.
  5. fluid:telemetry:Summarizer:Running:GenerateSummary is logging both failures and successes. Looking at stats, I believe we can drop this event on failure path, as fluid:telemetry:Summarizer:Running:Summarize_cancel duplicates it, though it does not log same data. Maybe it should be dropped altogether if Summarize_end & Summarize_cancel could convey same info? Not sure, need to look at how we report durations for various phases of summarization process.
    • Should we use exceptions for error paths (including expected errors)?
  6. timeWaiting property is not very telling. I think it should be called duration on SummaryOp, and ackWaitDuration on Summarize_end event.

Here are examples of payloads (relevant bits mostly, I removed some):

GenerateSummary:
"refSequenceNumber":"70418",
"opsSinceLastAttempt":"267",
"opsSinceLastSummary":"267",
"generateDuration":"3.9183210134506226",
"submitOpDuration":"0.06549999117851257",
"uploadDuration":"1097.4087789952755",

Summarize_end
"duration":"3775",
"sequenceNumber":"65519",
"summarySequenceNumber":"65518",
"reason":"maxTime",
"timeWaiting":"1475",
"timeSinceLastAttempt":"60639",
"timeSinceLastSummary":"60639",
"message":"summaryAck",

Summarize_cancel
"duration":"383",
"dmInitialSeqNumber":"45499",
"dmLastMsqSeqTimestamp":"1625638803886",
"dmLastKnownSeqNumber":"53696",
"dmLastMsqSeqClientId":"5a41c6c1-144a-43fd-a2dd-bdcba64c8d13",
"dmLastMsqSeqNumber":"53696",
"reason":"retry2",
"timeSinceLastAttempt":"120001",
"timeSinceLastSummary":"216331",
"error":"disconnected",
"message":"generateSummaryFailure",

SummaryOp
"summarySequenceNumber":"80051",
"refSequenceNumber":"80047",
"timeWaiting":"75",

@vladsud vladsud added the bug Something isn't working label Jul 7, 2021
@vladsud vladsud added this to the August 2021 milestone Jul 7, 2021
@vladsud vladsud mentioned this issue Jul 9, 2021
32 tasks
@vladsud
Copy link
Contributor Author

vladsud commented Aug 5, 2021

Would be nice to make progress here.
For example, I was not able to easily answer the question - do we ever got summary nack with 429 / retryAfter property?

I do not see retryAfter property on fluid:telemetry:Summarizer:Running:Summarize_cancel, and it's hard to trace this code to even answer basic question - what properties are available on it.

I finally was able to get to it via this query, but that's not very obvious (I'll raise PR to change that event to be more clear):

customEvents
| where customDimensions.eventName contains "SummarizeAttemptDelay"
| where customDimensions.retryAfterSeconds != customDimensions.regularDelaySeconds
| summarize count() by customDimensions.eventName, customDimensions.retryAfterSeconds

@pleath
Copy link
Contributor

pleath commented Sep 1, 2021

@vladsud, regarding a possible Nack event, would you recommend having RunningSummarizer.trySummarize detect that the attempt failed with summaryNack !== undefined and, if so, issue a "SummaryNack" event that includes the message and retryAfter fields?

@vladsud
Copy link
Contributor Author

vladsud commented Sep 1, 2021

I'd rather put telemetry event as deep as possible to ensure that it covers as many code paths as possible, but watch out for only summarizer client to issue these events.
So first suspect seems to be Summary.ackNack(), though it might be too broad (i.e. not satisfy requirement # 2 above).
Next obvious place would be SummaryGenerator.summarizeCore(), else branch here:

            if (ackNackOp.type === MessageType.SummaryAck) {
                this.heuristicData.markLastAttemptAsSuccessful();
                summarizeEvent.end({ ...telemetryProps, handle: ackNackOp.contents.handle, message: "summaryAck" });
                resultsBuilder.receivedSummaryAckOrNack.resolve({ success: true, data: {
                    summaryAckNackOp: ackNackOp,
                    ackNackDuration,
                }});
            } else {
                // Check for retryDelay in summaryNack response.
                // back-compat: cast needed until dep on protocol-definitions version bump
                const summaryNack = ackNackOp.type === MessageType.SummaryNack

RunningSummarizer.trySummarize is not ideal, as it only covers heuristic summaries. It will not include on-demand and enqueues summaries (see RunningSummarizer.summarizeOnDemand, etc.). That said, I'm creating a single entry point with #7298, so we can use it, but I'd rather put it one level deeper (SummaryGenerator.summarizeCore)

@pleath pleath modified the milestones: August 2021, September 2021 Sep 1, 2021
@pleath pleath linked a pull request Sep 1, 2021 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants