-
Notifications
You must be signed in to change notification settings - Fork 536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve summary failure logging #6649
Comments
Would be nice to make progress here. I do not see retryAfter property on fluid:telemetry:Summarizer:Running:Summarize_cancel, and it's hard to trace this code to even answer basic question - what properties are available on it. I finally was able to get to it via this query, but that's not very obvious (I'll raise PR to change that event to be more clear): customEvents |
@vladsud, regarding a possible Nack event, would you recommend having RunningSummarizer.trySummarize detect that the attempt failed with summaryNack !== undefined and, if so, issue a "SummaryNack" event that includes the message and retryAfter fields? |
I'd rather put telemetry event as deep as possible to ensure that it covers as many code paths as possible, but watch out for only summarizer client to issue these events. if (ackNackOp.type === MessageType.SummaryAck) {
this.heuristicData.markLastAttemptAsSuccessful();
summarizeEvent.end({ ...telemetryProps, handle: ackNackOp.contents.handle, message: "summaryAck" });
resultsBuilder.receivedSummaryAckOrNack.resolve({ success: true, data: {
summaryAckNackOp: ackNackOp,
ackNackDuration,
}});
} else {
// Check for retryDelay in summaryNack response.
// back-compat: cast needed until dep on protocol-definitions version bump
const summaryNack = ackNackOp.type === MessageType.SummaryNack RunningSummarizer.trySummarize is not ideal, as it only covers heuristic summaries. It will not include on-demand and enqueues summaries (see RunningSummarizer.summarizeOnDemand, etc.). That said, I'm creating a single entry point with #7298, so we can use it, but I'd rather put it one level deeper (SummaryGenerator.summarizeCore) |
Based on #6555.
Feel free to break it into smaller issues, we could also distribute it across more people.
Here are key observations (crossed out are things I believe I addressed):
submitOpDuration
on GenerateSummary event - the former tracks overall latency of connection, and later - latency of calling couple APIs?fluid:telemetry:Summarizer:Running:Summarize_cancel should have category = "error", not generic.There are fluid:telemetry:Summarizer:Running:Summarize_cancel events with empty error field. It's hard to say what these errors represent - would be great to figure out why that happens and how to address it.timeWaiting
property is not very telling. I think it should be calledduration
on SummaryOp, and ackWaitDuration on Summarize_end event.Here are examples of payloads (relevant bits mostly, I removed some):
GenerateSummary:
"refSequenceNumber":"70418",
"opsSinceLastAttempt":"267",
"opsSinceLastSummary":"267",
"generateDuration":"3.9183210134506226",
"submitOpDuration":"0.06549999117851257",
"uploadDuration":"1097.4087789952755",
Summarize_end
"duration":"3775",
"sequenceNumber":"65519",
"summarySequenceNumber":"65518",
"reason":"maxTime",
"timeWaiting":"1475",
"timeSinceLastAttempt":"60639",
"timeSinceLastSummary":"60639",
"message":"summaryAck",
Summarize_cancel
"duration":"383",
"dmInitialSeqNumber":"45499",
"dmLastMsqSeqTimestamp":"1625638803886",
"dmLastKnownSeqNumber":"53696",
"dmLastMsqSeqClientId":"5a41c6c1-144a-43fd-a2dd-bdcba64c8d13",
"dmLastMsqSeqNumber":"53696",
"reason":"retry2",
"timeSinceLastAttempt":"120001",
"timeSinceLastSummary":"216331",
"error":"disconnected",
"message":"generateSummaryFailure",
SummaryOp
"summarySequenceNumber":"80051",
"refSequenceNumber":"80047",
"timeWaiting":"75",
The text was updated successfully, but these errors were encountered: