-
Notifications
You must be signed in to change notification settings - Fork 536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Analyze latest ODSP stress test run #6555
Comments
fluid:telemetry:Summarizer:Running:FailToSummarize | | 443 | MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 removeMember listeners added to [QuorumProxy]. Use emitter.setMaxListeners() to increase limit | 152 | Some analyzes. First, likely less interesting issues - we need to look into them all, but my intuition / knowledge tells me they are not blocking scalability testing and not (by themselves) are the issues that are manifesting to users:
Which leaves are with these more important things to look into:
Looking at how long it takes to hear ack in successful cases - I believe this is captured in |
@arinwt - FYI, please take a look. Would be great if you can double check my logic.
|
Bigger run: testUid == "2ff41d63-688e-4283-9b57-95a8c56cd2e9" |
Latest issue identified: #6596 |
Acks are now "Summarize_end" events with message "summaryAck". The message is actually redundant, since all other code paths will result in "Summarize_cancel". |
We do sort of log this in "GenerateSummary" events with property: "opsSinceLastSummary"- this is the count of ops since the last summary ack: referring to the corresponding summarize op's reference sequence number, so we actually are counting the ops being summarized. Note that "GenerateSummary" event should fire even if summary process fails. |
I do think there's a need for correlating summarize attempts with their retry attempts and eventual failure. It's possible now (summaryGenTag increments by 1 with each retry, and the container id remains the same, and the reason changes from X to "retry1" to "retry2"), but not feasible for aggregating. I think the solution could be as simple as not incrementing summarizeCount (summaryGenTag) on retries. This works well for "Summarize" events, but for any other downstream events that leverage the summaryLogger it might be better to have a 2 part schema for summaryGenTag: "3.2" for second retry (annoying to query), or we could just add another property: "retry" that increments on the logger, etc. But the solution will probably involve adding another property to the logger. |
Also I can confirm that your analysis looks correct to me. I think the main point about "SummaryAckWaitTimeout" leading to "FailToSummarize" is almost definitely right, and probably it is related to server downtime as indicated in the other issue. I do think there are probably also missing acks in many of these cases as well, but for at least some cases, the client can recover on retry. This is not clearly indicated in the telemetry you linked, but as documents are recovering, I think it still holds true. |
Latest TAP40 run: 980e2dc0-f8ef-4661-9d5d-5d1dc8ed4630
Key errors (counts above 500), grouping into related groups: | fluid:telemetry:ElectedClientNotSummarizing | | 658,485 | | fluid:telemetry:DeltaManager:GetDeltas_Exception | Failed to retrieve ops from storage: too many retries | 10,175 | | fluid:telemetry:SummaryManager:CreateSummarizerError | Container was closed while load() | 4,352 | Tap30: c79d0f4c-1019-4b29-9a2c-275475d1d0e8 |
One more thing to add - fluid:telemetry:Summarizer:Running:Summarize_cancel escaped my queries because it does not have category = "error". Here are top errors: fluid:telemetry:Summarizer:Running:GenerateSummary is similar - it logs successes & failures into one bucket: | fluid:telemetry:Summarizer:Running:GenerateSummary | | generic | 535,120 | |
Looks like all the issues (including "too many retries") are connected to PUSH being throttled. And we are hitting 10K unsummarized ops all the time, so system (as it stands) is beyond capacity. One thing to consider: If client had ability to fetch ops from redis (indirectly, possibly through fetching OPS from SPO), then (1) would have a recovery path. And it would also allow PUSH not to flush ops so frequently (on client connection), and thus reduce chances of being throttled and reduce impact on summary failures. It should substantially reduce pressure. That might be also a stepping stone for single-phase summary commit, i.e. SPO pulling ops from push on summary upload. This might eliminate almost all the paths where PUSH flushes ops (no clients in session would be primary reason for flushes, flushing on timer would be rare as most of the time client would summarize and cause flush that way, also receiving throttling feedback from SPO that way). We should also consider alternative in the form of push (i.e. client asking PUSH to flush ops). With proper feedback (success/failure reporting from PUSH) client would be able to leverage it same workflows (gap in ops, summarization). A bit harder to stage, and a bit more complicated protocol, but possibly simpler on server side and fewer dependencies of SPO on PUSH. |
Disconnects and impact on stress tests are tracked in https://onedrive.visualstudio.com/SPIN/_queries/edit/1156694/?triage=true. |
I'm closing this issue as it served its purpose. All the issues identified here are logged as their respective github issues. Epic issue will continue to track overall process and any new investigations on as needed bases. |
Openning tracking issue to investigate results of latest ODSP stress test run:
customEvents
| where timestamp between(datetime(2021-06-23T10:30:00+530) .. datetime(2021-06-23T12:30+530))
| where customDimensions.testUid == "aaea96a8-a370-47b3-b9b5-287a5240ef55"
| where customDimensions.category == "error"
| extend eventName = tostring(customDimensions.eventName)
| extend error = tostring(customDimensions.error)
| summarize count() by eventName, error
Logs
The text was updated successfully, but these errors were encountered: