-
Notifications
You must be signed in to change notification settings - Fork 535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate negative numbers on Summary's timeSinceLastAttempt field. #9905
Comments
The number of negative numbers is 0.6% from the total and ALL of the occurrences came from the first summary: union Office_Fluid_FluidRuntime_*
|
We should eventually get to the bottom of it, but it impacts very few sessions (%% wise), and thus does not block us from making decisions at the moment. Thus, moving to June. |
It only happens on the first time (== 1) and in a specific set of Host scenarios: union Office_Fluid_FluidRuntime_*
|
Another interesting piece of information: we only set negative numbers with SummaryReason == MaxOps or Idle which makes sense as we calculate the timeSinceLastSummary before running the heuristics. Which indicates the heuristicData is where the problem is indeed: summarizerHeuristics.tss@public run() { |
We use a combination of server and client timestamps to get the
|
Once we generate a summary, we update the
|
is this because we process summary op / ack (for the summary we loaded from) before any of this code is running? |
Yes, that is correct. I believe we could use the timestamp from the last op we have received (similar to what GC does). Since we will have always have the join op (summarizer connects in write mode before summarizing), we should always have a server timestamp to work with. |
It has to be more than simply the server timestamp as it happens in a super small set of sessions, and it would not explain the values that are HUGE like +600 days or the large negative numbers. These large numbers seem to have the same underlying cause as they happen only on the first summary as well. Anyways, I was trying to see if it could be related to the document's age but it does not seem to be related. let loadStatus = union Office_Fluid_FluidRuntime_*
|
@NicholasCouri, the problem here is usage of Date.now() and comparing it values generated on some other machine. Clocks are not required to be synchronized across machines. We should either use times from service (ops) across all the code, or only use times generated on a given machine. Mixing them will never work. It's also worth clearly spelling the spec: what exactly are we measuring for first attempt? Or want to measure?
BTW, it's a bit weird to see both SummarizeHeuristicData's ctor and initialize() methods setting up new state. I think we should have only one way to initialize object. |
I like this idea much better. |
I think the right name should be overrideWithPendingSummaryAckInfo or something like that, instead of initialize - we are overriding. We have initialized when we created the object. |
This bug is to track why we have a few negative numbers on timeSinceLastAttempt
union Office_Fluid_FluidRuntime_*
| where Data_eventName contains "fluid:telemetry:Summarizer:Running:Summarize_generate"
| extend WB = Data_hostScenarioName contains "Whiteboard"
| extend First = Data_summarizeCount == 1
| summarize count(), Seconds=avg(Data_timeSinceLastAttempt) / 1000
by Data_eventName, WB, Data_summarizeReason, First)
From Vlad:
It shows where problems are:
maxOps, First -> getting negative
maxTime, first -> really big number
All other items are Ok
Thisis a continuation from #9635
The text was updated successfully, but these errors were encountered: