Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate negative numbers on Summary's timeSinceLastAttempt field. #9905

Closed
NicholasCouri opened this issue Apr 14, 2022 · 12 comments · Fixed by #10033
Closed

Investigate negative numbers on Summary's timeSinceLastAttempt field. #9905

NicholasCouri opened this issue Apr 14, 2022 · 12 comments · Fixed by #10033
Assignees
Labels
bug Something isn't working
Milestone

Comments

@NicholasCouri
Copy link
Contributor

NicholasCouri commented Apr 14, 2022

This bug is to track why we have a few negative numbers on timeSinceLastAttempt

union Office_Fluid_FluidRuntime_*
| where Data_eventName contains "fluid:telemetry:Summarizer:Running:Summarize_generate"
| extend WB = Data_hostScenarioName contains "Whiteboard"
| extend First = Data_summarizeCount == 1
| summarize count(), Seconds=avg(Data_timeSinceLastAttempt) / 1000
by Data_eventName, WB, Data_summarizeReason, First)

Data_eventName WB Data_summarizeReason First count_ Seconds
fluid:telemetry:Summarizer:Running:Summarize_generate 0 maxOps 1 612 -27.3636258169935
fluid:telemetry:Summarizer:Running:Summarize_generate 0 idle 1 779462 5.71102638614839
fluid:telemetry:Summarizer:Running:Summarize_generate 0 lastSummary 1 393 20.3479134860051
fluid:telemetry:Summarizer:Running:Summarize_generate 0 maxOps 0 872 28.1187855504587
fluid:telemetry:Summarizer:Running:Summarize_generate 0 idle 0 414794 36.5621380830002
fluid:telemetry:Summarizer:Running:Summarize_generate 0 lastSummary 0 135 40.2513185185185
fluid:telemetry:Summarizer:Running:Summarize_generate 0 maxTime 0 950839 254.752058535672
fluid:telemetry:Summarizer:Running:Summarize_generate 0 maxTime 1 245274 162041.76329836
	</HTML>

From Vlad:
It shows where problems are:

maxOps, First -> getting negative

maxTime, first -> really big number

All other items are Ok

Thisis a continuation from #9635

@NicholasCouri NicholasCouri added the bug Something isn't working label Apr 14, 2022
@NicholasCouri NicholasCouri self-assigned this Apr 14, 2022
@NicholasCouri NicholasCouri added this to the April 2022 milestone Apr 14, 2022
@NicholasCouri
Copy link
Contributor Author

The number of negative numbers is 0.6% from the total and ALL of the occurrences came from the first summary:

union Office_Fluid_FluidRuntime_*
| where Data_hostCorrelationId !in (InvalidatedHostCorrelationIds() )
| where Data_eventName contains "fluid:telemetry:Summarizer:Running:Summarize_generate"
| extend WB = Data_hostScenarioName contains "Whiteboard"
| extend First = Data_summarizeCount == 1
| extend Seconds=Data_timeSinceLastAttempt/ 1000
| extend IsNegative = Data_timeSinceLastAttempt < 0
| summarize count() by IsNegative, First

IsNegative First count_
0 1 1058499
0 0 1432781
1 1 15374
	</BODY></HTML>

@vladsud
Copy link
Contributor

vladsud commented Apr 18, 2022

We should eventually get to the bottom of it, but it impacts very few sessions (%% wise), and thus does not block us from making decisions at the moment. Thus, moving to June.

@vladsud vladsud modified the milestones: April 2022, June 2022 Apr 18, 2022
@NicholasCouri
Copy link
Contributor Author

It only happens on the first time (== 1) and in a specific set of Host scenarios:

union Office_Fluid_FluidRuntime_*
| where Data_hostCorrelationId !in (InvalidatedHostCorrelationIds() )
| where Data_eventName contains "fluid:telemetry:Summarizer:Running:Summarize_generate"
| extend WB = Data_hostScenarioName contains "Whiteboard"
| extend First = Data_summarizeCount == 1
| extend Seconds=Data_timeSinceLastAttempt/ 1000
| extend IsNegative = Data_timeSinceLastAttempt < 0
| where IsNegative == 1
| summarize count() by First,Data_hostScenarioName

First Data_hostScenarioName count_
1 ChatMessage 13021
1   3782
1 meeting_details_fluid_create 64
1 CreateNewCompose 1372
1 PasteComponent 413
1 meeting_details_fluid_load 44
1 Load_MeetingNotes_MeetingDetails 28
1 Create_Component_Mail 9
1 DirectNav 8
1 Create_MeetingNotes_MeetingDetails 19
	</HTML>

@NicholasCouri
Copy link
Contributor Author

Another interesting piece of information: we only set negative numbers with SummaryReason == MaxOps or Idle which makes sense as we calculate the timeSinceLastSummary before running the heuristics. Which indicates the heuristicData is where the problem is indeed:

summarizerHeuristics.tss@public run() {
const timeSinceLastSummary = Date.now() - this.heuristicData.lastSuccessfulSummary.summaryTime;
const opsSinceLastAck = this.opsSinceLastAck;
if (timeSinceLastSummary > this.configuration.maxTime) { // maxtime is 600000
this.idleTimer.clear();
this.trySummarize("maxTime");
} else if (opsSinceLastAck > this.configuration.maxOps) {
this.idleTimer.clear();
this.trySummarize("maxOps");
} else {
this.idleTimer.restart();
}
}

@agarwal-navin
Copy link
Contributor

We use a combination of server and client timestamps to get the timeSinceLastAttempt / timeSinceLastSummary. This can result in this being negative if the client time is behind the server time. Consider the following scenario:

@agarwal-navin
Copy link
Contributor

Once we generate a summary, we update the lastAttemptTime to Date.now(). All subsequent events have positive values because the comparison to get timeSinceLastAttempt / timeSinceLastSummary is against a client's local time.

this.heuristicData.recordAttempt(summaryData?.referenceSequenceNumber);

@vladsud
Copy link
Contributor

vladsud commented Apr 21, 2022

is this because we process summary op / ack (for the summary we loaded from) before any of this code is running?
I.e. we have no way to only use Deli (service) timestamps?

@agarwal-navin
Copy link
Contributor

Yes, that is correct. I believe we could use the timestamp from the last op we have received (similar to what GC does). Since we will have always have the join op (summarizer connects in write mode before summarizing), we should always have a server timestamp to work with.

@NicholasCouri
Copy link
Contributor Author

It has to be more than simply the server timestamp as it happens in a super small set of sessions, and it would not explain the values that are HUGE like +600 days or the large negative numbers. These large numbers seem to have the same underlying cause as they happen only on the first summary as well.

Anyways, I was trying to see if it could be related to the document's age but it does not seem to be related.
Navin was letting me know we added Data_createContainer that is logged with the ContainerLoadStats:

let loadStatus = union Office_Fluid_FluidRuntime_*
| where Data_eventName == "fluid:telemetry:ContainerLoadStats"
| where isnotnull(Data_createContainerTimestamp)
| summarize Data_createContainerTimestamp=min(Data_createContainerTimestamp) by Data_docId;
union Office_Fluid_FluidRuntime_*
| where Data_hostCorrelationId !in (InvalidatedHostCorrelationIds() )
| where Data_eventName contains "fluid:telemetry:Summarizer:Running:Summarize_generate"
| where Event_Time > ago(7d)
| extend WB = Data_hostScenarioName contains "Whiteboard"
| extend First = Data_summarizeCount == 1
| extend Seconds=Data_timeSinceLastAttempt/ 1000
| extend IsNegative = Data_timeSinceLastAttempt < 0
| where IsNegative == 1
| order by Data_timeSinceLastAttempt, Data_runtimeVersion, Data_loaderVersion desc
| project-away Data_createContainerTimestamp
| join (loadStatus) on Data_docId
| extend Data_createContainer = unixtime_milliseconds_todatetime(Data_createContainerTimestamp)
| extend Data_hostScenario = unixtime_milliseconds_todatetime(Data_hostScenarioStartTime)
| extend DataDiff = Data_hostScenarioStartTime - Data_createContainerTimestamp
| project Data_createContainer, Data_hostScenario, Data_timeSinceLastAttempt, Data_runtimeVersion, Data_loaderVersion, DataDiff, Data_docId
| order by Data_createContainer asc
| take 20

Data_createContainer Data_hostScenario Data_timeSinceLastAttempt Data_runtimeVersion Data_loaderVersion DataDiff Data_docId
2022-01-11 12:11:05.4790000 2022-04-21 12:20:16.2480000 -37636550 0.58.2002 0.58.2002 8640550769 5Zp2CHL3dmQeoVNA%2BCq8%2BNdK%2B7unjn5rKiy%2ByETAmWc%3D
2022-01-13 06:42:18.0750000 2022-04-18 13:36:54.6360000 -591056 0.58.2002 0.58.2002 8232876561 ZsobjUH7KsZyr1RtLldLlAMZG%2FsNw4aIP6mKT85fyww%3D
2022-01-14 09:31:20.2120000 2022-04-21 09:33:47.3080000 -4577 0.58.2002 0.54.2 8380947096 edM89u6ub6aUExScAIUi8KzI5b2XV0fJFhMuAopKxZQ%3D
2022-01-18 08:11:35.8000000 2022-04-22 06:49:37.9490000 -25334 0.58.2002 0.54.2 8116682149 UEnthraNL4i4WuW0vSE2%2BCaTHoQ6XDz0CaAK6%2FfbbUQ%3D
2022-01-19 01:37:58.2030000 2022-04-22 01:23:58.4930000 -316382 0.58.2002 0.58.2002 8034360290 ynzZxXzSsv5qjo3bA5MM%2BsjkUuzOMUhG8ZKG1HqxRkI%3D
2022-01-19 01:40:42.1370000 2022-04-22 04:05:40.2210000 -243417 0.58.2002 0.58.2002 8043898084 x2M0eWqDMeoPiNYbiDfD0VGpQRc4OWmMdM2mDsC1LkU%3D
2022-01-19 08:41:18.1540000 2022-04-22 04:16:27.0880000 -1127425 0.58.2002 0.58.2002 8019308934 9aNuIZLtn5jtwBtE3YcpW4Xcm8h%2Bm%2BkIgTof8%2FqJjh8%3D
2022-01-20 20:48:03.0750000 2022-04-20 14:19:35.8280000 -40337 0.58.2002 0.58.2002 7752692753 GbxaK6Ph%2Fx6zlf%2Fd7cVji1P2LhHX6YsE5PJwbw%2Bb93A%3D
2022-01-21 14:12:59.0050000 2022-04-21 20:29:41.9370000 -44398 0.58.2002 0.54.2 7798602932 ss4KRVR8W7bvGjXu1ruZ%2FUmjtJOXTTtZ6egjLVduzps%3D
2022-01-24 21:52:04.1310000 2022-04-20 19:26:35.2790000 -97723 0.58.2002 0.54.2 7421671148 pEx2COjJIK8sPHCig726b6j6fUTBU1ZuWkM8VWDy6Yk%3D
2022-01-26 19:13:18.4810000 2022-04-21 07:59:58.3570000 -21282 0.58.2002 0.58.2002 7303599876 FDf3FkDBWzM9gHMuNPYLhlzqUnbam2naDrZmJQ4Pe4I%3D
2022-01-27 10:00:39.0510000 2022-04-21 14:55:40.4500000 -12850 0.58.2002 0.58.2002 7275301399 thXbXKygapbJueMOlsEx4HDsT6%2FLGSTruDgsu8jBcbU%3D
2022-01-28 07:39:22.2160000 2022-04-21 08:31:00.1230000 -3194509 0.58.2002 0.58.2002 7174297907 cdSacGGQnsgM%2FjQ2enlF9MHJ9bK42T2YDtmhRuO3v1s%3D
2022-02-02 14:19:48.2030000 2022-04-18 14:37:28.5750000 -22456 0.58.2002 0.54.2 6481060372 hXM6x%2Bo%2FEPe1MHQVtNOW2iZASIfv%2FDc0hLLSsHfg1Pw%3D
2022-02-03 08:41:15.8270000 2022-04-21 07:41:53.0280000 -126219 0.58.2002 0.58.2002 6649237201 21RacQsMiRVfGb1WOnooz%2BpYOg67rq144uS0cXl36UI%3D
2022-02-07 05:53:09.9930000 2022-04-21 10:49:26.1440000 -53891 0.58.2002 0.58.2002 6324976151 b7CfBHSzxPucPxNYsUOqgcQy3%2F%2FxzYqSnJjsmBR80Lc%3D
2022-02-08 09:53:46.2220000 2022-04-19 16:34:59.4330000 -15530 0.58.2002 0.58.2002 6072073211 LJYjwaF522DNT%2FfKmC7mPG8oZjBK4FiXLeTA3U%2F8uls%3D
2022-02-08 10:19:25.7170000 2022-04-17 16:04:03.8960000 -5445 0.58.2002 0.54.2 5895878179 kOax9kWYoIjsPtL7wPCXayXWVUrViC0muCUkUfH6MmI%3D
2022-02-10 08:53:37.5620000 2022-04-19 06:06:34.9230000 -22676 0.58.2002 0.58.2002 5865177361 DSnnbc31iP%2FOMI8PXtb6DGhoZUprAypGYdPhVNS3VaU%3D
2022-02-11 20:20:32.3420000 2022-04-19 18:19:38.8960000 -42290 0.58.2002 0.58.2002 5781546554 qEVZIBA3vdpV7F2K1k0WyLOVAM0YmfTDctev8ws1z5o%3D
	</HTML>

@vladsud
Copy link
Contributor

vladsud commented Apr 22, 2022

@NicholasCouri, the problem here is usage of Date.now() and comparing it values generated on some other machine. Clocks are not required to be synchronized across machines.
We can assume that ordering service clock is stable (I do not remember who puts timestamp on op, but even if it's front-end, front-ends will synchronized and may differs only slightly from one another).

We should either use times from service (ops) across all the code, or only use times generated on a given machine. Mixing them will never work.

It's also worth clearly spelling the spec: what exactly are we measuring for first attempt? Or want to measure?
Measuring actual time between summaries is not very useful, as it includes time document was not active - this data is not very interesting and should be excluded in any analyzes.
So I see only two solutions:

  • We do not record this time for first attempt at all (keep it undefined), the rest is local machine time.
  • We record time for first iteration as well, but record time from boot till start of summary; all times are based on Date.now().

BTW, it's a bit weird to see both SummarizeHeuristicData's ctor and initialize() methods setting up new state. I think we should have only one way to initialize object.

@NicholasCouri
Copy link
Contributor Author

  • We do not record this time for first attempt at all (keep it undefined), the rest is local machine time.

I like this idea much better.

@NicholasCouri
Copy link
Contributor Author

BTW, it's a bit weird to see both SummarizeHeuristicData's ctor and initialize() methods setting up new state. I think we should have only one way to initialize object.

I think the right name should be overrideWithPendingSummaryAckInfo or something like that, instead of initialize - we are overriding. We have initialized when we created the object.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants