Investigate negative numbers on Summary's timeSinceLastAttempt field. #9905

NicholasCouri · 2022-04-14T20:32:03Z

This bug is to track why we have a few negative numbers on timeSinceLastAttempt

union Office_Fluid_FluidRuntime_*
| where Data_eventName contains "fluid:telemetry:Summarizer:Running:Summarize_generate"
| extend WB = Data_hostScenarioName contains "Whiteboard"
| extend First = Data_summarizeCount == 1
| summarize count(), Seconds=avg(Data_timeSinceLastAttempt) / 1000
by Data_eventName, WB, Data_summarizeReason, First)

Data_eventName	Data_summarizeReason	First	count_	Seconds
fluid:telemetry:Summarizer:Running:Summarize_generate	maxOps	1	612	-27.3636258169935
fluid:telemetry:Summarizer:Running:Summarize_generate	idle	1	779462	5.71102638614839
fluid:telemetry:Summarizer:Running:Summarize_generate	lastSummary	1	393	20.3479134860051
fluid:telemetry:Summarizer:Running:Summarize_generate	maxOps	0	872	28.1187855504587
fluid:telemetry:Summarizer:Running:Summarize_generate	idle	0	414794	36.5621380830002
fluid:telemetry:Summarizer:Running:Summarize_generate	lastSummary	0	135	40.2513185185185
fluid:telemetry:Summarizer:Running:Summarize_generate	maxTime	0	950839	254.752058535672
fluid:telemetry:Summarizer:Running:Summarize_generate	maxTime	1	245274	162041.76329836

	</HTML>

From Vlad:
It shows where problems are:

maxOps, First -> getting negative

maxTime, first -> really big number

All other items are Ok

Thisis a continuation from #9635

The text was updated successfully, but these errors were encountered:

NicholasCouri · 2022-04-15T16:00:34Z

The number of negative numbers is 0.6% from the total and ALL of the occurrences came from the first summary:

union Office_Fluid_FluidRuntime_*
| where Data_hostCorrelationId !in (InvalidatedHostCorrelationIds() )
| where Data_eventName contains "fluid:telemetry:Summarizer:Running:Summarize_generate"
| extend WB = Data_hostScenarioName contains "Whiteboard"
| extend First = Data_summarizeCount == 1
| extend Seconds=Data_timeSinceLastAttempt/ 1000
| extend IsNegative = Data_timeSinceLastAttempt < 0
| summarize count() by IsNegative, First

IsNegative	First	count_
0	1	1058499
0	0	1432781
1	1	15374

	</BODY></HTML>

vladsud · 2022-04-18T20:37:10Z

We should eventually get to the bottom of it, but it impacts very few sessions (%% wise), and thus does not block us from making decisions at the moment. Thus, moving to June.

NicholasCouri · 2022-04-19T23:20:24Z

It only happens on the first time (== 1) and in a specific set of Host scenarios:

union Office_Fluid_FluidRuntime_*
| where Data_hostCorrelationId !in (InvalidatedHostCorrelationIds() )
| where Data_eventName contains "fluid:telemetry:Summarizer:Running:Summarize_generate"
| extend WB = Data_hostScenarioName contains "Whiteboard"
| extend First = Data_summarizeCount == 1
| extend Seconds=Data_timeSinceLastAttempt/ 1000
| extend IsNegative = Data_timeSinceLastAttempt < 0
| where IsNegative == 1
| summarize count() by First,Data_hostScenarioName

First	Data_hostScenarioName	count_
1	ChatMessage	13021
1		3782
1	meeting_details_fluid_create	64
1	CreateNewCompose	1372
1	PasteComponent	413
1	meeting_details_fluid_load	44
1	Load_MeetingNotes_MeetingDetails	28
1	Create_Component_Mail	9
1	DirectNav	8
1	Create_MeetingNotes_MeetingDetails	19

	</HTML>

NicholasCouri · 2022-04-20T02:47:31Z

Another interesting piece of information: we only set negative numbers with SummaryReason == MaxOps or Idle which makes sense as we calculate the timeSinceLastSummary before running the heuristics. Which indicates the heuristicData is where the problem is indeed:

summarizerHeuristics.tss@public run() {
const timeSinceLastSummary = Date.now() - this.heuristicData.lastSuccessfulSummary.summaryTime;
const opsSinceLastAck = this.opsSinceLastAck;
if (timeSinceLastSummary > this.configuration.maxTime) { // maxtime is 600000
this.idleTimer.clear();
this.trySummarize("maxTime");
} else if (opsSinceLastAck > this.configuration.maxOps) {
this.idleTimer.clear();
this.trySummarize("maxOps");
} else {
this.idleTimer.restart();
}
}

agarwal-navin · 2022-04-21T21:19:41Z

We use a combination of server and client timestamps to get the timeSinceLastAttempt / timeSinceLastSummary. This can result in this being negative if the client time is behind the server time. Consider the following scenario:

When a new summarizer is created, it initializes HeursiticsData's lastAttemptTime and lastSummaryTime to Date.now() -

FluidFramework/packages/runtime/container-runtime/src/summarizer.ts

Line 252 in 965b460

new SummarizeHeuristicData(

.
It then waits for any pending acks and if it receives one, it initializes those times to the timestamp of the corresponding summary op which is added by the server -

FluidFramework/packages/runtime/container-runtime/src/runningSummarizer.ts

Line 261 in 965b460

this.heuristicData.initialize({

.
Now, when the summarizer attempts to summarize, it calculates timeSinceLastAttempt / timeSinceLastSummary as Date.now() - lastAttemptTime/lastSummaryTime. If the client's clock is behind the server's, the time difference will be negative.

agarwal-navin · 2022-04-21T21:20:44Z

Once we generate a summary, we update the lastAttemptTime to Date.now(). All subsequent events have positive values because the comparison to get timeSinceLastAttempt / timeSinceLastSummary is against a client's local time.

FluidFramework/packages/runtime/container-runtime/src/summaryGenerator.ts

Line 362 in 17c7b27

this.heuristicData.recordAttempt(summaryData?.referenceSequenceNumber);

vladsud · 2022-04-21T23:09:03Z

is this because we process summary op / ack (for the summary we loaded from) before any of this code is running?
I.e. we have no way to only use Deli (service) timestamps?

agarwal-navin · 2022-04-21T23:16:09Z

Yes, that is correct. I believe we could use the timestamp from the last op we have received (similar to what GC does). Since we will have always have the join op (summarizer connects in write mode before summarizing), we should always have a server timestamp to work with.

NicholasCouri · 2022-04-22T14:29:14Z

It has to be more than simply the server timestamp as it happens in a super small set of sessions, and it would not explain the values that are HUGE like +600 days or the large negative numbers. These large numbers seem to have the same underlying cause as they happen only on the first summary as well.

Anyways, I was trying to see if it could be related to the document's age but it does not seem to be related.
Navin was letting me know we added Data_createContainer that is logged with the ContainerLoadStats:

let loadStatus = union Office_Fluid_FluidRuntime_*
| where Data_eventName == "fluid:telemetry:ContainerLoadStats"
| where isnotnull(Data_createContainerTimestamp)
| summarize Data_createContainerTimestamp=min(Data_createContainerTimestamp) by Data_docId;
union Office_Fluid_FluidRuntime_*
| where Data_hostCorrelationId !in (InvalidatedHostCorrelationIds() )
| where Data_eventName contains "fluid:telemetry:Summarizer:Running:Summarize_generate"
| where Event_Time > ago(7d)
| extend WB = Data_hostScenarioName contains "Whiteboard"
| extend First = Data_summarizeCount == 1
| extend Seconds=Data_timeSinceLastAttempt/ 1000
| extend IsNegative = Data_timeSinceLastAttempt < 0
| where IsNegative == 1
| order by Data_timeSinceLastAttempt, Data_runtimeVersion, Data_loaderVersion desc
| project-away Data_createContainerTimestamp
| join (loadStatus) on Data_docId
| extend Data_createContainer = unixtime_milliseconds_todatetime(Data_createContainerTimestamp)
| extend Data_hostScenario = unixtime_milliseconds_todatetime(Data_hostScenarioStartTime)
| extend DataDiff = Data_hostScenarioStartTime - Data_createContainerTimestamp
| project Data_createContainer, Data_hostScenario, Data_timeSinceLastAttempt, Data_runtimeVersion, Data_loaderVersion, DataDiff, Data_docId
| order by Data_createContainer asc
| take 20

Data_createContainer	Data_hostScenario	Data_timeSinceLastAttempt	Data_runtimeVersion	Data_loaderVersion	DataDiff	Data_docId
2022-01-11 12:11:05.4790000	2022-04-21 12:20:16.2480000	-37636550	0.58.2002	0.58.2002	8640550769	5Zp2CHL3dmQeoVNA%2BCq8%2BNdK%2B7unjn5rKiy%2ByETAmWc%3D
2022-01-13 06:42:18.0750000	2022-04-18 13:36:54.6360000	-591056	0.58.2002	0.58.2002	8232876561	ZsobjUH7KsZyr1RtLldLlAMZG%2FsNw4aIP6mKT85fyww%3D
2022-01-14 09:31:20.2120000	2022-04-21 09:33:47.3080000	-4577	0.58.2002	0.54.2	8380947096	edM89u6ub6aUExScAIUi8KzI5b2XV0fJFhMuAopKxZQ%3D
2022-01-18 08:11:35.8000000	2022-04-22 06:49:37.9490000	-25334	0.58.2002	0.54.2	8116682149	UEnthraNL4i4WuW0vSE2%2BCaTHoQ6XDz0CaAK6%2FfbbUQ%3D
2022-01-19 01:37:58.2030000	2022-04-22 01:23:58.4930000	-316382	0.58.2002	0.58.2002	8034360290	ynzZxXzSsv5qjo3bA5MM%2BsjkUuzOMUhG8ZKG1HqxRkI%3D
2022-01-19 01:40:42.1370000	2022-04-22 04:05:40.2210000	-243417	0.58.2002	0.58.2002	8043898084	x2M0eWqDMeoPiNYbiDfD0VGpQRc4OWmMdM2mDsC1LkU%3D
2022-01-19 08:41:18.1540000	2022-04-22 04:16:27.0880000	-1127425	0.58.2002	0.58.2002	8019308934	9aNuIZLtn5jtwBtE3YcpW4Xcm8h%2Bm%2BkIgTof8%2FqJjh8%3D
2022-01-20 20:48:03.0750000	2022-04-20 14:19:35.8280000	-40337	0.58.2002	0.58.2002	7752692753	GbxaK6Ph%2Fx6zlf%2Fd7cVji1P2LhHX6YsE5PJwbw%2Bb93A%3D
2022-01-21 14:12:59.0050000	2022-04-21 20:29:41.9370000	-44398	0.58.2002	0.54.2	7798602932	ss4KRVR8W7bvGjXu1ruZ%2FUmjtJOXTTtZ6egjLVduzps%3D
2022-01-24 21:52:04.1310000	2022-04-20 19:26:35.2790000	-97723	0.58.2002	0.54.2	7421671148	pEx2COjJIK8sPHCig726b6j6fUTBU1ZuWkM8VWDy6Yk%3D
2022-01-26 19:13:18.4810000	2022-04-21 07:59:58.3570000	-21282	0.58.2002	0.58.2002	7303599876	FDf3FkDBWzM9gHMuNPYLhlzqUnbam2naDrZmJQ4Pe4I%3D
2022-01-27 10:00:39.0510000	2022-04-21 14:55:40.4500000	-12850	0.58.2002	0.58.2002	7275301399	thXbXKygapbJueMOlsEx4HDsT6%2FLGSTruDgsu8jBcbU%3D
2022-01-28 07:39:22.2160000	2022-04-21 08:31:00.1230000	-3194509	0.58.2002	0.58.2002	7174297907	cdSacGGQnsgM%2FjQ2enlF9MHJ9bK42T2YDtmhRuO3v1s%3D
2022-02-02 14:19:48.2030000	2022-04-18 14:37:28.5750000	-22456	0.58.2002	0.54.2	6481060372	hXM6x%2Bo%2FEPe1MHQVtNOW2iZASIfv%2FDc0hLLSsHfg1Pw%3D
2022-02-03 08:41:15.8270000	2022-04-21 07:41:53.0280000	-126219	0.58.2002	0.58.2002	6649237201	21RacQsMiRVfGb1WOnooz%2BpYOg67rq144uS0cXl36UI%3D
2022-02-07 05:53:09.9930000	2022-04-21 10:49:26.1440000	-53891	0.58.2002	0.58.2002	6324976151	b7CfBHSzxPucPxNYsUOqgcQy3%2F%2FxzYqSnJjsmBR80Lc%3D
2022-02-08 09:53:46.2220000	2022-04-19 16:34:59.4330000	-15530	0.58.2002	0.58.2002	6072073211	LJYjwaF522DNT%2FfKmC7mPG8oZjBK4FiXLeTA3U%2F8uls%3D
2022-02-08 10:19:25.7170000	2022-04-17 16:04:03.8960000	-5445	0.58.2002	0.54.2	5895878179	kOax9kWYoIjsPtL7wPCXayXWVUrViC0muCUkUfH6MmI%3D
2022-02-10 08:53:37.5620000	2022-04-19 06:06:34.9230000	-22676	0.58.2002	0.58.2002	5865177361	DSnnbc31iP%2FOMI8PXtb6DGhoZUprAypGYdPhVNS3VaU%3D
2022-02-11 20:20:32.3420000	2022-04-19 18:19:38.8960000	-42290	0.58.2002	0.58.2002	5781546554	qEVZIBA3vdpV7F2K1k0WyLOVAM0YmfTDctev8ws1z5o%3D

	</HTML>

vladsud · 2022-04-22T15:00:52Z

@NicholasCouri, the problem here is usage of Date.now() and comparing it values generated on some other machine. Clocks are not required to be synchronized across machines.
We can assume that ordering service clock is stable (I do not remember who puts timestamp on op, but even if it's front-end, front-ends will synchronized and may differs only slightly from one another).

We should either use times from service (ops) across all the code, or only use times generated on a given machine. Mixing them will never work.

It's also worth clearly spelling the spec: what exactly are we measuring for first attempt? Or want to measure?
Measuring actual time between summaries is not very useful, as it includes time document was not active - this data is not very interesting and should be excluded in any analyzes.
So I see only two solutions:

We do not record this time for first attempt at all (keep it undefined), the rest is local machine time.
We record time for first iteration as well, but record time from boot till start of summary; all times are based on Date.now().

BTW, it's a bit weird to see both SummarizeHeuristicData's ctor and initialize() methods setting up new state. I think we should have only one way to initialize object.

NicholasCouri · 2022-04-22T17:41:43Z

We do not record this time for first attempt at all (keep it undefined), the rest is local machine time.

I like this idea much better.

NicholasCouri · 2022-04-23T01:28:46Z

BTW, it's a bit weird to see both SummarizeHeuristicData's ctor and initialize() methods setting up new state. I think we should have only one way to initialize object.

I think the right name should be overrideWithPendingSummaryAckInfo or something like that, instead of initialize - we are overriding. We have initialized when we created the object.

NicholasCouri added the bug Something isn't working label Apr 14, 2022

NicholasCouri self-assigned this Apr 14, 2022

NicholasCouri added this to the April 2022 milestone Apr 14, 2022

vladsud modified the milestones: April 2022, June 2022 Apr 18, 2022

NicholasCouri mentioned this issue Apr 25, 2022

Fix for 9905 - Investigate negative numbers on Summary's timeSinceLastAttempt field #10033

Merged

NicholasCouri closed this as completed in #10033 Apr 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate negative numbers on Summary's timeSinceLastAttempt field. #9905

Investigate negative numbers on Summary's timeSinceLastAttempt field. #9905

NicholasCouri commented Apr 14, 2022 •

edited

Loading

NicholasCouri commented Apr 15, 2022

vladsud commented Apr 18, 2022

NicholasCouri commented Apr 19, 2022

NicholasCouri commented Apr 20, 2022

agarwal-navin commented Apr 21, 2022

agarwal-navin commented Apr 21, 2022

vladsud commented Apr 21, 2022

agarwal-navin commented Apr 21, 2022

NicholasCouri commented Apr 22, 2022

vladsud commented Apr 22, 2022

NicholasCouri commented Apr 22, 2022

NicholasCouri commented Apr 23, 2022

Investigate negative numbers on Summary's timeSinceLastAttempt field. #9905

Investigate negative numbers on Summary's timeSinceLastAttempt field. #9905

Comments

NicholasCouri commented Apr 14, 2022 • edited Loading

NicholasCouri commented Apr 15, 2022

vladsud commented Apr 18, 2022

NicholasCouri commented Apr 19, 2022

NicholasCouri commented Apr 20, 2022

agarwal-navin commented Apr 21, 2022

agarwal-navin commented Apr 21, 2022

vladsud commented Apr 21, 2022

agarwal-navin commented Apr 21, 2022

NicholasCouri commented Apr 22, 2022

vladsud commented Apr 22, 2022

NicholasCouri commented Apr 22, 2022

NicholasCouri commented Apr 23, 2022

NicholasCouri commented Apr 14, 2022 •

edited

Loading