Summary data: Notes based on observation of telemetry data & code #9635

vladsud · 2022-03-28T16:08:50Z

I was looking at summarize telemetry code, and I'm not sure I follow this logic:

        const summarizeEvent = PerformanceEvent.start(logger, {
            eventName: "Summarize",
            refreshLatestAck,
            ...summarizeTelemetryProps,
        });

            const generateSummaryEvent = PerformanceEvent.start(logger, {
                eventName: "Summarize",
                ...summarizeTelemetryProps,
            });

Why do we need to events with exactly same name in same function? I think you can leverage first one just fine - they represent the same starting point in time.

I'm not sure there is a value in recording constants:

message: "summaryAck",

Why Whiteboard data is different from Chapter1? opsSinceLastSummary & timeSinceLastAttempt is missing:
union Office_Fluid_FluidRuntime_*
| where Event_Time > ago(7d)
| where Data_eventName == "fluid:telemetry:Summarizer:Running:Summarize_end"
| extend WB = Data_hostScenarioName contains "Whiteboard"
| summarize count(), avg(Data_timeSinceLastAttempt), avg(Data_opsSinceLastSummary), avg(Data_totalBlobSize), avg(Data_duration)
by Data_eventName, WB

Data_eventName	WB	count_	avg_Data_timeSinceLastAttempt	avg_Data_opsSinceLastSummary	avg_Data_totalBlobSize	avg_Data_duration
fluid:telemetry:Summarizer:Running:Summarize_end	false	1,580,649	8450170	30	117863	2518
fluid:telemetry:Summarizer:Running:Summarize_end	true	388,603	21678480	NaN	NaN	3533

Note that if we look at fluid:telemetry:Summarizer:Running:GenerateSummary data, we will see that Whiteboard has both opsSinceLastSummary & timeSinceLastAttempt payload. So maybe it's a mix of different versions and their format that screws the data.

I'm not sure how to interpret timeSinceLastAttempt - see above: average is in millions!

The text was updated successfully, but these errors were encountered:

NicholasCouri · 2022-03-31T18:55:34Z

@vladsud

#1. Done.

#2. there is no need to have that constant summaryAck

#3 As for avg_Data_opsSinceLastSummary | avg_Data_totalBlobSize showing on Summarize, it should happen after they update their runtime version. See below.

union Office_Fluid_FluidRuntime_*
| where Event_Time > ago(1d)
| where Data_eventName == "fluid:telemetry:Summarizer:Running:Summarize_end"
| where Data_hostScenarioName contains "Whiteboard"
| summarize count() by Data_runtimeVersion

Data_runtimeVersion	count_
0.55.4	81225
0.54.1	1738

Here are the builds where it starts to be populated (0.57.2):
union Office_Fluid_FluidRuntime_*
| where Event_Time > ago(1d)
| where Data_eventName == "fluid:telemetry:Summarizer:Running:Summarize_end"
| where isnotnull(Data_opsSinceLastSummary)
| summarize count() by Data_runtimeVersion

Data_runtimeVersion	count_
0.58.1001	16769
0.58.2002	559
0.58.2001	39
0.57.2	14

#4 -The large numbers comes from the fact we store the time in ms.
So 21678480 is approx = 6 hours (361 mins)

Ex.
union Office_Fluid_FluidRuntime_*
| where Event_Time > ago(7d)
| where Data_eventName == "fluid:telemetry:Summarizer:Running:Summarize_end"
| extend WB = Data_hostScenarioName contains "Whiteboard"
| extend timeSinceLastSummaryHours = iff(isnull(Data_timeSinceLastSummary ), 0.0, Data_timeSinceLastSummary / (1000* 3600))
| summarize count(), round(avg(timeSinceLastSummaryHours)) , round(avg(Data_timeSinceLastAttempt)), round(avg(Data_opsSinceLastSummary)), avg(Data_totalBlobSize), avg(Data_duration)
by Data_eventName, WB

Data_eventName	WB	count_	avg_timeSinceLastSummaryHours	avg_Data_timeSinceLastAttempt	avg_Data_opsSinceLastSummary	avg_Data_totalBlobSize	avg_Data_duration
fluid:telemetry:Summarizer:Running:Summarize_end	0	1609287	4	14801757	28	79123.0775024632	2542.43151532325
fluid:telemetry:Summarizer:Running:Summarize_end	1	885106	3	11025799	NaN	NaN	4380.5799587846

	</BODY></HTML>

…also Adding new test when election is disabled (#9857) * Fix small outstanding issues from #9635 * Adding new test when election is disabled * Adding additional info for the UnexpectedElectionSequenceNumber event

vladsud · 2022-04-14T19:00:08Z

Hey @NicholasCouri , any chance you can provide some insights here on what was wrong with # 4 item from above?
Also can we create some query (filter data) to look at valid data based on earlier available telemetry?
It would be useful to poke at data earlier if there is any way to do so, without waiting for changes to propagate through the system

NicholasCouri · 2022-04-14T23:58:02Z

@vladsud - Working on it

vladsud added the bug Something isn't working label Mar 28, 2022

vladsud added this to the April 2022 milestone Mar 28, 2022

vladsud assigned pleath and NicholasCouri Mar 28, 2022

curtisman added the area: runtime: summarizer label Apr 4, 2022

pleath modified the milestones: April 2022, May 2022 Apr 6, 2022

pleath removed their assignment Apr 6, 2022

vladsud modified the milestones: May 2022, April 2022 Apr 12, 2022

NicholasCouri linked a pull request Apr 14, 2022 that will close this issue

Fix #9635 Summary data: Notes based on observation of telemetry data also Adding new test when election is disabled #9857

Merged

NicholasCouri removed a link to a pull request Apr 14, 2022

Fix #9635 Summary data: Notes based on observation of telemetry data also Adding new test when election is disabled #9857

Merged

NicholasCouri linked a pull request Apr 14, 2022 that will close this issue

Fix #9635 Summary data: Notes based on observation of telemetry data also Adding new test when election is disabled #9857

Merged

NicholasCouri closed this as completed in #9857 Apr 14, 2022

NicholasCouri mentioned this issue Apr 14, 2022

Investigate negative numbers on Summary's timeSinceLastAttempt field. #9905

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Summary data: Notes based on observation of telemetry data & code #9635

Summary data: Notes based on observation of telemetry data & code #9635

vladsud commented Mar 28, 2022 •

edited

Loading

NicholasCouri commented Mar 31, 2022 •

edited

Loading

vladsud commented Apr 14, 2022

NicholasCouri commented Apr 14, 2022

Summary data: Notes based on observation of telemetry data & code #9635

Summary data: Notes based on observation of telemetry data & code #9635

Comments

vladsud commented Mar 28, 2022 • edited Loading

NicholasCouri commented Mar 31, 2022 • edited Loading

vladsud commented Apr 14, 2022

NicholasCouri commented Apr 14, 2022

vladsud commented Mar 28, 2022 •

edited

Loading

NicholasCouri commented Mar 31, 2022 •

edited

Loading