Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Summary data: Notes based on observation of telemetry data & code #9635

Closed
vladsud opened this issue Mar 28, 2022 · 3 comments · Fixed by #9857
Closed

Summary data: Notes based on observation of telemetry data & code #9635

vladsud opened this issue Mar 28, 2022 · 3 comments · Fixed by #9857
Assignees
Labels
area: runtime: summarizer bug Something isn't working
Milestone

Comments

@vladsud
Copy link
Contributor

vladsud commented Mar 28, 2022

  1. I was looking at summarize telemetry code, and I'm not sure I follow this logic:
        const summarizeEvent = PerformanceEvent.start(logger, {
            eventName: "Summarize",
            refreshLatestAck,
            ...summarizeTelemetryProps,
        });
            const generateSummaryEvent = PerformanceEvent.start(logger, {
                eventName: "Summarize",
                ...summarizeTelemetryProps,
            });

Why do we need to events with exactly same name in same function? I think you can leverage first one just fine - they represent the same starting point in time.

  1. I'm not sure there is a value in recording constants:

message: "summaryAck",

  1. Why Whiteboard data is different from Chapter1? opsSinceLastSummary & timeSinceLastAttempt is missing:
    union Office_Fluid_FluidRuntime_*
    | where Event_Time > ago(7d)
    | where Data_eventName == "fluid:telemetry:Summarizer:Running:Summarize_end"
    | extend WB = Data_hostScenarioName contains "Whiteboard"
    | summarize count(), avg(Data_timeSinceLastAttempt), avg(Data_opsSinceLastSummary), avg(Data_totalBlobSize), avg(Data_duration)
    by Data_eventName, WB
Data_eventName WB count_ avg_Data_timeSinceLastAttempt avg_Data_opsSinceLastSummary avg_Data_totalBlobSize avg_Data_duration
fluid:telemetry:Summarizer:Running:Summarize_end false 1,580,649 8450170 30 117863 2518
fluid:telemetry:Summarizer:Running:Summarize_end true 388,603 21678480 NaN NaN 3533

Note that if we look at fluid:telemetry:Summarizer:Running:GenerateSummary data, we will see that Whiteboard has both opsSinceLastSummary & timeSinceLastAttempt payload. So maybe it's a mix of different versions and their format that screws the data.

  1. I'm not sure how to interpret timeSinceLastAttempt - see above: average is in millions!
@vladsud vladsud added the bug Something isn't working label Mar 28, 2022
@vladsud vladsud added this to the April 2022 milestone Mar 28, 2022
@NicholasCouri
Copy link
Contributor

NicholasCouri commented Mar 31, 2022

@vladsud

#1. Done.

#2. there is no need to have that constant summaryAck

#3 As for avg_Data_opsSinceLastSummary | avg_Data_totalBlobSize showing on Summarize, it should happen after they update their runtime version. See below.

union Office_Fluid_FluidRuntime_*
| where Event_Time > ago(1d)
| where Data_eventName == "fluid:telemetry:Summarizer:Running:Summarize_end"
| where Data_hostScenarioName contains "Whiteboard"
| summarize count() by Data_runtimeVersion

Data_runtimeVersion count_
0.55.4 81225
0.54.1 1738

Here are the builds where it starts to be populated (0.57.2):
union Office_Fluid_FluidRuntime_*
| where Event_Time > ago(1d)
| where Data_eventName == "fluid:telemetry:Summarizer:Running:Summarize_end"
| where isnotnull(Data_opsSinceLastSummary)
| summarize count() by Data_runtimeVersion

Data_runtimeVersion count_
0.58.1001 16769
0.58.2002 559
0.58.2001 39
0.57.2 14

#4 -The large numbers comes from the fact we store the time in ms.
So 21678480 is approx = 6 hours (361 mins)

Ex.
union Office_Fluid_FluidRuntime_*
| where Event_Time > ago(7d)
| where Data_eventName == "fluid:telemetry:Summarizer:Running:Summarize_end"
| extend WB = Data_hostScenarioName contains "Whiteboard"
| extend timeSinceLastSummaryHours = iff(isnull(Data_timeSinceLastSummary ), 0.0, Data_timeSinceLastSummary / (1000* 3600))
| summarize count(), round(avg(timeSinceLastSummaryHours)) , round(avg(Data_timeSinceLastAttempt)), round(avg(Data_opsSinceLastSummary)), avg(Data_totalBlobSize), avg(Data_duration)
by Data_eventName, WB

Data_eventName WB count_ avg_timeSinceLastSummaryHours avg_Data_timeSinceLastAttempt avg_Data_opsSinceLastSummary avg_Data_totalBlobSize avg_Data_duration
fluid:telemetry:Summarizer:Running:Summarize_end 0 1609287 4 14801757 28 79123.0775024632 2542.43151532325
fluid:telemetry:Summarizer:Running:Summarize_end 1 885106 3 11025799 NaN NaN 4380.5799587846
	</BODY></HTML>

@pleath pleath modified the milestones: April 2022, May 2022 Apr 6, 2022
@pleath pleath removed their assignment Apr 6, 2022
@vladsud vladsud modified the milestones: May 2022, April 2022 Apr 12, 2022
NicholasCouri added a commit that referenced this issue Apr 14, 2022
…also Adding new test when election is disabled (#9857)

* Fix small outstanding issues from #9635 
* Adding new test when election is disabled
* Adding additional info for the UnexpectedElectionSequenceNumber event
@vladsud
Copy link
Contributor Author

vladsud commented Apr 14, 2022

Hey @NicholasCouri , any chance you can provide some insights here on what was wrong with # 4 item from above?
Also can we create some query (filter data) to look at valid data based on earlier available telemetry?
It would be useful to poke at data earlier if there is any way to do so, without waiting for changes to propagate through the system

@NicholasCouri
Copy link
Contributor

@vladsud - Working on it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: runtime: summarizer bug Something isn't working
Projects
None yet
4 participants