fluid:telemetry:InitialElectedClientNotFound errors in telemetry #6948

vladsud · 2021-08-01T00:12:59Z

We do see these events in automation:

union office_fluid_ffautomation_*
| where Data_eventName contains "InitialElectedClientNotFound"
| summarize count() by Data_eventName

I see 3977 cases.

Plus this event should have some reasonable prefix, i.e. fluid:telemetry:OrderedClientElection:InitialElectedClientNotFound

anthony-murphy · 2021-08-02T16:22:34Z

We should def fix the prefix here, looks like the raw logger is getting past rather than the summarize's or contianerRuntime's sub-logger

arinwt · 2021-08-04T19:11:51Z

After reviewing the telemetry, it looks like the bug is attributed to an older summary (at seq x) being accepted, but the server is stamping it with protocol state at a later sequence number (x + k), which is the same as another previously nacked summary attempt.

This needs to be fixed in the scribe lambda, I think rejecting summaries that are older than the current protocol state is a must, maybe a regression from #934? I still see similar code that now lives in SummaryWriter, but I'm not sure what is used with other server implementations.

Additionally, I think we should fail faster, by recording the reference sequence number of the runtime in the .metadata blob and failing immediately on load if that doesn't match the one in the .protocol tree. Opened issue #7002 for this, and PR #7015 to address it.

pleath · 2021-08-06T22:54:05Z

Is there something particular about the way the stress test issues summary nacks that could be bypassing the original fix in the scribe lambda? Or is it more likely that something (a race?) is causing that fix not to work?

pleath · 2021-09-09T17:44:28Z

This seems to have dropped off the radar. No hits in current Kusto data.

vladsud · 2021-09-28T13:42:16Z

There are no events in Prod, our scalability tests or ODSP scalability tests.
All issues are resolved due to latest set of improvements in summarizer logic, including reliable last summary flow.

vladsud added the bug Something isn't working label Aug 1, 2021

vladsud added this to the August 2021 milestone Aug 1, 2021

vladsud assigned arinwt and pleath Aug 1, 2021

arinwt mentioned this issue Aug 2, 2021

Fix ordered client election loggers #6955

Merged

curtisman unassigned arinwt Aug 3, 2021

vladsud mentioned this issue Aug 6, 2021

[Epic] ODSP Scalability #6686

Closed

32 tasks

pleath modified the milestones: August 2021, September 2021 Sep 1, 2021

pleath modified the milestones: September 2021, Next Sep 9, 2021

vladsud closed this as completed Sep 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fluid:telemetry:InitialElectedClientNotFound errors in telemetry #6948

fluid:telemetry:InitialElectedClientNotFound errors in telemetry #6948

vladsud commented Aug 1, 2021

anthony-murphy commented Aug 2, 2021

arinwt commented Aug 4, 2021 •

edited

Loading

pleath commented Aug 6, 2021

pleath commented Sep 9, 2021

vladsud commented Sep 28, 2021

fluid:telemetry:InitialElectedClientNotFound errors in telemetry #6948

fluid:telemetry:InitialElectedClientNotFound errors in telemetry #6948

Comments

vladsud commented Aug 1, 2021

anthony-murphy commented Aug 2, 2021

arinwt commented Aug 4, 2021 • edited Loading

pleath commented Aug 6, 2021

pleath commented Sep 9, 2021

vladsud commented Sep 28, 2021

arinwt commented Aug 4, 2021 •

edited

Loading