Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fluid:telemetry:InitialElectedClientNotFound errors in telemetry #6948

Closed
Tracked by #6686
vladsud opened this issue Aug 1, 2021 · 5 comments
Closed
Tracked by #6686

fluid:telemetry:InitialElectedClientNotFound errors in telemetry #6948

vladsud opened this issue Aug 1, 2021 · 5 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@vladsud
Copy link
Contributor

vladsud commented Aug 1, 2021

We do see these events in automation:

union office_fluid_ffautomation_*
| where Data_eventName contains "InitialElectedClientNotFound"
| summarize count() by Data_eventName

I see 3977 cases.

Plus this event should have some reasonable prefix, i.e. fluid:telemetry:OrderedClientElection:InitialElectedClientNotFound

@vladsud vladsud added the bug Something isn't working label Aug 1, 2021
@vladsud vladsud added this to the August 2021 milestone Aug 1, 2021
@anthony-murphy
Copy link
Contributor

We should def fix the prefix here, looks like the raw logger is getting past rather than the summarize's or contianerRuntime's sub-logger

@arinwt
Copy link
Contributor

arinwt commented Aug 4, 2021

After reviewing the telemetry, it looks like the bug is attributed to an older summary (at seq x) being accepted, but the server is stamping it with protocol state at a later sequence number (x + k), which is the same as another previously nacked summary attempt.

This needs to be fixed in the scribe lambda, I think rejecting summaries that are older than the current protocol state is a must, maybe a regression from #934? I still see similar code that now lives in SummaryWriter, but I'm not sure what is used with other server implementations.

Additionally, I think we should fail faster, by recording the reference sequence number of the runtime in the .metadata blob and failing immediately on load if that doesn't match the one in the .protocol tree. Opened issue #7002 for this, and PR #7015 to address it.

@vladsud vladsud mentioned this issue Aug 6, 2021
32 tasks
@pleath
Copy link
Contributor

pleath commented Aug 6, 2021

Is there something particular about the way the stress test issues summary nacks that could be bypassing the original fix in the scribe lambda? Or is it more likely that something (a race?) is causing that fix not to work?

@pleath pleath modified the milestones: August 2021, September 2021 Sep 1, 2021
@pleath
Copy link
Contributor

pleath commented Sep 9, 2021

This seems to have dropped off the radar. No hits in current Kusto data.

@pleath pleath modified the milestones: September 2021, Next Sep 9, 2021
@vladsud
Copy link
Contributor Author

vladsud commented Sep 28, 2021

There are no events in Prod, our scalability tests or ODSP scalability tests.
All issues are resolved due to latest set of improvements in summarizer logic, including reliable last summary flow.

@vladsud vladsud closed this as completed Sep 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants