Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Summarizer telemetry and cleanup [0.10] #327

Merged
merged 11 commits into from
Oct 14, 2019

Conversation

arinwt
Copy link
Contributor

@arinwt arinwt commented Oct 12, 2019

Several small fixes to improve telemetry around summaries.

  1. Telemetry Precision: Add initialSequenceNumber to DeltaManager, which tracks the reference sequence number at the time that the protocol handler was set. This is useful to know what op we initially loaded from and is used by the summarizer as the initial value of lastSummarySequenceNumber.
  2. Add Telemetry: When summarizing from idle, there is a message "idle" that gets added to the telemetry event. Added "maxTime" and "maxOps" for when the summary was triggered from maximum active timeout and maximum number of ops respectively. Previously these were empty string.
  3. Add Telemetry: Add a performance event around the generateSummary() call in ContainerRuntime. This resembles the one in Summarizer, but has more information and stages. It reports progress after each step: "start", "loadedVersions", "generatedTree", "uploadedTree", and "end" (submitted op). It also includes the fullTree argument in the start event, and the handle in the end.
  4. Performance/Functionality: Use actual timer instead of op event for the timeout when waiting for the summary ack. This way it will timeout even if no ops are coming in.
  5. Add Telemetry: Add a timer for summarizing timeout which can fire multiple times, backing off, up to a max. It will not cause an error, but will report events to telemetry. After 20 sec it will trigger the first time, then after 40 more sec it will trigger again, then continue to double up to 5 times (20 + 40 + 80 + 160 + 320 = 620 sec total).
  6. Refactor: consolidated the various timers in Summarizer to use a Timer class which basically just reuses the wrapper functions around clearTimeout and setTimeout acting like a single object.
  7. Performance/Functionality: Prevent premature disconnects which lead to the runDeferred resolving too early. Now the Summarizer will try to make sure that it has at least been connected once before letting a disconnect event cause the runDeferred to complete. This was needed because the Container propagates its connection state after loading, which sometimes can be disconnected or connecting (or connected for readonly), but it is a race condition. We should consider removing this initial propagation altogether.
  8. Performance/Functionality: After runDeferred completes, now it will do some cleanup: clearing the various timeouts and resetting some of the state. (A summarizer can still not be reran though).
  9. Bug Fix: Change the summary configuration argument to a getter so we can defer getting it from the ContainerRuntime until after we are surely connected. It was a race condition before, causing it to use the default sometimes.

@arinwt arinwt requested a review from vladsud October 12, 2019 00:21
@vladsud
Copy link
Contributor

vladsud commented Oct 13, 2019

nice description! #Resolved

// initialize values (not exact)
this.lastSummarySeqNumber = this.runtime.deltaManager.referenceSequenceNumber;
this.lastSummarySeqNumber = this.runtime.deltaManager.initialSequenceNumber;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lastSummarySeqNumber [](start = 13, length = 20)

Something to consider:
We could update it when we see summaryAck while processing ops (on the way to become connected), by asking protocol layer for latest ack sequence number
That way we do not need to make changes in DeltaManager.
And I think it will make it more correct when summarizer is run not on document load, but when another summarizer exits.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, but it would be a little tedious to implement in the current state, since we'd need to get it from the summary op (not ack), meaning we'd need to pair up all summary ops to acks/nacks all the time, not just while pending. I think if we do this, we should add the reference sequence number to the summary ack op, and work to bypass listening for summary ops altogether, which would require a little more info in the summary ack op.

I think right now, we should only switch to this method if the initialSequenceNumber in DeltaManager is not useful anywhere else. If it has potential to be useful, we might as well keep it there. If not, I'm all for keeping DeltaManager as simple as possible.
And for now, there is little difference between document load and another summarizer exiting, because summarizer clients spin up only when elected summarizer. Unless a summarizer happens to start before the previous one from another client is acked (fairly unlikely).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's useful as a concept. We just need to be aware that telemetry we are producing might be not very accurate.
It will likely create false positives (signal that we are not summarizing too often) and we will need to fix it to be able to see past it.


In reply to: 334381006 [](ancestors = 334381006)

Copy link
Contributor

@vladsud vladsud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@arinwt arinwt merged commit c30627c into microsoft:release/0.10 Oct 14, 2019
@arinwt arinwt deleted the summary-telemetry branch October 14, 2019 22:55
@arinwt arinwt changed the title Summarizer telemetry and cleanup Summarizer telemetry and cleanup [0.10] Oct 14, 2019
arinwt pushed a commit that referenced this pull request Oct 15, 2019
* Summary cleanup [0.10] (#244)

* Add logger to SummaryManager; move ITreeEntry implementations to core-utils package

* Cleanup SummaryManager

* Fix tests

* Export summaryTreeConverter

* Rename properties on container sublogger

* Do not try to submit summary op while disconnected [0.10] (#263)

* Do not try to submit summary op while disconnected

* Move comment to correct place

* Add more checks for disconnected during generate summary

* Prevent summarizer from pending timeout while still generating summary (#276)

* Summarizer telemetry and cleanup (#327)

* Add initialSequenceNumber to DeltaManager and use it for summarizer

* Add message for summary reason

* Add more wrapping performance events around both summarize and generateSummary

* Add progress steps to generate summary perf event

* Add timer for pending timeout and summarize timeout and prevent disconnect too early

* Wait and fetch summaryConfiguration after we are connected

* Fix unit tests

* Small PR changes

* Return early if disconnected before run and add telemetry event

* Rename and do not wait for connected if stopped.

* Only realize component if changed while summarizing (#328)

* Only realize component if changed while summarizing

* Elaborate on comments

* Fix tslint
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants