Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More telemetry, asserts and some recovery for cases where client is not up to date #7422

Closed
wants to merge 3 commits into from

Conversation

vladsud
Copy link
Contributor

@vladsud vladsud commented Sep 8, 2021

More progress on understanding Issue #7312:

  1. A bunch of asserts to ensure correctness, mostly around fetching ops and getting sequential, no duplicate results.
  2. More telemetry around where we fetch ops ("GetOps") and also when we get into trouble with not being current.
  3. Smallish bug in ops caching that might have result in returning result without first requested op.
  4. Changed where we short-circuit code when dealing with closed & no-handler conditions - pushed it out for most of the code to not care.
  5. Biggest change is in processPendingOps() to detect that client is behind, and fetch ops proactively.

@vladsud vladsud requested a review from jatgarg September 8, 2021 23:39
@github-actions github-actions bot added area: driver Driver related issues area: loader Loader related issues area: odsp-driver labels Sep 8, 2021
@@ -137,44 +137,55 @@ export class OdspDeltaStorageWithCache implements IDocumentDeltaStorageService {
let opsFromCache = 0;
let opsFromStorage = 0;

const stream = requestOps(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: there are no changes here (code moved around and indentation change). Other than added asserts on lines 183+.

@@ -135,9 +135,9 @@ export class OpsCache {
break;
}
if (messages.length === 0) {
if (op.sequenceNumber > from + 1) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: That's defiantly a bug - it treated "from" as exclusive boundary, while all the code has been switched to treat "from" as inclusive. Added asserts in other layers to catch issues like that, such that we can find them sooner (and validate this code change as well).

// actionably to report gaps in this range.
this.enqueueMessages(pendingSorted, `${reason}_pending`, true /* allowGaps */);

// See issue #7312 for more details
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this is the main chance in behavior.
This code get hit a lot (in stress tests) in "read" mode and that condition is describe in comments.
I hit it only once for "write" connection and that's where I believe we have the key to understanding better the underlying problem that is described in 7312. This new code will help "fix" about 50% of cases where we do not see join op for a long time, but telemetry we get in such cases should have a key to why we run into these problems, and hopefully - path to full fix.

@github-actions github-actions bot added the area: tests Tests to add, test infrastructure improvements, etc label Sep 9, 2021
@msfluid-bot
Copy link
Collaborator

@fluidframework/base-host: +632 Bytes
Metric NameBaseline SizeCompare SizeSize Diff
main.js 186.29 KB 186.91 KB +632 Bytes
Total Size 213.88 KB 214.5 KB +632 Bytes
@fluid-example/bundle-size-tests: +665 Bytes
Metric NameBaseline SizeCompare SizeSize Diff
container.js 162.13 KB 162.13 KB No change
map.js 44.74 KB 44.74 KB No change
matrix.js 143.54 KB 143.54 KB No change
odspDriver.js 174.42 KB 175.06 KB +665 Bytes
odspPrefetchSnapshot.js 38.95 KB 38.95 KB No change
sharedString.js 165.12 KB 165.12 KB No change
Total Size 761.58 KB 762.23 KB +665 Bytes

Baseline commit: d4b1448

Generated by 🚫 dangerJS against 8c34b22

This was referenced Sep 9, 2021
@vladsud
Copy link
Contributor Author

vladsud commented Sep 9, 2021

This PR has been split into 4 individual PRs, closing

@vladsud vladsud closed this Sep 9, 2021
@vladsud vladsud deleted the ConnectionRecovery branch September 9, 2021 17:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: driver Driver related issues area: loader Loader related issues area: odsp-driver area: tests Tests to add, test infrastructure improvements, etc
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants