Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design/change how container is returned during loader.resolve() #4519

Closed
jatgarg opened this issue Dec 3, 2020 · 2 comments
Closed

Design/change how container is returned during loader.resolve() #4519

jatgarg opened this issue Dec 3, 2020 · 2 comments
Assignees
Labels
api area: loader Loader related issues
Milestone

Comments

@jatgarg
Copy link
Contributor

jatgarg commented Dec 3, 2020

Describe the bug

After the infinite blob/tree fetch retry for storage we need to change how we return the container to the app because app needs to know what is happening with the container when it is busy retrying fetching during load.
Today, policy of Container only retry after we've done loading (i.e. Container.load() returned and Container.loaded === true).
We do not retry and fail immediately while within Container.load. This is sort of required today as user experience would be not great if we loop forever waiting for bits and pieces while client is offline - hosting app has no object to check status / cancel operations, as we return Container form loader.resolve() only after Container.load() is done. We either need to maintain this invariant, or change when / how Container is returned to client. I think later is better, but we need to design it separately.

#4427

@jatgarg jatgarg added the bug Something isn't working label Dec 3, 2020
@jatgarg jatgarg added this to the January 2021 milestone Dec 3, 2020
@jatgarg jatgarg self-assigned this Dec 3, 2020
@ghost ghost added the triage label Dec 3, 2020
@curtisman curtisman added api area: loader Loader related issues and removed bug Something isn't working triage labels Dec 3, 2020
@ghost ghost added the triage label Dec 3, 2020
@curtisman curtisman removed the triage label Dec 3, 2020
@vladsud vladsud modified the milestones: January 2021, February 2021 Dec 17, 2020
@vladsud vladsud modified the milestones: February 2021, March 2021 Jan 6, 2021
@vladsud vladsud modified the milestones: March 2021, Future Jan 20, 2021
vladsud added a commit to vladsud/FluidFramework that referenced this issue Feb 12, 2021
Today, we return container only after we are done loading snapshot. That's too late for these scenarios:
- we hit 429 and keep retrying. There is nowhere to report progress or cancel as host does not have container yet
- People register for "connected" and "disconnected" events but never check initial state (Container.connected), assuming that is is always disconnected. Similar problem exists in other places (and more places will be in same bucket, like dirty flag). Ideally initial state of container is always pre-defined, but we also notify users of it by calling appropriate handler on registration.
- We have no good point in time to console.error if someone did not register for "close" - doing it when we raised an error is a bit too late (this is next follow up)

This solution allows us to return shell container object (not hitting anything that can fail) much earlier, to give a chance for host to put all handlers, and then continue loading.
@vladsud
Copy link
Contributor

vladsud commented May 27, 2021

So some history:
We added infinite retries for all storage operations, as long as errors are retryable.
That caused a serious bug in ODSP as TreesLatest call is made only once, and then (on retry) we fallback to getVersion, getSnapshotsTree and loading blobs one by one, tanking perf and possibly breaking document.
So code was changed not to retry on that path, plus I changed code in TreesLatest such that if we ever come back to infinite retries, it will keep retrying TreesLatest.
On top of that, stress tests showed that doing at least one retry helps improve substantially ability to load container if machine under stress.

That all said, it's really not good practice to have network (or any other blocking) calls without ability to cancel and track progress.
I believe #5114 (Loader Event on new Container resolve/request) is the the answer - that item IMHO is more about enforcing same policy across containers under loader, I'd not use it as a way to expose new container to caller.
I think returning container sooner (i.e. before any single network call (i.e. PR #5141) is a better way of doing it.

Note that this helps a lot (and resolves this issue), but we also should holistically look into overall problem that is tracked by issue #5192 (Add cancelation tokens to read/write blob APIs)

@vladsud
Copy link
Contributor

vladsud commented Jun 21, 2022

Closing old issues.

@vladsud vladsud closed this as completed Jun 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api area: loader Loader related issues
Projects
None yet
Development

No branches or pull requests

3 participants