Design/change how container is returned during loader.resolve() #4519

jatgarg · 2020-12-03T19:20:28Z

Describe the bug

After the infinite blob/tree fetch retry for storage we need to change how we return the container to the app because app needs to know what is happening with the container when it is busy retrying fetching during load.
Today, policy of Container only retry after we've done loading (i.e. Container.load() returned and Container.loaded === true).
We do not retry and fail immediately while within Container.load. This is sort of required today as user experience would be not great if we loop forever waiting for bits and pieces while client is offline - hosting app has no object to check status / cancel operations, as we return Container form loader.resolve() only after Container.load() is done. We either need to maintain this invariant, or change when / how Container is returned to client. I think later is better, but we need to design it separately.

#4427

Today, we return container only after we are done loading snapshot. That's too late for these scenarios: - we hit 429 and keep retrying. There is nowhere to report progress or cancel as host does not have container yet - People register for "connected" and "disconnected" events but never check initial state (Container.connected), assuming that is is always disconnected. Similar problem exists in other places (and more places will be in same bucket, like dirty flag). Ideally initial state of container is always pre-defined, but we also notify users of it by calling appropriate handler on registration. - We have no good point in time to console.error if someone did not register for "close" - doing it when we raised an error is a bit too late (this is next follow up) This solution allows us to return shell container object (not hitting anything that can fail) much earlier, to give a chance for host to put all handlers, and then continue loading.

vladsud · 2021-05-27T19:31:49Z

So some history:
We added infinite retries for all storage operations, as long as errors are retryable.
That caused a serious bug in ODSP as TreesLatest call is made only once, and then (on retry) we fallback to getVersion, getSnapshotsTree and loading blobs one by one, tanking perf and possibly breaking document.
So code was changed not to retry on that path, plus I changed code in TreesLatest such that if we ever come back to infinite retries, it will keep retrying TreesLatest.
On top of that, stress tests showed that doing at least one retry helps improve substantially ability to load container if machine under stress.

That all said, it's really not good practice to have network (or any other blocking) calls without ability to cancel and track progress.
I believe #5114 (Loader Event on new Container resolve/request) is the the answer - that item IMHO is more about enforcing same policy across containers under loader, I'd not use it as a way to expose new container to caller.
I think returning container sooner (i.e. before any single network call (i.e. PR #5141) is a better way of doing it.

Note that this helps a lot (and resolves this issue), but we also should holistically look into overall problem that is tracked by issue #5192 (Add cancelation tokens to read/write blob APIs)

vladsud · 2022-06-21T04:08:48Z

Closing old issues.

jatgarg added the bug Something isn't working label Dec 3, 2020

jatgarg added this to the January 2021 milestone Dec 3, 2020

jatgarg self-assigned this Dec 3, 2020

ghost added the triage label Dec 3, 2020

jatgarg mentioned this issue Dec 3, 2020

Implement retry logic for blob & tree reads in runtime #3963

Closed

curtisman added api area: loader Loader related issues and removed bug Something isn't working triage labels Dec 3, 2020

ghost added the triage label Dec 3, 2020

curtisman removed the triage label Dec 3, 2020

vladsud modified the milestones: January 2021, February 2021 Dec 17, 2020

vladsud modified the milestones: February 2021, March 2021 Jan 6, 2021

vladsud modified the milestones: March 2021, Future Jan 20, 2021

jatgarg assigned vladsud Feb 17, 2021

vladsud mentioned this issue May 27, 2021

Collection of 1.0 items to consider #6272

Closed

61 tasks

vladsud closed this as completed Jun 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design/change how container is returned during loader.resolve() #4519

Design/change how container is returned during loader.resolve() #4519

jatgarg commented Dec 3, 2020

vladsud commented May 27, 2021

vladsud commented Jun 21, 2022

Design/change how container is returned during loader.resolve() #4519

Design/change how container is returned during loader.resolve() #4519

Comments

jatgarg commented Dec 3, 2020

Describe the bug

vladsud commented May 27, 2021

vladsud commented Jun 21, 2022