Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RetriableDocumentStorageService is not handling retries correctly #4861

Closed
vladsud opened this issue Jan 21, 2021 · 2 comments · Fixed by #4876
Closed

RetriableDocumentStorageService is not handling retries correctly #4861

vladsud opened this issue Jan 21, 2021 · 2 comments · Fixed by #4876
Assignees
Labels
area: driver Driver related issues bug Something isn't working
Milestone

Comments

@vladsud
Copy link
Contributor

vladsud commented Jan 21, 2021

APIs know to be broken:

  • RetriableDocumentStorageService.write
  • RetriableDocumentStorageService.getVersion
  • RetriableDocumentStorageService.writeSummaryTree

In particular, we do not handle 429s correctly, or any other retryable error.

Can we move away from using DocumentStorageServiceProxy here? It hides details, it's not obvious on initial review what APIs are wrapped vs. not.
As other new APIs would get added, developers would keep missing the need to change that class.

Note that while it's right place to solve these issues, it creates another problem - too much noise in telemetry when we are offline. That's because actual driver calls will generate events, and if we call them every 8 seconds, they will generate a ton of data for couple hours of being offline (all of that data is cached and flushed to Aria on connection).
We likely need to create a separate issue to think about solution to this problem.
For example, op fetching does not log anything in the driver, DeltaManager is responsible for all logging. Maybe we need to adopt similar thing and say that RetriableDocumentStorageService is doing all loging?

@vladsud vladsud added the bug Something isn't working label Jan 21, 2021
@vladsud vladsud added this to the January 2021 milestone Jan 21, 2021
@ghost ghost added the triage label Jan 21, 2021
@vladsud
Copy link
Contributor Author

vladsud commented Jan 21, 2021

Also we need to ensure there is telemetry in place (number of attempts, total duration, if possible - last error we deal with).
I do not remember if we have dedicated issue to just that. If not, would be great to take care of it here.

@vladsud vladsud changed the title RetriableDocumentStorageService.write() is not handling retries correctly RetriableDocumentStorageService is not handling retries correctly Jan 21, 2021
@curtisman curtisman added area: driver Driver related issues and removed triage labels Jan 21, 2021
@jatgarg
Copy link
Contributor

jatgarg commented Jan 21, 2021

Moving the telemetry part to other issue: #4855

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: driver Driver related issues bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants