Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Scheduled Reports] error handling (timeouts, errors, kbn restarts...) #75603

Closed
tsullivan opened this issue Aug 20, 2020 · 2 comments · Fixed by #64853
Closed

[Scheduled Reports] error handling (timeouts, errors, kbn restarts...) #75603

tsullivan opened this issue Aug 20, 2020 · 2 comments · Fixed by #64853
Labels
(Deprecated) Feature:Reporting Use Reporting:Screenshot, Reporting:CSV, or Reporting:Framework instead discuss

Comments

@tsullivan
Copy link
Member

tsullivan commented Aug 20, 2020

Note: Start with #75605, which contains info about:

  • async workflows in Reporting
  • the reporting:execute type of task
  • export type run functions.

Problem statement:
In case there are errors in a report execution, scheduled retries should be triggered immediately. Task Manager schedules retries with an exponential backoff for timeouts, which would create a bad user experience when someone has manually requested a report.

Proposed solution:
Reporting will handle tracking retries and marking jobs as failed when the the number of retries goes hardcore to the max.

How it would work:
We can preserve the current ESQueue-like retry behavior when switching Reporting to Task Manager. Our reporting:execute run function will do the following:

  • Catch any errors from the export type's run function on its own
  • Using the ES document in the reporting system index for state, decide whether to increment the attempts and try again, or fail the job.

Why it makes sense
We already have the reporting index that contains all the reports and have fields to describe their state in a queue: number of attempts, time that processing jobs expire, etc. By continuing to use those documents to describe the state in the queue, we can preserve the behavior to retry immediately in case of an error.

What are the risks

  1. If the Kibana server crashes during job execution, a "processing" document will be sitting in the reporting index with no way of it being found and retried. The document does have an "expiration time" field, so there could be a secondary task registered by the Reporting plugin to search for these "stuck" documents.
  2. When Reporting usage increases for the scheduling use case, it may decrease in ad-hoc use case. In that case, having an exponential backoff time for retries might make more sense.

Alternative options
We could work with the Task Manager owners to work on an enhancement that would let us override its retry logic and not use exponential backoff. Doing so would avoid us ending up in a state where a server crash leaves processing jobs get "stuck" as Task Manager would hold on to those tasks and run the retries.

@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-reporting-services (Team:Reporting Services)

@tsullivan tsullivan added WIP Work in progress and removed WIP Work in progress labels Aug 21, 2020
@tsullivan tsullivan added (Deprecated) Feature:Reporting Use Reporting:Screenshot, Reporting:CSV, or Reporting:Framework instead Team:AppServices and removed Team:Reporting Services labels Dec 17, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-app-services (Team:AppServices)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
(Deprecated) Feature:Reporting Use Reporting:Screenshot, Reporting:CSV, or Reporting:Framework instead discuss
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants