[Scheduled Reports] error handling (timeouts, errors, kbn restarts...) #75603

tsullivan · 2020-08-20T20:23:29Z

Note: Start with #75605, which contains info about:

async workflows in Reporting
the reporting:execute type of task
export type run functions.

Problem statement:
In case there are errors in a report execution, scheduled retries should be triggered immediately. Task Manager schedules retries with an exponential backoff for timeouts, which would create a bad user experience when someone has manually requested a report.

Proposed solution:
Reporting will handle tracking retries and marking jobs as failed when the the number of retries goes hardcore to the max.

How it would work:
We can preserve the current ESQueue-like retry behavior when switching Reporting to Task Manager. Our reporting:execute run function will do the following:

Catch any errors from the export type's run function on its own
Using the ES document in the reporting system index for state, decide whether to increment the attempts and try again, or fail the job.

Why it makes sense
We already have the reporting index that contains all the reports and have fields to describe their state in a queue: number of attempts, time that processing jobs expire, etc. By continuing to use those documents to describe the state in the queue, we can preserve the behavior to retry immediately in case of an error.

What are the risks

If the Kibana server crashes during job execution, a "processing" document will be sitting in the reporting index with no way of it being found and retried. The document does have an "expiration time" field, so there could be a secondary task registered by the Reporting plugin to search for these "stuck" documents.
When Reporting usage increases for the scheduling use case, it may decrease in ad-hoc use case. In that case, having an exponential backoff time for retries might make more sense.

Alternative options
We could work with the Task Manager owners to work on an enhancement that would let us override its retry logic and not use exponential backoff. Doing so would avoid us ending up in a state where a server crash leaves processing jobs get "stuck" as Task Manager would hold on to those tasks and run the retries.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-08-20T21:58:54Z

Pinging @elastic/kibana-reporting-services (Team:Reporting Services)

elasticmachine · 2020-12-17T18:39:42Z

Pinging @elastic/kibana-app-services (Team:AppServices)

tsullivan mentioned this issue Aug 20, 2020

[Scheduled Reports] Handling create/execute jobs with scheduling in mind #75605

Closed

tsullivan added discuss Team:Reporting Services labels Aug 20, 2020

tsullivan mentioned this issue Aug 20, 2020

Switch Reporting to Task Manager #64853

Merged

7 tasks

tsullivan added WIP Work in progress and removed WIP Work in progress labels Aug 21, 2020

tsullivan added (Deprecated) Feature:Reporting Use Reporting:Screenshot, Reporting:CSV, or Reporting:Framework instead Team:AppServices and removed Team:Reporting Services labels Dec 17, 2020

tsullivan added the Team:Reporting Services label Feb 25, 2021

tsullivan closed this as completed in #64853 Mar 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Scheduled Reports] error handling (timeouts, errors, kbn restarts...) #75603

[Scheduled Reports] error handling (timeouts, errors, kbn restarts...) #75603

tsullivan commented Aug 20, 2020 •

edited

Loading

elasticmachine commented Aug 20, 2020

elasticmachine commented Dec 17, 2020

[Scheduled Reports] error handling (timeouts, errors, kbn restarts...) #75603

[Scheduled Reports] error handling (timeouts, errors, kbn restarts...) #75603

Comments

tsullivan commented Aug 20, 2020 • edited Loading

elasticmachine commented Aug 20, 2020

elasticmachine commented Dec 17, 2020

tsullivan commented Aug 20, 2020 •

edited

Loading