Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch Reporting to Task Manager #64853

Merged
merged 35 commits into from
Mar 9, 2021

Conversation

tsullivan
Copy link
Member

@tsullivan tsullivan commented Apr 29, 2020

Summary

Remove the ES-Queue sub-service of Reporting and replace the functionality with Task Manager integration.

Part of #53900

Closes #69340
Closes #75603
Closes #75605
Closes #76411
Closes #87074

Blocked on:

Summary

Using Task Manager for Reporting can be done by registering 2 new task types:

  • Execute Report Task
  • Monitor Reports Task

Execute Report Task
The Execute Report task is a "one-off" task that Task Manager will trigger as soon as it is executed. The Reporting API that handles requests to generate new reports will be responsible for scheduling these tasks, and also storing a historical document for the report output in the .reporting-* indices. The scheduled task references this placeholder report document.\

This design splits report execution into 2 phases for async report generation:

  • The schedule task phase uses logic migrated from the ESQueue "create job" phase, and calls the CreateJobFn of an Export Type.
    • Enrich the parameters provided from the Generation API into job parameters for an export type.
    • Store a historical document for a "Pending" report job.
    • New: schedule a task with Task Manager to asynchronously run the next phase
  • The run task phase uses logic migrated from the the "run task" phase in ESQueue, and calls the RunTaskFn of an Export type.
    • set the status of the historical document is set from "Pending" to "Processing."
    • generate the report output
    • save the output into the historical document, and set the status to "Completed"

Generate Report: Create Job Phase:
Reporting Create Job Flowchart


Generate Report: Run Task Phase:
Reporting Run Task Flowchart


Monitor for Pending Reports Task
It is possible for a task in the run task phase to fail without rescheduling itself. This will happen in situations such as:

  • The Kibana server restarts in the middle of running an Execute Report task
  • Report jobs were queued with ESQueue before a stack upgrade, and still have "Pending" status.
  • Unexpected situations (server crashes, task data gets deleted accidentally, etc)

ESQueue was able to self-heal against this problem since it used the .reporting-* as the queue index and the historical storage index. That means "stuck" jobs pending would be discovered in the historical data in the same cycle ESQueue uses to find queued work that needs to be claimed. ESQueue would immediately claim the stuck/expired jobs it found.

In Task Manager, we will need to define a 2nd task that discovers stuck jobs, by running searches on "Pending" or "Processing" jobs in the historical data and filtering on the expiration_time field. Unlike ESQueue, it will not immediately claim these jobs once discovered; instead it will re-schedule the execute report task using the job parameters in the historical document.


Monitor for Pending Reports:
Reporting Monitor for Pending Job Flowchart


Checklist

Delete any items that are not applicable to this PR.

For maintainers

Release Note for breaking change

Make sure there are no pending reporting jobs before upgrading to 8.0. If there are any, they will remain in pending state indefinitely.

@tsullivan tsullivan force-pushed the reporting/switch-to-tm branch 2 times, most recently from 215e3b7 to 6b4a034 Compare May 5, 2020 19:37
@tsullivan tsullivan force-pushed the reporting/switch-to-tm branch from 323fac3 to 347f61a Compare June 29, 2020 23:39
@tsullivan tsullivan force-pushed the reporting/switch-to-tm branch 2 times, most recently from 694524a to c9fed97 Compare July 8, 2020 00:00
@tsullivan tsullivan force-pushed the reporting/switch-to-tm branch from c9fed97 to 848bae9 Compare July 9, 2020 17:30
@tsullivan tsullivan force-pushed the reporting/switch-to-tm branch 2 times, most recently from c796f3f to d17e97a Compare July 15, 2020 18:45
@tsullivan tsullivan force-pushed the reporting/switch-to-tm branch 2 times, most recently from 65b569a to 07dae6f Compare August 3, 2020 21:40
@tsullivan tsullivan force-pushed the reporting/switch-to-tm branch 5 times, most recently from ba31b8a to 8af9656 Compare August 6, 2020 00:30
Copy link
Contributor

@gmmorris gmmorris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the TM code and it looks right.
I started Kibana and saw that the Execute Report task did in fact have the limited concurrency expected.

I haven't tested that reports are generated or anything along those lines.

@tsullivan
Copy link
Member Author

@streamich shared logs that appeared as though reporting jobs are running twice, so I'm holding off on this for a bit while I investigate.

@tsullivan
Copy link
Member Author

@elasticmachine merge upstream

Copy link
Contributor

@streamich streamich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, tested on Mac/Chrome, everything seems to work, could not reproduce the errors I was seeing a few weeks ago.

@tsullivan
Copy link
Member Author

This PR adds a breaking change: if you have pending reports in the cluster, they will not be continued after upgrading to a version with these changes.

Because of this fact, @kobelb and I have chatted and decided it's best for this change to roll out in a major version. Therefore this PR will not be backported to 7.x.

@tsullivan tsullivan merged commit 9fef424 into elastic:master Mar 9, 2021
@tsullivan tsullivan deleted the reporting/switch-to-tm branch March 9, 2021 00:33
@tsullivan tsullivan added backport:skip This commit does not require backporting release_note:breaking and removed v7.12.0 release_note:skip Skip the PR/issue when compiling release notes labels Mar 9, 2021
@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
reporting 95.9KB 96.2KB +328.0B

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting (Deprecated) Feature:Reporting Use Reporting:Screenshot, Reporting:CSV, or Reporting:Framework instead release_note:breaking v7.14.0 v8.0.0
Projects
None yet
9 participants