Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task manager enhancements for error handling in alerting and actions #39829

Merged
merged 62 commits into from
Jul 25, 2019

Conversation

mikecote
Copy link
Contributor

@mikecote mikecote commented Jun 27, 2019

In this PR, we're making the following changes to the task manager to support having better error handling for alerting and actions:

  • Timeout configuration now supports seconds syntax (10s)
  • Interval configuration now supports seconds syntax (10s)
  • Definitions can now define their own maxAttempts configuration
  • Intervals get re-scheduled from when task started and not when it finished
  • maxAttempts has a minimum of 1
  • Definitions can define a getRetryDelay function to return delay in seconds to wait before attempting the task again
  • maxAttempts configuration now behaves like maximum total attempts and not how many retries after first failure
  • runAt is no longer extended by timeout when the task runs
  • Timed out tasks no longer retry right away, the next retry is the same calculation as if it the execution failed
  • Convert task manager to use saved objects

Fixes #40872

Dev-Docs

Task manager now uses saved objects

Starting up kibana will convert .kibana_task_manager to an alias and indices will follow the .kibana_task_manager_1 syntax. A migration will execute to prefix the ids with task: as it converts to saved objects.

@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-stack-services

@mikecote mikecote added the release_note:skip Skip the PR/issue when compiling release notes label Jul 2, 2019
@elasticmachine

This comment has been minimized.

@elasticmachine

This comment has been minimized.

@chrisdavies
Copy link
Contributor

I didn't do a full review, but @mikecote asked me to have a look at the migration changes, and those look peachy. 💯

@tsullivan
Copy link
Member

Hi, sorry I need a little more time looking at this. I pulled it down, and getting a server crash, but there's a chance it's because my ES is messed up or other reasons.

Also, there are 2 conflicts that need to be resolved on this PR

@tsullivan
Copy link
Member

tsullivan commented Jul 19, 2019

I think I keep hitting stuck migrations, even when I clear all my documents out.

red   open .kibana_task_manager           KGvBhGvrRxChRbBaY0aQPw 1 0                    
red   open .triggered_watches             B2uluxE8RkuuFQN9dxu4pw 1 0                    
green open .reporting-2019.07.07          WnwbjZh5TWmPcgr_LyyQ-A 1 0  0  0   282b   282b
red   open .kibana_1                      AO6sVyQIQ9GpMe51C3coqw 1 0                    
green open .tasks                         DkIurr8qR-uhX0dLVewoQw 1 0  1  0  6.7kb  6.7kb
red   open babynames                      khbAdNhvQnaOWDqNFHC5aw 2 0                    
green open .security-7                    SUKdf30cSbGyXmI7jyCQRA 1 0 35 12 29.5kb 29.5kb
red   open .watcher-history-10-2019.07.18 nNQ41CvOSpaUQjiObCRowA 1 0                    
green open .kibana_task_manager_2         zTQ4tZfOQ6WQX2Um4nvfdA 1 0  0  0   283b   283b
red   open .watcher-history-10-2019.07.15 _DzsaVRJT7OYL4KOpjTSNQ 1 0                    
green open .kibana_task_manager_1         2gEgoAhtR1SLLmbQmpCW6g 1 0  0  0   283b   283b
red   open .reporting-2019.07.14          CmTknvxjRMydEv_FK4cCEQ 1 0                    
red   open .watcher-history-10-2019.07.17 ZO8b43n2RvmnbqOxwo-BOQ 1 0                    
red   open my_index                       3yIRlLUtTsS8DzEQS0FTaA 1 1                    
red   open .watcher-history-10-2019.07.16 IUc0KYeaRv6IltYjJiVEWg 1 0                    
red   open .watcher-history-10-2019.07.13 DiFOHVVySaKQMvt8vujhcw 1 0                    
red   open .watcher-history-10-2019.07.12 JsicjQ55SUq8xZ53MlBTYw 1 0                    
red   open metricbeat-8.0.0               Krh9VIqVSGeQRc0tjytG4g 1 1                    
red   open .watches                       smFKpBwFQ2-bc2ZTRmRHxQ 1 0                    

I can fix the problems with missing shards in the indices (hence why they are red) but for now, this is causing Kibana to completely crash. I'll need to look a little more to see if this is just a distraction

@elasticmachine

This comment has been minimized.

Copy link
Member

@tsullivan tsullivan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Looked at the changes from the latest commits to clear my "requested changes."

@joshdover
Copy link
Contributor

I'd feel way more comfortable if the SavedObjects/Migration changes were split into a separate PR from this one.

@mikecote
Copy link
Contributor Author

@joshdover I have moved it to another PR #41815, will update this one once merged. 👍

@elasticmachine

This comment has been minimized.

@mikecote mikecote removed the request for review from a team July 24, 2019 14:27
@mikecote
Copy link
Contributor Author

@joshdover The migration changes have been removed from this PR now that they are in master. Let me know if this looks good to you now.

@elasticmachine
Copy link
Contributor

💚 Build Succeeded

@mikecote mikecote merged commit 2cd3094 into elastic:master Jul 25, 2019
mikecote added a commit to mikecote/kibana that referenced this pull request Jul 26, 2019
…lastic#39829)

* Allow mtask definitions to overwrite default setting maxAttemps

* Leverage scheduledAt from task manager

* Treat maxAttempts like attempts and not retries

* Add support for second intervals

* Min 1 attempt

* Reverse relying on scheduledAt

* Add new startedAt attribute in task manager that keeps track when task started running

* Don't extend runAt when claiming a task

* Remove startedAt from state

* Attempt trying to define custom getBackpressureDelay function

* Pass error object to getBackpressureDelay

* Cleanup processResultForRecurringTask code

* Add backpressure to timed out tasks

* Change default timeout backpressure calculation

* getBackpressureDelay to return seconds instead of milliseconds

* Add comment for task store query

* Compress query

* Revert alert / actions specific code

* Add more interval tests

* Fix failing jest tests

* Fix test

* Add more unit tests

* Fix integration tests

* Fix sorting of tasks to process

* WIP

* Always provide error when getBackpressureDelay is called

* Rename getBackpressureDelay to getRetryDelay

* retryAt to be calculated from timeout time by default

* Remove invalid test

* Add unit tests

* Consider timeout before scheduling a retryAt

* Remove backpressure terminology

* Remove support for 0 based intervals and timeouts

* Apply PR feedback

* Fix last place using Math.abs

* Modify migrations to allow running a script when converting an index to an alias

* Convert task manager to use saved objects

* Fix broken test

* Fix broken tests pt1

* Remove index from task manager config schema

* Accept platform changes

* PR feedback

* Apply PR feedback

* Apply PR feedback pt2

* Apply PR feedback pt3

* Apply PR feedback pt4

* Fix feedback pt3

* Rename RawSavedObjectDoc to SavedObjectsRawDoc
mikecote added a commit that referenced this pull request Jul 26, 2019
…39829) (#42004)

* Allow mtask definitions to overwrite default setting maxAttemps

* Leverage scheduledAt from task manager

* Treat maxAttempts like attempts and not retries

* Add support for second intervals

* Min 1 attempt

* Reverse relying on scheduledAt

* Add new startedAt attribute in task manager that keeps track when task started running

* Don't extend runAt when claiming a task

* Remove startedAt from state

* Attempt trying to define custom getBackpressureDelay function

* Pass error object to getBackpressureDelay

* Cleanup processResultForRecurringTask code

* Add backpressure to timed out tasks

* Change default timeout backpressure calculation

* getBackpressureDelay to return seconds instead of milliseconds

* Add comment for task store query

* Compress query

* Revert alert / actions specific code

* Add more interval tests

* Fix failing jest tests

* Fix test

* Add more unit tests

* Fix integration tests

* Fix sorting of tasks to process

* WIP

* Always provide error when getBackpressureDelay is called

* Rename getBackpressureDelay to getRetryDelay

* retryAt to be calculated from timeout time by default

* Remove invalid test

* Add unit tests

* Consider timeout before scheduling a retryAt

* Remove backpressure terminology

* Remove support for 0 based intervals and timeouts

* Apply PR feedback

* Fix last place using Math.abs

* Modify migrations to allow running a script when converting an index to an alias

* Convert task manager to use saved objects

* Fix broken test

* Fix broken tests pt1

* Remove index from task manager config schema

* Accept platform changes

* PR feedback

* Apply PR feedback

* Apply PR feedback pt2

* Apply PR feedback pt3

* Apply PR feedback pt4

* Fix feedback pt3

* Rename RawSavedObjectDoc to SavedObjectsRawDoc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Task Manager release_note:plugin_api_changes Contains a Plugin API changes section for the breaking plugin API changes section. review v7.4.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Task manager to have migrations
7 participants