[Response Ops][Task Manager] Add more functional tests for resource based task claiming #189111

ymao1 · 2024-07-24T17:16:47Z

Followup to #185043

Basic mget functional tests were updated so that they would pass but we should add more FT for claiming tasks with different costs. Not sure if there is a straightforward way of orchestrating specific tasks to claim but it would be interesting to try to add a functional test where a large cost task is left unclaimed due to capacity constraints and then claimed in the next cycle.

elasticmachine · 2024-07-24T17:16:48Z

Pinging @elastic/response-ops (Team:ResponseOps)

Resolves #185043 ## Summary ### Task types can define a `cost` associated with running it - Optional definition that defaults to `Normal` cost ### New `xpack.task_manager.capacity` setting - Previous `xpack.task_manager.max_workers` setting is deprecated, changed to optional, and a warning will be logged if used - New optional `xpack.task_manager.capacity` setting is added. This represents the number of normal cost tasks that can be run at one time. - When `xpack.task_manager.max_workers` is defined and `xpack.task_manager.capacity` is not defined, a deprecation warning is logged and the value for max workers will be used as the capacity value. - When `xpack.task_manager.capacity` is defined and `xpack.task_manager.max_workers` is not defined, the capacity value will be used. For the `default` claiming strategy, this capacity value will be used as the `max_workers` value - When both values are set, a warning will be logged and the value for `xpack.task_manager.capacity` will be used - When neither value is set, the `DEFAULT_CAPACITY` value will be used. ### Updates to `TaskPool` class - Moves the logic to determine used and available capacity so that we can switch between capacity calculators based on claim strategy. For the `default` claim strategy, the capacity will be in units of workers. For the `mget` claim strategy, the capacity will be in units of task cost. ### Updates to `mget` task claimer - Updated `taskStore.fetch` call to take a new parameter that will return a slimmer task document that excludes that task state and task params. This will improve the I/O efficiency of returning up to 400 task docs in one query - Applies capacity constraint to the candidate tasks. - Bulk gets the full task documents for the tasks we have capacity for in order to update them to `claiming` status. Uses the `SavedObjectsClient.bulkGet` which uses an `mget` under the hood. ### Updates the monitoring stats - Emitting capacity config value and also capacity as translated into workers and cost. - Added total cost of running and overdue tasks to the health report ## Tasks for followup issues - Update mget functional tests to include tasks with different costs. - #189111 - Update cost of indicator match rule to be Extra Large - #189112 - Set `xpack.task_manager.capacity` on ECH based on the node size - #189117 --------- Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com> Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>

…acity based claiming (#189431) Resolves #189111 ## Summary Adds jest integration test to test cost capacity based claiming with the `mget` claim strategy. Using this integration test, we can exclude running other tasks other than our test types. We register a normal cost task and an XL cost task. We test both that we can claim tasks up to 100% capacity and that we will stop claiming tasks if the next task puts us over capacity, even if that means we're leaving capacity on the table. --------- Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>

ymao1 added Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Jul 24, 2024

ymao1 mentioned this issue Jul 24, 2024

[Response Ops][Task Manager] Resource based task scheduling #187999

Merged

ymao1 self-assigned this Jul 26, 2024

ymao1 mentioned this issue Jul 29, 2024

[Response Ops][Task Manager] Adding jest integration test to test capacity based claiming #189431

Merged

mikecote mentioned this issue Aug 19, 2024

Scaling the alerting throughput ceiling from 3,200 to 32,000+ rules per minute #188194

Open

48 tasks

ymao1 closed this as completed in #189431 Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Response Ops][Task Manager] Add more functional tests for resource based task claiming #189111

[Response Ops][Task Manager] Add more functional tests for resource based task claiming #189111

ymao1 commented Jul 24, 2024

elasticmachine commented Jul 24, 2024

[Response Ops][Task Manager] Add more functional tests for resource based task claiming #189111

[Response Ops][Task Manager] Add more functional tests for resource based task claiming #189111

Comments

ymao1 commented Jul 24, 2024

elasticmachine commented Jul 24, 2024