Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task Manager health - average interval #94937

Open
kobelb opened this issue Mar 18, 2021 · 2 comments
Open

Task Manager health - average interval #94937

kobelb opened this issue Mar 18, 2021 · 2 comments
Labels
estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Task Manager resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@kobelb
Copy link
Contributor

kobelb commented Mar 18, 2021

Context

To auto-scale Kibana, the following rules will determine when Kibana scales-up and scales-down:

Scale-up: Task load > 45% of Task Capacity OR 90th percentile CPU > 0.85
Scale-down: Task load < 15% of Task Capacity AND 90th percentile CPU < 0.25

This relies on us being able to calculate the task load and task capacity:

Task load = number of scheduled tasks x average task interval
Task capacity = number of Kibana instances x task concurrency x ( 3,600,000 / poll interval)

The Task Manager health API coupled with the information from Cloud provides us with enough information to perform these calculations, except for the ability to determine the average task interval.

The Task Manager health API response does include the workload schedule; however, behind the scenes, it's using an Elasticsearch terms aggregation without a specified size, so we only get 10 buckets.

Feature request

Ideally, the Task Manager health API would return an additional field that specifies the average interval for all recurring tasks in milliseconds. This would allow the auto-scaling logic to use this value in its calculations.

There are likely some complications here because the schedule.interval is currently using Elasticsearch's date math notation, for example "1m" represents one minute, and the avg aggregation doesn't work natively with this field.

@kobelb kobelb added Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Mar 18, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@gmmorris gmmorris added the loe:needs-research This issue requires some research before it can be worked on or estimated label Jul 15, 2021
@gmmorris
Copy link
Contributor

We've recently added an experimental Capacity Estimation feature in Task Manager's health API.
These estimations use, among other things, the averaged intervals- could this be used to achieve the metric needed for auto scaling?

https://www.elastic.co/guide/en/kibana/master/task-manager-troubleshooting.html#task-manager-health-evaluate-the-capacity-estimation

The PR has some detailed diagrams designed to express how we perform this estimation, I'd love your thoughts:
#100475

In hind sight I wish we'd produced an RFC for how this estimation works... I'll see if I can find capacity to write up a design doc.

@gmmorris gmmorris added the resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility label Jul 15, 2021
@gmmorris gmmorris added the estimate:needs-research Estimated as too large and requires research to break down into workable issues label Aug 18, 2021
@gmmorris gmmorris removed the loe:needs-research This issue requires some research before it can be worked on or estimated label Sep 2, 2021
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Task Manager resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
No open projects
Development

No branches or pull requests

4 participants