Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Response Ops][Alerting] Exposing background worker utilization load metric #153600

Merged
merged 21 commits into from
May 2, 2023

Conversation

ymao1
Copy link
Contributor

@ymao1 ymao1 commented Mar 23, 2023

Resolves #155762, #155761

Summary

This PR exposes a metric that represents the background task worker utilization load at the end of each polling cycle. This is calculated as (# of workers already busy + claimed tasks) / max workers, which comes out to the number of workers in use at the end of each claim cycle. This metric is then averaged over the previous 15 seconds (or 5 polling cycles). This window size is configurable using xpack.task_manager.worker_utilization_running_average_window

This PR exposes this metric in the existing /internal/task_manager/_background_task_utilization API but also adds a public version of this API (/api/task_manager/_background_task_utilization) that only exposes this metric. We need the public API for serverless but I thought we could keep the private route as well to expose experimental metrics without the overhead of supporting them long term.

To Verify

  1. Run ES and Kibana with this branch
  2. Navigate to /internal/task_manager/_background_task_utilization and see the new metric exposed as stats.value.load along with the existing adhoc and recurring metrics
  3. Navigate to /api/task_manager/_background_task_utilization and see only the load metric returned from the public API
  4. You can also create some rules to see the load metric increase.

@ymao1
Copy link
Contributor Author

ymao1 commented Mar 29, 2023

@elasticmachine merge upstream

@ymao1
Copy link
Contributor Author

ymao1 commented Apr 17, 2023

@elasticmachine merge upstream

@ymao1 ymao1 changed the title Adding worker utilization event to calculate load at end of claim cyc… [Response Ops][Alerting] Exposing background worker utilization load metric Apr 21, 2023
@ymao1 ymao1 self-assigned this Apr 24, 2023
@ymao1 ymao1 added Feature:Alerting release_note:skip Skip the PR/issue when compiling release notes Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.9.0 labels Apr 24, 2023
@ymao1 ymao1 marked this pull request as ready for review April 24, 2023 22:49
@ymao1 ymao1 requested a review from a team as a code owner April 24, 2023 22:49
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@ymao1 ymao1 requested a review from kobelb April 24, 2023 22:49
@ymao1
Copy link
Contributor Author

ymao1 commented Apr 24, 2023

@elasticmachine merge upstream

@ymao1
Copy link
Contributor Author

ymao1 commented Apr 25, 2023

@kobelb I made an /api and an /internal endpoint so we could have experimental metrics without the overhead of having to support them but if you think that's overkill, I can remove the /internal endpoint and remove the adhoc and recurring counters.

@kobelb
Copy link
Contributor

kobelb commented Apr 25, 2023

@kobelb I made an /api and an /internal endpoint so we could have experimental metrics without the overhead of having to support them but if you think that's overkill, I can remove the /internal endpoint and remove the adhoc and recurring counters.

Works for me!

@ymao1
Copy link
Contributor Author

ymao1 commented May 1, 2023

@elasticmachine merge upstream

@ymao1 ymao1 removed the ci:cloud-deploy Create or update a Cloud deployment label May 1, 2023
@ymao1
Copy link
Contributor Author

ymao1 commented May 1, 2023

@elasticmachine merge upstream

Copy link
Contributor

@mikecote mikecote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested locally and changes LGTM! Saw the load go up to 100 when overloaded and back down to ~6 when idle from alerting rules.

@ymao1 ymao1 requested a review from a team as a code owner May 2, 2023 13:33
Copy link
Member

@jbudz jbudz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kibana-docker

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Unknown metric groups

ESLint disabled line counts

id before after diff
enterpriseSearch 19 21 +2
securitySolution 398 401 +3
taskManager 24 23 -1
total +4

Total ESLint disabled count

id before after diff
enterpriseSearch 20 22 +2
securitySolution 478 481 +3
taskManager 24 23 -1
total +4

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @ymao1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting Feature:Alerting Feature:Task Manager release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.9.0
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

Change background task utilization endpoint to be public and no longer internal
7 participants