[Task Manager] adds capacity estimation to the TM health endpoint #100475

gmmorris · 2021-05-24T16:51:29Z

Summary

Adds Capacity Estimation to the Task Manager Health Endpoint.
Below is a diagram depicting what information we use to estimate the varying capacity variables.

Please use the user facing docs to understand how it fits together. If the docs aren't clear enough - make a review comment and I'll clarify in the docs.

Documentation changes:

Notes for Reviewers

I recommend reading the documentation that I have added both as they are designed to be user facing requiring review, and because they should give you the context needed to review the code.
Do you think the diagram belongs in the docs? Or is it too technically detailed and adds more noise than signal?
These estimations are very rough and are difficult to put together as we never really know how many Kibana instances are in the cluster and it's very hard to infer from these current stats what the non-recurring workload looks like. I hope the variable naming makes the calculations clearer, but I am fearful this might be had to maintain. If you have thoughts on how we can restructure the calculation code to make it easier to maintain- I'm all ears.
You will see references to ephemeral tasks in the docs and API output - this is in preparation for @chrisronline 's [Alerting] Change execution of alerts from async to sync #97311 PR, but as you can see in the code we don't actually detect any ephemeral tasks yet, as they don't actually exist in the code in any way yet (so the % of ephemeral task will always be zero). If [Alerting] Change execution of alerts from async to sync #97311 doesn't get merged in 7.14, then we can make a small fix to this code to remove references to ephemeral tasks. If it does get merged, then we'll need a follow up PR to hook them into each other, but I expect that will be a relatively small PR.

Checklist

Delete any items that are not applicable to this PR.

~~Any text added follows EUI's writing guidelines, uses sentence case text and includes i18n support~~
Documentation was added for features that require explanation or tutorials
Unit or functional tests were updated or added to match the most common scenarios
~~Any UI touched in this PR is usable by keyboard only (learn more about keyboard accessibility)~~
~~Any UI touched in this PR does not create any new axe failures (run axe in browser: FF, Chrome)~~
~~If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the docker list~~
~~This renders correctly on smaller devices using a responsive layout. (You can test this in your browser)~~
~~This was checked for cross-browser compatibility~~

For maintainers

This was checked for breaking API changes and was labeled appropriately

* master: (60 commits) [Docs] Index patterns REST API docs (elastic#100549) [Ingest pipelines] add support for fingerprint processor (elastic#100541) ping Core team when renovate bot bumps es client version (elastic#100662) [Maps] Add draw wizard (elastic#100278) Use documentation link service for index pattern field editor (elastic#100609) [Maps] filter dashboard by map extent (elastic#99860) [ftr] migrate screenshots and snapshots services to FtrService class (elastic#100514) fix anomaly functional test (elastic#100504) update breaking changes template to incorporate ES deprecations (elastic#100621) improve default time ranges (elastic#100536) [Gauge] Fixes wrong translations on ranges less than symbol (elastic#100535) [ftr] migrate "globalNav" service to FtrService class (elastic#100604) [ftr] migrate "testSubjects" to FtrService class (elastic#100512) Fix spaces test flakyness (elastic#100605) [Ingest pipelines] add support for ip type in convert processor (elastic#100531) [ftr] migrate "browser" to FtrService class (elastic#100507) [ftr] migrate "find" service to FtrService class (elastic#100509) [telemetry] report config deprecations (elastic#99887) [ftr] migrate "docTable" service to FtrService class (elastic#100595) [ftr] migrate "listingTable" service to FtrService class (elastic#100606) ...

…ll TMs are active

* master: (77 commits) [RAC][Security Solution] Register Security Detection Rules with Rule Registry (elastic#96015) [Enterprise Search] Log warning for Kibana/EntSearch version mismatches (elastic#100809) updating the saved objects test to include more saved object types (elastic#100828) [ML] Fix categorization job view examples link when datafeed uses multiple indices (elastic#100789) Fixing ES archive mapping failure (elastic#100835) Fix bug with Observability > APM header navigation (elastic#100845) [Security Solution][Endpoint] Add event filters summary card to the fleet endpoint tab (elastic#100668) [Actions] Taking space id into account when creating email footer link (elastic#100734) Ensure comments on parameters in arrow functions are captured in the docs and ci metrics. (elastic#100823) [Security Solution] Improve find rule and find rule status route performance (elastic#99678) [DOCS] Adds video to introduction (elastic#100906) [Fleet] Improve combo box for fleet settings (elastic#100603) [Security Solution][Endpoint] Endpoint generator and data loader support for Host Isolation (elastic#100813) [DOCS] Adds Lens video (elastic#100898) [TSVB] [Table tab] Fix "Math" aggregation (elastic#100765) chore(NA): moving @kbn/io-ts-utils into bazel (elastic#100810) [Alerting] Adding feature flag for enabling/disabling rule import and export (elastic#100718) [TSVB] Fix Upgrading from 7.12.1 to 7.13.0 breaks TSVB (elastic#100864) [Lens] Adds dynamic table cell coloring (elastic#95217) [Security Solution][Endpoint] Do not display searchbar in security-trusted apps if there are no items (elastic#100853) ...

* master: (68 commits) Unskip advanced settings a11y test (elastic#100558) [App Search] Crawler Landing Page (elastic#100822) [DOCS] Clarify when to use kbn clean (elastic#101155) change label behavior (elastic#100991) skip flaky suite (elastic#101126) Fix cases plugin ownership (elastic#101073) [Home] Adding file upload to add data page (elastic#100863) [ML] Functional tests - reenable categorization tests (elastic#101137) [DOCS] Adds server.uuid to settings docs (elastic#101121) Fix newsfeed unread notifications always on when reloading Kibana (elastic#100357) [Lens] Time shift metrics (elastic#98781) [Deprecations service] make `correctiveActions.manualSteps` required (elastic#100997) Add "Risk Matrix" section to the PR template (elastic#100649) [Maps] spatially filter by all geo fields (elastic#100735) [Security Solution] Add Ransomware canary advanced policy option (elastic#101068) [Exploratory view] Core web vitals (elastic#100320) [Security solution][Endpoint] Add unit tests for fleet event filters/trusted apps cards (elastic#101034) [Lens] Use a setter function for the dimension panel (elastic#101123) [Index Patterns] Fix return saved index pattern object (elastic#101051) [CI] For PRs, build TS refs before public api docs check (elastic#100791) ...

elasticmachine · 2021-06-03T09:26:42Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

gmmorris · 2021-06-03T09:27:02Z

Hi @elastic/kibana-docs 👋 ,
I've added extensive docs here, and would appreciate a review. :)

Thanks

gmmorris · 2021-06-03T09:44:47Z

x-pack/plugins/task_manager/server/monitoring/task_run_statistics.ts

+function persistenceOf(task: ConcreteTaskInstance) {
+  return task.schedule ? TaskPersistence.Recurring : TaskPersistence.NonRecurring;
+}


As you can see here - we only ever identify a task as having a "recurring" or a "non-recurring" persistence.
"Ephemeral" tasks never get detected because they don't yet exist- this is plumbing in preparation for #97311.

If that PR isn't merged in 7.14, then we'll have to merge a small fix here removing ephemeral tasks from the output of this API.

ymao1

Looked through the docs but not the code yet. Everything makes sense but the docs are quite dense. Can we consider adding a tldr section to the capacity estimation? Maybe direct users upfront to look at proposed.proposed_kibana to see the suggested number of kibana instances, but keep the detailed docs for users who want to understand why that number was proposed?

ymao1 · 2021-06-03T12:28:47Z

docs/user/production-considerations/task-manager-health-monitoring.asciidoc

+
+| This section provides a rough estimate about the sufficiency of its capacity. As the name suggests, these are estimates based on historical data and should not be used as predictions. We recommend using these estimations when following the Task Manager <<task-manager-scaling-guidance>>.
+
+[NOTE]


This is showing up kind of funky in the docs.

Oops looks like you can't put a Note inside of a definitions list 😬

docs/user/production-considerations/task-manager-production-considerations.asciidoc

ymao1 · 2021-06-03T12:42:22Z

docs/user/production-considerations/task-manager-troubleshooting.asciidoc

+  }
+}
+--------------------------------------------------
+<1> These estimates assume that there is one {kib} instance actively executing tasks


To me observed_kibana_instances means the number of Kibanas running, while this description says estimate. Is this an estimate or an actual number?

It's an estimate based on the observed number of OwnerIds.
There is no way, currently, for Kibana to know how many instances are running, which is frustrating but we have to work with it.
In the code you'll see a bunch of calculations where we use the term "assumed" - as in, we assume this top be the case, but it's not 100% as we're basing this off of the observed number of Kibana. In order for us to produce any kind of estimate for the non-recurring we need some kind of assumption to work off of, and so observing the OwnerIds is the closest we can come to producing an estimate.

I spoke to @chrisronline and he thinks Stack Monitoring could provide us a more reliable number.
I'll file a follow up issue for us to look into that in the near future, so we can improve this incrementally.

ymao1 · 2021-06-03T12:44:12Z

docs/user/production-considerations/task-manager-troubleshooting.asciidoc

+}
+--------------------------------------------------
+<1> These estimates assume that there is one {kib} instance actively executing tasks
+<2> Based on past throughput the overdue tasks in the system could be executed within 1 minute


Is there a guideline as to how many minutes_to_drain_overdue is too many? Comparing this example to the below example of an overloaded deployment, 1 minute to drain vs 12 minutes to drain doesn't seem too bad?

Not really.
In theory, we shouldn't have any overdue tasks, so anything > 0 is a sign that we are low on capacity.
12 minutes isn't too bad until you realized that during those 12 minutes, additional overdue tasks will build up.

I tried to calrify that in the docs saying:

* There appear to be so many overdue tasks that it would take 12 minutes of executions to catch up with that backlog. This does not take into account tasks that might become overdue during those 12 minutes, so while this congestion might be temporary, the system could also remain consistently under provisioned and might never drain the backlog entirely.

…nsiderations.asciidoc Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com>

Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com>

…s/kibana into task-manager/capacity-estimation * 'task-manager/capacity-estimation' of github.com:gmmorris/kibana: Apply grammer suggestions Update docs/user/production-considerations/task-manager-production-considerations.asciidoc Update docs/user/production-considerations/task-manager-production-considerations.asciidoc Update docs/user/production-considerations/task-manager-production-considerations.asciidoc

ymao1

LGTM!

* master: (54 commits) Implement "select all" rules feature (elastic#100554) [ML] Remove script fields from the Anomaly detection alerting rule executor (elastic#101607) [Security solutions][Endpoint] Update event filtering texts (elastic#101563) [Enterprise Search] Mocks/tests tech debt - avoid hungry mocking (elastic#101107) [FTR] Updates esArchive paths [FTR] Updates esArchive paths [Security Solution][Detection Engine] Adds runtime field tests (elastic#101664) Added APM PHP agent to the list of agent names (elastic#101062) [CI] Restore old version_info behavior when .git directory is present (elastic#101642) [Fleet] Add fleet server telemetry (elastic#101400) [APM] Syncs agent config settings to APM Fleet policies (elastic#100744) [esArchiver] drop support for --dir, use repo-relative paths instead (elastic#101345) Revert "[xpack/test] restore incremental: false in ts project" [Security Solution] Remove Host Isolation feature flag (elastic#101655) [xpack/test] restore incremental: false in ts project [DOCS] Adds link to video landing page (elastic#101413) [ML] Move Index Data Visualizer into separate plugin (Part 1) (elastic#100922) Improve security plugin return types (elastic#101492) [ts] migrate `x-pack/test` to composite ts project (elastic#101441) [App Search] Updated Search UI to new URL (elastic#101320) ...

* master: clarify which parts of TM are experimental (elastic#101757) Add sh scripts with _bulk_action route usage examples (elastic#101736) [Uptime] Only register route in side nav if uptime show capability is true (elastic#101709) Use KIBANA_DOCS in doc link service (elastic#101667) [Alerting][Event log] Persisting duration information for active alerts in event log (elastic#101387) Address design issues in Discover/Graph (elastic#101584) Optimize performance for document table (elastic#101715) Change file data visualizer links to point to new location in home application (elastic#101393) [Fleet] Tighten policy permissions, take II (elastic#97366) [ML] Add debounce to the severity control update (elastic#101581) [Fleet] Fix routing issues with `getPath` and `history.push` (elastic#101658) [APM] Add link-to/transaction route (elastic#101731) [Index Patterns] Runtime fields CRUD REST API (elastic#101164) [ILM] Refactor types and fix missing aria labels (elastic#101518) [Lens] New summary row feature for datatable (elastic#101075) Blocks save event filter with a white space name (elastic#101599) Improve security server types (elastic#101661) [APM] Replace side nav with tabs on Settings page (elastic#101460) [APM] Only register items in side nav if user has permissions to see app (elastic#101707) [Security solution][Endpoint] Add back button when to the event filters list (elastic#101280)

* master: (68 commits) skip flaky suite (elastic#94043) skip flaky suite (elastic#102012) [esArchive] Persists updates for management/saved_objects/* (elastic#101992) skip flaky suite (elastic#101449) remove unnecessary hack (elastic#101909) [Exploratory View] Use human readable formats (elastic#101520) [Expressions] Refactor expression functions to use observables underneath (elastic#100409) [esArchives] Persist migrated Kibana archives (elastic#101950) [kbnArchiver] fix save to non-existent file (elastic#101974) [Enterprise Search] Add owner and description properties to kibana.json (elastic#101957) [DOCS] Fixes terminology in Stack Monitoring:Kibana alerts (elastic#101696) [Observability] [Cases] Cases in the observability app (elastic#101487) [Alerting][Docs] Combine rule creation and management pages (elastic#101498) temporarily disable build-buddy [Fleet] Fix fleet server collector in case settings are not set (elastic#101752) [Event Log] Populated rule.* ECS fields for alert events. (elastic#101132) [APM] Fleet support for merging input.config values with other nested properties in the policy input (elastic#101690) Add comments to some alerting plugin public API items (elastic#101551) [Alerting][Docs] Moving alerting setup to its own page (elastic#101323) remove uptime public API, it's not used. (elastic#101799) ...

* master: [Security solution][Endpoint] Removes zip compression when creating artifacts (elastic#101379) [Dashboard]: Fixes disabled viz filter is applied (elastic#101859) [Discover] Deangularization of search embeddable (elastic#100552)

kibanamachine · 2021-06-14T12:50:45Z

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

💚 Build #130419 succeeded cc347ea
💚 Build #130312 succeeded 53e584b
💔 Build #130122 failed 982c513
💚 Build #130082 succeeded 2b14158
💚 Build #130066 succeeded 0270133

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

…astic#100475) Adds Capacity Estimation to the Task Manager Health Endpoint. Below is a diagram depicting what information we use to estimate the varying capacity variables. Please use the user facing docs to understand how it fits together. If the docs aren't clear enough - make a review comment and I'll clarify in the docs.

…00475) (#102054) Adds Capacity Estimation to the Task Manager Health Endpoint. Below is a diagram depicting what information we use to estimate the varying capacity variables. Please use the user facing docs to understand how it fits together. If the docs aren't clear enough - make a review comment and I'll clarify in the docs.

…astic#100475) Adds Capacity Estimation to the Task Manager Health Endpoint. Below is a diagram depicting what information we use to estimate the varying capacity variables. Please use the user facing docs to understand how it fits together. If the docs aren't clear enough - make a review comment and I'll clarify in the docs.

gmmorris added 16 commits May 24, 2021 17:50

added capacity estimation to TM health endpoint

e3515ad

separated estimation from capacity requirmenets

8cefde7

track owner IDs in running average as there may be cycles where not a…

9aab904

…ll TMs are active

added docs

ae2bd52

made assumptions clearer in estimations

fc0b23b

split estimations by observations and proposal

ea2da9c

fixed doc

37bf439

fixed test

f754d51

split porposed from min

dfb113f

clarified that instacnes need to be identical

ae5e884

reworded some docs

9839c36

use average execution durationwhen estimating task capacity

367b9e0

fixed typing issues

3a52341

gmmorris changed the title ~~added capacity estimation to TM health endpoint~~ [Task Manager] adds capacity estimation to the TM health endpoint Jun 2, 2021

gmmorris added Feature:Task Manager release_note:enhancement Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.14.0 v8.0.0 labels Jun 3, 2021

gmmorris marked this pull request as ready for review June 3, 2021 09:26

gmmorris requested a review from a team as a code owner June 3, 2021 09:26

gmmorris requested a review from a team June 3, 2021 09:27

gmmorris commented Jun 3, 2021

View reviewed changes

tweaked doc

e41d413

ymao1 reviewed Jun 3, 2021

View reviewed changes

gmmorris and others added 9 commits June 8, 2021 11:28

removed unused import

89ce8a2

Update docs/user/production-considerations/task-manager-production-co…

86b24fe

…nsiderations.asciidoc Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com>

Update docs/user/production-considerations/task-manager-production-co…

c05279c

…nsiderations.asciidoc Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com>

Update docs/user/production-considerations/task-manager-production-co…

5ac3c8d

…nsiderations.asciidoc Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com>

Apply grammer suggestions

c17730e

Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com>

gramatical corrections

bdddc96

marked as experimental

727c5a5

cant use notes in tables

0b0daf1

ymao1 approved these changes Jun 8, 2021

View reviewed changes

gmmorris added 11 commits June 9, 2021 10:20

tweaked docs

2fa1abd

fixed docs

6bfc2b8

cleaned up docs

2b14158

mark entire health monitoring endpoitn as experimental

982c513

rename proposed_kibana to provisioned_kibana

3b9f2b1

fixed grammer

49c8579

improved grammer

cc347ea

gmmorris merged commit daa3f62 into elastic:master Jun 14, 2021

gmmorris mentioned this pull request Jun 14, 2021

[7.x] [Task Manager] adds capacity estimation to the TM health endpoint (#100475) #102054

Merged

spong mentioned this pull request Jun 24, 2021

[Task Manager] Throttle TM health logging #102783

Closed

chrisronline mentioned this pull request Jun 24, 2021

[Task Manager] Typo in capacity estimation output for TM health api #103303

Closed

gmmorris mentioned this pull request Jul 15, 2021

Task Manager health - average interval #94937

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Task Manager] adds capacity estimation to the TM health endpoint #100475

[Task Manager] adds capacity estimation to the TM health endpoint #100475

gmmorris commented May 24, 2021 •

edited

Loading

elasticmachine commented Jun 3, 2021

gmmorris commented Jun 3, 2021

gmmorris Jun 3, 2021

ymao1 left a comment

ymao1 Jun 3, 2021

gmmorris Jun 3, 2021

ymao1 Jun 3, 2021

gmmorris Jun 3, 2021

gmmorris Jun 8, 2021

ymao1 Jun 3, 2021

gmmorris Jun 3, 2021

ymao1 left a comment

kibanamachine commented Jun 14, 2021

[Task Manager] adds capacity estimation to the TM health endpoint #100475

[Task Manager] adds capacity estimation to the TM health endpoint #100475

Conversation

gmmorris commented May 24, 2021 • edited Loading

Summary

Notes for Reviewers

Checklist

For maintainers

elasticmachine commented Jun 3, 2021

gmmorris commented Jun 3, 2021

Choose a reason for hiding this comment

ymao1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ymao1 left a comment

Choose a reason for hiding this comment

kibanamachine commented Jun 14, 2021

💚 Build Succeeded

Metrics [docs]

History

gmmorris commented May 24, 2021 •

edited

Loading