-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Task Manager] adds basic observability into Task Manager's runtime operations #77868
Conversation
* master: (92 commits) [ILM] Data tiers for 7.10 (elastic#76126) [ML] Transforms: Fixes styling of preview grid pagination in summary step (elastic#77789) [Drilldowns] Beta badge support. Mark URL Drilldown as Beta (elastic#75654) Re-enable session lifespan, idle timeout api integration tests and use unique names for the security test reports. (elastic#77746) [Alerting] renames code in alerting RBAC exemption to make it easier to maintain (elastic#77598) [Alerting & Actions] Overwrite SOs when updating instead of partially updating (elastic#73688) fixed react warning in Suspense in alert flyout (elastic#77777) [APM] Track usage of Gold+ features (elastic#77630) Visualize: Bad request when working with histogram aggregation (elastic#77684) remove legacy ES plugin (elastic#77703) [Lens] change name of custom query to filters (elastic#77725) skip flaky suite (elastic#76239) remove visual aspects of baseline job (elastic#77815) skip flaky suite (elastic#77835) Fixes typo in data recognizer text (elastic#77691) management/update trusted_apps jest snapshot [build] Use Elastic hosted UBI minimal base image (elastic#77776) [APM] Add transaction error rate alert (elastic#76933) [Security Solution] [Detections] Remove file validation on import route (elastic#77770) [Enterprise Search][tech debt] Add Kea logic paths for easier debugging/defaults (elastic#77698) ...
* master: (226 commits) [Enterprise Search] Added Logic for the Credentials View (elastic#77626) [CSM] Js errors (elastic#77919) Add the @kbn/apm-config-loader package (elastic#77855) [Security Solution] Refactor useSelector (elastic#75297) Implement tagcloud renderer (elastic#77910) [APM] Alerting: Add global option to create all alert types (elastic#78151) [Ingest pipelines] Upload indexed document to test a pipeline (elastic#77939) TypeScript cleanup in visualizations plugin (elastic#78428) Lazy load metric & mardown visualizations (elastic#78391) [Detections][EQL] EQL rule execution in detection engine (elastic#77419) Update tutorial-full-experience.asciidoc (elastic#75836) Update tutorial-define-index.asciidoc (elastic#75754) Add support for runtime field types to mappings editor. (elastic#77420) [Monitoring] Usage collection (elastic#75878) [Docs][Actions] Add docs for Jira and IBM Resilient (elastic#78316) [Security Solution][Resolver] Update @timestamp formatting (elastic#78166) [Security Solution] Fix app layout (elastic#76668) [Security Solution][Resolver] 2 new functions to DAL (elastic#78477) Adds new elasticsearch client to telemetry plugin (elastic#78046) skip flaky suite (elastic#78512) (elastic#78511) (elastic#78510) (elastic#78509) (elastic#78508) (elastic#78507) (elastic#78506) (elastic#78505) (elastic#78504) (elastic#78503) (elastic#78502) (elastic#78501) (elastic#78500) ...
* master: Fix APM lodash imports (elastic#78438) Add deprecated message to tile_map and region_map visualizations. (elastic#77683) Fix Lens smokescreen flaky tests (elastic#78566) updated discover with alt text (elastic#77660) Fix types (elastic#78619) Update tutorial-visualizing.asciidoc (elastic#76977) Update tutorial-discovering.asciidoc (elastic#76976) [Search] Error notification alignment (elastic#77788) Update tutorial-define-index.asciidoc (elastic#76975) [Lens] Fieldless operations (elastic#78080) [Usage Collection] [schema] Explicit "array" definition (elastic#78141) Update tutorial-define-index.asciidoc (elastic#76973) Fix --no-basepath references in doc (elastic#78570) Move StubIndexPattern to data plugin and convert to TS. (elastic#78518) Index pattern class - remove unused methods (elastic#78538) [Security Solution] [ALL] Eliminates all console.error and console.warn from Jest output (elastic#78523) [Actions] avoids setting a default dedupKey on PagerDuty (elastic#77773) First stab at developer-focussed saved objects docs (elastic#71430) remove unnecessary config validations (elastic#78527)
* master: (288 commits) add core-js production dependency (elastic#79395) Add support for sharing saved objects to all spaces (elastic#76132) [Alerting UI] Display a banner to users when some alerts have failures, added alert statuses column and filters (elastic#79038) load js-yaml lazily (elastic#79092) skip flaky suite (elastic#77278) Fix agentPolicyUpdateEventHandler() to use app context soClient for creation of actions (elastic#79341) [Security Solution] Untitled Timeline created when first action is to add note (elastic#78988) [Security Solutions][Detection Engine] Updates the edit rules page to only have what is selected for editing (elastic#79233) Cleanup yarn.lock from duplicates (elastic#66617) [kbn/optimizer] implement more efficient auto transpilation for node (elastic#79052) [Ingest Manager] Rename Fleet setup and requirement, Fleet => Central… (elastic#79291) [core/server/plugins] don't run discovery in dev server parent process (take 2) (elastic#79358) [babel/register] remove from build (take 2) (elastic#79379) [Security Solution] Changes rules table tag display (elastic#77102) define integrationTestRoot in config file and use to define screensho… (elastic#79247) Revert "[babel/register] remove from build (elastic#79176)" skip flaky suite (elastic#75241) [Uptime] Synthetics UI (elastic#77960) [Security Solution] [Detections] Only display actions options if user has "read" privileges (elastic#78812) [babel/register] remove from build (elastic#79176) ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM; I was hoping this wouldn't be quite so complex, but obviously it's going to be somewhat complex, so ... it is what it is :-)
I gave this a whirl on my 100 alerts w/4 active instances load, ran it for a while, checking the health, looked good, deleted all the alerts, leaving the queued actions to error (since the alert is no longer available) to look at the "error" side as well. Works as expected!
@@ -145,6 +145,15 @@ export interface AggregationOptionsByType { | |||
>; | |||
keyed?: boolean; | |||
} & AggregationSourceOptions; | |||
range: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these changes needed? I'm wondering if we will need to pre-req the apm plugin, if not now, in some future where the compilation depends on plugin deps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are investigating moving these types into the JS client (see #77720.) No timeframe on that, but worth considering.
I'm always in favor of explicitly declaring dependencies, though in this case we would create a circular one since APM has Task Manager as an optional dependency.
As for the type changes here I'd like to defer to @dgieselaar on that one before approving.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I wasn't crazy about this, but checked with @dgieselaar and he gave a thumbsup.
He said he'd review on the PR 👍
* you may not use this file except in compliance with the Elastic License. | ||
*/ | ||
|
||
import stats from 'stats-lite'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fun factoid - stats-lite
is from my former NodeSource colleague and all-around great person, Bryce Baril.
x-pack/plugins/task_manager/server/monitoring/task_run_calcultors.ts
Outdated
Show resolved
Hide resolved
return res.ok({ | ||
body: lastMonitoredStats | ||
? calculateStatus(lastMonitoredStats) | ||
: { id: taskManagerId, timestamp: new Date().toISOString(), status: HealthStatus.Error }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really more of a "NO DATA" condition, rather than an error, right? Probably worth indicating that somehow, not sure it's worthy of a new status
if instead it could just be in a message somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered it, but arguably it could also be an error if you consider that the goal of monitoring is to confirm that TM is working and having no data here indicates that TM is not working.
My thinking was that when a Kibana starts up it goes from Error to OK, and that makes sense to me fro ma monitoring perspective...
I'd opt to keep it this way until (unless) a customer complains :)
Pinging @elastic/apm-ui (Team:apm) |
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
* master: (63 commits) [KP] Fix Headers timeout issue (elastic#81140) [ML] Functional tests - stabilize typing with checks service method (elastic#81338) tabify - support docs (elastic#80351) [Security Solution][Detections] Look-back time logic fix (elastic#81383) [Workplace Search] Add top-level tests for Groups (elastic#81215) [Fleet] Fix agent action observable for long polling (elastic#81376) [Maps] fix feature tooltip remains open when zoom level change hides layer (elastic#81373) skip flaky suite (elastic#78689) chore(NA): add spec-to-console and plugin-helpers as devOnly dependencies (elastic#81357) Ensure some data is returned (elastic#81375) Change dumb-init to tini (elastic#81126) [Reporting/Tech Debt] Convert PdfMaker class to TypeScript (elastic#81242) Use Storybook Controls instead of Knobs (elastic#80705) [junit] make sure that report paths are unique (elastic#81255) bump elastic/elasticsearch-js version to 7.10.0-rc1 (elastic#81288) run ssl tests on CI (elastic#81320) Fix alert defaults (elastic#81207) [ML] DF Analytics wizard: ensure user can set mml manually or select to use given estimate (elastic#81078) Add UI notifier to indicate secret fields and to remember / reenter values (elastic#80657) [Monitoring] Use async/await (elastic#81200) ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, great to see more teams using these types! Left a small suggestion for type inference in aggregate
.
@elasticmachine merge upstream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍 This will be good to have insight into!
x-pack/plugins/task_manager/server/monitoring/configuration_statistics.ts
Show resolved
Hide resolved
x-pack/plugins/task_manager/server/monitoring/configuration_statistics.ts
Show resolved
Hide resolved
* master: (37 commits) [ILM] Migrate Warm phase to Form Lib (elastic#81323) [Security Solutions][Detection Engine] Fixes critical bug with error reporting that was doing a throw (elastic#81549) [Detection Rules] Add 7.10 rules (elastic#81676) [kbn/optimizer] ignore missing metrics when updating limits with --focus (elastic#81696) [SECURITY SOLUTIONS] Bugs overview page + investigate eql in timeline (elastic#81550) [Maps] fix unable to edit cluster vector styles styled by count when switching to super fine grid resolution (elastic#81525) Fixed migration issue for case specific actions, by extending email action migrator checks (elastic#81673) [CI] Preparation for APM tracking on CI (elastic#80399) [Home] Fixes Kibana app description order on home page and updates Canvas copy (elastic#80057) Make sure `to` is 'now' and not the same as `from` (elastic#81524) Nitpicking the 8.0 Breaking Change issue template (elastic#81678) [SECURITY_SOLUTION] Fix text on onboarding screen (elastic#81672) [data.search] Skip async search tests in build candidates and production builds (elastic#81547) Fix previousStartedAt by not changing when execution fails (elastic#81388) [Monitoring] Fix a couple of issues with the cpu usage alert (elastic#80737) Telemetry collection xpack to ts project references (elastic#81269) Elasticsearch: don't use url authentication for new client (elastic#81564) [App Search] Credentials: implement working flyout form (elastic#81541) Properly encode links to edit user page (elastic#81562) [Alerting UI] Don't wait for health check before showing Create Alert flyout (elastic#80996) ...
* master: [Security Solution][Endpoint][Admin] Malware Protections Notify User Version (elastic#81415) Revert "[Actions] Adding `hasAuth` to Webhook Configuration to avoid confusing UX (elastic#81390)" [Maps] Use default format when proxying EMS-files (elastic#79760) [Discover] Deangularize context.html (elastic#81442) Use the displayName property in default editor (elastic#73311) adds bug label to Bug report for Security Solution template (elastic#81643) [ML] Transforms: Remove index field limitation for custom query. (elastic#81467) [Actions] Adding `hasAuth` to Webhook Configuration to avoid confusing UX (elastic#81390) [Task Manager] Mark task as failed if maxAttempts has been met. (elastic#80681) [Lens] Fix URL query loss on redirect (elastic#81475) Log reason for 404 in field existence check (elastic#81315)
💚 Build SucceededMetrics [docs]distributable file count
History
To update your PR or re-run it, just comment with: |
…perations (elastic#77868) This PR adds an an internal monitoring mechanism in Task Manager which keep track of a variety of metrics and a health api endpoint which makes the monitored statistics accessible. # Conflicts: # x-pack/test/plugin_api_integration/test_suites/task_manager/index.ts
* master: (87 commits) [Actions] Adding `hasAuth` to Webhook Configuration to avoid confusing UX (elastic#81778) [i18n] add get_kibana_translation_paths tests (elastic#81624) [UX] Fix search term reset from url (elastic#81654) [Lens] Improved range formatter (elastic#80132) [Resolver] `SideEffectContext` changes, remove `@testing-library` uses (elastic#81077) [Time to Visualize] Update Library Text with Call to Action (elastic#81667) [docs] Resolving failed Kibana upgrade migrations (elastic#80999) [ftr/menuToggle] provide helper for enhanced menu toggle handling (elastic#81709) [Task Manager] adds basic observability into Task Manager's runtime operations (elastic#77868) Doc changes for stack management and grouped feature privileges (elastic#80486) Added functional test for alerts list filters by status, alert type and action type. Did a code refactoring and splitting for alerts tests. (elastic#81422) [Security Solution][Endpoint][Admin] Malware Protections Notify User Version (elastic#81415) Revert "[Actions] Adding `hasAuth` to Webhook Configuration to avoid confusing UX (elastic#81390)" [Maps] Use default format when proxying EMS-files (elastic#79760) [Discover] Deangularize context.html (elastic#81442) Use the displayName property in default editor (elastic#73311) adds bug label to Bug report for Security Solution template (elastic#81643) [ML] Transforms: Remove index field limitation for custom query. (elastic#81467) [Actions] Adding `hasAuth` to Webhook Configuration to avoid confusing UX (elastic#81390) [Task Manager] Mark task as failed if maxAttempts has been met. (elastic#80681) ...
Documented as part of #89997 |
Summary
closes #77456
This PR adds an an internal monitoring mechanism in Task Manager which keep track of a variety of metrics and a
health
api endpoint which makes the monitored statistics accessible.Exposed Metrics
There are three different sections to the stats returned by the
health
api.configuration
: Summarizes Task Manager's current configuration.workload
: Summarizes the workload in the current deployment.runtime
: Tracks Task Manager's performance.Configuring the Stats
There are four new configurations:
xpack.task_manager.monitored_stats_required_freshness
- The required freshness of critical "Hot" stats, which means that if key stats (last polling cycle time, for example) haven't been refreshed within the specified duration, the_health
endpoint and service will report anError
status. By default this is inferred from the configuredpoll_interval
and is set topoll_interval
plus a1s
buffer.xpack.task_manager.monitored_aggregated_stats_refresh_rate
- Dictates how often we refresh the "Cold" metrics. These metrics require an aggregation against Elasticsearch and add load to the system, hence we want to limit how often we execute these. We also inffer the required freshness of these "Cold" metrics from this configuration, which means that if these stats have not been updated within the required duration then the_health
endpoint and service will report anError
status. This covers the entireworkload
section of the stats. By default this is configured to60s
, and as a result the required freshness defaults to61s
(refresh plus a1s
buffer).xpack.task_manager.monitored_stats_running_average_window
- Dictates the size of the window used to calculate the running average of various "Hot" stats, such as the time it takes to run a task, the drift that tasks experience etc. These stats are collected throughout the lifecycle of tasks and this window will dictate how large the queue we keep in memory would be, and how many values we need to calculate the average against. We do not calculate the average on every new value, but rather only when the time comes to summarize the stats before logging them or returning them to the API endpoint.xpack.task_manager.monitored_task_execution_thresholds
- Configures the threshold of failed task executions at which point thewarn
orerror
health status will be set either at a default level or a custom level for specific task types. This will allow you to mark the health aserror
when any task type failes 90% of the time, but set it toerror
at 50% of the time for task types that you consider critical. This value can be set to any number between 0 to 100, and a threshold is hit when the value exceeds this number. This means that you can avoid setting the status toerror
by setting the threshold at 100, or hiterror
the moment any task failes by setting the threshold to 0 (as it will exceed 0 once a single failer occurs).For example, in your
Kibana.yml
:Consuming Health Stats
Task Manager exposes a
/api/task_manager/_health
api which returns the latest stats.Calling this API is designed to be fast and doesn't actually perform any checks- rather it returns the result of the latest stats in the system, and is design in such a way that you could call it from an external service on a regular basis without worrying that you'll be adding substantial load to the system.
Additionally, the metrics are logged out into Task Manager's
DEBUG
logger at a regular cadence (dictated by the Polling Interval).If you wish to enable DEBUG logging in your Kibana instance, you will need to add the following to your
Kibana.yml
:Please bear in mind that these stats are logged as often as your
poll_interval
configuration, which means it could add substantial noise to your logs.We would recommend only enabling this level of logging temporarily.
Understanding the Exposed Stats
As mentioned above, the
health
api exposes three sections:configuration
,workload
andruntime
.Each section has a
timestamp
and astatus
which indicates when the last update to this setion took place and whether the health of this section was evaluated asOK
,Warning
orError
.The root has its own
status
which indicate the state of the system overall as infered from thestatus
of the section.An
Error
status in any section will cause the whole system to display asError
.A
Warning
status in any section will cause the whole system to display asWarning
.An
OK
status will only be displayed when all sections are marked asOK
.The root
timestamp
is the time in which the summary was exposed (either to the DEBUG logger or the http api) and thelast_update
is the last time any one of the sections was updated.The Configuration Section
The
configuration
section summarizes Task Manager's current configuration, including dynamic configurations which change over time, such aspoll_interval
andmax_workers
which adjust in reaction to changing load on the system.These are "Hot" stats which are updated whenever a change happens in the configuration.
The Workload Section
The
workload
which summarizes the work load in the current deployment, listing the tasks in the system, their types and what their current status is.It includes three sub sections:
overdue
tasks, whoserunAt
has expired.These are "Cold" stat which are updated at a regular cadence, configured by the
monitored_aggregated_stats_refresh_rate
config.The Runtime Section
The
runtime
tracks Task Manager's performance as it runs, making note of task execution time, drift etc.These include:
50
by default)No tasks | Filled task pool | Unexpectedly ran out of workers
] frequency the past 50 polling cycles (using the same window size as the one used for running averages)Success | Retry | Failure ratio
by task type. This is different than the workload stats which tell you what's in the queue, but ca't keep track of retries and of non recurring tasks as they're wiped off the index when completed.These are "Hot" stats which are updated reactively as Tasks are executed and interacted with.
Example Stats
For example, if you curl the
/api/task_manager/_health
endpoint, you might get these stats:Checklist
Delete any items that are not applicable to this PR.
For maintainers