Skip to content

Commit

Permalink
[Response Ops][Alerting] Exposing background worker utilization load …
Browse files Browse the repository at this point in the history
…metric (#153600)

Resolves #155762,
#155761
## Summary

This PR exposes a metric that represents the background task worker
utilization load at the end of each polling cycle. This is calculated as
`(# of workers already busy + claimed tasks) / max workers`, which comes
out to the number of workers in use at the end of each claim cycle. This
metric is then averaged over the previous 15 seconds (or 5 polling
cycles). This window size is configurable using
`xpack.task_manager.worker_utilization_running_average_window`

This PR exposes this metric in the existing
`/internal/task_manager/_background_task_utilization` API but also adds
a public version of this API
(`/api/task_manager/_background_task_utilization`) that only exposes
this metric. We need the public API for serverless but I thought we
could keep the private route as well to expose experimental metrics
without the overhead of supporting them long term.

## To Verify

1. Run ES and Kibana with this branch
2. Navigate to `/internal/task_manager/_background_task_utilization` and
see the new metric exposed as `stats.value.load` along with the existing
`adhoc` and `recurring` metrics
3. Navigate to `/api/task_manager/_background_task_utilization` and see
only the load metric returned from the public API
4. You can also create some rules to see the load metric increase.

---------

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
  • Loading branch information
ymao1 and kibanamachine committed May 2, 2023
1 parent d41da82 commit 17487b8
Show file tree
Hide file tree
Showing 19 changed files with 812 additions and 505 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -413,6 +413,7 @@ kibana_vars=(
xpack.task_manager.version_conflict_threshold
xpack.task_manager.event_loop_delay.monitor
xpack.task_manager.event_loop_delay.warn_threshold
xpack.task_manager.worker_utilization_running_average_window
xpack.uptime.index
serverless
)
Expand Down
3 changes: 3 additions & 0 deletions x-pack/plugins/task_manager/server/config.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ describe('config validation', () => {
"exclude_task_types": Array [],
},
"version_conflict_threshold": 80,
"worker_utilization_running_average_window": 5,
}
`);
});
Expand Down Expand Up @@ -95,6 +96,7 @@ describe('config validation', () => {
"exclude_task_types": Array [],
},
"version_conflict_threshold": 80,
"worker_utilization_running_average_window": 5,
}
`);
});
Expand Down Expand Up @@ -149,6 +151,7 @@ describe('config validation', () => {
"exclude_task_types": Array [],
},
"version_conflict_threshold": 80,
"worker_utilization_running_average_window": 5,
}
`);
});
Expand Down
12 changes: 10 additions & 2 deletions x-pack/plugins/task_manager/server/config.ts
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,12 @@ export const DEFAULT_MAX_EPHEMERAL_REQUEST_CAPACITY = MAX_WORKERS_LIMIT;
// ===================
// Refresh aggregated monitored stats at a default rate of once a minute
export const DEFAULT_MONITORING_REFRESH_RATE = 60 * 1000;
export const DEFAULT_MONITORING_STATS_RUNNING_AVERGAE_WINDOW = 50;
export const DEFAULT_MONITORING_STATS_RUNNING_AVERAGE_WINDOW = 50;
export const DEFAULT_MONITORING_STATS_WARN_DELAYED_TASK_START_IN_SECONDS = 60;

// At the default poll interval of 3sec, this averages over the last 15sec.
export const DEFAULT_WORKER_UTILIZATION_RUNNING_AVERAGE_WINDOW = 5;

export const taskExecutionFailureThresholdSchema = schema.object(
{
error_threshold: schema.number({
Expand Down Expand Up @@ -98,7 +101,7 @@ export const configSchema = schema.object(
}),
/* The size of the running average window for monitored stats. */
monitored_stats_running_average_window: schema.number({
defaultValue: DEFAULT_MONITORING_STATS_RUNNING_AVERGAE_WINDOW,
defaultValue: DEFAULT_MONITORING_STATS_RUNNING_AVERAGE_WINDOW,
max: 100,
min: 10,
}),
Expand Down Expand Up @@ -130,6 +133,11 @@ export const configSchema = schema.object(
}),
}),
event_loop_delay: eventLoopDelaySchema,
worker_utilization_running_average_window: schema.number({
defaultValue: DEFAULT_WORKER_UTILIZATION_RUNNING_AVERAGE_WINDOW,
max: 100,
min: 1,
}),
/* These are not designed to be used by most users. Please use caution when changing these */
unsafe: schema.object({
exclude_task_types: schema.arrayOf(schema.string(), { defaultValue: [] }),
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,7 @@ describe('EphemeralTaskLifecycle', () => {
monitor: true,
warn_threshold: 5000,
},
worker_utilization_running_average_window: 5,
...config,
},
elasticsearchAndSOAvailability$,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,7 @@ describe('managed configuration', () => {
monitor: true,
warn_threshold: 5000,
},
worker_utilization_running_average_window: 5,
});
logger = context.logger.get('taskManager');

Expand Down
Loading

0 comments on commit 17487b8

Please sign in to comment.