Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Actions] Treat failures as successes for Task Manager #109655

Merged
merged 23 commits into from
Sep 9, 2021
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
c9ddb78
Support retry with email as an example
chrisronline Aug 24, 2021
366af39
Fix tests
chrisronline Aug 25, 2021
195944b
Add logic to treat as failure if there is a retry
chrisronline Aug 25, 2021
ac9e9c6
Handle retry better
chrisronline Aug 26, 2021
705fd7a
Make this optional
chrisronline Aug 26, 2021
e9c036d
Tweaks
chrisronline Aug 26, 2021
3150ce9
Merge remote-tracking branch 'elastic/master' into actions/failed_tasks
chrisronline Aug 26, 2021
70df414
Remove unnecessary code
chrisronline Aug 26, 2021
53e369f
Merge remote-tracking branch 'elastic/master' into actions/failed_tasks
chrisronline Aug 31, 2021
58d8946
Fix existing tests
chrisronline Aug 31, 2021
df80d91
Add some unit tests
chrisronline Aug 31, 2021
b3b316a
Add test
chrisronline Sep 1, 2021
a89fd86
Add doc note
chrisronline Sep 1, 2021
a2b7506
More docs
chrisronline Sep 2, 2021
69f1e3c
Merge remote-tracking branch 'elastic/master' into actions/failed_tasks
chrisronline Sep 3, 2021
975ad85
PR feedback
chrisronline Sep 3, 2021
ac96c26
Update docs/management/action-types.asciidoc
chrisronline Sep 7, 2021
07ce53c
Update docs/management/action-types.asciidoc
chrisronline Sep 7, 2021
5ddf97c
Update docs/management/action-types.asciidoc
chrisronline Sep 7, 2021
1b9eb7f
Update docs/management/action-types.asciidoc
chrisronline Sep 7, 2021
1ce8251
Update docs/management/action-types.asciidoc
chrisronline Sep 7, 2021
bb234ea
Merge remote-tracking branch 'elastic/master' into actions/failed_tasks
chrisronline Sep 7, 2021
1ad1073
Merge branch 'master' into actions/failed_tasks
kibanamachine Sep 9, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions docs/management/action-types.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -135,5 +135,15 @@ image::images/connectors-with-missing-secrets.png[Connectors with missing secret
For out-of-the-box and standardized connectors, you can <<preconfigured-connector-example, preconfigure connectors>>
before {kib} starts.

[float]
[[montoring-connectors]]
=== Monitoring
chrisronline marked this conversation as resolved.
Show resolved Hide resolved

The <<task-manager-health-monitoring,Task Manager health API>> is useful to understand the performance of all tasks in your environment.
chrisronline marked this conversation as resolved.
Show resolved Hide resolved
However, if connectors fail to execute for any reason, they will report as successful to Task Manager. This means the failure stats will not
chrisronline marked this conversation as resolved.
Show resolved Hide resolved
accurately depict the performance of connectors.

To get a better sense of connector failures, please refer to the <<event-log-index,Event log index>>
chrisronline marked this conversation as resolved.
Show resolved Hide resolved
for more accurate context into failures and successes.
chrisronline marked this conversation as resolved.
Show resolved Hide resolved

include::connectors/index.asciidoc[]
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,7 @@ a| Runtime

| This section tracks excution performance of Task Manager, tracking task _drift_, worker _load_, and execution stats broken down by type, including duration and execution results.


a| Capacity Estimation

| This section provides a rough estimate about the sufficiency of its capacity. As the name suggests, these are estimates based on historical data and should not be used as predictions. Use these estimations when following the Task Manager <<task-manager-scaling-guidance>>.
Expand All @@ -123,6 +124,14 @@ The root `status` indicates the `status` of the system overall.

The Runtime `status` indicates whether task executions have exceeded any of the <<task-manager-configuring-health-monitoring,configured health thresholds>>. An `OK` status means none of the threshold have been exceeded. A `Warning` status means that at least one warning threshold has been exceeded. An `Error` status means that at least one error threshold has been exceeded.

[IMPORTANT]
==============================================
Some tasks (such as <<action-types,connectors>>) will incorrectly report their status as successful even if the task failed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this part should be more generic since task manager can run any type of task, and we should add a section to the Actions and Connectors docs that specifically reference the event log.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are you thinking it should say? actions are the only known culprit of this and it does say Some tasks to make it seem like it's not an only action thing. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just wondering if it might lead people reading these docs to try to use the event log to look up failures for other tasks.

@gchaps Any suggestions for wording?

Copy link
Contributor

@gchaps gchaps Sep 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need more context to provide wording.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry, I missed this. I saw you added feedback. Did you still need more context?

The runtime and workload block will return data about success and failures and will not take this into consideration.

To get a better sense of action failures, please refer to the <<event-log-index,Event log index>> for more accurate context into failures and successes.
==============================================

The Capacity Estimation `status` indicates the sufficiency of the observed capacity. An `OK` status means capacity is sufficient. A `Warning` status means that capacity is sufficient for the scheduled recurring tasks, but non-recurring tasks often cause the cluster to exceed capacity. An `Error` status means that there is insufficient capacity across all types of tasks.

By monitoring the `status` of the system overall, and the `status` of specific task types of interest, you can evaluate the health of the {kib} Task Management system.
3 changes: 2 additions & 1 deletion x-pack/plugins/actions/server/action_type_registry.ts
Original file line number Diff line number Diff line change
Expand Up @@ -134,7 +134,8 @@ export class ActionTypeRegistry {
// Don't retry other kinds of errors
return false;
},
createTaskRunner: (context: RunContext) => this.taskRunnerFactory.create(context),
createTaskRunner: (context: RunContext) =>
this.taskRunnerFactory.create(context, actionType.maxAttempts),
},
});
// No need to notify usage on basic action types
Expand Down
2 changes: 2 additions & 0 deletions x-pack/plugins/actions/server/lib/action_executor.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -187,10 +187,12 @@ test('successfully executes as a task', async () => {

const scheduleDelay = 10000; // milliseconds
const scheduled = new Date(Date.now() - scheduleDelay);
const attempts = 1;
await actionExecutor.execute({
...executeParams,
taskInfo: {
scheduled,
attempts,
},
});

Expand Down
2 changes: 2 additions & 0 deletions x-pack/plugins/actions/server/lib/action_executor.ts
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ export interface ActionExecutorContext {

export interface TaskInfo {
scheduled: Date;
attempts: number;
}

export interface ExecuteOptions<Source = unknown> {
Expand Down Expand Up @@ -210,6 +211,7 @@ export class ActionExecutor {
config: validatedConfig,
secrets: validatedSecrets,
isEphemeral,
taskInfo,
});
} catch (err) {
rawResult = {
Expand Down
153 changes: 149 additions & 4 deletions x-pack/plugins/actions/server/lib/task_runner_factory.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,7 @@ test('executes the task by calling the executor with proper parameters, using gi
}),
taskInfo: {
scheduled: new Date(),
attempts: 0,
},
});

Expand Down Expand Up @@ -191,6 +192,7 @@ test('executes the task by calling the executor with proper parameters, using st
}),
taskInfo: {
scheduled: new Date(),
attempts: 0,
},
});

Expand Down Expand Up @@ -341,6 +343,7 @@ test('uses API key when provided', async () => {
}),
taskInfo: {
scheduled: new Date(),
attempts: 0,
},
});

Expand Down Expand Up @@ -401,6 +404,7 @@ test('uses relatedSavedObjects merged with references when provided', async () =
}),
taskInfo: {
scheduled: new Date(),
attempts: 0,
},
});
});
Expand Down Expand Up @@ -451,6 +455,7 @@ test('uses relatedSavedObjects as is when references are empty', async () => {
}),
taskInfo: {
scheduled: new Date(),
attempts: 0,
},
});
});
Expand Down Expand Up @@ -499,6 +504,7 @@ test('sanitizes invalid relatedSavedObjects when provided', async () => {
relatedSavedObjects: [],
taskInfo: {
scheduled: new Date(),
attempts: 0,
},
});
});
Expand Down Expand Up @@ -538,6 +544,7 @@ test(`doesn't use API key when not provided`, async () => {
}),
taskInfo: {
scheduled: new Date(),
attempts: 0,
},
});

Expand All @@ -549,9 +556,15 @@ test(`doesn't use API key when not provided`, async () => {
});

test(`throws an error when license doesn't support the action type`, async () => {
const taskRunner = taskRunnerFactory.create({
taskInstance: mockedTaskInstance,
});
const taskRunner = taskRunnerFactory.create(
{
taskInstance: {
...mockedTaskInstance,
attempts: 1,
},
},
2
);

mockedEncryptedSavedObjectsClient.getDecryptedAsInternalUser.mockResolvedValueOnce({
id: '3',
Expand Down Expand Up @@ -579,6 +592,138 @@ test(`throws an error when license doesn't support the action type`, async () =>
} catch (e) {
expect(e instanceof ExecutorError).toEqual(true);
expect(e.data).toEqual({});
expect(e.retry).toEqual(false);
expect(e.retry).toEqual(true);
}
});

test(`treats errors as errors if the task is retryable`, async () => {
const taskRunner = taskRunnerFactory.create({
taskInstance: {
...mockedTaskInstance,
attempts: 0,
},
});

mockedEncryptedSavedObjectsClient.getDecryptedAsInternalUser.mockResolvedValueOnce({
id: '3',
type: 'action_task_params',
attributes: {
actionId: '2',
params: { baz: true },
apiKey: Buffer.from('123:abc').toString('base64'),
},
references: [
{
id: '2',
name: 'actionRef',
type: 'action',
},
],
});
mockedActionExecutor.execute.mockResolvedValueOnce({
status: 'error',
actionId: '2',
message: 'Error message',
data: { foo: true },
retry: false,
});

let err;
try {
await taskRunner.run();
} catch (e) {
err = e;
}
expect(err).toBeDefined();
expect(err instanceof ExecutorError).toEqual(true);
expect(err.data).toEqual({ foo: true });
expect(err.retry).toEqual(false);
expect(taskRunnerFactoryInitializerParams.logger.error as jest.Mock).toHaveBeenCalledWith(
`Action '2' failed and will not retry: Error message`
);
});

test(`treats errors as successes if the task is not retryable`, async () => {
const taskRunner = taskRunnerFactory.create({
taskInstance: {
...mockedTaskInstance,
attempts: 1,
},
});

mockedEncryptedSavedObjectsClient.getDecryptedAsInternalUser.mockResolvedValueOnce({
id: '3',
type: 'action_task_params',
attributes: {
actionId: '2',
params: { baz: true },
apiKey: Buffer.from('123:abc').toString('base64'),
},
references: [
{
id: '2',
name: 'actionRef',
type: 'action',
},
],
});
mockedActionExecutor.execute.mockResolvedValueOnce({
status: 'error',
actionId: '2',
message: 'Error message',
data: { foo: true },
retry: false,
});

let err;
try {
await taskRunner.run();
} catch (e) {
err = e;
}
expect(err).toBeUndefined();
expect(taskRunnerFactoryInitializerParams.logger.error as jest.Mock).toHaveBeenCalledWith(
`Action '2' failed and will not retry: Error message`
);
});

test('treats errors as errors if the error is thrown instead of returned', async () => {
const taskRunner = taskRunnerFactory.create({
taskInstance: {
...mockedTaskInstance,
attempts: 0,
},
});

mockedEncryptedSavedObjectsClient.getDecryptedAsInternalUser.mockResolvedValueOnce({
id: '3',
type: 'action_task_params',
attributes: {
actionId: '2',
params: { baz: true },
apiKey: Buffer.from('123:abc').toString('base64'),
},
references: [
{
id: '2',
name: 'actionRef',
type: 'action',
},
],
});
mockedActionExecutor.execute.mockRejectedValueOnce({});

let err;
try {
await taskRunner.run();
} catch (e) {
err = e;
}
expect(err).toBeDefined();
expect(err instanceof ExecutorError).toEqual(true);
expect(err.data).toEqual({});
expect(err.retry).toEqual(true);
expect(taskRunnerFactoryInitializerParams.logger.error as jest.Mock).toHaveBeenCalledWith(
`Action '2' failed and will retry: undefined`
);
});
44 changes: 35 additions & 9 deletions x-pack/plugins/actions/server/lib/task_runner_factory.ts
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,6 @@ import { ActionExecutorContract } from './action_executor';
import { ExecutorError } from './executor_error';
import { RunContext } from '../../../task_manager/server';
import { EncryptedSavedObjectsClient } from '../../../encrypted_saved_objects/server';
import { ActionTypeDisabledError } from './errors';
import {
ActionTaskParams,
ActionTypeRegistryContract,
Expand Down Expand Up @@ -62,7 +61,7 @@ export class TaskRunnerFactory {
this.taskRunnerContext = taskRunnerContext;
}

public create({ taskInstance }: RunContext) {
public create({ taskInstance }: RunContext, maxAttempts: number = 1) {
if (!this.isInitialized) {
throw new Error('TaskRunnerFactory not initialized');
}
Expand All @@ -78,6 +77,7 @@ export class TaskRunnerFactory {

const taskInfo = {
scheduled: taskInstance.runAt,
attempts: taskInstance.attempts,
};

return {
Expand Down Expand Up @@ -119,7 +119,14 @@ export class TaskRunnerFactory {

basePathService.set(fakeRequest, path);

let executorResult: ActionTypeExecutorResult<unknown>;
// Throwing an executor error means we will attempt to retry the task
// TM will treat a task as a failure if `attempts >= maxAttempts`
// so we need to handle that here to avoid TM persisting the failed task
const isRetryableBasedOnAttempts = taskInfo.attempts < (maxAttempts ?? 1);
const willRetryMessage = `and will retry`;
const willNotRetryMessage = `and will not retry`;

let executorResult: ActionTypeExecutorResult<unknown> | undefined;
try {
executorResult = await actionExecutor.execute({
params,
Expand All @@ -131,20 +138,39 @@ export class TaskRunnerFactory {
relatedSavedObjects: validatedRelatedSavedObjects(logger, relatedSavedObjects),
});
} catch (e) {
if (e instanceof ActionTypeDisabledError) {
// We'll stop re-trying due to action being forbidden
throw new ExecutorError(e.message, {}, false);
logger.error(
`Action '${actionId}' failed ${
isRetryableBasedOnAttempts ? willRetryMessage : willNotRetryMessage
}: ${e.message}`
);
if (isRetryableBasedOnAttempts) {
// In order for retry to work, we need to indicate to task manager this task
// failed
throw new ExecutorError(e.message, {}, true);
}
throw e;
}

if (executorResult.status === 'error') {
if (
executorResult &&
executorResult?.status === 'error' &&
executorResult?.retry !== undefined &&
isRetryableBasedOnAttempts
) {
logger.error(
`Action '${actionId}' failed ${
!!executorResult.retry ? willRetryMessage : willNotRetryMessage
}: ${executorResult.message}`
);
// Task manager error handler only kicks in when an error thrown (at this time)
// So what we have to do is throw when the return status is `error`.
throw new ExecutorError(
executorResult.message,
executorResult.data,
executorResult.retry == null ? false : executorResult.retry
executorResult.retry as boolean | Date
);
} else if (executorResult && executorResult?.status === 'error') {
logger.error(
`Action '${actionId}' failed ${willNotRetryMessage}: ${executorResult.message}`
);
}

Expand Down
2 changes: 2 additions & 0 deletions x-pack/plugins/actions/server/types.ts
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ import {
SavedObjectReference,
} from '../../../../src/core/server';
import { ActionTypeExecutorResult } from '../common';
import { TaskInfo } from './lib/action_executor';
export { ActionTypeExecutorResult } from '../common';
export { GetFieldsByIssueTypeResponse as JiraGetFieldsResponse } from './builtin_action_types/jira/types';
export { GetCommonFieldsResponse as ServiceNowGetFieldsResponse } from './builtin_action_types/servicenow/types';
Expand Down Expand Up @@ -59,6 +60,7 @@ export interface ActionTypeExecutorOptions<Config, Secrets, Params> {
secrets: Secrets;
params: Params;
isEphemeral?: boolean;
taskInfo?: TaskInfo;
}

export interface ActionResult<Config extends ActionTypeConfig = ActionTypeConfig> {
Expand Down
Loading