Create task to cleanup action execution failures #96971

mikecote · 2021-04-13T13:54:08Z

Resolves #96577

In this PR, I'm creating a task that runs periodically to cleanup action execution failures and their related action_task_params. As discussed here, there are two intervals used, a short one when there are more failures to clean up and a longer one when there aren't any failures to clean up at a given time.

The configuration is unofficially available under xpack.actions.cleanupFailedExecutionsTask to enable/disable, change intervals and page size.

x-pack/plugins/actions/server/cleanup_failed_executions/lib/space_id_to_namespace.ts

pmuellr · 2021-04-15T21:23:57Z

x-pack/plugins/actions/server/config.ts

@@ -50,6 +50,11 @@ export const configSchema = schema.object({
  rejectUnauthorized: schema.boolean({ defaultValue: true }),
  maxResponseContentLength: schema.byteSize({ defaultValue: '1mb' }),
  responseTimeout: schema.duration({ defaultValue: '60s' }),
+  cleanupFailedExecutionsTask: schema.object({
+    enabled: schema.boolean({ defaultValue: true }),
+    interval: schema.duration({ defaultValue: '15s' }),


15 seconds? Is that just so you can test this? :-)

I was thinking more like 12 or 24 hours ... maybe an hour?

Yeah, this is a mistake, though I was thinking something small, like 5m. I'm trying to figure out how long it would take to cleanup environments for customers with 400,000 SOs..

1h processing 100 at a time = ~167 days 🤔
5m processing 100 at a time = ~14 days

It feels like there are two modes, one to clean up a large amount of data (short interval) and one to clean up executions that slipped (long interval). In theory, if I fix the task to always be successful, this feature would only be for the former (clean up a large amount of data) and would never be necessary again.

Ya, that kinda seems right - long interval that always runs, and a special one that cleans up a large backlog. Could this be two separate tasks? Maybe we'd have the large backlog task be disabled, and then enable it if we see > 1000 docs that need to be cleaned. Once that task went under that threshold (1000 in this case), it would disable itself.

Could this be two separate tasks?

We can return different intervals after each run. I could return, ex 5m if there are still docs left to clean up after a run.. or return 1h / 12h / 24h if there are none or something. This way, we have one task to worry about. What are your thoughts?

We can return different intervals after each run.

That should work out well. We should probably add a counter for the number of runs to set in alerting health, so we can tell from a report if it appears to be firing too often, or too infrequently.

I'm wondering if we should log an INFO for the long runs, indicating how many things have been deleted. Probably overkill - I did notice some debug logging which is probably good enough

That should work out well. We should probably add a counter for the number of runs to set in alerting health, so we can tell from a report if it appears to be firing too often, or too infrequently.

I capture runs and total_cleaned_up within the task manager state. Do you think that would suffice?

I'm wondering if we should log an INFO for the long runs, indicating how many things have been deleted. Probably overkill - I did notice some debug logging which is probably good enough

I fear if the user has 400k docs to delete that it would be 4000 info logs (one every 5m) and flood their logs over time.

…up-failures

mikecote · 2021-04-20T12:36:42Z

@elasticmachine merge upstream

elasticmachine · 2021-04-20T12:46:15Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

chrisronline · 2021-04-20T12:52:36Z

Will this PR be somewhat moot based on the work done for #90888? I'm not sure if we know the full direction yet but I'm worried these changes might need to all be backed out in the near future, unless I'm missing something?

mikecote · 2021-04-20T12:59:13Z

Will this PR be somewhat moot based on the work done for #90888? I'm not sure if we know the full direction yet but I'm worried these changes might need to all be backed out in the near future, unless I'm missing something?

@chrisronline The assumption is correct. This PR is the short-term fix to clean up existing objects on the existing set-ups. The work done in #90888 will be to prevent creating future objects that need to be cleaned up. At one point we should be able to revert this code.

… into actions/cleanup-failures

ymao1

LGTM! Verified by building up a bunch of bad email actions and then switching to this PR and watched them all disappear!

x-pack/plugins/actions/server/cleanup_failed_executions/cleanup_tasks.ts

ymao1 · 2021-04-20T15:50:46Z

x-pack/plugins/actions/server/cleanup_failed_executions/find_and_cleanup_tasks.ts

+  logger.debug(
+    `Removing ${result.saved_objects.length} of ${result.total} failed execution task(s)`
+  );
+  const cleanupResult = await cleanupTasks({


nit: should we check for length result.saved_objects > 0 and skip calling cleanupTasks if no failed tasks are found?

The call to bulk delete request fails when there isn't anything to bulk. I've moved the check to https://github.com/elastic/kibana/pull/96971/files#diff-676722bd4c174cfa74b5a0f77c8a5687e081ff306bc2376dbc2cdfa87b199333R16 but I'm happy to add another here.

ymao1 · 2021-04-20T16:21:23Z

The configuration is unofficially available under xpack.actions.cleanupFailedExecutionsTask to enable/disable, change intervals and page size.

Do we still need to allow-list in cloud even if not officially documented in case there is a cloud deployment that needs to change these settings?

mikecote · 2021-04-20T17:13:40Z

Do we still need to allow-list in cloud even if not officially documented in case there is a cloud deployment that needs to change these settings?

I think it would be worth it just in case a customer needs to change a value on short notice.

pmuellr

LGTM - a few non-critical comments / questions

pmuellr · 2021-04-20T17:33:20Z

x-pack/test/alerting_api_integration/common/fixtures/plugins/alerts/kibana.json

@@ -3,7 +3,7 @@
  "version": "1.0.0",
  "kibanaVersion": "kibana",
  "configPath": ["xpack"],
-  "requiredPlugins": ["taskManager", "features", "actions", "alerting", "encryptedSavedObjects"],
+  "requiredPlugins": ["taskManager", "features", "actions", "alerting", "encryptedSavedObjects", "actions"],


"actions" was already in the list, don't think it needs to be added here.

🤦 I guess I really realllly wanted to make sure the plugin was required 😄 Fixed in 4b82a46.

pmuellr · 2021-04-20T18:18:43Z

x-pack/plugins/actions/server/cleanup_failed_executions/find_and_cleanup_tasks.ts

+        actionTypeRegistry
+          .list()
+          .map((actionType) =>
+            nodeBuilder.is('task.attributes.taskType', `actions:${actionType.id}`)


I guess this means we won't clean up anything but failed actions? Seems ok for now, but I suspect we'll end up with some other task types in the future which also leave tombstones.

Yeah, agreed. I felt safe doing so for actions to begin with since they are traced with the event log while other ad-hoc tasks aren't (yet).

pmuellr · 2021-04-20T18:23:39Z

x-pack/plugins/actions/server/config.ts

@@ -50,6 +50,12 @@ export const configSchema = schema.object({
  rejectUnauthorized: schema.boolean({ defaultValue: true }),
  maxResponseContentLength: schema.byteSize({ defaultValue: '1mb' }),
  responseTimeout: schema.duration({ defaultValue: '60s' }),
+  cleanupFailedExecutionsTask: schema.object({


hmm, doesn't this mean you HAVE to include this property, and {} would be fine since all the sub-properties are optional? Or I guess config-schema is smart enough to realize all the sub-properties are optional, so the parent is also essentially optional?

I did a local test. It's smart enough to fill in the default sub-properties automatically without requiring the root property defined.

kibanamachine · 2021-04-20T20:32:35Z

💚 Build Succeeded

Metrics [docs]

Unknown metric groups

API count

id	before	after	diff
`taskManager`	41	43	+2

API count missing comments

id	before	after	diff
`taskManager`	16	18	+2

History

💚 Build #121096 succeeded bb356c9
💚 Build #120968 succeeded f40d408
💔 Build #120929 failed 6e9ec56
💔 Build #120930 failed 47e0a7c
💔 Build #120668 failed 25a5e6b

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @mikecote

* Initial commit * Add tests and support for concurrency * Ability to disable functionality, use bulk APIs * Fix type check * Fix jest tests * Cleanup * Cleanup pt2 * Add unit tests * Fix type check * Fixes * Update test failures * Split schedule between cleanup and idle * Add functional tests * Add one more test * Cleanup repeated code * Remove duplicate actions plugin requirement Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

kibanamachine · 2021-04-20T23:14:23Z

💚 Backport successful

Status	Branch	Result
✅	7.x

This backport PR will be merged automatically after passing CI.

* Initial commit * Add tests and support for concurrency * Ability to disable functionality, use bulk APIs * Fix type check * Fix jest tests * Cleanup * Cleanup pt2 * Add unit tests * Fix type check * Fixes * Update test failures * Split schedule between cleanup and idle * Add functional tests * Add one more test * Cleanup repeated code * Remove duplicate actions plugin requirement Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Mike Côté <mikecote@users.noreply.github.com>

Initial commit

e84224f

mikecote added v8.0.0 release_note:skip Skip the PR/issue when compiling release notes Feature:Actions Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.13.0 labels Apr 13, 2021

mikecote self-assigned this Apr 13, 2021

mikecote added 8 commits April 14, 2021 06:59

Add tests and support for concurrency

567aaca

Ability to disable functionality, use bulk APIs

454804a

Fix type check

f13cc01

Fix jest tests

963dcd3

Cleanup

aedd8bc

Cleanup pt2

bbb2fbe

Add unit tests

7abb886

Fix type check

06fdfb5

pmuellr reviewed Apr 15, 2021

View reviewed changes

x-pack/plugins/actions/server/cleanup_failed_executions/lib/space_id_to_namespace.ts Outdated Show resolved Hide resolved

pmuellr reviewed Apr 15, 2021

View reviewed changes

mikecote added 5 commits April 19, 2021 07:54

Fixes

8fda03e

Update test failures

86203eb

Merge branch 'master' of github.com:elastic/kibana into actions/clean…

8aa2d06

…up-failures

Split schedule between cleanup and idle

25a5e6b

Add functional tests

6e9ec56

Merge branch 'master' into actions/cleanup-failures

47e0a7c

mikecote added release_note:fix and removed release_note:skip Skip the PR/issue when compiling release notes labels Apr 20, 2021

mikecote marked this pull request as ready for review April 20, 2021 12:46

mikecote requested a review from a team as a code owner April 20, 2021 12:46

Add one more test

c819483

Merge branch 'actions/cleanup-failures' of github.com:mikecote/kibana…

f40d408

… into actions/cleanup-failures

ymao1 approved these changes Apr 20, 2021

View reviewed changes

Cleanup repeated code

bb356c9

pmuellr approved these changes Apr 20, 2021

View reviewed changes

Remove duplicate actions plugin requirement

4b82a46

mikecote added the auto-backport Deprecated - use backport:version if exact versions are needed label Apr 20, 2021

mikecote merged commit 0507ac5 into elastic:master Apr 20, 2021

kibanamachine mentioned this pull request Apr 20, 2021

[7.x] Create task to cleanup action execution failures (#96971) #97766

Merged

This was referenced Jun 22, 2021

Failed tasks and action type execution param objects remain as saved objects forever #55340

Closed

[Discuss][Actions] Should actions remain "at most once"? #102888

Closed

mikecote mentioned this pull request Jun 30, 2021

[Discuss] Too many saved-objects + migrations == 💥 #92933

Closed

LeeDr mentioned this pull request Jul 20, 2021

[7.14] SO migration takes too long on pre-7.13 upgrades with heavy alerting usage #106308

Closed

mikecote mentioned this pull request Aug 5, 2021

Cleanup new excludeOnUpgrade Saved Object API #106991

Open

6 tasks

chrisronline mentioned this pull request Sep 1, 2021

[Actions] Treat failures as successes for Task Manager #109655

Merged

ymao1 mentioned this pull request Sep 15, 2021

[Alerting] Explanation and approach to removing the "siem.notification" saved object and rule type #112209

Closed

mikecote mentioned this pull request Feb 16, 2023

Failed Task Manager task documents are never cleaned up bloating the index #79977

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create task to cleanup action execution failures #96971

Create task to cleanup action execution failures #96971

mikecote commented Apr 13, 2021 •

edited

Loading

pmuellr Apr 15, 2021

mikecote Apr 19, 2021 •

edited

Loading

pmuellr Apr 19, 2021

mikecote Apr 19, 2021 •

edited

Loading

pmuellr Apr 20, 2021

mikecote Apr 20, 2021

mikecote commented Apr 20, 2021

elasticmachine commented Apr 20, 2021

chrisronline commented Apr 20, 2021

mikecote commented Apr 20, 2021

ymao1 left a comment

ymao1 Apr 20, 2021

mikecote Apr 20, 2021

ymao1 commented Apr 20, 2021

mikecote commented Apr 20, 2021

pmuellr left a comment

pmuellr Apr 20, 2021

mikecote Apr 20, 2021

pmuellr Apr 20, 2021

mikecote Apr 20, 2021

pmuellr Apr 20, 2021

mikecote Apr 20, 2021

kibanamachine commented Apr 20, 2021

API count

API count missing comments

kibanamachine commented Apr 20, 2021

Create task to cleanup action execution failures #96971

Create task to cleanup action execution failures #96971

Conversation

mikecote commented Apr 13, 2021 • edited Loading

Choose a reason for hiding this comment

mikecote Apr 19, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikecote Apr 19, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikecote commented Apr 20, 2021

elasticmachine commented Apr 20, 2021

chrisronline commented Apr 20, 2021

mikecote commented Apr 20, 2021

ymao1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ymao1 commented Apr 20, 2021

mikecote commented Apr 20, 2021

pmuellr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kibanamachine commented Apr 20, 2021

💚 Build Succeeded

Metrics [docs]

API count

API count missing comments

History

kibanamachine commented Apr 20, 2021

💚 Backport successful

mikecote commented Apr 13, 2021 •

edited

Loading

mikecote Apr 19, 2021 •

edited

Loading

mikecote Apr 19, 2021 •

edited

Loading