-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create task to cleanup action execution failures #96971
Conversation
x-pack/plugins/actions/server/cleanup_failed_executions/lib/space_id_to_namespace.ts
Outdated
Show resolved
Hide resolved
@@ -50,6 +50,11 @@ export const configSchema = schema.object({ | |||
rejectUnauthorized: schema.boolean({ defaultValue: true }), | |||
maxResponseContentLength: schema.byteSize({ defaultValue: '1mb' }), | |||
responseTimeout: schema.duration({ defaultValue: '60s' }), | |||
cleanupFailedExecutionsTask: schema.object({ | |||
enabled: schema.boolean({ defaultValue: true }), | |||
interval: schema.duration({ defaultValue: '15s' }), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
15 seconds? Is that just so you can test this? :-)
I was thinking more like 12 or 24 hours ... maybe an hour?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is a mistake, though I was thinking something small, like 5m
. I'm trying to figure out how long it would take to cleanup environments for customers with 400,000
SOs..
1h processing 100 at a time = ~167 days 🤔
5m processing 100 at a time = ~14 days
It feels like there are two modes, one to clean up a large amount of data (short interval) and one to clean up executions that slipped (long interval). In theory, if I fix the task to always be successful, this feature would only be for the former (clean up a large amount of data) and would never be necessary again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya, that kinda seems right - long interval that always runs, and a special one that cleans up a large backlog. Could this be two separate tasks? Maybe we'd have the large backlog task be disabled, and then enable it if we see > 1000 docs that need to be cleaned. Once that task went under that threshold (1000 in this case), it would disable itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be two separate tasks?
We can return different intervals after each run. I could return, ex 5m
if there are still docs left to clean up after a run.. or return 1h / 12h / 24h
if there are none or something. This way, we have one task to worry about. What are your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can return different intervals after each run.
That should work out well. We should probably add a counter for the number of runs to set in alerting health, so we can tell from a report if it appears to be firing too often, or too infrequently.
I'm wondering if we should log an INFO for the long runs, indicating how many things have been deleted. Probably overkill - I did notice some debug logging which is probably good enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should work out well. We should probably add a counter for the number of runs to set in alerting health, so we can tell from a report if it appears to be firing too often, or too infrequently.
I capture runs
and total_cleaned_up
within the task manager state. Do you think that would suffice?
I'm wondering if we should log an INFO for the long runs, indicating how many things have been deleted. Probably overkill - I did notice some debug logging which is probably good enough
I fear if the user has 400k docs to delete that it would be 4000 info logs (one every 5m) and flood their logs over time.
@elasticmachine merge upstream |
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
Will this PR be somewhat moot based on the work done for #90888? I'm not sure if we know the full direction yet but I'm worried these changes might need to all be backed out in the near future, unless I'm missing something? |
@chrisronline The assumption is correct. This PR is the short-term fix to clean up existing objects on the existing set-ups. The work done in #90888 will be to prevent creating future objects that need to be cleaned up. At one point we should be able to revert this code. |
… into actions/cleanup-failures
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Verified by building up a bunch of bad email actions and then switching to this PR and watched them all disappear!
x-pack/plugins/actions/server/cleanup_failed_executions/cleanup_tasks.ts
Outdated
Show resolved
Hide resolved
logger.debug( | ||
`Removing ${result.saved_objects.length} of ${result.total} failed execution task(s)` | ||
); | ||
const cleanupResult = await cleanupTasks({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should we check for length result.saved_objects > 0
and skip calling cleanupTasks
if no failed tasks are found?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The call to bulk delete request fails when there isn't anything to bulk. I've moved the check to https://github.com/elastic/kibana/pull/96971/files#diff-676722bd4c174cfa74b5a0f77c8a5687e081ff306bc2376dbc2cdfa87b199333R16 but I'm happy to add another here.
Do we still need to allow-list in cloud even if not officially documented in case there is a cloud deployment that needs to change these settings? |
I think it would be worth it just in case a customer needs to change a value on short notice. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - a few non-critical comments / questions
@@ -3,7 +3,7 @@ | |||
"version": "1.0.0", | |||
"kibanaVersion": "kibana", | |||
"configPath": ["xpack"], | |||
"requiredPlugins": ["taskManager", "features", "actions", "alerting", "encryptedSavedObjects"], | |||
"requiredPlugins": ["taskManager", "features", "actions", "alerting", "encryptedSavedObjects", "actions"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"actions"
was already in the list, don't think it needs to be added here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦 I guess I really realllly wanted to make sure the plugin was required 😄 Fixed in 4b82a46.
actionTypeRegistry | ||
.list() | ||
.map((actionType) => | ||
nodeBuilder.is('task.attributes.taskType', `actions:${actionType.id}`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this means we won't clean up anything but failed actions? Seems ok for now, but I suspect we'll end up with some other task types in the future which also leave tombstones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, agreed. I felt safe doing so for actions to begin with since they are traced with the event log while other ad-hoc tasks aren't (yet).
@@ -50,6 +50,12 @@ export const configSchema = schema.object({ | |||
rejectUnauthorized: schema.boolean({ defaultValue: true }), | |||
maxResponseContentLength: schema.byteSize({ defaultValue: '1mb' }), | |||
responseTimeout: schema.duration({ defaultValue: '60s' }), | |||
cleanupFailedExecutionsTask: schema.object({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, doesn't this mean you HAVE to include this property, and {}
would be fine since all the sub-properties are optional? Or I guess config-schema is smart enough to realize all the sub-properties are optional, so the parent is also essentially optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a local test. It's smart enough to fill in the default sub-properties automatically without requiring the root property defined.
💚 Build SucceededMetrics [docs]Unknown metric groupsAPI count
API count missing comments
History
To update your PR or re-run it, just comment with: cc @mikecote |
* Initial commit * Add tests and support for concurrency * Ability to disable functionality, use bulk APIs * Fix type check * Fix jest tests * Cleanup * Cleanup pt2 * Add unit tests * Fix type check * Fixes * Update test failures * Split schedule between cleanup and idle * Add functional tests * Add one more test * Cleanup repeated code * Remove duplicate actions plugin requirement Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
💚 Backport successful
This backport PR will be merged automatically after passing CI. |
* Initial commit * Add tests and support for concurrency * Ability to disable functionality, use bulk APIs * Fix type check * Fix jest tests * Cleanup * Cleanup pt2 * Add unit tests * Fix type check * Fixes * Update test failures * Split schedule between cleanup and idle * Add functional tests * Add one more test * Cleanup repeated code * Remove duplicate actions plugin requirement Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Mike Côté <mikecote@users.noreply.github.com>
Resolves #96577
In this PR, I'm creating a task that runs periodically to cleanup action execution failures and their related action_task_params. As discussed here, there are two intervals used, a short one when there are more failures to clean up and a longer one when there aren't any failures to clean up at a given time.
The configuration is unofficially available under
xpack.actions.cleanupFailedExecutionsTask
to enable/disable, change intervals and page size.