-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RAM] adds bulkUpdatesSchedules method to Task Manager API #132637
[RAM] adds bulkUpdatesSchedules method to Task Manager API #132637
Conversation
const scheduleOperation = options.operations.find( | ||
(op): op is Extract<BulkEditOperation, { field: Extract<BulkEditFields, 'schedule'> }> => | ||
op.field === 'schedule' | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what other operations should affect rescheduling of tasks?
Should notifyWhen
or throttle
trigger this bulkUpdateSchedules as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only logic that causes a rule's task to get rescheduled at this time is when the rule interval changes. We should be ok to skip the other operations.
…liidm/kibana into task-manager-bulk-schedules
@@ -34,6 +38,27 @@ const operationsSchema = schema.arrayOf( | |||
field: schema.literal('actions'), | |||
value: schema.arrayOf(ruleActionSchema), | |||
}), | |||
schema.object({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently functional tests are skipped due to #132195
I plan to fix them and add additional tests in the following PR, so won't mix everything together.
I can do it in this PR, if it preferable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reviewers:
I believe this is the follow up PR @vitaliidm is referring to 😉
https://github.com/elastic/kibana/pull/133635/files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to see them here, but TBH doesn't make sense if they're skipped, so ok in the follow-up. I added a comment to that PR as a reminder.
Pinging @elastic/response-ops (Team:ResponseOps) |
Pinging @elastic/security-detections-response (Team:Detections and Resp) |
Pinging @elastic/security-solution (Team: SecuritySolution) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code and tests look good!
There is this from the original issue:
The update(...) function of the rulesClient will move to use the newly proposed updateSchedule API (see below) to keep the behaviour the same as bulkUpdate(...).
Should that be done in this issue or is there a separate issue for that?
I would like to address it in a separate PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Verified it works as expected. Believe @pmuellr is also going to take a look at this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything looks good to me, except I don't see any rationale why we only update idle
tasks. And I'm concerned about some tasks not getting updated, because they happen to be running when a bulk update occurs, with no indication to the caller that the task was skipped.
@@ -12,6 +12,10 @@ import { ILicenseState, RuleTypeDisabledError } from '../lib'; | |||
import { verifyAccessAndContext, rewriteRule, handleDisabledApiKeysError } from './lib'; | |||
import { AlertingRequestHandlerContext, INTERNAL_BASE_ALERTING_API_PATH } from '../types'; | |||
|
|||
const scheduleSchema = schema.object({ | |||
interval: schema.string(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a schema config validator we use for intervals, as seen here:
interval: schema.string({ validate: validateDurationSchema }), |
Using that, a bad interval sent will "fail fast" and not somewhere else, deeper in the framework.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, I've replaced it with validateDurationSchema
@@ -34,6 +38,27 @@ const operationsSchema = schema.arrayOf( | |||
field: schema.literal('actions'), | |||
value: schema.arrayOf(ruleActionSchema), | |||
}), | |||
schema.object({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to see them here, but TBH doesn't make sense if they're skipped, so ok in the follow-up. I added a comment to that PR as a reminder.
}, | ||
{ | ||
term: { | ||
'task.status': 'idle', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious why we only look at idle
tasks. Basically, why aren't we operating on running tasks as well? I'm guessing because the runAt
would get overwritten when the task completes and reschedules itself.
But this seems like a pain to the users. Ask to update X rules, but some aren't updated because they happen to be running?
I feel like we should at least return the rules which weren't idle, in the response back to the user, so they could "try again", if we can't make this work for running tasks. Or at a minimum log something, so we at least have an indication that some rules were skipped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mikecote filled me in - we don't NEED to update the running tasks, as their schedule is recalculated after the run, so it should recalculate the new runAt with the new schedule when it's complete. In case of a race condition - the rule is finished before the schedule is updated in it's SO - then there should be at most one more run with the old schedule. Which seems like the best we can do.
@@ -899,6 +909,62 @@ export default function ({ getService }: FtrProviderContext) { | |||
}); | |||
}); | |||
|
|||
it('should bulk updates schedules for multiple tasks', async () => { | |||
const initialTime = Date.now(); | |||
const tasks = await Promise.all([ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like it might be hard to arrange, but can we test this with a "running" task? I think it would be some new task that would take a long time to run, and then we'd check to see if we're "ready" by querying the task state looking to see if it's "running". And then verify that it's runAt wasn't changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya, looks good, thx!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got my comment about not updating non-idle tasks answered by Mike - we simply don't NEED to, as they will pick up the new schedule when they recalculate when the rule finishes.
However, I think we should add a comment about that - presumably in the place where we're filtering on idle, to indicate that. Otherwise I'll be confused about it next year :-)
💚 Build SucceededMetrics [docs]History
To update your PR or re-run it, just comment with: cc @vitaliidm |
thanks @pmuellr for the feedback and approval
Added it to method description |
…nal tests for alerting bulk_edit (#133635) ## Summary Addresses #132195 Follow-up to #132637 - Fixes flaky tests `x-pack/test/alerting_api_integration/spaces_only/tests/alerting/bulk_edit.ts` - Adds new tests in spaces_only/tests/alerting/bulk_edit.ts for the following operations: `schedule`, `notifyWhen`, `throttle` - Adds units test in x-pack/plugins/alerting/server/rules_client/lib/apply_bulk_edit_operation.test.ts ### Details to fixing flaky tests Despite having bulkEdit retry when 409 conflict happens during bulk update of SO, sometimes there was still failing test. So, all tests in bulk_edit.ts were skipped 93ffd78 The reason, it happens I believe because mocked rule that used for test is enabled, its update after run clashes with `SavedObject.bulkUpdate` method that used inside `RulesClient.bulkEdit`. So, to fix it, I made this mocked rule disabled, which seems like fixed the flakiness Here is flaky test runner builds with ENABLED rule: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/717 https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/713 and with DISABLED: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/718 https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/714 In both enabled runs there was test failure, but not for disabled. Another possible way to fix: use retry in test for `rules.bulkEdit` call and assertion. Let me know if it can more preferable way to fix it ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
…kUpdateSchedules (#134027) ## Summary - follow-up to #132637, #124850 - replaces in `RulesClient.update` method TaskManager API `runNow` to `bulkUpdateSchedules` When using runNow in scale, there can be situations, when TaskManager capacity is full, thus leading failure of `runNow`. Instead, new API `bulkUpdateSchedules` will be used, which in case if rule schedule is getting updated: will update underlying task schedule and will calculate new `runAt` time. More details on new TaskManager API: #132637, #124850 ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
Addresses: #124850
Summary
bulkUpdateSchedules
taskManager.bulkUpdateSchedules
in rulesClient.bulkEdit to update tasks if updated rules havescheduleTaskId
propertybulkUpdateSchedules
Using
bulkUpdatesSchedules
you can instruct TaskManager to update interval of tasks that are inidle
status.When interval updated, new
runAt
will be computed and task will be updated with that valuein follow-up PRs
taskManager.bulkUpdateSchedules
in rulesClient.update ([RAM] refactors RulesClient.update, by using new Task Manager API bulkUpdateSchedules #134027)Checklist
Release note
Adds new method to Task Manager - bulkUpdatesSchedules, that allow bulk updates of scheduled tasks.
Adds 3 new operations to rulesClient.bulkUpdate: update of schedule, notifyWhen, throttle.