Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RAM] adds bulkUpdatesSchedules method to Task Manager API #132637

Merged
merged 29 commits into from
Jun 13, 2022

Conversation

vitaliidm
Copy link
Contributor

@vitaliidm vitaliidm commented May 20, 2022

Addresses: #124850

Summary

  • Adds new method Task Manager API bulkUpdateSchedules
  • Adds calling taskManager.bulkUpdateSchedules in rulesClient.bulkEdit to update tasks if updated rules have scheduleTaskId property
  • Enables the rest of operations for rulesClient.bulkEdit (set schedule, notifyWhen, throttle)

bulkUpdateSchedules

Using bulkUpdatesSchedules you can instruct TaskManager to update interval of tasks that are in idle status.
When interval updated, new runAt will be computed and task will be updated with that value

export class Plugin {
  constructor() {
  }

  public setup(core: CoreSetup, plugins: { taskManager }) {
  }

  public start(core: CoreStart, plugins: { taskManager }) {
    try {
      const bulkUpdateResults = await taskManager.bulkUpdateSchedule(
        ['97c2c4e7-d850-11ec-bf95-895ffd19f959', 'a5ee24d1-dce2-11ec-ab8d-cf74da82133d'],
        { interval: '10m' },
      );
      // If no error is thrown, the bulkUpdateSchedule has completed successfully.
      // But some updates of some tasks can be failed, due to OCC 409 conflict for example
    } catch(err: Error) {
      // if error is caught, means the whole method requested has failed and tasks weren't updated
    }    
  }
}

in follow-up PRs

Checklist

Release note

Adds new method to Task Manager - bulkUpdatesSchedules, that allow bulk updates of scheduled tasks.
Adds 3 new operations to rulesClient.bulkUpdate: update of schedule, notifyWhen, throttle.

@vitaliidm vitaliidm self-assigned this May 20, 2022
const scheduleOperation = options.operations.find(
(op): op is Extract<BulkEditOperation, { field: Extract<BulkEditFields, 'schedule'> }> =>
op.field === 'schedule'
);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what other operations should affect rescheduling of tasks?

Should notifyWhen or throttle trigger this bulkUpdateSchedules as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only logic that causes a rule's task to get rescheduled at this time is when the rule interval changes. We should be ok to skip the other operations.

@@ -34,6 +38,27 @@ const operationsSchema = schema.arrayOf(
field: schema.literal('actions'),
value: schema.arrayOf(ruleActionSchema),
}),
schema.object({
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently functional tests are skipped due to #132195
I plan to fix them and add additional tests in the following PR, so won't mix everything together.

I can do it in this PR, if it preferable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reviewers:
I believe this is the follow up PR @vitaliidm is referring to 😉
https://github.com/elastic/kibana/pull/133635/files

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to see them here, but TBH doesn't make sense if they're skipped, so ok in the follow-up. I added a comment to that PR as a reminder.

@vitaliidm vitaliidm added Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Team:Detection Rule Management Security Detection Rule Management Team v8.4.0 release_note:feature Makes this part of the condensed release notes labels Jun 6, 2022
@vitaliidm vitaliidm changed the title [RAM]bulkUpdatesSchedules for taskManager [RAM] adds bulkUpdatesSchedules method to Task Manager API Jun 6, 2022
@vitaliidm vitaliidm added the backport:skip This commit does not require backporting label Jun 6, 2022
@vitaliidm vitaliidm marked this pull request as ready for review June 6, 2022 15:10
@vitaliidm vitaliidm requested a review from a team as a code owner June 6, 2022 15:10
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detections-response (Team:Detections and Resp)

@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

Copy link
Contributor

@ymao1 ymao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code and tests look good!

There is this from the original issue:

The update(...) function of the rulesClient will move to use the newly proposed updateSchedule API (see below) to keep the behaviour the same as bulkUpdate(...).

Should that be done in this issue or is there a separate issue for that?

@vitaliidm
Copy link
Contributor Author

There is this from the original issue:

The update(...) function of the rulesClient will move to use the newly proposed updateSchedule API (see below) to keep the behaviour the same as bulkUpdate(...).
Should that be done in this issue or is there a separate issue for that?

I would like to address it in a separate PR.
Will prepare it, right after this one merged.
Will update description, noting that update will come next

Copy link
Contributor

@ymao1 ymao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Verified it works as expected. Believe @pmuellr is also going to take a look at this PR

Copy link
Member

@pmuellr pmuellr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good to me, except I don't see any rationale why we only update idle tasks. And I'm concerned about some tasks not getting updated, because they happen to be running when a bulk update occurs, with no indication to the caller that the task was skipped.

@@ -12,6 +12,10 @@ import { ILicenseState, RuleTypeDisabledError } from '../lib';
import { verifyAccessAndContext, rewriteRule, handleDisabledApiKeysError } from './lib';
import { AlertingRequestHandlerContext, INTERNAL_BASE_ALERTING_API_PATH } from '../types';

const scheduleSchema = schema.object({
interval: schema.string(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a schema config validator we use for intervals, as seen here:

interval: schema.string({ validate: validateDurationSchema }),

Using that, a bad interval sent will "fail fast" and not somewhere else, deeper in the framework.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I've replaced it with validateDurationSchema

@@ -34,6 +38,27 @@ const operationsSchema = schema.arrayOf(
field: schema.literal('actions'),
value: schema.arrayOf(ruleActionSchema),
}),
schema.object({
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to see them here, but TBH doesn't make sense if they're skipped, so ok in the follow-up. I added a comment to that PR as a reminder.

},
{
term: {
'task.status': 'idle',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why we only look at idle tasks. Basically, why aren't we operating on running tasks as well? I'm guessing because the runAt would get overwritten when the task completes and reschedules itself.

But this seems like a pain to the users. Ask to update X rules, but some aren't updated because they happen to be running?

I feel like we should at least return the rules which weren't idle, in the response back to the user, so they could "try again", if we can't make this work for running tasks. Or at a minimum log something, so we at least have an indication that some rules were skipped.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikecote filled me in - we don't NEED to update the running tasks, as their schedule is recalculated after the run, so it should recalculate the new runAt with the new schedule when it's complete. In case of a race condition - the rule is finished before the schedule is updated in it's SO - then there should be at most one more run with the old schedule. Which seems like the best we can do.

@@ -899,6 +909,62 @@ export default function ({ getService }: FtrProviderContext) {
});
});

it('should bulk updates schedules for multiple tasks', async () => {
const initialTime = Date.now();
const tasks = await Promise.all([
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it might be hard to arrange, but can we test this with a "running" task? I think it would be some new task that would take a long time to run, and then we'd check to see if we're "ready" by querying the task state looking to see if it's "running". And then verify that it's runAt wasn't changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @pmuellr , I added a test to cover this bit, here is my solution: 52b794a

Let me know if you are fine with that, and I will merge this PR then

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya, looks good, thx!

Copy link
Member

@pmuellr pmuellr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got my comment about not updating non-idle tasks answered by Mike - we simply don't NEED to, as they will pick up the new schedule when they recalculate when the rule finishes.

However, I think we should add a comment about that - presumably in the place where we're filtering on idle, to indicate that. Otherwise I'll be confused about it next year :-)

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Unknown metric groups

API count

id before after diff
taskManager 77 80 +3

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @vitaliidm

@vitaliidm
Copy link
Contributor Author

thanks @pmuellr for the feedback and approval

However, I think we should add a comment about that - presumably in the place where we're filtering on idle, to indicate that. Otherwise I'll be confused about it next year :-)

Added it to method description

@vitaliidm vitaliidm merged commit 6e0086d into elastic:main Jun 13, 2022
vitaliidm added a commit that referenced this pull request Jun 15, 2022
…nal tests for alerting bulk_edit (#133635)

## Summary

Addresses #132195
Follow-up to #132637

- Fixes flaky tests `x-pack/test/alerting_api_integration/spaces_only/tests/alerting/bulk_edit.ts`
- Adds new tests in spaces_only/tests/alerting/bulk_edit.ts for the following operations: `schedule`, `notifyWhen`, `throttle`
- Adds units test in x-pack/plugins/alerting/server/rules_client/lib/apply_bulk_edit_operation.test.ts

### Details to fixing flaky tests

Despite having bulkEdit retry when 409 conflict happens during bulk update of SO, sometimes there was still failing test.
So, all tests in bulk_edit.ts were skipped 93ffd78

The reason, it happens I believe because mocked rule that used for test is enabled, its update after run clashes with `SavedObject.bulkUpdate` method that used inside `RulesClient.bulkEdit`. So, to fix it, I made this mocked rule disabled, which seems like fixed the flakiness

Here is flaky test runner builds with ENABLED rule: 
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/717
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/713

and with DISABLED:
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/718
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/714

In both enabled runs there was test failure, but not for disabled.

Another possible way to fix:
use retry in test for `rules.bulkEdit` call and assertion. Let me know if it can more preferable way to fix it

### Checklist

- [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
vitaliidm added a commit that referenced this pull request Jun 15, 2022
…kUpdateSchedules (#134027)

## Summary

 - follow-up to #132637, #124850
 - replaces in `RulesClient.update` method TaskManager API `runNow` to `bulkUpdateSchedules`

When using runNow in scale, there can be situations, when TaskManager capacity is full, thus leading failure of `runNow`.
Instead, new API `bulkUpdateSchedules` will be used, which in case if rule schedule is getting updated: will update underlying task schedule and will calculate new `runAt` time. 
More details on new TaskManager API: #132637, #124850


### Checklist

- [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
@vitaliidm vitaliidm deleted the task-manager-bulk-schedules branch March 4, 2024 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting Feature:Task Manager release_note:feature Makes this part of the condensed release notes Team:Detection Rule Management Security Detection Rule Management Team Team:Detections and Resp Security Detection Response Team Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. v8.4.0
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

7 participants