Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task Manager plugin API to bulk update task schedules #124850

Closed
mikecote opened this issue Feb 7, 2022 · 11 comments
Closed

Task Manager plugin API to bulk update task schedules #124850

mikecote opened this issue Feb 7, 2022 · 11 comments
Assignees
Labels
8.3 candidate Feature:Task Manager research Team:Detection Rule Management Security Detection Rule Management Team Team:Detections and Resp Security Detection Response Team Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc.

Comments

@mikecote
Copy link
Contributor

mikecote commented Feb 7, 2022

To support #124715 and the existing functionality when updating a rule, we need to reflect the rule's schedule in the task's schedule. We previously ran the rule right away and let it update the task schedule at the end. However, when thinking of doing so in bulk, we encounter thundering herd and worker capacity problems.

We think of a way to allow bulk updates of tasks with a new schedule that doesn't bring thundering herd or Kibana capacity problems. Maybe something like rescheduling each task between now and the interval, at random, for each. This functionality could also be re-used in the current rule update API to keep behaviour the same (bulk vs single update).

Requirements

  1. If a user reduces the interval (ex: 24hr to 12hr), the system should run the rule at "new interval" from its last run.
  2. If a user increases the interval (ex: 12hr to 24hr), the system should run the rule at "new interval" from its last run.
  3. If the new calculated run is in the past, we should replace it with "now" to avoid skipping the queue.

Proposal

For Alerting and Task Manager to support updating task schedules in bulk, we should avoid clogging the Task Manager queue and instead calculate the next runAt based on the new interval and the last run using the formula below. By only updating the task document, it allows any Kibana instance pick up these re-scheduled tasks.

newRunAt = oldRunAt - oldInterval + newInterval
Note: If newRunAt is < now, we will set it to now

Task Manager will only update task schedules of non-running tasks. Tasks that are claiming or running will have their schedule updated automatically at the end of their run.

The update(...) function of the rulesClient will move to use the newly proposed updateSchedule API (see below) to keep the behaviour the same as bulkUpdate(...).

Task Manager API proposal
updateSchedule(taskIds: string[], schedule: string)

Since the bulk edit API in alerting takes a single schedule and applies it to all alerting rules. We can do the same here by taking a single schedule parameter and applying it to all tasksIds passed in.

The function should do the following:

  1. Load all tasks part of taskIds, filtering by status:idle
    • We should probably limit the fields returned in case the state object is huge for each task (serialization / event loop blocking)
  2. Calculate the runAt for each task
  3. Bulk update task documents using OCC (Optimistic Concurrency Control / versioning)

Note: The alerting API shouldn’t return an error if we can’t update all tasks successfully (they will eventually be updated, this is a best effort mechanism)

@mikecote mikecote added Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Feb 7, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@mikecote mikecote changed the title Task Manager bulk API to update task schedule Task Manager bulk plugin API to update task schedule Feb 7, 2022
@mikecote mikecote changed the title Task Manager bulk plugin API to update task schedule Task Manager plugin API to bulk update task schedules Feb 7, 2022
@mikecote mikecote moved this from Awaiting Triage to Todo in AppEx: ResponseOps - Execution & Connectors Feb 17, 2022
@mikecote
Copy link
Contributor Author

mikecote commented Mar 8, 2022

Moving to the backlog as updating the interval is not a requirement for the first iteration of bulk update rules API.

@banderror banderror added Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Team:Detection Rule Management Security Detection Rule Management Team labels Apr 27, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detections-response (Team:Detections and Resp)

@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

@mikecote
Copy link
Contributor Author

I am working on a proposal and meeting with @vitaliidm in the coming weeks. Thank you for helping with the PR! 🙏

@mikecote
Copy link
Contributor Author

mikecote commented May 4, 2022

I've updated the issue description to have a Requirements and Proposal section reflecting what was discussed and proposed to the team.

@vitaliidm
Copy link
Contributor

vitaliidm commented May 11, 2022

After discussion with @mikecote and @XavierM we agreed on following implementation:

  1. Create new method in taskManager bulkUpdateSchedules which
bulkUpdateSchedules(taskIds: string[], schedule: string)
  1. Loads all tasks part of taskIds, filtering by status:idle. We should probably limit the fields returned in case the state object is huge for each task (serialization / event loop blocking)
  2. Calculate the runAt for each task
newRunAt = oldRunAt - oldInterval + newInterval
  1. Bulk update task documents using OCC (Optimistic Concurrency Control / versioning)
  2. We don't handle 409 conflict errors, as schedule task can be updated already by another instance of Kibana
  3. Alerting API shouldn’t return an error if we can’t update all tasks successfully (they will eventually be updated, this is a best effort mechanism)

@mikecote
Copy link
Contributor Author

mikecote commented May 11, 2022

^^ correct, I remove the handling of 409 errors from the issue description. 👍 (because 409 errors would usually mean the idle task changed to claiming or running and will update the schedule after it's run).

@mikecote mikecote moved this from Todo to In Progress in AppEx: ResponseOps - Execution & Connectors May 30, 2022
@ymao1 ymao1 moved this from In Progress to In Review in AppEx: ResponseOps - Execution & Connectors Jun 9, 2022
vitaliidm added a commit that referenced this issue Jun 13, 2022
Addresses: #124850
## Summary

- Adds new method Task Manager API `bulkUpdateSchedules`
- Adds calling `taskManager.bulkUpdateSchedules` in rulesClient.bulkEdit to update tasks if updated rules have `scheduleTaskId` property
- Enables the rest of operations for rulesClient.bulkEdit (set schedule, notifyWhen, throttle)
- 
#### bulkUpdateSchedules
Using `bulkUpdatesSchedules` you can instruct TaskManager to update interval of tasks that are in `idle` status.
When interval updated, new `runAt` will be computed and task will be updated with that value

```js
export class Plugin {
  constructor() {
  }

  public setup(core: CoreSetup, plugins: { taskManager }) {
  }

  public start(core: CoreStart, plugins: { taskManager }) {
    try {
      const bulkUpdateResults = await taskManager.bulkUpdateSchedule(
        ['97c2c4e7-d850-11ec-bf95-895ffd19f959', 'a5ee24d1-dce2-11ec-ab8d-cf74da82133d'],
        { interval: '10m' },
      );
      // If no error is thrown, the bulkUpdateSchedule has completed successfully.
      // But some updates of some tasks can be failed, due to OCC 409 conflict for example
    } catch(err: Error) {
      // if error is caught, means the whole method requested has failed and tasks weren't updated
    }    
  }
}
```
### in follow-up PRs
- use  `taskManager.bulkUpdateSchedules` in rulesClient.update (#134027)
- functional test for bulkEdit (#133635)

### Checklist

- [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios

### Release note
Adds new method to Task Manager - bulkUpdatesSchedules, that allow bulk updates of scheduled tasks.
Adds 3 new operations to rulesClient.bulkUpdate: update of schedule, notifyWhen, throttle.
@ymao1
Copy link
Contributor

ymao1 commented Jun 13, 2022

@vitaliidm This issue is closed by #132637 right?

@vitaliidm
Copy link
Contributor

vitaliidm commented Jun 13, 2022

@vitaliidm This issue is closed by #132637 right?

Hey @ymao1
There 2 follow-ups to that PR:

Let's close it, after the first one gets merged (RulesClient.update refactoring), when all scope of this task is addressed

vitaliidm added a commit that referenced this issue Jun 15, 2022
…kUpdateSchedules (#134027)

## Summary

 - follow-up to #132637, #124850
 - replaces in `RulesClient.update` method TaskManager API `runNow` to `bulkUpdateSchedules`

When using runNow in scale, there can be situations, when TaskManager capacity is full, thus leading failure of `runNow`.
Instead, new API `bulkUpdateSchedules` will be used, which in case if rule schedule is getting updated: will update underlying task schedule and will calculate new `runAt` time. 
More details on new TaskManager API: #132637, #124850


### Checklist

- [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
8.3 candidate Feature:Task Manager research Team:Detection Rule Management Security Detection Rule Management Team Team:Detections and Resp Security Detection Response Team Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc.
Projects
No open projects
Development

No branches or pull requests

5 participants