Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task Manager - Long claiming duration breaks flow control #151938

Closed
kobelb opened this issue Feb 22, 2023 · 4 comments · Fixed by #153491
Closed

Task Manager - Long claiming duration breaks flow control #151938

kobelb opened this issue Feb 22, 2023 · 4 comments · Fixed by #153491
Assignees
Labels
bug Fixes for quality problems that affect the customer experience Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@kobelb
Copy link
Contributor

kobelb commented Feb 22, 2023

When claiming tasks take a long time, it causes two independent behaviors that violate the intended flow controls:

  1. When claiming cycles timeout, ongoing work should be cancelled and subsequent work should not be started. Each task-manager claiming cycle currently has 30 seconds to complete, and if they don't finish within that time period, the promise is rejected and we start the next claiming cycle. Even though the promise is being rejected after 30 seconds, task-manager isn't cancelling any ongoing work or preventing additional work associated with the timed out claiming cycle from completing.

  2. When a claiming cycle takes a long time to complete, missed claiming intervals shouldn't "queue up" and immediately cause additional claims. Every 3 seconds, a new claim is triggered. However, the use of a concatMap will prevent these claims from running immediately if a prior claim is incomplete, which is great. However, this causes the claims to "queue up" and run immediately after the prior claim in line is complete.

@kobelb kobelb added bug Fixes for quality problems that affect the customer experience Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework labels Feb 22, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@mikecote
Copy link
Contributor

mikecote commented Mar 6, 2023

@pmuellr mentioned that it would be good to create and share tests exposing these symptoms to discuss what options we have and know if it would solve all the problems.

@mikecote
Copy link
Contributor

mikecote commented Mar 22, 2023

I took a look on one option that we can approach this. Mainly after discovering how to cancel updateByQuery calls..

  1. We could move away from the observable pattern to queue polling requests and instead use setTimeout at the end of a polling cycle
  2. When the 30s timeout is hit, we should cancel any ongoing "search for claimed tasks" queries.
  3. When the 30s timeout is hit, we should cancel any ongoing "update by query" request. If we really need to do this, I found a theoretical way to do so by use the wait_for_completion=false flag to detach from the request and use Elasticseach's Task Management APIs to wait_for_completion=true there (https://www.elastic.co/guide/en/elasticsearch/reference/master/tasks.html). This allows the updateByQuery call to return a taskId to monitor but also cancel if ever necessary. However, the API is beta and adds a lot of overhead when the majority of the time we will not need to cancel.. There might be a way to simply use the tasks API to cancel an ongoing update by query without having to detach.

I will keep digging on better approaches, I experimented with timeouts on updateByQuery without success. There might also be a way to set a header so we can query the tasks API independently and cancel the updateByQuery behind the scenes.

EDIT: Working on something that may not be as involved here: #153491

@mikecote mikecote self-assigned this Mar 22, 2023
mikecote added a commit that referenced this issue Mar 30, 2023
This PR is a preparation to
#151938 to avoid having a large
PR to review all at once. Here, I'm removing support for on-demand
polling (`pollRequests$`). This code path was used when we supported
`runNow`, and is no longer used after transitioning to `runSoon`. There
is a lot of underlying code related to this within the `task_poller.ts`
file to support that capability, including a queue of events
(combination of `pushOptionalIntoSet` and `pullFromSet`) before calling
`concatMap` that is no longer needed.

## To verify

Use the following to log timestamps whenever task manager is polling for
work
```
--- a/x-pack/plugins/task_manager/server/polling_lifecycle.ts
+++ b/x-pack/plugins/task_manager/server/polling_lifecycle.ts
@@ -243,6 +243,7 @@ export class TaskPollingLifecycle {
   }

   private pollForWork = async (): Promise<TimedFillPoolResult> => {
+    console.log('*** pollForWork', new Date().toISOString());
     return fillPool(
       // claim available tasks
       () => {
```

1. Ensure when Kibana is running, the pollForWork is logged on 3s
intervals
2. Modify the code to slow down the updateByQuery to 5s, notice the next
pollForWork isn't called until two update by queries have ran (main pool
and reporting). Should be every 10s instead of 3s.

```
--- a/x-pack/plugins/task_manager/server/task_store.ts
+++ b/x-pack/plugins/task_manager/server/task_store.ts
@@ -444,6 +444,7 @@ export class TaskStore {
     // eslint-disable-next-line @typescript-eslint/naming-convention
     { max_docs: max_docs }: UpdateByQueryOpts = {}
   ): Promise<UpdateByQueryResult> {
+    await new Promise((resolve) => setTimeout(resolve, 5000));
     const { query } = ensureQueryOnlyReturnsTaskObjects(opts);
     try {
       const // eslint-disable-next-line @typescript-eslint/naming-convention
```

3. Slow down the `setTimeout` added in step two by setting it to `45000`
and observe the pollForWork getting called on 30s interval with some
timeout errors getting logged (same as main)
4. Undo the code changes from step two in favour of this one. Observe
the pollForWork getting called on 3s intervals even though an error is
thrown

```
--- a/x-pack/plugins/task_manager/server/task_store.ts
+++ b/x-pack/plugins/task_manager/server/task_store.ts
@@ -444,6 +444,7 @@ export class TaskStore {
     // eslint-disable-next-line @typescript-eslint/naming-convention
     { max_docs: max_docs }: UpdateByQueryOpts = {}
   ): Promise<UpdateByQueryResult> {
+    throw new Error('oh no!');
     const { query } = ensureQueryOnlyReturnsTaskObjects(opts);
     try {
       const // eslint-disable-next-line @typescript-eslint/naming-convention
```

---------

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
@mikecote
Copy link
Contributor

mikecote commented Apr 3, 2023

I'm working on a POC (#153491). What I'm doing is the following:

  • Write a custom poller that uses setTimeout instead of observables (solves the issue of polling requests piling up)
  • Remove the 30s timeout (timeoutPromiseAfter) in favour of no timeout at all. This pushes the promise to rely on the Elasticsearch request timeouts instead which should properly cancel any ongoing and subsequent code.

A polling cycle would now look something like this:

  1. Run update by query for main tasks (takes up to 30s)
  2. If tasks got claimed, run a search query to find the claimed tasks (takes up to 30s)
  3. Run update by query for reporting tasks (takes up to 30s)
  4. If reporting tasks got claimed, run a search query to find the claimed reporting tasks (takes up to 30s)

The worst case polling cycle time jumps from 30s to 2m but we can adjust as needed. It should be quite rare that these requests take that long..

Internally, by relying on the Elasticsearch client requestTimeout and setting a maxRetries: 0, it ensures the requests don't take longer than the timeout to finish and do not retry if ever they do timeout.

mikecote added a commit that referenced this issue May 3, 2023
Fixes #151938

In this PR, I'm re-writing the Task Manager poller so it doesn't run
concurrently when timeouts occur while also fixing the issue where
polling requests would pile up when polling takes time. To support this,
I've also made the following changes:
- Removed the observable monitor and the
`xpack.task_manager.max_poll_inactivity_cycles` setting
- Make the task store `search` and `updateByQuery` functions have no
retries. This prevents the request from retrying 5x whenever a timeout
occurs, causing each call taking up to 2 1/2 minutes before Kibana sees
the error (now down to 30s each). We have polling to manage retries in
these situations.
- Switch the task poller tests to use `sinon` for faking timers
- Removing the `assertStillInSetup` checks on plugin setup. Felt like a
maintenance burden that wasn't necessary to fix with my code changes.

The main code changes are within these files (to review thoroughly so
the polling cycle doesn't suddenly stop):
- x-pack/plugins/task_manager/server/polling/task_poller.ts
- x-pack/plugins/task_manager/server/polling_lifecycle.ts (easier to
review if you disregard whitespace `?w=1`)

## To verify
1. Tasks run normally (create a rule or something that goes through task
manager regularly).
2. When the update by query takes a while, the request is cancelled
after 30s or the time manually configured.
4. When the search for claimed tasks query takes a while, the request is
cancelled after 30s or the time manually configured.

**Tips:**
<details><summary>how to slowdown search for claimed task
queries</summary>

```
diff --git a/x-pack/plugins/task_manager/server/queries/task_claiming.ts b/x-pack/plugins/task_manager/server/queries/task_claiming.ts
index 07042650a37..2caefd63672 100644
--- a/x-pack/plugins/task_manager/server/queries/task_claiming.ts
+++ b/x-pack/plugins/task_manager/server/queries/task_claiming.ts
@@ -247,7 +247,7 @@ export class TaskClaiming {
         taskTypes,
       });

-    const docs = tasksUpdated > 0 ? await this.sweepForClaimedTasks(taskTypes, size) : [];
+    const docs = await this.sweepForClaimedTasks(taskTypes, size);

     this.emitEvents(docs.map((doc) => asTaskClaimEvent(doc.id, asOk(doc))));

@@ -346,6 +346,13 @@ export class TaskClaiming {
       size,
       sort: SortByRunAtAndRetryAt,
       seq_no_primary_term: true,
+      aggs: {
+        delay: {
+          shard_delay: {
+            value: '40s',
+          },
+        },
+      },
     });

     return docs;
```
</details>

<details><summary>how to slow down update by queries</summary>
Not the cleanest way but you'll see occasional request timeouts from the
updateByQuery calls. I had more luck creating rules running every 1s.

```
diff --git a/x-pack/plugins/task_manager/server/task_store.ts b/x-pack/plugins/task_manager/server/task_store.ts
index a06ee7b918a..07aa81e5388 100644
--- a/x-pack/plugins/task_manager/server/task_store.ts
+++ b/x-pack/plugins/task_manager/server/task_store.ts
@@ -126,6 +126,7 @@ export class TaskStore {
       // Timeouts are retried and make requests timeout after (requestTimeout * (1 + maxRetries))
       // The poller doesn't need retry logic because it will try again at the next polling cycle
       maxRetries: 0,
+      requestTimeout: 900,
     });
   }

@@ -458,6 +459,7 @@ export class TaskStore {
           ignore_unavailable: true,
           refresh: true,
           conflicts: 'proceed',
+          requests_per_second: 1,
           body: {
             ...opts,
             max_docs,
```
</details>

---------

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

3 participants