Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix task manager polling flow controls #153491

Merged
merged 36 commits into from
May 3, 2023

Conversation

mikecote
Copy link
Contributor

@mikecote mikecote commented Mar 22, 2023

Fixes #151938

In this PR, I'm re-writing the Task Manager poller so it doesn't run concurrently when timeouts occur while also fixing the issue where polling requests would pile up when polling takes time. To support this, I've also made the following changes:

  • Removed the observable monitor and the xpack.task_manager.max_poll_inactivity_cycles setting
  • Make the task store search and updateByQuery functions have no retries. This prevents the request from retrying 5x whenever a timeout occurs, causing each call taking up to 2 1/2 minutes before Kibana sees the error (now down to 30s each). We have polling to manage retries in these situations.
  • Switch the task poller tests to use sinon for faking timers
  • Removing the assertStillInSetup checks on plugin setup. Felt like a maintenance burden that wasn't necessary to fix with my code changes.

The main code changes are within these files (to review thoroughly so the polling cycle doesn't suddenly stop):

  • x-pack/plugins/task_manager/server/polling/task_poller.ts
  • x-pack/plugins/task_manager/server/polling_lifecycle.ts (easier to review if you disregard whitespace ?w=1)

To verify

  1. Tasks run normally (create a rule or something that goes through task manager regularly).
  2. When the update by query takes a while, the request is cancelled after 30s or the time manually configured.
  3. When the search for claimed tasks query takes a while, the request is cancelled after 30s or the time manually configured.

Tips:

how to slowdown search for claimed task queries
diff --git a/x-pack/plugins/task_manager/server/queries/task_claiming.ts b/x-pack/plugins/task_manager/server/queries/task_claiming.ts
index 07042650a37..2caefd63672 100644
--- a/x-pack/plugins/task_manager/server/queries/task_claiming.ts
+++ b/x-pack/plugins/task_manager/server/queries/task_claiming.ts
@@ -247,7 +247,7 @@ export class TaskClaiming {
         taskTypes,
       });

-    const docs = tasksUpdated > 0 ? await this.sweepForClaimedTasks(taskTypes, size) : [];
+    const docs = await this.sweepForClaimedTasks(taskTypes, size);

     this.emitEvents(docs.map((doc) => asTaskClaimEvent(doc.id, asOk(doc))));

@@ -346,6 +346,13 @@ export class TaskClaiming {
       size,
       sort: SortByRunAtAndRetryAt,
       seq_no_primary_term: true,
+      aggs: {
+        delay: {
+          shard_delay: {
+            value: '40s',
+          },
+        },
+      },
     });

     return docs;
how to slow down update by queries Not the cleanest way but you'll see occasional request timeouts from the updateByQuery calls. I had more luck creating rules running every 1s.
diff --git a/x-pack/plugins/task_manager/server/task_store.ts b/x-pack/plugins/task_manager/server/task_store.ts
index a06ee7b918a..07aa81e5388 100644
--- a/x-pack/plugins/task_manager/server/task_store.ts
+++ b/x-pack/plugins/task_manager/server/task_store.ts
@@ -126,6 +126,7 @@ export class TaskStore {
       // Timeouts are retried and make requests timeout after (requestTimeout * (1 + maxRetries))
       // The poller doesn't need retry logic because it will try again at the next polling cycle
       maxRetries: 0,
+      requestTimeout: 900,
     });
   }

@@ -458,6 +459,7 @@ export class TaskStore {
           ignore_unavailable: true,
           refresh: true,
           conflicts: 'proceed',
+          requests_per_second: 1,
           body: {
             ...opts,
             max_docs,

@mikecote mikecote added Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Mar 22, 2023
@mikecote mikecote self-assigned this Mar 22, 2023
@mikecote
Copy link
Contributor Author

@elasticmachine merge upstream

@mikecote mikecote added release_note:skip Skip the PR/issue when compiling release notes v8.9.0 labels Apr 28, 2023
@mikecote
Copy link
Contributor Author

@elasticmachine merge upstream

@mikecote
Copy link
Contributor Author

mikecote commented May 1, 2023

@elasticmachine merge upstream

@mikecote mikecote marked this pull request as ready for review May 2, 2023 14:45
@mikecote mikecote requested review from a team as code owners May 2, 2023 14:45
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@@ -400,7 +400,6 @@ kibana_vars=(
xpack.securitySolution.prebuiltRulesPackageVersion
xpack.spaces.maxSpaces
xpack.task_manager.max_attempts
xpack.task_manager.max_poll_inactivity_cycles
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to go through the deprecation / breaking change process?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The setting isn't documented so I figured we could simply remove it. I could change it to a no-op so it doesn't become breaking for anyone. Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbudz ^^

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave it up to the team - not familiar with the setting and whether its use has been advised in any production scenarios.

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Unknown metric groups

ESLint disabled line counts

id before after diff
enterpriseSearch 19 21 +2
securitySolution 398 401 +3
total +5

Total ESLint disabled count

id before after diff
enterpriseSearch 20 22 +2
securitySolution 478 481 +3
total +5

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @mikecote

Copy link
Contributor

@ymao1 ymao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Created a bunch of rules and saw them running as expected. Verified the ES timeouts as per PR instructions as well.

@mikecote mikecote merged commit cb2e28d into elastic:main May 3, 2023
@kibanamachine kibanamachine added the backport:skip This commit does not require backporting label May 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting Feature:Task Manager release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.9.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Task Manager - Long claiming duration breaks flow control
6 participants