[Task Manager] Correctly handle `running` tasks when calling RunNow and reduce flakiness in related tests #73244

gmmorris · 2020-07-27T08:35:53Z

Summary

This PR addresses two issues which caused several tests to be flaky in TM.

When runNow was introduced to TM we added a pinned query which returned specific tasks by ID.
This query does not have the filter applied to it which causes task to return when they're marked as running but we didn't address these correctly which caused flakyness in the tests (rightfully) - we now address them correctly by messaging this correctly in the error.

It seems that sometimes, especially if the ES queue is overworked, it can take some time for the update to the underlying task to be visible (we don't user refresh:true on purpose), so adding a wait for the index to refresh to make sure the task is updated in time for the next stage of the test.

closes #71390

Checklist

Delete any items that are not applicable to this PR.

Unit or functional tests were updated or added to match the most common scenarios

For maintainers

This was checked for breaking API changes and was labeled appropriately

gmmorris · 2020-07-29T12:13:21Z

x-pack/test/plugin_api_integration/test_suites/task_manager/task_manager_integration.js

+      await delay(100);
+


The additional delay caused by the buffering of updates allows it to fail every once in a while as the runNow sees the task as still running and doesn't try to run it concurrently.

So we delay here a few ms to side step that.

gmmorris · 2020-07-29T16:01:04Z

x-pack/plugins/task_manager/server/task_manager.ts

@@ -456,7 +456,7 @@ export async function awaitTaskRunResult(
                )
              );
            } else if (isTaskClaimEvent(taskEvent)) {
-              reject(
+              return reject(


This addresses the Socket timeout - we weren't short circuiting the return here which caused some callbacks to fail weirdly at times.

* master: (126 commits) [ML] Disabling ML if license feature is disabled (elastic#73187) [ML] Fixing old _xpack style es endpoint paths (elastic#73667) [DOCS] [Lens] 7.9 docs refresh (elastic#72301) [ML] DF Analytics results: ensure `View` link is only enabled when job has successfully completed (elastic#73539) Set timeRange to default to trigger the error message (elastic#73629) [ML] Functional tests - stabilize DFA navigation and index pattern handling (elastic#73660) [ILM] Add links to "Snapshot and Restore" from ILM "wait for snapshot policy" (elastic#72473) [kbn-storybook] Update Storybook to 5.3.19 (elastic#73320) [Metrics UI] Fix hasData call to ensure it has data not just indices (elastic#72969) [Uptime] Use `service.name` to link from Uptime -> APM where available (elastic#73618) allow others to update `URL.revokeObjectURL` property if needed (elastic#73639) regen docs (elastic#73650) [Visualize] Fix inspector download filename issue when saving in-place (elastic#72605) [Data] Query Input String manager (elastic#72093) [Security Solutions] Add tooltips (elastic#73436) Do not render descriptionless actions within an EuiCard (elastic#73611) [Security Solution][Detections] Value Lists Modal supports multiple exports (elastic#73532) [Security Solution][Resolver] Handle disabled process collection (elastic#73592) [Security_Solution][Bug] Fix user name/domain to ECS structure (elastic#73530) [Security Solution][Exceptions] - Update rule.exceptions_list to include exception list list_id (elastic#73349) ...

elasticmachine · 2020-07-29T16:06:03Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

gmmorris · 2020-07-30T11:12:24Z

@elasticmachine merge upstream

* master: (38 commits) [Discover] Context unskip date nanos functional tests (elastic#73781) [ML] Migrate to React BrowserRouter and Kibana provided History. (elastic#71941) [Discover] Improve saveSearch functional test handling (elastic#73626) [Metrics UI] Fix all threshold alert conditions disappearing due to alert prefill (elastic#73708) [Metrics UI] Fix alert previews of ungrouped alerts (elastic#73735) [SIEM] Fixes "include building block button" to operate (elastic#73900) [Metrics UI] Fix alert management to open without refresh (elastic#73739) [Security Solution][Lists] - Tests cleanup and remove unnecessary import (elastic#73865) [Ingest Management] main branch uses epr-snapshot. Others production (elastic#73555) [Canvas][tech-debt] Fix SVG not shrinking vertically properly (elastic#73867) [Maps] upgrade turf (elastic#73816) [Security Solution][Telemetry] Concurrent telemetry requests (elastic#73558) [Security Solution][Exceptions] - Update how nested entries are displayed in exceptions viewer (elastic#73745) [Security Solution][Exceptions] Adds autocomplete workaround for .text fields (elastic#73761) [Metrics UI] Fix previewing of No Data results (elastic#73753) Closes elastic#72914 by hiding anomaly detection settings links when the ml plugin is disabled. (elastic#73638) [Ingest Manager] Fix config selection in enrollment flyout from config list page (elastic#73833) [DOCS] Fixes typo in Alerting actions (elastic#73756) [APM] fixes linking errors to ML and Discover (elastic#73758) Handle promise rejections when building artifacts (elastic#73831) ...

* master: (39 commits) [Canvas][tech-debt] Rename __examples__ to __stories__ (elastic#73853) [Canvas] Storybook Redux Addon (elastic#73227) Use "Apply_filter_trigger" in "explore underlying data" action (elastic#71445) [maps] convert top nav config to TS (elastic#73851) [maps] fix fit to bounds for ES document layers with joins (elastic#73985) [Canvas][tech-debt] Refactor Toolbar (completes Kill Recompose.pure) (elastic#73309) [CI] In-progress Slack notifications (elastic#74012) [SIEM][Detection Engine] Fixes tags to accept characters such as AND, OR, (, ), ", * (elastic#74003) [SECURITY_SOLUTION][ENDPOINT] Fix host list Configuration Status cell link loosing list page/size state (elastic#73989) Tweak injected metadata (elastic#73990) Closes elastic#73998 by using `canAccessML` in the ML capabilities API to (elastic#73999) [SIEM] Fixes toaster errors when siemDefault index is an empty or empty spaces (elastic#73991) [Security Solution] Fix timeline pin event callback (elastic#73981) [Security Solution] Fix unexpected redirect (elastic#73969) [Metrics UI] Fix Metrics Explorer TSVB link to use workaround pattern (elastic#73986) [APM] docs: Update machine learning integration (elastic#73597) [Ingest Manager] Fix limited concurrency helper (elastic#73976) [build/sysv] fix missing env variable rename (elastic#73977) Fix a typo. (elastic#73948) [Ingest Manager] Revert fleet config concurrency rollout to rate limit (elastic#73940) ...

pmuellr

LGTM

* master: (74 commits) [Discover] Inline noWhiteSpace function (elastic#74331) [DOCS] Add Observability topic (elastic#73041) skip flaky suite (elastic#74327) [Security Solution][Detections] Fixes Severity Override not matching for Elastic Endpoint Security rule (elastic#74317) [Security Solution][Exceptions] - Fixes exceptions builder nested deletion issue and adds unit tests (elastic#74250) Fixed Alert details does not update page title and breadcrumb (elastic#74214) [src/dev/build] build Kibana Platform bundles from source (elastic#73591) [Reporting] Shorten asset path to help CLI FS Watcher (elastic#74185) Fix TMS not loaded in legacy maps (elastic#73570) [Security Solution] styling for notes' panel (elastic#74274) [Security Solution][Tech Debt] cleans up ts-ignore issues and some smaller linter issues (elastic#74268) Make the actions plugin support generics (elastic#71439) [Security Solution] Keep original note creator (elastic#74203) [CI] Fix xpack kibana build dir in xpack visual regression script [CI] Fix baseline_capture job by adding parallel process number back [Monitoring] Ensure setup mode works on cloud but only for alerts (elastic#73127) [Maps] Custom color ramps should show correctly on the map for mvt layers (elastic#74169) [kbn/optimizer] remove unused modules (elastic#74195) [CI] Add pipeline task queue framework and merge workers into one (elastic#71268) Using msearch for tree api endpoint (elastic#73813) ...

mikecote

LGTM!

mikecote · 2020-08-05T13:13:06Z

x-pack/test/plugin_api_integration/test_suites/task_manager/task_manager_integration.js

      });

+      await delay(1000);


Is there a reason for a delay after a retry.try? Seems like a future cause of flakiness. The logic of the code above could be changed to wait until the task finished running.

oh, that was a mistake - this was meant to be await ensureTasksIndexRefreshed(); too 👍
Good catch!

mikecote · 2020-08-05T13:13:23Z

x-pack/test/plugin_api_integration/test_suites/task_manager/task_manager_integration.js

      });

+      await delay(1000);


…nd reduce flakiness in related tests (elastic#73244) This PR addresses two issues which caused several tests to be flaky in TM. When `runNow` was introduced to TM we added a pinned query which returned specific tasks by ID. This query does not have the filter applied to it which causes task to return when they're already marked as `running` but we didn't address these correctly which caused flakyness in the tests. This didn't cause a broken beahviour, but it did cause beahviour that was hard to reason about - we now address them correctly. It seems that sometimes, especially if the ES queue is overworked, it can take some time for the update to the underlying task to be visible (we don't user `refresh:true` on purpose), so adding a wait for the index to refresh to make sure the task is updated in time for the next stage of the test.

…nd reduce flakiness in related tests (#73244) (#74386) This PR addresses two issues which caused several tests to be flaky in TM. When `runNow` was introduced to TM we added a pinned query which returned specific tasks by ID. This query does not have the filter applied to it which causes task to return when they're already marked as `running` but we didn't address these correctly which caused flakyness in the tests. This didn't cause a broken beahviour, but it did cause beahviour that was hard to reason about - we now address them correctly. It seems that sometimes, especially if the ES queue is overworked, it can take some time for the update to the underlying task to be visible (we don't user `refresh:true` on purpose), so adding a wait for the index to refresh to make sure the task is updated in time for the next stage of the test.

…nd reduce flakiness in related tests (#73244) (#74387) This PR addresses two issues which caused several tests to be flaky in TM. When `runNow` was introduced to TM we added a pinned query which returned specific tasks by ID. This query does not have the filter applied to it which causes task to return when they're already marked as `running` but we didn't address these correctly which caused flakyness in the tests. This didn't cause a broken beahviour, but it did cause beahviour that was hard to reason about - we now address them correctly. It seems that sometimes, especially if the ES queue is overworked, it can take some time for the update to the underlying task to be visible (we don't user `refresh:true` on purpose), so adding a wait for the index to refresh to make sure the task is updated in time for the next stage of the test.

kibanamachine · 2020-08-05T21:42:38Z

💚 Build Succeeded

continuous-integration/kibana-ci/pull-request
Commit: d034480

Build metrics

✅ unchanged

History

💚 Build #66634 succeeded d034480
💚 Build #66581 succeeded 5907ed1
💚 Build #65990 succeeded dffcdf9
💛 Build #65781 was flaky ce2ee00
💚 Build #65492 succeeded cd70954

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

* master: (208 commits) Observability Overview fix extra basepath prepend for alerting fetch (elastic#74465) [Lens] Clean and inline disabling of react-hooks/exhaustive-deps eslint rule (elastic#70010) Skip "space with index pattern management disabled" functional test for cloud env (elastic#74073) Filter out non-security jobs when collecting Detections telemetry (elastic#74456) [Security Solution][Test] Enzyme test for related events button (elastic#74411) [SECURITY_SOLUTION] add z-index to get over nav bar (elastic#74427) Rename package configs SO to package policies (elastic#74422) [DOCS] Add Kibana alerts to Stack Monitoring (elastic#73762) skip flaky suite (elastic#71390) [ML] DF Analytics: adds functional tests for edit form (elastic#73885) Rename agent configs SO to agent policies (elastic#74397) [Jenkins] run CI when plugin readmes change (elastic#74388) [Metrics UI] Fix validating Metrics Explorer URL (elastic#74311) fixing encoding issue with \ for enroll command (elastic#74379) [Ingest Manager] Update package registry for testing to f6b01d (elastic#74341) Change experimental message for visualizations (elastic#74354) [Alerting] Reload the Alerts List when alerts are deleted (elastic#73715) [Enterprise Search] Fix/DRY out plugin i18n strings (elastic#74323) update empty prompt in analytics list (elastic#74174) [Task Manager] Correctly handle `running` tasks when calling RunNow and reduce flakiness in related tests (elastic#73244) ...

…chedule every polling interval (#74606) Fixes flaky tests in Task Manager and Alerting. The fix in #73244 was correct, but it missed an edge case which causes the already running task to be rescheduled over and over. This prevents that edge case which was effecting both TM in general and Alerting specifically.

…chedule every polling interval (elastic#74606) Fixes flaky tests in Task Manager and Alerting. The fix in elastic#73244 was correct, but it missed an edge case which causes the already running task to be rescheduled over and over. This prevents that edge case which was effecting both TM in general and Alerting specifically.

…chedule every polling interval (#74606) (#74940) Fixes flaky tests in Task Manager and Alerting. The fix in #73244 was correct, but it missed an edge case which causes the already running task to be rescheduled over and over. This prevents that edge case which was effecting both TM in general and Alerting specifically.

…chedule every polling interval (#74606) (#74941) Fixes flaky tests in Task Manager and Alerting. The fix in #73244 was correct, but it missed an edge case which causes the already running task to be rescheduled over and over. This prevents that edge case which was effecting both TM in general and Alerting specifically.

gmmorris added 2 commits July 27, 2020 09:34

removed skip

0879f0b

improved messaging around failed runNow

e9f2847

gmmorris changed the title ~~removed skip~~ [Task Manager] Better handle RunNow running out of workers Jul 27, 2020

gmmorris changed the title ~~[Task Manager] Better handle RunNow running out of workers~~ [Task Manager] Better handle RunNow running out of workers in tests Jul 27, 2020

gmmorris added 5 commits July 27, 2020 16:35

corrected test string

5767420

corrected error in unit tests

187b6ee

added more LOGGING DONT MERGE

66a83ba

reduce flakyness due to the delayed update

8faaee2

removed console logging

5391c0d

gmmorris commented Jul 29, 2020

View reviewed changes

bump delay

af3d139

gmmorris commented Jul 29, 2020

View reviewed changes

gmmorris marked this pull request as ready for review July 29, 2020 16:02

gmmorris requested a review from a team as a code owner July 29, 2020 16:02

gmmorris added 7.9.0 Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.10.0 v8.0.0 labels Jul 29, 2020

gmmorris added the release_note:skip Skip the PR/issue when compiling release notes label Jul 29, 2020

gmmorris changed the title ~~[Task Manager] Better handle RunNow running out of workers in tests~~ [Task Manager] Prevent flaky RunNow tests Jul 29, 2020

lukasolson added v7.9.0 and removed 7.9.0 labels Jul 29, 2020

add status check and bump delay

59ca702

elasticmachine and others added 3 commits July 30, 2020 05:12

Merge branch 'master' into task-manager/fix-flaky-test

cd70954

correctly handle sweep by id when task is already running

9e6f63f

gmmorris added 2 commits August 2, 2020 15:25

wait for index refresh

9be0b3e

gmmorris changed the title ~~[Task Manager] Prevent flaky RunNow tests~~ [Task Manager] Correctly handle running tasks when calling RunNow and reduce flakiness in related tests Aug 2, 2020

pmuellr approved these changes Aug 4, 2020

View reviewed changes

gmmorris added 2 commits August 5, 2020 10:07

skip flaky alerts test

b44d375

mikecote approved these changes Aug 5, 2020

View reviewed changes

Update task_manager_integration.js

d034480

gmmorris merged commit 5c770e5 into elastic:master Aug 5, 2020

gmmorris mentioned this pull request Aug 5, 2020

[7.x] [Task Manager] Correctly handle running tasks when calling RunNow and reduce flakiness in related tests (#73244) #74386

Merged

gmmorris mentioned this pull request Aug 5, 2020

[7.9] [Task Manager] Correctly handle running tasks when calling RunNow and reduce flakiness in related tests (#73244) #74387

Merged

gmmorris added release_note:fix and removed release_note:skip Skip the PR/issue when compiling release notes labels Aug 5, 2020

gmmorris mentioned this pull request Aug 6, 2020

[Task manager] Prevents edge case where already running tasks are reschedule every polling interval #74606

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Task Manager] Correctly handle `running` tasks when calling RunNow and reduce flakiness in related tests #73244

[Task Manager] Correctly handle `running` tasks when calling RunNow and reduce flakiness in related tests #73244

gmmorris commented Jul 27, 2020 •

edited

Loading

gmmorris Jul 29, 2020 •

edited

Loading

gmmorris Jul 29, 2020

elasticmachine commented Jul 29, 2020

gmmorris commented Jul 30, 2020

pmuellr left a comment

mikecote left a comment

mikecote Aug 5, 2020

gmmorris Aug 5, 2020

mikecote Aug 5, 2020

kibanamachine commented Aug 5, 2020

[Task Manager] Correctly handle running tasks when calling RunNow and reduce flakiness in related tests #73244

[Task Manager] Correctly handle running tasks when calling RunNow and reduce flakiness in related tests #73244

Conversation

gmmorris commented Jul 27, 2020 • edited Loading

Summary

Checklist

For maintainers

gmmorris Jul 29, 2020 • edited Loading

Choose a reason for hiding this comment

gmmorris Jul 29, 2020

Choose a reason for hiding this comment

elasticmachine commented Jul 29, 2020

gmmorris commented Jul 30, 2020

pmuellr left a comment

Choose a reason for hiding this comment

mikecote left a comment

Choose a reason for hiding this comment

mikecote Aug 5, 2020

Choose a reason for hiding this comment

gmmorris Aug 5, 2020

Choose a reason for hiding this comment

mikecote Aug 5, 2020

Choose a reason for hiding this comment

kibanamachine commented Aug 5, 2020

💚 Build Succeeded

Build metrics

History

[Task Manager] Correctly handle `running` tasks when calling RunNow and reduce flakiness in related tests #73244

[Task Manager] Correctly handle `running` tasks when calling RunNow and reduce flakiness in related tests #73244

gmmorris commented Jul 27, 2020 •

edited

Loading

gmmorris Jul 29, 2020 •

edited

Loading