Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Task Manager] Correctly handle running tasks when calling RunNow and reduce flakiness in related tests #73244

Merged
merged 18 commits into from
Aug 5, 2020

Conversation

gmmorris
Copy link
Contributor

@gmmorris gmmorris commented Jul 27, 2020

Summary

This PR addresses two issues which caused several tests to be flaky in TM.

When runNow was introduced to TM we added a pinned query which returned specific tasks by ID.
This query does not have the filter applied to it which causes task to return when they're marked as running but we didn't address these correctly which caused flakyness in the tests (rightfully) - we now address them correctly by messaging this correctly in the error.

It seems that sometimes, especially if the ES queue is overworked, it can take some time for the update to the underlying task to be visible (we don't user refresh:true on purpose), so adding a wait for the index to refresh to make sure the task is updated in time for the next stage of the test.

closes #71390

Checklist

Delete any items that are not applicable to this PR.

For maintainers

@gmmorris gmmorris changed the title removed skip [Task Manager] Better handle RunNow running out of workers Jul 27, 2020
@gmmorris gmmorris changed the title [Task Manager] Better handle RunNow running out of workers [Task Manager] Better handle RunNow running out of workers in tests Jul 27, 2020
Comment on lines 422 to 423
await delay(100);

Copy link
Contributor Author

@gmmorris gmmorris Jul 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The additional delay caused by the buffering of updates allows it to fail every once in a while as the runNow sees the task as still running and doesn't try to run it concurrently.

So we delay here a few ms to side step that.

@@ -456,7 +456,7 @@ export async function awaitTaskRunResult(
)
);
} else if (isTaskClaimEvent(taskEvent)) {
reject(
return reject(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This addresses the Socket timeout - we weren't short circuiting the return here which caused some callbacks to fail weirdly at times.

* master: (126 commits)
  [ML] Disabling ML if license feature is disabled (elastic#73187)
  [ML] Fixing old _xpack style es endpoint paths (elastic#73667)
  [DOCS] [Lens] 7.9 docs refresh (elastic#72301)
  [ML] DF Analytics results: ensure `View` link is only enabled when job has successfully completed (elastic#73539)
  Set timeRange to default to trigger the error message (elastic#73629)
  [ML] Functional tests - stabilize DFA navigation and index pattern handling (elastic#73660)
  [ILM] Add links to "Snapshot and Restore" from ILM "wait for snapshot policy" (elastic#72473)
  [kbn-storybook] Update Storybook to 5.3.19 (elastic#73320)
  [Metrics UI] Fix hasData call to ensure it has data not just indices (elastic#72969)
  [Uptime] Use `service.name` to link from Uptime -> APM where available (elastic#73618)
  allow others to update `URL.revokeObjectURL` property if needed (elastic#73639)
  regen docs (elastic#73650)
  [Visualize] Fix inspector download filename issue when saving in-place (elastic#72605)
  [Data] Query Input String manager (elastic#72093)
  [Security Solutions] Add tooltips (elastic#73436)
  Do not render descriptionless actions within an EuiCard (elastic#73611)
  [Security Solution][Detections] Value Lists Modal supports multiple exports (elastic#73532)
  [Security Solution][Resolver] Handle disabled process collection (elastic#73592)
  [Security_Solution][Bug] Fix user name/domain to ECS structure (elastic#73530)
  [Security Solution][Exceptions] - Update rule.exceptions_list to include exception list list_id (elastic#73349)
  ...
@gmmorris gmmorris marked this pull request as ready for review July 29, 2020 16:02
@gmmorris gmmorris requested a review from a team as a code owner July 29, 2020 16:02
@gmmorris gmmorris added 7.9.0 Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.10.0 v8.0.0 labels Jul 29, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@gmmorris gmmorris added the release_note:skip Skip the PR/issue when compiling release notes label Jul 29, 2020
@gmmorris gmmorris changed the title [Task Manager] Better handle RunNow running out of workers in tests [Task Manager] Prevent flaky RunNow tests Jul 29, 2020
@lukasolson lukasolson added v7.9.0 and removed 7.9.0 labels Jul 29, 2020
@gmmorris
Copy link
Contributor Author

@elasticmachine merge upstream

elasticmachine and others added 3 commits July 30, 2020 05:12
* master: (38 commits)
  [Discover] Context unskip date nanos functional tests (elastic#73781)
  [ML] Migrate to React BrowserRouter and Kibana provided History. (elastic#71941)
  [Discover] Improve  saveSearch functional test handling (elastic#73626)
  [Metrics UI] Fix all threshold alert conditions disappearing due to alert prefill (elastic#73708)
  [Metrics UI] Fix alert previews of ungrouped alerts (elastic#73735)
  [SIEM] Fixes "include building block button" to operate (elastic#73900)
  [Metrics UI] Fix alert management to open without refresh (elastic#73739)
  [Security Solution][Lists] - Tests cleanup and remove unnecessary import (elastic#73865)
  [Ingest Management] main branch uses epr-snapshot. Others production (elastic#73555)
  [Canvas][tech-debt] Fix SVG not shrinking vertically properly (elastic#73867)
  [Maps] upgrade turf (elastic#73816)
  [Security Solution][Telemetry] Concurrent telemetry requests (elastic#73558)
  [Security Solution][Exceptions] - Update how nested entries are displayed in exceptions viewer (elastic#73745)
  [Security Solution][Exceptions] Adds autocomplete workaround for .text fields (elastic#73761)
  [Metrics UI] Fix previewing of No Data results (elastic#73753)
  Closes elastic#72914 by hiding anomaly detection settings links when the ml plugin is disabled. (elastic#73638)
  [Ingest Manager] Fix config selection in enrollment flyout from config list page (elastic#73833)
  [DOCS] Fixes typo in Alerting actions (elastic#73756)
  [APM] fixes linking errors to ML and Discover (elastic#73758)
  Handle promise rejections when building artifacts (elastic#73831)
  ...
* master: (39 commits)
  [Canvas][tech-debt] Rename __examples__ to __stories__ (elastic#73853)
  [Canvas] Storybook Redux Addon (elastic#73227)
  Use "Apply_filter_trigger" in "explore underlying data" action (elastic#71445)
  [maps] convert top nav config to TS (elastic#73851)
  [maps] fix fit to bounds for ES document layers with joins (elastic#73985)
  [Canvas][tech-debt] Refactor Toolbar (completes Kill Recompose.pure) (elastic#73309)
  [CI] In-progress Slack notifications (elastic#74012)
  [SIEM][Detection Engine] Fixes tags to accept characters such as AND, OR, (, ), ", * (elastic#74003)
  [SECURITY_SOLUTION][ENDPOINT] Fix host list Configuration Status cell link loosing list page/size state (elastic#73989)
  Tweak injected metadata (elastic#73990)
  Closes elastic#73998 by using `canAccessML` in the ML capabilities API to (elastic#73999)
  [SIEM] Fixes toaster errors when siemDefault index is an empty or empty spaces (elastic#73991)
  [Security Solution] Fix timeline pin event callback (elastic#73981)
  [Security Solution] Fix unexpected redirect (elastic#73969)
  [Metrics UI] Fix Metrics Explorer TSVB link to use workaround pattern (elastic#73986)
  [APM] docs: Update machine learning integration (elastic#73597)
  [Ingest Manager] Fix limited concurrency helper (elastic#73976)
  [build/sysv] fix missing env variable rename (elastic#73977)
  Fix a typo. (elastic#73948)
  [Ingest Manager] Revert fleet config concurrency rollout to rate limit (elastic#73940)
  ...
@gmmorris gmmorris changed the title [Task Manager] Prevent flaky RunNow tests [Task Manager] Correctly handle running tasks when calling RunNow and reduce flakiness in related tests Aug 2, 2020
Copy link
Member

@pmuellr pmuellr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

* master: (74 commits)
  [Discover] Inline noWhiteSpace function (elastic#74331)
  [DOCS] Add Observability topic (elastic#73041)
  skip flaky suite (elastic#74327)
  [Security Solution][Detections] Fixes Severity Override not matching for Elastic Endpoint Security rule (elastic#74317)
  [Security Solution][Exceptions] - Fixes exceptions builder nested deletion issue and adds unit tests (elastic#74250)
  Fixed Alert details does not update page title and breadcrumb (elastic#74214)
  [src/dev/build] build Kibana Platform bundles from source (elastic#73591)
  [Reporting] Shorten asset path to help CLI FS Watcher (elastic#74185)
  Fix TMS not loaded in legacy maps (elastic#73570)
  [Security Solution] styling for notes' panel (elastic#74274)
  [Security Solution][Tech Debt] cleans up ts-ignore issues and some smaller linter issues  (elastic#74268)
  Make the actions plugin support generics (elastic#71439)
  [Security Solution] Keep original note creator (elastic#74203)
  [CI] Fix xpack kibana build dir in xpack visual regression script
  [CI] Fix baseline_capture job by adding parallel process number back
  [Monitoring] Ensure setup mode works on cloud but only for alerts (elastic#73127)
  [Maps] Custom color ramps should show correctly on the map for mvt layers (elastic#74169)
  [kbn/optimizer] remove unused modules (elastic#74195)
  [CI] Add pipeline task queue framework and merge workers into one (elastic#71268)
  Using msearch for tree api endpoint (elastic#73813)
  ...
Copy link
Contributor

@mikecote mikecote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

});

await delay(1000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for a delay after a retry.try? Seems like a future cause of flakiness. The logic of the code above could be changed to wait until the task finished running.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, that was a mistake - this was meant to be await ensureTasksIndexRefreshed(); too 👍
Good catch!

});

await delay(1000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

@gmmorris gmmorris merged commit 5c770e5 into elastic:master Aug 5, 2020
gmmorris added a commit to gmmorris/kibana that referenced this pull request Aug 5, 2020
…nd reduce flakiness in related tests (elastic#73244)

This PR addresses two issues which caused several tests to be flaky in TM.

When `runNow` was introduced to TM we added a pinned query which returned specific tasks by ID.
This query does not have the filter applied to it which causes task to return when they're already marked as `running` but we didn't address these correctly which caused flakyness in the tests.
This didn't cause a broken beahviour, but it did cause beahviour that was hard to reason about - we now address them correctly.

It seems that sometimes, especially if the ES queue is overworked, it can take some time for the update to the underlying task to be visible (we don't user `refresh:true` on purpose), so adding a wait for the index to refresh to make sure the task is updated in time for the next stage of the test.
gmmorris added a commit to gmmorris/kibana that referenced this pull request Aug 5, 2020
…nd reduce flakiness in related tests (elastic#73244)

This PR addresses two issues which caused several tests to be flaky in TM.

When `runNow` was introduced to TM we added a pinned query which returned specific tasks by ID.
This query does not have the filter applied to it which causes task to return when they're already marked as `running` but we didn't address these correctly which caused flakyness in the tests.
This didn't cause a broken beahviour, but it did cause beahviour that was hard to reason about - we now address them correctly.

It seems that sometimes, especially if the ES queue is overworked, it can take some time for the update to the underlying task to be visible (we don't user `refresh:true` on purpose), so adding a wait for the index to refresh to make sure the task is updated in time for the next stage of the test.
gmmorris added a commit that referenced this pull request Aug 5, 2020
…nd reduce flakiness in related tests (#73244) (#74386)

This PR addresses two issues which caused several tests to be flaky in TM.

When `runNow` was introduced to TM we added a pinned query which returned specific tasks by ID.
This query does not have the filter applied to it which causes task to return when they're already marked as `running` but we didn't address these correctly which caused flakyness in the tests.
This didn't cause a broken beahviour, but it did cause beahviour that was hard to reason about - we now address them correctly.

It seems that sometimes, especially if the ES queue is overworked, it can take some time for the update to the underlying task to be visible (we don't user `refresh:true` on purpose), so adding a wait for the index to refresh to make sure the task is updated in time for the next stage of the test.
gmmorris added a commit that referenced this pull request Aug 5, 2020
…nd reduce flakiness in related tests (#73244) (#74387)

This PR addresses two issues which caused several tests to be flaky in TM.

When `runNow` was introduced to TM we added a pinned query which returned specific tasks by ID.
This query does not have the filter applied to it which causes task to return when they're already marked as `running` but we didn't address these correctly which caused flakyness in the tests.
This didn't cause a broken beahviour, but it did cause beahviour that was hard to reason about - we now address them correctly.

It seems that sometimes, especially if the ES queue is overworked, it can take some time for the update to the underlying task to be visible (we don't user `refresh:true` on purpose), so adding a wait for the index to refresh to make sure the task is updated in time for the next stage of the test.
@gmmorris gmmorris added release_note:fix and removed release_note:skip Skip the PR/issue when compiling release notes labels Aug 5, 2020
@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Build metrics

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

gmmorris added a commit to gmmorris/kibana that referenced this pull request Aug 6, 2020
* master: (208 commits)
  Observability Overview fix extra basepath prepend for alerting fetch (elastic#74465)
  [Lens] Clean and inline disabling of react-hooks/exhaustive-deps eslint rule (elastic#70010)
  Skip "space with index pattern management disabled" functional test for cloud env (elastic#74073)
  Filter out non-security jobs when collecting Detections telemetry (elastic#74456)
  [Security Solution][Test] Enzyme test for related events button (elastic#74411)
  [SECURITY_SOLUTION] add z-index to get over nav bar (elastic#74427)
  Rename package configs SO to package policies (elastic#74422)
  [DOCS] Add Kibana alerts to Stack Monitoring (elastic#73762)
  skip flaky suite (elastic#71390)
  [ML] DF Analytics: adds functional tests for edit form (elastic#73885)
  Rename agent configs SO to agent policies (elastic#74397)
  [Jenkins] run CI when plugin readmes change (elastic#74388)
  [Metrics UI] Fix validating Metrics Explorer URL (elastic#74311)
  fixing encoding issue with \ for enroll command (elastic#74379)
  [Ingest Manager] Update package registry for testing to f6b01d (elastic#74341)
  Change experimental message for visualizations (elastic#74354)
  [Alerting] Reload the Alerts List when alerts are deleted (elastic#73715)
  [Enterprise Search] Fix/DRY out plugin i18n strings (elastic#74323)
  update empty prompt in analytics list (elastic#74174)
  [Task Manager] Correctly handle `running` tasks when calling RunNow and reduce flakiness in related tests (elastic#73244)
  ...
gmmorris added a commit that referenced this pull request Aug 13, 2020
…chedule every polling interval (#74606)

Fixes flaky tests in Task Manager and Alerting.

The fix in #73244 was correct, but it missed an edge case which causes the already running task to be rescheduled over and over.

This prevents that edge case which was effecting both TM in general and Alerting specifically.
gmmorris added a commit to gmmorris/kibana that referenced this pull request Aug 13, 2020
…chedule every polling interval (elastic#74606)

Fixes flaky tests in Task Manager and Alerting.

The fix in elastic#73244 was correct, but it missed an edge case which causes the already running task to be rescheduled over and over.

This prevents that edge case which was effecting both TM in general and Alerting specifically.
gmmorris added a commit to gmmorris/kibana that referenced this pull request Aug 13, 2020
…chedule every polling interval (elastic#74606)

Fixes flaky tests in Task Manager and Alerting.

The fix in elastic#73244 was correct, but it missed an edge case which causes the already running task to be rescheduled over and over.

This prevents that edge case which was effecting both TM in general and Alerting specifically.
gmmorris added a commit that referenced this pull request Aug 13, 2020
…chedule every polling interval (#74606) (#74940)

Fixes flaky tests in Task Manager and Alerting.

The fix in #73244 was correct, but it missed an edge case which causes the already running task to be rescheduled over and over.

This prevents that edge case which was effecting both TM in general and Alerting specifically.
gmmorris added a commit that referenced this pull request Aug 13, 2020
…chedule every polling interval (#74606) (#74941)

Fixes flaky tests in Task Manager and Alerting.

The fix in #73244 was correct, but it missed an edge case which causes the already running task to be rescheduled over and over.

This prevents that edge case which was effecting both TM in general and Alerting specifically.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Task Manager release_note:fix Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.9.0 v7.10.0 v8.0.0
Projects
None yet
6 participants