Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alerting] Shift polling interval by random amount when Task Manager experiences consistent claim version conflicts #88020

Merged

Conversation

gmmorris
Copy link
Contributor

@gmmorris gmmorris commented Jan 12, 2021

Summary

Closes #87808

This PR Introduces a pollingDelay which is applied to the polling interval whenever the average percentage of tasks experiencing a version conflict is higher than a preconfigured threshold (default to 80%).

Notes:

  1. There is some implementation overlap between the new delayOnClaimConflicts function and the Monitoring Stats components. I've kept these apart on purpose to avoid changing unnecessary code as part of this PR due to it being a post-FF fix.
  2. The average is calculated based on the max_workers rather than the available workers. I'm hoping this will be sufficient to handle the issue we've identified, but we might have to changing this in the future.

Checklist

Delete any items that are not applicable to this PR.

For maintainers

@gmmorris gmmorris changed the title added delay in response to claim conflicts [Alerting] Shift polling interval by random amount when Task Manager experiences consistent claim version conflicts Jan 12, 2021
@gmmorris gmmorris marked this pull request as ready for review January 12, 2021 15:30
@gmmorris gmmorris requested a review from a team as a code owner January 12, 2021 15:30
@gmmorris gmmorris added Feature:Alerting release_note:fix Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.11.0 v7.12.0 v8.0.0 labels Jan 12, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

Copy link
Contributor

@ymao1 ymao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@mikecote mikecote self-requested a review January 12, 2021 19:26
Copy link
Contributor

@mikecote mikecote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM! Tested locally and saw 2/3 of Kibana instances delaying their polling when they were previously in sync. 👍

Copy link
Member

@pmuellr pmuellr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Node a typo, and have a question about the "initial delay" in the new timer() usage. Likely my Rx inexperience showing :-)

Otherwise, LGTM.

x-pack/plugins/task_manager/README.md Outdated Show resolved Hide resolved
gmmorris and others added 3 commits January 12, 2021 20:22
Co-authored-by: Patrick Mueller <pmuellr@gmail.com>
…mmorris/kibana into tm/shift-polling-on-persistent-clashes

* 'tm/shift-polling-on-persistent-clashes' of github.com:gmmorris/kibana:
  typo
@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@gmmorris gmmorris merged commit 5e4402c into elastic:master Jan 12, 2021
gmmorris added a commit to gmmorris/kibana that referenced this pull request Jan 12, 2021
…experiences consistent claim version conflicts (elastic#88020)

This PR Introduces a `pollingDelay` which is applied to the polling interval whenever the average percentage of tasks experiencing a version conflict is higher than a preconfigured threshold (default to 80%).
gmmorris added a commit to gmmorris/kibana that referenced this pull request Jan 12, 2021
…experiences consistent claim version conflicts (elastic#88020)

This PR Introduces a `pollingDelay` which is applied to the polling interval whenever the average percentage of tasks experiencing a version conflict is higher than a preconfigured threshold (default to 80%).
gmmorris added a commit that referenced this pull request Jan 13, 2021
…experiences consistent claim version conflicts (#88020) (#88113)

This PR Introduces a `pollingDelay` which is applied to the polling interval whenever the average percentage of tasks experiencing a version conflict is higher than a preconfigured threshold (default to 80%).
gmmorris added a commit that referenced this pull request Jan 13, 2021
…experiences consistent claim version conflicts (#88020) (#88114)

This PR Introduces a `pollingDelay` which is applied to the polling interval whenever the average percentage of tasks experiencing a version conflict is higher than a preconfigured threshold (default to 80%).
@gmmorris
Copy link
Contributor Author

Added Docs for this into the Scaling docs in ad4fde6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Alerting release_note:fix Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.11.0 v7.12.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Alerting] Sometimes, when there are multiple Kibana, some instances fail to pick up any tasks
6 participants